Literature DB >> 26013810

PatternQuery: web application for fast detection of biomacromolecular structural patterns in the entire Protein Data Bank.

David Sehnal¹, Lukáš Pravda², Radka Svobodová Vařeková², Crina-Maria Ionescu³, Jaroslav Koča⁴.

Abstract

Well defined biomacromolecular patterns such as binding sites, catalytic sites, specific protein or nucleic acid sequences, etc. precisely modulate many important biological phenomena. We introduce PatternQuery, a web-based application designed for detection and fast extraction of such patterns. The application uses a unique query language with Python-like syntax to define the patterns that will be extracted from datasets provided by the user, or from the entire Protein Data Bank (PDB). Moreover, the database-wide search can be restricted using a variety of criteria, such as PDB ID, resolution, and organism of origin, to provide only relevant data. The extraction generally takes a few seconds for several hundreds of entries, up to approximately one hour for the whole PDB. The detected patterns are made available for download to enable further processing, as well as presented in a clear tabular and graphical form directly in the browser. The unique design of the language and the provided service could pave the way towards novel PDB-wide analyses, which were either difficult or unfeasible in the past. The application is available free of charge at http://ncbr.muni.cz/PatternQuery.

Entities: Chemical Disease Gene Mutation Species

Mesh：

Substances：

Year: 2015 PMID： 26013810 PMCID： PMC4489247 DOI： 10.1093/nar/gkv561

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

In the past years an overwhelming volume of biomacromolecular structures have been deposited in the worldwide deposition system Protein Data Bank (PDB) (1). The amount of data which was available 20 years ago is nowadays released every week, and this rapid pace is maintained. Small high-resolution protein structures are deposited, as well as extensive ribosomes or viral capsids. The whole scientific community can benefit from this abundance of biomacromolecular structures, being enabled to carry out experiments and analyses which were not feasible before (2–4). Such richness of 3D data accents the immense need for structural bioinformatics tools and services to help in reasoning out a variety of structural properties, which often go hand in hand with biological function. Presently, various computational tools and frameworks exist for the definition of molecular (sub)structure, such as SMILES (5), MQL (6), or SLN (7), which are mainly focused on small organic compounds. There are also tools that enable the definition and analysis of more general structural patterns, some of which rely on an internal molecular language (8–14). A structural pattern can, in principle, be any part of a biomacromolecule, i.e. protein backbone, ligands or metals together with their binding sites or surroundings, specific amino acids or nucleotide sequences, and sets of atoms or residues satisfying given criteria (distance, composition, intramolecular connectivity, etc.). Nevertheless, these tools are designed to operate either on a low number of structures, or their functionality is focused on very specific and narrow applications. Furthermore, some of the most popular services and databases use structure information for defining or inferring structure-function relationships (15,16). Even critical interaction sites are defined at the primary and secondary structure level (17,18), mainly because of the large structural variation of biomacromolecules. Ultimately, to our knowledge, there is no tool available for the general and systematic description and extraction of 3D structural patterns from biomacromolecules tailored for the mining of structural databases. In this article, we address the general philosophy of describing 3D structural patterns, and present an approach for their effective identification and extraction from individual biomacromolecules, as well as from the PDB archive. This approach is implemented as the user-friendly web service PatternQuery (PQ). The service is built on a simple yet powerful language for the description of any molecular structural patterns based on the nature and relationship between atoms, residues and other structural elements. The unique design of PQ allows the user to simultaneously operate at the primary, secondary and tertiary level of biomacromolecular structure. The results provided by PQ can serve as a source of input data in further analyses, such as structural and functional assignment of uncharacterized proteins, analysis of newly determined structures, comparative structural analysis, design and engineering of novel functional sites, etc.

DESCRIPTION OF THE TOOL

PatternQuery is an interactive web application for the optimal definition of biomacromolecular structural patterns, followed by their fast detection and extraction from the entire PDB or user defined datasets. These patterns are described by unique expressions based on the Python programming language, which are designed to define biomacromolecular structural patterns based on the nature and relationship between atoms, residues and other structural elements. These expressions define the composition, topology, connectivity, and 3D structure of a pattern. By composing these expressions into a query, 3D structural patterns can be identified inside biomacromolecules. Figure 1 gives the PQ query example that identifies and extracts a 3D pattern made up of a residue containing a pyranose moiety, together with its immediate surroundings.

Figure 1.

The query recognizes the binding pocket of any residue containing a pyranose moiety in the envelope glycoprotein gp160 from Human immunodeficiency virus 1 in complex with Homo sapiens immunoglobulins (3u7y). One of the recognized patterns is highlighted in the box. (A) First, the query identifies a pyranose moiety (a ring composed of 5 carbons and an oxygen atom). (B) Then, all residues which include this pattern in their structure are identified. (C) Finally, all the residues that are at most 4Å from any of the pyranose containing residues are detected as well. This ensures all the potential coordination partners are recognized properly. The molecules were visualized using PyMOL. The PatternQuery application can be used in two modes. The PQ Explorer mode (Figure 2) is useful for real-time investigation of smaller datasets (either user-uploaded or a small subset of the PDB), and tuning the queries prior to searching the whole PDB. The PQ Service mode (Supplementary Figure S1) is optimized for querying the entire PDB archive. Finally, a command-line version of the PQ application is available for processing in-house databases of 3D biomacromolecular structure data.

Figure 2.

The PatternQuery Explorer mode is tailored for querying smaller user-defined datasets (up to 100 entries) uploaded in one of the supported formats. Additionally, a subset of the PDB archive can be queried as well, based on PDB ID or a variety of metadata. The PQ web pages contain several interactive guides, which explain the features and give an easy walkthrough the application, along with plenty of tips. Rich documentation is provided as well, in the form of a Wiki user manual with many examples.

PatternQuery workflow

The procedure of using the PatternQuery application involves four steps: (i) query definition; (ii) input data specification; (iii) running the PQ query; (iv) visualization and analysis of retrieved patterns. Query definition First, it is necessary to build a query that optimally describes the structural pattern(s) of interest. The PatternQuery language is well documented, and its usage is richly illustrated on many examples and several case studies. Detailed knowledge of the language is not required, since the integrated high performance coding editor (ACE, http://ace.c9.io) provides syntax suggestions and relevant query examples. Multiple queries can be defined for a single run. Input data specification Second, the queried data has to be specified. Small subsets of the PDB, or the user's own datasets can be queried in the PQ Explorer mode. Large custom databases can be queried using the command line version of PatternQuery. In the PQ Service mode, the default queried dataset is a weekly updated mirror of the latest release of the PDB stored in the PDBx/mmCIF file format. Alternatively, a subset of the PDB can be specified based on a list of PDB entry IDs, or on various metadata criteria. By specifying a subset of the PDB as input, it is ensured that only patterns from relevant structures are retrieved, and the query can be executed in a more time efficient manner. For example one may restrict the search only to biomacromolecules including a DNA chain from Homo sapiens, determined by X-ray diffraction of resolution better than 2Å and published in the past 3 years. Optionally, all the patterns identified while running PQ may be subjected to the structural validation of annotation (19). During this process, which is briefly described in the supplementary materials in the section SI Structure Validation, all ligands and non-standard residues larger than six heavy atoms are inspected for their completeness and chirality correctness. Possible discrepancies or structural inconsistencies are highlighted. This may aid further processing of the results by discarding low-quality patterns. Running the PQ query After setup, the specified data set is queried with all the defined PQ expressions. This process involves generating the structure's internal representation, together with proper bond identification based on the intramolecular atomic distances, and then attempting to match the PQ query with any suitable substructure. The theoretical framework behind this process is given in the supplementary materials (Theoretical Background section). Depending on the complexity of the defined queries and the number of dataset entries, running the queries may take from a few seconds (for a few hundred small to medium-sized entries), up to approximately one hour for 100 000 PDB entries. Most types of queries have O(N) or O(N log N) time complexity (where N is the number of atoms in the structure), meaning that doubling the number of structures being processed will roughly double the running time. A benchmark of the application is available in the supplementary materials (section Performance Overview). Visualization and analysis of retrieved patterns The PQ results consist of structure files with the patterns, and statistics about their origin and composition. All the results are made available for inspection or download under a unique web address for at least a month, in both the PQ Explorer and PQ Service modes. The PatternQuery output provides a straightforward and rich report in both tabular and graphical form, including summary and detailed information about each pattern identified. The summary includes the number of detected patterns and PDB entries that the patterns were extracted from, together with possible errors and warnings, often caused by discrepancies either in the biomacromolecular structure, or in the file format. The detailed report provides a pattern view, focused on each individual pattern identified, and a PDB entry view, focused on each PDB entry queried. Additionally, in the PDB entry view, the results for all patterns identified in that particular PDB entry can be accessed together. Useful statistics in the form of the atom and residue composition are given for each extracted pattern, along with all the metadata from the parent data set entry (PDB entry). These can serve for further filtering of interesting results. Each extracted pattern can be visualized interactively (ChemDoodle, http://www.chemdoodle.com). Optionally, the validation report can be readily accessed.

Limitations

The setup of the PatternQuery web application, particularly in the PQ Service mode, is limited to 10 queries to be executed during a single run. The maximum number of results that can be returned by a single query execution on our server is one million patterns or ten million atoms, whichever is reached first. This limitation is not present in the command line tool. Additional limitations are discussed in detail in the supplementary materials (Limitations section).

RESULTS AND DISCUSSION

We provide two case studies, which demonstrate the possible usage of the PatternQuery web application. Additional biologically relevant examples, together with the corresponding PQ queries, are available on our wiki pages. All the queries used in the case studies can be found in the supplementary materials.

Case study I - LecB sugar binding sites

Pseudomonas aeruginosa is an opportunistic pathogen associated with a number of chronic infections. This pathogen forms a biofilm enabling it to survive both the response of the host immune system, and antibiotic treatment (20). One of the cornerstones of biofilm formation, in the case of P. aeruginosa, is the presence of sugar-binding proteins on the outer cell membrane — LecA (PA-IL) and LecB (PA-IIL). Their inhibition is considered to be a promising approach for anti-pseudomonadal treatment (21). LecB binds with the highest affinity to L-fucosides and D-mannosides (22), however, other monosaccharides are recognized as well (23). The sugar-binding domain is calcium dependent, with two calcium ions stabilizing the binding site. We employed PQ in the discovery of sugar binding sites of similar geometry as the tetrameric LecB entry in the PDB. Specifically, we have searched for 2 calcium ions at most 4Å apart, and all the residues with direct interaction with either of these ions. Furthermore, just the molecular patterns containing a residue with a furan or pyran ring were preserved. The complete PQ query which identifies such patterns is given as SI Query 1. Due to the fact that the sugar-binding domain is calcium dependent, we were able to restrict the search only to the biomacromolecules having a calcium ion in their structure, and containing a pyranose or furanose moiety (3074 PDB entries as of 25.4.2015), which tremendously reduced query-running time. The initial analysis of the PDB archive revealed 355 different patterns originating from 231 PDB entries. However, the majority of the sugar moieties originated from nucleotides. To filter them out, a simple filter was employed (SI Query 2), which provided 108 distinct patterns originating from 36 PDB entries of 7 different organisms. The majority of them originated from P. aeruginosa, however other pathogens such as R. solanacearum, B. cenocepacia or C. violaceum were identified among the organisms of origin. The sugar-binding domain in 87 of the patterns are composed of 3x Asp, 2x Asn and Glu and Gly residues, which is the binding site referred to as the sugar binding motif in the literature (24) for a total of 24 PDB entries from 3 organisms. In 12 further patterns a glycine residue was not present due to the fact that the structure stored in the PDB is only the asymmetric unit, rather than the expected biological unit, which is a tetramer. Finally, the remaining 9 patterns, originating from 6 different pectate lyase (EC: 4.2.2.2) structures, exhibited a different binding motif in comparison to the LecB protein. These patterns contained α-D-galactopyranuronic acid and its derivatives rather than a fucose or mannose derivative. A detailed list of these sugar ligands is given in the Supplementary Tables S1 and S2. Finally, the quality of the 3D structure of the patterns was examined. A total of 9 patterns originating from 3 PDB entries exhibited a serious structural issue, i.e. half of the α-L-fucose ligands in complex with the 1oxc PDB entry exhibit incorrect chirality at the C1 carbon atom. The details of this analysis can be found in the supplementary materials (SI Query Validation 1).

Case study II - C2H2 zinc fingers

The class of zinc finger DNA-binding proteins is the most abundant across all biology (25). They fulfill a remarkable range of diverse functions, including DNA recognition, transcriptional activation, regulation of apoptosis or lipid binding (25). Due to their specificity and modular architecture, they often serve as a rational engineering target for binding a wide range of DNA sequences to activate, repress, cut or paste genes (26). The classical C2H2 zinc finger domain is composed of a simple ββα fold, which is stabilized by a zinc ion coordinated by two histidine and two cysteine residues. The fold is often described by the pattern of X2-C-X2–4-C-X12-H-X3–5-H, where X stands for any amino acid, C is cysteine and H is histidine. Nevertheless, atypical variations also exist, which differ from the consensus profile (27) (e.g. UniProt ID (28): P47043). The X12 region of the consensus profile is usually further decomposed into the sequence X3-[F|Y]-X5-Ψ-X2, where [F|Y] represents either a phenylalanine or tyrosine residue, and Ψ denotes a hydrophobic residue (29). We have queried the whole PDB archive (access date 25.4.2015) using several different PQ queries. At first, we searched just for patterns with primary sequences which satisfy the basic consensus profile of the typical C2H2 zinc finger domain, without further specification of the X12 region (SI Query 3). We identified 595 patterns in 342 different PDB entries. The results of such a query will inevitably be plagued by a number of false positive hits, i.e. patterns satisfying the primary sequence criteria, but which are not zinc fingers. This is due to the fact that no further checks for the presence of a zinc ion or stabilizing residues were included in the query. Closer inspection of the results revealed that the above-defined primary sequence corresponds not only to the C2H2 zinc finger fold, but also to a variety of fumarate reductases and hydrolases. In order to filter out these false positive hits, we adjusted the query so that the pattern must contain a zinc ion stabilized by two cysteine and two histidine residues from the consensus profile (SI Query 4). This final query resulted in 461 different patterns originating from 278 PDB entries. The majority of the results (356 patterns in 239 PDB entries) also satisfied the special pattern of the X12 region between the second cysteine and the first histidine (SI Query 5). The largest number of structures was isolated from Eukaryotes, mainly Homo sapiens, and determined by solution nuclear magnetic resonance spectroscopy. However, a few structures originating from viruses and bacteria were found as well. No residues relevant for validation were detected inside the input patterns, and therefore validated. Furthermore, it has been reported that the zinc finger fold may also be stabilized by other metals (30). We have modified the query so that possible substitutions of zinc with other metals can also be considered (SI Query 6). Running this query returned five additional patterns from two PDB entries, where the zinc ion was substituted by cobalt (31) and cadmium (32). Although the cobalt-binding protein contains 5 zinc finger domains, just 4 patterns were identified, due to the alternate primary sequence in one of the patterns. These primary sequence modifications can be accounted for by modifying the regular expression in the PQ query.

CONCLUSION

In this article, we presented PatternQuery, a novel web application for rapid definition and extraction of 3D structural patterns from the entire PDB. The web application is easy to use and platform-independent. Results are presented in a clear graphical and tabular form. Rich documentation regarding both the underlying language and the features of the web application, along with several biologically relevant case studies are available at http://ncbr.muni.cz/PatternQuery. The innovative approach described in the present study enables mining large databases (entire PDB or in-house structural databases), a task which was unfeasible in the past, or was difficult for patterns with more complex structure.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

30 in total

1. Structural basis for oligosaccharide-mediated adhesion of Pseudomonas aeruginosa in the lungs of cystic fibrosis patients.

Authors: Edward Mitchell; Corinne Houles; Dvora Sudakevitz; Michaela Wimmerova; Catherine Gautier; Serge Pérez; Albert M Wu; Nechama Gilboa-Garber; Anne Imberty
Journal: Nat Struct Biol Date: 2002-12

2. CSB: a Python framework for structural bioinformatics.

Authors: Ivan Kalev; Martin Mechelke; Klaus O Kopec; Thomas Holder; Simeon Carstens; Michael Habeck
Journal: Bioinformatics Date: 2012-08-31 Impact factor: 6.937

3. Zinc to cadmium replacement in the A. thaliana SUPERMAN Cys₂ His₂ zinc finger induces structural rearrangements of typical DNA base determinant positions.

Authors: Gaetano Malgieri; Laura Zaccaro; Marilisa Leone; Enrico Bucci; Sabrina Esposito; Ilaria Baglivo; Annarita Del Gatto; Luigi Russo; Roberto Scandurra; Paolo V Pedone; Roberto Fattorusso; Carla Isernia
Journal: Biopolymers Date: 2011-05-25 Impact factor: 2.505

4. Solution structure of a Zap1 zinc-responsive domain provides insights into metalloregulatory transcriptional repression in Saccharomyces cerevisiae.

Authors: Zhonghua Wang; Linda S Feng; Viktor Matskevich; Krishna Venkataraman; Priya Parasuram; John H Laity
Journal: J Mol Biol Date: 2006-01-24 Impact factor: 5.469

5. PAST: fast structure-based searching in the PDB.

Authors: Hanjo Täubig; Arno Buchner; Jan Griebsch
Journal: Nucleic Acids Res Date: 2006-07-01 Impact factor: 16.971

6. SCOP2 prototype: a new approach to protein structure mining.

Authors: Antonina Andreeva; Dave Howorth; Cyrus Chothia; Eugene Kulesha; Alexey G Murzin
Journal: Nucleic Acids Res Date: 2013-11-29 Impact factor: 16.971

7. ValidatorDB: database of up-to-date validation results for ligands and non-standard residues from the Protein Data Bank.

Authors: David Sehnal; Radka Svobodová Vařeková; Lukáš Pravda; Crina-Maria Ionescu; Stanislav Geidl; Vladimír Horský; Deepti Jaiswal; Michaela Wimmerová; Jaroslav Koča
Journal: Nucleic Acids Res Date: 2014-11-11 Impact factor: 16.971

8. PDBe: Protein Data Bank in Europe.

Authors: Aleksandras Gutmanas; Younes Alhroub; Gary M Battle; John M Berrisford; Estelle Bochet; Matthew J Conroy; Jose M Dana; Manuel A Fernandez Montecelo; Glen van Ginkel; Swanand P Gore; Pauline Haslam; Rowan Hatherley; Pieter M S Hendrickx; Miriam Hirshberg; Ingvar Lagerstedt; Saqib Mir; Abhik Mukhopadhyay; Thomas J Oldfield; Ardan Patwardhan; Luana Rinaldi; Gaurav Sahni; Eduardo Sanz-García; Sanchayita Sen; Robert A Slowley; Sameer Velankar; Michael E Wainwright; Gerard J Kleywegt
Journal: Nucleic Acids Res Date: 2013-11-27 Impact factor: 16.971

9. Comprehensive analysis of loops at protein-protein interfaces for macrocycle design.

Authors: Jason Gavenonis; Bradley A Sheneman; Timothy R Siegert; Matthew R Eshelman; Joshua A Kritzer
Journal: Nat Chem Biol Date: 2014-07-20 Impact factor: 15.040

10. Synthetic zinc finger proteins: the advent of targeted gene regulation and genome modification technologies.

Authors: Charles A Gersbach; Thomas Gaj; Carlos F Barbas
Journal: Acc Chem Res Date: 2014-05-30 Impact factor: 22.384

8 in total

1. LiteMol suite: interactive web-based visualization of large-scale macromolecular structure data.

Authors: David Sehnal; Mandar Deshpande; Radka Svobodová Vařeková; Saqib Mir; Karel Berka; Adam Midlik; Lukáš Pravda; Sameer Velankar; Jaroslav Koča
Journal: Nat Methods Date: 2017-11-30 Impact factor: 28.547

2. MOLEonline: a web-based tool for analyzing channels, tunnels and pores (2018 update).

Authors: Lukáš Pravda; David Sehnal; Dominik Toušek; Veronika Navrátilová; Václav Bazgier; Karel Berka; Radka Svobodová Vareková; Jaroslav Koca; Michal Otyepka
Journal: Nucleic Acids Res Date: 2018-07-02 Impact factor: 16.971

3. Mol* Viewer: modern web app for 3D visualization and analysis of large biomolecular structures.

Authors: David Sehnal; Sebastian Bittrich; Mandar Deshpande; Radka Svobodová; Karel Berka; Václav Bazgier; Sameer Velankar; Stephen K Burley; Jaroslav Koča; Alexander S Rose
Journal: Nucleic Acids Res Date: 2021-07-02 Impact factor: 16.971

4. The Eighth Central European Conference "Chemistry towards Biology": Snapshot.

Authors: András Perczel; Atanas G Atanasov; Vladimír Sklenář; Jiří Nováček; Veronika Papoušková; Pavel Kadeřávek; Lukáš Žídek; Henryk Kozłowski; Joanna Wątły; Aleksandra Hecel; Paulina Kołkowska; Jaroslav Koča; Radka Svobodová-Vařeková; Lukáš Pravda; David Sehnal; Vladimír Horský; Stanislav Geidl; Ricardo D Enriz; Pavel Matějka; Adéla Jeništová; Marcela Dendisová; Alžběta Kokaislová; Volkmar Weissig; Mark Olsen; Aidan Coffey; Jude Ajuebor; Ruth Keary; Marta Sanz-Gaitero; Mark J van Raaij; Olivia McAuliffe; Birgit Waltenberger; Andrei Mocan; Karel Šmejkal; Elke H Heiss; Marc Diederich; Robert Musioł; Janez Košmrlj; Jarosław Polański; Josef Jampílek
Journal: Molecules Date: 2016-10-17 Impact factor: 4.411