Literature DB >> 30295745

cath-resolve-hits: a new tool that resolves domain matches suspiciously quickly.

Abstract

MOTIVATION: Many bioinformatics areas require us to assign domain matches onto stretches of a query protein. Starting with a set of candidate matches, we want to identify the optimal subset that has limited/no overlap between matches. This may be further complicated by discontinuous domains in the input data. Existing tools are increasingly facing very large data-sets for which they require prohibitive amounts of CPU-time and memory.
RESULTS: We present cath-resolve-hits (CRH), a new tool that uses a dynamic-programming algorithm implemented in open-source C++ to handle large datasets quickly (up to ∼1 million hits/second) and in reasonable amounts of memory. It accepts multiple input formats and provides its output in plain text, JSON or graphical HTML. We describe a benchmark against an existing algorithm, which shows CRH delivers very similar or slightly improved results and very much improved CPU/memory performance on large datasets.
AVAILABILITY AND IMPLEMENTATION: CRH is available at https://github.com/UCLOrengoGroup/cath-tools; documentation is available at http://cath-tools.readthedocs.io. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Proteins

Year: 2019 PMID： 30295745 PMCID： PMC6513158 DOI： 10.1093/bioinformatics/bty863

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Accurately annotating protein domains is essential for a number of tasks such as genome annotation. Various resources exist for assigning domains to proteins, each with its own distinct philosophy and approach (e.g. Dawson et al., 2016; Finn ; Lam ). Predicting domains for a query protein typically involves scanning the amino acid sequence against domain libraries and then resolving the candidate matches to obtain a final set of non-overlapping domain assignments. The scans typically assign a score (e.g. bit-score or e-value) to the candidate matches and this can be used to prioritise strong matches in cases of domain overlaps. The simple greedy approach (select the best hit, followed by the next non-conflicting best hit, etc.) has been shown to be outperformed by a method that seeks a global optimum, DomainFinder3 (DF3) (Yeats ). DF3 is also able to deal with discontinuous domains, which arise from domains’ insertions into other domains (meaning these domains do not have a single continuous region on the protein sequence, but have multiple starts and stops). However, DF3 is based on a graph-based, maximal-weighted-clique algorithm and it becomes increasingly slow and memory intensive for larger proteins. Similar problems, such as weighted interval scheduling, can be tackled with fast, optimal dynamic-programming algorithms. However, naïvely translating such algorithms to domain resolution would not account for discontinuous domains and so would disregard solutions in which one domain is inserted within the gap of another, discontinuous domain.

2 Materials and methods

In this work, we present cath-resolve-hits (CRH), a new tool that uses a dynamic-programming algorithm so that CPU/memory usage scales better with increasing problem size. CRH is implemented in C++ for speed and memory efficiency. We have chosen to frame the problem such that the algorithm itself is a deterministic search for an unambiguously optimal solution and all choices and trade-offs are pushed into the generation of the input data. This separation encourages the sort of transparency, reproducibility and debuggability that has previously benefited approaches to other bioinformatics problems such as sequence alignment. The algorithm maximizes the sum of the selected matches’ scores. Users may provide any positive scores with their input data, though CRH also provides default translations from HMMER (hmmer.org) bit scores or e-values to scores suitable for CRH. This provides tremendous flexibility without sacrificing ease-of-use. CRH also provides options that re-weight the scores to adjust the strength of preference for high scores or for long/short domains. CRH allows for limited overlaps between matches by trimming the ends of domains’ segments (according to a user-configurable policy) before resolving them. CRH can handle data for multiple query proteins in a single input file. It does not require that the data be pre-grouped by query (but if notified of such pre-grouping, can exploit it to reduce run-time and memory usage). See Supplementary Material for details. To assess CRH performance, we built a benchmark set by mapping known CATH v4.2 domains from PDB structures to UniProt sequences, using the SIFTS resource (Velankar ). To remove redundancy, we clustered the SIFTS-mapped protein sequences with CD-HIT at 70% sequence identity, choosing the longest protein sequence from each cluster. The final benchmark dataset consisted of 4738 protein sequences (see Supplementary Material for details). For these UniProt sequences, we assessed each method’s ability to reconstruct the original CATH domain assignments using domain predictions from a library of HMMs derived from CATH v4.2 definitions (Lewis ). Real-world domain assignment tasks typically involve low sequence identity between the known domain used to build the HMM model and the domain being predicted. To simulate this, we removed any HMM models built from seed domains with more than some specified percentage identity to the known CATH domains in the query sequence. We applied this at three levels of sequence-identity cut-off: 100%, 60% and 30%.

3 Results

We found that CRH’s performance is very similar to or slightly better than DF3’s (Fig. 1A). Both methods exhibited overall improvement over naïve-greedy approaches (both with and without domain overlap trimming) (Fig. 1A).

Fig. 1.

(A) performance of CRH, DF3 and Naïve Greedy at 100%, 60% and 30% sequence identity homology removal (see Methods). The axes show the proportion of domains assigned to: the correct domain superfamily (y-axis); an incorrect domain superfamily (x-axis). CRH assignments for all the Benchmark HMM assignments with 475, 161 hits took 3.3 s (Intel i7-7500U up to 3.5 GHz) and peak memory usage of 143 MB. A perfect result would appear at the top-left corner. B/C) Rate of use of CPU time in minutes (B)/memory in GBs (C) per 100 000 inputs to resolve a randomly chosen subset of hits to a large protein (human titin), averaged over 100 runs. The stars indicate the points beyond which DF3 failed to run, even with ample memory available A few sequences from this dataset were used in CRH’s development so we cannot exclude the possibility of overfitting, however contact was minimal and we think this is unlikely. The main difference we found between CRH and DF3 was that CRH shows greatly improved memory efficiency and speed. We demonstrated this by measuring the time/memory that each program required to resolve random subsets of 263 312 Gene3D-v16 HMM predictions to the 34 350-residue TITIN_HUMAN sequence (Q8WZ42) on the same CentOS 6 machine (Fig. 1B and C). Each measurement was averaged over 100 runs. CRH appears to exhibit a constant rate of CPU/memory usage per input (hence linear growth overall), whereas DF3 appears to exhibit a linear rate of usage (hence quadratic overall). Further, DF3 crashed when run on any datasets of 84 636 models or more, even with ample memory provided. This shows CRH’s better suitability for tackling the enormous growth in biological data [illustrated by the tens of billions of sequences now available from the IMG/M resource (Markowitz )]. CRH also provides greater flexibility in both input and output formats. Though DF3 and CRH both accept simple generic formats, CRH can also process both the raw and domain table outputs from hmmsearch and hmmscan (hmmer.org). Furthermore, there are several available output formats from CRH, including basic text, graphical HTML and JSON. The graphical HTML output (Supplementary Fig. S1) is useful for laying the domain resolution process bare, revealing why specified domains are included/excluded in the final resolved domain architecture. CRH is available for Linux and Mac as part of a suite of tools at https://github.com/UCLOrengoGroup/cath-tools. The project is written in C++14. The code compiles without warning or error under strict settings of both GCC and Clang. Travis-CI is used for builds and for continuous-integration execution of >¼-million test assertions in >1000 test cases, with and without Clang’s AddressSanitizer.

Funding

JGL was funded by BBSRC (Ref: BB/L002817/1). Conflict of Interest: none declared. Click here for additional data file.

7 in total

1. A fast and automated solution for accurately resolving protein domain architectures.

Authors: Corin Yeats; Oliver C Redfern; Christine Orengo
Journal: Bioinformatics Date: 2010-01-29 Impact factor: 6.937

2. IMG/M: the integrated metagenome data management and comparative analysis system.

Authors: Victor M Markowitz; I-Min A Chen; Ken Chu; Ernest Szeto; Krishna Palaniappan; Yuri Grechkin; Anna Ratner; Biju Jacob; Amrita Pati; Marcel Huntemann; Konstantinos Liolios; Ioanna Pagani; Iain Anderson; Konstantinos Mavromatis; Natalia N Ivanova; Nikos C Kyrpides
Journal: Nucleic Acids Res Date: 2011-11-15 Impact factor: 16.971

3. InterPro in 2017-beyond protein family and domain annotations.

Authors: Robert D Finn; Teresa K Attwood; Patricia C Babbitt; Alex Bateman; Peer Bork; Alan J Bridge; Hsin-Yu Chang; Zsuzsanna Dosztányi; Sara El-Gebali; Matthew Fraser; Julian Gough; David Haft; Gemma L Holliday; Hongzhan Huang; Xiaosong Huang; Ivica Letunic; Rodrigo Lopez; Shennan Lu; Aron Marchler-Bauer; Huaiyu Mi; Jaina Mistry; Darren A Natale; Marco Necci; Gift Nuka; Christine A Orengo; Youngmi Park; Sebastien Pesseat; Damiano Piovesan; Simon C Potter; Neil D Rawlings; Nicole Redaschi; Lorna Richardson; Catherine Rivoire; Amaia Sangrador-Vegas; Christian Sigrist; Ian Sillitoe; Ben Smithers; Silvano Squizzato; Granger Sutton; Narmada Thanki; Paul D Thomas; Silvio C E Tosatto; Cathy H Wu; Ioannis Xenarios; Lai-Su Yeh; Siew-Yit Young; Alex L Mitchell
Journal: Nucleic Acids Res Date: 2016-11-29 Impact factor: 16.971

4. CATH: an expanded resource to predict protein function through structure and sequence.

Authors: Natalie L Dawson; Tony E Lewis; Sayoni Das; Jonathan G Lees; David Lee; Paul Ashford; Christine A Orengo; Ian Sillitoe
Journal: Nucleic Acids Res Date: 2016-11-28 Impact factor: 16.971

5. SIFTS: Structure Integration with Function, Taxonomy and Sequences resource.

Authors: Sameer Velankar; José M Dana; Julius Jacobsen; Glen van Ginkel; Paul J Gane; Jie Luo; Thomas J Oldfield; Claire O'Donovan; Maria-Jesus Martin; Gerard J Kleywegt
Journal: Nucleic Acids Res Date: 2012-11-29 Impact factor: 16.971

6. Gene3D: expanding the utility of domain assignments.

Authors: Su Datt Lam; Natalie L Dawson; Sayoni Das; Ian Sillitoe; Paul Ashford; David Lee; Sonja Lehtinen; Christine A Orengo; Jonathan G Lees
Journal: Nucleic Acids Res Date: 2015-11-17 Impact factor: 16.971

7. Gene3D: Extensive prediction of globular domains in proteins.

Authors: Tony E Lewis; Ian Sillitoe; Natalie Dawson; Su Datt Lam; Tristan Clarke; David Lee; Christine Orengo; Jonathan Lees
Journal: Nucleic Acids Res Date: 2018-01-04 Impact factor: 16.971

7 in total

1. SARS-CoV-2 structural coverage map reveals viral protein assembly, mimicry, and hijacking mechanisms.

Authors: Seán I O'Donoghue; Andrea Schafferhans; Neblina Sikta; Christian Stolte; Sandeep Kaur; Bosco K Ho; Stuart Anderson; James B Procter; Christian Dallago; Nicola Bordin; Matt Adcock; Burkhard Rost
Journal: Mol Syst Biol Date: 2021-09 Impact factor: 13.068

2. Exploiting protein family and protein network data to identify novel drug targets for bladder cancer.

Authors: Tolulope Tosin Adeyelu; Aurelio A Moya-Garcia; Christine Orengo
Journal: Oncotarget Date: 2022-01-12

3. Infant gut strain persistence is associated with maternal origin, phylogeny, and traits including surface adhesion and iron acquisition.

Authors: Yue Clare Lou; Matthew R Olm; Spencer Diamond; Alexander Crits-Christoph; Brian A Firek; Robyn Baker; Michael J Morowitz; Jillian F Banfield
Journal: Cell Rep Med Date: 2021-09-07

4. Characterizing and explaining the impact of disease-associated mutations in proteins without known structures or structural homologs.

Authors: Neeladri Sen; Ivan Anishchenko; Nicola Bordin; Ian Sillitoe; Sameer Velankar; David Baker; Christine Orengo
Journal: Brief Bioinform Date: 2022-07-18 Impact factor: 13.994

5. Ultra-deep Sequencing of Hadza Hunter-Gatherers Recovers Vanishing Gut Microbes.

Authors: Bryan D Merrill; Matthew M Carter; Matthew R Olm; Dylan Dahan; Surya Tripathi; Sean P Spencer; Brian Yu; Sunit Jain; Norma Neff; Aashish R Jha; Erica D Sonnenburg; Justin L Sonnenburg
Journal: bioRxiv Date: 2022-10-06

6. Protein folds as synapomorphies of the tree of life.

Authors: Martin Romei; Guillaume Sapriel; Pierre Imbert; Théo Jamay; Jacques Chomilier; Guillaume Lecointre; Mathilde Carpentier
Journal: Evolution Date: 2022-07-13 Impact factor: 4.171

7. Genomic and transcriptional analysis of genes containing fibrinogen and IgSF domains in the schistosome vector Biomphalaria glabrata, with emphasis on the differential responses of snails susceptible or resistant to Schistosoma mansoni.

Authors: Lijun Lu; Eric S Loker; Coen M Adema; Si-Ming Zhang; Lijing Bu
Journal: PLoS Negl Trop Dis Date: 2020-10-14

7 in total