| Literature DB >> 30295745 |
T E Lewis1, I Sillitoe1, J G Lees2.
Abstract
MOTIVATION: Many bioinformatics areas require us to assign domain matches onto stretches of a query protein. Starting with a set of candidate matches, we want to identify the optimal subset that has limited/no overlap between matches. This may be further complicated by discontinuous domains in the input data. Existing tools are increasingly facing very large data-sets for which they require prohibitive amounts of CPU-time and memory.Entities:
Mesh:
Substances:
Year: 2019 PMID: 30295745 PMCID: PMC6513158 DOI: 10.1093/bioinformatics/bty863
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.(A) performance of CRH, DF3 and Naïve Greedy at 100%, 60% and 30% sequence identity homology removal (see Methods). The axes show the proportion of domains assigned to: the correct domain superfamily (y-axis); an incorrect domain superfamily (x-axis). CRH assignments for all the Benchmark HMM assignments with 475, 161 hits took 3.3 s (Intel i7-7500U up to 3.5 GHz) and peak memory usage of 143 MB. A perfect result would appear at the top-left corner. B/C) Rate of use of CPU time in minutes (B)/memory in GBs (C) per 100 000 inputs to resolve a randomly chosen subset of hits to a large protein (human titin), averaged over 100 runs. The stars indicate the points beyond which DF3 failed to run, even with ample memory available