| Literature DB >> 28083762 |
Kyungtaek Lim1, Kazunori D Yamada1,2, Martin C Frith1,3, Kentaro Tomii4,5.
Abstract
Protein database search for public databases is a fundamental step in the target selection of proteins in structural and functional genomics and also for inferring protein structure, function, and evolution. Most database search methods employ amino acid substitution matrices to score amino acid pairs. The choice of substitution matrix strongly affects homology detection performance. We earlier proposed a substitution matrix named MIQS that was optimized for distant protein homology search. Herein we further evaluate MIQS in combination with LAST, a heuristic and fast database search tool with a tunable sensitivity parameter m, where larger m denotes higher sensitivity. Results show that MIQS substantially improves the homology detection and alignment quality performance of LAST across diverse m parameters. Against a protein database consisting of approximately 15 million sequences, LAST with m = 105 achieves better homology detection performance than BLASTP, and completes the search 20 times faster. Compared to the most sensitive existing methods being used today, CS-BLAST and SSEARCH, LAST with MIQS and m = 106 shows comparable homology detection performance at 2.0 and 3.9 times greater speed, respectively. Results demonstrate that MIQS-powered LAST is a time-efficient method for sensitive and accurate homology search.Entities:
Keywords: Alignment quality; Amino acid substitution matrix; Homology detection
Mesh:
Substances:
Year: 2017 PMID: 28083762 PMCID: PMC5274646 DOI: 10.1007/s10969-016-9210-4
Source DB: PubMed Journal: J Struct Funct Genomics ISSN: 1345-711X
Fig. 1Superfamily level homology detection benchmark across database searches of the SCOP20 validation sequences against UniRef50+. ROC plot for weighted FP versus weighted TP counts up to particular E-values. Each FP or TP is weighted by 1/(number of the other domains in the query superfamily). Some FPs are ignored according to the JG standard in (b) but not in (a). Solid black line represents FDR = 10%. See “Results” section for additional details
Fig. 2Homology detection benchmark per query. Superfamily level homology detection performances are shown for all-against-all search of the SCOP20 validation set. Mean ROC5 scores for TPs and FPs collected until FDR = 10% in the ROC curve (Fig. 1) are shown. ‘JG’: some FPs are ignored according to the JG standard. See “Results” section for additional details
Fig. 3Superfamily level homology detection benchmark across database searches of CATH20-SCOP versus CATH20-SCOP. ROC plot for weighted FP versus weighted TP counts up to particular E-values. Each FP or TP is weighted by 1/(number of other domains in the query superfamily). The solid black line represents FDR = 10%. See “Results” section for additional details
Fig. 4Alignment quality benchmark for pairwise alignments (n = 588) constructed using sequences in the CATH20-SCOP set. ROC plot for the sum of sensitivity against the sum of (1—precision) until varying E-values is shown across all pairwise alignments, where sensitivity = TP/(TP + FN) and precision = TP/(TP + FP)
Fig. 5Running time and maximum memory usage of ten searches against UniRef50+. Time (s) is shown in a log10 scale. ‘LASTn_small’: the UniRef50+ database for LAST was constructed with ‘−s 7G’ option, so that the LAST search occupies less than 7G of memory