| Literature DB >> 17596268 |
Maricel G Kann1, Sergey L Sheetlin, Yonil Park, Stephen H Bryant, John L Spouge.
Abstract
The sequencing of complete genomes has created a pressing need for automated annotation of gene function. Because domains are the basic units of protein function and evolution, a gene can be annotated from a domain database by aligning domains to the corresponding protein sequence. Ideally, complete domains are aligned to protein subsequences, in a 'semi-global alignment'. Local alignment, which aligns pieces of domains to subsequences, is common in high-throughput annotation applications, however. It is a mature technique, with the heuristics and accurate E-values required for screening large databases and evaluating the screening results. Hidden Markov models (HMMs) provide an alternative theoretical framework for semi-global alignment, but their use is limited because they lack heuristic acceleration and accurate E-values. Our new tool, GLOBAL, overcomes some limitations of previous semi-global HMMs: it has accurate E-values and the possibility of the heuristic acceleration required for high-throughput applications. Moreover, according to a standard of truth based on protein structure, two semi-global HMM alignment tools (GLOBAL and HMMer) had comparable performance in identifying complete domains, but distinctly outperformed two tools based on local alignment. When searching for complete protein domains, therefore, GLOBAL avoids disadvantages commonly associated with HMMs, yet maintains their superior retrieval performance.Entities:
Mesh:
Year: 2007 PMID: 17596268 PMCID: PMC1950549 DOI: 10.1093/nar/gkm414
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 6.The accuracy of P-values for GLOBAL. Figure 6 plots on a logarithmic scale against p, where is the calculated GLOBAL P-value, and p is the P-value from the simulation. Thus, the horizontal solid black line (shown) corresponds to perfect P-value estimation. The error bars correspond to 1 SEM. (The error bars are asymmetric because of the logarithmic scale. In addition, they were omitted for some points on the right, if they included negative values and could not be plotted on a logarithmic scale.) Figure 6 shows GLOBAL P-value results for three CDs: cd00030 (black triangle), having 8 blocks of lengths 16, 16, 12, 6, 11, 15, 17 and 12; cd00083 (red square), having 2 blocks of lengths 34 and 26; and cd00288 (blue diamond), having 44 blocks of lengths 13, 5, 10, 10, 21, 13, 8, 8, 10, 6, 6, 10, 7, 15, 8, 7, 7, 10, 7, 6, 8, 17, 12, 8, 6, 6, 9, 19, 8, 12, 14, 8, 8, 17, 17, 18, 11, 10, 12, 10, 13, 19, 18 and 13.
Figure 7.The accuracy of P-values for HMMer_semi-global. HMMer_semi-global P-value results for the same three CDs and in the same format as in Figure 6: cd00030 (black triangle); cd00083 (red square); and cd00288 (blue diamond).
Figure 1.Three possible GLOBAL alignments of a CD to a protein sequence. A protein sequence (bottom); Three alternative GLOBAL alignments of a single CD (π1, π2 and π3) (above). The CD consists of three blocks (B1, B2 and B3, shown as yellow, purple and blue rectangles). Each block corresponds to a PSSM, but for simplicity, PSSM scores are not diagrammed. Usually, GLOBAL aligns only some block columns (solid colors) but not others (diagonally striped colors).
Figure 2.A GLOBAL alignment graph Γ.The alignment graph for a fixed protein sequence (on the y-axis) against b = 3 blocks (colored boxes on the x-axis, corresponding to the blocks B1, B2 and B3 in Figure 1). The graph has vertices V = {(i, j):0 ≤ i ≤ m, 0 ≤ j ≤ n} (circles); its directed edges e have integer weights W(e) (not shown). Dotted black arrows correspond to edges of weight 0, indicating unaligned block columns (eastward edges), and unaligned sequence letters (northward edges). The red path corresponds to an optimal alignment, the solid red edges indicating block columns aligned to sequence letters; and dotted red edges (of weight 0) indicating unaligned block columns (eastward edges) and unaligned sequence letters (northward edges).
Figure 3.An alignment graph showing the ‘independent alignment approximation’. For a random sequence, the alignment graph in Figure 2 corresponds, under the independent alignments approximation, to the b = 3 alignment graphs in Figure 3. Each of the three graphs in Figure 3 aligns a random sequence (not shown) against one of the b = 3 blocks B1, B2 and B3 from Figure 2. Each of the three alignment matrices shown has the same vertical dimension, the effective length j = {(n + b − 1)!/[(n − 1)!b!]}1/. The length of the sequence in Figure 2 is 29, e.g. so the effective length in Figure 3 is [(31)!/(28!3!)]1/3 ≈ 16.5.The effective length j is determined by Equation (1), which equates the number of combinations of b = 3 starting points, for (ordered) optimal matches in Figure 2 and (independent) optimal matches in Figure 3.
Figure 4.LROC curves comparing CD retrieval performances.The LROC curves for GLOBAL (green solid line), HMMer_semi-global (blue dashed line), HMMer_local (red dashed line) and RPS-BLAST (black solid line) up to a 5% false-positive rate.
The LROC score for GLOBAL, HMMer_semi-global, HMMer_local and RPS-BLAST
| LROC10 000 | LROC50 000 | LROC100 000 | LROC 200 000 | |
|---|---|---|---|---|
| GLOBAL | 0.181 | 0.224 | 0.260 | 0.313 |
| HMMer_semi- global | 0.185 | 0.224 | 0.254 | 0.299 |
| HMMer_local | 0.169 | 0.194 | 0.213 | 0.239 |
| RPS-BLAST | 0.168 | 0.192 | 0.207 | 0.229 |
The LROC score is given for n = 10 000; 50 000; 100 000; and 200 000 in the pooled retrieval list (see ‘Materials and Methods’ section). These values of n correspond to ∼1, 5, 10 and 20 unrelated CDs per protein query. All LROCns indicated have an error of ±0.003, as estimated by bootstrap (13).
Figure 5.An EPQ plot, graphing errors per query against E-value. For a given protein query and a particular E-value threshold, a retrieval tool might make ‘errors’ by assigning E-values below the threshold to irrelevant CDs (i.e. CDs not in the query). Figure 5 plots the average number of errors per protein query against the E-value threshold for GLOBAL (green solid line), HMMer_semi-global (blue dashed line), HMMer_local (red dashed line) and RPS-BLAST (black solid line). All curves intersect the y-axis at about 0.02, probably because our structural standard of truth misclassifies as unrelated about 2% of the related pairs in DB_10185 and DB_331_CD.