| Literature DB >> 19014584 |
Jessica H Fong1, Aron Marchler-Bauer.
Abstract
BACKGROUND: Domains, evolutionarily conserved units of proteins, are widely used to classify protein sequences and infer protein function. Often, two or more overlapping domain models match a region of a protein sequence. Therefore, procedures are required to choose appropriate domain annotations for the protein. Here, we propose a method for assigning NCBI-curated domains from the Curated Domain Database (CDD) that takes into account the organization of the domains into hierarchies of homologous domain models.Entities:
Year: 2008 PMID: 19014584 PMCID: PMC2632666 DOI: 10.1186/1756-0500-1-114
Source DB: PubMed Journal: BMC Res Notes ISSN: 1756-0500
Figure 1Domain assignment among closely related domains. (A) Schematic of NCBI-curated hierarchy illustrating domain models from the Ribokinase/pfkB superfamily (cl00192). Each domain model is pictured as a multiple sequence alignment. Parent and child models are defined to share at least one (overlapping) sequence (blue, purple, and green lines). The root domain of this hierarchy represents the whole superfamily and provides information about the conserved core regions and sequence variation within the superfamily. Its many subfamilies include the ribokinase-like subgroups A and D and KdgK. Ideally, a query sequence is labelled by the most specific domain that matches the sequence and that domain would yield the most significant hit. (B) A partial list of domain hits to query protein [Entrez:BAA97341] with domain accessions, names, and the E-values of their RPS-BLAST alignments to the query sequence. Previously, BAA97341 would have been assigned the domain with lowest E-value, cd01942. Our proposed method assesses the hit to cd01942 against a pre-computed, domain-specific threshold to determine that the hit with lowest E-value is not significant enough to be a confident match. Instead, we label the sequence generically by the superfamily of the best-matching domain, or cl00192.
Rate of incorrect domain assignment by type of other domain
| (A) Type of incorrect hit | (B) # Sequences with self hit and other hit (# domains) | (C) # Sequences with higher score from other hit (# domains) | (D) Error rate |
| Self only | 23918 (264) | - | - |
| Parent/ancestor | 49934 (2402) | 137 (58) | 0.35% |
| Child/descendant | 9822 (274) | 2135 (129) | 21.8% |
| Other domain in hierarchy | 47362 (2306) | 100 (37) | 0.15% |
| Domain outside hierarchy; sequence not in other domain model | 6747 (506) | 421 (32) | 3.0% |
| Domain outside hierarchy; sequence in other domain model | 736 (66) | 313 (18) | 21.1% |
(A) Classes of incorrect hits, labelled by hierarchical relationship to the specific domain. (B) Number of sequences with self hit and other hit of the respective category; the number of domains containing these sequences is in parentheses. (C) Number of sequences with other domain scoring higher than correct domain; the number of domains containing these sequences is in parentheses. (D) Error rate as percent of sequences with higher score from incorrect domain than the self hit, averaged over domains containing sequences with both types of hits.
Incorrect domain hits with alignment scores above domain-specific thresholds
| (A) Type of incorrect hit | (B) Domains with incorrect hits | (C) No incorrect hits above threshold | (D) With incorrect hits above threshold | (E) Incorrect hits above correct domain threshold | (F) Incorrect hits above hit domain threshold |
| None | 9.0% | 264 | - | - | - |
| All incorrect hits | 91.0% | 1778 | 887 | 9.1% | 25.9% |
| Parent/ancestor | 82.0% | 1838 | 564 | 9.0% | 82.3% |
| Child/descendant | 9.4% | 59 | 215 | 42.8% | 9.1% |
| Other domain in hierarchy | 78.7% | 1939 | 367 | 2.7% | 5.5% |
| Domain outside hierarchy | 17.3% | 444 | 64 | 9.3% | 29.7% |
(A) Classes of incorrect hits, labelled by hierarchical relationship to the specific domain. (B) Percent of domains with each type of incorrect hit to its representative sequences, among a total of 2929 domains. (C) The number of domains without incorrect hits that score above the correct domain's threshold and (D) with incorrect hits that score above the correct domain's threshold. (E) Percent of non-self hits that score above the correct domain's threshold or (F) the hit domain's threshold. In (E-F), the percent is averaged over domains with the respective class of non-self hits.