| Literature DB >> 27552220 |
Armen Abnousi1, Shira L Broschat1,2,3, Ananth Kalyanaraman1,2.
Abstract
BACKGROUND: Identifying conserved regions in protein sequences is a fundamental operation, occurring in numerous sequence-driven analysis pipelines. It is used as a way to decode domain-rich regions within proteins, to compute protein clusters, to annotate sequence function, and to compute evolutionary relationships among protein sequences. A number of approaches exist for identifying and characterizing protein families based on their domains, and because domains represent conserved portions of a protein sequence, the primary computation involved in protein family characterization is identification of such conserved regions. However, identifying conserved regions from large collections (millions) of protein sequences presents significant challenges.Entities:
Mesh:
Year: 2016 PMID: 27552220 PMCID: PMC4995020 DOI: 10.1371/journal.pone.0161338
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Structure of small data sets (#1-#10) used for evaluation of de novo detection of conserved regions and for runtime studies.
| Data Set | #sequences | % bacteria | % archaea | % eukaryota |
|---|---|---|---|---|
| #1 | 1,424 | 100% | 0% | 0% |
| #2 | 1,542 | 100% | 0% | 0% |
| #3 | 1,479 | 100% | 0% | 0% |
| #4 | 1,494 | 100% | 0% | 0% |
| #5 | 2,037 | 95.4% | 2.6% | 2.0% |
| #6 | 808 | 93.1% | 3.4% | 3.5% |
| #7 | 2,565 | 63.4% | 1.2% | 35.4% |
| #8 | 2,031 | 48.9% | 0 | 51.1% |
| #9 | 2,138 | 29.5% | 1.7% | 68.8% |
| #10 | 1,938 | 11.4% | 1.8% | 86.8% |
Data Set #1.
| Domain Name | #sequences |
|---|---|
| FAD_binding_9 | 750 |
| FixS | 350 |
| Gas_vesicle | 368 |
| total | 1,424 |
Data Set #10.
| Domain Name | #sequences |
|---|---|
| Has-barrel | 135 |
| EccE | 122 |
| EFhand_Ca_insen | 724 |
| KA1 | 959 |
| total | 1,938 |
Common bacterial protein domains used for construction of data set #11.
| Protein Domain Name | Number of Sequences |
|---|---|
| TOP1Bc | 11,545 |
| CBM_2 | 726 |
| ZnMc | 2,967 |
| ZipA_C | 1,508 |
| HLH | 16 |
| NADH-G_4Fe-4S_3 | 5,476 |
| POLAc | 7,110 |
| PP2Ac | 907 |
| Resolvase | 16,007 |
| S_TKc | 1,519 |
| Endonuclease_NS | 2,435 |
| total | 50,214 |
Performance of NADDA based on Pfam; When similar domains are present in the training set (repetitive) and when some domains are withheld from the training set (non-repetitive).
| AC | SN | SP | |
|---|---|---|---|
| Repetitive | 83.4% | 96.9% | 44.1% |
| Non-repetitive | 80.5% | 95.7% | 25.3% |
Fig 1Comparison of results from NADDA, ADDA, and MKDOM2 with Pfam on small data sets.
(AC = accuracy, SN = sensitivity, SP = specificity).
Performance of the algorithm compared to other methods based on Pfam.
| Data Set | Percentage Bacterial | Mean Freq. | Freq. Variance | NADDA | ADDA | MKDOM2 | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AC | SN | SP | AC | SN | SP | AC | SN | SP | ||||
| #1 | 100% | 5.9 | 169.1 | 64.9% | 61.9% | 73.0% | 82.7% | 99.5% | 3.7% | 45.4% | 35.5% | 72.0% |
| #2 | 100% | 13.4 | 3,386.4 | 65.1% | 61.1% | 73.5% | 78.1% | 99.6% | 4.9% | 36.9% | 25.0% | 61.8% |
| #3 | 100% | 2.5 | 30.2 | 47.8% | 37.2% | 94.5% | 82.7% | 99.4% | 9.3% | 72.7% | 77.5% | 51.4% |
| #4 | 100% | 1.4 | 3.3 | 42.6% | 12.5% | 86.2% | 69.1% | 99.0% | 3.1% | 50.9% | 54.7% | 42.7% |
| #5 | 95.4% | 36.3 | 13,466.3 | 76.3% | 85.7% | 48.9% | 74.6% | 99.9% | 2.2% | 55.8% | 56.0% | 54.9% |
| #6 | 93.1% | 23.7 | 1,833.9 | 78.9% | 79.7% | 67.6% | 93.3% | 99.8% | 6.5% | 25.4% | 23.7% | 47.5% |
| #7 | 63.4% | 25.2 | 9,628.8 | 64.5% | 86.6% | 42.6% | 51.0% | 99.5% | 3.3% | 62.8% | 60.4% | 65.1% |
| #8 | 48.9% | 10.8 | 522.6 | 58.6% | 66.1% | 45.1% | 65.9% | 99.4% | 5.5% | 56.6% | 48.9% | 70.7% |
| #9 | 29.5% | 18.2 | 1,108.6 | 69.1% | 86.3% | 37.4% | 66.1% | 99.5% | 4.2% | 46.5% | 33.7% | 70.3% |
| #10 | 11.4% | 54.8 | 7,183.5 | 65.0% | 90.4% | 29.9% | 59.7% | 99.5% | 3.5% | 62.5% | 56.5% | 70.9% |
| Average | 74.17% | 19.22 | 3,733.26 | 63.3% | 66.7% | 59.9% | 72.3% | 99.5% | 4.6% | 51.5% | 47.1% | 60.7% |
Fig 2Comparison of NADDA with InterPro and Pfam using small data sets.
(AC = accuracy, SN = sensitivity, SP = specificity).
Fig 3Comparison between the NADDA outputs and InterPro annotations for a few example sequences.
Runtimes of different methods for Data Set #7.
The times represented here are for a serial run, i.e., single core. As shown in Fig 4, NADDA scales almost linearly in the number of cores, allowing the use of the method on even larger data sets.
| Method | Time (s) |
|---|---|
| NADDA | 49 |
| ADDA | 10,566 |
| MKDOM2 | 456 |
Fig 4Speedup for parallel execution of NADDA on data set #11.
The dotted line shows the ideal linear speedup; the solid line is the actual speedup using NADDA.
Data Set #2.
| Domain Name | #sequences |
|---|---|
| Caa3_CtaG | 499 |
| Dak1_2 | 699 |
| dCache_3 | 351 |
| total | 1,542 |
Data Set #3.
| Domain Name | #sequences |
|---|---|
| XisI | 213 |
| NapB | 179 |
| EutN_CcmL | 330 |
| LptC | 823 |
| total | 1,479 |
Data Set #4.
| Domain Name | #sequences |
|---|---|
| EpsG | 313 |
| RcnB | 207 |
| FlgN | 542 |
| Lipoprotein_17 | 353 |
| LolA_like | 344 |
| total | 1,494 |
Data Set #5.
| Domain Name | #sequences |
|---|---|
| NA37 | 384 |
| DbpA | 1,232 |
| AAA_PrkA | 426 |
| total | 2,037 |
Data Set #6.
| Domain Name | #sequences |
|---|---|
| NA37 | 384 |
| AAA_PrkA | 426 |
| total | 808 |
Data Set #7.
| Domain Name | #sequences |
|---|---|
| Rad4 | 693 |
| YccF | 1,034 |
| DbpA | 1,232 |
| total | 2,565 |
Data Set #8.
| Domain Name | #sequences |
|---|---|
| RbsD_FucU | 567 |
| Vasohibin | 180 |
| NA37 | 384 |
| Ndc1_Nup | 431 |
| total | 2,031 |
Data Set #9.
| Domain Name | #sequences |
|---|---|
| AAA_PrkA | 426 |
| Dnal_N | 140 |
| FTCD | 262 |
| FACT-Spt16_Nlob | 433 |
| SAD_SRA | 890 |
| total | 2,138 |