| Literature DB >> 25540185 |
Abstract
MOTIVATION: With rapidly increasing volumes of biological sequence data the functional analysis of new sequences in terms of similarities to known protein families challenges classical bioinformatics.Entities:
Mesh:
Year: 2014 PMID: 25540185 PMCID: PMC4410661 DOI: 10.1093/bioinformatics/btu843
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.UProC workflow and Mosaic Matching sketch. For DNA input sequences, first all ORFs with at least 60 bp are identified, filtered and translated. The protein sequences then are analysed with the Mosaic Matching algorithm which compares all oligopeptides in the query sequence with oligopeptides from reference sequences in the database. From all matching reference oligopeptides with the same family label a maximum substitution score is computed for each residue and summed up over the whole sequence to provide the total Mosaic Matching score. If this score exceeds a length-dependent noise threshold the protein hit and the corresponding score is written to the output. The substitution scores that result from oligopeptide comparisons using PSSM are indicated by heatmap color (red:high, blue:low). The example shows all matching oligopeptides that contribute to the total score of Pfam family PF01370
Fig. 2.Contributions of different word positions to PSSM in terms of the SSW obtained from regularized least-squares classifier training (see text)
Test data subsets from HMP, GOS and GNHM with number of genomes, number of simulated reads and the percentage of reads with annotated Pfam domains, i.e. the fraction of positive test cases.
| Source | Subset | No. genomes | No. reads | % Pfams |
|---|---|---|---|---|
| HMP | Airways | 51 | 2 268 424 | 66.7 |
| Blood | 42 | 2 057 852 | 67.4 | |
| GI tract | 363 | 22 224 068 | 61.8 | |
| Oral | 193 | 7 749 640 | 61.0 | |
| Skin | 114 | 5 724 140 | 66.0 | |
| UG tract | 132 | 4 494 476 | 64.4 | |
| GOS | 100 bp | - | 742 527 | 76.6 |
| 150 bp | - | 492 466 | 81.3 | |
| 200 bp | - | 361 977 | 84.9 | |
| 250 bp | - | 279 125 | 87.7 | |
| GNHM | 100 bp | - | 415 630 | 68.9 |
| 150 bp | - | 272 129 | 74.6 | |
| 200 bp | - | 197 806 | 79.0 | |
| 250 bp | - | 150 548 | 82.8 |
Sensitivity (TPR) of protein domain detection on simulated short reads with best value in bold face
| Source | Subset | HMMER | RPS-BLAST | UProC |
|---|---|---|---|---|
| HMP | Airways | 52.4 | 48.6 | |
| Blood | 52.3 | 48.5 | ||
| GI tract | 49.2 | 46.0 | ||
| Oral | 49.3 | 46.0 | ||
| Skin | 48.6 | 45.3 | ||
| UG tract | 49.2 | 45.5 | ||
| GOS | 100 bp | 47.5 | 44.8 | |
| 150 bp | 67.4 | 61.4 | ||
| 200 bp | 77.3 | 70.6 | ||
| 250 bp | 76.7 | 80.9 | ||
| GNHM | 100 bp | 42.8 | 39.6 | |
| 150 bp | 56.7 | 58.0 | ||
| 200 bp | 66.7 | 62.5 | ||
| 250 bp | 73.1 | 65.7 |
Specificity (PPV) of protein domain detection on simulated short reads with best value in bold face
| Source | Subset | HMMER | RPS-BLAST | UProC |
|---|---|---|---|---|
| HMP | Airways | 97.5 | 97.5 | |
| Blood | 97.9 | 98.1 | ||
| GI tract | 97.1 | 97.2 | ||
| Oral | 97.2 | 97.3 | ||
| Skin | 97.1 | 97.5 | ||
| UG tract | 97.6 | 97.8 | ||
| GOS | 100 bp | 98.6 | 98.1 | |
| 150 bp | 97.7 | 98.1 | ||
| 200 bp | 96.2 | 96.5 | ||
| 250 bp | 94.1 | 94.3 | ||
| GNHM | 100 bp | 97.8 | 97.5 | |
| 150 bp | 97.4 | 97.4 | ||
| 200 bp | 96.3 | 96.3 | ||
| 250 bp | 95.0 | 94.7 |