| Literature DB >> 26691573 |
Rohita Sinha1, Jennifer Clarke2,3,4, Andrew K Benson5.
Abstract
BACKGROUND: Functional assignments for short-read metagenomic data pose a significant computational challenge due to perceived unpredictability of alignment behavior and the inability to infer useful functional information from translated protein-fragments/peptides. To address this problem, we have examined the predictability of short peptide alignments by systematically studying alignment behavior of large sets of short peptides generated from well-characterized proteins as well as hypothetical proteins in the KEGG database.Entities:
Mesh:
Substances:
Year: 2015 PMID: 26691573 PMCID: PMC4687345 DOI: 10.1186/s12864-015-2272-z
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Description of test case notations used in the current study
| Test case type | Description |
|---|---|
| Type1_11–81aa | Peptides were derived from well characterized proteins. In eight independent test cases lengths of peptides ranged between 11 to 81 amino-acids. |
| Type2.1–2.3 | Peptides were derived from uncharacterized proteins and test cases were classified on the basis of the degree of sequence similarity of proteins with well-annotated proteins. |
| 2.1: Coverage <70 % & identity < 70 % | |
| 2.2: Coverage <70 % & identity <50 % | |
| 2.3: Coverage <70 % & identity <35 % | |
| Type3 | Simulated data to test the “Frequency weighted method” |
Fig. 1Alignment profiles of short peptides to parental vs non-parental KO-families. Comparison of alignment behavior (Left panel) when the short peptides align to members of their parent KO-families (Right panel) when short peptides align to members of their non-parent KO-families. While hits to same KO-family members have high proportion of alignment-identity above 80 %, but a major fraction of hits still remain in the range of 40–80 % (for both left & right panel) and makes it difficult to discriminate between true and false positive hits. Hexbin colors within the graph are proportional to their frequency or members within the bin. Member frequency and color relationship is depicted in the arrow-headed color bar
Fig. 2Three dimensional plot of alignment length, percent identity and hit frequency of type 1 peptides. The plot is colored to differentiate values for hits to parental KO (red) and non-parental KO-family (blue). Data for the 61-mer peptides is shown
Alignment behavior of short-peptides (peptide length 11aa–81aa)
| Peptide Length | Total peptidesa | Fraction alignedb | Total number of blast-hitsc | Total aligned to same KO (%)d | Total aligned to different KO (%)e | Total of best-hits aligned to same KO (%)f | Total of best-hits aligned to different KO (%)g |
|---|---|---|---|---|---|---|---|
| 11 | 18,981 | 4196 (22.1) | 57,705 | 48,677 (84.3) | 9028 (15.6) | 3964 (94.4) | 232 (5.6) |
| 21 | 18,978 | 16,482 (86.8) | 1,200,295 | 815,915 (67.9) | 384,380 (32.0) | 15,366 (93.2) | 1116 (6.8) |
| 31 | 18,960 | 18,104 (95.4) | 2,348,640 | 1,409,131 (59.9) | 939,509 (40.0) | 16,449 (90.8) | 1655 (9.2) |
| 41 | 18,900 | 18,626 (98.5) | 3,181,987 | 1,784,051 (56.0) | 1,397,936 (43.9) | 16,855 (90.5) | 1771 (9.5) |
| 51 | 18,807 | 18,728 (99.5) | 3,829,200 | 2,050,119 (53.5) | 1,779,081 (46.4) | 16,912 (90.3) | 1816 (9.7) |
| 61 | 18,723 | 18,701 (99.8) | 4,266,683 | 2,199,736 (51.5) | 2,066,947 (48.4) | 16,986 (90.8) | 1715 (9.2) |
| 71 | 18,564 | 18,554 (99.9) | 4,598,168 | 2,313,719 (50.0) | 2,284,449 (49.6) | 16,907 (91.1) | 1647 (8.9) |
| 81 | 18,396 | 18,391 (99.9) | 4,839,697 | 2,387,627 (49.3) | 2,452,070 (50.6) | 16,765 (91.1) | 1626 (8.9) |
aTotal number of short-peptides used in the study
bTotal fraction of peptides having significant alignment with at least one other protein (self-hits are not considered)
cTotal count of significant BLAST hits
dTotal count of significant BLAST hits to the same KO group (percentage)
eTotal count of significant BLAST hits to a different KO group (percentage)
fTotal count of best BLAST hits aligning to the same KO group (percentage)
gTotal count of best BLAST hits aligning to a different KO group (percentage)
Fig. 3True KO-families abundances are compared with frequency-weighted read counts. Evaluation of performance of ‘Frequency weighted read count’ method when test case is comprised of peptides originating from those proteins, which have their family members in the ‘Reference set
Fig. 4Comparison of alignment-identity profiles of the best hits of peptides from known and unknown protein-space. Identity profile of best hits of peptides from uncharacterized proteins (first three pink boxplots) is compared with the same of peptides from proteins having their family members in the reference protein set (green)
Fig. 5Effect of peptides from “unknown protein-space” on the “frequency-weighted” abundance profiles of proteins from “known space”. Artificial boost in the abundances of KO-families is elucidated using output of ‘Frequency weighted read count’ method when ‘Test case type2.3’ (peptides from unknown space) is added to ‘Selected_4K_KO_Hits’. Red line reflects the true correspondence values
Fig. 6Corrected abundance profiles of KO-families using “Filter-enabled frequency weighted” method. Artificial boost in abundance values of KO-families due to peptides from “unknown protein space” is corrected by extending our frequency-weighted method and enable it to filter peptides with characteristics of those from hypothetical proteins. New abundance profile of the same test data used in Fig. 5 is plotted in this figure (the plot can be compared directly to the plot in Fig. 5). Red line reflects the true correspondence values