| Literature DB >> 22325056 |
Erez Persi1, Uri Weingart, Shiri Freilich, David Horn.
Abstract
BACKGROUND: Taxa counting is a major problem faced by analysis of metagenomic data. The most popular method relies on analysis of 16S rRNA sequences, but some studies employ also protein based analyses. It would be advantageous to have a method that is applicable directly to short sequences, of the kind extracted from samples in modern metagenomic research. This is achieved by the technique proposed here.Entities:
Mesh:
Substances:
Year: 2012 PMID: 22325056 PMCID: PMC3319421 DOI: 10.1186/1471-2164-13-65
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Figure 1Probability of inequality . Probability of inequality vs window size, as deduced from S61 enzymes of bacteria in Swiss-Prot. The curves are ordered according to the legend of higher-lower hierarchies, testing for probabilities of inequality of sequences belonging to two different members of the lower hierarchy within the same higher hierarchy.
Taxa-counting example of Rios Mesquites metagenomic short reads carrying a common SP.
| Index | Short read (translated to amino-acid string) | |
|---|---|---|
| 1 | ||
| 2 | ||
| 3 | ||
| 4 | ||
| 5 | ||
| 6 | ||
| 7 | ||
| 8 | ||
| 9 | ||
| 10 | ||
| 11 | ||
| 12 | ||
| 13 | ||
| 14 | ||
| 3 = 3U10U13 | ||
| 1U6U8 | ||
The first 14 rows display the short reads. The two last rows are fused strings of short reads used for solving the minimal chromatic number problem. Entries in the last column indicate 10 independent sequences (inconsistent with each other).
Consistency and inconsistency relations among the 6 sequences of Table 1 that require further analysis.
| 1 | 3 | 6 | 10 | 13 | 8 | |
|---|---|---|---|---|---|---|
| 1 | ~ | ~ | ~ | ~ | ~ | ~ |
| 3 | ~ | ~ | ~ | ~ | ~ | X |
| 6 | ~ | ~ | ~ | ~ | ~ | ~ |
| 10 | ~ | ~ | ~ | ~ | ~ | X |
| 13 | ~ | ~ | ~ | ~ | ~ | X |
| 8 | ~ | X | ~ | X | X | ~ |
Figure 2Length distributions in the prevalent gene set. Histograms of length distributions of putative proteins of EC = 6.1.1.9, 6.1.1.3 and 6.1.1.18 among the prevalent gene set. Upper figures: sequences carrying the leading SP in all three cases. Lower figures: sequences carrying all other SPs. Vertical dashed lines delineate the range of large sequences. The numbers of the latter are written to the right of the dashed lines. These numbers are added to those of the upper figures in order to form lower bound estimates of taxa counts.
64 taxa analyzed by 16S rRNA conventional blast analysis and by the new peptide based approach.
| Kegg | Taxon Name | Blast 16S | Peptides S61 | Specific Peptide |
|---|---|---|---|---|
| ban | Bacillus anthracis Ames | 100% | 100% | ISRQLWWGH |
| bar | Bacillus anthracis Ames 0581 | ISRQLWWGH | ||
| cgb | Corynebacterium glutamicum ATCC 13032 Bielefeld | 100% | 100% | ISRQLWWGH |
| cgl | Corynebacterium glutamicum ATCC 13032 Kyowa Hakko | ISRQLWWGH | ||
| sar | Staphylococcus aureus MRSA252 | 100% | > 99% | ISRQLWWGH |
| sas | Staphylococcus aureus MSSA476 | ISRQLWWGH | ||
| sbl | Shewanella baltica OS155 | 100% | > 99% | ISRQLWWGH |
| sbm | Shewanella baltica OS185 | ISRQLWWGH | ||
| ypa | Yersinia pestis Antiqua | 100% | 100% | ISRQLWWGH |
| ypg | Yersinia pestis Angola | ISRQLWWGH | ||
| ypp | Yersinia pestis Pestoides | ISRQLWWGH | ||
| bfr | Bacteroides fragilis YCH46 | > 99% | 100% | ISRQLWWGH |
| bfs | Bacteroides fragilis NCTC9343 | ISRQLWWGH | ||
| cbf | Clostridium botulinum F | > 99% | 100% | ISRQLWWGH |
| cbo | Clostridium botulinum A | ISRQLWWGH | ||
| cpa | Chlamydophila pneumoniae AR39 | > 99% | 100% | ISRQLWWGH |
| cpj | Chlamydophila pneumoniae J138 | ISRQLWWGH | ||
| cta | Chlamydia trachomatis serovar A | > 99% | 100% | ISRQLWWGH |
| ctr | Chlamydia trachomatis serovar D | ISRQLWWGH | ||
| eci | Escherichia coli UTI89 UPEC | > 99% | > 99% | ISRQLWWGH |
| eco | Escherichia coli K-12 MG1655 | ISRQLWWGH | ||
| llc | Lactococcus lactis subsp cremoris SK11 | > 99% | > 99% | ISRQLWWGH |
| llm | Lactococcus lactis subsp cremoris MG1363 | ISRQLWWGH | ||
| mtc | Mycobacterium tuberculosis CDC1551 | > 99% | 100% | ISRQLWWGH |
| mtf | Mycobacterium tuberculosis F11 | ISRQLWWGH | ||
| sag | Streptococcus agalactiae 2603 serotype V | > 99% | > 99% | ISRQLWWGH |
| san | Streptococcus agalactiae NEM316 serotype III | ISRQLWWGH | ||
| stc | Streptococcus thermophilus CNRZ1066 | > 99% | > 99% | ISRQLWWGH |
| ste | Streptococcus thermophilus LMD-9 | ISRQLWWGH | ||
| syd | Synechococcus sp CC9605 | > 99% | V | ISRQLWWGH |
| sye | Synechococcus sp CC9902 | V | ISRQLWWGH | |
| bbr | Bordetella bronchiseptica | > 99% | V | ISRQLWWGH |
| bpe | Bordetella pertussis | V | ISRQLWWGH | |
| pmf | Prochlorococcus marinus MIT 9303 | V | V | ISRQLWWGH |
| pmh | Prochlorococcus marinus MIT 9215 | V | V | ISRQLWWGH |
| tle | Thermotoga lettingae | V | V | ISRQLWWGH |
| tma | Thermotoga maritime | V | V | ISRQLWWGH |
| bth | Bacteroides thetaiotaomicron | V | V | ISRQLWWGH |
| cau | Chloroflexus aurantiacus | V | V | ISRQLWWGH |
| cdi | Corynebacterium diphtheria | V | V | ISRQLWWGH |
| cef | Corynebacterium efficiens | V | V | ISRQLWWGH |
| cha | Campylobacter hominis ATCC BAA-381 | V | V | ISRQLWWGH |
| cje | Campylobacter jejuni NCTC11168 | V | V | ISRQLWWGH |
| cmu | Chlamydia muridarum | V | V | ISRQLWWGH |
| cph | Chlorobium phaeobacteroides | V | V | ISRQLWWGH |
| det | Dehalococcoides ethenogenes | V | V | ISRQLWWGH |
| dge | Deinococcus geothermalis | V | V | ISRQLWWGH |
| dra | Deinococcus radiodurans | V | V | ISRQLWWGH |
| gfo | Gramella forsetii | V | V | ISRQLWWGH |
| rpd | Rhodopseudomonas palustris BisB5 | V | V | ISRQLWWGH |
| aav | Acidovorax avenae | V | V | ISRQLWWGH |
| abu | Arcobacter butzleri | V | V | ISRQLWWGH |
| ade | Anaeromyxobacter dehalogenans | V | V | ISRQLWWGH |
| atu | Agrobacterium tumefaciens C58 UWashDupont | V | V | ISRQLWWGH |
| rpa | Rhodopseudomonas palustris CGA009 | V | V | ISRQLWWGH |
| aae | Aquifex aeolicus | V | V | FFWVARMIM |
| cch | Chlorobium chlorochromatii | V | V | FFWVARMIM |
| chu | Cytophaga hutchinsonii | V | V | FFWVARMIM |
| fnu | Fusobacterium nucleatum | V | V | FFWVARMIM |
| rba | Rhodopirellula baltica | V | V | DTWFSSALWP |
| gme | Geobacter metallireducens | V | V | DTWFSSALWP |
| aau | Arthrobacter aurescens | V | V | DDNGLPTER |
| mga | Mycoplasma gallisepticum | V | V | DTWFSSALWP |
| mpe | Mycoplasma penetrans | V | V | ISRQLWWGH |
Kegg code and taxa names are given in the first 2 columns. Last three columns display the results of the two methods. Blast 16S analysis (3rd column): strains of the same species that were both fully matched (100%) to the same 16S rRNA query hence cannot be distinguished. Additional taxa that cannot be distinguished at an identity threshold of 99% are also indicated. 'V' represents taxa distinguished from each other with an identity level > 99%. S61 Peptide analysis (4th column): strains of the same species that were fully matched, i.e. no difference between the corresponding two sequences exist within the tested range (100%). Additional strains that cannot be distinguished if up to 4 amino-acid differences are allowed between the sequences (99%). 'V' represents taxa distinguished from each other with an identity level > 99%. Last column: SP identification of the sequence.
Figure 3Length distributions of putative proteins. Histogram of the length of sections of enzymes. Lengths derived from the 1488 PPs that contain a hit of SP = ISRQLWWGH (EC = 6.1.1.9) before (A) and after (B) obtaining a solution of the minimal number of mutually inconsistent sequences by the taxa-counting algorithm. Lengths derived from the 1961 PPs that contain a hit of SP = TRFPPEPNGYLH (EC = 6.1.1.18), before (C) and after (D) the algorithm.
Figure 4Hamming distances among enzymes in Uniprot data. Statistics of Hamming distances between 6.1.1.9 (A-C) sequences and 6.1.1.18 sequences (D-F) in Uniprot KB Data. Top: Differences between strains of the same species (A, D). Middle: Differences between species of the same genus (B, E). Bottom: differences between genera in the same family (C, F). Insets in (A, B) display results for low distances and demonstrate the lack of clear cut-off between species and strains for 6.1.1.9 in contrast with the clear cut-off at distance 2 for 6.1.1.18 enzymes.
Figure 5Taxa counting . Taxa counting for leading SPs as function of minimal distance d for (A) SYV 6.1.1.9 and (B) SYQ 6.1.1.18 for the fused strings that are full protein candidates. Based on the statistics of Swiss-Prot displayed in Fig. 4, we estimate from (B) that 400 of the total count may be due to different strains, exhibiting distance ≤ 2. In comparing with Fig. 4 note that the latter starts with a bin of zero difference between sequences, whereas here the first bin refers to differences larger or equal to 1.
Figure 6Taxa counting from short read data. Taxa count analysis for SYV (EC: 6.1.1.9) based on the 1780 Illumina short reads sharing the common leading SP. A) Counts are displayed as function of the minimal Hamming distance between fused strings. Different curves represent different sample sizes varying (from bottom to top) from S = 200 to 1600. Mean values and errors on the mean are calculated from 20 random realizations at each sample size. B) Mean counts as function of sample size S, for d ≥ 1-10 (top to bottom). C) Mean counts as function of sample size for d ≥ 1-5 for the data in B (solid) and for artificial data (dashed) constructed from the real data into which artificial errors were introduced with probability of 1% per amino acid.
Figure 7Matching of raw short reads to fused contigs and Uniprot databases. All short reads of length 24 amino-acids containing the leading SP of 6.1.1.9 were matched to the set of all 1009 fused contigs as well as to the enzyme data bases of Swiss-Prot and Uniprot. Values are presented for different numbers of allowed mismatches.
Analysis of SP hits on the E.coli genome.
| L | FP | TP | error | EFP |
|---|---|---|---|---|
| 7 | 965 | 7,485 | 0.11 | 413 |
| 8 | 237 | 4,235 | 0.053 | 49 |
| 9 | 100 | 2,346 | 0.041 | |
| 10 | 68 | 1,361 | 0.048 | 8 |