| Literature DB >> 26681335 |
Yi-Lin Chen, Chuan-Chun Lee, Ya-Lan Lin, Kai-Min Yin, Chung-Liang Ho, Tsunglin Liu.
Abstract
BACKGROUND: Next-generation sequencing (NGS) technology has transformed metagenomics because the high-throughput data allow an in-depth exploration of a complex microbial community. However, accurate species identification with NGS data is challenging because NGS sequences are relatively short. Assembling 16S rDNA segments into longer sequences has been proposed for improving species identification. Current approaches, however, either suffer from amplification bias due to one single primer or insufficient 16S rDNA reads in whole genome sequencing data.Entities:
Mesh:
Substances:
Year: 2015 PMID: 26681335 PMCID: PMC4682383 DOI: 10.1186/1471-2105-16-S18-S13
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Pipeline for obtaining long 16S rDNA for species identification.
Statistics of reads, genera, and contig lengths of ten samples.
| Sample | Raw reads | Trimmed reads | Confident reads | Total genus | Confident genus | No. of contigs | lcl/mrl | lc identity | |
|---|---|---|---|---|---|---|---|---|---|
| S1 | 31,035 | 29,170 | 27,423 | 8 | 4 | 4 | 4 | 3 | 4 |
| S2 | 38,100 | 35,007 | 30,242 | 144 | 28 | 23 | 15 | 6 | 11 |
| S3 | 39,881 | 35,456 | 10,909 | 191 | 62 | 53 | 28 | 12 | 27 |
| S4 | 91,44 | 4,590 | 478 | 65 | 9 | 9 | 4 | 3 | 3 |
| S5 | 25,314 | 16,857 | 3,059 | 218 | 38 | 34 | 24 | 12 | 14 |
| S6 | 14,709 | 8,347 | 854 | 117 | 18 | 13 | 5 | 2 | 2 |
| S7 | 13,810 | 8,262 | 983 | 128 | 20 | 17 | 4 | 3 | 4 |
| S8 | 11,379 | 9,214 | 2,145 | 230 | 47 | 37 | 18 | 5 | 18 |
| S9 | 9,924 | 8,277 | 1,374 | 193 | 33 | 30 | 13 | 5 | 16 |
| S10 | 12,084 | 8,216 | 1,685 | 124 | 23 | 20 | 11 | 4 | 7 |
A confident read is one with a ≥80 classification score. A genus is counted when there is a confident read. A confident genus is one with ≥10 confident reads. Confident reads of each confident genus were assembled and the longest contig (lc) analyzed. The last column shows the number of longest contigs with a ≥97% alignment identity to the 16S rDNA references. Abbreviations: longest contig length (lcl), mean read length (mrl).
Statistics of contigs, Sanger sequences, and their alignments.
| S1 | 780 | 1483 | 1468 | 479 |
| S2 | 1143 | 1052 | 1193 | N.A. |
| S3 | 1487 | N.A. | N.A. | 845 |
| 832 | 99.8; 0; 1 | 99.6; 0; 2 | 99.6; 0; 2 | |
| 1126 | 100; 0; 0 | 100; 0; 0 | N.A. | |
| 1529 | 100; 0; 0 | 99.8; 1; 1 | N.A. | |
Primer bias, i.e., percentage of confident genera that would be missed by each of the six primers.
| Primer | S1 | S2 | S3 | S4 | S5 | S6 | S7 | S8 | S9 | S10 |
|---|---|---|---|---|---|---|---|---|---|---|
| A | 0% | 36% | 42% | 45% | 39% | 50% | 32% | 33% | 30% | |
| B | 0% | 54% | 47% | 44% | 58% | 50% | 80% | 55% | 58% | |
| C | 0% | 7% | 15% | |||||||
| D | 0% | 39% | 29% | 44% | 30% | 21% | 30% | |||
| E | 0% | 75% | 35% | 44% | 42% | 56% | 40% | 47% | 36% | 52% |
| F | 0% | 21% | 35% | 56% | 21% | |||||
| No. of genus | 4 | 28 | 62 | 9 | 38 | 18 | 20 | 47 | 33 | 23 |
For real samples, the two least biased primers are shown in bold.
Figure 2Distances between real samples and their clustering. (a) Principal components of distances between seven microbial communities in real samples by UniFrac. (b) Clustering of the seven communities using full distances.
Alignment identity of nine contigs in S4 to corresponding contigs in samples S5-S10.
| Genus in S4 | S5 | S6 | S7 | S8 | S9 | S10 | Candidate species |
|---|---|---|---|---|---|---|---|
| 99.8% | - | - | 98.1% | - | - | ||
| 99.5% | - | - | - | - | - | ||
| 90.6% | - | - | 92.1% | 93.1% | - | N.A.* | |
| 100.0% | - | - | - | - | - | ||
| 96.2% | - | - | - | - | - | ||
| 100.0% | - | - | 100.0% | - | 99.9% | ||
| 93.3% | - | - | - | - | - | ||
| 98.89% | 99.41% | - | - | - | - | ||
| - | - | - | - | - | - | ||
The last column shows candidate species of the seven genera in S4. *Alignment identity <97%.
In-silico sensitivity of the six primers.
| Primer | No. of 16S rDNA sequences covering the primer position | No. of amplifiable sequences | Percentage |
|---|---|---|---|
| A | 468049 | 411649 | 87.9% |
| B | 2140848 | 1873932 | 87.5% |
| C | 1877993 | 1796209 | 95.6% |
| D | 1787993 | 1424757 | 79.7% |
| E | 1003626 | 915216 | 91.2% |
| F | 61634 | 47650 | 77.3% |
| At least one | 2472276 | 2455930 | 99.3% |