| Literature DB >> 31440213 |
Donghyeok Seol1,2, So Yun Jhang1,3, Hyaekang Kim1,2, Se-Young Kim4, Hyo-Sun Kwak5, Soon Han Kim5, Woojung Lee5, Sewook Park5, Heebal Kim1,2,3, Seoae Cho1, Woori Kwak1.
Abstract
Identifying the microbes present in probiotic products is an important issue in product quality control and public health. The most common methods used to identify genera containing species that produce lactic acid are matrix-assisted laser desorption/ionization-time of flight mass spectrometry (MALDI-TOF MS) and 16S rRNA sequence analysis. However, the high cost of operation, difficulty in distinguishing between similar species, and limitations of the current sequencing technologies have made it difficult to obtain accurate results using these tools. To overcome these problems, a whole-genome shotgun sequencing approach has been developed along with various metagenomic classification tools. Widely used tools include the marker gene and k-mer methods, but their inevitable false-positives (FPs) hampered an accurate analysis. We therefore, designed a coverage-based pipeline to reduce the FP problem and to achieve a more reliable identification of species. The coverage-based pipeline described here not only shows higher accuracy for the detection of species and proportion analysis, based on mapping depth, but can be applied regardless of the sequencing platform. We believe that the coverage-based pipeline described in this study can provide appropriate support for probiotic quality control, addressing current labeling issues.Entities:
Keywords: NGS; identification; lactic acid bacteria; mapping coverage; metagenomics; probiotics; whole genome shotgun sequencing
Year: 2019 PMID: 31440213 PMCID: PMC6693478 DOI: 10.3389/fmicb.2019.01683
Source DB: PubMed Journal: Front Microbiol ISSN: 1664-302X Impact factor: 5.640
FIGURE 1Pipeline overview. First, the complete genome data set was downloaded from the NCBI and filtered based on 95% ANI to obtain 126 species and 597 strains. Then, the one-to-one pairwise coverage of all strains of each species was calculated. In this case, the strain with the highest minimum coverage value for a given species was selected as the representative strain for that species. For example, if there are four strains of a species, we consider each of the strains as a reference strain; the coverage is calculated by aligning sequence data of the remaining three strains to the reference strain using bowtie2. Afterward, the minimum coverage of each strain is compared (Orange: strain_1: 0.79, strain_2: 0.86, strain_3: 0.87, strain_4: 0.83) and that with the highest value is selected (Red: strain_3: 0.87). Thus, the strain with the highest coverage is the representative strain of that species. A reference database was constructed using the representative strains and whole-metagenome shotgun sequencing data of probiotic probiotics were aligned to it. Only species exceeding 0.7137 coverage were judged to be present in the probiotic product.
FIGURE 2Distribution of coverage values. A positive correlation between the ANI and coverage was observed when the representative strain of each species was used as the reference sequence. The lowest minimum coverage value was 0.7137 at 96.47% ANI in B. longum.
ANI values for closely related pairs of GSLA species.
| 98.87 | |||
| 98.21 | |||
| 96.56 | |||
| 98.28 | |||
| 97.22 | |||
| 98.39 | |||
| 97.93 |
The results of single species data from the SRA.
| ERX231530 | 1 | 1 + 3 | 1 + 3 | 1 + 294 | 1 + 639 | 1 + 991 | 1 + 9 | |
| SRX2610845 | 1 | 1 | 1 | 1 + 122 | 1 + 54 | 1 + 207 | 1 + 5 | |
| SRX2610831 | 1 | 1 | 1 | 1 + 89 | 1 + 38 | 1 + 160 | 1 + 2 | |
| ERX1625346 | 1 | 1 + 1 | 1 + 2 | 1 + 245 | 1 + 62 | 1 + 291 | 1 + 2 | |
| ERX2085159 | 1 | 1 | 1 | 1 + 149 | 1 + 86 | 1 + 809 | 1 + 2 | |
| ERX1960389 | 1 | 1 + 1 | 1 | 1 + 258 | 1 + 149 | 1 + 626 | 1 + 1 | |
| SRX2610848 | 1 | 1 | 1 | 1 + 109 | 1 + 28 | 1 + 314 | 1 + 5 | |
| SRX2610844 | 1 | 1 + 2 | 1 | 1 + 39 | 1 + 22 | 1 + 108 | 1 + 6 | |
| ERX231531 | 1 | 1 + 4 | 1 + 4 | 1 + 346 | 1 + 257 | 1 + 1193 | 1 + 11 | |
| ERX2102726 | 1 | 1 + 1 | 1 + 1 | 1 + 65 | 1 + 44 | 1 + 258 | 1 + 1 | |
| SRX2610827 | 1 | 1 + 2 | 1 + 1 | 1 + 53 | 1 + 31 | 1 + 87 | 1 + 5 | |
| SRX2268576 | 1 | 1 + 1 | 1 + 2 | 1 + 169 | 1 + 83 | 1 + 353 | 1 + 7 | |
| ERX980028 | 1 | 1 + 2 | 1 + 1 | 1 + 112 | 1 + 74 | 1 + 251 | 1 + 3 | |
| SRX2268579 | 1 | 1 + 1 | 1 + 1 | 1 + 91 | 1 + 34 | 1 + 358 | 1 + 1 | |
| SRX2268582 | 1 | 1 + 1 | 1 + 1 | 1 + 79 | 1 + 32 | 1 + 206 | 1 + 4 | |
| ERX1101269 | 1 | 1 | 1 | 1 + 125 | 1 + 77 | 1 + 569 | 1 + 3 | |
| SRX1433289 | 1 | 1 + 1 | 1 | 1 + 276 | 1 + 96 | 1 + 490 | 1 + 3 | |
| ERX178725 | 1 | 1 + 9 | 1 + 7 | 1 + 281 | 1 + 122 | 1 + 575 | 1 + 10 | |
| SRX2268585 | 1 | 1 + 1 | 1 + 3 | 1 + 184 | 1 + 53 | 1 + 206 | 1 + 7 |
FIGURE 3Results of metagenomic analysis of probiotic products. Panel (A) shows simulated data, (B) shows the real data obtained with the Illumina platform and (C–G) show real data from the Ion Torrent platform. Green indicates the number of species correctly detected, yellow indicates FNs, and red indicates FPs. The blue line represents the precision of each classification.
FIGURE 4Species proportion analysis of simulated and SRA data. The number of reads was determined in proportion to the genome size of each species, and a total of 10 species were combined. Panel (A) shows simulated data, (B) shows data from the NCBI SRA. The closer the classification value of each species is to 10, the more accurately the proportions are determined.
FIGURE 5Changes in coverage values for species when reducing the Illumina data file size, of 30 Gb * 2, to 50, 25, 10, and 5%. Only species that changed to FN results are shown. L. fermentum was not detected when the data file size was reduced by 50% (30 Gb * 2 → 15 Gb * 2) and B. longum, L. paracasei, L. reuteri were not detected with a reduction to 5% (3 Gb * 2 → 1.5 Gb * 2). The red dotted line shows 0.7137 coverage. The number in parentheses in the legend is the proportion analysis result.
Data processing time required when the data set size was reduced (Min).
| Alignment | Bowtie 2 (2.3.3.1) | 383 | 105 | 60 | 20 | 10 |
| BAM file sorting | SAMtools (1.3.1) | 40 | 15 | 8 | 3 | 2 |
| Genomecov | Bedtools (2.20.1) | 28 | 10 | 5 | 2 | 1 |
| Sum | 451 | 130 | 73 | 25 | 13 |