| Literature DB >> 34812718 |
Grégoire Siekaniec1,2, Emeline Roux1,3, Téo Lemane1, Eric Guédon2, Jacques Nicolas1.
Abstract
This study aimed to provide efficient recognition of bacterial strains on personal computers from MinION (Nanopore) long read data. Thanks to the fall in sequencing costs, the identification of bacteria can now proceed by whole genome sequencing. MinION is a fast, but highly error-prone sequencing device and it is a challenge to successfully identify the strain content of unknown simple or complex microbial samples. It is heavily constrained by memory management and fast access to the read and genome fragments. Our strategy involves three steps: indexing of known genomic sequences for a given or several bacterial species; a request process to assign a read to a strain by matching it to the closest reference genomes; and a final step looking for a minimum set of strains that best explains the observed reads. We have applied our method, called ORI, on 77 strains of Streptococcus thermophilus. We worked on several genomic distances and obtained a detailed classification of the strains, together with a criterion that allows merging of what we termed 'sibling' strains, only separated by a few mutations. Overall, isolated strains can be safely recognized from MinION data. For mixtures of several non-sibling strains, results depend on strain abundance.Entities:
Keywords: MinION; Streptococcus thermophilus; bacterial strain identification; bloom filters; long read; strain classification
Mesh:
Year: 2021 PMID: 34812718 PMCID: PMC8743539 DOI: 10.1099/mgen.0.000654
Source DB: PubMed Journal: Microb Genom ISSN: 2057-5858
Average percentage error rate in the sequences (calculated from an alignment of the reads against the reference genomes performed with Minimap2)
The filters retain only sequences with a quality greater than 9 and a size greater than 2000 bp.
|
Errors |
Mismatches |
Deletions |
Insertions |
Total |
|---|---|---|---|---|
|
All sequences |
1.50 % |
2.16 % |
1.40 % |
5.06 % |
|
With filters |
1.44 % |
2.08 % |
1.37 % |
4.89 % |
Fig. 1.Biclusters in a strain × gene matrix and associated labelling of nodes in a classification tree.
Fig. 2.Overview of the ORI method in three steps: (1) genome indexing, (2) query the index from filtered reads, and (3) identification of strains.
Fig. 3.Heatmap of the Jaccard distance for 28 strains + ACA-DC 198 + subsp. ATCC 11842.
Identification of reads from strain CIRM-BIA67 from various numbers of reads
|
Method |
100 reads |
1000 reads |
10 000 reads |
20 000 reads |
|---|---|---|---|---|
|
|
C67 2 % |
C67 1.70 % |
C67 1.31 % |
C67 1.23 % |
|
|
One group of many indistinguishable strains (C67 and 66 other strains) 100% |
C67 100 % |
C67 13.13 % +4 groups (other strains) |
C67 2.2 % +C65 1.86% +22 other strains +2 groups |
|
|
C67 100 % |
C67 100 % |
C65 100 % |
C65 100 % |
C67, CIRM-BIA67; C65, CIRM-BIA65.
Fig. 4.Identification results on a balanced mix of strains. The Hamming distance between observed and expected strains, on the y-axis, has been multiplied by 10 000 (in blue for ORI, orange for StrainSeeker and green for Kraken 2). Stars represent mean values. Matthews correlation coefficient (MCC) values are given on the first line just above the x-axis at the bottom of the diagrams, followed by the ambiguity ratio (number of strains identified/number of strains present).
Fig. 5.Identification of subdominant strains in a mixture of strains using various numbers of reads. The Hamming distance between observed and expected strains, on the y-axis, has been multiplied by 10 000 (in blue for ORI, orange for StrainSeeker and green for Kraken 2). Matthews correlation coefficient (MCC) values are given on the first line just above the x-axis at the bottom of the diagrams, followed by the ambiguity ratio (number of strains identified/number of strains present).
Subdominant strain identification by ORI, without/with merge, in a mixture of four or six strains, by using 1000, 4000 or 16 000 Nanopore sequencing reads
Best results are in bold type. Values of Hamming distance: in all experiments, minimum value is 0 (perfect identification); MCC: Matthews correlation coefficient (1=perfect correlation); Ambiguity ratio: number of strains identified/number of strains present; sd: standard deviation
|
No. of strains |
4 (ORI/ORI_merge) |
6 (ORI/ORI_merge) | ||||
|---|---|---|---|---|---|---|
|
Number of reads |
1000 |
4000 |
16 000 |
1000 |
4000 |
16 000 |
|
Distance |
19.8/19.8 |
9.9/ |
|
30.3/ |
26.7/ |
|
|
MCC |
0.28/ |
0.42/ |
0.57/ |
0.22/ |
0.34/ |
0.63/ |
|
Ambiguity |
0.4/0.4 |
0.5/ |
0.7/ |
0.2/ |
0.2/ |
0.53/ |
strain identification by ORI, with and without merge index, in a balanced mixture of four or six strains more or less genetically close, by using 1000, 4000 or 16 000 sequencing reads
Best results are in bold type. Values of Hamming distance (0=perfect identification); MCC: Matthews correlation coefficient (1=perfect correlation); Ambiguity: number of strains identified/number of strains present.
|
(a) Global identification results (mean over all 90 experiments): | ||||||
|---|---|---|---|---|---|---|
|
Method |
ORI |
ORI_merge | ||||
|
Distance |
0.52 |
| ||||
|
(MCC/Ambiguity) |
0.66/0.63 |
| ||||
|
| ||||||
|
|
|
| ||||
|
Number of strains |
4 |
6 |
4 |
6 | ||
|
Distance |
0.73 |
0.31 |
0.53 |
| ||
|
(MCC/Ambiguity) |
0.70/0.65 |
0.65/0.56 |
0.94/0.93 |
| ||
|
| ||||||
|
|
|
| ||||
|
Number of reads |
1000 |
4000 |
16 000 |
1000 |
4000 |
16 000 |
|
Distance |
0.17 |
0.8 |
|
|
|
0.8 |
|
(MCC/Ambiguity) |
0.55/0.44 |
0.64/0.64 |
0.78/0.80 |
|
|
|
|
| ||||||
|
|
|
| ||||
|
Proximity |
Distant |
Medium |
Close |
Distant |
Medium |
Close |
|
Distance |
0.10 |
1.17 |
|
|
|
0.33 |
|
(MCC/Ambiguity) |
0.75/0.73 |
0.61/0.68 |
0.6/0.47 |
|
|
|