| Literature DB >> 28361960 |
Rigbe G Weldatsadik1, Jingwen Wang2, Kai Puhakainen3,4, Hong Jiao2, Jari Jalava3, Kati Räisänen3, Neeta Datta1, Tiina Skoog2, Jaana Vuopio3,4, T Sakari Jokiranta1, Juha Kere2,5,6.
Abstract
Knowledge of the genomic variation among different strains of a pathogenic microbial species can help in selecting optimal candidates for diagnostic assays and vaccine development. Pooled sequencing (Pool-seq) is a cost effective approach for population level genetic studies that require large numbers of samples such as various strains of a microbe. To test the use of Pool-seq in identifying variation, we pooled DNA of 100 Streptococcus pyogenes strains of different emm types in two pools, each containing 50 strains. We used four variant calling tools (Freebayes, UnifiedGenotyper, SNVer, and SAMtools) and one emm1 strain, SF370, as a reference genome. In total 63719 SNPs and 164 INDELs were identified in the two pools concordantly by at least two of the tools. Majority of the variants (93.4%) from six individually sequenced strains used in the pools could be identified from the two pools and 72.3% and 97.4% of the variants in the pools could be mined from the analysis of the 44 complete Str. pyogenes genomes and 3407 sequence runs deposited in the European Nucleotide Archive respectively. We conclude that DNA sequencing of pooled samples of large numbers of bacterial strains is a robust, rapid and cost-efficient way to discover sequence variation.Entities:
Mesh:
Year: 2017 PMID: 28361960 PMCID: PMC5374712 DOI: 10.1038/srep45771
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Sequencing depth of coverage for the pools and the public ENA runs.
(a) The base coverage distribution from aligned data of the ENA runs and the two Pool-seq datasets. For the 3407 ENA runs the average is shown with ±1 s.d. In the two Pool-seq datasets ~90% of the bases had ~10,000X depth of coverage while in the ENA runs, the depth was on the average ~100X. (b) The aligned average depth of coverage of the two pools per 100 base length of the GAS genome. The positions of prophages of the SF370 strain are indicated by the gray fill and the prophage numbers (370.1 etc.). (c) The aligned average depth of coverage of the 3407 ENA runs per 100 base length of the GAS genome. The positions of prophages of the SF370 strain are indicated by the gray fill.
Percentage of the variants identified from the six individual strains that were also mined from the pools using different variant calling tools and methods.
| variant calling tools | Variants identified in individual strains (%) | |||||
|---|---|---|---|---|---|---|
| strain1 | strain2 | strain3 | strain4 | strain5 | strain6 | |
| SAMtools (with duplicate removal) | 48.1 | 52.2 | 50.9 | 47.6 | 47.4 | 51.4 |
| SAMtools (without duplicate removal) | 67.2 | 71.0 | 71.2 | 69.9 | 63.4 | 70.0 |
| Freebayes | 96.9 | 97.0 | 96.9 | 96.8 | 95.7 | 96.5 |
| GATK | 95.0 | 94.9 | 94.9 | 94.9 | 92.2 | 94.3 |
| SNVer | 97.1 | 97.0 | 97.0 | 96.9 | 95.6 | 96.5 |
| Union of the four tools | 97.7 | 97.7 | 97.8 | 97.8 | 96.6 | 97.4 |
| Freebayes + GATK + SNVer concordant | 93.5 | 93.3 | 93.1 | 93.2 | 90.4 | 92.4 |
| Concordant in 2 or more tools | 94.1 | 94.1 | 93.9 | 93.9 | 91.2 | 93.4 |
Variants from strain1 were compared against pool 1 while the rest five strains were compared against pool 2. Where not mentioned, duplicates have not been removed.
Total number of variants identified in the two pools, the 44 annotated genomes and 3407 runs from ENA.
| Two pools | 44 GAS genomes | 3407 runs from ENA | |
|---|---|---|---|
| Total base count | 40, 226, 329, 793 | 82, 600, 584 | 1, 529, 716, 686, 794 |
| Total number of variants | 63, 883 | 90, 321 | 286, 502 |
| SNPs | 63, 719 | 84, 312 | 270, 212 |
| INDELs | 164 | 3, 159 | 16, 290 |
Figure 2Distribution of the variants in the pools.
(a) Distribution of the SNPs identified from the pools per 100 base length. (b) Distribution of the INDELs identified from the pools per 100 base length. The SNPs and INDELS are mostly uniformly distributed across the reference genome.
Figure 3Variants from one of the pools compared to 20 of the publicly available complete GAS genomes.
(a) Number of SNPs and (b) number of INDELs identified in one of the pools when 20 of the publicly available complete GAS genomes were used as a reference genome. Both the updated (version 2) and the previous version (version 1) of the SF370 genome are included.
Figure 4Hierarchical clustering of the 45 publicly available complete GAS genomes.
The currently available complete GAS genomes are clustered based on the number of variants among them. The genome used as the reference genome in this study is marked with*.
Alignment statistics of sequence reads from the two Pool-seq datasets and the ENA dataset formed from 3407 ENA runs.
| Two pools | 3407 runs from ENA | |
|---|---|---|
| Total number of reads | 8.0*108 | 1.4*1010 |
| Mapped (%) | 91.7 | 93.13 (83.9–100) |
| Unmapped (%) | 8.2 | 6.8 (0–16) |
| Properly paired (%) | 86.9 | 91.0 (78.0–99.6) |
For the ENA dataset, besides averages, maximum and minimum values are given for proportion of the mapped, unmapped, and properly paired reads.
Percentage of variants identified from the two pools, the 44 complete GAS genomes and the ENA dataset containing 3407 runs (rows) that were also found in the two other datasets(columns).
| Two pools | 44 GAS genomes | ENA dataset | |
|---|---|---|---|
| Two pools | 72.3 | 97.4 | |
| 44 GAS genomes | 53.0 | 90.1 | |
| ENA dataset | 21.9 | 27.6 |
The reference strain SF370 was used in all three sets to identify the variants.
Figure 5Relative variability of 10 kb regions in the 3 datasets.
(a) Proportion of variants in 10 kb regions. The proportion has been calculated by a python script. Black solid lines represent the pools, red dotted lines the 44 genomes and blue dashed lines the ENA data. The positions of prophages of the SF370 strain are shown with gray fills. (b) Identification of the regions that show a statistically significant difference in the proportion of variants in the various datasets. The values are −log10 p-values from a fisher’s exact test with Bonferroni multiple correction. Black dots indicate the p-values of the proportions of variants between the pools and the ENA data, blue dots the p-values between the pools and the 44 genomes and red dots the p-values between the ENA data and the 44 genomes. The positions of the prophages of the SF370 strain are shown with gray fills.