| Literature DB >> 31712740 |
Tobias P Loka1, Simon H Tausch1,2,3, Bernhard Y Renard4.
Abstract
The sequential paradigm of data acquisition and analysis in next-generation sequencing leads to high turnaround times for the generation of interpretable results. We combined a novel real-time read mapping algorithm with fast variant calling to obtain reliable variant calls still during the sequencing process. Thereby, our new algorithm allows for accurate read mapping results for intermediate cycles and supports large reference genomes such as the complete human reference. This enables the combination of real-time read mapping results with complex follow-up analysis. In this study, we showed the accuracy and scalability of our approach by applying real-time read mapping and variant calling to seven publicly available human whole exome sequencing datasets. Thereby, up to 89% of all detected SNPs were already identified after 40 sequencing cycles while showing similar precision as at the end of sequencing. Final results showed similar accuracy to those of conventional post-hoc analysis methods. When compared to standard routines, our live approach enables considerably faster interventions in clinical applications and infectious disease outbreaks. Besides variant calling, our approach can be adapted for a plethora of other mapping-based analyses.Entities:
Mesh:
Year: 2019 PMID: 31712740 PMCID: PMC6848508 DOI: 10.1038/s41598-019-52991-z
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Summary of data sets evaluated in this study. Information about sequencing platform, exome capture and coverage were adopted from Hwang et al.[18].
| Accession No. | Platform | Exome capture | Exome coverage | Readsa | Read length |
|---|---|---|---|---|---|
| SRR098401 | HiSeq2000 | SureSelect v2 | 116.84× | 114 M | 2 × 76 bp |
| SRR292250 | HiSeq2000 | SeqCap EZ v2 | 116.06× | 85 M | 2 × 50 bp |
| SRR515199 | HiSeq2000 | SureSelect v4 | 298.45× | 167 M | 2 × 100 bp |
| SRR1611178 | HiSeq2000 | SeqCap EZ v3 | 79.93× | 45 M | 2 × 100 bp |
| SRR1611179 | HiSeq2000 | SeqCap EZ v3 | 79.84× | 45 M | 2 × 100 bp |
| SRR1611183 | HiSeq2500 | SeqCap EZ v3 | 129.94× | 74 M | 2 × 100 bp |
| SRR1611184 | HiSeq2500 | SeqCap EZ v3 | 111.90× | 64 M | 2 × 100 bp |
aM = millions.
Figure 1Area under a precision-recall curve (APR) for SNP calling in seven data sets at different sequencing cycles. SNP calling was performed with xAtlas using real-time read mapping results of HiLive2. Results for the samples SRR1611178, SRR1611179, SRR1611183 and SRR1611184 were combined to a single data series due to their high similarity (SRR1611178-84). Error bars for this data series show the standard deviation. Reads of SRR292250 and SRR098401 were shorter than 2 × 100 bp which leads to missing data points. The vertical, ticked line in the middle of the plot divides the first and second read. (a) The gray columns show APR values using Bowtie 2 for read mapping and xAtlas (left) and GATK (right) for variant calling. The data for Bowtie 2 + GATK were taken from Hwang et al.[18]. The real-time workflow with HiLive2 and xAtlas provides first results after 40 sequencing cycles (30 cycles for SRR292250). An APR greater than 0.9 is reached after 75 cycles for all data sets with a minimal read length of 75 bp. Until end of sequencing, there is a moderate increase of the APR. (b) Precision with a quality threshold of 1 for variant calling with xAtlas. The results show no precision lower than 0.89 for all sequencing cycles. In general, precision increases only slightly over time. This indicates that results in early sequencing cycles are already reliable. (c) Recall with a quality threshold of 1 for variant calling with xAtlas. The results show strong improvements from the first results available until the end of the first read. The progression of all curves is similar to that of the APR curve (cf. a), indicating the correlation between those two measures. *Cycle 50 for SRR292250, cycle 55 for all other data sets. **Cycle 76 for SRR098401, cycle 75 for all other data sets.
Figure 2Turnaround time of our workflow for data sets SRR1611184 (a) and SRR515199 (b). For each cycle, the first vertical line indicates the time point when the data for the respective cycle was completely written. The second line shows when the alignment output of HiLive2 is written. The third line indicates the end of our workflow resulting in the output of variant calls for the respective cycle. Vertical lines with the same vertical position belong to the same output cycle.
List of software used in this study. Software with source Bioconda was installed with the environment management software conda (https://conda.io) and obtained from the Bioconda channel[13].
| Name | Version | Source | Used for |
|---|---|---|---|
| Bedtools[ | 2.21.0 |
| Vcf and Bed file processing |
| Bowtie 2[ | 2.3.4.1 | Bioconda | Read alignment |
| GATK[ | 3.8 | Bioconda | Alignment file processing |
| HiLive2 | 2.0 |
| Real-time read alignment |
| RTG Tools[ | 3.9 | Bioconda | Benchmark of variant calls |
| Samtools[ | 1.8 | Bioconda | SAM/BAM file processing |
| VCFLIB | 1.0.0_rc1 | Bioconda | Vcf file processing |
| VCFtools[ | 0.1.12 |
| Vcf file processing |
| xAtlas[ | 0.1 | Bioconda | Fast variant calling |