| Literature DB >> 29357822 |
Shifu Chen1,2,3, Tanxiao Huang4, Tiexiang Wen5, Hong Li4, Mingyan Xu4, Jia Gu6.
Abstract
BACKGROUND: Some types of clinical genetic tests, such as cancer testing using circulating tumor DNA (ctDNA), require sensitive detection of known target mutations. However, conventional next-generation sequencing (NGS) data analysis pipelines typically involve different steps of filtering, which may cause miss-detection of key mutations with low frequencies. Variant validation is also indicated for key mutations detected by bioinformatics pipelines. Typically, this process can be executed using alignment visualization tools such as IGV or GenomeBrowse. However, these tools are too heavy and therefore unsuitable for validating mutations in ultra-deep sequencing data. RESULT: We developed MutScan to address problems of sensitive detection and efficient validation for target mutations. MutScan involves highly optimized string-searching algorithms, which can scan input FASTQ files to grab all reads that support target mutations. The collected supporting reads for each target mutation will be piled up and visualized using web technologies such as HTML and JavaScript. Algorithms such as rolling hash and bloom filter are applied to accelerate scanning and make MutScan applicable to detect or visualize target mutations in a very fast way.Entities:
Keywords: Fast detection; MutScan; Mutation scan; Variant visualization
Mesh:
Substances:
Year: 2018 PMID: 29357822 PMCID: PMC5778627 DOI: 10.1186/s12859-018-2024-6
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1The overall design of MutScan. Three steps are presented: indexing, matching, and reporting. In the indexing step, a hashmap of KMER (all possible substrings of length k, k = 16 in MutScan’s implementation) mapping to mutations is computed; in the matching step, reads are associated with mutations by looking up the indexed hashmap; in the reporting step, the detected mutations are validated, the supporting reads for each mutation are piled up and rendered to an HTML page. The input and output files are then highlighted in grey
Fig. 2Screenshot of a MutScan’s pile-up result. The demonstrated mutation is EGFR p.T790 M (hg19 chr7:55,249,071 C > T), which is an important drugable target for lung cancer. This mutation’s (L, M, R) sequences are provided at the top of this figure, and M is the mutation base (C > T). The color of the bases indicates the quality score (green and blue indicate high quality, red indicates low quality). This screenshot is incomplete, and the complete report can be found at http://opengene.org/MutScan/report.html
Fig. 3Comparison result of MutScan and conventional NGS pipeline. The conventional NGS is a tumor variant calling pipeline using AfterQC + BWA + Samtools + VarScan2, which can be found at https://github.com/sfchen/tumor-pipeline. Mutations are given in columns and samples are given in rows. Tumor pipeline detected mutations are highlighted in shades of red, and MutScan detected mutations are highlighted in shades of green. The depth of the color reflects the unique supporting read number, which is also shown in the table cells
Execution time comparison of MutScan and conventional pipelines
| Sample ID | Base number | GATK pipeline called variants | Tumor pipeline called variants | GATK pipeline time | Tumor pipeline time | MutScan time (GATK VCF) | MutScan time (tumor pipeline VCF) | MutScan time (built-in mutation) |
|---|---|---|---|---|---|---|---|---|
| S001 | 3.07 G | 376 | 1163 | 166m01s | 84m26s | 3m09s | 4m31s | 1m28s |
| S002 | 2.70 G | 376 | 1438 | 158m40s | 64m09s | 3m03s | 4m50s | 1m11s |
| S003 | 4.98 G | 531 | 949 | 236m57s | 135m47s | 4m57s | 6m59s | 2m19s |
| S004 | 3.51 G | 375 | 798 | 186m48s | 100m14s | 3m16s | 4m26s | 1 m 34 s |
| S005 | 3.50 G | 385 | 751 | 191m29s | 84m24s | 3m26s | 4m20s | 1m34s |
| S006 | 3.67 G | 359 | 1303 | 182m42s | 96m50s | 3m24s | 5m57s | 1m36s |
| S007 | 6.08 G | 380 | 2055 | 200m22s | 142m30s | 4m20s | 11m17s | 2m29s |
| S008 | 3.33 G | 383 | 873 | 175m16s | 90m27s | 2m52s | 4m37s | 1m20s |
The input files are Gzip compressed paired-end sequencing FASTQ, and the base number is the summation of both paired files. Since the tumor and GATK pipeline used different variant detection and filtering strategies, the tumor pipeline detected more variants than the GATK pipeline. The column MutScan (GATK VCF) is the execution time of MutScan for processing time with the VCF (INDEL + SNV) called by the GATK pipeline, similarly for the column MutScan (tumor pipeline VCF). When MutScan was ran with a VCF, its execution time was predominantly determined by the size of the FASTQ file and the number of variants
Time (in second) and memory (in megabytes) used by MutScan for processing FASTQs and mutations in different sizes
| Mutation Number➔ | 5 K Mutations | 10 K Mutations | 50 K Mutations | |||
|---|---|---|---|---|---|---|
| MutScan Mode➔ | Simplified | Normal | Simplified | Normal | Simplified | Normal |
| 5Gbp FASTQ | 255 s, 672 M | 370 s, 1542 M | 289 s, 683 M | 428 s, 2110 M | 380 s, 943 M | 537 s, 9113 M |
| 10Gbp FASTQ | 402 s, 692 M | 621 s, 2260 M | 447 s, 714 M | 648 s, 3201 M | 624 s, 1113 M | 2279 s, 15,569 M |
| 50Gbp FASTQ | 1622s, 769 M | 2956 s, 8352 M | 1897s, 929 M | 3469 s, 12,440 M | 2729 s, 2601 M | 10,927 s, 69,389 M |
The input was paired-end data, and the base number was the sum of read1 and read2. Both the simplified mode and the normal mode were evaluated and shown in the table