| Literature DB >> 25918555 |
Wanding Zhou1, Hao Zhao1, Zechen Chong1, Routbort J Mark2, Agda K Eterovic3, Funda Meric-Bernstam4, Ken Chen1.
Abstract
Applying genomics to patient care demands sensitive, unambiguous and rapid characterization of a known set of clinically relevant variants in patients' samples, an objective substantially different from the standard discovery process, in which every base in every sequenced read must be examined. Further, the approach must be sufficiently robust as to be able to detect multiple and potentially rare variants from heterogeneous samples. To meet this critical objective, we developed a novel variant characterization framework, ClinSeK, which performs targeted analysis of relevant reads from high-throughput sequencing data. ClinSeK is designed for efficient targeted short read alignment and is capable of characterizing a wide spectrum of genetic variants from single nucleotide variation to large-scale genomic rearrangement breakpoints. Applying ClinSeK to over a thousand cancer patients demonstrated substantively better performance, in terms of accuracy, runtime and disk storage, for clinical applications than existing variant discovery tools. ClinSeK is freely available for academic use at http://bioinformatics.mdanderson.org/main/clinsek.Entities:
Year: 2015 PMID: 25918555 PMCID: PMC4410453 DOI: 10.1186/s13073-015-0155-1
Source DB: PubMed Journal: Genome Med ISSN: 1756-994X Impact factor: 11.117
Figure 1Schematic overview of ClinSeK. (A) The four major steps of the ClinSeK workflow for analyzing single nucleotide variants (SNVs) and insertions and deletions (indels) from DNA-sequencing data. (B) Illustration of k-mer screening, targeted alignment and variant calling. Sequencing reads (blue arrows) in raw FASTQ files are screened for presence of k-mers created from target sites of interest (dark, vertical dashed lines), which are predefined based on variant databases such as ClinVar and COSMIC. Those that do not contain any target k-mers (grey arrows) are discarded. Those associated with a target site (red vertical bar) are aligned against corresponding local reference sequences (grey horizontal bars) with potential variants (red dots) identified. Reads were realigned with mates (arrow in opposite directions) and against paralogous sites (green vertical bars) from other chromosomes. Variants are finally called from reads of high mapping quality (dark blue arrows). (C) Illustration of ClinSeK targeted breakpoint analysis. DNA or RNA sequencing reads are screened for presence of k-mers in the reference and in the variant alleles near the breakpoints or fusion junctions. Those that do not contain any target k-mers are discarded. The remaining ones are preferentially aligned to the wild-type reference (orange arrows) and to the fusion breakpoint (magenta bar) sequence (red arrows) and are counted and compared. (D) ClinSeK output. Reads and their alignments at the target sites are output in BAM files. Variants are output in VCF format and are further included in the clinical report.
Figure 2Comparison of ClinSeK alignment with BWA. (A) Comparison between ClinSeK and BWA alignment. The numbers are reported from 1,000 sites randomly chosen from the ClinVar database and from 700 samples. Brown color indicates the overlap between ClinSeK and BWA alignments. Green color indicates reads aligned to the target site by only ClinSeK; pink color indicates reads aligned by only BWA. (B) Alignment score distribution of read alignment by only ClinSeK but not BWA aln. The alignment score is calculated by BLAT, with the maximum score of 200 for reads as long as 100 bp.
Figure 3ClinSeK performance in analyzing DNA-seq and RNA-seq data. (A) Comparison of ClinSeK, VarScan2 and MuTect sensitivity in characterizing somatic mutations from 1,024 targeted exome-sequenced tumor and normal pairs. Text box lists CLIA-validated somatic mutations detected only by ClinSeK. (B) Comparison of ClinSeK and the base-to-base pipeline in runtime (blue dots) and data storage (green dots). Dashed lines correspond to 80× and 200× respective reductions in runtime and storage. Dot sizes are proportional to the number of reads sequenced from each sample. (C) Illustration of ClinSeK gene fusion detection results on The Cancer Genome Atlas (TCGA) samples. Black horizontal bars indicate breakpoints. Text boxes list TCGA sample names, in which the corresponding fusion is detected.