| Literature DB >> 30487811 |
Prashanthi Dharanipragada1, Sampreeth Reddy Seelam1, Nita Parekh1.
Abstract
The current trend in clinical data analysis is to understand how individuals respond to therapies and drug interactions based on their genetic makeup. This has led to a paradigm shift in healthcare; caring for patients is now 99% information and 1% intervention. Reducing costs of next generation sequencing (NGS) technologies has made it possible to take genetic profiling to the clinical setting. This requires not just fast and accurate algorithms for variant detection, but also a knowledge-base for variant annotation and prioritization to facilitate tailored therapeutics based on an individual's genetic profile. Here we show that it is possible to provide a fast and easy access to all possible information about a variant and its impact on the gene, its protein product, associated pathways and drug-variant interactions by integrating previously reported knowledge from various databases. With this objective, we have developed a pipeline, Sequence Variants Identification and Annotation (SeqVItA) that provides end-to-end solution for small sequence variants detection, annotation and prioritization on a single platform. Parallelization of the variant detection step and with numerous resources incorporated to infer functional impact, clinical relevance and drug-variant associations, SeqVItA will benefit the clinical and research communities alike. Its open-source platform and modular framework allows for easy customization of the workflow depending on the data type (single, paired, or pooled samples), variant type (germline and somatic), and variant annotation and prioritization. Performance comparison of SeqVItA on simulated data and detection, interpretation and analysis of somatic variants on real data (24 liver cancer patients) is carried out. We demonstrate the efficacy of annotation module in facilitating personalized medicine based on patient's mutational landscape. SeqVItA is freely available at https://bioinf.iiit.ac.in/seqvita.Entities:
Keywords: INDELs; NGS; SNPs; annotation; personalized medicine; platform; sequence variants
Year: 2018 PMID: 30487811 PMCID: PMC6247818 DOI: 10.3389/fgene.2018.00537
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Feature comparison of SeqVItA with various popular and recent tools for sequence variant calling, annotation and prioritization in NGS data.
| Single | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |
| Paired (case-control) | ✓ | – | ✓ | – | ✓ | – | – | – | – | - | ✓ | ✓ | ✓ | ✓ |
| Multiple | ✓ | ✓ | ✓ | – | ✓ | – | – | – | ✓ | – | – | – | ✓ | |
| Read alignment | – | – | – | – | – | – | – | – | – | ✓ | – | ✓ | ✓ | – |
| Preprocessing post-alignment | ✓ | ✓ | – | ✓ | – | – | – | – | – | ✓ | – | – | – | ✓ |
| Germline | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | – | – | – | ✓ | ✓ | ✓ | ✓ | ✓ |
| Somatic | ✓ | – | ✓ | – | ✓ | – | – | – | – | – | ✓ | ✓ | ✓ | ✓ |
| Multi-allelic F | ✓ | – | – | – | – | ✓ | – | – | – | – | – | – | – | ✓ |
| Parallelization | ✓ | ✓ | ✓ | ✓ | – | – | – | – | – | ✓ | – | – | – | ✓ |
| Gene-locus | – | – | – | – | – | – | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Conservation score | – | – | – | – | – | – | ✓ | ✓ | ✓ | ✓ | – | ✓ | ✓ | ✓ |
| Population alleles | – | – | – | – | – | – | ✓ | ✓ | ✓ | ✓ | ✓ | – | ✓ | – |
| Clinical associations | – | – | – | – | – | – | ✓ | ✓ | ✓ | – | ✓ | ✓ | ✓ | ✓ |
| Drug associations | – | – | – | – | – | – | – | – | – | – | – | – | – | ✓ |
| Prioritization | – | – | – | – | – | – | – | ✓ | ✓ | ✓ | ✓ | ✓ | – | ✓ |
| Reference | McKenna et al., | Li, | Koboldt et al., | Poplin et al., | Liu et al., | Hu et al., | Wang et al., | Kircher et al., | Kumar et al., | Chiara et al., | Magi et al., | Doig et al., | Brouwer et al., | |
Figure 1Workflow of SeqVItA for identification, annotation and prioritization of sequence variants in WGS, WES, or TS data. Het, Heterozygous; Homo, Homozygous; LOH, Loss of heterozygosity; MAF, Minor allele frequency.
Various steps and corresponding parameters in the detection of SNVs and INDELs in SeqVItA are summarized.
| Pre-processing (aligned file in BAM format) | Mapping quality correction using Equation (1) | –Mqcorr | 0 |
| Filter reads based on mapping quality cut-off | –Mqread | 20 | |
| Variant Calling | Base quality Cut-off (Phred score) | –Qbase | 15 |
| Read depth at a site ≥ cut-off, site is considered for variant calling | –RD_th | 10 | |
| If no. of reads supporting alternate allele ≥ cut-off, site is considered for variant calling | –VAR_th | 2 | |
| Compute variant allele frequency (VAF) for variant calling | –VAF_th | 0.20 | |
| Check for strand bias (Discard if ≥ 90% and ≤ 10% support from same strand) | –Strand_Bias | 1 | |
| –p-value | 0.01 | ||
| For germline SVs, if VAF > cut-off, variant is homozygous, else heterozygous | –VAF_homo | 0.75 | |
| –somatic-p-value | 0.05 | ||
| Variant annotation | Drug association based on gene/variant in PharmGKB | -d | – |
FET: Fisher Exact test.
The 2 × 2 contingency table for computing the p-value using Fisher exact test.
| Variant | N(obs, var) | N(exp, var) | N(var) |
| Reference | N(obs, ref) | N(exp, ref) | N(ref) |
| Column total | N(obs) | N(exp) | N |
The 2 × 2 contingency table for computing p-value using Fisher exact test to predict somatic, germline and LOH mutations.
| Tumor | N(tum, ref) | N(tum, var) | N(tum) |
| Normal | N(nor, ref) | N(nor, var) | N(nor) |
| Column total | N(ref) | N(var) | N |
Total number of reads, N = Nref + Nvar = Ntum + Nnor, where Nref and Nvar correspond to the number of reads supporting the reference and variant, respectively and Ntum and Nnor correspond to the number of reads in tumor and normal samples, respectively.
Figure 2Performance of SeqVItA in detecting SNVs in simulated data shown. F-score values for detecting homozygous “triangle” and heterozygous “square” SNVs with read length 50 bp (empty symbols) and 100 bp (filled symbols). Minimum coverage threshold = 10 and Base quality ≥15.
Figure 3Performance of SeqVItA in detecting INDELs in simulated data shown. F-score values for predicting (A) homozygous (Homo) and (B) heterozygous (Het) INDELs of various sizes: 1 bp (“diamond”), 2 bp (“square”), 5 bp (“triangle”) and 10 bp (“circle”) for two read lengths 50 bp (empty symbols) and 100 bp (filled symbols). Minimum coverage threshold = 10 and Base quality ≥ 15.
Various parameters considered for performance evaluation of SeqVItA on simulated data with three other tools.
| Mapping quality bias correction | Recalibration of mapping quality: | Mann–Whitney | – | Wilcoxon rank sum test |
| Mapping quality cut-off | 20 | 20 | 20 | – |
| Minimum coverage cut-off | 10 | – | 8 | – |
| Base quality cut-off | 15 | 15 | 15 | 15 |
| Germline | 0.01 | – | 0.01 | – |
| Strand bias | Discard sites with < 10% or > 90% strandedness | Mann–Whitney | Discard sites with < 10% or > 90% strandedness | Fisher exact test |
Figure 4Performance comparison of SeqVItA with BCFtools, VarScan2 and GATK on simulated data at three sequencing depths 20 ×, 40 ×, and 60 × in detecting homozygous (Homo) and heterozygous (Het) (A) SNVs, (B,C) insertions (Ins), and (D–E) deletions (Del). Read length = 100 bp, and base quality threshold ≥ 15.
Figure 5Mutational landscape of somatic sequence variants identified in 24 HCC patient samples (intronic and intergenic variants excluded). Each column corresponds to each patient sample and each row represents a gene.
Figure 6Clustering of HCC patients based on somatically mutated genes.
Summary of recurrent genes exhibiting somatic mutations in at least 25% of liver cancer patients.
| TP53 | 10 | 1,2,5,6,7,9,10,12,18,24 | 10 | 8 | 2 |
| FGFR1 | 7 | 1,7,11,13,21,23,24 | 5 | 5 | 0 |
| FANCD2 | 7 | 2,7,8,9,15,20,21 | 5 | 4 | 1 |
| MIR1278 | 7 | 1,6,7,9,10,17,21 | 7 | 7 | 0 |
| JAK1 | 6 | 2,5,8,9,11,23 | 5 | 5 | 0 |
| NCOR1 | 5 | 1,2,16,23,24 | 9 | 9 | 0 |
| NUP93 | 5 | 2,6,16,23,24 | 4 | 2 | 2 |
| XPO1 | 5 | 2,5,9,11,17 | 3 | 2 | 1 |
| SDHC | 5 | 7,9,12,18,20 | 3 | 3 | 0 |
| TSC2 | 5 | 7,11,13,18,23 | 4 | 4 | 0 |
Figure 7Interaction between recurrently mutated genes from STRING database. Pathway enrichment analysis of these mutated genes indicate that cell cycle (shown in red) and PI3K/AKT (shown in blue) pathways are affected.
Summary of sequence variants and their functional role identified in HCC patient 9 (39 yr old, male), patient 19 (56 yr old, male), and patient 22 (60 yr old, male).
| Number of mutated genes | 44 | 77 | 69 |
| High/Medium priority genes | TP53, FANCD2, SDHC, MTOR, TPMT, DNMT1, GSTA1, CYP2B6 (8) | CTNNB1, MSH1, FBXO11, MSH6, MAP3K1, KMT2C, BRCA2, NCOR1, BCR, ARID2, CARD11, ERG, GRM3 (13) | SDHC, ATR, FCGR2B, ESR1, HGF, CDKN2A, ATM, TSHR, TP53, NCOR1, NF1, KEAP1, BCR, MSH2 (14) |
| Known oncogenes | MTOR | CTNNB1, BCR, CARD11, ERG | CDKN2A, BCR |
| Known tumor suppressors | TP53, DNMT1 | NCOR1, BRCA2, KMT2C, ARID2 | SDHC, ATR, ESR1, ATM, TP53, NCOR1, NF1, KEAP1 |
| Key pathways affected | TP53: cell cycle MTOR: Upregulation frequently observed in HCC, MTOR –| PTEN, IGF and EGF pathways DNMT1: Methylates PTEN promoter –| PTEN activation of PI3K/AKT/mTOR pathway | CTNNB1: Wnt pathway → proliferation and survival | KEAP1: Oxidative stress → proliferation and survival in HCC |
| Variant-Drug associations (PharmGKB) | Mutations in GSTA1 & CYP2B6 affect enzymatic activity of drugs → lower efficacy | CTNNB1(A>G): Ethnic-specific, wild-type (AA) associated with better response to CTD therapy, not significant in heterozygous condition | ESR1: Increased risk of azoospermia in childhood cancer survivors when treated with alkylating agents and cisplatin |
Figure 8(A) Total number of somatic variants called and (B) Pair-wise agreement (0–1 scores) between SNVs and INDELs predicted by SeqVItA, Mutect2, and VarScan2.