| Literature DB >> 30717290 |
Milica Krunic1, Peter Venhuizen2, Leonhard Müllauer3, Bettina Kaserer4, Arndt von Haeseler5,6.
Abstract
Fast and affordable benchtop sequencers are becoming more important in improving personalized medical treatment. Still, distinguishing genetic variants between healthy and diseased individuals from sequencing errors remains a challenge. Here we present VARIFI, a pipeline for finding reliable genetic variants (single nucleotide polymorphisms (SNPs) and insertions and deletions (indels)). We optimized parameters in VARIFI by analyzing more than 170 amplicon-sequenced cancer samples produced on the Personal Genome Machine (PGM). In contrast to existing pipelines, VARIFI combines different analysis methods and, based on their concordance, assigns a confidence score to each identified variant. Furthermore, VARIFI applies variant filters for biases associated with the sequencing technologies (e.g., incorrectly identified homopolymer-associated indels with Ion Torrent). VARIFI automatically extracts variant information from publicly available databases and incorporates methods for variant effect prediction. VARIFI requires little computational experience and no in-house compute power since the analyses are conducted on our server. VARIFI is a web-based tool available at varifi.cibiv.univie.ac.at.Entities:
Keywords: amplicon sequencing; cancer; personalized medicine; pipeline; variant finding
Year: 2019 PMID: 30717290 PMCID: PMC6463100 DOI: 10.3390/jpm9010010
Source DB: PubMed Journal: J Pers Med ISSN: 2075-4426
Figure 1Complete variant analysis workflow from the sample preparation to the variant validation. Processes are presented using rectangles: VARIFI processes are in red, and laboratory experiments which we used to develop and optimize VARIFI are in blue rectangles. Parallelograms show input/output data of these processes. Arrows show the process and data generation flow. After sample preparation and sequencing, VARIFI starts by aligning the reads against the human genome reference and continues with variant identification in the alignment results. Detected variants are then filtered, annotated and prepared for visualization. VARIFI output is a list of filtered and annotated variants displayed in a final report. We used variant validation to optimize and evaluate VARIFI processes.
Figure 2Distribution of the variance of the gap/insert length (var) and frequency of the most common gap/insert length (frmode) at the potential indel site. Green circles represent values for the parameters var and frmode for 18 true positive (TP) training indels, and red circles represent var and frmode for 90 false positive (FP) training indels. We used training indels to define threshold values for filtering parameters var and frmode. maxVar is the maximum value among all var values for training TPs. minFrmode is the minimum frmode among all frmode values for training TPs. We filtered out a potential indel x if var(x) > maxVar and frmode(x) < minFrmode (red shaded area). In this way, we detected all training TPs and filtered out 60% of the training FPs. To test the parameter filtering settings, we used five test indels coming from two samples (S1 and S2) which Sanger sequencing showed to be FPs. Parameters for each test indel are presented by a combination of a symbol (indel genomic position) and a color (aligner), e.g., the “+” magenta symbol represents var and frmode for an indel at the position chr17:7579419, for which the parameters were calculated from the bwa aligned file. Since the parameters were calculated for at least one aligner are in the area from which the indels were filtered out, we filtered out all five FP test indels. A small plot in the right bottom corner of the figure is an enlargement of the left upper corner of the main figure.
Figure 3An example of VARIFI output plots. (a) Variant distribution on genes. In this example, the exonic function for most variants was not available (“N/A”) since the variants were found in the intronic and UTR3 regions, and the highest number of variants was found on PIK3CA and KDR genes, three variants on each gene. Five variants were synonymous single-nucleotide variants (SNVs), two variants were nonsynonymous SNVs and one variant was a frameshift insertion. (b) Distribution of variants based on their confidence score. Nine variants had a confidence score of 3, seven variants had a confidence score of 6 and two variants had a confidence score of 2.