Literature DB >> 25170027

RVboost: RNA-seq variants prioritization using a boosting method.

Chen Wang¹, Jaime I Davila¹, Saurabh Baheti¹, Aditya V Bhagwate¹, Xue Wang¹, Jean-Pierre A Kocher¹, Susan L Slager¹, Andrew L Feldman¹, Anne J Novak¹, James R Cerhan¹, E Aubrey Thompson¹, Yan W Asmann¹.

Abstract

MOTIVATION: RNA-seq has become the method of choice to quantify genes and exons, discover novel transcripts and detect fusion genes. However, reliable variant identification from RNA-seq data remains challenging because of the complexities of the transcriptome, the challenges of accurately mapping exon boundary spanning reads and the bias introduced during the sequencing library preparation.
METHOD: We developed RVboost, a novel method specific for RNA variant prioritization. RVboost uses several attributes unique in the process of RNA library preparation, sequencing and RNA-seq data analyses. It uses a boosting method to train a model of 'good quality' variants using common variants from HapMap, and prioritizes and calls the RNA variants based on the trained model. We packaged RVboost in a comprehensive workflow, which integrates tools of variant calling, annotation and filtering.
RESULTS: RVboost consistently outperforms the variant quality score recalibration from the Genome Analysis Tool Kit and the RNA-seq variant-calling pipeline SNPiR in 12 RNA-seq samples using ground-truth variants from paired exome sequencing data. Several RNA-seq-specific attributes were identified as critical to differentiate true and false variants, including the distance of the variant positions to exon boundaries, and the percent of the reads supporting the variant in the first six base pairs. The latter identifies false variants introduced by the random hexamer priming during the library construction.
AVAILABILITY AND IMPLEMENTATION: The RVboost package is implemented to readily run in Mac or Linux environments. The software and user manual are available at http://bioinformaticstools.mayo.edu/research/rvboost/.

Entities: CellLine Chemical Disease Species

Mesh：

Year: 2014 PMID： 25170027 PMCID： PMC4296157 DOI： 10.1093/bioinformatics/btu577

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

RNA-seq has become popular with the decreasing cost and its potential to quantify exon/transcript levels over a large dynamic range, discover novel transcripts, identify various splicing mechanisms and detect fusion genes (Costa ) (Asmann ). However, while the variant identification from DNA sequencing is becoming a routine practice, the variant detection from RNA-seq remains challenging because of the complexity of the transcriptome, the ambiguities in mapping exon boundary spanning reads and the artifacts introduced in RNA-seq library protocols (Piskol ). Because expressed genetic variants have more immediate impact on the protein function compared with the DNA variants, we were motivated to develop a reliable RNA-seq variant prioritization method. In general, variant detection from massive parallel sequencing data involves two steps. First is variant calling, which outputs all positions with any evidence of alternative alleles compared with reference. An essential next step is variant prioritization and filtering to obtain reliable variants of high confidence. For DNA sequencing data, the most widely used variant prioritization method is a mixture model-based classifier, variant quality score recalibration (VQSR), within the Genome Analysis Toolkit (GATK) (DePristo ). VQSR integrates multiple attributes/annotations of the variants, all of which are based on features of sequencing, including the depth of coverage, strand bias, mapping qualities and variant position bias toward the end of the reads. VQSR uses variants reported in HapMap as the training source to calculate a filtering criterion, and then predicts true ‘novel’ variants. Another method SNPiR proposes a series of arbitrary hard thresholds to filter and reduce the number of false variants (Piskol ). After careful examination of the RNA-seq variant detection process, we proposed to include RNA-specific attributes/annotations for the variant prioritization model in addition to the features included in GATK. Furthermore, we observed that the Gaussian mixture model and the parameter selection used in VQSR are not ideal for modeling these features and proposed to use a boosting method that uses generalized linear models as its base learners. This method, called the RNA Variant Boosting (RVboost), is a ranking machine to (a) train a model based on common variants in HapMap and (b) rank the variants accordingly. We implemented a comprehensive workflow using the framework of GATK that incorporates RVboost to facilitate reliable RNA-seq variant prioritization (Fig. 1). This workflow outputs called variants with detailed annotations in both the standard variant calling format (VCF) and the tab-delimited text format. We have shown that RVboost outperforms VQSR in 12 RNA-seq samples with paired exome sequencing data. We also introduced a key concept of ‘train-set quantile score’, or the train-Q score, to help users determine their preferred precision/recall trade-off.

Fig. 1.

The overall workflow of RVboost

2 FEATURES

2.1 Attributes selection

After testing, six attributes were included in RVboost. We kept three attributes from GATK’s Unified Genotyper, which are routinely used in VQSR: (i) Quality score over depth, (quality by depth, QD); (ii) Positional bias (ReadPosRankSum); and (iii) Fisher’s exact test-based Strand bias (FS). In addition, we added three novel attributes that are specific and unique for RNA-seq: (i) the percent of variant-supporting reads with variant positions in the first six bases of the reads (PctExtPos). This is to model the false variants introduced during the random hexamer priming of the cDNA synthesis step during the RNA-seq library protocol. The mismatches allowed between the hexamer primers and the RNA templates resulted in substantial amount of false variants (Fig. 1 of the Supplementary Material); (ii) distance to the exon–intron boundary (DJ); and (iii) the uniqueness of the read mapping in the genome and transcriptome (ED). More details are available in Supplementary Section 1.1.

2.2 Input, output and major modules

RVboost takes an aligned RNA-seq BAM file [e.g. the BAM file generated by TopHat (Trapnell )] and processes it through three major components (Fig. 1): (i) Unified Genotyper from GATK for raw variant calling in the target region and generation of the annotations including all GATK classic annotations and the three novel attributes described above in Section 2.1; (ii) annotation of each variant with additional attributes, including all functional annotations from SnpEff (Cingolani ), and whether the variant position is a known RNA-editing site according to a RNA-editing database (Ramaswami and Li, 2014); (iii) Variant prioritization and ranking using RVboost. This module includes two components: a novel boosting method to train the variant classifying model using high-confidence variants (e.g. common variants in HapMap as GATK recommended); and ranking of the likely true variants using a confidence score (details in the method section below). The output is a VCF file of all variants, with full annotations. Users can also generate a text file with selected attributes.

2.3 Description of the variant prioritization methods

With the selection of six attributes, we formulated the variant prioritization as a ranking problem where only likely true variants (e.g. common variants from HapMap) are used for model training. We describe it as a mathematical process to find a good F(.), which outputs ranking score y from data X with minimum error defined by a loss function L(.): where elements of y is 1 or 0 to indicate likely true or false variants, respectively, X is a variant by attribute matrix used to rank true variants and ‘arg min’ stands for the argument of the minimum error of the loss function. Different from mixture model-based VQSR, in which the construction of F(.) requires explicitly the number of Gaussian kernels and the percent of worst variants that are used as negative sets, we proposed to use a more flexible boosting method to rank variants. Boosting methods construct such a function F(X) by additive combinations of M ‘base’ learners h(X) (e.g. linear regression models) , with corresponding combination coefficients and parameters of m-th learner (Friedman, 2001; Bühlmann and Hothorn, 2007). By leveraging implemented boosting methods in R package ‘gbm’, we chose three boosting options for which response variable values range from 0 to 1: ‘adaboost’ with AdaBoost exponential loss function, ‘bernoulli’ with logistic regression loss function, and ‘huberized’ with modified Huber loss function (Ridgeway, 2005). It is often found that these three distribution options lead to similar performance and converge well before 20 000 iterations. Hence, we chose AdaBoost model and 20 000 iterations as default settings for RVboost.

2.4 Expected recall rate and train-set quantile score

After training, the user needs to choose a prioritization score as the threshold to call ‘true variants’. In practice, it is often difficult to interpret a prioritization score derived from a complex computational model and its implications for precision/recall trade-off. To address this problem, we explicitly defined a monotonic transformation independent of ranking methods, which depends on training set of likely true variants where score[j] is score generated by method from high to low indicating the likelihood of the j-th variant to be a true variant. is the empirical cumulative density function learnt from the training dataset. The train-Q score is intuitive to users, as it directly uses the expected recall rate from the provided training dataset. For example, a cutoff of the train-Q scores of 20% means that using this cutoff, 80% of the variants within the training set will be retained, i.e. we have a 80% expected recall. Through our comparison studies, we suggest a moderate expected recall rate, e.g. 90 or 95%, instead of an aggressive 99%, which is the recommended default by VQSR.

2.5 Comparison studies

We compared specificity and sensitivity of RVboost to VQSR (GATK version 1.6.9) and SNPiR (Piskol ), using the concordance between RNA and DNA variants from eight follicular lymphoma tumor samples and four replicates of MCF-7 cell lines (details described in Supplementary Material 1.4). To make unbiased comparisons, we evaluated the recall/precision on a subset of novel variants that (i) are not in the positive training set and (ii) have at least 10-fold coverage in both RNA-seq and exome-seq. We regarded the genotype calls from exome-seq as the ground truth and computed precision/recall accordingly, under the assumption that RNA-editing sites are a small percentage of RNA-seq variants (Piskol ). Overall, RVboost consistently outperforms both VQSR and SNPiR in all the tested samples in terms of AUC (Area Under the Curve) of precision/recall curves, and demonstrates superior precision in low train-Q score cutoffs, or equivalently, with high expected recall rates (details in Supplementary Material 2.2). We also investigated the contribution of individual attribute to distinguish true versus false variants, suggesting that the percent of reads supporting the variants in the first six base pairs and QD are the most informative features (details in Supplementary Material 2.3).

3 CONCLUSIONS

We developed RVboost, a software package designed to reliably prioritize and call variants from RNA-seq data. The output of our workflow provides comprehensive annotations to facilitate biological understanding. Variant prioritization is based on a proposed boosting method, which not only outperforms two other methods (SNPiR and VQSR) in overall performance, but also provides great flexibility to users for adjusting of the precision/recall trade-off, and it is superior to ad hoc hard-threshold approaches, such as SNPiR. The major modules are wrapped as a comprehensive package. Funding: Support for this work was provided by gift from Everett and Jane Hauck to the Center for Individualized Medicine at Mayo Clinic Jacksonville Florida, funds from the 26.2 with Donna Foundation and the proceeds of the National Marathon to Fight Breast Cancer and NIH P50 CA097274. Conflict of interest: none declared.

8 in total

Review 1. RNA-Seq and human complex diseases: recent accomplishments and future perspectives.

Authors: Valerio Costa; Marianna Aprile; Roberta Esposito; Alfredo Ciccodicola
Journal: Eur J Hum Genet Date: 2012-06-27 Impact factor: 4.246

2. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3.

Authors: Pablo Cingolani; Adrian Platts; Le Lily Wang; Melissa Coon; Tung Nguyen; Luan Wang; Susan J Land; Xiangyi Lu; Douglas M Ruden
Journal: Fly (Austin) Date: 2012 Apr-Jun Impact factor: 2.160

3. Reliable identification of genomic variants from RNA-seq data.

Authors: Robert Piskol; Gokul Ramaswami; Jin Billy Li
Journal: Am J Hum Genet Date: 2013-09-26 Impact factor: 11.025

4. Lack of evidence for existence of noncanonical RNA editing.

Authors: Robert Piskol; Zhiyu Peng; Jun Wang; Jin Billy Li
Journal: Nat Biotechnol Date: 2013-01 Impact factor: 54.908

5. A framework for variation discovery and genotyping using next-generation DNA sequencing data.

Authors: Mark A DePristo; Eric Banks; Ryan Poplin; Kiran V Garimella; Jared R Maguire; Christopher Hartl; Anthony A Philippakis; Guillermo del Angel; Manuel A Rivas; Matt Hanna; Aaron McKenna; Tim J Fennell; Andrew M Kernytsky; Andrey Y Sivachenko; Kristian Cibulskis; Stacey B Gabriel; David Altshuler; Mark J Daly
Journal: Nat Genet Date: 2011-04-10 Impact factor: 38.330

6. A novel bioinformatics pipeline for identification and characterization of fusion transcripts in breast cancer and normal cell lines.

Authors: Yan W Asmann; Asif Hossain; Brian M Necela; Sumit Middha; Krishna R Kalari; Zhifu Sun; High-Seng Chai; David W Williamson; Derek Radisky; Gary P Schroth; Jean-Pierre A Kocher; Edith A Perez; E Aubrey Thompson
Journal: Nucleic Acids Res Date: 2011-05-27 Impact factor: 16.971

7. TopHat: discovering splice junctions with RNA-Seq.

Authors: Cole Trapnell; Lior Pachter; Steven L Salzberg
Journal: Bioinformatics Date: 2009-03-16 Impact factor: 6.937

8. RADAR: a rigorously annotated database of A-to-I RNA editing.

Authors: Gokul Ramaswami; Jin Billy Li
Journal: Nucleic Acids Res Date: 2013-10-25 Impact factor: 16.971

8 in total

17 in total

1. Cohort Profile: The Lymphoma Specialized Program of Research Excellence (SPORE) Molecular Epidemiology Resource (MER) Cohort Study.

Authors: James R Cerhan; Brian K Link; Thomas M Habermann; Matthew J Maurer; Andrew L Feldman; Sergei I Syrbu; Carrie A Thompson; Umar Farooq; Anne J Novak; Susan L Slager; Cristine Allmer; Julianne J Lunde; William R Macon; David J Inwards; Patrick B Johnston; Ivana N M Micallef; Grzegorz S Nowakowski; Stephen M Ansell; Neil E Kay; George J Weiner; Thomas E Witzig
Journal: Int J Epidemiol Date: 2017-12-01 Impact factor: 7.196

2. RNAIndel: discovering somatic coding indels from tumor RNA-Seq data.

Authors: Kohei Hagiwara; Liang Ding; Michael N Edmonson; Stephen V Rice; Scott Newman; John Easton; Juncheng Dai; Soheil Meshinchi; Rhonda E Ries; Michael Rusch; Jinghui Zhang
Journal: Bioinformatics Date: 2020-03-01 Impact factor: 6.937

3. VaDiR: an integrated approach to Variant Detection in RNA.

Authors: Lisa Neums; Seiji Suenaga; Peter Beyerlein; Sara Anders; Devin Koestler; Andrea Mariani; Jeremy Chien
Journal: Gigascience Date: 2018-02-01 Impact factor: 6.524

4. Changes in ion channel expression and function associated with cardiac arrhythmogenic remodeling by Sorbs2.

Authors: Ling-Ling Qian; Xiaojing Sun; Jingchun Yang; Xiao-Li Wang; Michael J Ackerman; Ru-Xing Wang; Xiaolei Xu; Hon-Chi Lee; Tong Lu
Journal: Biochim Biophys Acta Mol Basis Dis Date: 2021-09-04 Impact factor: 5.187

5. Cardiac micro-RNA and transcriptomic profile of a novel swine model of chronic kidney disease and left ventricular diastolic dysfunction.

Authors: Alejandro R Chade; Alfonso Eirin
Journal: Am J Physiol Heart Circ Physiol Date: 2022-08-26 Impact factor: 5.125

6. Bone Marrow Adiposity in Models of Radiation- and Aging-Related Bone Loss Is Dependent on Cellular Senescence.

Authors: Abhishek Chandra; Anthony B Lagnado; Joshua N Farr; Megan Schleusner; David G Monroe; Dominik Saul; João F Passos; Sundeep Khosla; Robert J Pignolo
Journal: J Bone Miner Res Date: 2022-03-29 Impact factor: 6.390

7. Whole-Transcriptome Analysis by RNA Sequencing for Genetic Diagnosis of Mendelian Skin Disorders in the Context of Consanguinity.

Authors: Leila Youssefian; Amir Hossein Saeidian; Fahimeh Palizban; Atefeh Bagherieh; Fahimeh Abdollahimajd; Soheila Sotoudeh; Nikoo Mozafari; Rahele A Farahani; Hamidreza Mahmoudi; Sadegh Babashah; Masoud Zabihi; Sirous Zeinali; Paolo Fortina; Julio C Salas-Alanis; Andrew P South; Hassan Vahidnezhad; Jouni Uitto
Journal: Clin Chem Date: 2021-06-01 Impact factor: 8.327

8. Impact of RNA degradation on fusion detection by RNA-seq.

Authors: Jaime I Davila; Numrah M Fadra; Xiaoke Wang; Amber M McDonald; Asha A Nair; Barbara R Crusan; Xianglin Wu; Joseph H Blommel; Jin Jen; Kandelaria M Rumilla; Robert B Jenkins; Umut Aypar; Eric W Klee; Benjamin R Kipp; Kevin C Halling
Journal: BMC Genomics Date: 2016-10-20 Impact factor: 3.969

9. Aberrant epigenetic regulation in clear cell sarcoma of the kidney featuring distinct DNA hypermethylation and EZH2 overexpression.

Authors: Jenny Karlsson; Anders Valind; Caroline Jansson; Maureen J O'Sullivan; Linda Holmquist Mengelbier; David Gisselsson
Journal: Oncotarget Date: 2016-03-08

10. Frequent POLE-driven hypermutation in ovarian endometrioid cancer revealed by mutational signatures in RNA sequencing.

Authors: Jaime I Davila; Pritha Chanana; Vivekananda Sarangi; Zachary C Fogarty; S John Weroha; Ruifeng Guo; Ellen L Goode; Yajue Huang; Chen Wang
Journal: BMC Med Genomics Date: 2021-06-22 Impact factor: 3.063