Amie J Radenbaugh1, Singer Ma1, Adam Ewing1, Joshua M Stuart1, Eric A Collisson2, Jingchun Zhu1, David Haussler3. 1. University of California Santa Cruz Genomics Institute, Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, California, United States of America. 2. Division of Hematology/Oncology, University of California San Francisco, San Francisco, California, United States of America. 3. University of California Santa Cruz Genomics Institute, Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, California, United States of America; Howard Hughes Medical Institute, Chevy Chase, Maryland, United States of America.
Abstract
The detection of somatic single nucleotide variants is a crucial component to the characterization of the cancer genome. Mutation calling algorithms thus far have focused on comparing the normal and tumor genomes from the same individual. In recent years, it has become routine for projects like The Cancer Genome Atlas (TCGA) to also sequence the tumor RNA. Here we present RADIA (RNA and DNA Integrated Analysis), a novel computational method combining the patient-matched normal and tumor DNA with the tumor RNA to detect somatic mutations. The inclusion of the RNA increases the power to detect somatic mutations, especially at low DNA allelic frequencies. By integrating an individual's DNA and RNA, we are able to detect mutations that would otherwise be missed by traditional algorithms that examine only the DNA. We demonstrate high sensitivity (84%) and very high precision (98% and 99%) for RADIA in patient data from endometrial carcinoma and lung adenocarcinoma from TCGA. Mutations with both high DNA and RNA read support have the highest validation rate of over 99%. We also introduce a simulation package that spikes in artificial mutations to patient data, rather than simulating sequencing data from a reference genome. We evaluate sensitivity on the simulation data and demonstrate our ability to rescue back mutations at low DNA allelic frequencies by including the RNA. Finally, we highlight mutations in important cancer genes that were rescued due to the incorporation of the RNA.
The detection of somatic single nucleotide variants is a crucial component to the characterization of the cancer genome. Mutation calling algorithms thus far have focused on comparing the normal and tumor genomes from the same individual. In recent years, it has become routine for projects like The Cancer Genome Atlas (TCGA) to also sequence the tumor RNA. Here we present RADIA (RNA and DNA Integrated Analysis), a novel computational method combining the patient-matched normal and tumor DNA with the tumor RNA to detect somatic mutations. The inclusion of the RNA increases the power to detect somatic mutations, especially at low DNA allelic frequencies. By integrating an individual's DNA and RNA, we are able to detect mutations that would otherwise be missed by traditional algorithms that examine only the DNA. We demonstrate high sensitivity (84%) and very high precision (98% and 99%) for RADIA in patient data from endometrial carcinoma and lung adenocarcinoma from TCGA. Mutations with both high DNA and RNA read support have the highest validation rate of over 99%. We also introduce a simulation package that spikes in artificial mutations to patient data, rather than simulating sequencing data from a reference genome. We evaluate sensitivity on the simulation data and demonstrate our ability to rescue back mutations at low DNA allelic frequencies by including the RNA. Finally, we highlight mutations in important cancer genes that were rescued due to the incorporation of the RNA.
Much of our current understanding of cancer has come from investigating how normal cells are transformed into cancerous cells through the stepwise acquisition of somatic genomic abnormalities. These events include point mutations, insertions and deletions (INDELs), chromosomal rearrangements, and changes to the copy number of segments of DNA. Transforming a normal human cell into a malignant, immortal cancer cell line requires an estimated five to seven genetic alterations in key genes and pathways [1], [2]. Not surprisingly, much research has been devoted to determining how cancer cells are able to acquire their abilities through the accumulation of somatic mutations.The Cancer Genome Atlas (TCGA) project has produced exome-wide data from thousands of tumors and patient-matched normal tissues. With the development of RNA Sequencing (RNA-Seq) [3], TCGA began providing an additional high-throughput tumor sequence dataset. These three datasets consisting of tumor and patient-matched normal DNA and tumor RNA have become a new standard in cancer genomics. RNA-Seq enables one to investigate the consequences of genomic changes in the RNA transcripts they encode to better characterize 1) germline variants, 2) somatic mutations, and 3) variants in the RNA that are not found in the DNA that could be the result of RNA editing [4].Over the next few years, many more whole-genome and exome-capture DNA and RNA-Seq BAM (the binary version of Sequence Alignment/Map [5]) files will become available. TCGA has collected over 10,000 tissue samples from more than 20 types of cancer. There is a clear need for an efficient method for the combined analysis of patient-matched tumor DNA, normal DNA, and tumor RNA. Here we present a method called RADIA to identify and characterize alterations in cancer using DNA and RNA obtained by high-throughput sequencing data.Somatic mutation calling is traditionally performed on patient-matched pairs of tumor and normal genomes/exomes [6]–[11]. The ability to accurately detect somatic mutations is hindered by both biological and technical artifacts that make it difficult to obtain both high sensitivity and high specificity. Different mutation calling algorithms often disagree about putative mutations in the same source data, and frequently have discernible systematic differences due to the trade-off between sensitivity and specificity [12]. This is especially true for somatic mutations with low variant allele frequencies (VAFs). By creating an algorithm that utilizes both DNA and RNA, we have increased the power to detect somatic mutations, especially at low variant allele frequencies.RADIA combines patient-matched tumor and normal DNA with the tumor RNA to detect somatic mutations. The DNA Only Method (DOM) (Figure 1) uses just the tumor/normal pairs of DNA (ignoring the RNA), while the Triple BAM Method (TBM) (Figure 1) uses all three datasets from the same patient to detect somatic mutations. The mutations from the TBM are further categorized into two sub-groups: RNA Confirmation and RNA Rescue mutations (Figure S1). RNA Confirmation mutations are those that are made by both the DOM and the TBM due to the strong variant read support in both the DNA and RNA. RNA Rescue mutations are those that had very little DNA support, hence not called by the DOM, but strong RNA support, and thus called by the TBM. RNA Rescue mutations are typically missed by traditional methods that only interrogate the DNA.
Figure 1
Overview of the RADIA work-flow for identifying somatic mutations.
The normal DNA, tumor DNA, and tumor RNA BAMs are processed in parallel and initial low-level variants are identified. The variants are filtered by the DNA Only Method using the pairs of normal and tumor DNA and by the Triple BAM Method using all three datasets. The mutations from the two methods are merged and output in VCF format.
Overview of the RADIA work-flow for identifying somatic mutations.
The normal DNA, tumor DNA, and tumor RNA BAMs are processed in parallel and initial low-level variants are identified. The variants are filtered by the DNA Only Method using the pairs of normal and tumor DNA and by the Triple BAM Method using all three datasets. The mutations from the two methods are merged and output in VCF format.We have applied RADIA to data derived from over 3,300 patients representing 15 different cancer types from TCGA (Table S1). Overall, the RNA Rescue mutations that are made possible by the incorporation of the RNA-Seq data provide a two to seven percent increase in somatic mutations compared to the DOM (Table S1). Many of these mutations were new discoveries that were not previously found by other mutation calling algorithms in TCGA. Of these new discoveries, some mutations were found in well-known cancer genes that were heavily mutated in a specific cohort. We also find mutations in new samples where the same gene has already been identified as harboring mutations in other samples from the cohort. When these RNA Rescue mutations are added to the DNA Only mutations, these genes achieve a statistically significant overall mutation rate for the cohort.Here we specifically focus on results from 177 endometrial carcinoma [13] and 230 lung adenocarcinoma [14] patients from TCGA. To demonstrate the increase in sensitivity from including the tumor RNA-Seq dataset, we simulated mutations by spiking them into the tumor DNA and tumor RNA of a breast cancerpatient using bamsurgeon (https://github.com/adamewing/bamsurgeon). We also evaluated sensitivity and precision on the endometrial carcinoma and lung adenocarcinoma data using validation data that was generated by TCGA. We highlight RNA Rescue mutations found by the TBM in tumor suppressor genes such as TP53, STK11, and CDKN2A in lung adenocarcinoma.
Methods
RADIA operates on two or more BAM files, producing somatic mutation calls through a series of steps outlined in Figure 1. Each step in this process is described in detail, beginning with the initial selection of sites for further processing and ending with a description of filters used to eliminate false positives while maintaining true positives.
2.1 Variant Detection with RADIA
RADIA is typically run on three BAM [5] files consisting of a pair of patient-matched tumor and normal genomes and a tumor transcriptome and outputs germline (inherited) variants and somatic Single Nucleotide Variants (SNVs). Here we focus specifically on the detection of somatic SNVs with RADIA. The DOM is run on the pairs of tumor and matched-normal DNA while the TBM is applied to the DNA and RNA triplets. After the DOM and TBM specific filters, the results are merged and run through a final read support filter (Figure 1). If RNA-Seq data is not available, RADIA can utilize paired tumor and normal DNA genomes using the DOM to detect germline variants and somatic SNVs.Internally, RADIA uses the samtools [5] mpileup command (version 0.1.18) to examine the pileups of bases in each sample in parallel. A heuristic algorithm determines the existence and type of variant at any given position based on the user-configurable minimum thresholds for overall depth, variant depth, Base Alignment Quality (BAQ) [15], and mapping quality. Initially, RADIA requires a minimum overall depth of four bases, minimum variant depth of two bases, minimum phred BAQ of 10, and minimum phred mapping quality of 10. These initial calls are lenient in coverage and provide a good baseline set of calls for further filtering.RADIA scans pileups of reads across the reference genome and outputs variants in Variant Call Format (VCF) (https://github.com/samtools/hts-specs). For each position, summary information such as the overall depth, allele specific depth and frequency, average BAQ base quality, average mapping quality, and the fraction of reads on the plus strand are calculated for both the DNA and RNA. All of this information is used during the filtering process.
2.2 Variant Filtering
After the initial variants are detected, a number of filters are applied to remove false positive variants that result from biological and technical artifacts. Each filter is described here in detail.
2.2.1 Filtering Around INDELs
Many current mutation calling algorithms have a pre-processing step to account for misaligned reads around INDELs. This realignment step is computationally expensive and relies on accurately predicting the location of INDELs which itself is not a trivial problem. Base Alignment Quality (BAQ) is an alternative option for dealing with alignment ambiguity around INDELs. It calculates the probability that a base has been misaligned and returns the minimum of the original base quality and the base alignment quality. BAQ is run by default when executing a samtools mpileup command and has been shown to improve SNP calling accuracy [15]. We use the extended version of BAQ (option –E) that is activated by default in the latest version of samtools (0.1.19) for increased sensitivity and slightly lower specificity [5].
2.2.2 1000 Genomes Blacklist Filter
The 1000 Genomes Project coined the term “accessible genome” to be the part of the reference genome that is reliable for accurate variant calling after removing ambiguous or highly repetitive regions [16]. Since the reference genome is incomplete, repetitive in places, and does not represent human genetic variation comprehensively, reads often get mapped incorrectly in locations outside the accessible genome (inaccessible sites), leading to false positive variant calls. Over 97% of inaccessible sites are due to high copy repeats or segmental duplications. In the pilot, the 1000 Genomes Project determined that 85% of the reference sequence and 93% of the coding region was accessible. Due to longer read lengths (75–100 bp) and improvements to both paired end protocols and sequence alignment algorithms, the accessible genome increased in Phase I to 94% of the reference and 98% of the coding region [17]. We filter variants that are not in the accessible genome using the Phase I mapping quality and depth blacklists (ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/phase1/analysis_results/supporting/accessible_genome_masks/).
2.2.3 Strand-Bias Filter
It has recently been shown that variant allele reads that occur exclusively on one strand are largely associated with false positives [8]. In order to account for this technical artifact, we filter based on the variant allele strand bias. If we have at least four total reads supporting the variant allele, then we apply the strand bias filter if more than 90% of the reads are on the forward strand or more than 90% are on the reverse strand.
2.2.4 Filtering by mpileup Support
RADIA can be executed on patient-matched pairs of tumor and normal DNA samples using the DOM to identify germline variants and somatic mutations. We first compare the matched normal DNA to the human reference genome. We require the normal DNA to pass the mpileup support filters described in Table 1 for all germline variants.
Table 1
DNA Only Method mpileup Support Filters.
Filter
Germline
Somatic
Normal DNA
Normal DNA
Tumor DNA
Min Total Depth
10
10
10
Min Alt. Depth
4
NA
4
Min Alt. Percent
10%
NA
10%
Min Avg. Alt. BAQ
20
NA
20
Max Alt. Strand Bias
90%
NA
90%
Max Alt. Percent
NA
2%
NA
Max Other Percent
2%
2%
2%
The germline variants and somatic mutations from the DOM are filtered according to the parameters described here. The minimum average alternative read BAQ filter uses the phred scale. The maximum other percent restricts the percentage of reads that are allowed to support an additional alternative allele.
The germline variants and somatic mutations from the DOM are filtered according to the parameters described here. The minimum average alternative read BAQ filter uses the phred scale. The maximum other percent restricts the percentage of reads that are allowed to support an additional alternative allele.If no germline variant is found, we compare the tumor DNA to the matched normal DNA and the reference genome to search for somatic mutations. We require the normal DNA and tumor DNA to pass the mpileup support filters shown in Table 1 for all somatic variants. To ensure that we have the power to detect a possible germline variant at this site, we require that the germline DNA depth is 10 or more.We use the Triple BAM Method to augment our somatic mutation calls using both the pairs of DNA and the RNA-Seq data. The normal DNA, tumor DNA, and tumor RNA must pass the mpileup support filters shown in Table 2 for all somatic mutations. We require at least one read with a minimum BAQ phred score of 15 in the tumor DNA. To rule out possible germline variants, we again require that the normal DNA depth is 10 or more. In addition, we filter out calls that overlap with common SNPs that are not flagged as clinically relevant and found in at least one percent of the samples in dbSNP [18]. We downloaded this subset of dbSNP from the “Common SNPs” track on the UCSC human genome browser [19], [20]. We found that many false positive variants overlapped with earlier versions of dbSNP. These variants were due to technical artifacts and were removed from subsequent versions of dbSNP [21]. Therefore, we filter out all variants that overlap with dbSNP versions 130, 132 or 135 (ftp://ftp.ncbi.nih.gov/snp/). The TBM calls are subjected to further filtering procedures as shown in Figure 1 and described below.
Table 2
Triple BAM mpileup Support Filters.
Filter
Somatic
Normal DNA
Tumor DNA
Tumor RNA
Min Total Depth
10
1
10
Min Alt. Depth
NA
1
4
Min Alt. Percent
NA
NA
10%
Min Avg. Alt. BAQ
NA
15
15
Max Alt. Strand Bias
NA
90%
90%
Max Alt. Percent
10%
NA
NA
Max Other Percent
10%
10%
2%
The somatic mutations from the TBM are filtered according to the parameters shown here.
The somatic mutations from the TBM are filtered according to the parameters shown here.
2.2.5 Pseudogene Filter
We noticed that many of our TBM mutations overlapped with predicted pseudogenes. Although expressed pseudogenes have recently been reported to be significant contributors to the transcriptional landscape and shown to play a role in cancer progression [22], mutations that overlap with predicted pseudogenes have a high false positive rate. Sequence similarity of pseudogene copies to their parent genes leads to uncertainty in alignment within these regions. Because of these technical artifacts, we remove TBM mutations that overlap with pseudogenes annotated in GENCODE by the ENCODE project (version 19) [23] and predicted by RetroFinder (version 5) [23], [24]. We downloaded the pseudogene annotations from the following tracks on the UCSC human genome browser [19], [25]: Gene Annotations from ENCODE/GENCODE and Retroposed Genes. The predicted pseudogenes occupy 1.5% of the total genome.
2.2.6 Highly Variable Genes Filter
We remove TBM mutations that overlap with families of genes that have high sequence similarity. Some examples of these gene families are Human Leukocyte Antigens (HLAs), Ribosomal Proteins (RPLs), and immunoglobulins. While mutations in these genes may exist, special processing would be needed to distinguish them from false positive calls due to misaligned reads. We annotate the mutations using SnpEff [26] and filter out the following five gene families: RPLs, RP11s, HLAs, IGHVs and IGHCs.
2.2.7 Positional Bias Filter
False positive calls are associated with misaligned reads where the alternative allele is consistently within a certain distance from the start or end of the read. The positional bias filter is applied when 95% or more of the reads that have an alternative allele are such that the alternate allele falls in the first third or last third of the read.
2.2.8 BLAT Filter
We observed multiple instances where RNA-Seq reads appeared to be incorrectly mapped due to the added difficulties in aligning RNA-Seq data, such as dealing with hard to identify splice junctions and multiple gene isoforms. To guarantee that the RNA-Seq reads that support a variant do not map better to another location in the genome, we created a BLAT filter. All of the RNA-Seq reads that support a variant are extracted from the BAM file and aligned to the human genome using BLAT [27]. If the read maps to another location with a better score, the read is rejected. After using BLAT on each read, we again require that there are at least four valid reads that support the variant and that 10% or more of the reads support the variant.
2.2.9 Read Support Filter
We merge the calls from the DOM and the TBM and apply one final filter. We require that each somatic mutation be supported by at least four “perfect” reads. We define a perfect read as follows:Minimum mapping quality of read is 10Minimum base quality of alternative allele in read is 10Minimum base qualities of the five bases up- and down-stream of the alternative allele are 10Read is properly pairedRead has fewer than four mismatches across its entirety when compared to the referenceRead does not require an insertion or deletion to be mappedAfter determining the number of perfect reads that support the reference and the alternative at a coordinate, we re-apply the strand bias filter to guarantee that no more than 90% of the total perfect reads are from one strand.
Results
We evaluate the sensitivity of RADIA using simulation data that was generated from patient data. We also measure the sensitivity and precision of RADIA using patient and validation data generated by TCGA. All patients in this study provided written informed consent to genomic studies in accordance with local Institutional Review Boards (Table S2) and the policies and guidelines outlined by the Ethics, Law and Policy Group from TCGA. All patient data is anonymous and was originally collected for routine therapeutic purposes.
3.1 Sensitivity on Simulation Data
In order to evaluate sensitivity and demonstrate the increase in power from including the RNA-Seq data, we simulated somatic mutations starting from patient data. We spiked mutations into a pair of breast cancer tumor DNA and tumor RNA samples using bamsurgeon (https://github.com/adamewing/bamsurgeon), a tool we developed to generate simulation data that closely mimics actual experimental data from high-throughput sequencing datasets. Bamsurgeon first determines the loci that have an appropriate DNA and RNA depth to spike in mutations. It then extracts the reads at the loci, adjusts the VAF according to the user-defined VAF distribution, and then re-maps the reads (Figure S2). This simulation strategy is more sophisticated than simply generating simulated reads from a reference genome, as it retains the biological and technical artifacts that are inherently present in next generation sequencing data. We performed two spike-in experiments: one varying the DNA VAF while holding the RNA VAF constant, and one varying the RNA VAF while holding the DNA VAF to 10% or less.
3.1.1 Sensitivity on Variable DNA-Constant RNA Simulation Data
To evaluate the sensitivity of RADIA, we spiked in 1,594 mutations to the tumor DNA sequence with a variant allele frequency ranging from 1–50% and to the tumor RNA sequence at a constant frequency of 25%. The overall sensitivity rate averaged across all VAFs is 85% consisting of 1,351 out of 1,594 spiked in mutations (Figure 2A). Of the 243 calls that were filtered out, over 50% are removed because they failed to meet the minimum variant allele frequency, more than 20% land in blacklist regions that the method ignores, and nearly 20% are discarded due to the BLAT filter. The number of mutations that are rejected by the full list of filters can be found in Figure S3.
Figure 2
Sensitivity of RADIA on simulation data.
Artificial mutations were spiked into the tumor DNA and RNA BAM files of a breast cancer patient using bamsurgeon. (A) Mutations were spiked into the DNA at variant allele frequencies distributed from 1–50% and into the RNA at a constant 25%. The overall sensitivity of RADIA was 85%. RNA Rescue calls from the Triple BAM method detected the mutations that had a DNA VAF less than 10%. (B) Mutations were spiked into the DNA at 10% or less and into the RNA distributed from 1–50%. Most of the DOM mutations are filtered due to the low DNA allelic frequency. The mutations that have adequate RNA read support are rescued back at these low DNA allelic frequencies.
Sensitivity of RADIA on simulation data.
Artificial mutations were spiked into the tumor DNA and RNA BAM files of a breast cancerpatient using bamsurgeon. (A) Mutations were spiked into the DNA at variant allele frequencies distributed from 1–50% and into the RNA at a constant 25%. The overall sensitivity of RADIA was 85%. RNA Rescue calls from the Triple BAM method detected the mutations that had a DNA VAF less than 10%. (B) Mutations were spiked into the DNA at 10% or less and into the RNA distributed from 1–50%. Most of the DOM mutations are filtered due to the low DNA allelic frequency. The mutations that have adequate RNA read support are rescued back at these low DNA allelic frequencies.
3.1.2 Sensitivity on Low Frequency DNA-Variable RNA Simulation Data
To demonstrate the ability of the TBM to rescue calls at low DNA VAFs, we spiked in 1,761 mutations to the tumor RNA sequence with a variant allele frequency ranging from 1–50% and to the tumor DNA sequence at a frequency of 10% or less. Most of the mutations by the DOM are filtered out due to the low allelic frequency in the DNA (Figure S4). For the mutations that have sufficient read support in the RNA, these low DNA VAFs are rescued back (Figure 2B).
3.2 Precision and Sensitivity on Patient Data
We made somatic mutation calls on 177 non-hypermutated TCGA endometrial carcinoma samples [13]. All 177 tumor and matched normal whole exome sequencing and RNA-Seq alignments in BAM [5] format were downloaded from TCGA at the Cancer Genomics Hub (CGHub, https://cghub.ucsc.edu, Table S2). The exomes were sequenced using the Illumina Genome Analyzer II, and the paired-end sequencing reads were aligned by BWA [28]. The RNA was sequenced using the Illumina Genome Analyzer II, and the single-end sequencing reads were aligned by MapSplice (V2) [29].
3.2.1 RADIA Precision on Endometrial Carcinoma Patient Data
For the study on endometrial carcinoma by TCGA [13], mutations were submitted by three independent TCGA Genomic Data Analysis Centers (GDACs). These mutations were merged and targeted for custom recapture and resequencing using new cDNA libraries from the tumor and normal DNA samples [13]. We downloaded the validation BAMs containing the results of the hybrid capture and resequencing of targeted mutations from CGHub (https://cghub.ucsc.edu, Table S2). We utilized the identical validation criteria used by the TCGA Endometrial Analysis Working Group to validate the somatic mutations detected by RADIA [13]. For each somatic mutation, we queried the patient-matched tumor and normal validation data. We required at least 10 reads in both the tumor and normal data in order to determine if a call validated, otherwise we classified it as ambiguous. If the variant was present at low levels in both datasets, we also classified it as ambiguous. Otherwise, we determined whether a mutation validated as germline/LOH, somatic, or neither according to Table 3. In addition, any RNA Rescue call in the “Not Validated” group that overlapped with a COSMIC somatic mutation that was confirmed in another study was considered as validated.
Table 3
Validation Criteria for Endometrial Carcinoma Data.
Normal VAF
Tumor VAF
0%
<8%
≥8%, <20%
≥20%
= 0%
Not Validated
Somatic Low
Somatic Med
Somatic High
<3%
Not Validated
Ambiguous
Somatic Med
Somatic High
≥3%
Germline/LOH
Germline/LOH
Germline/LOH
Germline/LOH
Validation BAMs were used to determine the validation status for somatic mutations as shown here. A mutation is considered validated in the Somatic Low, Med, or High groups (bold), not validated in the “Not Validated” (italics) and Germline/LOH groups (italics), and Ambiguous when there was low read depth (<10 reads) or low VAFs in both the normal (<3%) and tumor (<8%) validation BAMs.
Validation BAMs were used to determine the validation status for somatic mutations as shown here. A mutation is considered validated in the Somatic Low, Med, or High groups (bold), not validated in the “Not Validated” (italics) and Germline/LOH groups (italics), and Ambiguous when there was low read depth (<10 reads) or low VAFs in both the normal (<3%) and tumor (<8%) validation BAMs.We made a total of 27,900 somatic mutation calls over 177 endometrial samples, of which the DOM and TBM made 27,390 and 6,325 calls respectively. Of the 6,325 TBM calls, there were 5,815 RNA Confirmation mutations that were made by both the DOM and TBM signifying high DNA and RNA support, and importantly, a total of 510 RNA Rescue mutations that were missed by the DOM.Using the validation strategy described above, we demonstrate that the overall precision for RADIA is 98% (Figure 3A). Due to lack of coverage or uncertainty in the tumor and normal validation BAMs, a total of 1,825 calls were considered to be ambiguous. Of the remaining 26,075 mutations called by RADIA, 25,520 validated as somatic, 271 validated as germline/LOH variants and 284 did not validate. The precision of calls made by the DOM and the TBM was 98% and 98.5% respectively. For the RNA Confirmation mutations made by both the DOM and the TBM, the precision was 99.3%. There were 510 RNA Rescue mutations made only by the TBM, and even though most of these calls were not targeted for validation, the precision was 74%. For the 510 RNA Rescue calls, 251 were classified as ambiguous, 6 validated as Germline/LOH, and 61 did not validate. Of the remaining 192 RNA Rescue mutations that validated, 178 (93%) were verified using the validation BAMs and 14 (7%) were confirmed as somatic mutations in COSMIC.
Figure 3
Precision and sensitivity of RADIA on 177 non-hypermutated endometrial carcinoma samples.
Mutations are considered validated in the Somatic Low, Med, or High groups (blue), not validated in the “Not Validated” (green) and Germline/LOH (red) groups, and Ambiguous (orange) when there was low read depth (<10 reads) or ambiguity in the validation data. (A) An overall precision of 98% was demonstrated. RNA Confirmation mutations with strong DNA and RNA support validated over 99%. RNA Rescue mutations validated at 74%. (B) The union of all mutations submitted by TCGA GDACs that validated as somatic was considered as the truth set. RADIA demonstrated an overall sensitivity rate of 84%. Of the mutations that were missed, 33% occurred at low variant allele frequencies (<8%) and 23% occurred in blacklist regions that were ignored.
Precision and sensitivity of RADIA on 177 non-hypermutated endometrial carcinoma samples.
Mutations are considered validated in the Somatic Low, Med, or High groups (blue), not validated in the “Not Validated” (green) and Germline/LOH (red) groups, and Ambiguous (orange) when there was low read depth (<10 reads) or ambiguity in the validation data. (A) An overall precision of 98% was demonstrated. RNA Confirmation mutations with strong DNA and RNA support validated over 99%. RNA Rescue mutations validated at 74%. (B) The union of all mutations submitted by TCGA GDACs that validated as somatic was considered as the truth set. RADIA demonstrated an overall sensitivity rate of 84%. Of the mutations that were missed, 33% occurred at low variant allele frequencies (<8%) and 23% occurred in blacklist regions that were ignored.We next examined the precision of the DOM with varying RNA-Seq reads supporting the variant allele as well as the precision of RNA Rescue mutations with differing levels of DNA supporting reads. Sixty-two percent of the DOM mutations were covered by reads in the RNA-Seq data, and 29% had at least 10 RNA-Seq reads covering the mutation. Nearly half (44%) had at least one RNA read supporting the DNA variant allele, while 25% of the DOM mutations had at least four supporting RNA reads. The precision of the DOM is lowest (92%) with no RNA-Seq support, increases to 95% with weak RNA-Seq support (at least one but less than five supporting reads), and increases to 99.3% for RNA Confirmation mutations. Overall, mutations that are detected by the DOM validate above 92%, regardless of the RNA-Seq support, and the precision increases as the RNA-Seq support increases.On the other hand, RNA Rescue mutations weakly supported by the DNA validate at low levels. For RNA Rescue mutations, we require at least one variant supporting read in the DNA in order to distinguish between RNA Rescue mutations and possible RNA editing events. The precision of RNA Rescue mutations with only one read supporting the variant in the DNA was 11%, with two supporting reads in the DNA 23%, with three supporting reads in the DNA 43%, and with four or more supporting reads in the DNA 94%.
3.2.2 RADIA Sensitivity on Endometrial Carcinoma Patient Data
In order to measure the sensitivity of RADIA, we considered the union of all mutations submitted by TCGA GDACs that validated as somatic as our truth set. There were 30,239 mutations that validated as somatic from TCGA. We compared our somatic mutations to this truth set and demonstrated an overall sensitivity of 84% (Figure 3B, Figure S5). Of the 4,751 calls that were missed, 1,539 (33%) were filtered by RADIA because they had a variant allele frequency less than 8% (Figure S6). In addition, 1,072 (23%) landed in blacklist regions that were not considered (Figure S6).
3.2.3 RADIA Precision on Lung Adenocarcinoma Patient Data
Finally, RADIA somatic mutations were analyzed during the course of our participation in the TCGA Lung Adenocarcinoma Analysis Working Group [14]. We ran RADIA on 230 TCGA lung adenocarcinoma triplets that we downloaded from CGHub (https://cghub.ucsc.edu, Table S2). The exomes were sequenced using the Illumina HiSeq platform, and the paired-end sequencing reads were aligned by BWA [28]. The RNA was sequenced using the Illumina HiSeq platform, and the paired-end sequencing reads were aligned by MapSplice (V2) [29]. Validation was performed by the Broad Institute on 74 genes of interest along with an additional 1,150 somatic SNVs. Validation was attempted on 2,404 RADIA somatic mutations and 2,395 (99.63%) were verified. From the DOM, 2,336 of the 2,345 mutations (99.62%) validated. Importantly, 469/469 (100%) of the TBM mutations consisting of 410 RNA Confirmation and 59 RNA Rescue mutations validated.
3.3 Somatic Mutations in Specific Lung Adenocarcinoma Genes
Mutations in the tumor suppressor gene TP53 are common in the majority of humancancers. Most of the mutations occur in the DNA-Binding Domain (DBD) and are considered change-of-function mutations that alter activity of TP53, sometimes acting in a dominant negative manner to sequester wildtype tp53 protein in trans
[30]. As such, many p53 mutant proteins endow cells with oncogenic characteristics by promoting cell proliferation, survival, and metastasis [31].We ran RADIA on 230 TCGA lung adenocarcinoma triplets [14] and discovered two non-synonymous TP53 mutations that were below the detection threshold for other mutation calling algorithms used by TCGA (Table 4). Both of the mutations were validated by the deep-sequencing validation data and confirmed as somatic in COSMIC by other studies. One of the mutations (G266E) was confirmed as somatic in another lung cancer study [32] as well as in prostate [33], pancreas [34], urinary tract [35], and hematopoietic and lymphoid [36] cancer studies. The G266E mutation occurs in the TP53 DBD mutation hotspot frequently resulting in pathological effects [37]–[39]. This mutation has also been described as a gain-of-function mutation in a melanoma cell line [40]. The other TP53 mutation (G199V) was confirmed as somatic in breast [41], ovarian [42], and medulloblastoma [43] studies. It is a known anti-apoptotic gain-of-function mutation that promotes cell survival through the signal transducer and activator of transcription-3 (STAT3) pathway [44]. Knockdown experiments of G199Vp53 mutants demonstrated a level of anti-tumor activity similar to high doses of chemotherapeutic agents, suggesting that inhibition of G199Vp53 mutants may be beneficial for cancer treatment [44].
Table 4
RNA Rescue Mutations in Lung Adenocarcinoma not Detected by Other Methods in TCGA.
Gene
Mutation
DNA VAF
RNA VAF
Validation DNA VAF
TP53
G266E
1/7 (13%)
6/10 (60%)
47/183 (26%)
TP53
G199V
4/64 (6%)
8/57 (14%)
17/380 (4%)
CDKN2A
R131H
3/45 (7%)
22/62 (35%)
9/149 (6%)
CDKN2A
R122*/R163*
2/16 (13%)
31/34 (91%)
20/92 (22%)
STK11
W239*
1/13 (7%)
20/40 (50%)
NA
These mutations were below the detection threshold for other mutation calling algorithms used by TCGA. The ratio of reads supporting the mutations along with the variant allele frequencies are shown for both the DNA and RNA. Validation was done on four of the mutations, and the resulting validation DNA variant allele frequencies are shown.
These mutations were below the detection threshold for other mutation calling algorithms used by TCGA. The ratio of reads supporting the mutations along with the variant allele frequencies are shown for both the DNA and RNA. Validation was done on four of the mutations, and the resulting validation DNA variant allele frequencies are shown.Additionally, we found mutations in other well-known tumor suppressor genes such as STK11 and CDKN2A. In the lung adenocarcinoma manuscript from TCGA, mutations in STK11 and CDKN2A were reported in 17% and 4% of all patients, respectively [14]. STK11 was the fourth most mutated gene and CDKN2A was the sixteenth [14]. The proximal-proliferative subtype in lung adenocarcinoma is characterized by an enrichment of mutations in KRAS along with inactivation mutations in STK11
[14]. In the STK11 gene, we discovered a nonsense mutation at W239* in the structurally conserved protein kinase domain that was below the detection threshold for other mutation algorithms used by TCGA. This mutation introduces an early stop codon in exon five (of ten) leading to a truncated protein. This site is in COSMIC and was previously reported to be part of a 398 nucleotide deletion in a lung cancer study [45].In the CDKN2A gene, we found one nonsense mutation at R122*, R163* and one missense mutation at R131H, R80H that were both validated by TCGA and found in COSMIC. CDKN2A is silenced in many CpG island methylator phenotype-high (CIMP-High) tumors by DNA methylation [14], but mutations and deletions in CDKN2A also result in loss of function. The nonsense mutation at R122*, R163* results in an early stop codon in exon two (of three or four, isoform dependent) leading to a truncated protein. Previous lung cancer studies [46]–[48] have reported frameshifts and deletions at this site. The missense mutation at R131H was also found in colon cancer [49], clear cell sarcoma [50], and chronic myeloid leukemia [51] and confirmed as somatic in biliary tract cancer [52].
Discussion
Identifying somatic mutations is a key step in characterizing the cancer genome. Until now, algorithms for mutation detection have concentrated on comparing just the normal and tumor genomes within the same individual. In the past few years, it has become common to also sequence the tumor transcriptome using RNA-Seq technologies. Large genomics studies, such as those conducted by TCGA, primarily use the RNA-Seq for gene expression, gene fusion, and splicing analyses. With the cost of sequencing steadily decreasing and the wealth of information that can by obtained from RNA-Seq, we predict that the sequencing of the tumor RNA will continue to be routine in large cancer profiling projects. We have developed a novel method called RADIA that combines the normal DNA, tumor DNA, and tumor RNA from the same individual to increase sensitivity to detect somatic mutations without compromising specificity. Here we have focused on the ability of RADIA to detect germline variants and somatic single nucleotide variants. In the future, we plan to include other classes of somatic mutations such as small insertions and deletions (INDELs), loss of heterozygosity events (LOHs) and RNA editing events.The accurate detection of somatic mutations is complicated by biological and technical artifacts such as tumor purity and subclonality, varying allele frequencies, sequencing depths, and copy-number variation. There is a trade-off between high sensitivity and high specificity, such that it is difficult to achieve both. By including an additional dataset, we are increasing our ability to reliably detect mutations, especially at low variant allele frequencies (Figure S7) where the signal to noise ratio becomes unfavorable.Many widely used mutation calling algorithms see a large decrease in precision as the DNA variant allele frequency declines [6], [8], [9], [11], [12]. We found that a DNA VAF of 10% gives us the best balance between sensitivity and precision. To demonstrate this point, we lowered the DNA VAF to 5% and reran RADIA on the endometrial carcinoma data from Section 3.2. We used the same validation strategy as described in Section 3.2 and compared the results to the ones with a DNA VAF of 10%. We found a slight 1% increase in overall sensitivity from 84% (at 10% VAF) to 85% (at 5% VAF) but an 8% decrease in overall precision from 97% (at 10% VAF) to 89% (at 5% VAF).By combining the RNA with the DNA, we are able to confirm the expression of a mutation, providing insight into its likely functional effect. Confirming mutations through RNA-Seq is also advantageous for large genomic studies in providing a means for weak validation for mutations without costly resequencing for validation (Figure S8). We find that over 99% of mutations that have both strong DNA and RNA support validate upon resequencing, suggesting that if one is not using mutations in clinical practice but rather estimating overall frequencies of specific mutations in a research cohort, the extreme expense in validating every mutation may not be warranted. While the integration of RNA and DNA provides an important but limited use as a DNA variant validation technique, studying the impacts on gene expression levels may lead to a deeper understanding of the functional impact of DNA-originating variants.Here we have outlined some of the strengths of RADIA, but approaches that use RNA-Seq for detecting variants have clear limitations [53], [54]. Only expressed alleles can be evaluated, which reduces the number of genes that can be assessed. In addition, several classes of mutations, such as the introduction of premature stop codons that lead to nonsense mediated decay, cannot be verified. Expression levels can also confound the ability to detect an imbalance in the genomic VAF as influences due to feedback control to rebalance gene dosage are currently unknown.With RADIA, we are able to detect mutations in important cancer genes such as TP53 that were previously not identified by other algorithms because the signal was lost in the noise. Somatic mutations are commonly used to group patients into subtypes that are critical for diagnosis and treatment of the disease. Our ability to rescue back mutations for individual patients will assist in correctly identifying each patient’s specific subtype and consequently their treatment options.Schematic of mutations detected by the DNA Only Method (DOM) and Triple BAM Method (TBM). In the first and middle columns, there is enough DNA read support for the DOM and other algorithms acting on DNA pairs to detect a mutation. In the middle and last columns, there is sufficient RNA read support for the TBM to detect a mutation. The middle column illustrates “RNA Confirmation” mutations that are detected by both the DOM and the TBM due to high read support in both the DNA and RNA. The last column represents the “RNA Rescue” mutations that have some support in the DNA and strong evidence in the RNA. The RNA Rescue mutations are typically missed by traditional mutation calling algorithms that only investigate the pairs of DNA.(PDF)Click here for additional data file.Diagram of bamsurgeon methodology. Mutations are spiked into BAM files by selecting locations with adequate coverage, extracting the reads, and adjusting the VAF according to the desirable VAF distribution. Once the bases in the reads are changed, the reads are remapped to the genome, replacing the reads in the original BAM file.(PDF)Click here for additional data file.Filters applied in the Variable DNA-Constant RNA bamsurgeon simulation experiment. The DNA variant allele frequencies were distributed from 1–50% and the RNA was held constant at 25%. Most of the DOM mutations were filtered because of the low variant allele frequency and tumor strand bias. In the TBM, most of the mutations were filtered due to the minimum number of alternative alleles required to make a call (n = 4) and strand bias in the tumor DNA and RNA.(PDF)Click here for additional data file.Filters applied in the Low Frequency DNA-Variable RNA bamsurgeon simulation experiment. The RNA variant allele frequencies were distributed from 1–50% and the DNA was held at 10% or less. Most of the DOM mutations were filtered because of the low DNA variant allele frequency and tumor strand bias. In the TBM, most of the mutations were filtered due to the minimum number of alternative alleles required to make a call (n = 4) and the low RNA variant allele frequency.(PDF)Click here for additional data file.Distribution of overlaps between RADIA and the endometrial TCGA MAF file. The distribution of the overlaps between RADIA and the validated somatic mutations from the endometrial TCGA network MAF file.(PDF)Click here for additional data file.Filters applied to the RADIA mutations that validated as somatic in the endometrial TCGA MAF file. Thirty-three percent of the mutations had a DNA VAF of eight percent or less while 23% landed in blacklist regions that were ignored.(PDF)Click here for additional data file.RNA Rescue mutations are primarily at low DNA VAFs. RNA Rescue mutations are primarily found at low DNA variant allele frequencies, but they also occur at higher frequencies where they were filtered due to non-depth related artifacts (e.g. strand-bias).(PDF)Click here for additional data file.Distribution of RNA Confirmation Calls. The total number of mutations (blue) that are covered by at least one RNA read (yellow), one RNA read supporting the alternative allele (orange), and RNA Confirmation mutations with high support in both the DNA and RNA (purple).(PDF)Click here for additional data file.Summary of TCGA samples analyzed by RADIA. RADIA has been run on over 3,300 TCGA samples across 15 different types of cancer. The RNA Rescue mutations make up two to seven percent of the total somatic mutations across the 15 types of cancer. Variant Call Format (VCF) and Mutation Annotation Format (MAF) files can be downloaded from the TCGA Data Portal (https://tcga-data.nci.nih.gov/tcga/). Open-access somatic MAFs can be visualized and downloaded via the UCSC Cancer Browser (https://genome-cancer.ucsc.edu/).(PDF)Click here for additional data file.TCGA barcodes and Universally Unique Identifiers (UUIDs) for the TCGA samples used in this study. All patients provided written informed consent in accordance with TCGA guidelines and local Institutional Review Boards (IRBs).(XLSX)Click here for additional data file.
Authors: Elizabeth T Cirulli; Abanish Singh; Kevin V Shianna; Dongliang Ge; Jason P Smith; Jessica M Maia; Erin L Heinzen; James J Goedert; David B Goldstein Journal: Genome Biol Date: 2010-05-28 Impact factor: 13.583
Authors: Marcin Imielinski; Alice H Berger; Peter S Hammerman; Bryan Hernandez; Trevor J Pugh; Eran Hodis; Jeonghee Cho; James Suh; Marzia Capelletti; Andrey Sivachenko; Carrie Sougnez; Daniel Auclair; Michael S Lawrence; Petar Stojanov; Kristian Cibulskis; Kyusam Choi; Luc de Waal; Tanaz Sharifnia; Angela Brooks; Heidi Greulich; Shantanu Banerji; Thomas Zander; Danila Seidel; Frauke Leenders; Sascha Ansén; Corinna Ludwig; Walburga Engel-Riedel; Erich Stoelben; Jürgen Wolf; Chandra Goparju; Kristin Thompson; Wendy Winckler; David Kwiatkowski; Bruce E Johnson; Pasi A Jänne; Vincent A Miller; William Pao; William D Travis; Harvey I Pass; Stacey B Gabriel; Eric S Lander; Roman K Thomas; Levi A Garraway; Gad Getz; Matthew Meyerson Journal: Cell Date: 2012-09-14 Impact factor: 41.582
Authors: Lynnette Fernández-Cuesta; Catherine Oakman; Priscila Falagan-Lotsch; Ke-Seay Smoth; Emmanuel Quinaux; Marc Buyse; M Stella Dolci; Evandro De Azambuja; Pierre Hainaut; Patrizia Dell'orto; Denis Larsimont; Prudence A Francis; John Crown; Martine Piccart-Gebhart; Giuseppe Viale; Angelo Di Leo; Magali Olivier Journal: Breast Cancer Res Date: 2012-05-02 Impact factor: 6.466
Authors: Goncalo R Abecasis; Adam Auton; Lisa D Brooks; Mark A DePristo; Richard M Durbin; Robert E Handsaker; Hyun Min Kang; Gabor T Marth; Gil A McVean Journal: Nature Date: 2012-11-01 Impact factor: 49.962
Authors: Nicola D Roberts; R Daniel Kortschak; Wendy T Parker; Andreas W Schreiber; Susan Branford; Hamish S Scott; Garique Glonek; David L Adelson Journal: Bioinformatics Date: 2013-07-09 Impact factor: 6.937
Authors: Donna Karolchik; Galt P Barber; Jonathan Casper; Hiram Clawson; Melissa S Cline; Mark Diekhans; Timothy R Dreszer; Pauline A Fujita; Luvina Guruvadoo; Maximilian Haeussler; Rachel A Harte; Steve Heitner; Angie S Hinrichs; Katrina Learned; Brian T Lee; Chin H Li; Brian J Raney; Brooke Rhead; Kate R Rosenbloom; Cricket A Sloan; Matthew L Speir; Ann S Zweig; David Haussler; Robert M Kuhn; W James Kent Journal: Nucleic Acids Res Date: 2013-11-21 Impact factor: 16.971
Authors: Jacob J Adashek; Shumei Kato; Rahul Parulkar; Christopher W Szeto; J Zachary Sanborn; Charles J Vaske; Stephen C Benz; Sandeep K Reddy; Razelle Kurzrock Journal: JCI Insight Date: 2020-06-04
Authors: Milan Radovich; Curtis R Pickering; Ina Felau; Gavin Ha; Hailei Zhang; Heejoon Jo; Katherine A Hoadley; Pavana Anur; Jiexin Zhang; Mike McLellan; Reanne Bowlby; Thomas Matthew; Ludmila Danilova; Apurva M Hegde; Jaegil Kim; Mark D M Leiserson; Geetika Sethi; Charles Lu; Michael Ryan; Xiaoping Su; Andrew D Cherniack; Gordon Robertson; Rehan Akbani; Paul Spellman; John N Weinstein; D Neil Hayes; Ben Raphael; Tara Lichtenberg; Kristen Leraas; Jean Claude Zenklusen; Junya Fujimoto; Cristovam Scapulatempo-Neto; Andre L Moreira; David Hwang; James Huang; Mirella Marino; Robert Korst; Giuseppe Giaccone; Yesim Gokmen-Polar; Sunil Badve; Arun Rajan; Philipp Ströbel; Nicolas Girard; Ming S Tsao; Alexander Marx; Anne S Tsao; Patrick J Loehrer Journal: Cancer Cell Date: 2018-02-12 Impact factor: 31.743
Authors: Lisa Neums; Seiji Suenaga; Peter Beyerlein; Sara Anders; Devin Koestler; Andrea Mariani; Jeremy Chien Journal: Gigascience Date: 2018-02-01 Impact factor: 6.524
Authors: Arjun A Rao; Ada A Madejska; Jacob Pfeil; Benedict Paten; Sofie R Salama; David Haussler Journal: Front Immunol Date: 2020-11-10 Impact factor: 7.561