| Literature DB >> 33261552 |
Anna Quaglieri1,2, Christoffer Flensburg3, Terence P Speed3,4,5, Ian J Majewski6,7.
Abstract
BACKGROUND: RNA sequencing allows the study of both gene expression changes and transcribed mutations, providing a highly effective way to gain insight into cancer biology. When planning the sequencing of a large cohort of samples, library size is a fundamental factor affecting both the overall cost and the quality of the results. Here we specifically address how overall library size influences the detection of somatic mutations in RNA-seq data in two acute myeloid leukaemia datasets. RESULTS : We simulated shallower sequencing depths by downsampling 45 acute myeloid leukaemia samples (100 bp PE) that are part of the Leucegene project, which were originally sequenced at high depth. We compared the sensitivity of six methods of recovering validated mutations on the same samples. The methods compared are a combination of three popular callers (MuTect, VarScan, and VarDict) and two filtering strategies. We observed an incremental loss in sensitivity when simulating libraries of 80M, 50M, 40M, 30M and 20M fragments, with the largest loss detected with less than 30M fragments (below 90%, average loss of 7%). The sensitivity in recovering insertions and deletions varied markedly between callers, with VarDict showing the highest sensitivity (60%). Single nucleotide variant sensitivity is relatively consistent across methods, apart from MuTect, whose default filters need adjustment when using RNA-Seq. We also analysed 136 RNA-Seq samples from the TCGA-LAML cohort (50 bp PE) and assessed the change in sensitivity between the initial libraries (average 59M fragments) and after downsampling to 40M fragments. When considering single nucleotide variants in recurrently mutated myeloid genes we found a comparable performance, with a 6% average loss in sensitivity using 40M fragments.Entities:
Keywords: Cancer RNA-Seq; Library size; Sequencing depth; Variant calling
Year: 2020 PMID: 33261552 PMCID: PMC7708150 DOI: 10.1186/s12859-020-03860-4
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Types of variants in the Leucegene truth sets
| Mutation type | Min VAF | Mean VAF | Max VAF | N |
|---|---|---|---|---|
| Composite indel | 0.06 | 0.25 | 0.56 | 9 |
| Long insertion | 0.41 | 0.5 | 0.64 | 3 |
| Short deletion | 0.09 | 0.24 | 0.38 | 2 |
| Short insertion | 0.07 | 0.33 | 0.84 | 15 |
| SNVs | 0.05 | 0.37 | 0.97 | 58 |
| Indel-not reported | 0.84 | 0.84 | 0.84 | 1 |
Variants used as the truth set were previously validated in a set of 45 CBF-AML RNA-Seq samples [9]. Variant types are inferred from the information in the published study and by the variant calls performed on the initial samples. A short indel (insertion/deletion) is an indel < 10 bp long; composite indels are mutations including both inserted and deleted nucleotides; SNVs are single nucleotide variants
Fig. 1Sensitivity in recovering the variants in the truth set using the Leucegene RNA-Seq samples. a Median with maximum and minimum sensitivity (vertical bars) for recovering the SNVs (left plot) and indels (right plot) in Table 1, across random downsampling runs using different library sizes. Each estimated median sensitivity represents the median across 5 random downsamplings (only 3 for 80M libraries) of the initial RNA-Seq libraries at a specific library size. The solid lines are the sensitivities obtained using the default-filters and the dotted lines are obtained with the annotation-filters. b Average VAF (top plot) and alternative depth (bottom plot) on the log scale at a variant site for the variants in the truth set using different library sizes. A line in each plot represents one mutation in the truth set. Each dot is coloured according to the average number of times a variant was called by one caller using the annotation-filters across replicated downsampling runs at one specific library size. c Heatmap showing the sensitivity in detecting the Leucegene mutations using default-filters across intervals of the VAF at a variant site and intervals of the total log2 gene counts of a gene across different library sizes. Every coloured square in the heatmap represents the average time the Leucegene variants within an interval were detected. The average is over mutations found within an interval and over callers
Fig. 2Sensitivity using the TCGA-LAML truth sets. a 2D Density plots of the RNA VAF (x-axis) against the total depth on a logarithmic scale (y-axis) of the SNVs in TCGA-LAML validated sets of variants: Set1 (all SNVs after removing intronic and intergenic variants) on the left and Set2 (SNVS on recurrently mutated AML genes) on the right. b Violin plots of the log2RPKM of the genes with variants detected in Set1 and Set2. The gene expression distributions are provided for the TCGA-LAML and the Leucegene libraries using the initial samples, before downsampling. c Sensitivity in recovering SNVs from the TCGA-LAML cohort in the initial and downsampled libraries using a combination of the two truth sets, callers and filtering strategies
Fig. 3SNVs sensitivity and expression of the recurrently mutated AML genes using the TCGA-LAML cohort. a Sensitivity in recovering mutations on recurrently mutated AML genes (Set2) using the TCGA-LAML cohort with callers default-filters. The size of the dots is proportional to the number of times a gene is mutated and genes were ordered by mutation load, with the most mutated genes at the top. Red dots corresponds to the results obtained with the initial library sizes and cyan dots using the downsampled libraries. b Expression distribution of recurrently mutated myeloid genes across the TCGA-LAML RNA-Seq samples used. The genes are reported in the same order as in a. Each dot corresponds to a sample and dots are coloured based on the percentage of times the variants detected on a sample are called by the callers using default-filters and the 40M libraries. A patient can harbour more than one mutation per gene. Horizontal violin plots are drawn below the dots
Fig. 4Sensitivity by total depth at a variant site. a Sensitivity as a function of the total depth at a variant site for the TCGA-LAML SNVs in Set1 (all SNVs after removing intronic and intergenic variants), combining the initial and 40M libraries and adopting callers with default-filters. b Sensitivity as a function of the total depth at a variant site using the Leucegene samples. The sensitivity is computed using the variants in the truth set, combining the calls from all downsampling runs, and using both types of filters. c Median with maximum and minimum sensitivity in recovering the SNVs in the truth set using the Leucegene samples. Only SNVs with total depth are considered as called. The sensitivity by depth is computed for each starting library size (colours) and using annotation-filters. Each estimated median sensitivity (and minimum and maximum) is the median across random downsampling runs at the same library size. The red dotted lines represent the 80%, 90% and 95% sensitivity thresholds