| Literature DB >> 27413049 |
Arkarachai Fungtammasan1, Marta Tomaszkiewicz2, Rebeca Campos-Sánchez2, Kristin A Eckert3, Michael DeGiorgio4, Kateryna D Makova5.
Abstract
Transcript variation has important implications for organismal function in health and disease. Most transcriptome studies focus on assessing variation in gene expression levels and isoform representation. Variation at the level of transcript sequence is caused by RNA editing and transcription errors, and leads to nongenetically encoded transcript variants, or RNA-DNA differences (RDDs). Such variation has been understudied, in part because its detection is obscured by reverse transcription (RT) and sequencing errors. It has only been evaluated for intertranscript base substitution differences. Here, we investigated transcript sequence variation for short tandem repeats (STRs). We developed the first maximum-likelihood estimator (MLE) to infer RT error and RDD rates, taking next generation sequencing error rates into account. Using the MLE, we empirically evaluated RT error and RDD rates for STRs in a large-scale DNA and RNA replicated sequencing experiment conducted in a primate species. The RT error rates increased exponentially with STR length and were biased toward expansions. The RDD rates were approximately 1 order of magnitude lower than the RT error rates. The RT error rates estimated with the MLE from a primate data set were concordant with those estimated with an independent method, barcoded RNA sequencing, from a Caenorhabditis elegans data set. Our results have important implications for medical genomics, as STR allelic variation is associated with >40 diseases. STR nonallelic transcript variation can also contribute to disease phenotype. The MLE and empirical rates presented here can be used to evaluate the probability of disease-associated transcripts arising due to RDD.Entities:
Keywords: RNA sequencing; RNA–DNA differences; error correction model.; microsatellites; reverse transcription errors; sequencing errors; tandem repeats; transcription errors
Mesh:
Substances:
Year: 2016 PMID: 27413049 PMCID: PMC5026258 DOI: 10.1093/molbev/msw139
Source DB: PubMed Journal: Mol Biol Evol ISSN: 0737-4038 Impact factor: 16.240
Fig. 1A schematic representation of the experimental design.
Fig. 2A comparison of RT error rates and RT expansion probabilities as a function of repeat number for motif (A) between sequencing batches A (blue) and B (red). (A) RT error rates for the bin size of 2; (B) RT error rates for the bin size of 5; (C) RT error rates for the bin size of 40; (D) RT expansion probabilities for the bin sizes of 2; (E) RT expansion probabilities for the bin size of 5; (F) RT expansion probabilities for the bin size of 40. Repeat numbers between 5 and 10 were chosen due to their high abundance. Median values across 100 empirical bootstrap replicates (bootstrapped across loci) are plotted with open circles, whereas point estimates are plotted with stars. Solid lines connect the median bootstrap estimates. The 95% confidence intervals were calculated from the 100 bootstraps replicates. Each estimate was based on five sets of random initial parameters to minimize the possibility of reaching local maxima, and the set of parameters that had the maximal likelihood was taken as the estimate for a given bootstrap replicate. The estimations for the bin size of 2 were performed using full MLE, whereas the estimations for the bin size of 5 and 40 were performed using lumping MLE. The number of loci analyzed for each bin size is listed in supplementary tables S4 and S5, Supplementary Material online.
RDD Rates for the (A) Motif.
| Batch A | Batch B | Batch A | Batch B | Batch A | Batch B | |
|---|---|---|---|---|---|---|
| <1.0e-9 [<1.0e-9, <1.0e-9] | <1.0e-9 [<1.0e-9, 2.76e-4] | <1.0e-9 [<1.0e-9, 3.29e-9] | <1.0e-9 [<1.0e-9,6.86e-5] | 1.87e-4 [<1.0e-9, 5.70e-4] | <1.0e-9 [<1.0e-9, 4.29e-9] | |
| 1.87e-3 [2.45e-4, 3.38e-3] | 5.13e-4 [<1.0e-9, 2.01e-3] | 6.45e-4 [<1.0e-9, <2.26e-3] | 6.80e-4 [<1.0e-9, 2.17e-3] | 1.89e-3 [<1.0e-9, 3.42e-3] | 1.81e-3 [<1.0e-9, 4.97e-3] | |
| <1.0e-9 [<1.0e-9, 2.28e-3] | 2.59e-3 [<1.0e-9, 7.59e-3] | 3.26e-3 [<1.0e-9, 2.34e-3] | 3.9e-3 [<1.0e-9, 2.83e-3] | 7.36e-4 [<1.0e-9, 5.36e-3] | 5.90e-3 [<1.0e-9, 1.33e-2] | |
| 3.57e-3 [<1.0e-9, 1.48e-2] | 2.68e-3 [<1.0e-9, 1.72e-2] | 7.53e-3 [3.80e-3, <1.0e-9] | 7.46e-3 [<1.0e-9, 1.14e-2] | 3.76e-3 [<1.0e-9, 1.36e-2] | <1.0e-9 [<1.0e-9, 8.12e-3] | |
| 8.94e-3 [<1.0e-9, 2.28e-2] | <1.0e-9 [<1.0e-9, <1.0e-9] | 2.40e-2 [<1.0e-9, 1.90e-2] | 1.57e-2 [<1.0e-9, 1.67e-2] | 2.33e-2 [<1.0e-9, 7.34e-2] | <1.0e-9 [<1.0e-9, 1.49e-2] | |
| <1.0e-9 [<1.0e-9, 6.55e-2] | 1.15e-2 [<1.0e-9, 8.22e-2] | 7.52e-2 [<1.0e-9, 6.85e-2] | 4.95e-3 [<1.0e-9, 3.54e-2] | 7.54e-3 [<1.0e-9, 5.93e-2] | <1.0e-9 [<1.0e-9, 1.18e-2] | |
Note.—In each cell, the number outside the brackets is the point estimation, whereas the numbers inside the brackets are the 95% confidence intervals.
Fig. 3A comparison of RT error rates estimated using the full MLE (orangutan data) versus barcoded RNA sequencing (Caenorhabditis elegans data). The 95% confidence intervals for the rates estimated with the full MLE were generated from 100 empirical bootstrap replicates (bootstrapped across loci), whereas the 95% confidence intervals for the barcoded RNA sequencing were generated from 1,000 bootstrap replicates of inferred cDNA molecules with at least two cDNA molecules in that family. The lower bounds of the RT error rate confidence intervals for the barcoded RNA sequencing are zero and thus are outside the plotting area.
Fig. 4A comparison among STR RT error rates (this study), STR RDD rates (this study), STR germ-line mutation rates (Fungtammasan et al. 2015), STR sequencing error rates (Fungtammasan et al. 2015), base-substitution germ-line mutation rates (Kong et al. 2012), and base-substitution RDD rates (Gout et al. 2013 [lower line], Traverse & Ochman 2016 [upper line]).