| Literature DB >> 31501426 |
Huanle Liu1,2, Oguzhan Begik1,2,3, Morghan C Lucas1,4, Jose Miguel Ramirez1, Christopher E Mason5,6,7, David Wiener8, Schraga Schwartz8, John S Mattick2,3,9, Martin A Smith3,10, Eva Maria Novoa11,12,13,14.
Abstract
The epitranscriptomics field has undergone an enormous expansion in the last few years; however, a major limitation is the lack of generic methods to map RNA modifications transcriptome-wide. Here, we show that using direct RNA sequencing, N6-methyladenosine (m6A) RNA modifications can be detected with high accuracy, in the form of systematic errors and decreased base-calling qualities. Specifically, we find that our algorithm, trained with m6A-modified and unmodified synthetic sequences, can predict m6A RNA modifications with ~90% accuracy. We then extend our findings to yeast data sets, finding that our method can identify m6A RNA modifications in vivo with an accuracy of 87%. Moreover, we further validate our method by showing that these 'errors' are typically not observed in yeast ime4-knockout strains, which lack m6A modifications. Our results open avenues to investigate the biological roles of RNA modifications in their native RNA context.Entities:
Mesh:
Substances:
Year: 2019 PMID: 31501426 PMCID: PMC6734003 DOI: 10.1038/s41467-019-11713-9
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Fig. 1Base-calling “errors” can be used as a proxy to identify RNA modifications in direct RNA sequencing reads. a Schematic overview of the strategy used in this work to train and test an m6A RNA base-calling algorithm. b IGV snapshot of one of the four transcripts used in this work. In the upper panel, in vitro transcribed reads containing m6A have been mapped, whereas in the lower panel the unmodified counterpart is shown. Nucleotides with mismatch frequencies >0.05 have been colored. c Comparison of m6A and A positions, at the level of per-base quality scores (left panel), mismatch frequencies (middle left panel), deletion frequency (middle right panel), and mean current intensity (right panel). All possible k-mers (computed as a sliding window along the transcripts) have been included for these comparisons (n = 9974). d–g Replicability of each individual feature — base quality (d), deletion frequency (f), mismatch frequency (e), and current intensity (g) — across biological replicates, for both unmodified (“A”) and m6A-modified (“m6A”) data sets. Comparison of unmodified and m6A-modified (“A vs m6A”) is also shown for each feature. Correlation values shown correspond to Spearman’s rho. Error bars indicate s.d. Source data are provided in the Source Data file
Fig. 2Base-calling “errors” alone can accurately identify m6A RNA modifications. a Base-called features (base quality, insertion frequency, and deletion frequency) of m6A motif 5-mers, and for each position of the 5-mer, are shown. The features of the m6A-modified transcripts (“m6A”) are shown in red, whereas the features of the unmodified transcripts (“unm”) are shown in blue. b, c Principal component analysis (PCA) scores plot of the two first principal components, using 15 features (base quality, mismatch frequency, deletion frequency, for each of the five positions of the k-mer) as input. The logos of the k-mers used in the m6A-motif RRACH set (left) and control set (right) are also shown. Each dot represents a specific k-mer in the synthetic sequence, and has been colored depending on whether the k-mer belongs to the m6A-modified transcripts (red) or the unmodified transcripts (black). The contribution of each principal component is shown in each axis. ROC curves of the SVM predictions using: (i) each individual feature separately to train and test each model, at m6A sites (d); (ii) combined features at m6A sites, relative to the individual features (e); (iii) combined features at m6A sites relative to control sites, where the base-called “errors” information of neighboring nucleotides has been included in the model (f); and (iv) different mixtures of methylated and unmethylated reads, using the combined features model (g). Error bars indicate s.d. Source data are provided in the Source Data file
Fig. 3Yeast wild-type and ime4∆ strains show distinct base-called features at known m6A-modified RRACH sites. a Overview of the direct RNA sequencing library preparation using in vivo polyA(+) RNA from S. cerevisiae cultures. b Replicability of per-gene counts using direct RNA sequencing across wild-type yeast strains (top) and ime4∆ strains (middle). The correlation between wild-type and ime4∆ strains is also shown (bottom). c Comparison of the observed mismatch frequencies in the 100%-modified in vitro transcribed sequences (blue), unmodified sequences (red), yeast ime4∆ knockout (green), and yeast wild type (cyan). Values for each biological replicate are shown. d Base-called features (base quality, insertion frequency, and deletion frequency) of RRACH 5-mers known to contain m6A modifications. Only features corresponding to the modified nucleotide (position 0) are shown. Features extracted from wild-type yeast reads (m6A-modified) are shown in blue, whereas those from ime4∆ (unmodified) for the same set of k-mers are shown in red. f Genomic tracks of previously reported m6A-modified RRACH sites in yeast, identified using Illumina sequencing. The m6A-modified nucleotide is highlighted with a green asterisk. In these positions, wild-type yeast strains show increased mismatch frequencies, as well as decreased coverage — reflecting increased deletion frequency — in all three biological replicates, whereas these features are not observed in any of the three ime4∆ replicates. g Predicted m6A modification scores predicted by the trained SVM at known m6A-modified (n = 363) and unknown (n = 60,794) RRACH sites, both for yeast wild-type and ime4∆ data sets. P-values have been computed using Kruskal–Wallis test. A site was included in the analysis if there were mapped reads present in all six yeast samples. Sites with more than one “A” in the 5-mer were excluded from the analysis. h ROC curve depicting the performance of EpiNano in yeast data sets (n = 61,363 sites). Error bars indicate s.d. Source data are provided in the Source Data file