Literature DB >> 34348892

Repurposing RNA sequencing for discovery of RNA modifications in clinical cohorts.

Kar-Tong Tan1,2,3, Ling-Wen Ding4, Chan-Shuo Wu1, Daniel G Tenen5,6, Henry Yang5,7.   

Abstract

The study of RNA modifications in large clinical cohorts can reveal relationships between the epitranscriptome and human diseases, although this is especially challenging. We developed ModTect (https://github.com/ktan8/ModTect), a statistical framework to identify RNA modifications de novo by standard RNA-sequencing with deletion and mis-incorporation signals. We show that ModTect can identify both known (N 1-methyladenosine) and previously unknown types of mRNA modifications (N 2,N 2-dimethylguanosine) at nucleotide-resolution. Applying ModTect to 11,371 patient samples and 934 cell lines across 33 cancer types, we show that the epitranscriptome was dysregulated in patients across multiple cancer types and was additionally associated with cancer progression and survival outcomes. Some types of RNA modification were also more disrupted than others in patients with cancer. Moreover, RNA modifications contribute to multiple types of RNA-DNA sequence differences, which unexpectedly escape detection by Sanger sequencing. ModTect can thus be used to discover associations between RNA modifications and clinical outcomes in patient cohorts.
Copyright © 2021 The Authors, some rights reserved; exclusive licensee American Association for the Advancement of Science. No claim to original U.S. Government Works. Distributed under a Creative Commons Attribution NonCommercial License 4.0 (CC BY-NC).

Entities:  

Mesh:

Substances:

Year:  2021        PMID: 34348892      PMCID: PMC8336963          DOI: 10.1126/sciadv.abd2605

Source DB:  PubMed          Journal:  Sci Adv        ISSN: 2375-2548            Impact factor:   14.136


INTRODUCTION

Association studies in large clinical cohorts prove to be highly effective in identifying possible relationships between human diseases and variations on the genetic, epigenetic, transcriptomic, and proteomic layers. Given that RNA modifications have emerged as important regulators of molecular processes such as protein translation, RNA stability, and pre-mRNA splicing (, ) and are thought to be involved in a variety of human diseases (–), the study of the epitranscriptome in large clinical cohorts can similarly reveal unexpected relationships between different types of RNA modifications and a multitude of human diseases. Nonetheless, the study of RNA modifications in large patient cohorts is greatly impeded by a number of challenges: (i) technical complexities of approaches used to detect RNA modifications that require either an RNA pull-down step (–) or chemical treatment with difficult to access reagents (–); (ii) the requirement of a large abundance of total RNA, ranging from tens of micrograms to milligrams (, ), which is particularly challenging to obtain in the context of rare and highly precious clinical samples; (iii) the limitation of these methods to probe only for a single type of RNA modification at a time out of the >170+ possible types of RNA modifications; and (iv) difficulties in collecting a large collection of patient samples corresponding to different human diseases. Thus, because of these technical limitations and limited accessibility to clinically relevant samples, it has been particularly challenging to establish relationships between the epitranscriptome and various human diseases. As standard RNA sequencing (RNA-seq) has now been widely performed for a large panel of clinical samples, these datasets present an invaluable opportunity to study RNA modifications that were unintentionally captured. To identify RNA modification sites using standard RNA-seq, two main tools were previously developed. First, the HAMR (high-throughput annotation of modified ribonucleotides) software predicts RNA modification sites by identifying candidates that deviate from a homozygous genotype or 1 of 10 possible biallelic genotypes using a binomial test on the A,T,G,C nucleotide counts of each site (). A k-nearest neighbor approach is then used to classify candidate sites into seven classes of RNA modification based on a database of mismatch profiles detected at 303 tRNA modification sites. Second, the random forest classifier makes use of a random forest model trained on tRNAs to predict potential RNA modifications based on the arrest rate, mismatch rate, and jump rate profiles captured by RNA-seq (, ). Notably, both approaches apply a base quality cutoff (Q30 in HAMR) to generate the nucleotide counts of each site for the prediction of RNA modifications. However, a stringent base quality cutoff can lead to loss of sensitivity, while one that is too lenient may cause sequencing errors to be falsely predicted as RNA modifications. Further, the database of mismatch profiles used for prediction or classification of RNA modifications may also differ because of the library preparation strategy or the sequence context (, ) and may affect the prediction of RNA modifications in libraries generated by different protocols. To address these issues, we developed ModTect, a computational method to discover and profile multiple types of RNA modifications in large clinical cohorts. Notably, ModTect only requires standard RNA-seq libraries as an input and can thus be widely applied to RNA-seq datasets that were generated as part of most studies without requiring a database of modification signatures and profiles. In addition, ModTect models the nucleotide counts of each site in combination with their base qualities and can effectively identify multinucleotide sites while distinguishing them from sequencing noise. Further, we also introduced a series of filters to remove potential bioinformatics artifacts. With this approach, we analyzed 11,371 cancer patients and 934 cell line RNA-seq datasets across 33 cancer types for signals corresponding to multiple types of RNA modifications that disrupt base pairing (Fig. 1A and fig. S1). Our study indicates that some cancer types showed tumor type–specific differences in RNA modification levels, reveals that certain types of RNA modifications may be more strongly associated with some human diseases than others, and identified associations between RNA modifications and clinical outcomes.
Fig. 1

ModTect enables the discovery of multiple types of RNA modifications by standard RNA-seq.

(A) Schematic depicting how a type of base pair–disrupting RNA modification [3-methyluridine (m3U)] with an added chemical moiety disrupts Watson-Crick base pairing. (B) RNA modifications that disrupt base pairing cause the misincorporation of nucleotides, thus generating a multinucleotide mismatch pattern, and cause skipping of the modified base to produce a deletion signature during reverse transcription. (C) Detection of a multinucleotide mismatch and deletion signal at three different types of base pair–disrupting rRNA modifications. Top: Screenshot depicting the multinucleotide mismatch and deletion signature detected by RNA-seq but not by whole genome DNA sequencing (DNA-seq) at the m3U site. Middle: Percentage of each type of nucleotide and deletions observed from DNA sequencing and RNA-seq at three different types of base pair–disrupting RNA modification sites, N1-methyladenosine (m1A) at 28S:1322 rRNA, m3U at 28S:4530 rRNA, and 3-(3-amino-3-carboxypropyl) pseudo-uridine (m1acp3Ψ) at 18S:1248 rRNA. Depth of sequencing is indicated at the top of the chart. Bottom: Mismatch rate, deletion rate, and the type of mismatches observed at sites corresponding to each type of base pair–disrupting RNA modification. (D) Performance of ModTect in identifying base pair–disrupting RNA modifications on ribosomal RNAs (rRNAs). Left: ModTect allows effective extraction of the multinucleotide mismatch and deletion signals at an RNA modification site from RNA-seq. Multinucleotide mismatch signal represented by the modification score was extracted using a statistical model we designed, without the need for DNA sequencing. The deletion signal around each RNA modification is also depicted. Right: Precision-Recall curve, generated on the basis of rRNA modification sites from 934 RNA-seq datasets. The area under the precision-recall curve for each approach is indicated in the table.

ModTect enables the discovery of multiple types of RNA modifications by standard RNA-seq.

(A) Schematic depicting how a type of base pair–disrupting RNA modification [3-methyluridine (m3U)] with an added chemical moiety disrupts Watson-Crick base pairing. (B) RNA modifications that disrupt base pairing cause the misincorporation of nucleotides, thus generating a multinucleotide mismatch pattern, and cause skipping of the modified base to produce a deletion signature during reverse transcription. (C) Detection of a multinucleotide mismatch and deletion signal at three different types of base pair–disrupting rRNA modifications. Top: Screenshot depicting the multinucleotide mismatch and deletion signature detected by RNA-seq but not by whole genome DNA sequencing (DNA-seq) at the m3U site. Middle: Percentage of each type of nucleotide and deletions observed from DNA sequencing and RNA-seq at three different types of base pair–disrupting RNA modification sites, N1-methyladenosine (m1A) at 28S:1322 rRNA, m3U at 28S:4530 rRNA, and 3-(3-amino-3-carboxypropyl) pseudo-uridine (m1acp3Ψ) at 18S:1248 rRNA. Depth of sequencing is indicated at the top of the chart. Bottom: Mismatch rate, deletion rate, and the type of mismatches observed at sites corresponding to each type of base pair–disrupting RNA modification. (D) Performance of ModTect in identifying base pair–disrupting RNA modifications on ribosomal RNAs (rRNAs). Left: ModTect allows effective extraction of the multinucleotide mismatch and deletion signals at an RNA modification site from RNA-seq. Multinucleotide mismatch signal represented by the modification score was extracted using a statistical model we designed, without the need for DNA sequencing. The deletion signal around each RNA modification is also depicted. Right: Precision-Recall curve, generated on the basis of rRNA modification sites from 934 RNA-seq datasets. The area under the precision-recall curve for each approach is indicated in the table.

RESULTS

Multiple types of RNA modifications that disrupt base pairing can be simultaneously identified from standard RNA-seq libraries using ModTect

To simultaneously profile a wide variety of RNA modifications in a large panel of cancer patient–derived standard RNA-seq datasets, we focused on various types of RNA modifications that disrupt complementary Watson-Crick base pairing (Fig. 1A and fig. S1). Specifically, these RNA modifications, which we term base pair–disrupting RNA modifications, carry a chemical moiety (e.g., methyl group) on the Watson-Crick face and are unable to base pair with a complementary nucleotide (Fig. 1A and fig. S1). Thus, during reverse transcription, random nucleotides would be incorporated at these sites () or skipped by reverse transcriptase (RT), leading respectively to a multinucleotide mismatch signal and a deletion signal that is detectable by RNA-seq (Fig. 1B). Crucially, these signals are also distinct from mononucleotide mismatch signals induced by RNA editing and germline single-nucleotide polymorphisms (SNPs) (fig. S2) and are not known to be induced by other processes. Furthermore, noting that the reverse transcriptional step accompanies essentially all RNA-seq library preparation protocols, these two key-defining signals can potentially enable us to discover and profile the wide variety of RNA modifications that disrupt base pairing within the many thousands of patient-derived RNA-seq datasets from human cancers and various other human diseases and to explore possible dysregulation of the epitranscriptome in these highly precious patient samples. Thus, these two hallmark signals can potentially be used to identify and profile RNA modifications in large panels of cancer patient–derived RNA-seq. To first confirm that the two hallmark signals, the multinucleotide mismatch and deletion signals, can indeed be induced by various types of base pair–disrupting RNA modifications and also captured by standard RNA-seq, we used residual ribosomal RNA (rRNA) reads, present in most RNA-seq libraries (fig. S3), as an internal positive control. rRNAs are known to carry three types of base pair–disrupting RNA modifications, N1-methyladenosine (m1A), 3-methyluridine (m3U), and 3-(3-amino-3-carboxypropyl) pseudo-uridine (m1acp3Ψ) (, ), which can serve as positive control sites in our analysis. Upon examination of a matched whole-genome sequencing (WGS) and RNA-seq dataset (TCGA-D7-6521-01A), generated via standard RNA-seq library preparation protocols, we observed two or more types of nucleotide mismatches (i.e., multinucleotide mismatches) and deletions (~0.4 to 5% of reads) simultaneously at sites corresponding to each of these three types of base pair–disrupting RNA modifications (Fig. 1C), concordant with a prior study on modified bases in synthetic RNAs (). These mismatches were also only observed at the RNA level, but not at the DNA level at all three sites, suggesting that they were not induced by variations at the DNA level. They also do not show characteristics that are typical of artifacts (see Supplementary Text and fig. S4) and are detectable by three different sequencing platforms (fig. S5). Thus, these two hallmark signals are strong predictive indicators for the presence of RNA modifications. Noting that RNA modifications that disrupt base pairing are characterized by multinucleotide mismatches and deletions, we developed a computational framework named ModTect to capture these signals and detect base pair–disrupting RNA modification from typical RNA-seq libraries (see Materials and Methods, fig. S6A, and Supplementary Text). Notably, ModTect effectively captures the multinucleotide mismatch and deletion signals that characterize these RNA modification sites on rRNAs (Fig. 1D, left, and fig. S6B). Across 934 RNA-seq libraries, ModTect effectively identified these RNA modifications de novo (Fig. 1D, right; fig. S6, C and D; and Supplementary Text) and distinguished them from non–base pair–disrupting RNA modifications (fig. S6C). Together, these results indicate that ModTect is highly effective at detecting RNA modifications that disrupt base pairing from RNA-seq.

ModTect enables the discovery of multiple types of mRNA modifications with standard RNA-seq

The identification of mRNA modifications directly with standard RNA-seq is potentially susceptible to artifacts (–). To cope with potential pitfalls, we devised an overall strategy for identifying base pair–disrupting mRNA modifications in the whole transcriptome from RNA-seq (Fig. 2A, fig. S7, and Supplementary Text). We used the multinucleotide mismatch and deletion signals as captured by ModTect and coupled it to a range of stringent post-filters to remove possible sources of artifacts (fig. S7). When applied to 934 RNA-seq from the Cancer Cell Line Encyclopedia (CCLE) that covers a large range of tissue types, we identified 15,332 possible mRNA modifications that disrupt base pairing in the whole transcriptome and 9823 sites with more aggressive filters (Fig. 2A, Materials and Methods, fig. S7, and tables S1 and S2). Among the 9823 sites found by ModTect, 1350 “A”-type, 854 “C”-type, 5546 “G”-type, and 682 “T”-type sites were recorded, while 1391 sites could not be unambiguously assigned to a nucleotide type (Supplementary Text). Notably, the strong compositional bias of A versus T (1.98 times, P < 10−16) and G versus C (6.49 times, P < 10−16) sites also indicates that these sites were likely signals at the RNA rather than the DNA level. To prioritize sites of higher confidence for study, we defined a ranking system and assigned a rank to each of these sites (Fig. 2A, fig. S7, and Supplementary Text). We also manually curated a list of top-ranking sites with high mismatch rates across cell lines that are suitable for further experimental studies (table S3). Overall, we have successfully implemented a widely applicable framework for detecting potential sites of mRNA modifications from standard RNA-seq libraries while dealing with potential sources of artifacts.
Fig. 2

Validation of mRNA modification sites found by ModTect.

(A) Overview of method for discovering base pair–disrupting mRNA modifications from 934 standard RNA-seq datasets from the CCLE. ModTect identified 9823 base pair–disrupting mRNA modifications of all four possible nucleotide types. (B) Top ranking mRNA modification sites found by ModTect are concordant with known and previously unidentified types of mRNA modifications. Plot depicts the number of samples in which the site was identified versus their corresponding ModTect rank. A smaller ModTect rank represents a site of higher confidence. Known m1A sites, unknown m1A sites, a previously unknown type of mRNA modification identified by ModTect (m22G), and the gene symbols of the sites are as indicated. (C) A-type mRNA modification sites identified by ModTect show strong increase in modification rate following m1A immunoprecipitation (m1A IP) relative to control samples. Sequencing data were from GSE73914. (D) Proportion of variant nucleotides detected for RNA modification sites on DNA and RNA. Left: Plot depicts the average proportion of variant nucleotides at each RNA modification site identified by ModTect in the CCLE cohort. Right: Boxplot representing the variant proportions for all 9823 sites. (E) An example of an m1A site (11:65273630) on MALAT1 that was found independently and ranked highly by ModTect. A representative dataset (CCLE-A3/KAW) is depicted with a screenshot of read mapping, the modification score for multinucleotide mismatches, and the deletion signal. (F) Distribution of ModTect-identified sites in different genic regions. (G) Modification rate of RNA modification sites in different genic regions for each nucleotide type. A-to-I RNA editing sites obtained from the RADAR database are indicated as a reference.

Validation of mRNA modification sites found by ModTect.

(A) Overview of method for discovering base pair–disrupting mRNA modifications from 934 standard RNA-seq datasets from the CCLE. ModTect identified 9823 base pair–disrupting mRNA modifications of all four possible nucleotide types. (B) Top ranking mRNA modification sites found by ModTect are concordant with known and previously unidentified types of mRNA modifications. Plot depicts the number of samples in which the site was identified versus their corresponding ModTect rank. A smaller ModTect rank represents a site of higher confidence. Known m1A sites, unknown m1A sites, a previously unknown type of mRNA modification identified by ModTect (m22G), and the gene symbols of the sites are as indicated. (C) A-type mRNA modification sites identified by ModTect show strong increase in modification rate following m1A immunoprecipitation (m1A IP) relative to control samples. Sequencing data were from GSE73914. (D) Proportion of variant nucleotides detected for RNA modification sites on DNA and RNA. Left: Plot depicts the average proportion of variant nucleotides at each RNA modification site identified by ModTect in the CCLE cohort. Right: Boxplot representing the variant proportions for all 9823 sites. (E) An example of an m1A site (11:65273630) on MALAT1 that was found independently and ranked highly by ModTect. A representative dataset (CCLE-A3/KAW) is depicted with a screenshot of read mapping, the modification score for multinucleotide mismatches, and the deletion signal. (F) Distribution of ModTect-identified sites in different genic regions. (G) Modification rate of RNA modification sites in different genic regions for each nucleotide type. A-to-I RNA editing sites obtained from the RADAR database are indicated as a reference.

Validation of mRNA modification sites identified by ModTect

To determine whether sites identified by ModTect are genuine base pair–disrupting mRNA modification sites, we compared sites found by ModTect with m1A mRNA modification sites identified previously by m1A sequencing (m1A-seq) and m1A-MAP (misincorporation-assisted profiling of m1A) (Fig. 2B and fig. S8A) (, ). Notably, sites that were independently found by ModTect showed a good degree of overlap with sites that were identified by these experiment-based approaches and ranked highly (Fig. 2B). Among A-type sites identified by ModTect, seven sites were concordant with those previously identified by m1A-seq and m1A-MAP. These sites ranked between 1 and 19 among 9823 sites identified by ModTect (fig. S8C), reflecting the general effectiveness of our ranking-based approach in prioritizing m1A sites that are recovered using other experimental approaches. Further, A-type modification sites identified by ModTect showed an increase in modification rate following m1A immunoprecipitation (IP), suggesting that these sites are m1A sites (Fig. 2C). While 4 A-type sites among the list of top 20 sites were found to be nonconcordant with m1A sites identified by these studies, these sites may represent unknown classes of A-type base pair–disrupting mRNA modifications. Seven m1A sites that were concordantly identified by m1A-seq and m1A-MAP but missed by ModTect had low mismatch and deletion rate (Supplementary Text and fig. S8D), suggesting that sites missed by ModTect tended to be those with very low modification rate (<1% modification rate). Together, these results validate the ModTect approach and indicate that it can capture known mRNA modification sites from RNA-seq even in the absence of prior information. To assess whether the mismatch signals we observed by RNA-seq might correspond to variations at the DNA level, we used a panel of 327 WGS datasets generated from the CCLE samples (). Notably, 326 of these cell lines were part of the CCLE RNA-seq libraries analyzed in this study (99.7%). Across the 9823 putative RNA modification sites identified in our study, less than 2.7% of the sites across all samples (267 of 9823) were found to have a mismatch rate of ≥1% at the DNA level. Overall, we also observed that the mismatch rate observed at the RNA level was also much higher than that detected at the DNA level (Fig. 2D), with the mismatch rate at the DNA level for all sites found to be almost zero. Together, these results indicate that RNA modification sites found by ModTect were not caused by variations at the DNA level. Further, given that these RNA modification sites were identified without the use of a matched DNA sequencing dataset, our results further demonstrate that ModTect can discover RNA modification sites using RNA-seq datasets alone. To then determine whether sites that were ranked highly by ModTect have characteristics that were consistent with an m1A-type base pair–disrupting RNA modification, we focused on two m1A sites that were ranked highly by ModTect and were also identified by m1A-seq and m1A-MAP. These putative m1A sites are found on MALAT1 (chr11:65273630) and ND5 (NADH Dehydrogenase Subunit 5) (MT:13710) and were ranked 1 and 2, respectively. At the MALAT1 m1A site, we can clearly observe the multinucleotide mismatch and deletion signature in three representative cell lines from the CCLE (Fig. 2E and fig. S9, A to D), a strong dip in sequencing depth around the modification site concordant with a previous report () (Fig. 2E and fig. S9D), an enrichment of mismatches at this site following m1A IP (fig. S9E), clear deletion and multinucleotide mismatch signals across a large panel of CCLE RNA-seq (fig. S9, F and G), and detect multinucleotide mismatch and deletion signals in both the Illumina and BGI-500 sequencing platforms (fig. S9, H and I). Notably, similar results were also observed at the m1A site on ND5 (fig. S10). Together, these results indicate that sites identified by ModTect are genuine RNA modification sites rather than spurious noise, which further validates the ModTect approach.

Characteristics of base pair–disrupting mRNA modification sites

We then asked whether some types of mRNA modifications might be more readily installed in some regions of the transcriptome. To address this, we then examined the distribution of base pair–disrupting mRNA modifications sites, across nucleotide types, genic regions, and their corresponding modification rate. G-type sites were ~3 times more commonly observed than sites corresponding to all other nucleotide types across all 15,332 sites detected by ModTect (fig. S8B). With regard to the distribution of sites of each nucleotide type in different genic regions, all four types of sites were found fairly extensively in regions upstream of annotated transcripts and in the coding region (CDS), in contrast to sites found in the RADAR A-to-I RNA editing database (Fig. 2F) (). Notably, a greater proportion of G-type sites were found in the CDS, and a greater proportion of C-type sites were found in the 5′ untranslated region (5′UTR). Thus, the results indicate that RNA modification sites detected by our approach display region-specific distribution in the transcriptome. We next asked whether the distribution of modification sites might differ from their observed modification rate. To do so, we examined the modification rate of these sites at different regions of the transcriptome. The number of sites found in each region differed from patterns in modification rate of these sites. While more sites were found in upstream regions and CDSs, the modification rate of sites in these regions was among the lowest. Sites that were found in introns and downstream of an annotated transcript tended to have a higher mismatch rate, while mismatch rates across sites of all four nucleotide types were fairly similar (Fig. 2G). Overall, G-type RNA modification sites also tended to have a lower level of modification than other types of sites (Fig. 2G). Notably, numerous sites were also found on the MALAT1 and NEAT1 long noncoding RNAs (lncRNAs), indicating that they may be hypermodified (Supplementary Text and fig. S11). Together, the results indicate that sites of RNA modifications that disrupt base pairing are installed at lower modification levels in protein CDSs but introduced at higher modification levels in nonprotein CDSs of a transcript such as the introns, 3′UTR, and on lncRNAs.

ModTect leads to the discovery of a previously unidentified type of mRNA modification, m22G

Noting the ability of ModTect to identify genuine m1A mRNA modification sites in the human transcriptome, we then asked whether ModTect might also aid in the discovery of previously unidentified types of mRNA modifications. To do so, we further examined sites of other nucleotide types to determine whether these might correspond to previously unknown classes of base pair–disrupting mRNA modifications. We focused on three G-type modification sites found using ModTect, which were all found within the 3′UTR of the CTC1 (CST Telomere Replication Complex Component 1) gene, together with two other predicted A-type base pair–disrupting RNA modification sites (Fig. 3A and fig. S12A). These G-type sites were ranked 13,988 and 6676 among the list of 9823 candidate sites, while the two A-type sites were ranked 3 and 8, respectively. Curiously, all these sites were found on three tRNA-like elements predicted by the GtRNAdb database () and present in the CTC1 gene that we label as CTC1-tRNA1, CTC1-tRNA2, and CTC1-tRNA3 (Fig. 3A and fig. S12, B and C). Via manual inspection, we also identified an A>G mismatch site at each of the three tRNA-like elements and a C nucleotide–type mismatch in CTC1-tRNA2 (CTC1-tRNA2 site 1-2). Notably, a clear multinucleotide mismatch and deletion signal was noted at each of the five predicted base pair–disrupting RNA modification sites (CTC1-tRNA sites 1 and 3), which induces two or more types of mismatches, but not at the three A>G mismatch sites (CTC1-tRNA site 2) where a single type of mismatch was observed (Fig. 3B). Thus, these G-type modification sites found by ModTect might potentially correspond to previously unknown types of mRNA modifications.
Fig. 3

Discovery of previously unidentified classes of mRNA modifications with ModTect.

(A) Predicted base pair–disrupting mRNA modifications localize to tRNA-like elements within the 3′UTR of the CTC1 gene. Top: Schematic depicting predicted RNA modification sites at three tRNA-like elements within CTC1 3′UTR (CTC1-tRNA1, CTC1-tRNA2, and CTC1-tRNA3). Bottom: Screenshots depicting mismatches at candidate sites in three CCLE RNA-seq datasets. These sites are site 1 (17:8129984), site 1-2 (17:8129978), site 2 (17:8129976), and site 3 (17:8129943). (B) The modification score and the deletion signal across CCLE RNA-seq samples at CTC1-tRNA2 are as indicated. A high modification score was only recorded when multiple types of mismatches were detected (sites 1 and 3) but not for sites with a single type of mismatches (site 2). (C) Candidate RNA modification sites on MALAT1 and CTC1-tRNA2 are found at sites analogous to the RNA modification sites found on a standard tRNA. Structure and RNA modifications sites on a typical tRNA, tRNA-like structures on MALAT1, and CTC1-tRNA2 were generated based on prior publications (see the main text). (D) Modification rate at the inosine, m22G, and m1A RNA modification sites on CTC1-tRNA1, CTC1-tRNA2, and CTC1-tRNA3 across 934 CCLE RNA-seq samples. m1A site on CTC1-tRNA1 (17:8130324) was inferred from its structure. (E) RNA modifications are found on long mRNA transcripts, rather than short tRNA fragments. Plot depicts how far read pairs with RNA modifications extend from an RNA modification site. Most reads extend far beyond the 5′ and 3′ cut sites of a typical tRNA (m1A on MALAT1). (F) Most readpairs with mismatches at RNA modification sites extend well beyond the predicted 5′ and 3′ cut site of a standard tRNA. Bar plots for CTC1 represent CTC1-tRNA2.

Discovery of previously unidentified classes of mRNA modifications with ModTect.

(A) Predicted base pair–disrupting mRNA modifications localize to tRNA-like elements within the 3′UTR of the CTC1 gene. Top: Schematic depicting predicted RNA modification sites at three tRNA-like elements within CTC1 3′UTR (CTC1-tRNA1, CTC1-tRNA2, and CTC1-tRNA3). Bottom: Screenshots depicting mismatches at candidate sites in three CCLE RNA-seq datasets. These sites are site 1 (17:8129984), site 1-2 (17:8129978), site 2 (17:8129976), and site 3 (17:8129943). (B) The modification score and the deletion signal across CCLE RNA-seq samples at CTC1-tRNA2 are as indicated. A high modification score was only recorded when multiple types of mismatches were detected (sites 1 and 3) but not for sites with a single type of mismatches (site 2). (C) Candidate RNA modification sites on MALAT1 and CTC1-tRNA2 are found at sites analogous to the RNA modification sites found on a standard tRNA. Structure and RNA modifications sites on a typical tRNA, tRNA-like structures on MALAT1, and CTC1-tRNA2 were generated based on prior publications (see the main text). (D) Modification rate at the inosine, m22G, and m1A RNA modification sites on CTC1-tRNA1, CTC1-tRNA2, and CTC1-tRNA3 across 934 CCLE RNA-seq samples. m1A site on CTC1-tRNA1 (17:8130324) was inferred from its structure. (E) RNA modifications are found on long mRNA transcripts, rather than short tRNA fragments. Plot depicts how far read pairs with RNA modifications extend from an RNA modification site. Most reads extend far beyond the 5′ and 3′ cut sites of a typical tRNA (m1A on MALAT1). (F) Most readpairs with mismatches at RNA modification sites extend well beyond the predicted 5′ and 3′ cut site of a standard tRNA. Bar plots for CTC1 represent CTC1-tRNA2. However, it is unclear what type of RNA modifications these G-type sites are. To identify what type of modifications these sites might potentially be, we mapped these putative sites of modifications found on the CTC1-tRNA–like elements toward their predicted structures (–). Notably, we found that these sites mapped to locations on their structures that were analogous to the sites of RNA modifications found on a standard tRNA (Fig. 3C). The detected m1A site found on MALAT1 and the two A-type modification sites on the tRNA-like elements on CTC1 (CTC1-tRNA2 and CTC1-tRNA3) all mapped to positions that are analogous to the m1A site typically found on the T-loop of a tRNA (Fig. 3C and fig. S13). Sites of G-type modifications on the three CTC1-tRNA–like elements also mapped to positions analogous to the m22G RNA modification site found on a standard tRNA, in the region between the anticodon loop and D-loop (Fig. 3C and fig. S13). Notably, m1A and m22G are both base pair–disrupting RNA modifications (fig. S1), and their occurrence at these sites is consistent with the observed multinucleotide mismatch and deletion signals observed. The sites where A>G mismatches (site 2) and C-type mismatches (site 1-2) were noted on CTC1-tRNA–like elements also respectively localized toward inosine and 3-methylcytidine (m3C) modification sites of a standard tRNA (Fig. 3C and fig. S13). The observation of an m3C site here is consistent with a recent report demonstrating the presence of m3C RNA modification in bulk mRNAs by mass spectrometry (). Curiously, there also seems to be some relationship between the levels of modifications detected at these sites with the relative positioning of these tRNA-like elements in the 3′UTR and from the protein CDS (Fig. 3D), which is possibly related to their ability to fold into tRNA-like structures. Overall, these results indicate that ModTect, which is widely applicable to all types of standard RNA-seq datasets, empowers the discovery of previously unknown types of mRNA modifications. We then questioned whether these previously unidentified types of mRNA modifications we observed might simply be tRNA modifications rather than mRNA modifications. To address this, we analyzed reads with mismatches at the RNA modification sites we identified on CTC1 and note that most reads extend past the 5′ or 3′ cut site of a standard tRNA (fig. S14). By BLAT (BLAST-like alignment tool) alignment of reads with mismatches at the RNA modification sites found on CTC1-tRNA–like elements, we also found that these alignments were extremely unique and had very high BLAT scores (>900) in comparison to the next best hit (~400 to 500), suggesting that these alignments were not caused by mismapping. To obtain longer range information of these reads, we further made use of the mate-pair information and extracted read pairs with mismatches at these sites. We then asked how far to the left and to the right each of these read pairs can extend from these modification sites (fig. S15A). Notably, if most of these reads extend well beyond the 5′ and 3′ cut site of a standard tRNA, this will indicate that these modifications are present on the CTC1 mRNA and MALAT1 (Metastasis Associated Lung Adenocarcinoma Transcript 1) lncRNA rather than on tRNAs. We found that most reads carrying these mismatch signals extend well beyond the cut sites of a typical tRNA (Fig. 3, E and F), indicating that these signals likely originate from full-length CTC1 mRNAs rather that processed tRNAs (Supplementary Text and figs. S15 and S16). These results indicate that these previously unidentified types of mRNA modifications that we have identified (m22G) are found on mRNAs rather than tRNAs. More broadly, our results highlight the ability of ModTect to uncover the existence of a previously unknown class of mRNA modifications, m22G, and emphasize how signals corresponding to previously unidentified types of mRNA modification (m22G and m3C) are existent and detectable via standard RNA-seq.

The epitranscriptome contributes to multiple types of RNA-DNA sequence differences during RNA-seq

The extensive occurrence of mRNA modifications in the transcriptome was hinted upon, before the emergence of the epitranscriptome, by a previous study by Li et al. (), which reported the widespread occurrence of RNA-DNA sequence differences of all nucleotide types. Nonetheless, this study was widely disputed by many other groups due to technical issues associated with the study (–). To date, it is unclear whether these observations correspond to artifacts or to biological signals and whether they might be related to the base pair–disrupting RNA modifications assessed in our study. We then asked whether some of these previously reported but disputed RNA-DNA sequence differences might in fact be RNA modification signals. To do so, we focused on mRNA modification sites identified in our study and assessed whether RNA-DNA sequence differences can be observed at these sites. Specifically, we examined the mismatch patterns at the m1A, m22G, and m3C mRNA modification sites on CTC1-tRNA1 across 934 RNA-seq libraries in the CCLE. While two or more types of mismatches were observed at each of these mRNA modification sites (Fig. 4A), a single type of mismatch was exclusively observed in some samples. Specifically, m1A sites generated not only multinucleotide mismatch signals (A>T,G,C) but also A>T, A>G, or A>C mononucleotide mismatch signals in some samples. At the m22G site, we observed G>T,A multinucleotide mismatch signals in some samples, while only G>T or G>A signals in others. Similarly, at the m3C mRNA modification site, C>T,A, C>T, or C>A mismatch signals were observed. Comparatively, at the inosine site, only A>G mononucleotide mismatches were observed. Notably, similar observations were made at the m1A site on MALAT1 (fig. S17A), and the m1A, m22G, and inosine mRNA modification sites on CTC1-tRNA1 and CTC1-tRNA3 (fig. S17, B to E). The number of different types of mismatches in these samples also seems to be correlated to the sequencing depth, proportion of mismatches, and number of variant reads at the m1A and m22G mRNA modification sites (Fig. 4B). A single type of mismatches was possibly observed at some of these sites due to stochastic random sampling and modification of these sites at lower levels. Together, these results indicate that multiple types of RNA-DNA sequence differences can be induced by mRNA modifications. More broadly, although the study by Li et al. () was widely disputed by many other groups (–) and is still widely disregarded to date, our results indicate that there was a degree of conceptual validity in its claim that the widespread observation of RNA-DNA differences in the human transcriptome represents a yet unexplored aspect of genome variation, which we have found in this study to be caused by the epitranscriptome.
Fig. 4

The epitranscriptome contributes to multiple types of RNA-DNA sequence differences, which escape detection by Sanger sequencing.

(A) mRNA modifications induce multiple types of RNA-DNA sequence differences observable by standard RNA-seq. Mismatches (base quality ≥Q30) at three types of base pair–disrupting mRNA modifications (m1A, m22G, and m3C) and the inosine site (which does not disrupt base pairing) on CTC1-tRNA–like element 2 in the CCLE RNA-seq datasets are as indicated. The symbols “*” and “#” highlight mismatches typically interpreted as A-to-I and C-to-U RNA editing, respectively. (B) Relationship between types of mismatches at mRNA modification sites and the depth, number of variant reads, and mismatch proportion. (C) Base pair–disrupting RNA modifications induce deletions in RNA-seq across a large proportion of CCLE samples. (D) Mismatches at RNA modification sites are readily detected by RNA-seq but escape detection by Sanger sequencing. Mismatch rates of base pair–disrupting RNA modifications assessed by RNA-seq and Sanger sequencing in two cell lines are indicated. RNA-seq data were from CCLE, while Sanger sequencing was performed using independently extracted RNA from the same cell line. (E) Model explaining why RNA modifications that disrupt Watson-Crick base pairing escape detection by Sanger sequencing when a typical MMLV RT was used for cDNA synthesis. Conversely, mismatches should be readily observed when TGIRT RT is used. (F) Base pair–disrupting RNA modifications are undetectable by Sanger sequencing with typical RT (MMLV) but can be readily observed with TGIRT RT. No mismatches were detected with MMLV but were clearly observed with TGIRT at one of the sites. RT was performed on the same RNA sample using two different RT enzymes.

The epitranscriptome contributes to multiple types of RNA-DNA sequence differences, which escape detection by Sanger sequencing.

(A) mRNA modifications induce multiple types of RNA-DNA sequence differences observable by standard RNA-seq. Mismatches (base quality ≥Q30) at three types of base pair–disrupting mRNA modifications (m1A, m22G, and m3C) and the inosine site (which does not disrupt base pairing) on CTC1-tRNA–like element 2 in the CCLE RNA-seq datasets are as indicated. The symbols “*” and “#” highlight mismatches typically interpreted as A-to-I and C-to-U RNA editing, respectively. (B) Relationship between types of mismatches at mRNA modification sites and the depth, number of variant reads, and mismatch proportion. (C) Base pair–disrupting RNA modifications induce deletions in RNA-seq across a large proportion of CCLE samples. (D) Mismatches at RNA modification sites are readily detected by RNA-seq but escape detection by Sanger sequencing. Mismatch rates of base pair–disrupting RNA modifications assessed by RNA-seq and Sanger sequencing in two cell lines are indicated. RNA-seq data were from CCLE, while Sanger sequencing was performed using independently extracted RNA from the same cell line. (E) Model explaining why RNA modifications that disrupt Watson-Crick base pairing escape detection by Sanger sequencing when a typical MMLV RT was used for cDNA synthesis. Conversely, mismatches should be readily observed when TGIRT RT is used. (F) Base pair–disrupting RNA modifications are undetectable by Sanger sequencing with typical RT (MMLV) but can be readily observed with TGIRT RT. No mismatches were detected with MMLV but were clearly observed with TGIRT at one of the sites. RT was performed on the same RNA sample using two different RT enzymes. Noting that mRNA modifications can induce multiple types of RNA-DNA sequence differences, including those that are typically generated by RNA editing processes, we then asked whether RNA modification sites might have been misinterpreted as RNA editing sites. Specifically, processes mediated by m1A and m3C mRNA modifications can also generate A>G and C>T RNA-DNA sequences (Fig. 4A), which are typically attributed to A-to-I and C-to-U RNA editing. We note that m1A mRNA/lncRNA modification sites on MALAT1 and CTC1 (CTC1-tRNA site 3), highlighted in our study, were misannotated as sites of A-to-I editing by the widely used RADAR and REDIportal databases (, ). Furthermore, we also noticed that deletions can be detected in >70% of the samples at some of these sites of mRNA modifications from RNA-seq, but not from DNA sequencing (Fig. 4C), suggesting that deletions observed from RNA-seq may not always be indicative of a corresponding presence of genomic deletions. Together, our results put forth mRNA modifications as a previously poorly appreciated cause for observed single-nucleotide variations and deletions observed from RNA-seq and also draw caution to RNA-seq–based approaches for variant detection.

RNA-DNA sequence differences induced by RNA modifications escape detection by Sanger sequencing

Although our study indicates that noncanonical RNA-DNA sequence differences induced by mRNA modifications are widely observed by RNA-seq, previous attempts to validate such signals by Sanger sequencing had largely failed (). We thus questioned whether it might be possible to detect RNA-DNA sequence differences induced by RNA modifications using Sanger sequencing. As such, we subjected several well-characterized and highly modified base pair–disrupting RNA modification sites found on rRNAs to Sanger sequencing validation. Unexpectedly, while the m1A site at 28S rRNA:1322, the m3U site at 28S rRNA:4530, and the m1acp3Ψ site at 18S rRNA:1248 are well-established base pair–disrupting RNA modification sites with high level of mismatches when assessed by RNA-seq, mismatches were predominantly not detected at these sites via Sanger sequencing (Fig. 4D). Notably, these sites were highly confident sites with very high mismatch rates across 934 CCLE RNA-seq samples (fig. S6C), and our failure to observe mismatches at these sites from Sanger sequencing seems particularly enigmatic. Thus, RNA-DNA sequence differences induced by mRNA modifications seem to escape detection by Sanger sequencing. There is, however, no clear explanation for why these mRNA modifications might escape detection by Sanger sequencing, despite causing clear RNA-DNA sequence difference signals by RNA-seq. Prior studies had highlighted how RNA modifications such as m1A can cause the commonly used MMLV (Moloney Murine Leukemia Virus) RT to truncate and stall at these sites (, , ). We then questioned whether this observation might be caused by the stalling of the commonly used RT (MMLV) at these RNA modification sites. Specifically, we hypothesized that because these truncated cDNAs derived from modified RNA cannot be amplified by flanking polymerase chain reaction (PCR) primers (Fig. 4E), RNAs carrying these RNA modifications will be underrepresented during Sanger sequencing validation. To address this, we then use a recently invented thermostable group II intron reverse transcriptases (TGIRT) RT () for Sanger sequencing validation. Notably, this RT enzyme is able to bypass these modification sites during cDNA synthesis to generate full-length cDNAs (, ). In which case, mismatches at sites of base pair–disrupting RNA modifications, which were originally undetectable via Sanger sequencing with typically used RT (MMLV), should now be observed (Fig. 4E). We thus used two different RTs, MMLV and TGIRT, to generate cDNA from the same RNA sample and subjected them to Sanger validation. Consistent with this model, no mismatches were observed at the m1acp3Ψ base pair–disrupting RNA modification site when MMLV was used for cDNA synthesis (Fig. 4F), but a 21% mismatch rate can be observed when TGIRT was used for cDNA synthesis. At the same time, at another base pair–disrupting RNA modification site, m3U, a mismatch rate of 28% was noted when MMLV was used for cDNA synthesis, but a grossly larger mismatch rate of 82% was now detectable after TGIRT was used for cDNA synthesis. As such, these results showing the different propensity of the MMLV and TGIRT RT to generate mismatches during Sanger sequencing provide an explanation for how RNA modifications have escaped detection by Sanger sequencing for so many years, despite causing clear and widespread RNA-DNA sequence differences during RNA-seq.

Sanger sequencing validation of mRNA and lncRNA modification sites identified by ModTect

Having established the importance of using the TGIRT RT for Sanger sequencing validation experiments for base pair–disrupting RNA modification, we then attempted to validate the mRNA and lncRNA modification sites identified by ModTect with Sanger sequencing. Using the CCLE RNA-seq dataset, we identified two cancer cell lines (KMS-12-BM and OKV-18) with high modification levels at the candidate RNA modification sites. RNA samples from these cell lines were obtained, and complete digestion of genomic DNA (gDNA) was ensured. cDNA in these two cell lines was then generated from RNA using the TGIRT RT, and flanking primers were then designed to amplify gDNA and cDNA for Sanger sequencing (see Materials and Methods). Clear mismatches (A>T) with a mismatch rate of 76 to 80% could be observed at the m1A site on MALAT1 on the cDNA but not on the gDNA in both the KMS-12-BM and OKV-18 cell lines (Fig. 5A). A moderate level of A>G mismatches (9.4%) was also detected at the m1A site on prune exopolyphosphatase (PRUNE) in the KMS-12-BM cell line on the cDNA, but not on the gDNA by Sanger sequencing (Fig. 5B). We then designed primers to validate RNA modification sites we identified on tRNA-like elements within CTC1 (Fig. 5C). Notably, primers were designed to sit outside each tRNA-like element to ensure amplification of CTC1 mRNA rather than tRNAs. Further, using BLAST (), we also confirmed that each primer set mapped uniquely to the CTC1 locus at the positions we designed. With these primer sets, Sanger sequencing validation of RNA modifications found on the CTC1 locus was then performed.
Fig. 5

Sanger sequencing validation of mRNA and lncRNA modification sites identified by ModTect.

Sanger sequencing validation was performed on gDNA and on cDNA generated using TGIRT. Experiments were performed using the KMS-12-BM (left) and OKV-18 (right) cancer cell lines. The mismatch type and mismatch rate observed at the RNA modification site with Sanger sequencing are as indicated. Experiments were performed for (A) the m1A site on MALAT1 (chr11:65273630) and (B) the m1A site on PRUNE (chr1:150980982). (C) Schematic depicting the location of primers used to perform Sanger sequencing experiments of RNA modification sites found on tRNA-like elements 2 and 3 on the CTC1 gene. Sanger sequencing experiments were similarly performed for (D) the m1A (chr17:8129943) and m22G (chr17:8129984) sites on CTC1-tRNA3 and (E) the m1A (chr17:8129568) and m22G (chr17:8129600) sites on CTC1-tRNA2. Note that the CTC1 gene is transcribed from right to left as indicated in (C) and Fig. 3A and Sanger sequencing results represent the reverse complement of the actual CTC1 mRNA. In some cases, an inosine site (A>G) was also detected, and these sites are highlighted in purple.

Sanger sequencing validation of mRNA and lncRNA modification sites identified by ModTect.

Sanger sequencing validation was performed on gDNA and on cDNA generated using TGIRT. Experiments were performed using the KMS-12-BM (left) and OKV-18 (right) cancer cell lines. The mismatch type and mismatch rate observed at the RNA modification site with Sanger sequencing are as indicated. Experiments were performed for (A) the m1A site on MALAT1 (chr11:65273630) and (B) the m1A site on PRUNE (chr1:150980982). (C) Schematic depicting the location of primers used to perform Sanger sequencing experiments of RNA modification sites found on tRNA-like elements 2 and 3 on the CTC1 gene. Sanger sequencing experiments were similarly performed for (D) the m1A (chr17:8129943) and m22G (chr17:8129984) sites on CTC1-tRNA3 and (E) the m1A (chr17:8129568) and m22G (chr17:8129600) sites on CTC1-tRNA2. Note that the CTC1 gene is transcribed from right to left as indicated in (C) and Fig. 3A and Sanger sequencing results represent the reverse complement of the actual CTC1 mRNA. In some cases, an inosine site (A>G) was also detected, and these sites are highlighted in purple. Notably, clear mismatches could be observed on the m1A and m22G mRNA modification sites found on both CTC1-tRNA–like elements 2 and 3 on the cDNA level, but not on the gDNA level (Fig. 5, D and E). At the m1A site on CTC1-tRNA3, A>G mismatches could be observed with a relatively high mismatch rate of 82 and 63% in the KMS-12-BM and OKV-18 cancer cell lines, respectively (Fig. 5D). High levels of G>A mismatches were similarly found on the m22G site on CTC1-tRNA3 in both KMS-12-BM (69%) and OKV-18 (95%) (Fig. 5D). On CTC1-tRNA2, a lower but relatively high level of A>G mismatches was also observed on the m1A site in KMS-12-BM (39%) and OKV-18 (23%). G>A mismatches were similarly identified on the m22G RNA modification site on CTC1-tRNA2 in KMS-12-BM (51%) and OKV-18 (42%). Notably, using two different pairs of primers targeting CTC1-tRNA2 and which are also situated outside the tRNA-like elements, we could similarly observe mismatches at each of these sites (fig. S18), suggesting that RNA modifications found on the CTC1 locus were indeed present on the CTC1 mRNA rather than tRNAs. The lower mismatch rate at RNA modification sites on CTC1-tRNA2 than on CTC1-tRNA3 we observed with Sanger sequencing is concordant with our findings using RNA-seq (Fig. 3D). Together, these results confirm that RNA modification sites found within the CTC1 locus by ModTect are real and are present on CTC1 mRNA.

Benchmarking of ModTect with other approaches for prediction of RNA modifications

HAMR and the random forest classifier are two approaches that were previously developed for the detection of RNA modifications from RNA-seq datasets (, ). However, it is unclear how these approaches compare to ModTect. We then obtained these software tools and sought to benchmark them against ModTect. We could readily implement HAMR and apply it to the 935 CCLE RNA-seq datasets used for our evaluation. On the other hand, because the random forest classifier was implemented as a website using the Galaxy platform, we were unable to execute it on our compute cluster and apply it to the full CCLE RNA-seq dataset. As such, benchmarking experiments were only performed between ModTect and HAMR. To benchmark ModTect with HAMR, we selected 10 representative RNA modification sites consisting of those that have been reported in prior literature and those that we have found and validated in our current study (Fig. 6 and table S4). ModTect and HAMR analysis were then performed on the ±100–base pair (bp) region of each site on 935 CCLE RNA-seq datasets to establish (i) whether these approaches were able to capture these RNA modifications in the CCLE dataset and (ii) to what extent the neighboring regions are also identified as RNA modifications. The flanking ±100-bp regions were selected as “background” sites since no RNA modifications were presumably present in these regions. As such, these sites can serve as “negative controls” and give an indication of the precision of the calls of both approaches.
Fig. 6

Benchmarking of ModTect against HAMR at 10 representative RNA modification sites.

ModTect and HAMR were assessed for their ability to capture base pair–disrupting RNA modifications from standard RNA-seq libraries at 10 representative RNA modification sites. A total of 935 CCLE RNA-seq datasets were used for the evaluation, and the number of samples detected by each approach at each of the 10 sites is indicated in the bar plots.

Benchmarking of ModTect against HAMR at 10 representative RNA modification sites.

ModTect and HAMR were assessed for their ability to capture base pair–disrupting RNA modifications from standard RNA-seq libraries at 10 representative RNA modification sites. A total of 935 CCLE RNA-seq datasets were used for the evaluation, and the number of samples detected by each approach at each of the 10 sites is indicated in the bar plots. Using a modification score cutoff of ≥1 in ModTect, as per our initial discovery for RNA modifications, 4030 true RNA modification sites were recovered from the 935 RNA-seq datasets (table S4). On the other hand, using the “H4 hypothesis” and the recommended parameters in HAMR (see Materials and Methods), only 978 true-positive sites could be recovered (table S4), suggesting that ModTect is able to recover more sites than HAMR (4.1× more true sites). We then assessed whether ModTect and HAMR might be able to capture some of these sites better than other. We see that ModTect performed much better than HAMR in capturing sites found on mRNAs and lncRNAs (MALAT1, ND5, PRUNE, PQLC2, and CTC1) (Fig. 6). On the other hand, both ModTect and HAMR performed similarly in the recovery of the mitochondrial rRNA m1A site at MT:2617 (ModTect, 935 sites; HAMR, 904 sites) (table S4). Considering only nine mRNA/lncRNA sites, we see that ModTect was able to identify 41.8× more true sites than HAMR. On further evaluation, we note that the mitochondrial rRNA site that both ModTect and HAMR identified had a high coverage and mismatch rate (mean coverage, 25,130.1×; mean mismatch rate, 46.4%). On the other hand, sites identified by ModTect but not by HAMR had low sequencing coverage or low mismatch rate. For instance, the ND5 site missed by HAMR had a high coverage (mean, 14,234.3×) but a low mismatch rate (mean, 4.6%). Similarly, the MALAT1 site that is frequently omitted by HAMR had a high mismatch rate (mean, 48.8%) but a low coverage (mean, 46.7×). Assessing the sensitivity and precision of ModTect and HAMR at these sites, we note that ModTect was more sensitive than HAMR, while having better or comparable precision in calls for RNA modifications even when the parameters in HAMR were varied (table S5). Together, this suggests that ModTect is better than HAMR at identifying mRNA/lncRNA modification sites, which typically have lower coverages or mismatch rates.

Some types of RNA modifications are more strongly associated with cancer in large cohorts of patients with cancer

More than 170 different types of RNA modifications are known to exist in the epitranscriptome, though which of these modifications are associated with cancer in cancer patients remains a largely unaddressed question. As TCGA consortium has generated a huge resource of 11,371 cancer patient RNA-seq datasets representing 33 tumor types, this represents a useful resource for us to identify potential dysregulation in RNA modification levels. To establish whether the dataset is suitable for the analysis of RNA modification levels, we first assessed the dataset for potential batch effects. By examining the relevant publications, we found that essentially all samples were generated using the SuperScript II MMLV RT (table S6). Further, 87.4% of all samples were all generated at the same sequencing center (table S7). Further evaluation of a few reference RNA modification sites in the samples indicates little technical variability within each tumor type (Supplementary Text). Some technical variability was observed between tumor types and is related to the sequencing center that generated these datasets. However, these technical variabilities were limited and affected only a small fraction of tumor types (5 of 33) and samples (between 3 and 24%) (Supplementary Text). Thus, after accounting for the sequencing center responsible for data generation, the TCGA dataset can be applied to identify potential dysregulation in RNA modification levels in patients with cancer. We then proceeded to analyze and identify potential dysregulation in RNA modification levels in the TCGA cohort. Specifically, we argue that if a type of RNA modification is more extensively dysregulated in patients with cancer than another, it is more strongly associated with cancer and can possibly play a more important role in tumorigenesis than another. As such, we examined the degree of dysregulation of different types of RNA modifications across 11,371 cancer patient RNA-seq datasets generated by TCGA at each of the 9823 sites identified in our study (table S8). Furthermore, noting that residual rRNAs can be readily detected in all libraries (figs. S19 to S21 and tables S9 and S10), we further exploited these reads to assess the modification levels at the three base pair–disrupting rRNA modification sites, 28S:1322 (m1A), 28S:4530 (m3U), and 18S:1248 (m1acp3Ψ) (tables S11 to S13). Notably, some tumor type–specific differences in rRNA ratios, which were unrelated to the sequencing center, were observed (fig. S22). Modification levels were highest at rRNA modification sites (~40 to 80%) and much lower in mRNA/lncRNA modification sites (~5 to 30%) (Fig. 7A and tables S14 and S15), suggesting that rRNA sites are more highly modified than mRNA/lncRNA sites.
Fig. 7

Different types of RNA modifications show different propensities to be dysregulated among 33 different cancer types and across 11,371 cancer patient samples.

(A) Some modification types and sites are much more variable than others and are thus more greatly dysregulated. Boxplots depict modification rate at base pair–disrupting mRNA and rRNA modification sites in 11,371 TCGA samples. Variability of modification rate at each site is indicated by the size of the boxplot. Sites that correspond to mRNA modification sites on the CTC1 gene are labeled in red text. Sites on rRNAs are labeled in blue text. Sites indicated by an asterisk (*) correspond to m1A sites. The heatmap depicts the mean, variance, and coefficient of variation of the modification rate of each site in the TCGA cohort. (B) Variability of RNA modification levels of each type of RNA modification in 11,371 cancer patient samples. Variability of each type of RNA modification is calculated on the basis of the average of the coefficient of variation of each type of RNA modification. (C) RNA modification level at the m1A lncRNA modification site on MALAT1 (chr11:65273630) across 9537 patient tumor RNA-seq datasets from the TCGA of different cancer types. Related to figs. S24 and S25. Labels represent TCGA disease cohorts’ abbreviations. Detailed sample types and number of samples are as indicated in the Supplementary Materials. Sequencing center is as indicated. (D) Heatmap depicting RNA modification levels of base pair–disrupting RNA modifications across different cancer types. RNA modification levels are represented as Z scores for each site. UNC, University of North Carolina; BCGSC, Canada’s Michael Smith Genome Sciences Centre. (E) t-SNE plot of RNA modification enzyme expression of 11,000+ TCGA cancer patient samples. Labels represent TCGA disease cohorts’ abbreviations.

Different types of RNA modifications show different propensities to be dysregulated among 33 different cancer types and across 11,371 cancer patient samples.

(A) Some modification types and sites are much more variable than others and are thus more greatly dysregulated. Boxplots depict modification rate at base pair–disrupting mRNA and rRNA modification sites in 11,371 TCGA samples. Variability of modification rate at each site is indicated by the size of the boxplot. Sites that correspond to mRNA modification sites on the CTC1 gene are labeled in red text. Sites on rRNAs are labeled in blue text. Sites indicated by an asterisk (*) correspond to m1A sites. The heatmap depicts the mean, variance, and coefficient of variation of the modification rate of each site in the TCGA cohort. (B) Variability of RNA modification levels of each type of RNA modification in 11,371 cancer patient samples. Variability of each type of RNA modification is calculated on the basis of the average of the coefficient of variation of each type of RNA modification. (C) RNA modification level at the m1A lncRNA modification site on MALAT1 (chr11:65273630) across 9537 patient tumor RNA-seq datasets from the TCGA of different cancer types. Related to figs. S24 and S25. Labels represent TCGA disease cohorts’ abbreviations. Detailed sample types and number of samples are as indicated in the Supplementary Materials. Sequencing center is as indicated. (D) Heatmap depicting RNA modification levels of base pair–disrupting RNA modifications across different cancer types. RNA modification levels are represented as Z scores for each site. UNC, University of North Carolina; BCGSC, Canada’s Michael Smith Genome Sciences Centre. (E) t-SNE plot of RNA modification enzyme expression of 11,000+ TCGA cancer patient samples. Labels represent TCGA disease cohorts’ abbreviations. RNA modification levels at sites that are dysregulated in patients with cancer are likely to show a greater degree of variation. To assess whether some RNA modification sites are more variable than others and thus more likely to be dysregulated, we calculated the variance of modification levels at each of these sites. Notably, some modification types and modification sites were found to be much more variable than others (Figs. 7A and 7B and fig. S23A). For instance, on rRNAs, the modification levels at the 18S:1248 (m1acp3Ψ) and MT:2617 (m1A) sites were highly variable across the different samples (Fig. 7A). Conversely, a more or less consistent modification rate was observed across these samples for the 28S:4530 (m3U) and 28S:1322 (m1A) rRNA sites (Fig. 7A and fig. S24). In general, mRNA/lncRNA modification sites also tended to be more variable (Fig. 7A). For instance, high degree of variation in modification levels was observed at the m1A sites on MALAT1 (11:65273630), ND5 (MT:13710), and CTC1 (17:8129568). Notably, sites that are more poorly modified also tended to be much more variable (Fig. 7A). Overall, across these patient samples, we found that m3U and m1acp3Ψ RNA modifications tend to be more stable, while those of m1A and m22G tend to be much more variable and dynamic (Fig. 7B and fig. S23). Together, this suggests that modification levels at some modification sites and types tend to be much more dynamic and may be more dysregulated than others in patients with cancer. We then asked whether the same type of RNA modification might be modified to different degrees in different contexts. To address this, we examined levels of modification at each of the m1A sites in our study. Among all m1A sites assessed, modification levels were highest among the two m1A modification sites on 28S rRNA (28S:1322) and mitochondria rRNA (MT:2617), with modification levels as high as 60 to 80% (Fig. 7A). Conversely, the modification levels on the m1A sites on mRNAs tended to be lower (0 to 40%). The m1A sites on MALAT1 (11:65273630), CTC1 (17:8129568), and ND5 (MT:13710) were the most highly modified sites on mRNAs and lncRNAs with modification levels of ~10 to 40%. Conversely, all other m1A sites (e.g., MT:14597 and MT:7882) were modified at much lower levels in the TCGA cohort (<10%) (Fig. 7A). These results indicate that the same type of RNA modifications can be installed at markedly different levels in different contexts in patients with cancer, with much higher levels of modification more likely to be found at a small set of m1A sites as compared to others. We further asked whether this phenomenon is restricted to m1A RNA modifications or might more broadly represent other RNA modification types. To do so, we assessed levels of modifications at sites corresponding to a wide variety of modifications encompassing m3U, m1acp3Ψ, m22G, and other undetermined modifications. Notably, the m3U site on 28S rRNA (28S:4530) and m1acp3Ψ RNA modification sites on 18S rRNA (18S:1248) were both found to have a relatively high level of modification of ~40 to 60% across the TCGA cohort (Fig. 7A). Fairly high level of modifications of ~20 to 40% were also found on three sites of undetermined modification types (10:14610664, 9:138455224, and 21:45567606). Conversely, m22G sites identified in our study tended to be relatively poorly modified, with the two m22G mRNA modification sites on CTC1 (17:8129984 and 17:8129600) found to have a fairly low level of modification of only ~5 to 10% in the TCGA samples (Fig. 7A). Low levels of modifications were also identified on many other base pair–disrupting RNA modification sites of unknown modification types. Thus, similar to what we observed with m1A, most sites corresponding to other modification types had low levels of modification (<5 to 10%) in the TCGA samples, apart from those on rRNAs and some “hotspot modification” sites on mRNAs. Different modification sites may potentially play different contributory roles in different types of cancer. To assess whether RNA modification sites might show different patterns of modification in patients of different cancer types, we further assessed the modification levels of these patient samples by tumor types. At the m1A RNA modification site on MALAT1, strong tumor type–specific differences in modification levels were noted, with tumor types such as prostate adenocarcinoma (PRAD) and pheochromocytoma and paraganglioma (PCPG) showing modification levels as low as ~20%, while tumor types such as acute myeloid leukemia (LAML) and diffuse large B cell lymphoma (DLBC) were found to have modification levels as high as 50 to 60% at this site (Fig. 7C). Comparatively, the RNA modification levels at 28S:1322 (m1A) were more or less consistent across different patient samples of different tumor types, with glioblastoma multiforme (GBM) cancer patient samples found to have a much lower mismatch level in comparison to other tumor types (Fig. 7D and fig. S24A), although this might be a technical issue caused by differences in the sequencing center. Modification levels at the m3U rRNA modification site at 28S:4530 were notably also highly consistent across multiple cancer types (Fig. 7D and fig. S24B). In contrast, modification levels at the m1acp3Ψ site at 18S:1248 showed a high degree of variability, between tumor types and within tumor types (Fig. 7D and fig. S25A). Tumor type–specific differences in modification levels at the m1A modification site on ND5 site could also be observed, with modification levels varying from as low as ~2% in the cancer types uveal melanoma (UVM) and PCPG, to as high as ~20% in thyroid carcinoma (THCA) and adrenocortical carcinoma (ACC) (Fig. 7D and fig. S25B). Curiously, the modification levels of mitochondria m1A RNA modification sites (MT:7882, MT:7375, MT:14597, and MT:13710) tended to cluster together, while the m1A RNA modification sites on rRNA sites and MALAT1 clustered separately (28S:1322, MT:2617, and 11:65273630) (Fig. 7D), suggesting that modification of these two groups of sites might be installed by two distinct m1A modification enzymes. Together, these findings indicate that RNA modification levels can vary drastically across patients of different cancer types. To assess whether these tumor type–specific differences in RNA modification levels might be related to differences in expression of RNA modification enzymes in different cancer types, we further assessed the expression of all known RNA modification enzymes (n = 74) in these samples (table S16). The gene expression of RNA modification enzymes in the tumor types lower-grade glioma (LGG), GBM, THCA, PCPG, liver hepatocellular carcinoma (LIHC), and ovarian serous cystadenocarcinoma cancer (OV) showed a highly tumor type–specific pattern of expression across the TCGA samples (Fig. 7E). Notably, the expression of these RNA modification enzymes was also highly distinct from other tumor types. At the same time, tumor type–specific expression of RNA modification enzymes could also be observed in tumor types such as uterine corpus endometrial carcinoma (UCEC), breast invasive carcinoma (BRCA), and colon adenocarcinoma (COAD) (Fig. 7E), although these tumor types tend to have more similar RNA modification enzyme expression with other tumor types as well. Conversely, tumor type–specific expression of RNA modification enzymes was not observed in tumor types such as lung squamous cell carcinoma (LUSC), lung adenocarcinoma (LUAD), cervical squamous cell carcinoma and endocervical adenocarcinoma (CESC), and pancreatic adenocarcinoma (PAAD) (Fig. 7E). Furthermore, to explore whether it might be possible to identify RNA modification enzymes responsible for driving modification at each site, we correlated the expression of RNA modification enzymes with the modification levels at each site. Curiously, the N6-methyladenosine (m6A) RNA methyltransferase methyltransferase like 3 (METTL3) showed fairly strong correlation (0.52) with modification levels of the m1A site on MALAT1 (fig. S26 and table S17). In contrast, weaker correlation (0.27) was observed between this site and the expression of TRMT61A, a tRNA m1A methyltransferase. Together, this suggests that RNA modification enzymes show tumor type–specific differences in gene expression for some cancer types, and may help drive different RNA modification landscapes.

Dysregulation of RNA modification levels is clinically associated with disease progression and survival outcomes

We then asked whether the dysregulation of RNA modification levels might be related to disease progression and survival outcomes of patients with cancer. To assess whether levels of RNA modifications might be different between tumor and normal samples collected from the same patients with cancer, we then compared the modification rate at RNA modification sites between tumor and non-tumor (“normal”) samples to determine possible involvement of RNA modifications in the tumorigenesis process (Fig. 8, A to D). The 18S:1248 (m1acp3Ψ) site, which sits at the decoding center of the ribosome, typically shows ~10 to 15% lower level of modification in tumor as compared to normal patient samples, across multiple tumor types (Fig. 8A and fig. S27). These differences in modification levels between tumor and non-tumor samples were especially prominent for the tumor types rectum adenocarcinoma (READ), UCEC, and COAD. Comparatively, no significant differences were noted between tumor and non-tumor samples for the rRNA modification sites 28S:4530 (m1A) and 28S:4530 (m3U) (figs. S28 and S29). Notably, tumor- versus non–tumor-specific differences in RNA modification levels were also detected for LUSC, esophageal carcinoma (ESCA), and BRCA at the m1A site on CTC1 (chr17:8129568); for LUSC, kidney renal papillary cell carcinoma (KIRP), and BRCA at the m1A site on MALAT1 (chr11:65273630); and an A-type modification site on NDUFS1 (chr2:206986978) (Fig. 8, B to D). Together, these results suggest that levels of various types of RNA modifications are different between tumor and normal samples collected from patients with cancer and may be dysregulated in multiple types of human cancers.
Fig. 8

The epitranscriptome is dysregulated across multiple types of human cancers and is associated with clinical outcomes.

(A to D) Modification levels of RNA modification sites in cancer and matched normal patient samples of different cancer types. The modification rate of the adjacent normal tissue and primary tumor samples is as indicated. The genomic position, gene corresponding to the site, and type of modification are indicated at the top of each plot. Cancer type information is also indicated at the top of each pair of boxplots. Specifically, the different boxplots represent different disease cohorts from the TCGA. Cohorts represented in the plots are BRCA, COAD, ESCA, head and neck squamous cell carcinoma (HNSC), KIRP, LUAD, LUSC, READ, and UCEC. (E) Levels of RNA modification at the m1A mRNA modification site on the ND5 gene correlates with survival outcome. Plot depicts the kidney renal clear cell carcinoma (KIRC) cancer type. Note that the modification score is better than the modification rate in segregating patients by survival outcome, indicating that it is a better prognostic marker. A representative plot is depicted for the m1A mRNA modification site on the ND5 gene. (F and G) The level of modification at RNA modification sites, as represented by the modification score, is correlated with survival outcome and stages of disease progression for (F) KIRC and (G) LIHC.

The epitranscriptome is dysregulated across multiple types of human cancers and is associated with clinical outcomes.

(A to D) Modification levels of RNA modification sites in cancer and matched normal patient samples of different cancer types. The modification rate of the adjacent normal tissue and primary tumor samples is as indicated. The genomic position, gene corresponding to the site, and type of modification are indicated at the top of each plot. Cancer type information is also indicated at the top of each pair of boxplots. Specifically, the different boxplots represent different disease cohorts from the TCGA. Cohorts represented in the plots are BRCA, COAD, ESCA, head and neck squamous cell carcinoma (HNSC), KIRP, LUAD, LUSC, READ, and UCEC. (E) Levels of RNA modification at the m1A mRNA modification site on the ND5 gene correlates with survival outcome. Plot depicts the kidney renal clear cell carcinoma (KIRC) cancer type. Note that the modification score is better than the modification rate in segregating patients by survival outcome, indicating that it is a better prognostic marker. A representative plot is depicted for the m1A mRNA modification site on the ND5 gene. (F and G) The level of modification at RNA modification sites, as represented by the modification score, is correlated with survival outcome and stages of disease progression for (F) KIRC and (G) LIHC. To determine whether RNA modification levels might be related to survival outcomes, we compared levels of RNA modifications at specific sites with survival outcomes. Notably, we found that lower modification levels at the m1A site on ND5 (MT:13710) were associated with better survival outcomes in patients suffering from kidney renal clear cell carcinoma (KIRC) (P = 0.00025) (Fig. 8E). Moreover, we also found that the modification score was a better metric than the modification rate in segregating the patients by survival outcome (P = 4.2 × 107) (Fig. 8E). Differences in survival outcomes were also noted for the m1A site on MALAT1 (11:65273630) and mitochondria rRNA (MT:2617) for patients with KIRC and LIHC, respectively (Fig. 8, F and G). Differences in modification levels as represented by the modification score were also associated with diseased stage (Fig. 8, F and G). Similar observations were also made for other modification sites and tumor types (fig. S30). Our work thus links differences in RNA modification levels at specific RNA modification sites to tumorigenesis and to clinical outcomes in patients with cancer, which suggests that RNA modifications can serve as a prognostic biomarker. More broadly, our results also highlight the possible dysregulation of the epitranscriptome in a wide range of human cancers and its possible involvement in cancer progression of patients with cancer, which will necessitate further exploration experimentally.

DISCUSSION

The study of RNA modifications in large clinical cohorts can lead to the discovery of potentially interesting relationships between the epitranscriptome and human diseases, although this has been technically challenging to date. In this study, we developed ModTect to enable the study of RNA modifications in large clinical cohorts using standard RNA-seq datasets that are typically generated as part of these studies. ModTect relies on two basic principles for the detection of RNA modifications from RNA-seq datasets: (i) the presence of a multinucleotide mismatch signal (i.e., two or more types of nucleotide mismatches) and (ii) the presence of a deletion signal. Notably, across thousands of patient-derived RNA-seq datasets from TCGA, these two features seem to be a universal phenomenon of multiple types of RNA modifications that disrupts Watson-Crick base pairing (tables S11 to S13). Noting that these multinucleotide mismatch signals can also be induced by sequencing errors, we developed a robust statistical model to distinguish RNA modification sites from sequencing noise. The use of a deletion signal on top of the multinucleotide mismatch signal, coupled with a large number of stringent filters, also greatly increased the precision of our calls. In combination, our approach enables multiple types of RNA modifications to be confidently discovered and profiled in clinical cohorts without prior knowledge of where these sites are as shown in this study. In contrast to previous approaches for detecting RNA modifications, ModTect can more reliably detect mRNA modifications from standard RNA-seq libraries and is more suited for the study of large clinical cohorts. Previous computational approaches developed for the study of RNA modifications relied upon a database of misincorporation signatures corresponding to different types of RNA modifications to either identify potential RNA modification sites () or assign these sites to RNA modification types (). However, these signals can be sensitive to the experimental conditions applied and to the modification of interest (). To our understanding, these approaches have not been successfully applied to the discovery of mRNA modifications in the human transcriptome with standard RNA-seq datasets (). In contrast to the random forest classifier (, ), ModTect does not rely upon a database of misincorporation profiles for the prediction of candidate RNA modifications and is able to identify potential RNA modification sites with misincorporation profiles that differ drastically from those that were previously recorded. Further, in contrast to HAMR, ModTect does not make use of a binomial model on the nucleotide counts but rather a specialized statistical model meant to detect the multinucleotide mismatch signal while simultaneously accounting for the base qualities. As such, ModTect can identify potential RNA modification sites more sensitively while distinguishing them from sequencing noise. Here, the ModTect statistical model only requires the presence of two or more types of variant nucleotides and a deletion signal, although the downside of this approach is that it is not possible to determine the modification type at a site. Nonetheless, ModTect still proves to be an invaluable method for determining whether a site is a putative RNA modification site. Overall, ModTect is likely to be more reliable and predictive under the highly varied experimental conditions applied for large clinical cohorts like those analyzed in this study. Here, it is also important to distinguish ModTect from algorithms that were previously developed for the study of RNA editing (–), which are not suitable for the study of RNA modifications. While these methods have been extensively validated and have proved to be invaluable for the study of RNA editing in clinical cohorts and human samples (–), these approaches identify RNA editing sites by the presence of the “A>G” mismatch signal alone, rather than the multinucleotide mismatch signals that are characteristic of RNA modifications. Thus, these previous methods are unable to distinguish RNA modifications from other sources of RNA variation and are thus not suitable for the study of RNA modifications. Noting the difficulty in studying the epitranscriptome in clinical cohorts, we then applied ModTect to >11,000 clinically derived RNA-seq datasets for the study of multiple types of RNA modifications, which revealed the following relationships: (i) Levels of RNA modifications were potentially dysregulated in some cancer types (e.g., colon and rectal cancers); (ii) the epitranscriptome was associated with cancer progression and survival outcomes (e.g., m1A in liver and kidney cancers); and (iii) some types of RNA modifications such as m1A and m1acp3Ψ differed more than others (e.g., m3U) between tumor and normal samples collected from patients with cancer (Fig. 9A). Further, we also found that levels of RNA modifications tend to decrease the further the RNA modification site is from the CDS of a transcript in the CTC1 gene (Fig. 9B). We also observed higher modification rates on rRNAs than mRNAs. As is the case of m6A RNA modification that is introduced in rRNAs by METTL5 () and in mRNAs by METTL3 (–), such differences in RNA modification rates may be caused in part by different RNA modification enzymes used for these processes. Notably, although these relationships we identified in this study are associative in nature, they are nonetheless invaluable for defining directions for experimentation and in-depth study and may also prove useful for identifying biomarkers to stratify clinical outcomes for patients with cancer. Together, these findings highlight the value of studying the epitranscriptome in large clinical cohorts, which was empowered by the development of ModTect (fig. S31).
Fig. 9

Summary of findings made in this study.

(A) Analysis of 9823 base pair–disrupting mRNA and rRNA modification sites found by ModTect across 11,371 TCGA cancer patient samples revealed sites with RNA modifications levels that are strongly associated with tumor types and diseased states. (B) Model depicting the relationship between modification levels at tRNA-like elements and their relative positioning from the CDS. The different types of mRNA modifications (m1A, m22G, inosine, and m3C) that are detectable at tRNA-like elements embedded within mRNAs/lncRNAs are also indicated. Model is supported by figs. S15 and S16. (C) Summary of the types of RNA-DNA differences that are induced by RNA modifications. The 10 types of RNA-DNA differences that are generally interpreted as artifacts are as indicated. RNA-DNA differences that are conclusively shown in our study to be induced by mRNA modifications are also highlighted. Note that both A-to-G and C-to-U RNA-DNA sequence differences can be induced by both RNA editing and RNA modifications. m6,6A, N6,N6-dimethyladenosine; m1I, 1-methylinosine; m1G, 1-methylguanosine. In addition, note that these sites can be observed as both mononucleotide mismatch and multinucleotide mismatch sites

Summary of findings made in this study.

(A) Analysis of 9823 base pair–disrupting mRNA and rRNA modification sites found by ModTect across 11,371 TCGA cancer patient samples revealed sites with RNA modifications levels that are strongly associated with tumor types and diseased states. (B) Model depicting the relationship between modification levels at tRNA-like elements and their relative positioning from the CDS. The different types of mRNA modifications (m1A, m22G, inosine, and m3C) that are detectable at tRNA-like elements embedded within mRNAs/lncRNAs are also indicated. Model is supported by figs. S15 and S16. (C) Summary of the types of RNA-DNA differences that are induced by RNA modifications. The 10 types of RNA-DNA differences that are generally interpreted as artifacts are as indicated. RNA-DNA differences that are conclusively shown in our study to be induced by mRNA modifications are also highlighted. Note that both A-to-G and C-to-U RNA-DNA sequence differences can be induced by both RNA editing and RNA modifications. m6,6A, N6,N6-dimethyladenosine; m1I, 1-methylinosine; m1G, 1-methylguanosine. In addition, note that these sites can be observed as both mononucleotide mismatch and multinucleotide mismatch sites Unexpectedly, we also found that RNA modifications contribute to multiple types of RNA-DNA sequence differences during RNA-seq but, for some reason, escaped detection by Sanger sequencing. RNA-DNA sequence differences, beyond those mediated by RNA editing processes (i.e., noncanonical RNA-DNA sequence differences), were first reported in a widely scrutinized study in 2011 (), before the emergence of the epitranscriptome. However, because there were no established biological mechanisms at that time that could account for these observations and because of technical issues associated with this study (–), such signals are generally discounted as artifacts and typically ignored from most analyses despite being widely observed by us and others (–, , –). In contrast to the general line of thought in the field, our study provides concrete support for the existence of multiple types of noncanonical RNA-DNA sequence differences and established that the epitranscriptome is an important source of previously observed mononucleotide RNA-DNA sequence differences in mRNAs/lncRNAs (Fig. 9C). More specifically, we found that multiple types of RNA-DNA sequence differences were induced by RNA modifications that disrupt Watson-Crick base pairing during cDNA synthesis. Further, noting that prior studies had highlighted how the commonly used MMLV RT truncates at RNA modification sites (, , ), we show that the truncation of RT at these RNA modifications sites was what prevented these sites from being validated by Sanger sequencing previously (). Thus, our findings not only challenge the presumption that all types of RNA-DNA sequence differences beyond those mediated by RNA editing processes are artifacts but also highlight the importance of digging deeper into these observations, since they might represent disease driving RNA modifications. There were, however, a few limitations to our study. First, as the TCGA study was a multiyear project spanning thousands of patient samples, batch effects in the datasets can potentially confound some of our observations. Nonetheless, limited technical variability in RNA modification levels was observed in each tumor type (Supplementary Text). Further, while we note that the sequencing center responsible for data generation can confound the RNA modification levels, a large majority of the TCGA datasets used in our study (87.4% of all datasets representing 28 of 33 tumor types) were generated at the same site and were thus cross-comparable (Supplementary Text). Next, because of random incorporation of both correct and incorrect nucleotides at an RNA modification site during cDNA synthesis, the mismatch rate observed during RNA-seq is likely an underestimate of the actual modification ratio. Nevertheless, as the mismatch rate and the actual modification ratio are positively correlated with each other (), changes in the mismatch rate at the same site between different samples is likely to still indicate a change in the actual modification ratio. Last, while we successfully validated some of the RNA modification sites captured by ModTect, we were unable to perform comprehensive experimental validation of all sites. Of particular note, the lower ranking RNA modification sites (i.e., higher ModTect rank number) were significantly more difficult to validate as these occurred in fewer samples, had lower modification rates, and were thus ascribed a lower confidence. Further work is thus needed to validate and establish the identity of these sites. In conclusion, we developed ModTect for discovery and profiling of RNA modifications in large clinical cohorts by RNA-seq. The application of ModTect to large cohorts of patients readily identified relationships between the epitranscriptome and clinical outcomes. Together, our study represents an example of how relationships between the epitranscriptome and human diseases can be readily unraveled in the context of a wide range of human diseases with ModTect.

MATERIALS AND METHODS

Data used in this study

WGS and RNA-seq data used for the illustration of the multinucleotide mismatch and deletion signature in rRNA were obtained for the tumor sample TCGA-D7-6521-01A from TCGA (). RNA-seq datasets for TCGA cohort were collected for normal and tumor patient samples of various tumor types as described above (n = 10,725). RNA-seq data of various human cancer cell lines used in this study were downloaded from the CCLE (n = 934) (). These datasets were previously obtained from the Cancer Genomics Hub but can now be obtained from the Genomics Data Common Legacy Archive (https://portal.gdc.cancer.gov/legacy-archive/search/f). The analysis ID that can be used to uniquely identify the exact bam/fastq files we downloaded for the TCGA and CCLE cohorts used in this study can be found in column A of tables S9 and S10, respectively. Where necessary (because only the .bam file was available for download), downloaded sequencing data in bam file format were converted back into fastq format. Sequencing data for m1A IP used in this study were obtained from two previous studies on m1A by Dominissini et al. () (SRR2086044, SRR2086045, SRR2086046, SRR2086047, SRR3000784, SRR300785, SRR3000774, SRR3000775, SRR3000776, SRR3000777, SRR3000778, and SRR3000779) and Li et al. () (SRR2636905, SRR2636907, SRR2636909, SRR2636906, SRR2636908, and SRR2636910).

Sample types and size for TCGA cohort

The cancer types and the number of tumor samples of each type analyzed in this study is as follows: ACC (n = 79), bladder urothelial carcinoma (BLCA; n = 414), BRCA (n = 1135), cervical squamous cell carcinoma and endocervical adenocarcinoma CESC (n = 304), cholangiocarcinoma (CHOL; n = 36), COAD (n = 311), DLBC (n = 48), ESCA (n = 184), GBM (n = 157), head and neck squamous cell carcinoma (HNSC; n = 520), kidney chromophobe (KICH; n = 66), kidney renal clear cell carcinoma KIRC (n = 545), KIRP (n = 290), LAML (n = 178), LGG (n = 516), LIHC (n = 371), LUAD (n = 540), LUSC (n = 504), mesothelioma (MESO; n = 87), ovarian serous cystadenocarcinoma OV (n = 422), PAAD (n = 178), PCPG (n = 179), PRAD (n = 505), READ (n = 94), sarcoma (SARC; n = 259), skin cutaneous melanoma (SKCM; n = 103), stomach adenocarcinoma (STAD; n = 416), testicular germ cell tumors (TGCT; n = 150), THCA (n = 505), thymoma (THYM; n = 120), UCEC (n = 184), uterine carcinosarcoma (UCS; n = 57), and UVM (n = 80).

Mapping of RNA-seq data

RNA-seq data from the CCLE and TCGA were aligned toward the Broad variant of the hg19 reference genome (b37) (www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta) that was indexed using the National Center for Biotechnology Information RefSeq as a reference transcript annotation and with a “readlength -1” bp overhang (--sjdbOverhang parameter) using STAR (version 2.5.2a). Note that the mitochondrial reference sequence in the Broad variant of the hg19 reference genome differs from that in the standard hg19 reference genome, and is equivalent to that found in hg38.

Mapping of ribosomal reads

RNA-seq data from the CCLE and TCGA were mapped to a reference set of transcripts corresponding toward human rRNAs (NR_003287.2, NR_003286.2, NR_003285.2, and V00589.1) with Burrows-Wheeler Alignment-maximal exact match (BWA-MEM) (version 0.7.5a-r405) using default parameters.

Detection of multinucleotide signature and modification score

To determine whether a site is a putative RNA modification site, we introduced an RNA modification score in a subsequent section, inspired in part by SAMtools and MuTect (, ). The modification score is based on the likelihood of a site being tetranucleotide (three types of mismatches), trinucleotide (two types of mismatches), or binucleotide (one type of mismatch). Specifically, for each site, we calculated the likelihood of the site being a binucleotide site, a trinucleotide site, or a tetranucleotide site. Next, we will describe our variant models for the three different cases before introduction of the modification score.

Binucleotide case

If the site is assumed to be binucleotide, the site would follow a variant model Mε in which the site is assumed to contain a single true variant nucleotide m (non–reference sequence matching nucleotide) with variant nucleotide fraction ε. The likelihood is then calculated as follows on the basis of the definition formulated by the MuTect algorithm ()where b is the observed nucleotide at this site in the ith read with base sequencing error e, r is the reference nucleotide sequence at this position with r ∈ {A, T, G, C}, m is the variant nucleotide sequence at this position with m ∈ {A, T, G, C}, r ≠ m, and the variant nucleotide proportion of m nucleotide is labeled as ε with 0 ≤ ε ≤ 1. Under the model of Mε, the probability of each observed base can be calculated as follows Maximizing the log-likelihood of the following function representing the binucleotide casegives the maximum likelihood estimates (MLEs) of the two parameters m and ε

Trinucleotide case

Extending the binucleotide case to the trinucleotide case, the likelihood of the trinucleotide case can be defined as followswhere b, e, and r are defined the same as in the binucleotide case, while m1 and m2 are the two variant nucleotide sequences at the position with m1, m2 ∈ {A, T, G, C}, r ≠ m1, r ≠ m2, and m1 ≠ m2. Furthermore, ε1 and ε2 represent their corresponding variant proportions with ε1 ≤ ε2 and 0 ≤ ε1 + ε2 ≤ 1. Extending the model to trinucleotide cases, the probability of each observed base is calculated in the following manner Maximizing the log-likelihood yields the MLE of the four parameters

Tetranucleotide case

The likelihood of the tetranucleotide case can be further extended as followswhere b, e, and r are defined the same as in the binucleotide and trinucleotide cases, while m1, m2, and m3 are the three variant nucleotide sequences at the position with m ∈ {A, T, G, C}, r ≠ m, k = 1, 2, 3, and m ≠ m. Furthermore, ε1, ε2, and ε3 represent their corresponding variant nucleotide proportions with ε1 ≤ ε2 ≤ ε3 and 0 ≤ ε1 + ε2 + ε3 ≤ 1. Extending the model further to tetranucleotide cases, the probability of each observed base can be calculated in the following manner Maximizing the log-likelihood results in MLE for the six parameters as

Modification score

With the statistical models that we have developed, we can then test whether a site of interest displays the multinucleotide mismatch signature. More specifically, this is performed by looking at the relative log-likelihood of the site of interest being trinucleotide versus binucleotide. Thus, the modification score (SM) can be defined as follows Replacing the parameters with the MLEs , , , , , and gives us the following equation for the modification score of a site It should be noted that we have chosen to calculate the modification score by taking the relative log-likelihood of the trinucleotide case to the binucleotide case rather than to the mononucleotide case because germline variant single-nucleotide polymorphism and RNA editing sites would also be expected to show a high modification score if defined as such. Thus, defining the modification score as shown above would allow us to properly differentiate true multinucleotide mismatch sites from germline SNP sites and RNA editing sites. While we could also have calculated the modification score by considering the relative log-likelihood of the tetranucleotide case to the binucleotide case, we note that the proportion of the third mutant nucleotide (i.e., ε3) from the three sites we have presented is considerably smaller than the proportion of the second mutant nucleotide (ε2). Thus, while additional information can be gained from using the tetranucleotide model, the trinucleotide model is sufficient for most purposes. An important feature of the defined equations is that in the event that a site shows only the reference and a single type of variant nucleotide, ε2 would take the value of zero. Under such a situation, would reduce to , and these sites would give a modification score of zero. Similarly, if the site only carries nucleotides corresponding to the reference nucleotide, both the and terms would reduce to L(M0) with = . Thus, a modification score of zero would similarly be obtained in this context. In this way, the modification score is able to effectively separate true multinucleotide mismatch sites from germline variant sites and nonvariant sites given that such sites would give a modification score of zero.

Approximation of the maximum log-likelihood value

Solving for the maximum log-likelihood value by expectation maximization is computationally very demanding. For reasons of computational efficiency, we then approximated the parameters m and ε for the binucleotide model and m1, m2, ε1, and ε2 for the trinucleotide model. Each site carries one reference nucleotide as identified from the human genome and three possible variant nucleotides termed v1, v2, and v3. The fraction of each variant nucleotide was then calculated as follows These approximated values were then used for the calculation of the maximum log-likelihood values.

Detection of deletion signature

Custom-written Python scripts were used to process the output from SAMtools mpileup with the parameter –Q 0 to retain the asterisk placeholder character representing a deleted base at each position. The proportion of reads carrying a deletion at each site is then calculated by dividing the number of “*” characters by the corresponding sequencing depth for the site.

Collection of rRNA modification sites

Sites of rRNA modifications were collected from Modomics () and from the 3D rRNA Modification Maps Database (). A consensus position and rRNA modification were then defined. The positions of these sites were then transposed to the 28S and 18S reference sequences used in this study (NR_003287.2 and NR_003286.2, respectively). In all, 5 base pair–disrupting RNA modification sites {28S:1322 (m1A), 28S:4530 (m3U), 18S:1248 (m1acp3Ψ), 18S:1850 [N6,N6-dimethyladenosine (m6,6A)], and 18S:1851 (m6,6A)} and 188 non–base pair–disrupting RNA modification sites [Am (2′-O-methyladenosine) (29 sites), Cm (2′-O-methylcytidine) (22 sites), Gm (2′-O-methylguanosine) (24 sites), m5C (5-methylcytidine) (2 sites), m6A (2 sites), m7G (7-methylguanosine) (1 site), Um (2′-O-methyluridine) (16 sites), Ψm (2′-O-methylpseudouridine) (2 sites), and Ψ (pseudouridine) (90 sites)] were noted on 28S and 18S rRNA.

Performance assessment of method

Precision-recall curves and receiver-operator characteristic curves (ROC) were generated to assess the performance of our methodology based on the residual rRNA reads from the 934 CCLE RNA-seq samples. The base pair–disrupting RNA modifications, 28S:1322 (m1A), 28S:4530 (m3U), and 18S:1248 (m1acp3Ψ), from these 934 RNA-seq datasets were defined as the true-positive set, while all other sites on 28S and 18S rRNA were defined as the true-negative set. Although the two m6,6A RNA modification sites [18S:1850 (m6,6A) and 18S:1851 (m6,6A)] are also expected to disrupt base pairing, they were excluded from our analysis as they localized toward neighboring positions and might strongly affect the ability of the RT to incorporate a random nucleotide during RT. In addition, we have also noticed that the deletion signature induced by a base pair–disrupting RNA modification can sometimes extend additional nucleotides upstream of the RNA, presumably due to further slippage of the RT after encountering a base pair–disrupting RNA modification during cDNA synthesis. As a result, it is difficult to determine from which of the neighboring RNA modification sites the deletion signal was induced from for a fair analysis. Given the above, these two sites were excluded when assessing the performance of the different methodologies. In particular, we tested four different methodologies: (i) total fraction of all variant nucleotides (ε), (ii) deletion rate (rd), (iii) modification score (SM), and (iv) combinatory score (SC) by combining both SM and rd for each site. The combination algorithm can be given as followswhere rd,c is the deletion rate cutoff parameter. The deletion rate cutoff (rd,c) was set to be rd,c = 0.001 because it is similar to the deletion rate we observe in the rRNA modification sites. With these scores calculated, we then generated the precision-recall and ROC curves to assess each of the four methodologies for calling RNA modifications based on the rRNA reads found in the CCLE libraries.

Discovery of putative RNA modification sites

BAM files corresponding to each RNA-seq dataset were processed using the SAMtools mpileup function together with custom Python scripts. Briefly, the bases observed at each position together with their base sequencing qualities from all the reads aligning at the position were obtained. At the same time, the reference nucleotide sequence was also extracted from the reference genome. Then, the likelihoods for the position can be estimated and are further used to calculate the modification score (SM), as defined above. It should be noted that bases within six nucleotides from the terminal edge of a read were excluded from analysis to remove the possibility that identified variants were induced during random hexamer priming. This was done by extracting the base position on the read with the “-O” option in SAMtools mpileup and by filtering out the terminal six nucleotides from the edge of the read based on the known length of the read using custom Python scripts. Sites with a modification score (SM) ≥1 were then extracted from each RNA-seq dataset. The list of sites identified across the CCLE samples was then compiled together. Specifically, the final list of RNA modification sites was obtained by filtering for sites with a modification score ≥1 in ≥6 samples and a deletion proportion ≥0.001. In addition, sites that localized to the immunoglobulins (abParts) were excluded from the final list of sites. To exclude variants from spurious mapping, we only retained variants with a mappability of 0.8 and above (wgEncodeCrgMapabilityAlign100mer). In addition, we extracted ±50 bp from each candidate site and looked for possible hits using BLAT. Only sites with a score ≥20 higher than the next best hit were retained. We then removed all sites that were also annotated as common SNPs (at least one 1000Genomes population with a minor allele of frequency ≥1%) in the dbSNP150 database. Sites residing in homopolymeric regions (≥5 nucleotides) were also removed. A further layer of more aggressive filters was also introduced to further ensure the precision of our calls and to generate a cleaner list of sites (table S1). Specifically, sites were discarded if (i) >80% of the supporting reads are spanning an unannotated splicing junction, (ii) they are in or next to a homopolymer ≥6 nucleotides, (iii) almost all mismatches on the supporting reads disappear if substituted with an indel, or (iv) mismatches are from the last 20 bp of the supporting reads as was previously applied ().

Analysis of variant proportion from WGS datasets

CCLE WGS datasets were obtained from a previous study () and mapped to the hg38 reference genome with BWA-MEM (0.7.17-r1188). To assess the mismatch rate at RNA modification sites identified in this study, RNA modification sites (hg19 coordinates) were first converted to hg38 coordinates using the UCSC liftover tool (https://genome.ucsc.edu/cgi-bin/hgLiftOver). Each of these RNA modification sites was then assessed for their mismatch rate at the DNA level with ModTect, with the following parameters: “--minBaseQual 20 --basesToTrimFromEdge 0 --scoreCutoff -1000000000”. The average mismatch rate for each site across all samples was then calculated and indicated in the plot.

Annotation of discovered variants

Discovered sites were annotated using custom scripts written in Perl based on annotations derived from the UCSC knownGene table. All possible transcripts overlapping with the variants were identified, and the location of the variant (i.e., 5′UTR, CDS, 3′UTR, intron, etc.) on the transcript was identified. In the event that the variant location identified from different transcripts is different, the variant location was assigned according to the following priority: CDS>5′UTR>3′UTR>intron>upstream>downstream>ncRNA>intergenic. The strand corresponding to each transcript was also noted during the annotation process. The identified variant was then indicated as arising from the positive or negative strand based on the transcript annotations and as “ambiguous” if the corresponding transcript annotations support both strands. Sites that originated from an unannotated region of the genome were defined as “intergenic,” and no strand of origin or nucleotide was noted. The reference nucleotide and the observed nucleotide conversion were then determined accordingly from the annotated strand information. Note that some sites could not be assigned to the four nucleotide categories as the transcript annotation indicates that they originated from intergenic regions or from regions where the strand of origin was ambiguous.

Nucleotide substitution pattern

Samples were classified as carrying “no mismatch” or “one,” “two,” or “three” types of mismatches based on observed number of types of reference and mismatch bases (≥Q30) at the site of interest. The corresponding mismatch nucleotide (i.e., non-reference base matching nucleotide) was then noted to obtain the mismatch pattern for the site in each sample. A similar plot was also generated for the inosine sites in our study, to serve as a negative control as it is known to cause only A>G mismatches.

Proportion of samples where deletion signal was detected

The entire 934 CCLE RNA-seq and 349 DNA sequencing samples were used for the study. Only samples that are covered by ≥10 reads at each of the sites were used for the analysis. A deletion was deemed to be present at the site if it has at least one read with a deletion and could be detected in ≥0.001 proportion of all reads.

Benchmarking of ModTect with HAMR

HAMR software (Commit: 0a04208eae481137e0081dd90b9c1a40bc49c9f3) was obtained from the HAMR github repository (https://github.com/GregoryLab/HAMR). HAMR was then performed on 935 CCLE RNA-seq datasets using the recommended parameters of min_read_qual = 30, min_read_coverage = 10, seq_error_rate = "0.05", hypothesis = H4, max_p = 0.01, max_fdr = 0.05, and min_ref_percent = 0.05. The “euk_trna_mods.Rdata” model file provided by HAMR was also specified as the prediction model for RNA modification classification. HAMR analysis was restricted to the ±100-bp region of the 10 representative sites by specifying a target bed file representing these regions. We also assessed HAMR using the “H1 hypothesis” and the raw outputs generated by HAMR. The number of sites that localized to the modification sites (true positives) and to neighboring regions (false positives) was then enumerated. ModTect was similarly applied to the 935 CCLE RNA-seq datasets for the ±100-bp region of the 10 representative sites. ModTect run represented in the main figure was performed using the default modification score cutoff of ≥1. A more stringent modification score cutoff of ≥5 was also evaluated. Notably, we did not apply the deletion signal filter for the evaluation although it can enhance the precision of our calls as the deletion signal can only be reliably observed for each sample at high sequencing coverages or when reads from multiple samples are aggregated. Similarly, sites that were identified at the true modification sites and at neighboring regions were enumerated. To assess the sensitivity and precision of the calls by both approaches, we had assumed that there was a theoretical maximum of 9350 true sites (935 samples × 10 sites assessed), although it should be noted that there was no sequencing coverage observed in some samples at some of these sites. Further, the flanking ±100-bp region of each site was defined as false sites, although some as yet characterized RNA modifications could potentially reside in these regions. There were a total of 1671 nonoverlapping “negative” sites in each sample, representing a total of 1,562,385 sites across 935 CCLE samples.

Sanger sequencing validation of base pair–disrupting RNA modifications

For Sanger sequencing experiments comparing MMLV and TGIRT, cells were grown in Dulbecco’s modified Eagle’s medium (DMEM) with 10% fetal bovine serum (FBS) and 1% penicillin-streptomycin. RNA was extracted using ReliaPrep RNA Miniprep Systems (Promega). Reverse transcription was performed using TGIRT-III Enzyme or ProtoScript II Reverse Transcriptase (MMLV) with the following conditions: 1 μg of RNA sample, 4 μl of 5× buffer, 2 μl of 10× dithiothreitol, 1 μl of 10 mM deoxynucleotide triphosphate (dNTP), 2 μl of random primer mix, 1 μl of enzyme, and nuclease-free water to a total volume of 20 μl. The reaction tube was incubated at 60°C (TGIRT) or 42°C (ProtoScript II Reverse Transcriptase) for 1 hour. PCR was performed using the 1:50 diluted cDNA using GoTaq Green Master Mix taq enzyme with indicated primers. The PCR products were gel-purified and Sanger-sequenced. For Sanger sequencing experiments validating mRNA/lncRNA modifications, OVK18 cells were grown in DMEM with 10% FBS and 1% penicillin-streptomycin. KMS-12-BM cells were grown in RPMI 1640 medium with 10% FBS and 1% penicillin-streptomycin. Total RNA was extracted using the RNeasy Mini Kit (QIAGEN). gDNA was removed by in-column digestion with deoxyribonuclease I (DNase I), following the protocol of the RNeasy and RNase-Free DNase Set (except that the DNase I digestion time was extended from 15 to 40 min, to completely remove residual gDNA). Reverse transcription was performed using TGIRT-III Enzyme with the following conditions: 0.5 μg of RNA sample (4 μl), 1 μl of 10× T4 ligase buffer, 2 μl of 5 M NaCl, 1 μl of 25 mM dNTP, 1 μl of random primer mix (hexamer), 1 μl of oligo(dT), 1 μl of ribonuclease (RNase) inhibitor (Roche), 1 μl of TGIRT enzyme, and 8 μl of nuclease-free water to a total volume of 20 μl. The reaction tube was incubated at 60°C (TGIRT) for 1.5 hours. PCR was performed using the 1:25 diluted cDNA using GoTaq Green Master Mix taq enzyme with indicated primers. The PCR products were gel-purified and Sanger-sequenced. Primers used for PCR amplification are as follows: Primer set 1 (28 DLW1411CCCCGAGTGTTACAGCCC DLW1412CGCCCCTATACCCAGGTC Primer set 2 (28 DLW1413GTGCCAGGTGGGGAGTTTG DLW1414CCGTTTCCCAGGACGAAGG Primer set 3 (18 DLW1415GACGGACCAGAGCGAAAG DLW1416TGAGCCAGTCAGTGTAGCG Primer set 4 (MALAT1 m MALAT1 forwardGGTAACGATGGTGTCGAGGT MALAT1 reverseACACCAGCTGCAGGCTATTA Primer set 5 (PRUNE m PRUNE forwardCCGATTCGCCGTGTGG PRUNE reverseCCTCCCCACAACTCAGTGTTC Primer set 6 (m CTC1 8129568 forwardGTGGGGTGTGGCATCAGTTA CTC1 8129568 reverseGGATGGGATGATCCAGTCGG Primer set 7 (m CTC1 m22g forwardATGAGAGGTTTGGAGGGTGC CTC1 m22g reverseTTGGAGGGAGGAGTCTCTGG Primer set 8 (m CTC1 P2 forwardCGACTGGATCATCCCATCCTG CTC1 P2 reverseGAATCACTAGCGCTCCCGAA Primer set 9 (m CTC1 P9 forwardGCATCACTTGTCCATTCCTCG CTC1 P9 reverseAAGGCTTGGAGGGAGGAGTC

Distance of modified site from 5′ and 3′ terminus of read pairs

The 5′ and 3′ positions of the modified read pair were extracted using SAMtools and custom-written Perl scripts. Briefly, reads that overlapped with a site of interest were extracted, and their read names were noted to extract the corresponding read in the read pair. At the same time, the observed nucleotide and base sequencing quality at the site were also noted. Only reads where the base quality was ≥Q20 and where the observed nucleotide differed from the reference nucleotide (i.e., mismatch nucleotide at the site of interest) at the site of interest were retained for subsequent analysis. The corresponding read pairs could then be extracted and their 3′ or 5′ positions noted for type 1 and type 2 reads, respectively. The distance from the site of interest is then calculated on the basis of the positions noted.

Composition bias of detected candidate sites

The chi-squared test was performed using R to determine whether there was a composition bias in the number of A/T and C/G sites.

List of m1A sites from the literature

The list of 15 m1A sites detected using m1A-seq by Safra et al. was obtained directly from the publication. As there were slight differences in the reference mitochondria sequences used in both studies, the mitochondrial sites reported in the study were converted to the same genomic coordinate before comparison. The list of m1A sites detected using m1A-MAP by Li et al. was obtained from the publication. As the sites were reported on the basis of transcript coordinates, these sites were converted to genomic coordinates based on the reference genome used in our study using custom Perl scripts and the corresponding transcript annotations. Duplicated sites reported in the study and sites not carrying an A based on the reference transcript were removed.

Processing of TCGA data

The TCGA dataset was processed as per the CCLE dataset to generate the BAM files. Modification levels at each of the 9823 sites found from the panel of CCLE RNA-seq dataset were then assessed by ModTect. In addition, the modification levels at m1A RNA modification sites that were concordantly identified by previous studies were also assessed.

Analysis of RNA modification levels in TCGA cohort and heatmap of modification levels

A set of RNA modification sites among the 9823 sites that were found to be of higher confidence were selected for more detailed analysis. First, each of the 10,000+ TCGA samples were analyzed separately at each of the 9823 sites found from the CCLE dataset. The site was then assessed to determine whether it had a variant proportion ≥0.01, a depth ≥10, and a modification score ≥0.1. The number of samples that passed this threshold across the TCGA cohort was then enumerated. We then retained sites that had an average modification rate of more than 0.1 and were observed in ≥50 samples. Sites that had a modification rate ≥80% were omitted after manual curation as they were found to be germline SNP variants (14 out of 9823 sites). A list of base pair–disrupting rRNA modifications and m1A mRNA modification sitesthat were found concordantly by m1A-seq and m1A-MAP was also included. This filtered and finalized list of sites is then presented in the final boxplot in Figure 7A (45 sites). The heatmap of RNA modification levels across different tumor types, of each modification site, was generated as follows for the sites that were presented in the boxplot. For each site, the average modification level in each tumor cohort was calculated (meansite_tumortype). A corresponding Z score was then calculated by the mean-of-mean method. This is represented by the following equation: Z score = (meansite_tumortype − mean-of-meansite + 0.001)/mean-of-meansite, where 0.001 is a pseudo-count and mean-of-meansite is the mean of the meansite_tumortype value across all tumor types. Sites that were not represented in more than three tumor types were discarded (e.g., when the gene covering the RNA modification site is not expressed in a tumor type, the modification site will not be represented in our result). The calculated values were then visualized on a heatmap with the heatmap.2 package in R.

Expression of RNA modification enzymes

The list of RNA modification enzymes was extracted from a previous study () and was manually curated. The list of human RNA modification enzymes assessed in this study and their corresponding gene symbols is included as part of table S16. Gene expression data for TCGA samples were downloaded from the GDC (genomic data commons) data portal from the National Cancer Institute (HTSeq-FPKM). As the expression data were indicated in ensemble geneID format, these labels were correspondingly converted to the corresponding gene symbols. Gene expression data of each of the RNA modification enzymes were subsequently extracted for downstream analysis.

t-distributed stochastic neighbor embedding (t-SNE) plot of RNA modification enzymes

Expression of RNA modification enzymes was converted to the log2 scale and then Z score–normalized by the mean-of-mean method. A t-SNE plot was then generated on the basis of the expression of the RNA modification enzymes with the Rtsne package in R (version 3.5.0) with the following parameters: perplexity = 30, max_iter = 500.

Survival analysis

Before survival analysis, the samples in each TCGA cohort were clustered into two groups: MS (modification score) high and MS low. Only those samples with sufficient sequencing depth (≥20) were considered in the above grouping. The Kaplan-Meier method was then used for depicting the survival difference between the MS high and MS low groups and the log-rank test was used for determining the statistical significance.

Correlation for identification of the potential modification enzymes

Expression of each gene/enzyme was correlated with the modification score across all RNA modification sites based on Spearman rank correlation. To exclude those genes/enzymes that had high expression correlation with the host gene of an RNA modification site, we also profiled their linear (Pearson) expression correlation with the host gene. The enzymes with gene expression highly correlated with the modification score but not with the expression of the host gene could be identified as potential candidates for the modification enzymes.

Validation of m1A sites detected by ModTect with m1A IP

Previously published m1A IP data were downloaded from GSE73941 (). This corresponded to two sets of m1A IP and two sets of control datasets. These datasets were then mapped to hg19 using STAR, with the GENCODE v32 annotations, and ENCODE parameters. We further removed nonunique alignments by removing reads that had a mapping quality of <20. Subsequent to that, we then counted the number of A, G, C, and T nucleotides at each candidate A sites. Sites were regarded as m1A sites if the variant allele frequency of the IP sample is ≥30% and ≥3× that observed in the control.
  108 in total

1.  Comment on "Widespread RNA and DNA sequence differences in the human transcriptome".

Authors:  Claudia L Kleinman; Jacek Majewski
Journal:  Science       Date:  2012-03-16       Impact factor: 47.728

2.  Comment on "Widespread RNA and DNA sequence differences in the human transcriptome".

Authors:  Joseph K Pickrell; Yoav Gilad; Jonathan K Pritchard
Journal:  Science       Date:  2012-03-16       Impact factor: 47.728

3.  Comprehensive Pan-Genomic Characterization of Adrenocortical Carcinoma.

Authors:  Siyuan Zheng; Andrew D Cherniack; Ninad Dewal; Richard A Moffitt; Ludmila Danilova; Bradley A Murray; Antonio M Lerario; Tobias Else; Theo A Knijnenburg; Giovanni Ciriello; Seungchan Kim; Guillaume Assie; Olena Morozova; Rehan Akbani; Juliann Shih; Katherine A Hoadley; Toni K Choueiri; Jens Waldmann; Ozgur Mete; A Gordon Robertson; Hsin-Ta Wu; Benjamin J Raphael; Lina Shao; Matthew Meyerson; Michael J Demeure; Felix Beuschlein; Anthony J Gill; Stan B Sidhu; Madson Q Almeida; Maria C B V Fragoso; Leslie M Cope; Electron Kebebew; Mouhammed A Habra; Timothy G Whitsett; Kimberly J Bussey; William E Rainey; Sylvia L Asa; Jérôme Bertherat; Martin Fassnacht; David A Wheeler; Gary D Hammer; Thomas J Giordano; Roel G W Verhaak
Journal:  Cancer Cell       Date:  2016-05-09       Impact factor: 31.743

4.  Genomic Classification of Cutaneous Melanoma.

Authors: 
Journal:  Cell       Date:  2015-06-18       Impact factor: 41.582

5.  Integrated genomic characterization of papillary thyroid carcinoma.

Authors: 
Journal:  Cell       Date:  2014-10-23       Impact factor: 41.582

6.  Next-generation characterization of the Cancer Cell Line Encyclopedia.

Authors:  Mahmoud Ghandi; Franklin W Huang; Judit Jané-Valbuena; Gregory V Kryukov; Christopher C Lo; E Robert McDonald; Jordi Barretina; Ellen T Gelfand; Craig M Bielski; Haoxin Li; Kevin Hu; Alexander Y Andreev-Drakhlin; Jaegil Kim; Julian M Hess; Brian J Haas; François Aguet; Barbara A Weir; Michael V Rothberg; Brenton R Paolella; Michael S Lawrence; Rehan Akbani; Yiling Lu; Hong L Tiv; Prafulla C Gokhale; Antoine de Weck; Ali Amin Mansour; Coyin Oh; Juliann Shih; Kevin Hadi; Yanay Rosen; Jonathan Bistline; Kavitha Venkatesan; Anupama Reddy; Dmitriy Sonkin; Manway Liu; Joseph Lehar; Joshua M Korn; Dale A Porter; Michael D Jones; Javad Golji; Giordano Caponigro; Jordan E Taylor; Caitlin M Dunning; Amanda L Creech; Allison C Warren; James M McFarland; Mahdi Zamanighomi; Audrey Kauffmann; Nicolas Stransky; Marcin Imielinski; Yosef E Maruvka; Andrew D Cherniack; Aviad Tsherniak; Francisca Vazquez; Jacob D Jaffe; Andrew A Lane; David M Weinstock; Cory M Johannessen; Michael P Morrissey; Frank Stegmeier; Robert Schlegel; William C Hahn; Gad Getz; Gordon B Mills; Jesse S Boehm; Todd R Golub; Levi A Garraway; William R Sellers
Journal:  Nature       Date:  2019-05-08       Impact factor: 49.962

7.  Comprehensive Molecular Characterization of Papillary Renal-Cell Carcinoma.

Authors:  W Marston Linehan; Paul T Spellman; Christopher J Ricketts; Chad J Creighton; Suzanne S Fei; Caleb Davis; David A Wheeler; Bradley A Murray; Laura Schmidt; Cathy D Vocke; Myron Peto; Abu Amar M Al Mamun; Eve Shinbrot; Anurag Sethi; Samira Brooks; W Kimryn Rathmell; Angela N Brooks; Katherine A Hoadley; A Gordon Robertson; Denise Brooks; Reanne Bowlby; Sara Sadeghi; Hui Shen; Daniel J Weisenberger; Moiz Bootwalla; Stephen B Baylin; Peter W Laird; Andrew D Cherniack; Gordon Saksena; Scott Haake; Jun Li; Han Liang; Yiling Lu; Gordon B Mills; Rehan Akbani; Mark D M Leiserson; Benjamin J Raphael; Pavana Anur; Donald Bottaro; Laurence Albiges; Nandita Barnabas; Toni K Choueiri; Bogdan Czerniak; Andrew K Godwin; A Ari Hakimi; Thai H Ho; James Hsieh; Michael Ittmann; William Y Kim; Bhavani Krishnan; Maria J Merino; Kenna R Mills Shaw; Victor E Reuter; Ed Reznik; Carl S Shelley; Brian Shuch; Sabina Signoretti; Ramaprasad Srinivasan; Pheroze Tamboli; George Thomas; Satish Tickoo; Kenneth Burnett; Daniel Crain; Johanna Gardner; Kevin Lau; David Mallery; Scott Morris; Joseph D Paulauskis; Robert J Penny; Candace Shelton; W Troy Shelton; Mark Sherman; Eric Thompson; Peggy Yena; Melissa T Avedon; Jay Bowen; Julie M Gastier-Foster; Mark Gerken; Kristen M Leraas; Tara M Lichtenberg; Nilsa C Ramirez; Tracie Santos; Lisa Wise; Erik Zmuda; John A Demchok; Ina Felau; Carolyn M Hutter; Margi Sheth; Heidi J Sofia; Roy Tarnuzzer; Zhining Wang; Liming Yang; Jean C Zenklusen; Jiashan Zhang; Brenda Ayala; Julien Baboud; Sudha Chudamani; Jia Liu; Laxmi Lolla; Rashi Naresh; Todd Pihl; Qiang Sun; Yunhu Wan; Ye Wu; Adrian Ally; Miruna Balasundaram; Saianand Balu; Rameen Beroukhim; Tom Bodenheimer; Christian Buhay; Yaron S N Butterfield; Rebecca Carlsen; Scott L Carter; Hsu Chao; Eric Chuah; Amanda Clarke; Kyle R Covington; Mahmoud Dahdouli; Ninad Dewal; Noreen Dhalla; Harsha V Doddapaneni; Jennifer A Drummond; Stacey B Gabriel; Richard A Gibbs; Ranabir Guin; Walker Hale; Alicia Hawes; D Neil Hayes; Robert A Holt; Alan P Hoyle; Stuart R Jefferys; Steven J M Jones; Corbin D Jones; Divya Kalra; Christie Kovar; Lora Lewis; Jie Li; Yussanne Ma; Marco A Marra; Michael Mayo; Shaowu Meng; Matthew Meyerson; Piotr A Mieczkowski; Richard A Moore; Donna Morton; Lisle E Mose; Andrew J Mungall; Donna Muzny; Joel S Parker; Charles M Perou; Jeffrey Roach; Jacqueline E Schein; Steven E Schumacher; Yan Shi; Janae V Simons; Payal Sipahimalani; Tara Skelly; Matthew G Soloway; Carrie Sougnez; Angela Tam; Donghui Tan; Nina Thiessen; Umadevi Veluvolu; Min Wang; Matthew D Wilkerson; Tina Wong; Junyuan Wu; Liu Xi; Jane Zhou; Jason Bedford; Fengju Chen; Yao Fu; Mark Gerstein; David Haussler; Katayoon Kasaian; Phillip Lai; Shiyun Ling; Amie Radenbaugh; David Van Den Berg; John N Weinstein; Jingchun Zhu; Monique Albert; Iakovina Alexopoulou; Jeremiah J Andersen; J Todd Auman; John Bartlett; Sheldon Bastacky; Julie Bergsten; Michael L Blute; Lori Boice; Roni J Bollag; Jeff Boyd; Erik Castle; Ying-Bei Chen; John C Cheville; Erin Curley; Benjamin Davies; April DeVolk; Rajiv Dhir; Laura Dike; John Eckman; Jay Engel; Jodi Harr; Ronald Hrebinko; Mei Huang; Lori Huelsenbeck-Dill; Mary Iacocca; Bruce Jacobs; Michael Lobis; Jodi K Maranchie; Scott McMeekin; Jerome Myers; Joel Nelson; Jeremy Parfitt; Anil Parwani; Nicholas Petrelli; Brenda Rabeno; Somak Roy; Andrew L Salner; Joel Slaton; Melissa Stanton; R Houston Thompson; Leigh Thorne; Kelinda Tucker; Paul M Weinberger; Cynthia Winemiller; Leigh Anne Zach; Rosemary Zuna
Journal:  N Engl J Med       Date:  2015-11-04       Impact factor: 91.245

8.  Single-nucleotide-resolution mapping of m6A and m6Am throughout the transcriptome.

Authors:  Bastian Linder; Anya V Grozhik; Anthony O Olarerin-George; Cem Meydan; Christopher E Mason; Samie R Jaffrey
Journal:  Nat Methods       Date:  2015-06-29       Impact factor: 28.547

9.  RADAR: a rigorously annotated database of A-to-I RNA editing.

Authors:  Gokul Ramaswami; Jin Billy Li
Journal:  Nucleic Acids Res       Date:  2013-10-25       Impact factor: 16.971

10.  Integrated Molecular Characterization of Testicular Germ Cell Tumors.

Authors:  Hui Shen; Juliann Shih; Daniel P Hollern; Linghua Wang; Reanne Bowlby; Satish K Tickoo; Vésteinn Thorsson; Andrew J Mungall; Yulia Newton; Apurva M Hegde; Joshua Armenia; Francisco Sánchez-Vega; John Pluta; Louise C Pyle; Rohit Mehra; Victor E Reuter; Guilherme Godoy; Jeffrey Jones; Carl S Shelley; Darren R Feldman; Daniel O Vidal; Davor Lessel; Tomislav Kulis; Flavio M Cárcano; Kristen M Leraas; Tara M Lichtenberg; Denise Brooks; Andrew D Cherniack; Juok Cho; David I Heiman; Katayoon Kasaian; Minwei Liu; Michael S Noble; Liu Xi; Hailei Zhang; Wanding Zhou; Jean C ZenKlusen; Carolyn M Hutter; Ina Felau; Jiashan Zhang; Nikolaus Schultz; Gad Getz; Matthew Meyerson; Joshua M Stuart; Rehan Akbani; David A Wheeler; Peter W Laird; Katherine L Nathanson; Victoria K Cortessis; Katherine A Hoadley
Journal:  Cell Rep       Date:  2018-06-12       Impact factor: 9.423

View more
  2 in total

1.  Single-molecule long-read sequencing reveals the potential impact of posttranscriptional regulation on gene dosage effects on the avian Z chromosome.

Authors:  Jianmei Wang; Yang Xi; Shengchao Ma; Jingjing Qi; Junpeng Li; Rongping Zhang; Chunchun Han; Liang Li; Jiwen Wang; Hehe Liu
Journal:  BMC Genomics       Date:  2022-02-11       Impact factor: 3.969

2.  Characterization of novel small non-coding RNAs and their modifications in bladder cancer using an updated small RNA-seq workflow.

Authors:  Zhangli Su; Ida Monshaugen; Arne Klungland; Rune Ougland; Anindya Dutta
Journal:  Front Mol Biosci       Date:  2022-07-18
  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.