Literature DB >> 27834909

CoverageAnalyzer (CAn): A Tool for Inspection of Modification Signatures in RNA Sequencing Profiles.

Ralf Hauenschild¹, Stephan Werner², Lyudmil Tserovski³, Andreas Hildebrandt⁴, Yuri Motorin⁵, Mark Helm⁶.

Abstract

Combination of reverse transcription (RT) and deep sequencing has emerged as a powerful instrument for the detection of RNA modifications, a field that has seen a recent surge in activity because of its importance in gene regulation. Recent studies yielded high-resolution RT signatures of modified ribonucleotides relying on both sequence-dependent mismatch patterns and reverse transcription arrests. Common alignment viewers lack specialized functionality, such as filtering, tailored visualization, image export and differential analysis. Consequently, the community will profit from a platform seamlessly connecting detailed visual inspection of RT signatures and automated screening for modification candidates. CoverageAnalyzer (CAn) was developed in response to the demand for a powerful inspection tool. It is freely available for all three main operating systems. With SAM file format as standard input, CAn is an intuitive and user-friendly tool that is generally applicable to the large community of biomedical users, starting from simple visualization of RNA sequencing (RNA-Seq) data, up to sophisticated modification analysis with significance-based modification candidate calling.

Entities: Chemical Disease Gene Species

Keywords: Next-Generation Sequencing (NGS); RNA modifications; RNA sequencing (RNA-Seq); alignment viewer; candidate screening; reverse transcription; reverse transcription (RT) signature

Mesh：

Year: 2016 PMID： 27834909 PMCID： PMC5197952 DOI： 10.3390/biom6040042

Source DB: PubMed Journal: Biomolecules ISSN： 2218-273X

1. Introduction

The detection of RNA modifications has recently re-emerged as a very timely topic of current research. Coupled to new detection methods came new insights into the function of RNA modifications in the regulation of RNA stability [1], regulation of gene expression [2,3,4,5], and immunity [6]. RNA modifications are structurally highly diverse, and among the approximately 150 chemically different structures in the Modomics database [7], all major classes of natural product compounds can be found [8,9]. Furthermore, there is evidence that the diversity may yet increase with the discovery of more modifications [10]. Despite this high diversity, some common denominators apply to both function and detection. Here, two important features for detection are reverse transcription (RT) arrest and misincorporation during complementary DNA (cDNA) synthesis. Before the advent of methods that are nowadays subsumed as deep sequencing, RT reverse transcription arrest was traditionally analyzed by gel or capillary electrophoresis [11]. A model modification for misincorporation, inosine, the product of an A-to-I deamination, is reliably reverse transcribed into a cytidine rather than a thymidine residue in the resulting cDNA. This misincorporation has led to the first transcriptome-wide mapping of an RNA modification [12]. The combined appearance of both RT arrest and misincorporation at modification sites was analyzed in early work [13,14]. Detailed analysis showed correlation between modification type and the relative composition of misincorporated nucleotides [15]. Also, chemical treatments that selectively alter the properties of a given modification [16,17] may therefore be exploited as an additional layer of information in single RNA species or in transcriptome-wide mapping [18,19,20]. Collection [7] and curation [21] of RNA sequences containing modifications underline a central problem in the field, arising from the vast number of candidate sites in large datasets. Because of these vast numbers, experimental verification of candidate sites by independent methods must typically be restricted to a small subset. Before engaging in such an endeavor, the experimentalist, and potential user of the software presented here, may want to assess the significance of an identification event, and visually inspect parameters at a given site. In principle, a variety of so-called alignment viewers like IGV, Tablet, Savant, UGENE and Persephone provide more or less detailed graphical representations of mapping results, typically resolving the base composition and orientation of reads covering a reference sequence. However, our recent application of machine learning approaches to the identification of modification sites has uncovered an unmet need for particular features in said tools. Specifically, the combination of mismatch patterns and a newly defined RT arrest rate has emerged as the central feature allowing efficient identification of 1-methyladenosine residues [22]. In response, CoverageAnalyzer (CAn) was specifically created for analysis of modification signatures in deep sequencing data. Distinct from variant caller and single nucleotide polymorphisms (SNP) identification tools, it allows the definition of a highly detailed query, based on combinations of arrest rates and mismatch composition, as well as a Context Sensitive Arrest rate (CSA). A differential visualization tool is particularly useful to compare signatures upon differential chemical treatment, or between wild-type and knockout mutants e.g., of a methyltransferase [22]. CAn combines a data processing pipeline with flexible controls for independent or differential visualization and automated screening for modification candidates based on complex RT signatures.

2. Results

CAn was optimized to allow rapid pre-selection and convenient visualization of such sites in transcriptome data, which display conspicuous RT signatures and are therefore potential candidates for further scrutiny, e.g., by visual inspection. The RT signatures in question may comprise nucleotide misincorporation or transcription arrest, and frequently originate from nucleotide modification at the position of interest. Several library preparation protocols have been published that capture cDNA from abortive RT [16,20,22,23] and can therefore be fully exploited by CAn. However, even preparation methods that do not capture abortive cDNA may provide useful information by providing misincorporation signals that may be analyzed by CAn. It is hence recommended that the user familiarizes himself with details of the various preparations beforehand ([24]). A typical CAn-session, conceived to identify, highlight, and visually inspect modification candidates, is depicted in Figure 1. The user is required to input a dataset in SAM format, containing RNA sequencing (RNA Seq) reads mapped to a genome or transcriptome. This is converted to the internally used Profile format by an automated pipeline (Figure 1a). To optimally detect stalled RT events, a parameter called CSA was introduced, which queries a local background arrest rate near the inspection site and takes it into account. CSA was defined as the fold change of a site i’s arrest rate A [22] with respect to the median A of its sequence environment of r bases up- and r bases downstream (here r = 5):

Figure 1

Workflow for a typical CAn session (a) Input SAM files are processed to a positional profile; (b) Sorting and filtering of data by various statistical criteria. From the depicted result table, users select sequences for visualization; (c) Visualization tab. Independent plots and differential comparison for mismatch and/or arrest parameters with marked above-threshold sites (yellow triangle). Display of base sequences is enabled automatically depending on the horizontal plot dimension; (d) Candidate casting tab; (i) Formula editor: Specification of screening thresholds. Conditions are combined with Boolean operators AND, OR and XOR and can be parenthesized; (ii) Control panel for serial plotting of a resulting candidate batch.

The CSA feature, since it maps cDNA from abortive RT events, can only be meaningfully applied to data from library preparation protocols that specifically include such reads. Whether or not a protocol does so, typically hinges upon the incorporation step that introduces the second primer binding site. The installation package provides test datasets which were obtained by a library preparation protocol that captures abortive RT reads by ligation of a second adapter to the cDNA, as described in [22]. After selection of an RNA sequence of interest (Figure 1b), the software displays the sequence for visual inspection in a window (Figure 1c(i)). Events are labeled (yellow triangles), where values for misincorporation or peaks of CSA exceed adjustable thresholds. Additional profiles of the same sequence can be loaded and displayed in parallel plots (Figure 1c(ii)) for comparison of samples of variegated modification status, for example a wild-type RNA preparation versus one from a knockout organism lacking a certain modification activity [22]. Another application of interest is exemplified by the included test dataset, namely a chemical treatment suspected to alter the profile of certain modifications. With test data 1 as the naive sample in window (i) and test data 2 treated with an agent causing partial deamination of 5-methylcytidines in window (ii); a differential plot was generated in window (iii), where differences are displayed according to self-defined threshold criteria, and combinations thereof. The candidate casting tab (snippet shown in Figure 1d) offers a formula editor to generate filter rules of arbitrary complexity using thresholds combined by Boolean expressions and brackets. The resulting candidates files can be submitted to batch plotting for fast visual inspection of many candidate positions. With high flexibility in image dimensions, parameters, and legend details, these data can be exported as publication-ready images.

3. Discussion

CAn is a tool that allows the visualization and assisted inspection of deep sequencing data in the search for RNA modifications. Perusal of vast amounts of data is facilitated by a toolbox that allows to automatically highlight sites, where noticeably unusual combinations of RT arrest and misincorporation hint at the potential presence of modifications. Of note, there are no predefined thresholds that the program uses to flag unusual instances. Rather, it is up to the user to define threshold values for different parameters, and to combine them by Boolean operators. CAn is not meant to predict a modification event, or even to decipher the chemical structure of a potential modification. The program is rather designed to point attention to special candidate sites for its visual inspection. Inspection of large datasets automatically increases the statistical likelihood of the occurrence of conspicuous signals without a biochemical cause. Therefore, it is prudent to increase the stringency in such a case. While it is left to the user to decide how p-values are used to gauge the significance of findings, we recommend to use techniques like the Bonferroni correction [25] in order to account for the number of tested positions. In addition, the False Discovery Rate (FDR) can be controlled in the manner of Benjamini and Hochberg [26]. Outside these two approaches, which are rooted in statistics, a number of experimental approaches are open to the user to improve confidence by experimentally validating the candidate sites proposed by CAn. We urgently propose to call and treat these sites as “candidates” until validated by further experiments, e.g., by biochemical interrogation of a suggested site. In this context, we again emphasize the comparison feature, through which CAn specifically provides the possibility to inspect profiles before and after treatment with specific chemicals known to alter the RT-profile of a given modification. These may include, e.g., a Dimroth rearrangement of 1-methyladenosine (m1A) by alkaline treatment [27], acrylonitrile treatment for the detection of inosine [28], and others [16].

4. Materials and Methods

4.1. Implementation

The graphical user interface (GUI) and the core of CAn are written in Java. The Miniconda Python based plotting component uses Matplotlib [29], Numpy [30] and Scipy. The software is distributed as self-extracting archive (~100 MB) for Windows (64-bit) and as zip files installed via included script setup routines for Linux and Mac OS X. Dependencies are downloaded automatically. On Mac OS X, latest Homebrew is installed to setup SAMtools [31]. Java 1.7+ is expected to be installed by the user, whereas Linux version installs dependencies via apt-get. Test data, a getting-started screencast and a user manual are included.

4.2. Workflow

From unseen SAM input data files from N user-specified samples and the original FASTA mapping reference, CAn creates sorted and indexed Binary Alignment/Map (BAM) and finally the Pileup format. Users may replace the generated results on the hard drive with own files if they prefer different SAMtools parameters. In Pileup, periods and commas indicate matches, As, Gs, Ts and Cs mismatches and the arrest rate A of position i can be calculated as quotient of circumflexes at i + 1 and coverage at i + 1. Thus, a tabular Profile format is created, listing sequence positions line-wise with columns providing information on: position, reference base, coverage, mismatch rate M, number of (#) As, #Gs, #Ts, #Cs, and arrest rate A. Profile is divided into subfiles named by an x_y.txt tag, where x represents the reference number and y the yth file of a 1 kb block of subsequence of reference x. For example, a file named 3_7.txt contains data for positions 6001–6430, if the third reference has 6430 nt. Hence, hashing allows fast access to a query region without reading or memorizing leading positions, when accessing ends of long reference sequences. Thus, although the scope is on short sequences of RNA, chromosomes can be handled, too. In parallel, statistics are gathered for reference sequences (Figure 1b): ID, file path, length, sequence (first 100 nucleotides (nt)), coverage peak, number of high-arrest sites (S), high mismatch sites (S), heterogeneous mismatch sites (S) and mapped reads. This facilitates manual sorting and filtering by the user for visualization. Let c be the coverage at position i of reference f of length n. Let R be the reference base at i. Let where be the observed frequency of base type b covering i in f. Thus, is the set of mismatching . All i with contribute to , if two or more mismatch types exhibit a minimum mismatch rate of 0.1: S and S are calculated similarly, for arrest or mismatch rates exceeding a threshold normalized with coverage c, such that low arrest rates are considered insignificant at low c, but captured if c is high.

5. Conclusions

CAn was developed as a cross-platform open-source software running on most current computers. It allows efficient inspection of RNA Seq profiles for RT signatures of modifications, such as m1A [22]. The user is provided with assistance to identify unusual patterns, to compare different datasets containing the same sequence, and to perform significance-based candidate calling. Important to the field is the implementation of both misincorporation patterns and RT arrest, including also the CSA format as defined during our recent extraction of RT signatures by machine learning [22]. CAn is highly conductive to the extraction of complete RT signatures, by providing full control of all thresholds for visualization, identification and discrimination to the user.

27 in total

1. Traces of post-transcriptional RNA modifications in deep sequencing data.

Authors: Sven Findeiss; David Langenberger; Peter F Stadler; Steve Hoffmann
Journal: Biol Chem Date: 2011-02-24 Impact factor: 3.915

Review 2. RNA nucleotide methylation.

Authors: Yuri Motorin; Mark Helm
Journal: Wiley Interdiscip Rev RNA Date: 2011-03-23 Impact factor: 9.957

Review 3. Pseudouridine: still mysterious, but never a fake (uridine)!

Authors: Felix Spenkuch; Yuri Motorin; Mark Helm
Journal: RNA Biol Date: 2014 Impact factor: 4.652

4. Pseudo-Seq: Genome-Wide Detection of Pseudouridine Modifications in RNA.

Authors: Thomas M Carlile; Maria F Rojas-Duran; Wendy V Gilbert
Journal: Methods Enzymol Date: 2015-05-28 Impact factor: 1.600

5. Transcriptome-wide mapping reveals widespread dynamic-regulated pseudouridylation of ncRNA and mRNA.

Authors: Schraga Schwartz; Douglas A Bernstein; Maxwell R Mumbach; Marko Jovanovic; Rebecca H Herbst; Brian X León-Ricardo; Jesse M Engreitz; Mitchell Guttman; Rahul Satija; Eric S Lander; Gerald Fink; Aviv Regev
Journal: Cell Date: 2014-09-11 Impact factor: 41.582

Review 6. RNA modifications: what have we learned and where are we headed?

Authors: Michaela Frye; Samie R Jaffrey; Tao Pan; Gideon Rechavi; Tsutomu Suzuki
Journal: Nat Rev Genet Date: 2016-05-03 Impact factor: 53.242

7. RNA cytosine methylation analysis by bisulfite sequencing.

Authors: Matthias Schaefer; Tim Pollex; Katharina Hanna; Frank Lyko
Journal: Nucleic Acids Res Date: 2008-12-05 Impact factor: 16.971

8. Transcriptome-wide mapping of pseudouridines: pseudouridine synthases modify specific mRNAs in S. cerevisiae.

Authors: Alexander F Lovejoy; Daniel P Riordan; Patrick O Brown
Journal: PLoS One Date: 2014-10-29 Impact factor: 3.240

9. RMBase: a resource for decoding the landscape of RNA modifications from high-throughput sequencing data.

Authors: Wen-Ju Sun; Jun-Hao Li; Shun Liu; Jie Wu; Hui Zhou; Liang-Hu Qu; Jian-Hua Yang
Journal: Nucleic Acids Res Date: 2015-10-12 Impact factor: 16.971

10. HAMR: high-throughput annotation of modified ribonucleotides.

Authors: Paul Ryvkin; Yuk Yee Leung; Ian M Silverman; Micah Childress; Otto Valladares; Isabelle Dragomir; Brian D Gregory; Li-San Wang
Journal: RNA Date: 2013-10-22 Impact factor: 4.942

7 in total

Review 1. Detecting RNA modifications in the epitranscriptome: predict and validate.

Authors: Mark Helm; Yuri Motorin
Journal: Nat Rev Genet Date: 2017-02-20 Impact factor: 53.242

2. Structure-seq2: sensitive and accurate genome-wide profiling of RNA structure in vivo.

Authors: Laura E Ritchey; Zhao Su; Yin Tang; David C Tack; Sarah M Assmann; Philip C Bevilacqua
Journal: Nucleic Acids Res Date: 2017-08-21 Impact factor: 16.971

3. Manganese Ions Individually Alter the Reverse Transcription Signature of Modified Ribonucleosides.

Authors: Marco Kristen; Johanna Plehn; Virginie Marchand; Kristina Friedland; Yuri Motorin; Mark Helm; Stephan Werner
Journal: Genes (Basel) Date: 2020-08-18 Impact factor: 4.096

4. Graphical Workflow System for Modification Calling by Machine Learning of Reverse Transcription Signatures.

Authors: Lukas Schmidt; Stephan Werner; Thomas Kemmer; Stefan Niebler; Marco Kristen; Lilia Ayadi; Patrick Johe; Virginie Marchand; Tanja Schirmeister; Yuri Motorin; Andreas Hildebrandt; Bertil Schmidt; Mark Helm
Journal: Front Genet Date: 2019-09-25 Impact factor: 4.599

Review 5. Analysis of RNA Modifications by Second- and Third-Generation Deep Sequencing: 2020 Update.

Authors: Yuri Motorin; Virginie Marchand
Journal: Genes (Basel) Date: 2021-02-16 Impact factor: 4.096

Review 6. RNA Modification in Inflammatory Bowel Diseases.

Authors: Mika Nakayama; Yuki Ozato; Yoshiko Tsuji; Yasuko Arao; Chihiro Otsuka; Yumiko Hamano; Genzo Sumi; Ken Ofusa; Shizuka Uchida; Andrea Vecchione; Hideshi Ishii
Journal: Biomedicines Date: 2022-07-13

7. Machine learning of reverse transcription signatures of variegated polymerases allows mapping and discrimination of methylated purines in limited transcriptomes.

Authors: Stephan Werner; Lukas Schmidt; Virginie Marchand; Thomas Kemmer; Christoph Falschlunger; Maksim V Sednev; Guillaume Bec; Eric Ennifar; Claudia Höbartner; Ronald Micura; Yuri Motorin; Andreas Hildebrandt; Mark Helm
Journal: Nucleic Acids Res Date: 2020-04-17 Impact factor: 16.971

7 in total