Literature DB >> 23434047

MeRIP-PF: an easy-to-use pipeline for high-resolution peak-finding in MeRIP-Seq data.

Yuli Li¹, Shuhui Song, Cuiping Li, Jun Yu.

Abstract

RNA modifications, especially methylation of the N(6) position of adenosine (A)-m(6)A, represent an emerging research frontier in RNA biology. With the rapid development of high-throughput sequencing technology, in-depth study of m(6)A distribution and function relevance becomes feasible. However, a robust method to effectively identify m(6)A-modified regions has not been available yet. Here, we present a novel high-efficiency and user-friendly analysis pipeline called MeRIP-PF for the signal identification of MeRIP-Seq data in reference to controls. MeRIP-PF provides a statistical P-value for each identified m(6)A region based on the difference of read distribution when compared to the controls and also calculates false discovery rate (FDR) as a cut off to differentiate reliable m(6)A regions from the background. Furthermore, MeRIP-PF also achieves gene annotation of m(6)A signals or peaks and produce outputs in both XLS and graphical format, which are useful for further study. MeRIP-PF is implemented in Perl and is freely available at http://software.big.ac.cn/MeRIP-PF.html.

Entities: Chemical Disease Gene Species

Mesh：

Substances：

Year: 2013 PMID： 23434047 PMCID： PMC4357668 DOI： 10.1016/j.gpb.2013.01.002

Source DB: PubMed Journal: Genomics Proteomics Bioinformatics ISSN： 1672-0229 Impact factor: 7.691

Introduction

Multiple layers of epigenetic regulation including modification of DNA, RNA and proteins have been intensively explored. One of the major RNA modifications, m6A—methylation of the N6 position of adenosine (A)—represents an emerging research frontier in RNA biology and medicine [1]. Post transcriptionally added, m6A is an enzymatic modification of RNAs and the most common form found in the internal sequences of mRNAs in eukaryotes, as well as RNAs in nuclear-replicating viruses [2]. Unlike A-to-I RNA editing, m6A is nonstoichiometric and does not alter the coding capacity of transcripts [3,4] and may play roles in regulating RNA expression [1]. Along with the development of analytical and detection methods, such as MeRIP-Seq or m6A-seq [4,5], researchers are now able to carry out in-depth studies on m6A distribution and function of related genes. Next-generation sequencing (NGS) technologies render the identification of m6A-specific methylation of mRNAs possible through target enrichment, including immunoprecipitation [4,5]. As this strategy becomes popular, challenges are yet to be met for efficient data analysis, especially in peak finding. Due to fundamental technical differences of experimental protocols and the nature of the raw data, the existing peak-finding software packages, such as MACS [6] for ChIP-Seq and PARalyzer [7] for PAR-CLIP, are not suitable for MeRIP-Seq peak-finding. More importantly, there have not been tools or algorithms publically available for MeRIP-Seq data analysis yet. Here, we present a highly-efficient and easy-to-use analysis pipeline, named MeRIP-PF (MeRIP-Seq data peak finding), which is a publicly-available tool and specially developed for the peak-calling of MeRIP-Seq data. MeRIP-PF has a powerful graphical display and can be readily used for the identification of m6A-modified regions and gene annotation.

Method

Formulation

If an mRNA contains m6A modification sites, the RNA fragment containing the sites can be pulled down by the anti-m6A antibody and sequenced subsequently. The m6A modified regions can be identified by mapping the sequencing reads to the reference sequence of a genome through comparison of read counts between the sample and the control (Figure 1A).

Figure 1

MeRIP-Seq, MeRIP-PF and result processing A. Schematic of MeRIP-Seq and MeRIP-PF pipelines (red dots indicate m6A sites). B. Distribution of m6A peaks in different regions of transcripts. Pie chart showing the percentage of m6A peaks within distinct regions of RNA. NP, non-protein-coding genes; PR, protein-coding genes. C. Distribution of m6A peaks along mRNAs. 5′ UTRs, CDSs and 3′ UTRs of every transcript are separately binned into regions spanning 1% of their total lengths. Y-coordinates represent percentage of m6A peaks located in every bin. D. An example of transcripts in wig plot. Y-coordinates show the read coverage of every position in transcripts. NM_009484 was taken as an example. Different regions of transcripts are color coded with UTR indicated in blue and CDS in pink, and intronic regions are indicated with blank space. Red triangle indicates the position of m6A peaks.

Input data

The pipeline requires two Fastq-formatted data. One is sequencing reads from m6A-containing RNA immunoprecipitated (MeRIP) sample, and the other is the corresponding reads from non-IP transcriptome, which serves as the background control. Additionally, users also need to prepare the genome reference sequence of the species and several annotated BED files with gene structure information, which we recommend downloading from the UCSC database (http://genome.ucsc.edu/cgi-bin/hgTables) directly.

Processing

Figure 1A shows the process of MeRIP-Seq and the method for detecting m6A peaks (see detailed steps in Figure S1). We integrated four modules, including mapping, testing, annotating and plotting, into one command program to complete the analysis. The first module, i.e., sequence mapping, specifically yields two datasets in SAM-format, one from the MeRIP sample and the other from the control, by using the BWA software [8]. The uniquely-mapped reads (MAPQ ⩾ 20) are converted into BED files using SAMtools [9] and BEDTools [10]. The second module identifies m6A-modified regions in high resolution in reference to read coverage. The m6A signal peaks are defined based on comparison of read counts between the MeRIP data and the control data with a fixed 25-bp window across the genome. Based on one-tailed Fisher’s exact test and Benjamini–Hochberg method [11], both P-value and adjusted P-value (FDR) for each 25-bp window are calculated continuously to define significant sequence windows (FDR ⩽ 0.05). The significant and adjacent (no gaps present) windows are concatenated, while only those with appropriate sizes are considered as reliable and real peaks. The third module is used to annotate each peak in terms of peaks and genes (see Tables S1 and S2 where we just presented partial results derived from the published data [5]), and an enrichment score for each peak is also calculated. The fourth module is used to analyze m6A peak distribution at transcriptomic level and display read distribution along transcripts graphically. The peak counts are based on non-protein-coding and protein-coding genes, which are further divided into coding sequences (CDSs), introns, and untranslated regions (UTRs).

Output

MeRIP-PF generates 4 output files that contain the complete information of m6A modification profile. “Reads_Overview.txt” file supplies the basic status of the two sequencing datasets; “Peak_All.xls” file provides the absolute positions of m6A peaks in genome and mRNA regions; “Gene_List.xls” file presents the annotating information of m6A peaks at the gene level; and “Plot_Fig.pdf” file shows read distribution in the control and peak distribution in the MeRIP sample (Figure 1B), the distribution of m6A peaks along mRNAs (Figure 1C), and wig plots for transcripts with m6A peaks (Figure 1D).

Implementation

The pipeline program is written in Perl and runs in a Linux machine cluster with each node consisting of 8 cores with 2.00 GHz processor and 16G RAM. MeRIP-PF requires the installation of Perl and R language program, BWA, SAMtools and BEDTools. MeRIP-PF, along with an implementation file of the described method and Demo Datasets can be downloaded from the project website http://software.big.ac.cn/MeRIP-PF.html. The package requires the programming environment R. The R software is available at the website “The R Project for Statistical Computing” (http://www.r-project.org/).

Results and discussion

We tested MeRIP-PF performance using published adult mouse brains data (GSM854223 and GSM854224) with more than 30 millions reads for each sample [5]. By MeRIP-PF, we caught all high-confidence peaks in the previous study [5], which we named as “HC peaks”, but there are some differences between these two reports, which were shown in Table 1. We classified all peaks into four groups according to the fraction of peak region overlapping. In group A, 9555 peaks were “identical” between MeRIP-PF and “HC peaks”, which showed more than 75% overlap for each peak region. In group B, 3719 peaks were “similar”, which had 50–75% overlap for each peak region. In group C, 1772 peaks of MeRIP-PF had 0–50% overlap with 147 peaks in “HC peaks”, and we defined this group as “different”. In group D, there were 5153 peaks that had no overlap with “HC peaks” and we defined this group as “new”.

Table 1

Peak-finding performance of MeRIP-PF

Methods	No. of peaks	Group A	Group B	Group C	Group D
Methods	No. of peaks	Identical peaks (overlap ⩾ 75%)	Similar peaks (50% ⩽ overlap < 75%)	Different peaks (overlap < 50%)	New peaks (overlap = 0%)
Ref. [5]	13,471	9555	3719	197	⧹
MeRIP-PF	20,199	9555	3719	1772	5153

Note: The “overlap” we mean is reciprocal for peak M and peak N. In other words, if “overlap” is more than 75%, it requires that peak N overlaps at least 75% of peak M and that peak M also overlaps at least 75% of peak N.

We detected all the high-confidence peaks in the previous report [5], but there remained some differences in some peak regions in group A, B and C. Since the two methods were similar, the peak region differences might be due to the following two reasons: (1) although we both used default parameters of BWA software for reads mapping, use of different versions may contribute to difference of mapping results (we used bwa-0.6.2); (2) although uniquely-mapped reads were taken for further peaks calling in both reports, we regarded reads with MAPQ (mapping quality) ⩾ 20 as uniquely-mapped, while their criterion of filtering was not very clear. The differences of total uniquely-mapped read count and read number falling in each 25-bp window resulted in shift of peak regions. About those “new” peaks in group D, besides the aforementioned explanations, more importantly, they used multiple replicates samples to improve the confidence of m6A peaks [5]. We only used one replicate for analysis, so the “new” peaks are probably the low-confidence peaks. The distribution of time spent on processing GSM854223 and GSM854224 datasets is presented in Table 2. Apparently, the total time required depends much on mapping procedure which is closely related to sequencing quantity.

Table 2

Time spent on processing GSM854223/GSM854224 datasets

Process	Timing (min)
Mapping	∼200
Statistical test	∼15
Annotation	∼10
Plotting	∼10

Conclusion

As the first specific and easy-to-use pipeline for MeRIP-Seq data, MeRIP-PF provides outputs in both XLS and graphical format, which are important for deep analysis of m6A modification. We integrated four modules, including mapping, testing, annotating and plotting, into one command program to complete the analysis. The pipeline applies to MeRIP-Seq data with corresponding control samples.

Authors’ contributions

YL drafted the manuscript and developed the software. CL participated in the software design. SS proposed the idea of the software and revised the manuscript. JY revised the manuscript. All authors have read and approved the final manuscript.

Competing interests

The authors have no competing interests to declare.

9 in total

1. Grand challenge commentary: RNA epigenetics?

Authors: Chuan He
Journal: Nat Chem Biol Date: 2010-12 Impact factor: 15.040

2. Comprehensive analysis of mRNA methylation reveals enrichment in 3' UTRs and near stop codons.

Authors: Kate D Meyer; Yogesh Saletore; Paul Zumbo; Olivier Elemento; Christopher E Mason; Samie R Jaffrey
Journal: Cell Date: 2012-05-17 Impact factor: 41.582

3. The Sequence Alignment/Map format and SAMtools.

Authors: Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal: Bioinformatics Date: 2009-06-08 Impact factor: 6.937

4. BEDTools: a flexible suite of utilities for comparing genomic features.

Authors: Aaron R Quinlan; Ira M Hall
Journal: Bioinformatics Date: 2010-01-28 Impact factor: 6.937

5. PARalyzer: definition of RNA binding sites from PAR-CLIP short-read sequence data.

Authors: David L Corcoran; Stoyan Georgiev; Neelanjan Mukherjee; Eva Gottwein; Rebecca L Skalsky; Jack D Keene; Uwe Ohler
Journal: Genome Biol Date: 2011-08-18 Impact factor: 13.583

6. Identification of recognition residues for ligation-based detection and quantitation of pseudouridine and N6-methyladenosine.

Authors: Qing Dai; Robert Fong; Mridusmita Saikia; David Stephenson; Yi-tao Yu; Tao Pan; Joseph A Piccirilli
Journal: Nucleic Acids Res Date: 2007-09-18 Impact factor: 16.971

7. Fast and accurate short read alignment with Burrows-Wheeler transform.

Authors: Heng Li; Richard Durbin
Journal: Bioinformatics Date: 2009-05-18 Impact factor: 6.937

8. Model-based analysis of ChIP-Seq (MACS).

Authors: Yong Zhang; Tao Liu; Clifford A Meyer; Jérôme Eeckhoute; David S Johnson; Bradley E Bernstein; Chad Nusbaum; Richard M Myers; Myles Brown; Wei Li; X Shirley Liu
Journal: Genome Biol Date: 2008-09-17 Impact factor: 13.583

9. Topology of the human and mouse m6A RNA methylomes revealed by m6A-seq.

Authors: Dan Dominissini; Sharon Moshitch-Moshkovitz; Schraga Schwartz; Mali Salmon-Divon; Lior Ungar; Sivan Osenberg; Karen Cesarkas; Jasmine Jacob-Hirsch; Ninette Amariglio; Martin Kupiec; Rotem Sorek; Gideon Rechavi
Journal: Nature Date: 2012-04-29 Impact factor: 49.962

9 in total

7 in total

1. Decomposition of RNA methylome reveals co-methylation patterns induced by latent enzymatic regulators of the epitranscriptome.

Authors: Lian Liu; Shao-Wu Zhang; Yu-Chen Zhang; Hui Liu; Lin Zhang; Runsheng Chen; Yufei Huang; Jia Meng
Journal: Mol Biosyst Date: 2014-11-05

2. RNA Framework: an all-in-one toolkit for the analysis of RNA structures and post-transcriptional modifications.

Authors: Danny Incarnato; Edoardo Morandi; Lisa Marie Simon; Salvatore Oliviero
Journal: Nucleic Acids Res Date: 2018-09-19 Impact factor: 16.971

3. Comprehensive Analysis of Expression Regulation for RNA m6A Regulators With Clinical Significance in Human Cancers.

Authors: Xiaonan Liu; Pei Wang; Xufei Teng; Zhang Zhang; Shuhui Song
Journal: Front Oncol Date: 2021-02-23 Impact factor: 6.244

4. m6A-Driver: Identifying Context-Specific mRNA m6A Methylation-Driven Gene Interaction Networks.

Authors: Song-Yao Zhang; Shao-Wu Zhang; Lian Liu; Jia Meng; Yufei Huang
Journal: PLoS Comput Biol Date: 2016-12-27 Impact factor: 4.475

5. Spatially Enhanced Differential RNA Methylation Analysis from Affinity-Based Sequencing Data with Hidden Markov Model.

Authors: Yu-Chen Zhang; Shao-Wu Zhang; Lian Liu; Hui Liu; Lin Zhang; Xiaodong Cui; Yufei Huang; Jia Meng
Journal: Biomed Res Int Date: 2015-08-02 Impact factor: 3.411

6. Transcriptome-wide N⁶-methyladenosine profiling of rice callus and leaf reveals the presence of tissue-specific competitors involved in selective mRNA modification.

Authors: Yuli Li; Xiliang Wang; Cuiping Li; Songnian Hu; Jun Yu; Shuhui Song
Journal: RNA Biol Date: 2014 Impact factor: 4.652

7. m6aViewer: software for the detection, analysis, and visualization of N⁶-methyladenosine peaks from m⁶A-seq/ME-RIP sequencing data.

Authors: Agne Antanaviciute; Belinda Baquero-Perez; Christopher M Watson; Sally M Harrison; Carolina Lascelles; Laura Crinnion; Alexander F Markham; David T Bonthron; Adrian Whitehouse; Ian M Carr
Journal: RNA Date: 2017-07-19 Impact factor: 4.942

7 in total