Literature DB >> 30416385

Alternative polyadenylation analysis in animals and plants: newly developed strategies for profiling, processing and validation.

Yunqi Zhang¹, Shane A Carrion¹, Yangzi Zhang¹, Xiaohui Zhang¹, Amy L Zinski¹, Jennifer J Michal¹, Zhihua Jiang¹.

Abstract

Alternative polyadenylation is an essential RNA processing event that contributes significantly to regulation of transcriptome diversity and functional dynamics in both animals and plants. Here we review newly developed next generation sequencing methods for genome-wide profiling of alternative polyadenylation (APA) sites, bioinformatics pipelines for data processing and both wet and dry laboratory approaches for APA validation. The library construction methods LITE-Seq (Low-Input 3'-Terminal sequencing) and PAC-seq (PolyA Click sequencing) tag polyA+ cDNA, while BAT-seq (BArcoded, three-prime specific sequencing) and PAPERCLIP (Poly(A) binding Protein-mediated mRNA 3'End Retrieval by CrossLinking ImmunoPrecipitation) enrich polyA+ RNA. Interestingly, only WTTS-seq (Whole Transcriptome Termini Site sequencing) targets both polyA+ RNA and polyA+ cDNA. Varieties of bioinformatics pipelines are well established to pursue read quality control, mapping, clustering, characterization and pathway analysis. The RHAPA (RNase H alternative polyadenylation assay) and 3'RACE-seq (3' rapid amplification of cDNA end sequencing) methods directly validate APA sites, while WTSS-seq (whole transcriptome start site sequencing), RNA-seq (RNA sequencing) and public APA databases can serve as indirect validation methods. We hope that these tools, pipelines and resources trigger huge waves of interest in the research community to investigate APA events underlying physiological, pathological and psychological changes and thus understand the information transfer events from genome to phenome relevant to economically important traits in both animals and plants.

Entities: Chemical Disease Gene Species

Keywords: alternative polyadenylation; genome function.; processing pipelines; profiling tools; validation approaches

Mesh：

Substances：

Year: 2018 PMID： 30416385 PMCID： PMC6216028 DOI： 10.7150/ijbs.27168

Source DB: PubMed Journal: Int J Biol Sci ISSN： 1449-2288 Impact factor: 6.580

Introduction

Alternative polyadenylation, which causes the same gene to produce multiple RNA transcripts, is an evolutionarily conserved phenomenon in both animals and plants 1-4. This process prepares the 3' untranslated regions (3'UTRs) of RNAs with either variable sequence composition or different nucleotide lengths due to use of alternative polyadenylation (APA) sites 5. In fact, 3'UTRs often harbor variable cis- and trans-acting regulatory elements so that alternative polyadenylation plays essential roles in regulation of RNA stability, localization, translation and degradation 6. Consequently, the same gene can function quantitatively, qualitatively or epigenetically, depending on APA position, 3'UTR features and regulatory modes. Two alternative transcripts can encode for the same protein. For example, the transcript with a short 3'UTR would avoid the negative regulation targeted by microRNA, but potentially lose the stability provided by RNA-binding proteins in comparison to another with a long 3'UTR. This regulatory mode contributes quantitatively to gene function 7. Qualitative modes occur when APA sites from the same gene yield transcript isoforms that encode distinct proteins, potentially with distinct properties 8. When a protein-coding transcript is converted into a non-coding, truncated or unstable protein due to a switch in APA usage, this regulatory mode is called an epigenetic effect, which ultimately silences the target gene 9. These findings clearly indicate that alternative polyadenylation plays essential roles in coordination of genetic information transfer from genome to phenome. This has triggered a great wave of interest in the research community to develop methods and techniques that can comprehensively capture the 3'-ends of transcripts and thus thoroughly characterize how APA sites influence various physiological, pathological and psychological processes 1, 10-11. In 2015, Jiang and colleagues 5 reviewed 15 next generation sequencing methods and technologies specifically designed to profile the 3' termini of RNAs with or without restriction enzyme digestion. In the present review, we evaluate methods advanced during the last 2 - 3 years, summarize data processing strategies and discuss ways to validate the functional significance of APA sites.

Newly developed methods for APA profiling

Here we review five recently developed methods for capturing the 3'-ends of transcripts associated with polyA+ tails: 1) LITE-Seq (Low-Input 3'-Terminal sequencing method) 4; 2) PAC-seq (PolyA Click sequencing method) 12; 3) BAT-seq (BArcoded, Three-Prime specific sequencing method) 13; 4) WTTS-seq (Whole Transcriptome Termini Site sequencing method) 14 and 5) PAPERCLIP (Poly(A) binding Protein-mediated mRNA 3′End Retrieval by CrossLinking ImmunoPrecipitation) 15. Although all of these methods use total RNA as their starting materials, their strategies to enrich the polyA+ ends of transcripts are different. Broadly, the first two methods enrich polyA+ cDNA (complementary DNA), while the last three methods enrich polyA+ RNA to complete construction of the next generation sequencing libraries (Figures 1A and 1B).

Figure 1

Outline of library construction procedures involved in five newly developed methods. (A) LITE-seq and PAC-seq enrich polyA+ cDNA. (B) BAT-seq, WTTS-seq and PAPERCLIP-seq target polyA+ RNAs.

The LITE-Seq library preparation method does not deplete rRNA molecules and begins with synthesis of full-length cDNA, targeting polyA+ RNAs by reverse transcription using oligo (dT) primers containing a hairpin structure. Next, a polyA tail is added to the first-strand cDNA and second-strand synthesis completed using PCR, which integrates biotinylated half-hairpin primers at the 3' ends (Figure 1A). These full-length, double-stranded cDNA molecules are then fragmented and polyA+ cDNAs enriched using streptavidin beads. Like conventional RNA-seq, the remaining steps include end repair, dA tailing, adaptor ligation and PCR amplification using primers that fit with the sequencing platforms. In contrast, PAC-seq synthesizes a partial cDNA because the reverse transcription reaction utilizes azido-nucleotides, which induce termination of cDNA synthesis once incorporated (Figure 1A). Oligo (dT) primers containing 3' partial adaptors dictate first-strand cDNA synthesis so that only fragments associated with polyA products are enriched. The click ligation reaction, which is catalyzed by vitamin C and Cu-TBTA at room temperature, is then performed to join the azido-terminated cDNA and the 5 hexynyl-functionalized DNA oligos. The chemically ligated products are purified, amplified by PCR and size-selected for sequencing. While the PAC-seq method involves relatively few steps, the efficiency in the chemical ligation step can be very low 12. Fragmentation of total RNAs is the first step involved in the BAT-seq, WTTS-seq and PAPERCLIP-seq library preparation methods 13-15 (Figure 1B). However, the subsequent procedures for each are quite different, especially between the first and the last two methods (Figure 1B). By targeting polyA+ RNAs using oligo(dT) primers, BAT-seq builds constructs with promoters, which leads to an in vitro transcription to make polyA+ RNA fragments expressed one more time for enrichment. These multiple amplification steps are required because BAT-seq is designed to capture the 3'ends of transcripts in a single cell. In contract, WTTS-seq and PAPERCLIP-seq capture polyA+ RNAs immediately with either oligo(dT) beads or polyA binding protein after fragmentation. Adaptors that fit with the sequencing platforms are then added to the polyA+ RNA fragments by reverse transcription. Regardless of method, size selection is required to select appropriate-sized products for sequencing. Among the five methods described above, only our WTTS-seq method was designed to fit with the Ion Torrent sequencing platform 14. WTTS-seq utilizes a strand-specific sequencing approach so that each read begins with a polyT stretch, complementary to the polyA tail. Our assay can be redesigned to fit with the Illumina sequencing platform, however, a “low-diversity library” issue may occur if it runs alone on the Illumina sequencing platform, which requires equal proportions of the four nucleotides at each reading position. Furthermore, we use RNases H and I to destroy all RNA molecules after reverse transcription so that our WTTS-seq enriches both polyA+ RNA and polyA+ cDNA 14. Oligo (dT) primers are used in all five of these methods to synthesize first-strand cDNA. Two-base anchors are included in the oligo (dT) primer for WTTS-seq and PAPERCLIP-seq, but only a one-base anchor is included in the primer for LITE-seq, and noanchor is used for BAT-seq and PAC-seq. The amount of total RNA required for successful library preparation are variable among these five methods, ranging from 125 ng (PAC-seq) to 5 µg (WTTS-seq). Time required for library construction among methods is also variable. For example, it takes a technician at least 40 hours to create a library using PAPERCLIP-seq protocol 15.

Bioinformatics pipelines for APA characterization

In general, data processing includes raw read quality control, 5' or 3' end trimming, genome mapping, APA clustering and characterization, assessment of differential expression and pathway enrichment. Once libraries are well separated with individual barcodes, raw reads are processed for quality control and sequencing scores using CIMS (crosslinking induced mutation sites) (http://zhanglab.c2b2.columbia.edu/index.php/CIMS), FastQC (a quality control tool for high throughput sequence data) (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) or FASTX Toolkit version 0.0.13.1 (http://hannonlab.cshl.edu/fastx_toolkit/), for example. Depending on the library construction protocol and sequencing platform, the ideal reads that contain APA sites should have either poly(T) at their 5'-ends or poly(A) at their 3'-ends. To improve successful rate of read mapping, it works best to trim off these poly(T) or poly(A) because they do not exist at the DNA level. Software, such as Cutadapt 16 or Perl script 14 can be used to complete the trimming process. For read mapping, selection of a well-assembled genome is essential, which can be downloaded from either UCSC Genome Browser (https://genome.ucsc.edu/) or the NCBI ftp site (https://ftp.ncbi.nih.gov/genomes/). NovoAlign (http://www.novocraft.com/products/novoalign/), GSNAP (genome short-read nucleotide alignment program) 17, Bowtie (http://bowtie-bio.sourceforge.net/index.shtml) 18 and TopHat2 (https://ccb.jhu.edu/software/tophat/index.shtml) 19 are frequently used to map reads to different genome/gene regions. CIMS (http://zhanglab.c2b2.columbia.edu/index.php/CIMS), F-seq 20, PAcluster 21 and PlantAPA 22 can be used to call APA clusters, usually within a 20 - 30 bp window. There are several ways to characterize APA sites by type or category. 3'UTR and CDS-APA sites can be separated using closestBed with the BEDtools suite 23. This simple classification allows examination of APA switching events within 3'UTRs or between coding regions and 3'UTRs 24. Genomic features, such as 5'UTR, intron, exon, 3'UTR and intergenic regions can be used to classify APA sites 25. Differentially expressed APA (DE-APA) sites can be determined using DEXSeq, DESeq2, edgeR, HTSeq and XBSeq2 26-30. Differentially expressed genes (DEGs) associated with DE-APA sites can be used to enrich GO terms using PantherDB 31, KEGG (Kyoto Encyclopedia of Genes and Genomes, http://www.genome.jp/kegg/), and DAVID (the database for annotation, visualization and integrated discovery, https://david.ncifcrf.gov/), for example. The bioinformatics pipelines for APA processing described above have been well tested by the scientific community 24, 32-35. Figure 2 demonstrates the bioinformatics pipelines we have developed to analyze our WTTS-seq datasets 1, 14, 36-37. First, we use TMAP (version 3.4.1, https://github.com/iontorrent/TMAP) to map reads to genomes because the package fits well with libraries sequenced on an Ion PGM™ Sequencer. Second, we explore gene biotypes, such as protein-coding genes, long non-coding genes, microRNAs, pseudogenes and small RNAs, which can be downloaded from NCBI databases for the species of interest. APA usage is significantly different in gene biotypes. Average APA usage per gene is extremely high in protein coding genes, moderate in lncRNAs and pseudogenes and low in small RNAs and miRNAs 1, 36-37. Third, we use the Cuffcompare (v2.2.1) program 38 to classify APA sites into 1) genic regions with class codes c (or cAPA sites, confined in exonic regions), e (or eAPA sites, extended from exonic regions to intronic regions with at least 10 bp), i (or iAPA sites, completed in the intronic regions), o (or oAPA sites, exonic regions with extension), p (or pAPA sites, located within 2 kb downstream of reference transcripts) and x (or xAPA sites, exonic regions, but with opposite direction); and 2) intergenic regions with u (or uAPA sites, remain unknown). Different gene biotypes tend to utilize certain class codes. Lastly, we employ the Metascape program 39 for pathway enrichment. The unique feature about this program is that it can take multiple lists of genes to pursue pathway analyses simultaneously.

Figure 2

WTTS-seq raw data processing and bioinformatics pipeline. Data analysis usually involves quality control, reads mapping, APA clustering, location assignment and classification for characterization.

Wet and dry laboratory approaches for APA validation

It is preferable to use multiple methods to validate APA sites and their expression abundances. Here we focus on two wet laboratory approaches: RHAPA (RNase H alternative polyadenylation assay) 40 and 3'RACE-seq (3' rapid amplification of cDNA end sequencing) 41 that were designed to directly validate APA sites. For the former method, gene-specific oligonucleotides are synthesized and hybridized to all alternative polyadenylation transcripts for validation, followed by RNase H digestion. The digested products are then used to synthesize the first cDNA strand, just for RNA fragments containing polyA tails. Such procedures avoid any overlapping cDNAs synthesized among alternative transcripts. Finally, qRT-PCR is carried out using alternative transcript specific primers to directly measure and quantify each transcript. This method does not effectively identify APA sites that are less than 100 bp apart. The latter method is actually a combination between conventional RACE and high-throughput sequencing 41. The authors used the commercial 3'RACE adaptor: 5'- GCGAGCACAGAATTAATACGACTCACTATAGGT12VN-3' (Promega, Madison, WI, USA) for cDNA synthesis, followed by the RACE amplification using oligonucleotides adjacent to the initiation codon as the forward primer. The amplified products are then sequenced using high-throughput platforms and reads are mapped to genes for APA counts. The potential drawback of this method is that some APA sites can be missed if they use different alternative transcriptional start sites. As such, we recently decided to use WTSS-seq (whole transcriptome start site sequencing) to indirectly validate WTTS-seq results. Interestingly, both methods match effectively in terms of functional pathways 36-37. Dry laboratory approaches can be used to generate additional evidence for indirect validation of APA sites. For example, QAPA (Quantification of alternative polyadenylation) 42 and APAtrap 43 are recently released software packages that can be systematically used to retrieve and collect APA sites from RNA-seq data. In addition, at least three APA databases: PolyA_DB (http://www.polya-db.org/v3), APASdb (http://mosas.sysu.edu.cn/utr) and APADB (http://tools.genxpro.net/apadb/) 44-46 have been established to provide information on APA variants, location, usage and signals. PolyA_DB involves four species: human, mouse, rat and chicken, APASdb holds APA information on humans, mice and zebrafish, and APADB includes APA sites for humans, chickens and mice. We plan to establish our own APA resources for cattle, chicken, mouse, rat and Xenopus tropicalis in the near future. In summary, tools, pipelines and resources to characterize alternative polyadenylation events in cells, tissues and even whole organisms derived from both animals and plants are well developed. Our recent studies clearly indicated that APA sites are sensitive and powerful biomarkers that illustrate information flows from genome to phenome under unique internal and external environments. As such, we believe that characterization of alternative polyadenylation events will provide novel insights into genome function related to genetic complexity of economically important traits in animals and plants.

44 in total

Review 1. Understanding Neurodevelopmental Disorders: The Promise of Regulatory Variation in the 3'UTRome.

Authors: Kai A Wanke; Paolo Devanna; Sonja C Vernes
Journal: Biol Psychiatry Date: 2017-11-14 Impact factor: 13.382

2. F-Seq: a feature density estimator for high-throughput sequence tags.

Authors: Alan P Boyle; Justin Guinney; Gregory E Crawford; Terrence S Furey
Journal: Bioinformatics Date: 2008-09-10 Impact factor: 6.937

3. HTSeq--a Python framework to work with high-throughput sequencing data.

Authors: Simon Anders; Paul Theodor Pyl; Wolfgang Huber
Journal: Bioinformatics Date: 2014-09-25 Impact factor: 6.937

4. PolyA_DB 3 catalogs cleavage and polyadenylation sites identified by deep sequencing in multiple genomes.

Authors: Ruijia Wang; Ram Nambiar; Dinghai Zheng; Bin Tian
Journal: Nucleic Acids Res Date: 2018-01-04 Impact factor: 16.971

5. Activity-Dependent Regulation of Alternative Cleavage and Polyadenylation During Hippocampal Long-Term Potentiation.

Authors: Mariana M Fontes; Aysegul Guvenek; Riki Kawaguchi; Dinghai Zheng; Alden Huang; Victoria M Ho; Patrick B Chen; Xiaochuan Liu; Thomas J O'Dell; Giovanni Coppola; Bin Tian; Kelsey C Martin
Journal: Sci Rep Date: 2017-12-12 Impact factor: 4.379

6. A novel method for genome-wide profiling of dynamic host-pathogen interactions using 3' end enriched RNA-seq.

Authors: Jie Li; Liangliang He; Yun Zhang; Chunyi Xue; Yongchang Cao
Journal: Sci Rep Date: 2017-08-17 Impact factor: 4.379

7. QAPA: a new method for the systematic analysis of alternative polyadenylation from RNA-seq data.

Authors: Kevin C H Ha; Benjamin J Blencowe; Quaid Morris
Journal: Genome Biol Date: 2018-03-28 Impact factor: 13.583

8. Capturing the Alternative Cleavage and Polyadenylation Sites of 14 NAC Genes in Populus Using a Combination of 3'-RACE and High-Throughput Sequencing.

Authors: Haoran Wang; Mingxiu Wang; Qiang Cheng
Journal: Molecules Date: 2018-03-08 Impact factor: 4.411

9. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.

Authors: Mark D Robinson; Davis J McCarthy; Gordon K Smyth
Journal: Bioinformatics Date: 2009-11-11 Impact factor: 6.937

10. APADB: a database for alternative polyadenylation and microRNA regulation events.

Authors: Sören Müller; Lukas Rycak; Fabian Afonso-Grunz; Peter Winter; Adam M Zawada; Ewa Damrath; Jessica Scheider; Juliane Schmäh; Ina Koch; Günter Kahl; Björn Rotter
Journal: Database (Oxford) Date: 2014-07-22 Impact factor: 3.451

3 in total

1. Development of Poly(A)-ClickSeq as a tool enabling simultaneous genome-wide poly(A)-site identification and differential expression analysis.

Authors: Nathan D Elrod; Elizabeth A Jaworski; Ping Ji; Eric J Wagner; Andrew Routh
Journal: Methods Date: 2019-01-06 Impact factor: 3.608

2. Extensive Involvement of Alternative Polyadenylation in Single-Nucleus Neurons.

Authors: Ying Wang; Weixing Feng; Siwen Xu; Bo He
Journal: Genes (Basel) Date: 2020-06-26 Impact factor: 4.096

3. DPAC: A Tool for Differential Poly(A)-Cluster Usage from Poly(A)-Targeted RNAseq Data.

Authors: Andrew Routh
Journal: G3 (Bethesda) Date: 2019-06-05 Impact factor: 3.154

3 in total