| Literature DB >> 35799267 |
Runxuan Zhang1, Richard Kuo2, Max Coulter3, Cristiane P G Calixto3,4, Juan Carlos Entizne3, Wenbin Guo5, Yamile Marquez6, Linda Milne5, Stefan Riegler7,8, Akihiro Matsui9, Maho Tanaka9, Sarah Harvey10, Yubang Gao11, Theresa Wießner-Kroh12, Alejandro Paniagua13, Martin Crespi14, Katherine Denby10, Asa Ben Hur15, Enamul Huq16, Michael Jantsch17, Artur Jarmolowski18, Tino Koester19, Sascha Laubinger20,21, Qingshun Quinn Li22,23, Lianfeng Gu11, Motoaki Seki9, Dorothee Staiger19, Ramanjulu Sunkar24, Zofia Szweykowska-Kulinska18, Shih-Long Tu25, Andreas Wachter12,26, Robbie Waugh27, Liming Xiong28, Xiao-Ning Zhang29, Ana Conesa13, Anireddy S N Reddy30, Andrea Barta31, Maria Kalyna7, John W S Brown3,27.
Abstract
BACKGROUND: Accurate and comprehensive annotation of transcript sequences is essential for transcript quantification and differential gene and transcript expression analysis. Single-molecule long-read sequencing technologies provide improved integrity of transcript structures including alternative splicing, and transcription start and polyadenylation sites. However, accuracy is significantly affected by sequencing errors, mRNA degradation, or incomplete cDNA synthesis.Entities:
Keywords: Alternative polyadenylation; Alternative splicing; Arabidopsis; Iso-seq; Reference transcript dataset; Splice junction; Transcription start and end sites
Mesh:
Year: 2022 PMID: 35799267 PMCID: PMC9264592 DOI: 10.1186/s13059-022-02711-0
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 17.906
Fig. 1Workflow of analysis of PacBio Iso-sequencing. A Raw reads are analyzed using the PacBio Iso-seq 3 pipeline to generate FLNCs which are mapped to the genome (blue boxes). B Mapped FLNCs are collapsed and merged using TAMA to generate transcripts (pink boxes). C Transcripts are quality controlled using datasets of high-confidence (HC) splice junctions (SJs) and transcript start and end sites (TSS/TES). Transcripts with unsupported splice junctions where reads contain mismatches within ±10 nt of an SJ are removed. Transcripts with both high-confidence TSS and TES (determined by binomial probability for highly expressed genes and by end support with > 2 reads for low expressed genes) are retained as HC transcripts. The remaining transcripts which have partial or no TSS and/or TES support were removed unless they overlapped with annotated gene loci. These transcripts, from genes with low coverage by Iso-seq, were combined with the HC transcripts to form AtIso (Arabidopsis Iso-seq based transcriptome)
Fig. 2Impact of mismatches around splice junctions on the accuracy of their determination. A Splice junctions (SJs) shared by AtRTD2 and Iso-seq (LDE_30; sjt_30) and unique to each. B Position weight matrix (PWM) scores for splice sites unique to Iso-seq transcripts and shared with AtRTD2. PWM scores for 5′ and 3′ splice site sequences from SJs shared between AtRTD2 and Iso-seq transcripts (high confidence), are significantly higher (t-test, p < 2.26e−16) than those unique to Iso-seq (low confidence). C, D Distribution of the number of mismatches in each position 30 nt upstream (C) and 30 nt downstream (D) of SJs unique to Iso-seq (low confidence) and shared with AtRTD2 (high confidence). See Additional File 1: Tables S3A,B). E Filtering of SJs—the graph shows the number of SJs remaining (expressed as a percentage) after the cumulative removal of SJs with mismatches in the first n positions (1, 2, 3, etc.) flanking SJs. See Additional File 1: Tables S5A,B)
Fig. 3Enrichment of sequence motifs associated with TSS and TES sites. A–D TSS sites: A TATA box, B Initiator (Inr), C Y-patch, D Kozak translation start site consensus motif, E,F TES sites: E CFlm binding site and F PAS. Lines indicate number of motifs found in relation to start and end sites from Iso-seq (blue), Morton et al. [41] A–D, and Sherstnev et al. [44] E,F (red); random control (gray)
Fig. 4Gene and transcript characteristics of AtRTD3. A Protein-coding and non-protein-coding genes. B Mono-exonic and multi-exonic genes. C Mono- and multi-exonic genes with single/multiple transcript isoforms for all genes and D for protein-coding genes. E Distribution of transcripts from protein-coding genes (protein-coding and unproductive isoforms) and from non-protein-coding genes. F Protein-coding transcripts with little or no impact on coding sequence (NAGNAG/AS in UTR) and protein-coding variants. G Distribution of transcripts with NAGNAG, AS in 5′ UTR, and AS in 3′ UTR: H distribution of NMD features among unproductive transcripts from protein-coding genes. DSSJ—downstream splice junction; OUORF—overlapping upstream open reading frame
Fig. 5Correlation of splicing ratios calculated from the RNA-seq using different RTDs and HR RT-PCR data. Splicing ratios for 226 AS events from 71 Arabidopsis genes (three biological replicates of the time-points T5 and T20) generated 1349 data points in total. The splicing ratio of individual AS transcripts to the cognate fully spliced (FS) transcript was calculated from TPMs generated by Salmon and A Araport11, B AtRTD2-QUASI, C AtIso, and D AtRTD3 and compared to the ratio from HR RT-PCR. E Correlation coefficients are given for each plot. Note that for clarity of the figures, data points with values that lie substantially outside the range of the graphs are not included in A–D but are included in the correlation values and shown in Additional File 2: Fig. S11
Fig. 6Differential TSS and TES usage. Pairs of transcript isoforms with significant isoform switches and different TSS (A–D) and TES (E, F). A AT1G11280—the shorter .6 transcript is cold-responsive. B AT3G13110—single-exon gene with different TSS where the .1 transcript has rapid cold-induced expression compared to the .2 transcript. C AT1G55960—both transcripts peak at dusk but have different expression behavior with the .11 isoform showing large increases of expression at 20 °C and day 1 at 4 °C declining with continued cold exposure. D AT5G53420—isoforms with very different TSS - .7 isoform expressed rhythmically peaking during the day (light-responsive) at 20 °C before declining rapidly in the cold while the .12 transcript has increased expression in the cold, peaking during the dark. E AT4G14400—the isoforms differ only in their TES but are expressed rhythmically with different phase (3 h offset) at 20 °C and reduced at 4 °C. F AT3G56860—very different TES and expression behavior—antiphasic at 20 °C with cold-induced switch to the shorter .12 isoform. Error bars on points are standard errors of the mean