Literature DB >> 31295871

TGStools: A Bioinformatics Suit to Facilitate Transcriptome Analysis of Long Reads from Third Generation Sequencing Platform.

Danze Chen¹, Qianqian Zhao^1,2, Leiming Jiang¹, Shuaiyuan Liao³, Zhigang Meng³, Jianzhen Xu⁴.

Abstract

Recent analyses show that transcriptome sequencing can be utilized as a diagnostic tool for rare Mendelian diseases. The third generation sequencing de novo detects long reads of thousands of base pairs, thus greatly expanding the isoform discovery and identification of novel long noncoding RNAs. In this study, we developed TGStools, a bioinformatics suite to facilitate routine tasks such as characterizing full-length transcripts, detecting shifted types of alternative splicing, and long noncoding RNAs (lncRNAs) identification in transcriptome analysis. It also prioritizes the transcripts with a visualization framework that automatically integrates rich annotation with known genomic features. TGStools is a Python package freely available at Github.

Entities: Chemical Disease Gene Species

Keywords: alternative splicing; noncoding RNAs; rare disease; third generation sequencing; transcriptome analysis

Mesh：

Substances：
RNA, Long Noncoding

Year: 2019 PMID： 31295871 PMCID： PMC6678717 DOI： 10.3390/genes10070519

Source DB: PubMed Journal: Genes (Basel) ISSN： 2073-4425 Impact factor: 4.096

1. Introduction

Gene-panel and whole-exome sequencing revolutionized mutation detection of the rare Mendelian disease during the past decade. Recently, accumulated analyses demonstrated that transcriptome analysis also significantly improves diagnostic yield in genetically unresolved cases of rare diseases [1,2,3]. Commercially available third generation sequencing (TGS) platforms, such as Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) developed novel methods to directly capture the long nucleotide sequences from single molecules [4,5]. Compared to canonical second generation sequencing (i.e., RNA-seq), TGS provides a great potential in isoform discovery and characterization of novel long noncoding RNAs. Both are essential aspects of rare disease diagnostics [6,7]. However, the main drawback of TGS is its higher sequencing error rate, which may produce spurious transcripts [8]. Full length transcripts can be identified by comparing them with known genomic annotations, which are associated with actively transcribed regions [9,10]. To the best of our knowledge, currently no bioinformatics tools are built to automatically find nearby genomic features in order to filter transcripts. In this study, we present TGStools, a package that implements multiple tools to facilitate routine transcriptome analysis, such as isoforms comparison, detecting alternative splicing (AS) pattern and lncRNAs identification.

2. Materials and Methods

TGStools is a Python package that can be freely obtained from the GitHub project. Test data from both PacBio and ONT platforms, as well as detailed tutorials for each function, is also available online. TGStools includes a set of applications which are classified into three categories (Figure 1). In the ‘Transcripts’ category, the tool ‘TransDisp’ compares the isoforms of the queried gene and displays the sequenced transcripts along with multiple genomic annotations; ‘StaDist’ automatically finds the nearby genomics feature and calculates the distance; ‘TransFilt’ can be used to filter out transcripts according to user-defined distance cutoff. In the ‘LncRNA’ category, the tools ‘LncPred’ and ‘LncExt’ are used to identify non-coding transcripts; ‘LncExtTiss’ extracts tissue-specific lncRNA. Finally, in the ‘Alternative splicing’ category, ‘StaAS’ identifies the alternative events and detects the difference of each alternative splicing event among samples; ‘CalScoreD’ selects the most spliced genes; ‘GOEnrich’ selects top ranked gene ontology terms which are enriched with the most spliced genes. Open access to TGStools at (https://github.com/BioinformaticsSTU/TGStools).

Figure 1

Overview of TGStools. A set of applications to facilitate transcriptome analysis are included in TGStools.

Among the various types of figures TGStools can produce, the transcripts overview plot and the alternative splicing plot are illustrated here (Figure 2). Demonstrations of the other plots can be seen in the Supplementary Material.

Figure 2

Visualization in TGStools. (a) Example of isoforms comparison with known genes and auxiliary annotation. Red track: Novel isoforms from TGS platform; Black track: Known isoforms identified from TGS platform. Blue track: Known transcripts annotation. The numbers of long reads detected are shown in brackets. Red arrow: Known Cap Analysis of Gene Expression (CAGE) promoters identified from FANTOM5 data; in Roadmap track, red, blue and green arrow indicated known H3K4me1, H3K4me3 and H3K27ac marks; (b) percentage of splicing events in each sample. The χ2 test is used to find the significant difference among samples. Colors indicate different types of AS events. A3: Alternative 3’ splice site; A5: Alternative 5’ splice site; AF: Alternative first; AL: Alternative last exons; MX: Mutually exclusive exon; RI: Retained intron and SE: Skipped exon.

3. Results

3.1. Isoforms Comparison with Known Annotations

The user can import data from the most widely used TGS platforms such as PacBio and ONT after alignment. TGStools includes the latest gene model annotation files from Ensembl (http://grch37.ensembl.org/index.html), the epigenetics marks downloaded from the Roadmap Epigenomics project (http://www.roadmapepigenomics.org/), and the TSS (transcription start site) peaks data generated by the CAGE experiment in the FANTOM5 project (http://fantom.gsc.riken.jp/5/). Since these annotations are typically associated with actively transcribed promoters, the user can identify bona fide full length reads by overlapping transcripts produced from TGS platform with this auxiliary information. TGStools automatically finds the nearby genomic features and produces a summarized report. Given a gene of interest, TGStools also shows the transcript comparison with multiple annotations, from which users can easily identify the spurious transcripts. The transcript overview plot gives a genome-scale summary along the chromosome location together with known annotation features (Figure 2a). The genomic coordinates of sequenced transcripts are shown in the bottom part of the plot. This is followed by the track which indicates known transcript annotations, whereas known isoforms identified from TGS platform (i.e., Single Molecule Real Time (SMRT) data and ONT data), are shown in black. The numbers of long reads detected are shown in brackets. Comparison of transcription start sites (TSSs) detected in long reads with CAGE promoters and active epigenetic marks are also illustrated at the bottom part of the plot. This figure enables evaluation of whether regulatory elements nearby long transcripts can be detected in other genomic data, in order to eliminate a false discovery. Users can discard some spurious transcripts according to a user-defined cutoff. For example, users can discard the transcript if no genomic features are found upstream or downstream 1 Kbp of its first nucleotide. For an overview of all sequenced transcripts, TGStools also generates distance distributions of TSS in each full length transcript to the closest epigenetic marks and CAGE tags (Supplementary Material, Figures S1 and S2). This plot can be used as an assessment of the overall quality of the sequencing data.

3.2. Comparing and Detecting the Shifted Types of Alternative Splicing

Using TGStools, the alternative splicing events can be categorized and illustrated for each sample based on the SUPPA2 algorithm [11]. Users can compare the alternative splicing pattern among different samples with the built-in statistical test. In the alternative splicing plot, different colors indicate the seven AS types. Percentage and event counts of AS types in each sample are illustrated and compared based on the χ2 test (Figure 2b and Supplementary Material, Figures S3 and S4). Furthermore, a diversity score is developed to quantitatively measure the isoform usage in each sample (see Supplementary Material and online tutorial). According to user defined cutoffs, the most differentially spliced genes are used to find the significantly enriched functional terms from Gene Ontology. Illustrative plots are also automatically produced for the enriched functional terms (Supplementary Material, Figures S5 and S6).

3.3. Finding Tissue Specific Novel Isoforms or lncRNAs

Full length transcripts often encode novel lncRNAs which may be tissue specific. To assist the lncRNA analysis, TGStools can predict the protein coding potential of transcripts using the PLEK and CNCI algorithms, which are commonly used for lncRNA identification [12,13]. Our empirical comparison indicated that the combination of the two software improves the identification of known lncRNAs across the reference catalog (Supplementary Material, Table S1). TGStools generates PLEK and CNCI separate predictions, intersections and union outputs, thus the users can decide on their own. Furthermore, TGStools can compare novel transcripts with the lncRNA reference catalog across human tissues, thus finding tissue-specific novel lncRNAs or isoforms [14]. From the lncRNA Venn plot, users can compare the numbers of identified lncRNAs from different bioinformatics tools (Supplementary Material, Figure S7).

4. Discussion

Several large cohort studies revealed that the impacts of splicing pattern, altered expression, as well as non-coding variants contribute to the identification of causal genes, especially for genetically unresolved cases of rare diseases [1,2,3]. We have developed TGStools, which can take input from commonly used long reads platforms, create visualizations to illustrate the full-length transcripts and their expression, and apply functions for analyzing candidate transcripts. TGStools can facilitate researchers in exploring a full-length human transcriptome based on the TGS platform. In the future, we will continuously update TGStools to include user-friendly GUI and more functionalities such as samples classification procedures. Thus, it can also be applied to patient stratification when analyzing clinical datasets [15,16].

16 in total

1. Characterization of the human ESC transcriptome by hybrid sequencing.

Authors: Kin Fai Au; Vittorio Sebastiano; Pegah Tootoonchi Afshar; Jens Durruthy Durruthy; Lawrence Lee; Brian A Williams; Harm van Bakel; Eric E Schadt; Renee A Reijo-Pera; Jason G Underwood; Wing Hung Wong
Journal: Proc Natl Acad Sci U S A Date: 2013-11-26 Impact factor: 11.205

2. Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses.

Authors: Moran N Cabili; Cole Trapnell; Loyal Goff; Magdalena Koziol; Barbara Tazon-Vega; Aviv Regev; John L Rinn
Journal: Genes Dev Date: 2011-09-02 Impact factor: 11.361

3. Evolutionary Multiobjective Clustering and Its Applications to Patient Stratification.

Authors: Xiangtao Li; Ka-Chun Wong
Journal: IEEE Trans Cybern Date: 2018-04-02 Impact factor: 11.448

4. Identification of rare-disease genes using blood transcriptome sequencing and large control cohorts.

Authors: Laure Frésard; Craig Smail; Nicole M Ferraro; Nicole A Teran; Xin Li; Kevin S Smith; Devon Bonner; Kristin D Kernohan; Shruti Marwaha; Zachary Zappala; Brunilda Balliu; Joe R Davis; Boxiang Liu; Cameron J Prybol; Jennefer N Kohler; Diane B Zastrow; Chloe M Reuter; Dianna G Fisk; Megan E Grove; Jean M Davidson; Taila Hartley; Ruchi Joshi; Benjamin J Strober; Sowmithri Utiramerur; Lars Lind; Erik Ingelsson; Alexis Battle; Gill Bejerano; Jonathan A Bernstein; Euan A Ashley; Kym M Boycott; Jason D Merker; Matthew T Wheeler; Stephen B Montgomery
Journal: Nat Med Date: 2019-06-03 Impact factor: 53.440

5. PLEK: a tool for predicting long non-coding RNAs and messenger RNAs based on an improved k-mer scheme.

Authors: Aimin Li; Junying Zhang; Zhongyin Zhou
Journal: BMC Bioinformatics Date: 2014-09-19 Impact factor: 3.169

6. SUPPA2: fast, accurate, and uncertainty-aware differential splicing analysis across multiple conditions.

Authors: Juan L Trincado; Juan C Entizne; Gerald Hysenaj; Babita Singh; Miha Skalic; David J Elliott; Eduardo Eyras
Journal: Genome Biol Date: 2018-03-23 Impact factor: 13.583

7. Expanding the Boundaries of RNA Sequencing as a Diagnostic Tool for Rare Mendelian Disease.

Authors: Hernan D Gonorazky; Sergey Naumenko; Arun K Ramani; Viswateja Nelakuditi; Pouria Mashouri; Peiqui Wang; Dennis Kao; Krish Ohri; Senthuri Viththiyapaskaran; Mark A Tarnopolsky; Katherine D Mathews; Steven A Moore; Andres N Osorio; David Villanova; Dwi U Kemaladewi; Ronald D Cohn; Michael Brudno; James J Dowling
Journal: Am J Hum Genet Date: 2019-02-28 Impact factor: 11.025

8. Targeted, High-Resolution RNA Sequencing of Non-coding Genomic Regions Associated With Neuropsychiatric Functions.

Authors: Simon A Hardwick; Samuel D Bassett; Dominik Kaczorowski; James Blackburn; Kirston Barton; Nenad Bartonicek; Shaun L Carswell; Hagen U Tilgner; Clement Loy; Glenda Halliday; Tim R Mercer; Martin A Smith; John S Mattick
Journal: Front Genet Date: 2019-04-12 Impact factor: 4.599

9. Utilizing sequence intrinsic composition to classify protein-coding and long non-coding transcripts.

Authors: Liang Sun; Haitao Luo; Dechao Bu; Guoguang Zhao; Kuntao Yu; Changhai Zhang; Yuanning Liu; Runsheng Chen; Yi Zhao
Journal: Nucleic Acids Res Date: 2013-07-27 Impact factor: 16.971

10. Comprehensive comparison of Pacific Biosciences and Oxford Nanopore Technologies and their applications to transcriptome analysis.

Authors: Jason L Weirather; Mariateresa de Cesare; Yunhao Wang; Paolo Piazza; Vittorio Sebastiano; Xiu-Jie Wang; David Buck; Kin Fai Au
Journal: F1000Res Date: 2017-02-03

1 in total

Review 1. The third generation sequencing: the advanced approach to genetic diseases.

Authors: Tiantian Xiao; Wenhao Zhou
Journal: Transl Pediatr Date: 2020-04

1 in total