Literature DB >> 34297055

Ranked Choice Voting for Representative Transcripts with TRaCE.

Abstract

SUMMARY: Genome sequencing projects annotate protein-coding gene models with multiple transcripts, aiming to represent all of the available transcript evidence. However, downstream analyses often operate on only one representative transcript per gene locus, sometimes known as the canonical transcript. To choose canonical transcripts, TRaCE (Transcript Ranking and Canonical Election) holds an 'election' in which a set of RNA-seq samples rank transcripts by annotation edit distance. These sample-specific votes are tallied along with other criteria such as protein length and InterPro domain coverage. The winner is selected as the canonical transcript, but the election proceeds through multiple rounds of voting to order all the transcripts by relevance. Based on the set of expression data provided, TRaCE can identify the most common isoforms from a broad expression atlas or prioritize alternative transcripts expressed in specific contexts.
AVAILABILITY AND IMPLEMENTATION: Transcript ranking code can be found on GitHub at {{https://github.com/warelab/TRaCE}}. SUPPLEMENTARY INFORMATION: Additional data are available in the GitHub repository.

Entities: Chemical

Year: 2021 PMID： 34297055 PMCID： PMC8696091 DOI： 10.1093/bioinformatics/btab542

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Genome sequencing projects often use complex, automated annotation pipelines to build reference sets of gene models. These pipelines mask repeats in the assembled genome, align protein and transcript evidence, and build gene models by aggregating overlapping alignments that adhere to known or inferred splice site patterns (Campbell ; Haas ; Hoff ). Before a project releases a set of high-confidence gene models, additional filtering steps may remove transcript models that lack homology or are subject to non-sense-mediated degradation. Alternative splicing contributes to the functional diversity of a genome (Black, 2003); and new sequencing technology such as PacBio IsoSeq can capture splice variants at an unprecedented scale (Bruijnesteijn ; Wang ; Zhang ). However, this heightened sensitivity can lead to the detection of transcriptional noise, which can be misreported by gene builders as biologically relevant splice variants. Furthermore, it is possible for partially processed transcripts containing retained introns that neither disrupt the reading frame nor introduce stop codons to be promoted to canonical transcripts (Fig. 1).

Fig. 1.

(A) The complex set of transcript models for the Zea mays B73 gene sbe4 (starch branching enzyme4). Red blocks show the predicted coding regions, and orange blocks are untranslated regions. The longest translation contains a retained intron and was selected as the canonical transcript for Compara gene tree analysis. (B) The left side shows a portion of the gene tree focused on this maize gene and displaying homologs from Sorghum bicolor, Setaria italica, Brachypodium distachyon and Oryza sativa Japonica. The right side shows regions of protein sequences participating in the multiple sequence alignment, color coded by InterPro domain. The first row shows a unique region relative to other species that derives from the retained intron Comparative gene tree analysis platforms such as Ensembl Compara (Herrero ) operate on a single canonical transcript for each gene locus. In the absence of a curated canonical transcript, this is usually defined as the longest transcript with the longest translation, but this definition does not necessarily select the best representative transcript for a gene locus. Subsequently developed techniques have defined canonical isoforms based on expression level, sequence conservation, annotation of functional domains or some combination of these features (Li ; Pruitt ; Rodriguez ; The UniProt Consortium ). For example, NCBI’s RefSeq Select dataset uses an evidence hierarchy to identify a transcript in each protein-coding human and mouse gene model. The Matched Annotation from NCBI and EMBL-EBI (MANE) project has the goal of providing a unified set of human protein-coding gene annotations, but it is not known if and when such efforts will be applied to other species. We developed Transcript Ranking and Canonical Election (TRaCE) to choose canonical transcripts based on data typically available at the time of a new genome annotation. In this approach, transcripts are ranked by length, domain coverage, and how well they represent a diverse population of transcriptome RNA-seq data. An ‘election’ based on ranked-choice voting selects a canonical transcript that is the first- or second-choice transcript for the majority of samples. The election proceeds through multiple rounds, effectively sorting all transcripts by relevance. Here we present the TRaCE algorithm and results obtained by running TRaCE on Zea mays and Homo sapiens gene annotations. In addition, we describe validation of TRaCE predictions by manual curation (Tello-Ruiz ) and compare TRaCE to RefSeq/MANE Select and APPRIS (Rodriguez ) human transcript classifications.

2 Materials and methods

The first step in preparing to run TRaCE is to gather a diverse set of RNA-seq expression data covering a wide variety of tissues or conditions to act as ‘voters’ in the upcoming elections. The next step is to align the reads, assemble sample-specific transcripts, and quantify their expression. Each reference gene model with multiple transcripts (candidates) will hold an election to sort the reference transcripts by relevance (Fig. 2).

Fig. 2.

Flowchart of preparation of TRaCE inputs and a schematic of the rank-choice voting (RCV) approach to select transcripts for an example gene with three transcripts (blue, red, gray). Exon thickness corresponds to non-coding, coding and functional regions with Pfam domains. Voters are represented by rectangles, and rank transcripts by length criteria (9, 6 or 3 votes) or AED (1 vote per sample). Eight of the samples rank the red and blue transcripts equally (blue-red gradient), so both get tallied in round 1. RCV selects the blue transcript first with 24 rank 1 votes. After removing the blue votes from consideration, the red and gray transcripts tie with 10 rank 1 votes, but the red transcript is elected with 14 rank 2 votes In each election, samples rank the candidate transcripts based on the annotation edit distance (AED) to the most highly expressed overlapping sample-specific transcripts (Eilbeck ). AED scores range from 0 (perfect agreement) to 1 (no overlap) and are calculated from the pairwise similarity of reference transcripts and aligned evidence based on the proportion of exonic overlap. Because there may be insufficient data to assemble full-length transcripts from samples in which the gene is expressed at low levels, the AED score calculation is restricted to overlapping portions of candidate transcripts. A maximum AED score cutoff (default, 0.5) prevents samples from voting for candidate transcripts with very little similarity. There are also cutoff parameters for minimum expression level (default TPM, 0.5) and proportion overlapping (default, 0.5) to filter out some noise in the sample transcriptome data. The election includes additional voters that rank transcripts based on domain coverage, protein length and transcript length. To avoid overwhelming the length-based voters when running TRaCE with many samples, sample votes are weighted to balance the electorate. Default weights were selected to prioritize functional domain coverage over protein length and total transcript length. Once each sample voter and the length-based voters have ranked the transcripts, the election proceeds in multiple rounds selecting winners until no candidates remain. In each round, TRaCE tallies votes for top-ranked candidates; and so long as there is a tie for first place, votes for the subsequent rankings are added to the tally.

3 Results

We ran TRaCE on a pre-release set of Zea mays B73 gene models with the set of 10 RNA-seq samples that had already been aligned to the genome as part of the evidence-based gene annotation pipeline (Hufford ). The samples were derived from shoot, root, embryo, endosperm, ear, tassel, anther and three leaf sections (base, middle and tip). StringTie version 1.3.5 (with the –rf flag) was used for transcript assembly and quantification (Pertea ) and InterProScan version 5.38-76.0 was run to identify Pfam domains (Mulder and Apweiler, 2007). The Zea mays B73 V5 annotation set (Zm00001eb) has 15 162 multi-transcript protein-coding gene models; for 5616 of these (37%), the canonical transcript chosen by TRaCE was not the longest isoform. TRaCE selected canonical transcripts for the genome annotations of 25 additional maize accessions, 33–38% of which were not the longest isoform (Supplementary Table S1). We used two approaches to validate TRaCE’s predictions on maize genes. First, we modified an interactive gene tree viewer, designed to flag problematic gene models by visual inspection of the multiple sequence alignment and domain annotations (Tello-Ruiz ). We used this interface to compare maize B73 V5 canonical transcripts (Zm00001eb) selected by TRaCE with the prior set of maize V4 canonical transcripts (Zm00001d) selected by length criteria alone. A random selection of 173 pairs of genes for which the TRaCE canonical was not the longest transcript were evaluated in the gene tree viewer and flagged if the alignment was inconsistent with outgroup orthologs. Genes were flagged if there was a relative gain or loss of conserved sequence within the transcript or at either end. Of these gene pairs, 32% were flagged as problematic in Zm00001d only, 4% in Zm00001eb only and 5% in both versions (Supplementary Table S2). The most common issue in the flagged Zm00001d gene models was gain of sequence due to an intron retention. Thus, according to this approach, TRaCE was selecting better-conserved isoforms than the prior length-based algorithm. In the second approach, TRaCE predictions were validated by student curators who were given a subset of 48 gene models with two to five transcripts, for which TRaCE’s top-ranked isoform was not the longest isoform. The students, who were not aware of TRaCE’s output, were asked to rate transcripts as best, good or poor, based on viewing the gene structure and expression evidence in the Apollo genome browser (Dunn ). Each gene model was curated by at least three different students. The transcript ratings were mapped to a score (best 2, good 1, poor -1). Transcript rankings from TRaCE and rankings based on length alone were compared to rankings based on curator scores. For each rank (1–5), we calculated the sum of the curator scores for the associated transcripts. The correlation of these sums between the length-based ranking and the curator-based ranking was 0.917, whereas the TRaCE and curator ranking sums had a higher correlation coefficient of 0.985 (Supplementary Table S3). We also ran TRaCE on human GRCh38 annotations (Frankish ) with a diverse panel of 127 samples of human RNA-seq data covering the development of seven major organs (brain, cerebellum, heart, kidney, liver, ovary and testis) from 4 weeks post-conception to adulthood (https://www.ebi.ac.uk/gxa/experiments/E-MTAB-6814/Results). Reads were aligned with hisat2 version 2.1.0 (–dta –reorder), transcripts were assembled and quantified with stringtie version 2.1.4 (–conservative) and protein-coding reference transcripts were annotated with Pfam domains using InterProScan version 5.38-76.0 (Mulder and Apweiler, 2007; Pertea ). The GRCh38 annotation set has 13 848 multi-transcript protein-coding gene models that were classified by both APPRIS and MANE Select. The TRaCE canonical was not the longest isoform in 3717 (27%) of these gene models. For comparison, the principal isoform according to APPRIS and the MANE Select transcript was not the longest isoform in 3061 (22%) and 4292 (31%) of gene models, respectively. There are 1202 gene models where APPRIS and MANE Select disagree. In these cases, TRaCE agrees with APPRIS on 408 (34%) genes, MANE Select on 597 (50%) genes and neither on 197 (16%) genes. On the 12 646 multi-transcript gene models where APPRIS and MANE Select agree, TRaCE gives 10 677 (84%) transcripts rank 1, 1470 (12%) rank 2, 351 (3%) rank 3 and 148 (1%) rank 4 or higher. To assess TRaCE’s performance on gene models with many transcripts, we compared TRaCE to APPRIS and MANE Select on the 90% of genes with 2–10 transcripts and the remaining 10% of human protein-coding gene models with 11–151 transcripts. There are 1399 genes with many transcripts where APPRIS and MANE Select agree. In these cases, TRaCE selects 1021 (73%) of these as the canonical transcript, 215 (15%) have rank 2, 92 (7%) have rank 3 and 71 (5%) have rank 4 or higher. On the 11 247 genes with fewer transcripts where APPRIS and MANE Select agree TRaCE assigns 9656 (86%) rank 1, 1255 (11%) rank 2, 259 (2%) rank 3 and 84 (1%) rank 4 or higher. For the initial release of TRaCE, we manually tuned the weights on TRaCE’s length-based votes, but future versions may benefit from an automated parameter sweep to minimize these differences. Click here for additional data file.

19 in total

1. InterPro and InterProScan: tools for protein sequence classification and comparison.

Authors: Nicola Mulder; Rolf Apweiler
Journal: Methods Mol Biol Date: 2007

2. Revisiting the identification of canonical splice isoforms through integration of functional genomics and proteomics evidence.

Authors: Hong-Dong Li; Rajasree Menon; Gilbert S Omenn; Yuanfang Guan
Journal: Proteomics Date: 2014-11-17 Impact factor: 3.984

3. Gramene 2021: harnessing the power of comparative genomics and pathways for plant research.

Authors: Marcela K Tello-Ruiz; Sushma Naithani; Parul Gupta; Andrew Olson; Sharon Wei; Justin Preece; Yinping Jiao; Bo Wang; Kapeel Chougule; Priyanka Garg; Justin Elser; Sunita Kumari; Vivek Kumar; Bruno Contreras-Moreira; Guy Naamati; Nancy George; Justin Cook; Daniel Bolser; Peter D'Eustachio; Lincoln D Stein; Amit Gupta; Weijia Xu; Jennifer Regala; Irene Papatheodorou; Paul J Kersey; Paul Flicek; Crispin Taylor; Pankaj Jaiswal; Doreen Ware
Journal: Nucleic Acids Res Date: 2020-11-10 Impact factor: 16.971

4. Whole-Genome Annotation with BRAKER.

Authors: Katharina J Hoff; Alexandre Lomsadze; Mark Borodovsky; Mario Stanke
Journal: Methods Mol Biol Date: 2019

5. Human and Rhesus Macaque KIR Haplotypes Defined by Their Transcriptomes.

Authors: Jesse Bruijnesteijn; Marit K H van der Wiel; Wendy T N Swelsen; Nel Otting; Annemiek J M de Vos-Rouweler; Diënne Elferink; Gaby G Doxiadis; Frans H J Claas; Neubury M Lardy; Natasja G de Groot; Ronald E Bontrop
Journal: J Immunol Date: 2018-01-22 Impact factor: 5.422

6. Quantitative measures for the management and comparison of annotated genomes.

Authors: Karen Eilbeck; Barry Moore; Carson Holt; Mark Yandell
Journal: BMC Bioinformatics Date: 2009-02-23 Impact factor: 3.169

7. Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown.

Authors: Mihaela Pertea; Daehwan Kim; Geo M Pertea; Jeffrey T Leek; Steven L Salzberg
Journal: Nat Protoc Date: 2016-08-11 Impact factor: 13.491

8. APPRIS 2017: principal isoforms for multiple gene sets.

Authors: Jose Manuel Rodriguez; Juan Rodriguez-Rivas; Tomás Di Domenico; Jesús Vázquez; Alfonso Valencia; Michael L Tress
Journal: Nucleic Acids Res Date: 2018-01-04 Impact factor: 16.971

9. Apollo: Democratizing genome annotation.

Authors: Nathan A Dunn; Deepak R Unni; Colin Diesh; Monica Munoz-Torres; Nomi L Harris; Eric Yao; Helena Rasche; Ian H Holmes; Christine G Elsik; Suzanna E Lewis
Journal: PLoS Comput Biol Date: 2019-02-06 Impact factor: 4.475

10. De novo assembly, annotation, and comparative analysis of 26 diverse maize genomes.

Authors: Matthew B Hufford; Arun S Seetharam; Margaret R Woodhouse; Kapeel M Chougule; Shujun Ou; Jianing Liu; William A Ricci; Tingting Guo; Andrew Olson; Yinjie Qiu; Rafael Della Coletta; Silas Tittes; Asher I Hudson; Alexandre P Marand; Sharon Wei; Zhenyuan Lu; Bo Wang; Marcela K Tello-Ruiz; Rebecca D Piri; Na Wang; Dong Won Kim; Yibing Zeng; Christine H O'Connor; Xianran Li; Amanda M Gilbert; Erin Baggs; Ksenia V Krasileva; John L Portwood; Ethalinda K S Cannon; Carson M Andorf; Nancy Manchanda; Samantha J Snodgrass; David E Hufnagel; Qiuhan Jiang; Sarah Pedersen; Michael L Syring; David A Kudrna; Victor Llaca; Kevin Fengler; Robert J Schmitz; Jeffrey Ross-Ibarra; Jianming Yu; Jonathan I Gent; Candice N Hirsch; Doreen Ware; R Kelly Dawe
Journal: Science Date: 2021-08-06 Impact factor: 47.728

2 in total

1. APPRIS principal isoforms and MANE Select transcripts define reference splice variants.

Authors: Fernando Pozo; José Manuel Rodriguez; Laura Martínez Gómez; Jesús Vázquez; Michael L Tress
Journal: Bioinformatics Date: 2022-09-16 Impact factor: 6.931

2. De novo assembly, annotation, and comparative analysis of 26 diverse maize genomes.

2 in total