Literature DB >> 35891789

Proteotranscriptomics - A facilitator in omics research.

Abstract

Applications in omics research, such as comparative transcriptomics and proteomics, require the knowledge of the species-specific gene sequence and benefit from a comprehensive high-quality annotation of the coding genes to achieve high coverage. While protein-coding genes can in simple cases be detected by scanning the genome for open reading frames, in more complex genomes exonic sequences are separated by introns. Despite advances in sequencing technologies that allow for ever-growing numbers of genomes, the quality of many of the provided genome assemblies do not reach reference quality. These non-contiguous assemblies with gaps and the necessity to predict splice sites limit accurate gene annotation from solely genomic data. In contrast, the transcriptome only contains transcribed gene regions, is devoid of introns and thus provides the optimal basis for the identification of open reading frames. The additional integration of proteomics data to validate predicted protein-coding genes further enriches for accurate gene models. This review outlines the principles of the proteotranscriptomics approach, discusses common challenges and suggests methods for improvement.

Entities: Chemical

Keywords: Gene annotation; Proteomics; Proteotranscriptomics; Transcriptomics

Year: 2022 PMID： 35891789 PMCID： PMC9293588 DOI： 10.1016/j.csbj.2022.07.007

Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN： 2001-0370 Impact factor: 6.155

Introduction

A gene is basically a sequence of DNA nucleotides that encodes the synthesis of a gene product, which can be either RNA or protein. The annotation of protein coding functional elements in a complex genome is a challenging task that requires not only a highly accurate genome assembly but also the implementation of common gene features such as start and stop codons and splicing signals. The development of highly efficient sequencing technologies enables the sequencing of an ever-growing number of genomes. This necessitates automated gene annotations that are usually highly dependent on the transfer of gene models from related species. While this trend accommodates the need of the progressively omics-oriented science to study gene regulation on a global level in any species of interest, in reality imprecisions in annotations can propagate across new assemblies and genes can be missed. In addition, genome assembly and its annotation is still a highly challenging endeavor and thus remains reserved to highly specialized research groups or big consortia. Thus, an alternative approach by which gene predictions are based not on genomic sequences but rather on assembled contigs from RNA-Seq data of polyA enriched mRNA has advantages. As the transcribed part of the genome is devoid of introns and other non-coding sequences the resulting coding-gene predictions are likely more accurate. Adding additional evidence in form of peptide information obtained by mass-spectrometry, a technique broadly used for the identification and characterization of proteins, a Proteo-Transcriptomics Assembly (PTA) workflow can yield high confidence annotation of protein coding genes without the need for genome assembly. RNA-Seq and mass spectrometry data can nowadays be easily produced via on-demand services or just downloaded from RNA-Seq and proteomics repositories and hence the protocols we present and discuss here can even be applied by research groups with no such high-throughput equipment. In the following, we outline the concepts of the approach, discuss common challenges and suggest methods for advancement.

Genome annotation remains challenging

Since its development in the late 1970s, DNA sequencing has become one of the most pivotal tools in biomedical research [1]. It initially facilitated the sequencing of whole genomes of phages in the late 1980s [2] followed by several prokaryotic organisms and the first eukaryote, i.e. the yeast Saccharomyces cerevisiae in the mid-2000s [3]. Multicellular eukaryotic genome sequencing was achieved soon later starting with the roundworm Caenorhabditis elegans [4] and the plant Arabidopsis thaliana [5]. As of March 2021, the International Nucleotide Sequence Database Collaboration (INSDC) contained whole-genome DNA sequence information for 6,480 unique eukaryote species, of which only 583 (9%) represented reference-quality chromosome-scale assemblies [6]. The rest remain only draft assemblies, not passing the now generally accepted quality requirements [6], [7]. Although the number of reference-quality genomes is planned to be increased extensively, this will still require concerted efforts from the community or larger consortia [6], [7]. Further, this will not encompass every species of interest and may still require some time. Accurate and contiguous genomic sequences are important foundations for the identification of functional elements, a process called genome annotation. While this procedure was performed in a highly curated manner with very intense efforts for the first sequenced organisms, the sheer number of sequenced genomes nowadays requires fully automated processes for annotation. In these processes the genome sequence is screened for features of open reading frames that potentially code for proteins. While these measures efficiently enable prediction of possible open reading frames (Fig. 1), the accuracy has been described to suffer in many cases [8], [9].

Fig. 1

Paradigm of proteome assembly releases: The number of entries per UniProt Knowledgebase (UniProtKB) [15] release increases extensively with time (upper panel). The vast majority of these entries however (greater than99%) is merely inferred by homology or predicted and has no biological evidence at the transcript or protein level (lower panel). All data presented was extracted from the release notes of the respective UniProt release. The main challenges of genome annotation can be divided into two categories. The automated annotation of large, fragmented draft genomes still remains very difficult as open reading frames at the edge of contigs as well as in non-assembled genomic regions are lost. This is reflected in the loss of ORFs when compared to fully assembled genomes [10], [11], [12], [13]. In addition, draft genome assemblies are known to be frequently contaminated with common bacteria, sequencing vectors, or even human DNA, all of which are ubiquitously present in most labs [14]. These contaminations and any other error in existing annotation i.e. wrongly assigned gene names or a non-genic sequence being annotated as protein coding lead to errors in annotation that tend to propagate across species (Fig. 2). For eukaryotic genomes challenges are even more complex as genes are exceptionally far apart and usually interrupted by introns. That might explain why only 34% of animals with genome assemblies in GenBank also have corresponding annotations [9]. In addition, automated genome annotation mostly provides predictions based on sequence without further evidence unable to control for overprediction (Fig. 1). Hence while genome sequencing technology has continuously improved, genome annotation has become less accurate in general [8].

Fig. 2

Main genome annotation steps. Many steps such as repeat masking, protein homology prediction and the alignment of open reading frames from other species include implementing data from other assemblies and annotations and hence mistakes are transferred resulting in impaired precision.

Transcriptome assembly enables gene prediction with reduced complexity

One approach to overcome the obstacles of genome sequence-based gene annotation for protein-coding genes is to start the annotation effort with much less complex underlying data, namely the transcriptome. The transcriptome as the intermediate level of information between genome and proteome is devoid of complex features such as introns and other non-coding sequences and does theoretically constitute the perfect basis for the identification of open reading frames. The large drop in the cost of sequencing also led to the expansion of investigations of transcriptomes of a large range of organisms [16]. This is accomplished by extracting total RNA from the organism of interest, enriching for poly-adenylated transcripts and reverse transcription to create a cDNA library. The cDNA can then be fragmented into various lengths depending on the platform used for sequencing. 454 Sequencing, Illumina, and SOLiD platforms utilize different types of technologies to sequence millions of short reads [17]. Similar to genome assembly the cDNA sequence reads can then be assembled into transcripts. However, established genome assemblers can't be directly used in transcriptome assembly for several reasons. (1) Sequencing depth is assumingly uniform across the genome, while the depth obviously varies between transcripts. (2) In genome sequencing both strands are sequenced, while RNA-Seq is normally strand-specific. (3) Transcript variants of the same locus share different exons and it can be difficult to reconstruct and tease apart all splicing isoforms. Thus, transcriptome assembly has its own challenges. The approach was however strongly enforced by the development of dedicated transcriptome assembly programs [18]. Transcriptome assembly can be performed in two different modes: de novo assembly (i.e. assembly of reads without the usage of any reference genome or transcriptome) and genome-guided transcriptome assembly (i.e. reads are mapped to a related reference genome to identify transcript models, which are then assembled into transcripts). While genome-guided assembly provides better results when a well-assembled genome is available, the de novo approach enables transcriptome assemblies in cases where genome assembly is absent or in a non-satisfactory shape, i.e. highly gapped or fragmented. Short-read de novo transcriptome assemblers generally use one of two basic algorithms: (1) overlap graphs or (2) de Bruijn graphs. Overlap graphs are utilized for most assemblers designed for Sanger sequenced reads. The overlap between each pair of reads is computed and compiled into a graph, in which each node represents a single sequence read. This algorithm is more computationally intensive than de Bruijn graphs, and most effective in assembling fewer reads with a high degree of overlap. De Bruijn graphs align k-mers (sub-sequences within the read with a length of k - usually 25–50 bp) to create contigs. The de Bruijn graph approach bypasses the challenge of all-against-all overlap consensus assembly using the full-length reads. While building the graph, the reads are computed as a path through the k-mers and as the k-mers are shorter than the read lengths. This allows fast hashing so the operations in de Bruijn graphs are generally less computationally intensive. The following short read assemblers were specially designed for working with RNA-Seq data and are based on de Bruijn graphs: Trans-ABySS [19], Trinity [20], [21], Oases [22], IDBA-Tran [23], SOAPdenovo-Trans [24], and Shannon [25]. Bridger [26] and BinPacker [27] are two assembly tools that rely on splicing graphs [26] instead of de Bruijn graphs. SPAdes v3.13.0 [28] is a widely used de novo genome assembler based on de Bruijn graphs and MK values. These assemblers have been used to provide transcriptomes for chickpea [29], planarians [30], Parhyale hawaiensis [31], as well as the Nile crocodile, the corn snake, the bearded dragon, and the red-eared slider [32], to name just a few. Although the de novo mode facilitates the inference of many valid and precise transcripts, the approach also bears some potential issues, namely: possible assembly errors in paralogs and multigene families; production of errorsome chimeras; problems reaching full transcript length, and misestimation of allelic diversity [33], [34], [35]. Using short read sequences for transcriptome assembly sometimes suffers from low accuracy, especially for the transcripts from eukaryotes that contain complex isoforms [35], [36]. This can be partially tackled by using long-read sequences which span longer parts of the original transcript and hence allows for more precise assembly [37]. A downside to long-read sequencing is that the accuracy per read can be much lower than that of short-read sequencing introducing other errors into assembled contigs. Lately hybrid approaches integrating both short and long read sequences in the assembly process have been proposed [38]. To benchmark the transcriptome assemblies overall and the assembled contigs individually, control measures and programs that can characterize these have been developed. Mapping rate (re-mapping rate can give preliminary insights into the quality of a transcriptome assembly), Ex90N50 statistics (expression-informed ExN50 statistic), rate of full-length protein-coding transcripts reconstruction, rnaQUAST [39] (completeness and correctness levels of the assembled transcripts), TransRate [40] (confidence and completeness measures based on the reads used for the assembly only), DETONATE [41] (compactness of the assembly and its support from the RNA-Seq reads) and BUSCO [42] (abundance of single-copy orthologs in the assembly) are some of the more established measures and tools. Using these tools the general quality of the assembly can be measured, and some of the algorithms (TransRate, DETONATE) also provide per-transcript measures that allow further filtering of assembled transcripts to keep only high confidence transcripts.

Integration of peptide evidence increases confidence in protein predictions

Although transcriptome assemblers get quicker and work with increasing precision, transcript isoform variation contributes to transcriptome complexity and ultimately the quality of transcriptome assemblies. The consequence is overprediction and misassembly especially in loci with high levels of alternative splice forms, allelic variants, close paralogs, close homologs, and close homeologs. Using assembly quality assessment tools as mentioned above, some of these misassembled transcripts can be identified and filtered out. However incorrect frame-shifted open reading frames can only be detected by either comparison to known well evidenced proteins from other species or by using evidence at the protein sequence level. This can be accomplished by cross checking the predicted open reading frame pool from transcriptome assembly with mass-spectrometry peptide identifications. Using this approach, open reading frames stemming from misassembled transcripts can be eliminated by establishing evidence at the protein level and herewith strengthening the confidence in the predictions. Similar approaches have been used to identify non-canonical proteins and novel alternative splicing isoforms which would be lost when working with predetermined annotation databases [43], [44], [45]. The principle is easy; RNA-Seq data is assembled to possible transcripts, these are the basis for the prediction of potential open reading frames, which are then used as search space for mass spectrometry peptide information (Fig. 3). This Proteo-Transcriptomics Assembly (PTA) approach enables unbiased proteome annotation without the need for genome information. The ultimate result of the PTA process are transcript contigs bearing open reading frames that are validated by the presence of peptides and hence represent a set of high confidence protein coding transcripts. Preferably the same sample may be used for preparing RNA and protein extracts, however any available raw data (RNA-Seq or mass spectrometry data) can also be downloaded from official repositories like GEO [46], SRA [47] or PRIDE [48] and be combined retrospectively. This opens avenues for the annotation of high confidence open reading frames facilitating research in a cost-effective approach to improve previous or generate new gene models. As the technique can be based on de novo transcriptome assembly, it provides the possibility to study any species also in the absence of genome sequence information enabling gene discovery, comparative analyses, estimation of expression abundances, and identification of sequence variants.

Fig. 3

General outline of the PTA (Proteo-Transcriptomics Assembly) approach. RNA-sequencing data of all poly-adenylated RNA molecules of any species of any cell origin is used for transcript assembly in which individual reads are concatenated into potential full-length transcript contigs. The predicted transcript contigs are then in-silico translated into predicted protein sequences in all possible frames. These predictions are used to find potential open reading frames taking important features of common protein coding transcripts (such as a Methionine start and an in-frame stop codon) into consideration. In parallel the proteome of the same sample used for RNA-sequencing is measured with a high-resolution mass spectrometer. The mass spectrometer first records the mass/charge (m/z) of each peptide ion and then selects the peptide ions individually to obtain sequence information via MS/MS. Peptide fragmentation spectra are matched to in silico generated peptide fragmentation patterns. The ultimate result of the process are transcript contigs that were validated by the presence of peptides and hence represent a set of high confidence protein coding transcripts.

Proteo-Transcriptomics assembly – Challenges

Computational complexity

While PTA delivers promising results, the implementation of the various programs, tools and custom scripts is not a straight-forward endeavor yet. A possible workflow (outlined in Fig. 4) for the full analysis including QC requires the implementation of at least 10 different programs. As there is no pipeline available yet, the application of the approach remains reserved to computationally experienced researchers, limiting it to more highly relevant fields. One possible solution to this issue would be the implementation of a workflow framework that eases the writing of data-intensive computational pipelines, e.g. Nextflow [49], Snakemake [50] or bpipe [51] to automate all relevant programming steps in a parallelized preferably portable pipeline. This would enable a scalable and reproducible analysis also for research groups with less computational experience.

Fig. 4

General outline of the PTA workflow. In blue: RNA-Seq data preparation steps include 1. the validation of sufficient quality of the sequencing data (FastQC[52], fastqp [53], fastq-stats [54]); 2. raw RNA-Seq reads correction and adapter removal (Rcorrector[55], QuorUM [56], specialized scripts from TranscriptomeAssemblyTools (FilterUncorrectablePEfastq.py); TrimGalore (a wrapper around Cutadapt [57] and FastQC [52]); 3. Mapping of reads to a reference genome for the genome-guided mode (STAR[58], Bowtie2 [59], BWA [60], Hisat2 [61], TopHat2 [62]); 4. Transcriptome assembly (Trinity[20], [21], Oases [22], Trans-ABySS [19], SOAPdenovo-Trans [24], IDBA-Tran [23], Bridger [26], BinPacker [27], Shannon [25], SPAdes-sc [28], SPAdes-rna [28]); 5. Identification of candidate coding regions within reconstructed transcript sequences from the previous step (TransDecoder[21], FrameD [63], GeneMarkS [64]). In green: mass spectrometry spectra processing and filtering (MaxQuant[65], ProteomeDiscoverer (Thermo Scientific), FragPipe [66], MS-GF+ [67]). In red: The predicted ORF protein sequences will be used as search space for the identified peptides extracted from MS/MS spectra. In yellow: ORFs with peptide evidence can be functionally annotated (Trinotate[68], blast2GO [69], annot8r [70], Annoscript2 [71]). Newly established annotations can be compared with current annotations e.g., from UniProt and Ensembl (blastp[72], DIAMOND [73]), checked for assembly quality standards (TransRate[40], rnaQUAST [39], Detonate [41]) and examined for proteome completeness (BUSCO[42]). Programs that can be used for the individual steps are listed, while the ones that were tested to work well and deliver satisfactory results in our hands are bolded. The list, though being comprehensive, is not intended to be complete. Beyond the tools listed, alternative tools that may work equally well may exist or being developed. The right panel depicts the computation times of the different steps compared between High-Performance-Computing machines and strong tabletop PCs. The times are only representative, based on the tools marked bold, and depend on the amount of raw data processed and the underlying computing architecture. Execution time may vary for alternative tools used for the individual steps.

Transcriptome assembly accuracy

A known issue with all transcriptome assembly programs is a more or less severe level of fragmented contig assembly. Such fragmented contigs lack a start or a stop codon, or both and hence represent only partial open reading frames and lead to noisy results. The main reasons for partially assembled contigs are low read coverage at a locus, repetitive regions, differential expression of different exons, polymorphism, and sequencing errors, which might potentially lead to local assembly errors. The most efficient way to clean assemblies from these false contigs would be to use measures that would detect any of the underlying causes and then try to filter contigs with high chance of being a wrong assembly, keeping only high-confidence full-length contigs. There are two programs that facilitate the detection of such features. Both TransRate [40] and Detonate [41] provide metrics which take the mapping of reads against the contigs into account in assessing the assembly quality. In addition to an overall assembly score for a given assembly, for each contig within the assembly, TransRate [40] and Detonate [41] provide a score that assesses how well that contig is supported by the RNA-Seq data and that can be used to filter suspicious contigs. While using these measures can help to enrich for high confidence predictions, we observed that the pool of predicted proteins for which peptide evidence can be detected, the overall completeness of the assembled transcripts seems to be significantly higher (Fig. 5). This also emphasizes the importance of adding peptide evidence to predictions, a step most current genome annotations lack (Fig. 1).

Fig. 5

Open reading frames can be predicted from the assembled transcripts. A known issue of transcriptome assembly is that under certain circumstances (see details in main text) the assembler is not able to assemble the complete transcript but the assembled transcript rather represents a fragment of the actual transcript. The completeness can be measured by comparing the assembled transcripts to current annotations. Depicted are the proportions of assembled transcripts in our previously published transcriptome assembly of the silkworm Bombyx mori[74] with different levels of completeness when compared to the genome-based annotation of the silkworm from SilkBase [75]. The left panel represents the distributions in all raw transcript assemblies. Only around 62% of the transcripts show completeness of more than 80%. However, in the pool of predicted open reading frames that could be verified at the protein level (depicted in the right panel) the proportion of near complete transcripts increases to 82%. These gene annotations with additional peptide evidence are enriched for full-length transcripts and thereby increase accuracy.

Considerations of proteome coverage

Addition of protein data will increase confidence for the existence of an assembled transcript. Most proteomic data is available as peptide identifications from bottom-up experiments and can be accessed on databases like PRIDE [48] and Massive [76]. However, while high confidence peptide identification has been aided by ever more accurate mass spectrometers in the last decade, currently even for in-depth proteomes, we unfortunately only measure peptides of the more abundant proteins in a sample [77]. This naturally limits the PTA approach as only a fraction of the predicted open reading frames can thus be supported by peptide evidence. Despite this current limit, increases in proteome coverage can further enhance the comprehensiveness of PTA. Advances in mass-spectrometry instrumentation [78], [79], [80] and acquisition methods [81], [82], [83] enable increasing measurement depth. The use of specific methodology like removal of high‐abundant proteins or fractionation approaches can split sample complexity across the measurement and are readily implementable [84]. In addition, the use of samples from different developmental stages, tissues or treatments can modulate and increase the pool of expressed proteins allowing to obtain more peptide evidence [85].

Summary and outlook

Identifying all coding regions in a genome is crucial for any study at the level of molecular biology, ranging from single-gene cloning to genome-wide measurements using RNA-Seq or mass spectrometry. While satisfactory annotation has been made feasible for well-studied model organisms through great efforts of big consortia, for many species this kind of data is either absent or not adequately precise. We here reviewed an approach that seeks to overcome many of the bottlenecks of detecting protein-coding regions in the genome. We could previously show that by combining in-depth transcriptome sequencing and high resolution mass spectrometry by proteotranscriptomics we achieved improved gene annotation of protein-coding genes in the Bombyx mori cell line BmN4, which is an increasingly used tool for the analysis of piRNA biogenesis and function [74]. Using the PTA approach, we provided the exact coding sequence and evidence for more than six thousand expressed genes on the protein level. This approach outperformed current Bombyx mori gene annotation efforts from 4 different sources in terms of accuracy and coverage [74]. Similar approaches were also successfully applied by other groups in various different species and fields such as in human placental samples [86] and leukemia cells [87] and for the detection of microproteins in human [88], in rat [89], pigs [90], mosquitos [91], [92], in a combined analysis of human and adenovirus [93], and plants such as the opium poppy [94] and Michelia maudiae [95] demonstrating that proteotranscriptomics is widely applicable. The presented PTA approach can in principle be applied by any individual lab and without prior genomic information. Although most labs do not have their own next-generation sequencer or a high-resolution mass-spectrometer, access to these services from different in-house or external providers are easily available. In principle, even already existing data from different RNA-Seq and proteome repositories can be incorporated eagerly well for PTA. As mentioned above a significant bottleneck of PTA is the computational complexity of the different bioinformatic analysis steps, which also need considerably large computing resources. These obstacles can be overcome by building a computational pipeline that executes the different processes in a highly parallelized and streamlined manner on an HPC platform or in the cloud. Indeed, we are currently developing a workflow that will be deployable in cloud computing infrastructure and will make benchmarked PTA feasible for anyone interested. Another common issue, fragmented transcript assemblies, has been the source for the development of quality control programs that provide quality measures for assembled contigs. In the future, we envision integrating this QC information with a machine learning algorithm to facilitate identifying potentially fragmented transcript assemblies even more precisely. In summary, Proteotranscriptomics is an efficient, cost-effective and accurate approach to improve previous gene annotations or generate completely new gene models. As this technique is based on de novo transcriptome assembly, it provides the possibility to study any species also in the absence of genome sequence information, for which proteogenomics in its stricter meaning is impossible. Easier computational access and solving major bottlenecks such as program application, efficient transcriptome assembly and automatic quality controls are the next steps to make this approach feasible and reproducible for the broader scientific community.

CRediT authorship contribution statement

Michal Levin: Conceptualization, Data curation, Writing – original draft, Writing – review & editing. Falk Butter: Conceptualization, Writing – original draft, Writing – review & editing.

Declaration of Competing Interest

The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: Falk Butter reports administrative support was provided by German Research Foundation.

90 in total

1. Bpipe: a tool for running and managing bioinformatics pipelines.

Authors: Simon P Sadedin; Bernard Pope; Alicia Oshlack
Journal: Bioinformatics Date: 2012-04-12 Impact factor: 6.937

2. De novo assembly and analysis of RNA-seq data.

Authors: Gordon Robertson; Jacqueline Schein; Readman Chiu; Richard Corbett; Matthew Field; Shaun D Jackman; Karen Mungall; Sam Lee; Hisanaga Mark Okada; Jenny Q Qian; Malachi Griffith; Anthony Raymond; Nina Thiessen; Timothee Cezard; Yaron S Butterfield; Richard Newsome; Simon K Chan; Rong She; Richard Varhol; Baljit Kamoh; Anna-Liisa Prabhu; Angela Tam; YongJun Zhao; Richard A Moore; Martin Hirst; Marco A Marra; Steven J M Jones; Pamela A Hoodless; Inanc Birol
Journal: Nat Methods Date: 2010-10-10 Impact factor: 28.547

Proteotranscriptomics - A facilitator in omics research.

Introduction

Genome annotation remains challenging

Transcriptome assembly enables gene prediction with reduced complexity

Integration of peptide evidence increases confidence in protein predictions

Proteo-Transcriptomics assembly – Challenges

Computational complexity

Transcriptome assembly accuracy

Considerations of proteome coverage

Summary and outlook

CRediT authorship contribution statement

Declaration of Competing Interest

1. Bpipe: a tool for running and managing bioinformatics pipelines.

2. De novo assembly and analysis of RNA-seq data.

Review 3. Data-Independent Acquisition Mass Spectrometry-based Proteomics and Software Tools: A Glimpse in 2020.

4. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana.

5. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype.

6. MSFragger: ultrafast and comprehensive peptide identification in mass spectrometry-based proteomics.

7. A time-resolved proteotranscriptomics atlas of the human placenta reveals pan-cancer immunomodulators.

8. Fast and accurate short read alignment with Burrows-Wheeler transform.

9. Splice-Junction-Based Mapping of Alternative Isoforms in the Human Proteome.

10. An analysis of tissue-specific alternative splicing at the protein level.