| Literature DB >> 35891789 |
Abstract
Applications in omics research, such as comparative transcriptomics and proteomics, require the knowledge of the species-specific gene sequence and benefit from a comprehensive high-quality annotation of the coding genes to achieve high coverage. While protein-coding genes can in simple cases be detected by scanning the genome for open reading frames, in more complex genomes exonic sequences are separated by introns. Despite advances in sequencing technologies that allow for ever-growing numbers of genomes, the quality of many of the provided genome assemblies do not reach reference quality. These non-contiguous assemblies with gaps and the necessity to predict splice sites limit accurate gene annotation from solely genomic data. In contrast, the transcriptome only contains transcribed gene regions, is devoid of introns and thus provides the optimal basis for the identification of open reading frames. The additional integration of proteomics data to validate predicted protein-coding genes further enriches for accurate gene models. This review outlines the principles of the proteotranscriptomics approach, discusses common challenges and suggests methods for improvement.Entities:
Keywords: Gene annotation; Proteomics; Proteotranscriptomics; Transcriptomics
Year: 2022 PMID: 35891789 PMCID: PMC9293588 DOI: 10.1016/j.csbj.2022.07.007
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 6.155
Fig. 1Paradigm of proteome assembly releases: The number of entries per UniProt Knowledgebase (UniProtKB) [15] release increases extensively with time (upper panel). The vast majority of these entries however (greater than99%) is merely inferred by homology or predicted and has no biological evidence at the transcript or protein level (lower panel). All data presented was extracted from the release notes of the respective UniProt release.
Fig. 2Main genome annotation steps. Many steps such as repeat masking, protein homology prediction and the alignment of open reading frames from other species include implementing data from other assemblies and annotations and hence mistakes are transferred resulting in impaired precision.
Fig. 3General outline of the PTA (Proteo-Transcriptomics Assembly) approach. RNA-sequencing data of all poly-adenylated RNA molecules of any species of any cell origin is used for transcript assembly in which individual reads are concatenated into potential full-length transcript contigs. The predicted transcript contigs are then in-silico translated into predicted protein sequences in all possible frames. These predictions are used to find potential open reading frames taking important features of common protein coding transcripts (such as a Methionine start and an in-frame stop codon) into consideration. In parallel the proteome of the same sample used for RNA-sequencing is measured with a high-resolution mass spectrometer. The mass spectrometer first records the mass/charge (m/z) of each peptide ion and then selects the peptide ions individually to obtain sequence information via MS/MS. Peptide fragmentation spectra are matched to in silico generated peptide fragmentation patterns. The ultimate result of the process are transcript contigs that were validated by the presence of peptides and hence represent a set of high confidence protein coding transcripts.
Fig. 4General outline of the PTA workflow. In blue: RNA-Seq data preparation steps include 1. the validation of sufficient quality of the sequencing data (FastQC[52], fastqp [53], fastq-stats [54]); 2. raw RNA-Seq reads correction and adapter removal (Rcorrector[55], QuorUM [56], specialized scripts from TranscriptomeAssemblyTools (FilterUncorrectablePEfastq.py); TrimGalore (a wrapper around Cutadapt [57] and FastQC [52]); 3. Mapping of reads to a reference genome for the genome-guided mode (STAR[58], Bowtie2 [59], BWA [60], Hisat2 [61], TopHat2 [62]); 4. Transcriptome assembly (Trinity[20], [21], Oases [22], Trans-ABySS [19], SOAPdenovo-Trans [24], IDBA-Tran [23], Bridger [26], BinPacker [27], Shannon [25], SPAdes-sc [28], SPAdes-rna [28]); 5. Identification of candidate coding regions within reconstructed transcript sequences from the previous step (TransDecoder[21], FrameD [63], GeneMarkS [64]). In green: mass spectrometry spectra processing and filtering (MaxQuant[65], ProteomeDiscoverer (Thermo Scientific), FragPipe [66], MS-GF+ [67]). In red: The predicted ORF protein sequences will be used as search space for the identified peptides extracted from MS/MS spectra. In yellow: ORFs with peptide evidence can be functionally annotated (Trinotate[68], blast2GO [69], annot8r [70], Annoscript2 [71]). Newly established annotations can be compared with current annotations e.g., from UniProt and Ensembl (blastp[72], DIAMOND [73]), checked for assembly quality standards (TransRate[40], rnaQUAST [39], Detonate [41]) and examined for proteome completeness (BUSCO[42]). Programs that can be used for the individual steps are listed, while the ones that were tested to work well and deliver satisfactory results in our hands are bolded. The list, though being comprehensive, is not intended to be complete. Beyond the tools listed, alternative tools that may work equally well may exist or being developed. The right panel depicts the computation times of the different steps compared between High-Performance-Computing machines and strong tabletop PCs. The times are only representative, based on the tools marked bold, and depend on the amount of raw data processed and the underlying computing architecture. Execution time may vary for alternative tools used for the individual steps.
Fig. 5Open reading frames can be predicted from the assembled transcripts. A known issue of transcriptome assembly is that under certain circumstances (see details in main text) the assembler is not able to assemble the complete transcript but the assembled transcript rather represents a fragment of the actual transcript. The completeness can be measured by comparing the assembled transcripts to current annotations. Depicted are the proportions of assembled transcripts in our previously published transcriptome assembly of the silkworm Bombyx mori[74] with different levels of completeness when compared to the genome-based annotation of the silkworm from SilkBase [75]. The left panel represents the distributions in all raw transcript assemblies. Only around 62% of the transcripts show completeness of more than 80%. However, in the pool of predicted open reading frames that could be verified at the protein level (depicted in the right panel) the proportion of near complete transcripts increases to 82%. These gene annotations with additional peptide evidence are enriched for full-length transcripts and thereby increase accuracy.