| Literature DB >> 30654742 |
Laetitia Guillot1,2, Ludovic Delage3, Alain Viari4, Yves Vandenbrouck5, Emmanuelle Com1,2, Andrés Ritter3,6, Régis Lavigne1,2, Dominique Marie3, Pierre Peterlongo7, Philippe Potin3, Charles Pineau8,9.
Abstract
BACKGROUND: Accurate structural annotation of genomes is still a challenge, despite the progress made over the past decade. The prediction of gene structure remains difficult, especially for eukaryotic species, and is often erroneous and incomplete. We used a proteogenomics strategy, taking advantage of the combination of proteomics datasets and bioinformatics tools, to identify novel protein coding-genes and splice isoforms, assign correct start sites, and validate predicted exons and genes.Entities:
Keywords: Bioinformatics; Genome annotation; Peptide sequence tag; Proteogenomics; Proteomics; Tandem mass spectrometry
Mesh:
Substances:
Year: 2019 PMID: 30654742 PMCID: PMC6337836 DOI: 10.1186/s12864-019-5431-9
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1a Project workflow: Samples corresponding to various stages of the life cycle (sporophyte, gametophyte, and gametes) and sub-cellular compartments of Ectocarpus sp. were prepared for MS analysis. PSTs were generated from the MS/MS data, mapped against the genome, and clustered. We classified each cluster according to their genomic annotation (whether one hit overlapped with or included at least one CDS of an identified protein) and the number of typical spectra and verified whether all hits were included in a CDS or not. The location of clusters of interest were written into GFF files to be integrated into a Genome viewer. The results of cluster qualification and visualization along the contig allowed us to select clusters for experimental characterization. b Illustration of PSTs obtained from an MS/MS spectrum. MS/MS spectra are composed of ions resulting from the fragmentation of peptides at their peptide bonds during tandem mass spectrometry analysis. The generated fragment ions differ in mass corresponding to their adjacent amino-acid masses within the peptide sequence. A PST is a partial element of information deduced from an MS/MS spectrum, defined as a small sequence of several probable adjacent amino acids from the original peptide and the masses of the flanking N- and C-terminal fragment ions of this small sequence
Fig. 2Steps of the bioinformatics workflow
Fig. 3a Mapping step testing method. Predicted proteins matched by PSTs were compared to the expected proteins, identified by conventional database spectral identifications, using the same spectra and protein sequences database (ORCAE). b Definition of the cluster exon-mapping categories according to their mRNA locations (Ectocarpus sp. GFF3 files, ORCAE): IN when clusters were located inside an mRNA feature, OUT when clusters were located outside an mRNA feature, and CROSS when clusters were located across an mRNA feature. c Clustering step testing method. We only used the green dataset. The number of predicted proteins (matched by IN and CROSS clusters) was compared to that of expected proteins identified with Proteome Discoverer™ using the Ectocarpus sp. database (ORCAE)
Comparison of tag quality between Peaks and PepNovo+, using Proteome Discoverer ™ peptide identifications as the expected sequences
| Peaks | Proteome discoverer | Pepnovo+ | |||||||
|---|---|---|---|---|---|---|---|---|---|
| ID | Tag sequence | Significant score >60% | Rank | Protein AC | Reference Sequence | e-value | ID | Tag sequence | Significant score >1,5 |
| 271 | WVQAAGAGASR | 23 | 0 | Esi0085_0010 |
| 5,65568E-06 | Spectrum271_scans |
| 6,346 |
| 271 | WVQAAGAGWK | 21.5 | 1 | ||||||
| 271 | WVQAAGAGTGR | 18.5 | 2 | ||||||
| 552 | (CamC)VGVSEETTTRHR | 31 | 0 | Esi0091_0058 |
| 2,36909E-07 | Spectrum552_scans | GVSEDD | 4,237 |
| 552 | QMGVSEETTTRHR | 29 | 1 | ||||||
| 552 | V(CamC)GVSEETTTRHR | 17 | 2 | ||||||
| 577 | QFAGDDAPR | 43 | 0 | Esi0203_0038 |
| 1,69937E-06 | Spectrum578_scans |
| 7,539 |
| 577 | K(MetOxM)AGDDAPR | 41 | 1 | ||||||
| 577 |
| 11.5 | 2 | ||||||
| 642 | KAENPMSKR | 100 | 0 | Esi0349_0012 |
| 1,01272E-05 | Spectrum643_scans | DLDTLR | 4,727 |
| 642 | KAQDPMSKR | 0.1 | 1 | ||||||
| 642 | QAKDPMSKR | 0.02 | 2 | ||||||
| 1118 | DGLVYGK(MetOxM)NEPPGAR | 38 | 0 | Esi0327_0021 |
| 2,42457E-06 | Spectrum1119_scans |
| 1,396 |
| 1118 | DGLVYGQFNEPPGAR | 37 | 1 | ||||||
| 1118 | DGLVYGQ(MetOxM)NEPPGVK | 8 | 2 | ||||||
| 2462 |
| 87,5 | 0 | Esi0091_0058 |
| 1,64304E-05 | Spectrum2463_scans |
| 4,78 |
| 2462 | DESAAVFAGTR | 8 | 1 | ||||||
| 2462 | LMSAAV(MetOxM)AGEK | 1.5 | 2 | ||||||
| 2479 | MDDLTNNALARK | 58 | 0 | Esi0888_0002 |
| 4,70366E-07 | Spectrum2480_scans |
| 1,943 |
| 2479 | (MetOxM)VDLTNNALAAGR | 39.5 | 1 | ||||||
| 2479 | MDDLTNGGALARK | 1 | 2 | ||||||
| 2558 | ED(MetOxM)ETE(CamC)AVNYDNLYQVMK | 48 | 0 | Esi0349_0012 |
| 1,01272E-05 | Spectrum2559_scans |
| 2,289 |
| 2558 | DE(MetOxM)ETE(CamC)AVNYDNLYQVMK | 8.5 | 1 | ||||||
| 2558 | ED(MetOxM)ETE(CamC)AVNYDNLYQMVK | 8 | 2 | ||||||
| 2610 | TFQAGEVASALLGR | 32 | 0 | Esi0327_0021 |
| 5,99619E-08 | Spectrum2610_scans |
| 4,356 |
| 2610 |
| 25 | 1 | ||||||
| 2610 | TFKAGAGNGSALLGR | 10 | 2 | ||||||
| 3191 |
| 53 | 0 | Esi0327_0021 |
| Spectrum3196_scans |
| 4,471 | |
| 3191 | GLLTGLTLAEYFR | 25.5 | 1 | ||||||
| 3191 | AVLTGLTLAEYFR | 11 | 2 | ||||||
The Peaks and Pepnovo + results are shown in the first and the last columns, with the Proteome Discoverer ™ identifications in between as reference sequences. We ran Peaks, PepNovo+, and Proteome Discoverer ™ on the top 10 high-quality selected spectra. Each spectrum corresponds to one row in the table. Sequences shown in bold are the correct sequences, thus four for Peaks and eight for PepNovo+
Fig. 4Sensitivity and selectivity were calculated for each dataset. Sensitivity measures the proportion of positives that are correctly identified by clusters and selectivity is the proportion of positives that are correctly identified among all proteins matched by clusters. a Selectivity and sensitivity for each reference dataset with MINHIT = 1. b Selectivity and sensitivity for each reference dataset with MINHIT = 2. c Selectivity and sensitivity for each reference dataset with MINHIT = 2 and one amino acid modification allowed
Fig. 5a Percentage of clusters qualified as IN, OUT, or CROSS with respect to predicted gene locations from Ectocarpus sp. GFF3 files (ORCAE). b The number of proteins (matched by IN and CROSS clusters) relative to the number of expected proteins (identified with Proteome Discoverer™) allowed the calculation of sensitivity and selectivity. Sensitivity measures the proportion of positives that are correctly identified by IN or CROSS clusters. Selectivity corresponds to the proportion of positives that are correctly identified among all proteins matched by IN or CROSS clusters
Fig. 6a Distribution of all ANNOTATED/UNANNOTATED Ectocarpus sp. clusters. b Distribution of ANNOTATED clusters according to category. c Distribution of UNANNOTATED clusters according to the category
Fig. 7a Artemis view of cluster “A” for validation. Ectocarpus sp. genes close to these clusters are represented by linear exons colored in yellow. Another representation of the same exons is shown in light blue along the six reading frames (+ 1, + 2, + 3 above and − 1, − 2, − 3 below). The cluster is indicated by black bold rectangles around the small rectangles of the PSTs. b Data for cluster “A”. c Experimental validation conditions. d RT-PCR validation: agarose gel electrophoresis of RT-PCR products (lane 1), RNA-PCR negative controls (lane 2), and DNA size marker (lane 3) for PST cluster “A”
Fig. 8a Artemis view of cluster “B” for validation. Ectocarpus sp. genes close to these clusters are represented by linear exons colored in yellow. Another representation of the same exons is shown in light blue along the six reading frames (+ 1, + 2, + 3 above and − 1, − 2, − 3 below). The cluster is indicated by black bold rectangles around the small rectangles of the PSTs. b Data for cluster “B”. c Translation view of the 5’ cDNA sequence corresponding to the region defined by the cluster
Fig. 9a Artemis view of cluster “C” for validation. Ectocarpus sp. genes close to these clusters are represented by linear exons colored in yellow. Another representation of the same exons is shown in light blue along the six reading frames (+ 1, + 2, + 3 above and − 1, − 2, − 3 below). The cluster is indicated by black bold rectangles around the small rectangles of the PSTs. b Data for cluster “C”. c Experimental validation conditions. d RT-PCR validation: agarose gel electrophoresis of RT-PCR products (lane 1), RNA-PCR negative controls (lane 2), and DNA size marker (lane 3) for PST cluster “C”
Fig. 10a ARTEMIS view of cluster “D” for validation. Ectocarpus sp. genes close to these clusters are represented by linear exons colored in yellow. Another representation of the same exons is shown in light blue along the six reading frames (+ 1, + 2, + 3 above and − 1, − 2, − 3 below). The cluster is indicated by black bold rectangles around the small rectangles of the PSTs. b Cluster “D” data. c Experimental validation conditions. d RT-PCR and EST validation: agarose gel electrophoresis of RT-PCR products (lane 1), RNA-PCR negative controls (lane 2), and DNA size marker (lane 3) for PST cluster “D”
Eukaryotic proteogenomics pipelines and Galaxy workflows
| Name | Pipeline Interface | Database-driven for peptide identifi-cation | de novo peptide inter-pretation | User-friendly for biologists | Results curation | Results visuali-zation | Description | Revelance |
|---|---|---|---|---|---|---|---|---|
| Peptimapper (released in 2018) | Command line, Docker image, Galaxy tools | – | √ | √ | √ | √ | Peptide Sequence Tags (PSTs) obtained from partial interpretation of ion trap mass spectra are mapped onto the six-frame translation of genomic sequences giving hits. Hits are then clustered to detect potential coding regions. Clusters are evaluated and further compared to existing gene predictions. Clusters are available as GFF file to be uploaded into a genome viewer. | Improves genome annotation |
| IPAW (2018) [ | Command line | √ | – | – | √ | – | This is an Integrated Proteomics Analysis Workflow: i) Peptide spectra are searched in two different databases in parallel: VarDB filtered by class-specific FDR for SAAV peptides and 6FT of the human genome filtered by peptides pI. ii) SAAV candidates are curated by SpectrumAI and potential novel proteins are blasted onto public databases. ii) Curated results are validated by different controls. | Identification of Pseudogenes, lncRNAs, nsSNPs and somatic mutations |
| JUMPg (2016) [ | Command line | √ | – | – | √ | √ | This pipeline includes multiple customized databases construction, tag-based database search, peptide-spectrum match filtering, ans data visualization. | Improves genome annotation |
| PGMiner (2016) [ | Command line | √ | – | – | √ | √ | This workflow allows acquisition of mass spectrometric data, peptide identification against preprocessed sequence databases, assignment of statistical confidence to identified peptides, and mapping confident peptides to gene models. | Improves genome annotation |
| PROTEO-FORMER (2015) [ | Command line, Virtual machine, Galaxy tools | √ | – | √ | √ | √ | RIBO-seq NGS data are processed to delineates proteoforms. RIBO-seq-derived sequences are then translated and mapped to a public database, creating a custom search database for peptides to MS/MS matching. | Identification of novel translation products |
| PGTools (2015) [ | Command line | √ | – | – | √ | √ | The software is divided into 2 phases: Phase 1 contains 8 modules to analyse MS/MS data using known proteins databases. Phase 2 contains 5 modules and 7 customized databases that allow MS/MS data to be analysed against the genome. That software includes applications, libraries, customized databases and visualization tools. | Improves genome annotation |
| NextSearch (2015) [ | Command line | – | – | – | √ | √ | Nucleotide EXon-graph Transcriptome Search identifies peptides by directly searching the nucleotide exon graph against tandem mass spectra. NextSearch outputs which are the proteome-genome/transcriptome mapping that can be visualized using public tools. | Improves genome annotation |
| ProteoAnnotator (2014) [ | Command line, Stand alone application | √ | – | √ | √ | √ | MS spectrum are queried by one or several proteomics databases search engines (MASCOT, OMSSA, X!Tandem or MSGF+) and results are converted into GFF adding genome coordinates and statistical confidence values. It exports mzIdentML files. | Improves genome annotation |
| Peppy (2013) [ | Command line, Stand alone application | √ | – | N/A | √ | – | This workflow generates a peptide database from a genome, tracks peptide loci, matches peptides to MS/MS spectra and assigns FDR confidence values to those matches. | Improves genome annotation |
| Protk (released in 2012) | Command line, Galaxy tools | √ | – | √ | – | √ | It is a suite of tools for proteomics providing the following analysis tasks: (i) MS/MS data search with X!Tandem, Mascot, OMSSA and MS-GF+; (ii) peptide and protein inference with Peptide Prophet, iProphet and Protein Prophet; (iii) conversion of pepXML or protXML to tabular format, and (iv) mapping of peptides to genomic coordinates | Improves genome annotation |
| IggyPep (2010) [ | Web interface | √ | √ | N/A | – | – | The pipeline is based on a database system with advanced indexing and querying strategy, which holds the translated genome in all six reading frames. It can be queried with de novo sequences or partial peptide sequence tags (PSTs). It determines the ORF amino acid comprising these tags and compiles a fasta-formated sequence file for a database-driven search. | Improves genome annotation |
| PepLine (2008) [ | Command line | – | √ | N/A | √ | – | Peptide Sequence Tags (PSTs) obtained from partial interpretation of QTOF mass spectra are mapped onto the six-frame translation of genomic sequences giving hits. Hits are then clustered to detect potential coding regions. | Improves genome annotation |
| Workflows for Proteomics Informed by Transcriptomics (2015) [ | Galaxy tools | √ | – | √ | √ | √ | Galaxy Integrated Omics (GIO) provides workflows for 4 common use cases: i) a standard search against a reference proteome; ii) PIT protein identification without a reference genome; iii) PIT protein identification using a genome guide; iiii) and PIT genome annotation. | Improves genome annotation |
| Workflows for proteogenomics studies using Galaxy-P (2014–2018) [ | Galaxy tools | √ | – | √ | √ | √ | These modular workflows incorporating both established and customized software tools that improve depth and quality of proteogenomic results. | Improves genome annotation |
Available Eukaryotic Proteogenomics pipelines are listed in https://omictools.com/proteogenomics-category. We only selected software types “pipeline/workflow” or “Toolkit/Suite” for comparison to our pipeline. Proteogenomics Galaxy workflows [49, 50] are added at the end of the table
Additional clusters currently under investigation
| Cluster ID | Contig | Strand | From | To | Strain | Tot pep | RNA data | Genetic map | Action | Identification |
|---|---|---|---|---|---|---|---|---|---|---|
| 113 | sctg_117 | D | 265421 | 281797 | EC494 | 10 | ESTs+RNAseq | LGUn | probable new gene | Esi0117_0046 similar sequence |
| 179 | sctg_136 | D | 16590 | 18641 | EC494 | 3 | ESTs+RNAseq | LG16 | Esi0136_0001 model correction | Ferredoxin |
| 750 | sctg_346 | D | 52096 | 53724 | EC494 | 3 | RNAseq | LG15 | Esi0346_0010 model correction | Esi0003_0041 similar sequence |
| 1034 | sctg_6 | D | 824608 | 831877 | EC494 | 3 | RNAseq | LG04 | Esi0006_0137 model correction | Conserved unknown protein |
| 1056 | sctg_62 | D | 30800 | 38368 | EC494 | 9 | RNAseq | LG16 | Esi0062_0006 model correction | Hypothetical protein |
| 1072 | sctg_634 | D | 21444 | 27984 | EC494 | 3 | ESTs+RNAseq | LGUn | probable new gene | none |
| 1154 | sctg_77 | D | 414499 | 420533 | EC494 | 3 | No data | LGUn | probable new gene | none |
| 120 | sctg_123 | R | 77652 | 82101 | Ec32 | 4 | RNAseq | LG08 | probable new gene | none |
| 220 | sctg_150 | D | 399071 | 404913 | Ec32 | 3 | No data | LG11 | probable new gene | none |
| 777 | sctg_43 | R | 146476 | 147692 | Ec32 | 3 | RNAseq | LG03 | Esi0043_0035 model correction | Catalase |
| 822 | sctg_48 | R | 267223 | 267911 | Ec32 | 5 | RNAseq | LG23 | Esi0048_0051 model correction | Hypothetical protein |
| 492 | sctg_253 | D | 215846 | 216917 | Ec32 | 3 | RNAseq | LGUn | probable new gene | none |
| 567 | sctg_291 | R | 60813 | 65620 | Ec32 | 3 | RNAseq | LG03 | Esi0291_0011 model correction | mTERF domain-containing protein |
| 618 | sctg_310 | D | 63974 | 71773 | Ec32 | 3 | RNAseq | LGUn | probable new gene | none |
| 697 | sctg_365 | R | 137829 | 143627 | Ec32 | 3 | RNAseq | LGUn | probable new gene | none |
| 218 | sctg_87 | R | 471174 | 478804 | Ec410 | 3 | No data | LG26 | probable new gene | Retrotransposon integrase-like protein |
Identification of the clusters was obtained by Blast analysis. The contig and genetic map data correspond to the Ectocarpus sp. v1 genome annotation, showing supercontigs (sctg) and linkage groups (LG), respectively. Strain refers to the Ectocarpus sp. strain that was the origin of the protein samples. Action refers to the proposed correction of the current gene annotation according to the newly incorporated RNA data in the browser (RNA sequencing and ESTs) see Additional file 6