| Literature DB >> 25867681 |
Ritesh Krishna1,2, Dong Xia2, Sanya Sanderson2, Achchuthan Shanmugasundram1,2, Sarah Vermont2, Axel Bernal3, Gianluca Daniel-Naguib1, Fawaz Ghali1, Brian P Brunk3, David S Roos3, Jonathan M Wastling2, Andrew R Jones1.
Abstract
Proteomics data can supplement genome annotation efforts, for example being used to confirm gene models or correct gene annotation errors. Here, we present a large-scale proteogenomics study of two important apicomplexan pathogens: Toxoplasma gondii and Neospora caninum. We queried proteomics data against a panel of official and alternate gene models generated directly from RNASeq data, using several newly generated and some previously published MS datasets for this meta-analysis. We identified a total of 201 996 and 39 953 peptide-spectrum matches for T. gondii and N. caninum, respectively, at a 1% peptide FDR threshold. This equated to the identification of 30 494 distinct peptide sequences and 2921 proteins (matches to official gene models) for T. gondii, and 8911 peptides/1273 proteins for N. caninum following stringent protein-level thresholding. We have also identified 289 and 140 loci for T. gondii and N. caninum, respectively, which mapped to RNA-Seq-derived gene models used in our analysis and apparently absent from the official annotation (release 10 from EuPathDB) of these species. We present several examples in our study where the RNA-Seq evidence can help in correction of the current gene model and can help in discovery of potential new genes. The findings of this study have been integrated into the EuPathDB. The data have been deposited to the ProteomeXchange with identifiers PXD000297and PXD000298.Entities:
Keywords: Gene annotation; MS/MS; Microbiolgy; N. Caninum; Proteogenomics; T. gondii
Mesh:
Substances:
Year: 2015 PMID: 25867681 PMCID: PMC4692086 DOI: 10.1002/pmic.201400553
Source DB: PubMed Journal: Proteomics ISSN: 1615-9853 Impact factor: 3.984
The composition of gene models and the number of representative proteins identified in total across our full data collection
| Species | Gene model | Total database entries | Total amino acid count | Representative proteins identified as group leaders | Alternate loci with |
|---|---|---|---|---|---|
| Official | 8322 | 6 669 204 | 2921 | 0 | |
| RNA-Seq | 86 699 | 37 847 722 | 289 | 191 | |
| Official | 7122 | 6 054 032 | 1273 | 0 | |
| RNA-Seq | 13 777 | 8 158 875 | 140 | 101 |
A “representative protein” can encompasses more than one record (protein) from the protein database, incorporating the set of proteins that share the same set or subset of peptide identifications to avoid double counting of proteins with no independent evidence.
Figure 3Peptide evidence indicating a different splicing site to the official gene model. RNA-Seq-derived gene model tgondii-rna.Saeij_Jeroen_strains.VEG.gene444 was identified in our analysis that has a different starting site to the official gene model TGVEG_295125. (A) GBrowse screenshot showing ORF KI544509-5-1423859-1422096 expands the intron region of the official gene model. (B) Sequence alignment of official and RNA-Seq-derived gene model where peptides identified in both models are colored in blue and peptides identified in the intron region of official gene model are colored in red.
Figure 1Peptide evidence indicating an alternate start codon. RNA-Seq-derived gene model tgondii-rna.Saeij_Jeroen_strains.CEPdelta.gene1305 was identified in our analysis, which has a different starting site to the official gene model TGME49_269442. (A) GBrowse screenshot of the alignment of the official gene model, RNA-Seq-derived gene model, and peptides identified. (B) Sequence alignment of official and RNA-Seq-derived gene model. (C) Comparison of InterProScan results between official and RNA-Seq-derived models where a complete EF-hand domain pair was identified in the RNA-Seq-derived model.
Figure 2Peptide evidence indicating a suggested extension to the official gene model. RNA-Seq-derived gene model tgondii-rna.Saeij_Jeroen_strains.TgCATBr5.gene3708 was identified in our analysis that has a different starting site to the official gene model TGME49_324800. (A) Sequence alignment of official and RNA-Seq-derived gene model where peptides identified in both models are colored in blue and peptides identified only in RNA-Seq-derived model are colored in red. (B) Results from InterProScan indicate relevant domains detected in the RNA-Seq-derived gene model that are missing in the official gene model that has been annotated as tryptophanyl-tRNA synthetase.