| Literature DB >> 19193216 |
James C Wright1, Deana Sugden, Sue Francis-McIntyre, Isabel Riba-Garcia, Simon J Gaskell, Igor V Grigoriev, Scott E Baker, Robert J Beynon, Simon J Hubbard.
Abstract
BACKGROUND: Proteomic data is a potentially rich, but arguably unexploited, data source for genome annotation. Peptide identifications from tandem mass spectrometry provide prima facie evidence for gene predictions and can discriminate over a set of candidate gene models. Here we apply this to the recently sequenced Aspergillus niger fungal genome from the Joint Genome Institutes (JGI) and another predicted protein set from another A.niger sequence. Tandem mass spectra (MS/MS) were acquired from 1d gel electrophoresis bands and searched against all available gene models using Average Peptide Scoring (APS) and reverse database searching to produce confident identifications at an acceptable false discovery rate (FDR).Entities:
Mesh:
Year: 2009 PMID: 19193216 PMCID: PMC2644712 DOI: 10.1186/1471-2164-10-61
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Overview of JGI and DSM A. niger genome data
| Genome Size | 37.1 Mb |
| Number of Gene Models generated | 87,287 |
| Number of filtered "best" Gene Models | 11,200 |
| Genome Size | |
| Number of annotated proteins | 14,165 |
Figure 1Schematic of the Average Peptide Scoring (APS) pipeline using reversed database searching. The final APS threshold is established iteratively for each pkl file searched against a given database, calculated over a range of peptide quality filters. Mascot was used to conduct an initial database search of both forward and reverse databases, and the resulting peptide scores were then used to calculate an average peptide score for matching proteins.
Figure 2Overview of the relationship between gene models, clusters and predicted proteins. An overview of how Aspergillus niger proteomics data is mapped via clusters of gene models, which are in turn mapped back to the genomic scaffold via EXONERATE. This allowed the assessment and evaluation of gene models located at a particular genomic loci based on the peptides consistent with the proposed gene model structures.
Protein-level identifications obtained over three search databases.
| Gel01–12% SDS (partial) 8 bands | APS hits | 638 | 42 | 40 |
| Mascot hits(1)c | 461 | 38 | 36 | |
| Mascot hits(2)d | 448 | 38 | 36 | |
| Gel02–15% SDS (partial) 33 bands | APS hits | 1443 | 109 | 111 |
| Mascot hits(1)c | 1062 | 102 | 102 | |
| Mascot hits(2)d | 1047 | 102 | 102 | |
| Gel03–10% SDS (full) 110 bands | APS hits | 2349 | 153 | 156 |
| Mascot hits(1)c | 1572 | 146 | 140 | |
| Mascot hits(2)d | 1572 | 146 | 139 |
aSearches performed against 87,287 protein sequences translated from 87,287 predicted gene models, cluster at 8709 loci on the genome.
aSearches performed against the 11,200 "best" gene models, based on homology to known proteins, from the full 87,287 set. Note, some clusters have more than one "best" model, and some have none, hence there are more "best" models than clusters.
cProtein hits are reported with 1 or more peptide identification with significant Mascot scores (p < 0.05)
dProtein hits are reported with 1 or more peptide identification with significant scores using the Mascot MudPit Score.
Figure 3Gel band protein identifications and their relationship with theoretical molecular weight. The right hand bar chart shows of the number of protein identifications made from each gel slice using the average peptide scoring method when searching the DSM A. niger database. Dark green sections represent protein identifications unique to that gel band. Subsequent colours, shown in the key at the top of the plot, indicate the distance in consecutive gel bands that a particular set of protein identifications are shared i.e. red shows protein identifications which are found more than 10 bands from the current positions. The left hand bar chart shows the mean protein mass in KDa for proteins identified in each gel band. As should be expected the average protein mass decreases from the top to the bottom of the gel.
Gene cluster and proteome peptide identification results.
| Number of APS matches (to Gene clusters, Proteins, or Peptides) | i) "Best" filtered model consistent with peptide data | ii) Gene cluster does not contain a "Best" filtered model, but does have APS matches | iii) "Best" filtered model in cluster is inconsistent with peptide data | |
| Gene Clusters | 214 | 201 | 9 | 4 |
| Total | 2872 | 2729 | 56 | 87 |
| Single peptide | 1791 | 1698 | 28 | 65 |
| Multi-peptide | 1081 | 1031 | 28 | 22 |
| Unique | 405 | 379 | 13 | 13 |
| (single peptide hits) | (149) | |||
| (multi-peptide hits) | (256) |
(Identified peptides validating exon/intron boundaries: 54 peptides covering 54 introns)
Of the 8709 gene model clusters generated only 214 contained individual gene models with significant APS scores. These 214 clusters were then classified into three distinct types; a). those where all peptide identifications were consistent with the filtered gene model selected as the best model for that particular genomic locus, b). those where the cluster did not include a "best" filtered model but still contained models with matching peptide identifications, and c). those clusters where the "best" filtered model either did not match or other models in the cluster were more parsimonious with the proteomics data suggesting that the filtered model was not the most likely model for that particular gene locus.
Gene cluster statistics where "best" filtered model is inconsistent with proteome peptide data
| 68_S6 | 2E-64 | An15g00690 | strong similarity to 14.8 kD subunit of NADH:ubiquinone reductase – Neurospora crassa | 2E-54 | Q1E404 | Hypothetical protein; n = 1; Coccidioides immitis RS|Rep: Hypothetical protein – Coccidioides immitis RS |
| 229_S11 | 5E-41 | An09g03480 | strong similarity to snRNA-associated sm-like protein Lsm2 – Saccharomyces cerevisiae | 8E-35 | Q1E7Q6 | Hypothetical protein; n = 1; Coccidioides immitis RS|Rep: Hypothetical protein – Coccidioides immitis RS |
| 626_S3 | 4E-71 | An08g03600 | similarity to hypothetical protein CAE47874.1/AfA24A6.130c – Aspergillus fumigatus | 3E-47 | Q0D146 | Predicted protein; n = 1; Aspergillus terreus NIH2624|Rep: Predicted protein – Aspergillus terreus NIH2624 |
| 25_S5 | 0 | An07g09990 | strong similarity to heat shock protein 70 hsp70 – Ajellomyces capsulatus [putative frameshift] | 0 | Q56G95 | Heat shock protein 70; n = 2; mitosporic Trichocomaceae|Rep: Heat shock protein 70 – Penicillium marneffei |
| 523_S5 | 1E-75 | An07g01640 | strong similarity to calmodulin 6 CaM6 – Arabidopsis thaliana | 7E-70 | Q4WGR4 | EF-hand protein; n = 2; Aspergillus|Rep: EF-hand protein – Aspergillus fumigatus (Sartorya fumigata) |
| 303_S2 | 1E-37 | An02g05240 | strong similarity to histone 4 from patent WO9919502-A1 – Homo sapiens | 8E-38 | UPI00005A5829 | PREDICTED: similar to germinal histone H4 gene; n = 1; Canis familiaris|Rep: PREDICTED: similar to germinal histone H4 gene – Canis familiaris |
| 117_S3 | 2E-131 | An06g00990 | strong similarity to soluble cytoplasmic fumarate reductase YEL047c – Saccharomyces cerevisiae | 2E-71 | Q0CC76 | Hypothetical protein; n = 1; Aspergillus terreus NIH2624|Rep: Hypothetical protein – Aspergillus terreus NIH2624 |
| 54_S17 | 9E-139 | An04g06870 | similarity to hypothetical protein CAD21072.1 – Neurospora crassa | 2E-121 | Q4WPR6 | Transcription factor RfeF, putative; n = 1; Aspergillus fumigatus|Rep: Transcription factor RfeF, putative – Aspergillus fumigatus (Sartorya fumigata) |
| 717_S2 | 0 | An02g11680 | strong similarity to translation initiation factor eIF-4A – Schizosaccharomyces pombe | 0 | Q5B948 | ATP-dependent RNA helicase eIF4A; n = 1; Emericella nidulans|Rep: ATP-dependent RNA helicase eIF4A – Emericella nidulans (Aspergillus nidulans) |
| 39_S9 | 5E-72 | An12g09130 | similarity to glucanase ZmGnsN3 from patent WO200073470-A2 – Zea mays | 7E-130 | Q4WCP3 | Hypothetical protein; n = 1; Aspergillus fumigatus|Rep: Hypothetical protein – Aspergillus fumigatus (Sartorya fumigata) |
| 373_S9 | No Match | 0 | Q2TZ90 | Ca2+binding actin-bundling protein; n = 2; Aspergillus|Rep: Ca2+binding actin-bundling protein – Aspergillus oryzae | ||
| 256_S9 | 3E-125 | An12g04870 | strong similarity to cytoplasmic ribosomal protein of the large subunit L10 – Saccharomyces cerevisiae | 1E-117 | Q2TZP5 | RIB40 genomic DNA, SC011; n = 2; Aspergillus|Rep: RIB40 genomic DNA, SC011 – Aspergillus oryzae |
| 238_S10 | 2E-124 | An18g04220 | strong similarity to mitochondrial ADP/ATP carrier anc1p – Schizosaccharomyces pombe | 4E-115 | Q4WJN2 | Mitochondrial ADP, ATP carrier protein (Ant), putative; n = 1; Aspergillus fumigatus|Rep: Mitochondrial ADP, ATP carrier protein (Ant), putative – Aspergillus fumigatus (Sartorya fumigata) |
Figure 4Example Gene Clusters with associated peptide identifications. Three example mappings of proteomics data to clusters via the gene models are shown. The images, generated using the BioPerl::Graphics module, are split into tracks with the top ruler representing the genomic scaffold and the first track coloured black highlighting the region covered by the cluster. The tracks below this represent all the gene models mapped to this cluster regardless of whether they have proteomics evidence or not. The yellow models correspond to those that are included in the filtered "best" gene model set; models not in the filtered set are coloured green. The final and bottommost track represents the peptides as mapped to the genome track and are coloured blue. Each example shows how the proteomics evidence in the form of significantly matched peptides can lend weight to support or refute the filtered gene model set.
Figure 5Example gene cluster alignments. Selected sections from the examples in Figure 4 are shown as alignments illustrating peptide mappings to the gene models. Peptide identifications are shown in shaded boxes, corresponding to those in Figure 4, with intron spanning peptides linked by a solid line. The "best" filtered models are shown in bold text, the corresponding aligned DSM proteins in purple, and gene models consistent with the proteomics data are preceded by an asterisk.