| Literature DB >> 16646984 |
Damian Fermin1, Baxter B Allen, Thomas W Blackwell, Rajasree Menon, Marcin Adamski, Yin Xu, Peter Ulintz, Gilbert S Omenn, David J States.
Abstract
BACKGROUND: Defining the location of genes and the precise nature of gene products remains a fundamental challenge in genome annotation. Interrogating tandem mass spectrometry data using genomic sequence provides an unbiased method to identify novel translation products. A six-frame translation of the entire human genome was used as the query database to search for novel blood proteins in the data from the Human Proteome Organization Plasma Proteome Project. Because this target database is orders of magnitude larger than the databases traditionally employed in tandem mass spectra analysis, careful attention to significance testing is required. Confidence of identification is assessed using our previously described Poisson statistic, which estimates the significance of multi-peptide identifications incorporating the length of the matching sequence, number of spectra searched and size of the target sequence database.Entities:
Mesh:
Substances:
Year: 2006 PMID: 16646984 PMCID: PMC1557991 DOI: 10.1186/gb-2006-7-4-r35
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Figure 1Selection of candidate high confidence ORFs. The flowchart diagrams how high confidence ORFs were identified. Data starts with raw spectra being analyzed by X!Tandem using our six-frame genome translation and ends with our set of high confidence ORFs and the peptides contained within them. The dashed line indicates the switch from discussion of spectra/peptides to ORFs.
Figure 2Selection and classification of diagnostic peptides. The flowchart outlines how diagnostic peptides found in high-confidence ORFs were classified into four categories: perfect match (PM), intra-exonic (IE), overlapping exon (OE), and non-exonic (NE).
Diagnostic peptides
| Number | |
| All peptide matches above hyperscore 35 | 105,065 |
| + spectra match only a single peptide | 66,711 |
| + peptide maps to a unique location in the genome | 38,906 |
| ORFs with confidence > 95%* | 427 |
| Intra-exonic perfect match (PM) | 1,682 |
| Intra-exonic different reading frame (IE) | 47 |
| Overlapping exon (OE) | 90 |
| Non-exonic (NE) | 490 |
Peptides are categorized based upon where they align to in relation to the annotated start/stop boundaries of genes.*Based on Poisson statistic with correction for multiple hypothesis testing
A representative set of peptide containing genes
| HUGO gene ID | Ensembl gene ID | PT | PM | IE | OE | NE | Gene description |
| - | ENSG00000198209 | 28 | 26 | 0 | 2 | 0 | Complement component 4B preproprotein |
| A1BG | ENSG00000121410 | 19 | 19 | 0 | 0 | 0 | Alpha-1B-glycoprotein precursor (Alpha-1-B glycoprotein) |
| A2M | ENSG00000175899 | 47 | 46 | 0 | 0 | 1 | Alpha-2-macroglobulin precursor (Alpha-2-M) |
| AFM | ENSG00000079557 | 12 | 11 | 0 | 0 | 1 | Afamin precursor (Alpha-albumin; Alpha-Alb) |
| AGT | ENSG00000135744 | 21 | 21 | 0 | 0 | 0 | Angiotensinogen precursor (contains angiotensin I (Ang I); angiotensin II (Ang II); angiotensin III (Ang III) (Des-Asp[1]-angiotensin II)). |
| AHSG | ENSG00000145192 | 24 | 24 | 0 | 0 | 0 | Alpha-2-HS-glycoprotein precursor (Fetuin-A; Alpha-2-Z-globulin; Ba-alpha-2-glycoprotein) |
| ALB | ENSG00000163631 | 111 | 108 | 0 | 3 | 0 | Serum albumin precursor |
| ANKRD24 | ENSG00000089847 | 7 | 0 | 5 | 2 | 0 | F20887_1, partial CDS (fragment) |
| APC2 | ENSG00000115266 | 9 | 0 | 5 | 0 | 4 | Adenomatosis polyposis coli 2 |
| APCS | ENSG00000132703 | 11 | 11 | 0 | 0 | 0 | Serum amyloid P-component precursor (SAP; 9.5S alpha-1-glycoprotein; contains serum amyloid P-component(1-203)) |
| APOA1 | ENSG00000118137 | 53 | 52 | 0 | 1 | 0 | Apolipoprotein A-I precursor (Apo-AI; ApoA-I; contains apolipoprotein A-I(1-242)) |
| APOA2 | ENSG00000158874 | 17 | 15 | 0 | 2 | 0 | Apolipoprotein A-II precursor (Apo-AII; ApoA-II; contains apolipoprotein A-II(1-76)) |
| APOB | ENSG00000084674 | 112 | 110 | 0 | 2 | 0 | Apolipoprotein B-100 precursor (Apo B-100; contains apolipoprotein B-48 (Apo B-48)) |
| APOC3 | ENSG00000110245 | 4 | 4 | 0 | 0 | 0 | Apolipoprotein C-III precursor (Apo-CIII; ApoC-III) |
| APOE | ENSG00000130203 | 13 | 13 | 0 | 0 | 0 | Apolipoprotein E precursor (Apo-E) |
| APOF | ENSG00000175336 | 4 | 4 | 0 | 0 | 0 | Apolipoprotein F precursor (Apo-F) |
| APOH | ENSG00000091583 | 15 | 15 | 0 | 0 | 0 | Beta-2-glycoprotein I precursor (apolipoprotein H; Apo-H; B2GPI; Beta(2)GPI; activated protein C-binding protein; APC inhibitor; anticardiolipin cofactor) |
| APOL1 | ENSG00000100342 | 5 | 5 | 0 | 0 | 0 | Apolipoprotein-L1 precursor (apolipoprotein L-I; apolipoprotein L; ApoL-I; Apo-L; ApoL) |
| AZGP1 | ENSG00000160862 | 9 | 9 | 0 | 0 | 0 | Zinc-alpha-2-glycoprotein precursor (Zn-alpha-2-glycoprotein; Zn-alpha-2-GP) |
| AZI1 | ENSG00000141577 | 3 | 0 | 0 | 3 | 0 | 5-azacytidine induced 1 isoform a |
| BF | ENSG00000166285 | 9 | 7 | 0 | 2 | 0 | Complement factor B precursor (EC 3.4.21.47; C3/C5 convertase; properdin factor B; glycine-rich beta glycoprotein; GBG; PBF2) |
A breakdown of the distribution of diagnostic peptides among the 128 parent genes they occur in. HUGO gene ID, HUGO gene identifier; Ensembl gene ID, the Ensembl identifier for the gene containing the diagnostic peptides; PT, the total number of diagnostic peptides found within the coding boundaries of this gene; PM, number of perfect-matching peptides to a protein product of this gene; IE, number of intra-exonic peptides associated with this gene; OE, number of exon overlapping peptides associated with this gene; NE, number of non-exonic peptides associated with this gene; Gene description, the name given to the gene according to the Ensembl Genome Browser database. A complete list is available in Additional data file 1.
EST library matches to diagnostic peptides
| PM | IE | NE | OE | Total | |
| EST + | 615 (72%) | 24 (62%) | 36 (17%) | 36 (65%) | 711 |
| EST - | 241 (28%) | 15 (38%) | 216 (83%) | 19 (35%) | 491 |
| Total | 856 | 39 | 252 | 55 | 1,202 |
A list of the breakdown of EST hits to a peptide in each of the four categories. EST +, indicates how many peptides in each category had at least one EST hit. EST -, gives the number of peptides in each category that did not match an EST. Percentages of total category total are given in parentheses. Totals are given in the final column and row. Only the longest representative peptide for a set of overlapping peptides was used in this analysis. PM, perfect matching peptide; IE, intra-exonic peptide; NE, non-exonic peptide; OE, overlapping exon peptide.
Representative distribution of the ESTs across diagnostic peptides
| HUGO gene ID | Ensembl gene ID | PT | ALL | PM | IE | OE | NE | Gene description |
| - | ENSG00000198209 | 17 | 17 | 15 | 0 | 2 | 0 | Complement component 4B preproprotein |
| A1BG | ENSG00000121410 | 10 | 10 | 10 | 0 | 0 | 0 | Alpha-1B-glycoprotein precursor (alpha-1-B glycoprotein) |
| A2M | ENSG00000175899 | 20 | 20 | 19 | 0 | 0 | 1 | Alpha-2-macroglobulin precursor (alpha-2-M) |
| AFM | ENSG00000079557 | 4 | 3 | 3 | 0 | 0 | 0 | Afamin precursor (alpha-albumin; alpha-Alb) |
| AGT | ENSG00000135744 | 13 | 13 | 13 | 0 | 0 | 0 | Angiotensinogen precursor (contains angiotensin I (Ang I); angiotensin II (Ang II); angiotensin III (Ang III) (Des-Asp[1]-angiotensin II)). |
| AHSG | ENSG00000145192 | 9 | 9 | 9 | 0 | 0 | 0 | Alpha-2-HS-glycoprotein precursor (fetuin-A; alpha-2-Z-globulin; Ba- alpha-2-glycoprotein) |
| ALB | ENSG00000163631 | 30 | 30 | 30 | 0 | 0 | 0 | Serum albumin precursor |
| ANKRD24 | ENSG00000089847 | 3 | 3 | 0 | 2 | 1 | 0 | F20887_1, partial CDS (fragment) |
| APC2 | ENSG00000115266 | 9 | 6 | 0 | 3 | 0 | 3 | Adenomatosis polyposis coli 2 |
| APCS | ENSG00000132703 | 7 | 7 | 7 | 0 | 0 | 0 | Serum amyloid P-component precursor (SAP; 9.5S alpha-1-glycoprotein; contains serum amyloid P-component(1-203)) |
| APOA1 | ENSG00000118137 | 18 | 18 | 18 | 0 | 0 | 0 | Apolipoprotein A-I precursor (Apo-AI; ApoA-I; contains apolipoprotein A-I(1-242)) |
| APOA2 | ENSG00000158874 | 5 | 5 | 4 | 0 | 1 | 0 | Apolipoprotein A-II precursor (Apo-AII; ApoA-II; contains apolipoprotein A-II(1-76)) |
| APOB | ENSG00000084674 | 95 | 95 | 94 | 0 | 1 | 0 | Apolipoprotein B-100 precursor (Apo B-100; contains apolipoprotein B-48 (Apo B-48)) |
| APOC3 | ENSG00000110245 | 2 | 2 | 2 | 0 | 0 | 0 | Apolipoprotein C-III precursor (Apo-CIII; ApoC-III) |
| APOE | ENSG00000130203 | 10 | 10 | 10 | 0 | 0 | 0 | Apolipoprotein E precursor (Apo-E) |
| APOF | ENSG00000175336 | 4 | 4 | 4 | 0 | 0 | 0 | Apolipoprotein F precursor (Apo-F) |
| APOH | ENSG00000091583 | 6 | 6 | 6 | 0 | 0 | 0 | Beta-2-glycoprotein I precursor (apolipoprotein H; Apo-H; B2GPI; Beta(2)GPI; activated protein C-binding protein; APC inhibitor; anticardiolipin cofactor) |
| APOL1 | ENSG00000100342 | 5 | 5 | 5 | 0 | 0 | 0 | Apolipoprotein-L1 precursor (apolipoprotein L-I; apolipoprotein L; ApoL-I; Apo-L; ApoL) |
| AZGP1 | ENSG00000160862 | 6 | 6 | 6 | 0 | 0 | 0 | Zinc-alpha-2-glycoprotein precursor (Zn-alpha-2-glycoprotein; Zn- alpha-2-GP) |
| AZI1 | ENSG00000141577 | 2 | 2 | 0 | 0 | 2 | 0 | 5-azacytidine induced 1 isoform a |
| BF | ENSG00000166285 | 5 | 5 | 4 | 0 | 1 | 0 | Complement factor B precursor (EC 3.4.21.47; C3/C5 convertase; properdin factor B; glycine-rich beta glycoprotein; GBG; PBF2) |
A representative sampling of the total number of ESTs matched to diagnostic peptides as well as the parent gene that contains the peptide. PT, total number of non-redundant (NR) peptides associated with this gene; All, number of peptides with EST hits; PM, number of PM peptides with EST hits; IE, number of IE with EST hits; OE, number of OE with EST hits; NE, number of NE with EST hits. A complete list is given in Additional data file 2.
Features of proteins from genes with novel coding regions
| HUGO gene ID | Ensembl gene ID | AAs in domain | Domain ID | Domain name | Gene name |
| PLG | ENSG00000122194 | 23 | P00747 | Kringle | Plasminogen precursor |
| BF | ENSG00000166285 | 28 | P00751 | Peptidase S1, trypsin | Complement factor B precursor |
| APOB | ENSG00000084674 | 21 | Q13787 | Vitellogenin | Apolipoprotein B-100 precursor |
| C4BPA | ENSG00000123838 | 29 | P04003 | Sushi | C4b-binding protein alpha chain precursor |
| HPX | ENSG00000110169 | 15 | P02790 | Hemopexin-like | Hemopexin precursor |
| GC | ENSG00000145321 | 17 | P02774 | Albumin | Vitamin D-binding protein precursor |
| PLEKHA4 | ENSG00000105559 | 7 | PS50003 | PH_DOMAIN | Pleckstrin homology domain-containing protein family A member-4 |
| IGLC1, IGLC2, IGLC3, IGLV1-40, IGLV3-25, IGLV4-3 | ENSG00000100208 | 12 | PS50835 | IG-LIKE | Ig lambda chain C region |
| IGHA1, IGHG3, IGHM | ENSG00000130076 | 11 | PS50835 | IG-LIKE | Ig alpha-1 chain C region |
| - | ENSG00000142082 | 51 | PS50305 | SIRTUIN | NAD-dependent deacetylase sirtuin-3 mitochondrial precursor |
| TF | ENSG00000091513 | 11 | PS00207 | TRANSFERRIN | Serotransferrin precursor |
A list of the protein domains that the novel OE peptides overlapped. HUGO gene ID, Hugo gene identifiers; Ensembl gene ID, Ensembl gene identifier; AAs in domain, number of amino acids from the peptide that are part of the domain; Domain ID, the Uniprot or Prosite identifier for the domain (Prosite identifiers begin with the letters 'S'); Domain name, the common name assigned to the domain in either Uniprot or Prosite.
Figure 3Receiver operator curve for X!Tandem hyperscores. The ROC was used to select the hyperscore cut-off value for candidate peptides. Numbers represent the first instance of the hyperscore values 24, 30, 35, 40, and 45 as they occur among the data points.