| Literature DB >> 22967011 |
Fahim Mohammad1, Robert M Flight, Benjamin J Harrison, Jeffrey C Petruska, Eric C Rouchka.
Abstract
BACKGROUND: High-throughput molecular biology techniques yield vast amounts of data, often by detecting small portions of ribonucleotides corresponding to specific identifiers. Existing bioinformatic methodologies categorize and compare these elements using inferred descriptive annotation given this sequence information irrespective of the fact that it may not be representative of the identifier as a whole.Entities:
Mesh:
Year: 2012 PMID: 22967011 PMCID: PMC3554462 DOI: 10.1186/1471-2105-13-229
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Granularity of annotations.
Feature comparison of different conversion tools (As of April 2012)
| probes, genes, | | batch | select one | select | S, F | custom generated | html, txt | NA | web,API, | Sep, 2009 | |||||||
| | prots. | | | | | | one | | | | | | EASE, | | |||
| probes, genes, | | | | batch | select one | select | S, F | custom generated | html, txt | 11 org. | web | Sep, 2009 | |||||
| | prots. | | | | | | multiple | | | | | | | | | ||
| genes , prots. and | | | | | | batch | select one | select | S, F | Ensembl | html, txt, xls | H, M, R, O | web | Jun, 2011 | |||
| | probes | | | | | | one | | | | | | | | | ||
| genes, prots. and | | | | batch | select one | select | S, F | corr. files | html, txt | H, M, O | web, | current | |||||
| | bio. molecules | | | | | | one | | | | | | | | | ||
| probes, genes and | | | | | | batch | select one | select | S | Peg/custom | html, xls | H, M, R, O | web, API | May, 2011 | |||
| | Prots. | | | | | | one | | | | generated | | | | | ||
| genes and | | | | | batch | select one | select | S, F | Ensembl | html, txt, xls | H, M, R | web | Apr, 2008 | ||||
| | prots. | | | | | | multiple | | | | | | | | | ||
| | | | | | | | | | | | | | | | |||
| probes, genes, | | | | | batch | NA | select | S, F | MADGene link | html, xls | H, M, R, O | web, open | Aug, 2009 | ||||
| | trans. | | | | | | multiple | | | | | | (17 org.) | source | | ||
| Affy expression & | | | single | probes | genes, | S | Ensembl | html | H, M, R | web, | Mar, 2010 | ||||||
| | exon arrays | | | | | | trans. | | | | | | | | | ||
| genes, prots., | | | | batch | select one | select | S, F | UniGene, LocusLink | html, txt | H, M, R, O | web | CND | |||||
| | probes, other | | | | | | one | | | | | | | | | ||
| Affy expression | | | | single | Affy, Hugo,Ens | Affy, Hugo, | S | RefSeq | html | H | web, | May, 2009 | |||||
| | arrays | | | | | | Ens. | | | | | | | | | ||
| probes, cDNA, | | | | batch | select one | select | S | UniGene, Homologene | html | H, M, R | web | 2006 | |||||
| | EST, gene, prots. | | | | | | one | | | | | | | | | ||
| genes and | | | | batch | genes or prots. | prots. or | S, F | UniProt ID | html | NA | web, API, | Jul, 2011 | |||||
| | prots. | | | | | | gene | | | | | | | | | ||
| Affy, uniGene clusters, | | | | | batch | select one | select | S, F | RefSeq, Entrez | html, email | H, M, R, O (58 org.) | web | May, 2009 | ||||
| Acc num | | | | | | one | | | | | | | | | |||
| Affy, genes, | | | | | batch | select one | select multiple | S | custom generated | html, txt | Not Available | Not Available | CND | ||||
| | prots. | | | | | | | | | | | | | | | ||
| Affy, | | | | | batch | select one | choose from | S | custom generated | Email (txt,xls) | H, M | web | Sep. 2006 | ||||
| | genes | | | | | | | | | | | | | | | ||
| genes and | | | | | | batch | select one | NA | S, F | corr. files | html | 5 org. | web | Apr, 2007 | |||
| | prots. | | | | | | | | | | | | | | | ||
| genes, prots., | | | | NA | select one | select multiple | S, F | NA | html, txt, xls | H, M, R, O | web, API, | depends on DB | |||||
| | probes, other | | | | | | | | | | | | | | | ||
| probes, genes, | | | NA | NA | NA | NA | NA | S, F | Ensembl, other | NA | 36 org. | open source | May, 2011 | ||||
| | prots., metabolites | | | | | | | | | | | | | | | ||
| genes, trans., | batch | select one | select multiple | S | Genomic Sequence | html, txt | H, M, R, O | web, | Dec, 2011 | ||||||||
| seqs., probes | (53 org.) |
aAbbreviations: Annot. View: Custom Annotation view, Annot.: Annotation (S: Structural annotation, F: Functional annotation), org.: Organisms (H: Human, M: Mouse, R: Rat, O: others), prots: proteins, Affy: Affymetrix®;, trans: transcripts, seqs: sequences, Ens.: Ensembl, corr: correspondence, acc: accession, bio: biological, NA: Not Applicable, CND: Could not determine, ⇓:download Knowledgebase, VM: Virtual Machine.
ID converter tools, data sources and availability
| DAVID | GenBank, RefSeq, KEGG, OMIM, UniGene | |
| Babelomics | Go, KEGG, Ensembl and others | |
| g:Convert | GO, KEGG, Ensembl, TRANSFAC, Reactome | |
| HMS and IC | Ensembl, GO, KEGG and others | |
| Synergizer | Ensembl, NCBI, RGD, SGD, KEGG, WormBase and EcoCyc | |
| Clone/Gene ID Converter | Ensembl, NCBI, Pubmed, UCSC, KEGG, Reactome | |
| MADGene | GEO, UniGene, Entrez and others | |
| GATExplorer | Ensembl, Affymetrix®; | |
| NetAffx™ | NCBI, GO, KEGG and others | |
| PLANdbAffy | Affymetrix®;, UCSC, NCBI | |
| probeMatchDB | UniGene, HomoloGene | |
| Uniprot | GenBank, RefSeq, GO and others | |
| Onto-Translate | Ensembl, GO, KEGG and others | |
| AliasServer | Ensembl, EMBL, NCBI, SGD and others | |
| MatchMiner | Affymetrix®;, UCSC, UniGene, Entrez, OMIM | |
| GeneMerge | GO, KEGG | |
| BioMart | NCBI, GO, KEGG and others | |
| BridgeDb | Ensembl and others | |
| AbsIDconvert | UCSC, NCBI, Ensembl, Agilent, Affymetrix®; and others |
Figure 2ID conversion: A two step process Step 1: ID A is converted into ID C. Step 2: ID C is converted into ID B
Figure 3AbsIDconvert technique. Absolute ID conversion is a two step process whereby probe A can be converted to identifiers at the transcript level by first converting the probe to its genomic coordinates (step 1) and then determining transcripts that overlap the coordinate positions (step 2).
Figure 4Steps involved in the construction of AbsIDconvert.
Figure 5Example of interval overlaps. The reference region is 10 bases in length, with database annotations s1–s4. Queries q1–q4 are used to obtain the corresponding annotations.
Figure 6Runtime comparison between MySQL and interval-tree approach while converting EST IDs into Entrez Gene IDs.
Run time (sec.) to convert 1,000 IDs from one type to another using web–based AbsIDconvert
| | ||||||||
|---|---|---|---|---|---|---|---|---|
| Affymetrix Rat230_2 | 5.6 | 3.2 | 4.1 | 7.6 | 3.2 | 3.3 | 33 | 47.6 |
| Agilent Cgh105a | 5.1 | 3.9 | 2.5 | 2.7 | 2.92 | 3.05 | 31.3 | 55.6 |
| RefSeq | 4.5 | 3.1 | 3.6 | 3.6 | 2.3 | 2.2 | 31.9 | 34.5 |
| Ensembl transcript | 2.9 | 3.8 | 3.1 | 4 | 2.47 | 3.02 | 34.6 | 47.1 |
| Entrez gene | 2.7 | 2.9 | 2.8 | 3 | 7.5 | 7.1 | 18.4 | 35.3 |
| Gene symbol | 2.9 | 2.8 | 2.7 | 2.9 | 8.5 | 7.5 | 16.6 | 38.2 |
| EST sequences | 18.6 | 17.6 | 31 | 30.3 | 28.3 | 29.3 | 64.1 | 73.7 |
Figure 7Runtime comparison of AbsIDconvert with other conversion tools. (a) Tools give comparative run times with a small input (i.e. ≤5,000). (b) The number of inputs was gradually increased to 100,000 and run times for each tool were determined. Most of the tools were not able to produce the result for large inputs. Only MADGene and AbsIDconvert could produce results. Note that DAVID limits user input to 30,000 identifiers.
Entrez ID to gene symbol conversion accuracy
| AbsIDconvert | 885 | 866 | 19 | 109 | 6 | 88.82 | 76.00 | 2.15 | 93.12 | |
| DAVID | 853 | 790 | 63 | 146 | 1 | 84.40 | 98.44 | 7.39 | 88.32 | |
| MADGene | 854 | 730 | 124 | 145 | 1 | 83.43 | 99.20 | 14.52 | 84.44 | |
| HMS & IC | 724 | 723 | 1 | 270 | 6 | 72.81 | 14.29 | 0.14 | 84.22 | |
| Onto-Translate | 823 | 722 | 101 | 176 | 1 | 80.40 | 99.02 | 12.27 | 83.90 | |
| MatchMiner | 539 | 457 | 82 | 458 | 3 | 49.95 | 96.47 | 15.21 | 62.86 | |
| Clone/Gene ID converter | 537 | 441 | 96 | 457 | 6 | 49.11 | 94.12 | 17.88 | 61.46 | |
| g:Convert | 445 | 433 | 12 | 549 | 6 | 44.09 | 66.67 | 2.70 | 60.69 | |
| Synergizer | 445 | 433 | 12 | 549 | 6 | 44.09 | 66.67 | 2.70 | 60.69 | |
| Babelomics | 486 | 421 | 65 | 508 | 6 | 45.32 | 91.55 | 13.37 | 59.51 |
Figure 8Venn diagram showing conversion results for the top performing conversion tools. (a) Entrez IDs converted to official gene symbols. (b) Entrez IDs converted to RefSeq IDs. (c) Entrez IDs converted to RefSeq IDs using cumulative bootstrap. (d) Affymetrix®; HG_U133Plus2.0 probesets converted to Agilent Cgh44b probes.
Entrez ID to RefSeq ID conversion accuracy
| AbsIDconvert | 586 | 362 | 224 | 20 | 394 | 94.76 | 36.25 | 38.23 | 74.79 | |
| MADGene | 551 | 335 | 216 | 49 | 400 | 87.24 | 35.06 | 39.20 | 71.66 | |
| Onto-Translate | 501 | 291 | 210 | 99 | 400 | 74.62 | 34.43 | 41.92 | 65.32 | |
| DAVID | 549 | 311 | 238 | 72 | 379 | 81.20 | 38.57 | 43.35 | 66.74 | |
| Synergizer | 482 | 278 | 204 | 121 | 397 | 69.67 | 33.94 | 42.32 | 63.11 | |
| g:Convert | 482 | 278 | 204 | 121 | 397 | 69.67 | 33.94 | 42.32 | 63.11 | |
| MatchMiner | 474 | 268 | 206 | 126 | 400 | 68.02 | 33.99 | 43.46 | 61.75 | |
| Babelomics | 501 | 267 | 234 | 128 | 371 | 67.59 | 38.68 | 46.71 | 59.60 | |
| Clone/Gene ID converter | 421 | 219 | 202 | 195 | 384 | 52.90 | 34.47 | 47.98 | 52.46 | |
| HMS & ID | 461 | 227 | 430 | 181 | 162 | 55.64 | 72.64 | 65.45 | 42.63 |
Figure 9Case study 1 - Comparative genomics study using AbsIDconvert. (a) Venn diagram showing the number of gene fragments from P. falciparum (PF) and P. vivax (PV) which overlaps with at least one gene from Anopheles gambiae and Homo sapiens. (b) Corresponding genes in Anopheles gambiae (AnoGam2) and Homo sapiens (hg19) that were mapped by gene fragments from P. falciparum and P. vivax. Only those genes were considered which had the exact same sequence as the gene fragments.
Significantly enriched (p-value < 0.001, number of genes ≥ 2) Gene Ontology biological processes for the and genes
| GO:0048639 | positive regulation of developmental growth | pFal | 0.00023 | 0.078421 |
| GO:0051865 | protein autoubiquitination | pFal | 0.000611 | 0.310842 |
| GO:0007417 | central nervous system development | pFal | 0.000749 | 0.052751 |
| GO:0010559 | regulation of glycoprotein biosynthetic process | pFal | 0.000534 | 0.189699 |
| GO:0043062 | extracellular structure organization | pFal | 0.000896 | 0.056366 |
| GO:0031290 | retinal ganglion cell axon guidance | pFal | 0.000729 | 0.020543 |
| GO:0050772 | positive regulation of axonogenesis | pFal | 0.000671 | 0.108078 |
| GO:0007268 | synaptic transmission | pFal | 9.63E-005 | 0.004437 |
| GO:0007156 | homophilic cell adhesion | pFal | 2.90E-005 | 0.00181 |
| GO:0048745 | smooth muscle tissue development | pFal | 0.00097 | 0.215514 |
| GO:0008038 | neuron recognition | pFal,pViv | 0.000611 | 2.71E-005 |
| GO:0071702 | organic substance transport | pViv | 0.358064 | 0.000932 |
| GO:0010827 | regulation of glucose transport | pViv | 0.15634 | 0.000705 |
| GO:0016337 | cell-cell adhesion | pViv | 0.002316 | 0.000615 |
| GO:0045725 | positive regulation of glycogen biosynthetic process | pViv | 0.316458 | 0.000806 |
| GO:0008037 | cell recognition | pViv | 0.041274 | 0.000425 |
| GO:0010907 | positive regulation of glucose metabolic process | pViv | 0.486254 | 0.000312 |
| GO:0045913 | positive regulation of carbohydrate metabolic process | pViv | 0.561654 | 0.000731 |
| GO:0010676 | positive regulation of cellular carbohydrate metabolic process | pViv | 0.561654 | 0.000731 |
| GO:0030036 | actin cytoskeleton organization | pViv | 0.133792 | 8.55E-005 |
| GO:0030029 | actin filament-based process | pViv | 0.099308 | 2.74E-005 |
Figure 10Case study 2 - Reinterpretation of prior datasets using AbsIDconvert. (a) Number of Incyte IDs (from a total of 8,392) mapping to the human, mouse and rat genomes within 5% of the maximum alignment score. (b) Incyte IDs with at least one Entrez ID found using AbsIDconvert.
Comparison of Homologene and sequence based homologs
| Human | 7095 | 4155 | 3854 | – | 3648 (88%) | 3401 (82%) | – | 1002 (24%) | 806 (19%) |
| Mouse | 3368 | 2081 | 1872 | 1794 (86%) | – | 1715 (82%) | 1002 (48%) | – | 1064 (51%) |
| Rat | 2776 | 1438 | 1263 | 1210 (84%) | 1222 (85%) | – | 806 (56%) | 1064 (74%) | – |
amapped†:Number of probes mapped to Genome; Entrez‡:Mapped probes with Entrez ID; Homol§: Probes with Entrez ID as well as Homologene ID; Hom: Homologene Based Homologs; Seq: Sequence Based Homologs determined using AbsIDconvert.
Figure 11Case study 3 - Cross-platform meta-analysis study using AbsIDconvert. Ensembl transcripts mapped by Agilent Cgh 105a (purple) and Affymetrix®; HG_U133Plus2.0 (green) probes.