| Literature DB >> 19682364 |
Igor A Sidorov1, Denis A Reshetov, Alexander E Gorbalenya.
Abstract
BACKGROUND: A growing diversity of biological data is tagged with unique identifiers (UIDs) associated with polynucleotides and proteins to ensure efficient computer-mediated data storage, maintenance, and processing. These identifiers, which are not informative for most people, are often substituted by biologically meaningful names in various presentations to facilitate utilization and dissemination of sequence-based knowledge. This substitution is commonly done manually that may be a tedious exercise prone to mistakes and omissions.Entities:
Mesh:
Year: 2009 PMID: 19682364 PMCID: PMC2739203 DOI: 10.1186/1471-2105-10-251
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Unique and non-unique identifiers in databases used by SNAD
| GenBank1 | gi number, primary ID | + | |
| version number = accession.version | +9 | ||
| accession number | - | ||
| locus/sequence name | - | MTVCG, SCU49845 | |
| UniProt2 | accession number | + | Q10AA9, P47123, A2BC19 |
| entry name | - | INSR_HUMAN | |
| EMBL3,4,5 | accession number | + | |
| entry name | - | HSINSR, BUM | |
| SeqHound6,7 | gi number | + | 1234567 |
| accession number | - | AY123456, NC_006558, NP_000483 | |
| locus/sequence name | - | BTACHRE | |
| EnsEMBL | gene ID, transcript ID | + | ENSG00000133103, ENST00000222982 |
1)
2)
3) [28]
4)
5)
6) [29]
7)
8) Gene and Transcript part of EnsEMBL,
9) Some replaced or removed records in GenBank have no accession number but they can be retrieved from GenBank with the primary ID.
Figure 1SNAD dataflow. User submits sequence UIDs and defines a format of conversion by choosing a template. SNAD analyzes the input, extracts UIDs from it and queries a user-defined database to locate cognate entries. The next step is extracting annotation from the entries and parsing the user-selected template to generate separate parts. They are mapped onto annotation to produce name parts, which are combined into names according to the template. Finally, SNAD substitutes the submitted UIDs in list, alignment, or tree with the designed names and returns results to the user.
Figure 2Unique IDs conversion with SNAD. Nine IDs from GenBank (section "Before UID conversion") are converted using a pre-compiled four-characteristic template "Complex protein name" that includes the following GenBank characteristics: "Primary ID" (gi number); organism name formatted as "G [enus].species"; protein product and gene locus tag. Three delimiters are used: space, " [" and "]", and none of characteristics is abbreviated. First three characteristics have their size limited to: 9, 6, and 17 symbols, respectively. Results of the conversion are shown in "After UID conversion" section. Each arrow links a delimiter to a part of converted names. {...} indicates that more than one organism name characteristic is found for a submitted ID (see [GenBank/Protein: 126224], features source/organism).
Cross-database recognition of UIDs in databases by SNAD*
| GenBank | UniProt | EMBL | SeqHound | EnsEMBL | |
| GenBank | UID | NUID | NUID | UID | NR |
| UniProt | NR | UID | NR | NR | NR |
| EMBL | NR | NR | UID | NR | NR |
| SeqHound | UID** | NR | NUID | UID | NR |
| EnsEMBL | NR | NR | NR | NR | UID |
*With the use of standard BioPerl drivers for GenBank, UniProt, EMBL and SeqHound and API drivers from EnsEMBL web site , UIDs from source DB can be used for searching in target DB. In some target DBs they are recognized as UID, in others as non-unique IDs (NUID) or not recognized as identifiers (NR). Although same five DBs make the source and target DB lists, the table is not symmetrical.
**, only gi's from GenBank are recognized as UID in SeqHound.