| Literature DB >> 15723693 |
Sohrab P Shah1, Yong Huang, Tao Xu, Macaire M S Yuen, John Ling, B F Francis Ouellette.
Abstract
BACKGROUND: We present a biological data warehouse called Atlas that locally stores and integrates biological sequences, molecular interactions, homology information, functional annotations of genes, and biological ontologies. The goal of the system is to provide data, as well as a software infrastructure for bioinformatics research and development. DESCRIPTION: The Atlas system is based on relational data models that we developed for each of the source data types. Data stored within these relational models are managed through Structured Query Language (SQL) calls that are implemented in a set of Application Programming Interfaces (APIs). The APIs include three languages: C++, Java, and Perl. The methods in these API libraries are used to construct a set of loader applications, which parse and load the source datasets into the Atlas database, and a set of toolbox applications which facilitate data retrieval. Atlas stores and integrates local instances of GenBank, RefSeq, UniProt, Human Protein Reference Database (HPRD), Biomolecular Interaction Network Database (BIND), Database of Interacting Proteins (DIP), Molecular Interactions Database (MINT), IntAct, NCBI Taxonomy, Gene Ontology (GO), Online Mendelian Inheritance in Man (OMIM), LocusLink, Entrez Gene and HomoloGene. The retrieval APIs and toolbox applications are critical components that offer end-users flexible, easy, integrated access to this data. We present use cases that use Atlas to integrate these sources for genome annotation, inference of molecular interactions across species, and gene-disease associations.Entities:
Mesh:
Substances:
Year: 2005 PMID: 15723693 PMCID: PMC554782 DOI: 10.1186/1471-2105-6-34
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Architecture of the Atlas data warehouse. The data integrated in Atlas are first downloaded as data files from the public repositories shown in the Data Source panel. These data files are then parsed and loaded into the MySQL relational databases using the Atlas loaders. The Atlas Databases panel shows the databases grouped by biological theme. These groups are sequences (green), molecular interactions (yellow); genes and functional categorization (blue); and ontologies (orange). For each database the available data retrieval methods are marked as SQL (S), C++ Atlas API (C), Java Atlas API (J), and Perl Atlas API (P). The Retrieval panel shows the flexible, layered architecture of the interfaces to the databases. Data can be accessed directly using the MySQL client with SQL statements, through the APIs in C++, Java, and Perl, and through the end-user applications implemented in the Toolbox. The APIs can also be used to implement web-based tools, or standalone applications.
Data sources included in Atlas.
| Atlas Data Source Summary Table * | ||||
| GenBank Sequence | ASN.1 | Daily | Incremental | |
| GenBank Sequence | ASN.1 | Release | Reload | |
| GenBank Refseq | ASN.1 | Daily | Incremental | |
| GenBank Refseq | ASN.1 | Release | Reload | |
| NCBI Taxonomy | Delimited Text | Release | Reload | |
| HomoloGene | Delimited Text | Daily | Reload | |
| OMIM | Delimited Text | Daily | Reload | |
| Gene | Delimited Text | Daily | Reload | |
| LocusLink | Delimited Text | Daily | Reload | |
| UniProt | XML | Bi-weekly | Reload | |
| HPRD | XML | Release | Reload | |
| MINT | XML | Release | Reload | |
| DIP | XML | Release | Reload | |
| BIND | Delimited Text | Release | Reload | |
| GO | MySQL dump | Release | Reload | |
* up-to-date information about data sources and statistics are available from the Atlas website
Figure 2Atlas database schema. There are four major functional groups. Biological Sequences: includes instances of GenBank sequences, RefSeq sequences, and UniProt sequences; Molecular Interactions: includes instances of BIND, HPRD, DIP, IntAct and MINT; Gene Related Resources: includes instances of OMIM, Entrez Gene, and LocusLink, and HomoloGene; and Ontology: includes instances of Taxonomy, Atlas internal ontologies, Gene Ontology, and PSI-MI ontologies.
Figure 3Atlas API architecture. MySqlDb, Seq, SeqGet and SeqLoad classes/modules (grey) are available in all three languages: C++, Java, and Perl. The SeqLoad and Seqloader modules are created in C++ only as these are tightly coupled to the NCBI C++ Toolkit. All other classes are available in Java. Applications share the common modules SeqLoad, SeqGet, InteractionLoad, and InteractionGet which provide the methods necessary for loading and retrieval operations, to and from the databases. These modules employ additional classes (not shown) that are representative of the major data model components such as Sequence, Interaction, Interactor, and Dbxref, for example.
Atlas toolbox applications.
| Sequence | |||
| ac2seq | Retrieve sequences given an accession | Nucleic acid or Protein Accession Number(s) | Sequences in Fasta format |
| feat2seq | Retrieve sub-sequences that span features | Feature type (and qualifier) | Sequences in Fasta format |
| gi2seq | Retrieve sequences given a GenInfo identifier | GenInfo Identifier(s) (GI Number(s)) | Sequences in Fasta format |
| gi2seqentry | Retrieve sequences given a GenInfo identifier | GenInfo Identifier(s) (GI Number(s)) | GBFF, EMBL, GFF, FTABLE, ASN.1, GBSEQ |
| tax2seq | Retrieve sequences by taxonomy | NCBI taxon identifier or scientific name of taxon | Sequences in Fasta format |
| tech2seq | Retrieve sequences by sequencing technique | Sequencing technique (eg EST, GSS, etc.) | Sequences in Fasta format |
| techtax2seq | Retrieve sequences by taxonomy and sequencing technique | Sequencing technique and NCBI taxonid/scientific name of taxon | Sequences in Fasta format |
| Loader | |||
| fastaloader | Fasta sequence data loader | Sequences in Fasta format | |
| seqloader | ASN.1 sequence data loader | GenBank/RefSeq ASN.1 records | |
| Feature | |||
| ac2feat | Retrieve features | GenBank Accession number (s) | Features in GFF or FTABLE format |
| gi2feat | Retrieve features | GenInfo Identifier(s) (GI Number(s)) | Features in GFF or FTABLE format |
| Taxonomy | |||
| ac2tax | Retrieve taxonomy given an accession number | GenBank Accession number (string) | NCBI taxon identifier (integer) |
| gi2tax | Retrieve taxonomy given a GenInfo identifier | GenInfo Identifier (integer) | NCBI taxon identifier (integer) |
| ID Converters | |||
| ac2gi | Convert an accession number to a GenInfo identifier | GenBank Accession number (string) | GenInfo Identifier (integer) |
| gi2ac | Convert a GenInfo identifier to an accession number | GenInfo Identifier (integer) | Accession number (string) |
| tax2gi | Retrieve GenInfo identifiers associated with taxon identifier | NCBI taxon identifier (integer) | GenInfo Identifier (integer) |
Figure 4Using Atlas in genome annotation. Atlas facilitates genome annotation at multiple levels: creation of data reagents, storage of annotations, and data transformation for submission. Here we show a schema of our genome annotation process that integrates Pegasys, Apollo, NCBI tools and Atlas into a comprehensive platform. Data reagents for sequence alignment are compiled using the Atlas toolbox applications. Computational analyses are run through the Pegasys system which outputs GAME XML for import into Apollo. Annotations are saved in a GAME XML which are then imported into Atlas using the GameLoader. At this step, the biological features created in the annotation process are stored in the Atlas Feature tables, exactly the same way a GenBank sequence record containing annotations are stored. These annotations can then be retrieved using the Atlas toolbox application ac2feat and exported in GFF2 or Sequin Feature Table Format for import into the NCBI submission tools for validation, and submission to GenBank.