| Literature DB >> 22720753 |
Andreas Wilke1, Travis Harrison, Jared Wilkening, Dawn Field, Elizabeth M Glass, Nikos Kyrpides, Konstantinos Mavrommatis, Folker Meyer.
Abstract
BACKGROUND: Computing of sequence similarity results is becoming a limiting factor in metagenome analysis. Sequence similarity search results encoded in an open, exchangeable format have the potential to limit the needs for computational reanalysis of these data sets. A prerequisite for sharing of similarity results is a common reference. DESCRIPTION: We introduce a mechanism for automatically maintaining a comprehensive, non-redundant protein database and for creating a quarterly release of this resource. In addition, we present tools for translating similarity searches into many annotation namespaces, e.g. KEGG or NCBI's GenBank.Entities:
Mesh:
Substances:
Year: 2012 PMID: 22720753 PMCID: PMC3410781 DOI: 10.1186/1471-2105-13-141
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1A simplified view of the internal representation of the M5nr. Sequences are stored in a single FASTA file using md5 sequence identifiers. In addition a number of tables are stored in an SQL database management system to allow rapid queries. The tables link md5 identifiers with IDs, functions and organisms provided by a number of data sources.
Figure 2M5nr Databases. Databases currently included in the M5nr database as presented in the online overview page provided as part of the M5nr web site.
Figure 3Database Statistics. Statistics on the M5nr databases showing total number of source databases, IDs, sequences and other key annotations. We show the number of unique elements added by each database that is added to M5nr. Looking at identifiers, sequences, functional annotations and organisms. For each item (IDs, sequences, functions, organisms) there is a total and percent. The total represents the count of unique representations of that item. The total count is important because there is duplication of sequences, functional names, and organism names within each source database.