| Literature DB >> 22139929 |
Tanya Barrett1, Karen Clark, Robert Gevorgyan, Vyacheslav Gorelenkov, Eugene Gribov, Ilene Karsch-Mizrachi, Michael Kimelman, Kim D Pruitt, Sergei Resenchuk, Tatiana Tatusova, Eugene Yaschenko, James Ostell.
Abstract
As the volume and complexity of data sets archived at NCBI grow rapidly, so does the need to gather and organize the associated metadata. Although metadata has been collected for some archival databases, previously, there was no centralized approach at NCBI for collecting this information and using it across databases. The BioProject database was recently established to facilitate organization and classification of project data submitted to NCBI, EBI and DDBJ databases. It captures descriptive information about research projects that result in high volume submissions to archival databases, ties together related data across multiple archives and serves as a central portal by which to inform users of data availability. Concomitantly, the BioSample database is being developed to capture descriptive information about the biological samples investigated in projects. BioProject and BioSample records link to corresponding data stored in archival repositories. Submissions are supported by a web-based Submission Portal that guides users through a series of forms for input of rich metadata describing their projects and samples. Together, these databases offer improved ways for users to query, locate, integrate and interpret the masses of data held in NCBI's archival repositories. The BioProject and BioSample databases are available at http://www.ncbi.nlm.nih.gov/bioproject and http://www.ncbi.nlm.nih.gov/biosample, respectively.Entities:
Mesh:
Year: 2011 PMID: 22139929 PMCID: PMC3245069 DOI: 10.1093/nar/gkr1163
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Schematic depicting how BioProject, BioSample and data objects can be organized and linked. This example is composed of one umbrella project that encompasses three subprojects, each of which generated data derived from two BioSample records. Users can query either the BioProject or the BioSample database to retrieve the relevant records, and then navigate through links to the corresponding experimental data which continue to be stored in NCBI's primary data archives, including GenBank, SRA, dbGaP and GEO. This schematic depicts direct links that can be applied between objects; it does not depict links to corresponding records in other NCBI databases, including PubMed, Gene, Genome and Taxonomy.
Figure 2.Screenshot of a Genome Sequencing project that is a component of an umbrella project that encompasses data generated from an E. coli pathogen outbreak (upper panel) (17) and a corresponding sample record (lower panel). The records display the project title, summary, data type, locus_tag prefix and various project attributes including the scope and capture method (A). The Project Data section (B) lists the availability of corresponding sequence and assembly data in the Nucleotide and SRA databases where the data can be downloaded. Navigation panels assist users to link to Genome-level resources for that organism (C), or to ‘Navigate Up’ to the parent umbrella project, or to ‘Navigate across’ to sibling projects that are part of that umbrella project, as well as any additional projects related by organism (D). The ‘Related Information’ panel (E) contains full list of linkages for that record; clicking the BioSample link directs the user to the sample record shown in the lower panel, which lists the attributes that were collected for that sample including the collection date, isolation source, country and strain and serovar (F).
Types of projects in BioProject 5 October 2011
| Project data type | Number |
|---|---|
| Umbrella | 216 |
| Primary submission | 10 712 |
| Assembly | 6 |
| Clone ends | 65 |
| Epigenomics | 9 |
| Exome | 33 |
| Genome sequencing | 9265 |
| Map | 134 |
| Metagenome | 400 |
| Metagenomic assembly | 40 |
| Other | 12 |
| Phenotype or genotype | 3 |
| Proteome | 1 |
| Random survey | 5 |
| Targeted locus (loci) | 90 |
| Transcriptome or gene expression | 588 |
| Variation | 61 |
| RefSeq | 3418 |
As the intra-NCBI database cross-connections are established and enhanced, the number of some project types is expected to increase. For example, the number of ‘Phenotype or genotype’ projects will greatly increase when BioProject is populated with the studies from dbGaP.
Selected search fields and example queries
| Database | Find by … | Example search ‘term[field]’ |
|---|---|---|
| BioProject and BioSample | organism or taxonomic class | insecta[Organism] |
| BioProject | project data type | metagenome[Project Data Type] |
| BioProject | publication | 10473380[Pubmed ID] |
| BioProject | submitter organization or consortium | JGI[Submitter Organization] |
| BioProject | sample scope | scope environment[Properties] |
| BioProject | material used | material transcriptome[Properties] |
| BioProject and BioSample | database identifier | PRJNA33823 or PRJNA33823 [bioproject] or 33823[uid] or 33823[bioproject] |
| BioSample | attribute name | age[Attribute Name] |
| BioSample | attribute and value | cell line GM10847[Attribute] |
| BioSample | data source | source coriell[Properties] |
Multi-part queries may be constructed by specifying the search terms, their fields and the Boolean operations AND, OR, NOT.