| Literature DB >> 30778255 |
Rafael S Gonçalves1, Mark A Musen1.
Abstract
We present an analytical study of the quality of metadata about samples used in biomedical experiments. The metadata under analysis are stored in two well-known databases: BioSample-a repository managed by the National Center for Biotechnology Information (NCBI), and BioSamples-a repository managed by the European Bioinformatics Institute (EBI). We tested whether 11.4 M sample metadata records in the two repositories are populated with values that fulfill the stated requirements for such values. Our study revealed multiple anomalies in the metadata. Most metadata field names and their values are not standardized or controlled. Even simple binary or numeric fields are often populated with inadequate values of different data types. By clustering metadata field names, we discovered there are often many distinct ways to represent the same aspect of a sample. Overall, the metadata we analyzed reveal that there is a lack of principled mechanisms to enforce and validate metadata requirements. The significant aberrancies that we found in the metadata are likely to impede search and secondary use of the associated datasets.Entities:
Mesh:
Year: 2019 PMID: 30778255 PMCID: PMC6380228 DOI: 10.1038/sdata.2019.21
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 6.444
Figure 1Example metadata record from the NCBI BioSample.
An NCBI BioSample metadata record has a title, potentially multiple identifiers associated with it, an organism, a package specification (explained in Section 2.1), multiple attributes in the form of name-value pairs, a description with keywords associated with it, information about the record submitter, and finally accession details.
Figure 2Mention of metadata packages in NCBI BioSample.
The chart shows the package names followed by the number (and percentage) of metadata records that use that package. The Generic package does not specify any required or optional attributes.
Figure 3Metadata submissions to NCBI BioSample from 2009–2017.
The columns represent the total number of metadata record submissions to NCBI BioSample in a year, split between Generic and non-Generic records. The Non-Generic metadata records column contains data labels with the absolute number of records. Generic records make up nearly all the submissions in the early years of BioSample, and the bulk of the submissions even in recent years.
Figure 4Quality of dictionary attributes in NCBI BioSample according to their type.
The columns show the number and percentage of attributes whose values are well-specified or invalid.
Figure 5Quality of attributes in packaged metadata records in NCBI BioSample.
The columns represent the metadata attribute types. Each column shows the number and percentage of metadata attributes whose values are either well-specified or invalid.
Figure 6Quality of attributes in metadata that co-exist in EBI and NCBI repositories.
The columns represent the metadata attribute types. Each column shows the number and percentage of metadata attributes whose values are either well-specified or invalid.
Figure 7Metadata submissions to EBI BioSamples from 2009–2017.
The columns represent the total number of metadata record submissions to EBI BioSamples per year.
Figure 8Mention of metadata packages in EBI BioSamples.
The chart shows the package names (or “Unpackaged” for records that do not specify a package) followed by the number and percentage of metadata records that specify that package name.
Figure 9Quality of named attributes in EBI BioSamples.
The columns represent the metadata attribute types. Each column shows the number and percentage of metadata attributes whose values are either well-specified or invalid.
Examples of clusters of metadata attribute names.
| Exemplar | Attribute names in cluster |
|---|---|
| The left column contains the exemplar attribute name computed by the clustering algorithm, followed by the cluster of attribute names formed around the exemplar in the right column. | |
| atm pressure | atm press, atmospheric pressure |
| Disease stage | disease_stage, disease_stage, DiseaseStaging, disease staging, tfc_disease_stage, disease/status, diease_stat, DiseaseLocati |
| embryonic stage | embryo age, embryo stage, embryogenesis stage, embryonic age, embryonic day, embryonic stages, embryonic zone, meiotic stage, pollen embryo sac stage |
| environmental history | EnvironmentalHistory, Host Environmental History, environemental history, environmental history colony |
| experimental condition | Experiment condition, Experimental or control, enviromental conditions, environmental condition, environmental conditions, experimental conditions |
| genetic background | cytogenetic background, genetic and mutant background, genetic background cultivar, genetic backround, genetick background, genotype variation background |
| genotype variation | genotype variaion, cmv genotype variation, geneotype variation, genome variation, genoptype variation, genotype varation, genotype variaion, genotype variarion, genotype variataion, genotype variatation, genotype variaton, gentotype variation |
| geo_loc_name | geo_lac_name, geo_loc_name2, geo_loc_name_coord, geo_log_name, geooc_name, go_loc_name |
| nucleic acid extraction | Nucleic acid preparation, Nucleic_acid_extraction, nucleic acid amplification, nucleic acid extraction method |
| Sampling days | Sample Time days, Sampling Time days approx, Sampling Year, Sampling days, sampling day |
| Submitted by | Submitter, Submitters |
| Time point | Time local, Time weeks, TimePointC, TimePointF, time point, time points, timepoint, timepoints |
Categories of attribute names according to the concept they represent.
| Category | Attribute names |
|---|---|
| The table shows the category in the left column, and the attribute names in that category in the right column. | |
| Biomedical characteristic | breed, ethnicity, host, sample_type, organism, tissue, species, strain, sex, body site, cell type, genotype, disease state, … |
| Date | collection date, collection timestamp, time point |
| Geographic location | geo_loc_name, geographic location, lat_lon, country, latitude and longitude, grographic location (country and/or sea) |
| Measurement | depth, elevation, age, altitude, host_age |
| Identifier | sample id, package, model, gap_accession, gap_sample_id, … |
| Textual description | Sample_title, project name, Sample Name, label, title, study name, common name, secondary description, source name, … |
Groups of attribute names seemingly used to describe the same concept.
| Concept | #Attribute names | Example attribute names | #NCBI records | #EBI records |
|---|---|---|---|---|
| From left to right, the table shows in each row: the concept that the metadata attributes presumably represent, the number of attribute names found to represent that concept, example attribute names found using our clustering method, and the numbers of metadata records in the NCBI BioSample and EBI BioSamples that contain attributes using one of the attribute names in the cluster. The standard attribute names specified in the NCBI BioSample documentation are shown in bold. | ||||
| Geographic location | 32 | 1,056,519 | 442,950 | |
| Height | 31 | 23,170 | 23,641 | |
| Elevation | 13 | 119,477 | 157,778 | |
| Age | 33 | 553,523 | 711,747 | |
| Weight | 26 | 16,330 | 11,966 | |
| Birth date | 18 | 22,785 | 19,684 | |
| Time point | 62 | Timepoint, Time.point, time point, time points, time-point, time_point, Timepoints, time-point in minutes, timepoint in minutes, time-window, time, time_period, time period, time_point_days, time_point_months | 76,561 | 105,083 |
| Country or Region | 24 | 190,718 | 201,655 | |
| Collection date/time | 32 | 136,819 | 139,231 | |
| Ethnicity | 4 | 41,007 | 73,997 | |
| Sample type | 31 | 260,708 | 299,868 | |