| Literature DB >> 23409217 |
Konstantinos Liolios1, Lynn Schriml, Lynette Hirschman, Ioanna Pagani, Bahador Nosrat, Peter Sterk, Owen White, Philippe Rocca-Serra, Susanna-Assunta Sansone, Chris Taylor, Nikos C Kyrpides, Dawn Field.
Abstract
Variability in the extent of the descriptions of data ('metadata') held in public repositories forces users to assess the quality of records individually, which rapidly becomes impractical. The scoring of records on the richness of their description provides a simple, objective proxy measure for quality that enables filtering that supports downstream analysis. Pivotally, such descriptions should spur on improvements. Here, we introduce such a measure - the 'Metadata Coverage Index' (MCI): the percentage of available fields actually filled in a record or description. MCI scores can be calculated across a database, for individual records or for their component parts (e.g., fields of interest). There are many potential uses for this simple metric: for example; to filter, rank or search for records; to assess the metadata availability of an ad hoc collection; to determine the frequency with which fields in a particular record type are filled, especially with respect to standards compliance; to assess the utility of specific tools and resources, and of data capture practice more generally; to prioritize records for further curation; to serve as performance metrics of funded projects; or to quantify the value added by curation. Here we demonstrate the utility of MCI scores using metadata from the Genomes Online Database (GOLD), including records compliant with the 'Minimum Information about a Genome Sequence' (MIGS) standard developed by the Genomic Standards Consortium. We discuss challenges and address the further application of MCI scores; to show improvements in annotation quality over time, to inform the work of standards bodies and repository providers on the usability and popularity of their products, and to assess and credit the work of curators. Such an index provides a step towards putting metadata capture practices and in the future, standards compliance, into a quantitative and objective framework.Entities:
Year: 2012 PMID: 23409217 PMCID: PMC3558968 DOI: 10.4056/sigs.2675953
Source DB: PubMed Journal: Stand Genomic Sci ISSN: 1944-3277
Figure 1Schematic representation of the MCI calculation procedure.
The list of all selected metadata fields in GOLD (columns 2 and 6)1
| | Records | | | Records | | ||
|---|---|---|---|---|---|---|---|
| GOLD STAMP ID | 13,786 | 100 | | HMP FINISHING GOAL2 | 2,472 | 17.93 | |
| DISPLAY NAME | 13,786 | 100 | | ENERGY SOURCES | 2,467 | 17.89 | |
| NCBI TAXON ID | 13,786 | 100 | | ASSEMBLY METHOD | 2,235 | 16.21 | |
| DOMAIN | 13,786 | 100 | | HMP ISOLATION BODY SITE2 | 2,169 | 15.73 | |
| AVAILABILITY | 13,786 | 100 | | GREENGENES ID | 2,146 | 15.57 | |
| GOLD GENUS | 13,785 | 99.99 | | PROJECT DESCRIPTION | 2,122 | 15.39 | |
| PROJECT TYPE | 13,784 | 99.99 | | PUBLICATION LINK | 2,062 | 14.96 | |
| PROJECT STATUS | 13,784 | 99.99 | | HMP NCBI SUBMISSION STATUS2 | 1,948 | 14.13 | |
| NCBI SUPERKINGDOM | 13,782 | 99.97 | | HMP PROJECT STATUS2 | 1,948 | 14.13 | |
| GOLD PHYLUM | 13,778 | 99.94 | | HMP ID2 | 1,946 | 14.12 | |
| PROPOSAL NAME | 13,761 | 99.82 | | ISOLATION SOURCE | 1,884 | 13.67 | |
| GOLD SPECIES | 13,734 | 99.62 | | SEQUENCING STATUS LINK | 1,849 | 13.41 | |
| NCBI PHYLUM | 13,526 | 98.11 | | GENE CALLING METHOD | 1,811 | 13.14 | |
| NCBI GENUS | 13,506 | 97.97 | | LONGITUDE | 1,631 | 11.83 | |
| NCBI ORDER | 13,435 | 97.45 | | LATITUDE | 1,629 | 11.82 | |
| NCBI SPECIES | 13,359 | 96.90 | | HMP ISOLATE SOURCE2 | 1,482 | 10.75 | |
| NCBI FAMILY | 13,135 | 95.28 | | BEI STATUS2 | 1,355 | 9.83 | |
| NCBI CLASS | 13,063 | 94.76 | | BODY SAMPLE SUBSITES | 1,236 | 8.97 | |
| SEQUENCING STATUS | 12,498 | 90.66 | | 16S ID | 1,195 | 8.67 | |
| STRAIN | 12,480 | 90.53 | | BIOSAFETY LEVEL | 1,154 | 8.37 | |
| SEQUENCING COUNTRY | 12,326 | 89.41 | | ISOLATION DATE | 1,080 | 7.83 | |
| SEQUENCING CENTER | 11,837 | 85.86 | | HMP ISOLATION COMMENTS2 | 1,052 | 7.63 | |
| NCBI PROJECT ID | 10,358 | 75.13 | | NUMBER OF READS | 1,048 | 7.60 | |
| UPDATE DATE | 10,247 | 74.33 | | ORGANISM COMMENTS | 948 | 6.88 | |
| RELEVANCE | 9,993 | 72.49 | | METABOLISM | 947 | 6.87 | |
| CONTACT NAME | 8,413 | 61.03 | | ISOLATION COMMENTS | 874 | 6.34 | |
| HABITATS | 7,979 | 57.88 | | LIBRARY METHOD | 778 | 5.64 | |
| TEMPERATURE RANGE | 7,673 | 55.66 | | SEROVAR | 774 | 5.61 | |
| GRAM STAIN | 7,341 | 53.25 | | BODY PRODUCTS | 723 | 5.24 | |
| BIOTIC RELATIONSHIP | 7,147 | 51.84 | | HOST HEALTH | 712 | 5.16 | |
| CONTACT EMAIL | 7,037 | 51.04 | | STRAIN INFO ID | 691 | 5.01 | |
| OXYGEN REQUIREMENT | 7,028 | 50.98 | | HMP ISOLATION COMMENTS2 | 690 | 5.01 | |
| CELL SHAPE | 6,748 | 48.95 | | HMP ISOLATION BODY SUBSITE2 | 681 | 4.94 | |
| DISEASES | 6,661 | 48.32 | | SYMBIOTIC RELATIONSHIP | 493 | 3.58 | |
| MOTILITY | 6,275 | 45.52 | | SHORT READ ARCHIVE ID | 475 | 3.45 | |
| HOST NAME | 5,807 | 42.12 | | INFORMATION URL | 465 | 3.37 | |
| SEQUENCING METHODS | 5,636 | 40.88 | | PH | 441 | 3.20 | |
| ISOLATION SITE | 5,388 | 39.08 | | IMAGE URL | 415 | 3.01 | |
| SPORULATION | 5,187 | 37.63 | | VECTOR | 380 | 2.76 | |
| HOST TAXON ID | 5,131 | 37.22 | | SYMBIONT | 348 | 2.52 | |
| GENOME SIZE | 4,706 | 34.14 | | SYMBIOTIC INTERACTION | 344 | 2.50 | |
| COMPLETION DATE | 4,585 | 33.26 | | ISOLATION PUBMED ID | 339 | 2.46 | |
| CULTURE COLLECTION | 4,212 | 30.55 | | HOST GENDER | 323 | 2.34 | |
| CELL ARRANGEMENTS | 4,126 | 29.93 | | DEPTH | 308 | 2.23 | |
| PHENOTYPES | 4,045 | 29.34 | | SALINITY | 281 | 2.04 | |
| GC PERC | 3,693 | 26.79 | | HOST AGE | 250 | 1.81 | |
| GENE COUNT | 3,556 | 25.79 | | ISOLATION METHOD | 238 | 1.73 | |
| IN IMG DATABASE | 3,453 | 25.05 | | CELL DIAMETER | 233 | 1.69 | |
| PUBLICATION JOURNAL | 3,395 | 24.63 | | CELL LENGTH | 189 | 1.37 | |
| SEQUENCING QUALITY | 3,286 | 23.84 | | COLOR | 157 | 1.14 | |
| GEO LOCATION | 3,265 | 23.68 | | ALTITUDE | 94 | 0.68 | |
| TYPE STRAIN | 3,248 | 23.56 | | HOST RACE | 72 | 0.52 | |
| COVERAGE | 3,246 | 23.55 | | HOST COMMENTS | 50 | 0.36 | |
| BODY SAMPLE SITES | 3,225 | 23.39 | | PROJECT COMMENTS | 38 | 0.28 | |
| ISOLATION COUNTRY | 3,140 | 22.78 | | SYMBIONT TAXON ID | 36 | 0.26 | |
| TEMPERATURE OPTIMUM | 2,712 | 19.67 | | NCBI ARCHIVE ID | 10 | 0.07 | |
| CONTIG COUNT | 2,472 | 17.93 |
1with the number of records for each of them (columns 3 and 7), and the MCI % (columns 4 and 8), ordered by the field with highest MCI. Rows in gray belong to the MIGS minimum information checklist that extends what is captured by the INSDC [4] (i.e. full taxonomy is not captured since a reference to a valid NCBI taxid is expected).
Comparison of MCI scores from the GOLD database. 1
| | | | Records | Total Fields | Filled Fields | | |
|---|---|---|---|---|---|---|---|
| | CORE | 103 | 256 | 26,368 | 14,287 | 54.18 | |
| | 2,040 | 211,253 | 109,532 | 52.00 | |||
| | 2,096 | 215,888 | 87,007 | 39.91 | |||
| | 13,790 | 1,420,370 | 522,850 | 37.00 | |||
| | CORE | 103 | 340 | 35,020 | 16,767 | 48.00 | |
| | 11,233 | 1,156,999 | 443,474 | 38.00 | |||
| | 2,217 | 228,351 | 62,609 | 27.00 | |||
| | MIGS | 12 | 256 | 3,072 | 2,102 | 68.43 | |
| | 2,040 | 24,612 | 14,667 | 59.59 | |||
| | 2,096 | 25,152 | 9,642 | 38.34 | |||
| | 13,790 | 165,480 | 62,564 | 37.81 | |||
| | HMP | 10 | 2,096 | 20,960 | 14,673 | 70.00 | |
| | 2008 | 33 | 2,905 | 95,865 | 59,097 | 61.65 | |
| | 5,843 | 192,819 | 119,881 | 62.17 | |||
| | 13,790 | 455,070 | 273,805 | 60.17 |
1 Note that if all variables in a database or collection apply to all records, then ‘total fields’ is equal to records multiplied by fields. If some variables are specific to a subset of records then the total number of possible fields will be smaller.
Figure 2MCI scores are implemented in the GOLD database. MCI scores can be seen on the GOLDCARDS for each entry and are including in the advanced search option. For example, all entries with an MCI score > 50 are shown on the map below.
The list of the genome projects in GOLD with the top 10 MCI scores
| | | | |
|---|---|---|---|
| Gi05215 | | HMP | 66.95 |
| Gi02825 | | HMP | 66.10 |
| Gc00590 | | RNB | 65.25 |
| Gc00870 | | RNB | 65.25 |
| Gi02071 | | HMP | 64.41 |
| Gi02072 | | HMP | 64.41 |
| Gi02680 | | HMP | 64.41 |
| Gi01716 | | HMP | 64.41 |
| Gc01039 | | RNB | 64.41 |
| Gi02147 | | RNB | 63.56 |
Accordingly, the above discussion points out that an MCI score is useful when applied to large datasets: it can provide the average score across all the records as well as the distribution of the scores across the records. To demonstrate this, we plot the distribution of the MCI scores across the HMP and GEBA datasets, for each of their corresponding records. As shown on Figure 3, this distribution reveals that the HMP dataset has indeed a larger number of records that currently are characterized with lower MCI score, compared to the GEBA dataset.
Figure 3Distribution of the MCI percentages for the GEBA and HMP groups.