| Literature DB >> 34966925 |
Matthias Lange1, Blaise T F Alako2, Guy Cochrane2, Mehmood Ghaffar1, Martin Mascher1,3, Pia-Katharina Habekost1,4, Upneet Hillebrand5, Uwe Scholz1, Florian Schorch1,4, Jens Freitag1, Amber Hartman Scholz5.
Abstract
BACKGROUND: Linking nucleotide sequence data (NSD) to scientific publication citations can enhance understanding of NSD provenance, scientific use, and reuse in the community. By connecting publications with NSD records, NSD geographical provenance information, and author geographical information, it becomes possible to assess the contribution of NSD to infer trends in scientific knowledge gain at the global level.Entities:
Keywords: Convention on Biological Diversity; Europe PMC; European Nucleotide Archive; data citation; digital sequence information; nucleotide sequence data; text mining
Mesh:
Substances:
Year: 2021 PMID: 34966925 PMCID: PMC8716361 DOI: 10.1093/gigascience/giab084
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 6.524
Figure 1.Schematic overview of the data flow of extract-load-transform (ELT) process to build the data warehouse from ENA and ePMC datasets. ENA records are parsed (A1), filtered for valid country tags, and fed into ePMC RESTful API to extract matching secondary publication (B1) by ENA accession or project accession numbers. Primary publications are linked by ENA record (A2) to the DOI, PMCID, or PMID. The resulting datasets are normalized as tables ENA_SEQUENCES, PMC_REFERENCES and loaded into the data warehouse (A3, B2). This is complemented by a manual ingested list of the world's countries and economics groups into the tables COUNTRIES and COUNTRY2GRP, respectively (C1). Finally, SQL queries are applied to generate charts and reports in the Web application.
Figure 2.Table schema of the WiLDSI data warehouse. The table ENA_SEQUENCES comprise metadata of a sequence stored in the EBI ENA database. The attributes “accession" and “project accession" are used to join secondary literature that cites sequences. The attribute “country" refers to the country table to resolve and group country-tagged ENA sequences. The table “PMC_REFERENCES" consists of all ePMC published papers referencing an ENA sequence by either accession or project accession and references from ENA records as primary publication by either a DOI, PMID, or PMCID.
Figure 3.Example of range notation for ENA accession references. Within the selected part of Crabtree et al. [39], the actual number of cited ENA accessions is 35, but ePMC API matched only 8.
Comparison of ENA accession number query performance of APIs of EBI ePMC, Dimensions, and Lens
| ENA Accession | Hits in ePMC | Hits in Dimensions | Hits in Lens | Overlap Dimensions and ePMC | Overlap Lens and ePMC |
|---|---|---|---|---|---|
| AB076935 | 0 | 6 | 0 | 0 | 0 |
| AB076941 | 0 | 1 | 0 | 0 | 0 |
| EU257628 | 3 | 5 | 0 | 2 | 0 |
| AB326609 | 0 | 1 | 0 | 0 | 0 |
| AM262332 | 0 | 2 | 0 | 0 | 0 |
| EU575854 | 1 | 1 | 0 | 1 | 0 |
| CP039348 | 0 | 1 | 0 | 0 | 0 |
| DQ410599 | 1 | 1 | 0 | 1 | 0 |
| EU293114 | 12 | 19 | 1 | 6 | 1 |
| AY924392 | 10 | 7 | 2 | 6 | 2 |
| EF607913 | 0 | 1 | 0 | 0 | 0 |
| AY768827 | 0 | 1 | 0 | 0 | 0 |
Dimensions queries used the URL pattern https://app.dimensions.ai/discover/publication?search_text=AY924392&search_type=kws&search_field=full_search, whereas Lens queries used the URL pattern https://www.lens.org/lens/scholar/search/results?q=AY924392&preview=true.
Figure 4.Screenshots of the WiLDSI Web Application. It consists of pages for (A) detailed data reports with integrated (B) drill-down to sources, (C) charts of DSI usage scenarios, (D) per country DSI use and contribution, and so forth.