Literature DB >> 31600197

The NIH Open Citation Collection: A public access, broad coverage resource.

B Ian Hutchins¹, Kirk L Baker¹, Matthew T Davis¹, Mario A Diwersy², Ehsanul Haque¹, Robert M Harriman¹, Travis A Hoppe¹, Stephen A Leicht², Payam Meyer¹, George M Santangelo¹.

Abstract

Citation data have remained hidden behind proprietary, restrictive licensing agreements, which raises barriers to entry for analysts wishing to use the data, increases the expense of performing large-scale analyses, and reduces the robustness and reproducibility of the conclusions. For the past several years, the National Institutes of Health (NIH) Office of Portfolio Analysis (OPA) has been aggregating and enhancing citation data that can be shared publicly. Here, we describe the NIH Open Citation Collection (NIH-OCC), a public access database for biomedical research that is made freely available to the community. This dataset, which has been carefully generated from unrestricted data sources such as MedLine, PubMed Central (PMC), and CrossRef, now underlies the citation statistics delivered in the NIH iCite analytic platform. We have also included data from a machine learning pipeline that identifies, extracts, resolves, and disambiguates references from full-text articles available on the internet. Open citation links are available to the public in a major update of iCite (https://icite.od.nih.gov).

Entities: Chemical Disease Species

Mesh：

Year: 2019 PMID： 31600197 PMCID： PMC6786512 DOI： 10.1371/journal.pbio.3000385

Source DB: PubMed Journal: PLoS Biol ISSN： 1544-9173 Impact factor: 8.029

Background

“If I have seen farther, it is by standing on the shoulders of giants,” wrote Issac Newton [1]. Science advances the frontier of knowledge by building on the discoveries described in the literature, and the provenance and spread of scientific discoveries is documented in the directed graph of the embedded citations [2,3]. Meta-research, frequently drawing on the historical citation record, seeks to apply scientific methods to further accelerate research by identifying potential improvements to research practices and organization [4,5]. Paywalled citation data remain locked behind restrictive licensing agreements, raising an unnecessary barrier to entry for investigators and blocking the increasingly common practice of data sharing in scientific articles that use this information. This state of affairs prevents many scientists from using comprehensive citation graphs in their research, reduces the robustness and reproducibility of the analyses that do use it, and hinders research in bibliometrics [6]. The Initiative for Open Citations [7,8] was a crucial step toward opening public access to this structured information, which has so far made public an estimated 55% of the reference links between documents indexed in CrossRef. Here, we describe a comprehensive, public domain citation graph for biomedical research made freely available to the community. This citation database is not a static snapshot but—as a part of our bibliometrics web service, iCite [9]—will be updated monthly. While developing new science-of-science methodologies [10,11], the NIH Office of Portfolio Analysis (OPA) has carefully pursued unencumbered citation data resources that can be shared publicly. The NIH Open Citation Collection (NIH-OCC) described here underpins the iCite database that distributes citation metrics worldwide. Data sources include federally funded resources such as PubMed Central (PMC) [12], MedLine [13], and Entrez [14] from the National Library of Medicine (NLM); the community resource CrossRef [15]; and reference data harvested from full-text scientific papers that have been made freely available on the internet. These open access articles were identified either through explicit journal policies or through third-party aggregators such as Unpaywall [16]. Citation data can be visualized and downloaded through the iCite website [9]; the data can also be accessed via machine-readable Application Programming Interface (API) or bulk downloads. In a call for open citation data from the scientometrics community [17,18], signatories noted the capacity of open data to improve transparency and reproducibility of analyses. Transparency is also an important goal for the NIH; link-level citation data have been disseminated through PMC, making transparent the flow of knowledge from earlier work to NIH-supported publications. With the release of this NIH-OCC for biomedicine at large, the subsequent work that draws upon NIH-supported discoveries is now visible as well, and barriers to entry for scientometric studies will be reduced. The science-of-science community has illustrated the high value of link-level citation data (as opposed to aggregated citation measures), e.g., using such information to discover principles of citation dynamics [19-21], quantify the influence of model organism research on human studies [22], and predict the transmission of knowledge from basic research into clinical studies [23]. Thus, comprehensive open citation data can both enable the attribution of scientific progress and convey foreknowledge that research discoveries will culminate in downstream applications.

Description

iCite currently draws on PubMed for crucial article metadata, and this information is augmented with citation data from multiple sources. The NLM resolves citations from PMC to PubMed articles, disseminating these through a few resources (Entrez eLink, PMC full-text XML, and MedLine XML). We augment these citations with CrossRef citation data, which are processed by a citation resolver to identify additional PubMed-to-PubMed citations. For publications since 2010, the NIH-OCC has more citation links and is therefore more comprehensive than leading proprietary sources. Prior to 2010, a subset of historical articles (typically published during or before the early 2000s) have not been assigned DOIs and are therefore not captured in the CrossRef dataset. For this reason, we have further augmented these data sources with information from full-text articles that have been made freely available on the internet. We developed a prototype machine learning pipeline, described below, to identify, parse, and resolve references from such full-text articles for inclusion in the NIH-OCC. Finally, once citations are resolved, these are entered into our data processing pipelines for calculating downstream metrics like the Relative Citation Ratio [10] and the Approximate Potential to Translate [23]. At the time of writing (July 2019), the NIH-OCC comprises over 420,000,000 citation links between articles published in PubMed (Fig 1A). The major limitation of the NIH-OCC is that, as part of iCite, it has been developed with a biomedical focus; at present, its citation universe is restricted to PubMed-to-PubMed citation links. The largest contribution comes from CrossRef, followed by the NLM, and finally our prototype machine learning pipeline that extracts references from full-text articles (Fig 1B). Although references from the machine learning pipeline represent a small fraction of the total at present, we expect this to increase over time as new papers are identified and processed. Data can be accessed through the iCite web interface (https://icite.od.nih.gov/; Fig 2), the iCite API (https://icite.od.nih.gov/api), or through bulk downloads (DOI: 10.35092/yhjc.c.4586573).

Fig 1

Citations in the NIH-OCC.

(A) Citations per year. (B) Citation source by time period. ML, Machine Learning; NLM, National Library of Medicine; OCC, Open Citation Collection.

Fig 2

Screen capture of the iCite web interface to open citation data.

The Open Citations module of iCite displays portfolio-level data in a summary table (top) and charts beneath the table. Charts provide visualization of publications over time (left), total citations per year by the publication year of the referenced article (center left), total citations per year by the publication year of the citing article (center right), and average citations per article in each publication year (right). Article-level information is shown on bottom and includes links to the PubMed records of the citing and referenced papers.

Citations in the NIH-OCC.

(A) Citations per year. (B) Citation source by time period. ML, Machine Learning; NLM, National Library of Medicine; OCC, Open Citation Collection.

Screen capture of the iCite web interface to open citation data.

Machine learning pipeline for full-text articles

Our data pipeline starts with the identification of full-text articles that have been made freely available on the internet and do not require an institutional library subscription to access. This was first accomplished by identifying journals with open access policies after an embargo period. We also leveraged the recent efforts of Unpaywall [16], which has identified freely available full text at scale, and included these publications in our dataset. Central to our pipeline is our Citation Resolution Service, which accepts unstructured citation text through an API and returns a matched article along with information about which fields from that paper were present (e.g., journal name, author name, title words). The service takes each citation and tokenizes the input to query the search index and find the publications with the highest percentage of matched terms. The scoring algorithm is then run on the candidates to find the best matches by checking fields such as title, authors, and journal name against the input text. Although a general-purpose pipeline is in development, we initially developed a workflow that trained machine learning models on individual journals in order to take advantage of regularities in Portable Document Format (PDF) formatting. For each journal, our workflow was as follows: Harvest PDFs from open sources and convert to structured XML with the open source Cermine tool [24]. For papers in the journal that were NIH funded, generate positive training data from text that resolves to previously matched citations in PMC. Combine this with negative training data sampled from other parts of the PDF to train a Long Short-Term Memory (LSTM) recurrent neural network model that discriminates between reference text and other text in the scientific article. Pass LSTM-identified references to the Citation Resolution Service. To filter out any remaining false positives, use PMC data as gold standards to train a Random Forest on the feedback received from the Citation Resolution Service. This prototype approach yielded excellent precision and recall (both 0.98) in extracting and resolving references when the models were trained on papers from the same journal. Because we used references previously indexed in PMC as gold standards, any false negatives in that dataset would be flagged as false positives in ours; manual inspection of our false positives indicated that over 90% of these were actually false negatives in the gold standards or resolution to a transient duplicate entry of an article (duplicates are typically later identified and removed by PubMed). For papers identified through Unpaywall, which come from a variety of journals, recall across different batches dropped to between 0.78 and 0.89 while precision remained constant. This occurred because more text was filtered by the LSTM, perhaps indicating additional uncertainty about what references look like in a corpus of papers from a variety of journal formats. Whether identified through aggregation services or via permissive journal policies, however, the rapidly increasing number of freely available full-text articles [16] promises to be a rich source of open citation data going forward.

Future directions

Our prototype machine learning pipeline currently re-trains for each data source. We are building a general-purpose, deep learning reference parser that can both take advantage of recognizable formatting regularities and gracefully handle unknown full-text formats. The comprehensiveness of citation coverage in iCite will also benefit from planned future updates that will incorporate reference data from expanded open data sources such as preprint servers and fields of research not indexed in PubMed. We are also engaged in research to predict, in the absence of full text information, which articles are likely to be referenced, based on information present in the local network structure. Finally, we are using the NIH-OCC to develop new artificial intelligence (AI) approaches to generate a high-resolution map of the biomedical research landscape, identify emerging areas, and improve the effectiveness of data-driven decision-making at the NIH. Outside the NIH, open citations may help to power tools and services that do not yet exist, such as next-generation literature recommendation engines. Using the NIH-OCC as the source of citation data means that this work will be transparent and reproducible.

12 in total

1. Quantifying the evolution of individual scientific impact.

Authors: Roberta Sinatra; Dashun Wang; Pierre Deville; Chaoming Song; Albert-László Barabási
Journal: Science Date: 2016-11-04 Impact factor: 47.728

2. Funders should mandate open citations.

Authors: David Shotton
Journal: Nature Date: 2018-01-11 Impact factor: 49.962

Review 3. Fundamental science behind today's important medicines.

Authors: Jonathan M Spector; Rosemary S Harrison; Mark C Fishman
Journal: Sci Transl Med Date: 2018-04-25 Impact factor: 17.956

Review 4. Science of science.

Authors: Santo Fortunato; Carl T Bergstrom; Katy Börner; James A Evans; Dirk Helbing; Staša Milojević; Alexander M Petersen; Filippo Radicchi; Roberta Sinatra; Brian Uzzi; Alessandro Vespignani; Ludo Waltman; Dashun Wang; Albert-László Barabási
Journal: Science Date: 2018-03-02 Impact factor: 47.728

5. Meta-research: Evaluation and Improvement of Research Methods and Practices.

Authors: John P A Ioannidis; Daniele Fanelli; Debbie Drake Dunne; Steven N Goodman
Journal: PLoS Biol Date: 2015-10-02 Impact factor: 8.029

6. Relative Citation Ratio (RCR): A New Metric That Uses Citation Rates to Measure Influence at the Article Level.

Authors: B Ian Hutchins; Xin Yuan; James M Anderson; George M Santangelo
Journal: PLoS Biol Date: 2016-09-06 Impact factor: 8.029

7. Additional support for RCR: A validated article-level measure of scientific influence.

Authors: B Ian Hutchins; Travis A Hoppe; Rebecca A Meseroll; James M Anderson; George M Santangelo
Journal: PLoS Biol Date: 2017-10-02 Impact factor: 8.029

8. The state of OA: a large-scale analysis of the prevalence and impact of Open Access articles.

Authors: Heather Piwowar; Jason Priem; Vincent Larivière; Juan Pablo Alperin; Lisa Matthias; Bree Norlander; Ashley Farley; Jevin West; Stefanie Haustein
Journal: PeerJ Date: 2018-02-13 Impact factor: 2.984

9. Predicting translational progress in biomedical research.

Authors: B Ian Hutchins; Matthew T Davis; Rebecca A Meseroll; George M Santangelo
Journal: PLoS Biol Date: 2019-10-10 Impact factor: 8.029

10. Large-scale investigation of the reasons why potentially important genes are ignored.

Authors: Thomas Stoeger; Martin Gerlach; Richard I Morimoto; Luís A Nunes Amaral
Journal: PLoS Biol Date: 2018-09-18 Impact factor: 8.029

16 in total

1. Assessment of transparency indicators across the biomedical literature: How open is open?

Authors: Stylianos Serghiou; Despina G Contopoulos-Ioannidis; Kevin W Boyack; Nico Riedel; Joshua D Wallach; John P A Ioannidis
Journal: PLoS Biol Date: 2021-03-01 Impact factor: 8.029

2. Altmetric and bibliometric analysis of obstetrics and gynecology research: influence of public engagement on citation potential.

Authors: Sonal Grover; Adam D Elwood; Jharna M Patel; Cande V Ananth; Justin S Brandt
Journal: Am J Obstet Gynecol Date: 2022-03-11 Impact factor: 10.693

3. Predicting translational progress in biomedical research.

Authors: B Ian Hutchins; Matthew T Davis; Rebecca A Meseroll; George M Santangelo
Journal: PLoS Biol Date: 2019-10-10 Impact factor: 8.029

4. Journal- or article-based citation measure? A study of academic promotion at a Swiss university.

Authors: Nicole Steck; Lukas Stalder; Matthias Egger
Journal: F1000Res Date: 2020-10-01

5. Tracing Long-Term Outcomes of Basic Research Using Citation Networks.

Authors: James Onken; Andrew C Miklos; Richard Aragon
Journal: Front Res Metr Anal Date: 2020-09-08

6. Commentary to Gusenbauer and Haddaway 2020: Evaluating retrieval qualities of Google Scholar and PubMed.

Authors: D V Klopfenstein; Will Dampier
Journal: Res Synth Methods Date: 2020-10-08 Impact factor: 5.273

7. Translatability Analysis of National Institutes of Health-Funded Biomedical Research That Applies Artificial Intelligence.

Authors: Feyisope R Eweje; Suzie Byun; Rajat Chandra; Fengling Hu; Ihab Kamel; Paul Zhang; Zhicheng Jiao; Harrison X Bai
Journal: JAMA Netw Open Date: 2022-01-04

8. The characteristics of early-stage research into human genes are substantially different from subsequent research.

Authors: Thomas Stoeger; Luís A Nunes Amaral
Journal: PLoS Biol Date: 2022-01-06 Impact factor: 8.029

9. An open-source framework for neuroscience metadata management applied to digital reconstructions of neuronal morphology.

Authors: Kayvan Bijari; Masood A Akram; Giorgio A Ascoli
Journal: Brain Inform Date: 2020-03-26

10. A detailed open access model of the PubMed literature.

Authors: Kevin W Boyack; Caleb Smith; Richard Klavans
Journal: Sci Data Date: 2020-11-20 Impact factor: 6.444