| Literature DB >> 32591513 |
Jian Xu1, Sunkyu Kim2, Min Song3, Minbyul Jeong2, Donghyeon Kim2, Jaewoo Kang2, Justin F Rousseau4, Xin Li5, Weijia Xu6, Vetle I Torvik7, Yi Bu8, Chongyan Chen5, Islam Akef Ebeid5, Daifeng Li9, Ying Ding10,11.
Abstract
PubMed® is an essential resource for the medical domain, but useful concepts are either difficult to extract or are ambiguous, which has significantly hindered knowledge discovery. To address this issue, we constructed a PubMed knowledge graph (PKG) by extracting bio-entities from 29 million PubMed abstracts, disambiguating author names, integrating funding data through the National Institutes of Health (NIH) ExPORTER, collecting affiliation history and educational background of authors from ORCID®, and identifying fine-grained affiliation data from MapAffil. Through the integration of these credible multi-source data, we could create connections among the bio-entities, authors, articles, affiliations, and funding. Data validation revealed that the BioBERT deep learning method of bio-entity extraction significantly outperformed the state-of-the-art models based on the F1 score (by 0.51%), with the author name disambiguation (AND) achieving an F1 score of 98.09%. PKG can trigger broader innovations, not only enabling us to measure scholarly impact, knowledge usage, and knowledge transfer, but also assisting us in profiling authors and organizations based on their connections with bio-entities.Entities:
Year: 2020 PMID: 32591513 PMCID: PMC7320186 DOI: 10.1038/s41597-020-0543-2
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 6.444
Fig. 1Bio-entity integration framework for PKG.
Multi-type normalization model and dictionaries.
| Entity types | Normalization models | Dictionaries | # of IDs | # of names | Avg. # of names per ID |
|---|---|---|---|---|---|
| Gene/Protein | GNormPlus | Entrez Gene[ | 139,375 | 248,581 | 1.8 |
| Disease | Sieve-based entity linking[ | MeSH[ | 32,954 | 172,650 | 5.2 |
| Drug/Chemical | tmChem without Ab3P | MeSH[ | 518,223 | 2,571,570 | 5.0 |
| Species | Dictionary lookup | NCBI Taxonomy | 398,037 | 3,119,005 | 7.8 |
| Mutation | tmVar 2.0 | dbSNP[ | 208,474 | 302,498 | 1.5 |
| Total | 1,297,063 | 6,414,304 | 4.9 |
Fig. 2Entities and relationships in PKG.
Date coverage and version information of data sources.
| Data Source | Start Year | End Year | Version Information |
|---|---|---|---|
| PubMed 2019 baseline files[ | 1781 | 2018 | The PubMed 2019 baseline files were released in December 2018. It also includes 13,097 papers published after 2018 and majority of them are preprints. |
| Author-ity dataset[ | 1865 | 2008 | The dataset was generated based on PubMed 2009 baseline files. It also includes AND results of 93,228 papers published after 2008, and majority of them are preprints. |
| Semantic Scholar dataset[ | 1786 | 2019 | The dataset was released on January 31, 2019. |
| NIH ExPORTER dataset[ | 1985 | 2018 | The articles marked with projects span from 1981 to 2018, and project details cover from 1985 to 2018. The dataset was downloaded in June 2018. |
| Employment History Data from ORCID[ | 1913 | 2018 | The dataset was released on October 22, 2018. ORCID publishes the data once per year. |
| Educational Background Data from ORCID[ | 1913 | 2018 | The dataset was released on October 22, 2018. ORCID publishes the data once per year. |
| MapAffil 2016 dataset[ | 1975 | 2017 | The dataset is based on a snapshot of PubMed taken in the first week of October, 2016, and was released on April 5, 2018. |
| Affiliation Parser Library[ | 1786 | 2019 | Fast and simple parser for MEDLINE and PubMed Open-Access affiliation string, which was published on March 15, 2018. We applied it to parse multiple fields from the affiliation string, including department, institution, zip code, location, and country. |
Dataset details.
| File | # of Lines | # of Distinct PMIDs | # of Distinct AND_IDs | Short description |
|---|---|---|---|---|
| Author_List | 114,345,178 | 28,510,300 | 14,830,461 | CSV file containing PubMed authors and AND_IDs. |
| Bio-entities_Main | 330,394,494 | 18,361,409 | — | CSV file containing all types of extracted bio-entities by BioBERT. |
| Bio-entities_Mutation | 1,388,341 | 312,099 | — | CSV file containing additional items of mutations from Bio-entities_Main file. |
| Affiliations | 46,065,099 | 19,601,383 | 8,300,984 | CSV file containing affiliations and their extracted fine-grained items. |
| Researcher_Employment | 532,356 | — | 276,483 | CSV file containing employment history from ORCID. |
| Researcher_Education | 512,267 | — | 268,610 | CSV file containing educational background from ORCID. |
| NIH_Porjects | 12,340,431 | 1,790,949 | 102,070 | CSV file containing projects from NIH ExPORTER and mapping relation between PI_ID, PMID, and AND_ID. |
Note: In file Author_List, about 1.3 million (1.15%) author instances cannot be disambiguated because they do not exist in Author-ity or Semantic Scholar dataset. Therefore, their AND_ID field values were set to zero.
Statistics of extracted entities.
| Species | Disease | Gene/Protein | Drug/Chemical | Mutation | |
|---|---|---|---|---|---|
| Total number of extracted entities | 65,737,425 | 98,865,897 | 81,035,640 | 83,367,191 | 1,388,341 |
| Distinct PMIDs for each type | 13,717,884 | 12,708,292 | 7,914,735 | 9,681,294 | 312,099 |
| Distinct entities for each type | 84,203 | 36,704 | 25,489 | 134,574 | 208,466 |
Data type for records of Author_List.
| Index | Format | # of Lines with non-empty values | Short description |
|---|---|---|---|
| id | Integer | 114,345,178 | Unique ID for each author instance. |
| PMID | Integer | 114,345,178 | Unique ID assigned by PubMed to identify PubMed articles. |
| AND_ID | Integer | 109,245,192 | Unique author ID allocated by AND. |
| AuOrder | Integer | 114,345,178 | Author order of the current author in the author list of current articles. |
| LastName | String | 114,130,643 | Last name of the current author. |
| ForeName | String | 113,452,639 | First name of the current author. |
| Initials | String | 114,007,764 | Middle initials of the current author. |
| Suffix | String | 513,508 | Suffix name of the current author. |
| AuNum | Integer | 114,345,178 | Co-author number of the current articles. |
| PubYear | Integer | 114,345,178 | Publication year of the current article. |
| BeginYear | Integer | 109,245,192 | Begin year of the current author’s first article. |
Data type for records of NIH_Projects.
| Index | Format | # of Lines with non-empty values | Short description |
|---|---|---|---|
| id | Integer | 12,340,431 | Unique ID for each project instance. |
| AND_ID | Integer | 11,013,198 | Unique author ID allocated by AND. |
| PI_ID | String | 12,340,431 | Unique PI ID allocated by NIH. |
| PMID | Integer | 12,340,431 | Unique ID assigned by PubMed to identify PubMed articles. |
| ProjectNumber | String | 12,340,431 | Project number of the current project. |
| subProjectNumber | String | 9,438,420 | Subproject number of the current project. |
| PI_Name | String | 12,340,431 | Full name of a PI. |
Test results of biomedical NER.
| Entity Type | Datasets | Metrics | State-of-the-art | BERT (Wiki + Books) | BioBERT (+PubMed + PMC) |
|---|---|---|---|---|---|
| Disease | NCBI disease[ | P % | 86.41 | 84.12 | |
| R % | 88.31 | 87.19 | |||
| F % | 87.34 | 85.63 | |||
| 2010 i2b2/VA[ | P % | 84.04 | |||
| R % | 84.08 | 85.44 | |||
| F % | 84.06 | ||||
| BC5CDR[ | P % | 85.61 | 81.97 | ||
| R % | 82.61 | 82.48 | |||
| F % | 84.08 | 82.41 | |||
| Drug/Chemical | BC5CDR[ | P % | 90.94 | ||
| R % | 92.38 | 91.38 | |||
| F % | 91.16 | ||||
| BC4CHEMD[ | P % | 91.30 | 91.19 | ||
| R % | 87.53 | 88.92 | |||
| F % | 89.37 | 90.04 | |||
| Gene/Protein | BC2GM[ | P % | 81.81 | 81.17 | |
| R % | 81.57 | 82.42 | |||
| F % | 81.69 | 81.79 | |||
| JNLPBA[ | P % | 69.57 | |||
| R % | 81.20 | ||||
| F % | 74.94 | ||||
| Species | LINNAEUS[ | P % | 91.17 | ||
| R % | 84.30 | ||||
| F % | 87.6 | ||||
| Species·800[ | P % | 69.35 | |||
| R % | 74.05 | ||||
| F % | 71.63 | ||||
| Average | P % | 82.61 | |||
| R % | 84.00 | ||||
| F % | 83.25 | ||||
Data type for records of Researcher_Education.
| Index | Format | # of Lines with non-empty values | Short description |
|---|---|---|---|
| id | Integer | 512,267 | Unique ID for each scholar’s education instance. |
| AND_ID | Integer | 512,267 | Unique author ID allocated by AND. |
| ORCID | String | 512,267 | Unique researcher ID that distinguishes the researcher from others. |
| BeginYear | String | 453,122 | The beginning year of the researcher’s education. |
| Organization | String | 512,267 | The organization the researcher has been educated. |
| City | String | 512,267 | The city where the researcher works. |
| Region | String | 378,188 | The region where the researcher works. |
| Country | String | 512,267 | The country where the researcher works. |
| Identifier | String | 410,239 | The identifier of an organization. |
| IdSource | String | 410,239 | The provider of an organizations’ identifier. |
| EndYear | String | 440,750 | The end year of the researcher’s education. |
| Role | String | 487,218 | The degree that the researcher received. |
Performance of the multi-type normalization model.
| Entity type | Normalization model | Test sets | Precision % | Recall % | F1 score % | Accuracy % |
|---|---|---|---|---|---|---|
| Gene/Protein | GNormPlus | BC2 Gene Normalization, human species[ | 87.1 | 86.4 | 86.7 | — |
| BC3 Gene Normalization, multispecies[ | — | — | 50.1 | — | ||
| Disease | Sieve-based entity linking | ShARe/CLEF eHealth Challenge corpus[ | — | — | — | 90.75 |
| NCBI disease | — | — | — | 84.65 | ||
| Mutation | tmVar 2.0 | OSIRISv1.2[ | 97.20 | 80.62 | 88.14 | — |
| Thomas[ | 89.94 | 88.24 | 89.08 | — | ||
| Species | Dictionary lookup of SR4GN[ | BioCreative III GN[ | — | — | 46.91 | — |
Note: There are empty cells in the table because GNormPlus and tmVar 2.0 did not report their accuracies, the sieve-based entity linking model only reported its accuracy, and SR4GN only reported its F1 score. The authors of tmChem did not report the normalization performance of tmChem independently, so there were no performance data for Drug/Chemical.
Fig. 3Calculation of Precision, Recall, and F1 Score.
Evaluation results of AND.
| Precision | Recall | F1 score | |
|---|---|---|---|
| Author-ity | 99.43% | 96.92% | 98.16% |
| Semantic Scholar | 96.24% | 97.66% | 96.94% |
| AND Integration | 98.62% | 97.56% | 98.09% |
Fig. 4Trends over time of researcher-centric and bio-entity-centric activity.
Fig. 5Bipartite network analysis of coronavirus.
| Measurement(s) | textual entity • author information textual entity • funding source declaration textual entity • abstract • Biologic Entity Classification |
| Technology Type(s) | machine learning • computational modeling technique |
Data type for records of Bio_entities_Main.
| Index | Format | # of Lines with non-empty values | Short description |
|---|---|---|---|
| id | Integer | 330,394,594 | Unique ID for each bio-entity instance. |
| PMID | Integer | 330,394,594 | Unique ID assigned by PubMed to identify PubMed articles. |
| Start | Integer | 330,394,594 | Start position of mention in an abstract. |
| End | Integer | 330,394,594 | End position of mention in an abstract. |
| Mention | String | 330,394,594 | Entity mentioned in an abstract. |
| EntityID | Integer | 265,304,264 | Normalized entity ID. |
| Type | String | 330,394,594 | Enumerated type of entity; values include species, disease, gene, drug, and mutation. |
Data type for records of Bio_entities_Mutation.
| Index | Format | # of Lines with non-empty values | Short description |
|---|---|---|---|
| Main_id | Integer | 1,388,341 | Foreign key references from Bio-entities_Main (id). |
| Mention | String | 1,388,341 | Mutation entity mentioned in the abstract. |
| MutationType | String | 1,388,341 | Normalized entity ID. |
| NormalizedName | String | 1,388,341 | Enumerated type of entity; values include species, disease, gene, drug, and mutation. |
Data type for records of Affiliations.
| Index | Format | # of Lines with non-empty values | Short description |
|---|---|---|---|
| id | Integer | 46,065,099 | Unique ID for each affiliation. |
| PMID | Integer | 46,065,099 | Unique ID assigned by PubMed to identify PubMed articles. |
| AuOrder | Integer | 46,065,099 | Author order of the current author in the author list of the current article. |
| AND_ID | Integer | 42,242,447 | Unique author ID allocated by AND. |
| AffiliationOrder | Integer | 46,065,099 | Affiliation order in the affiliation list of the current author. |
| Affiliation | String | 42,676,487 | Affiliation string. |
| Department | String | 29,438,469 | The department that the author belongs to. |
| Institution | String | 38,955,031 | The institution that the author belongs to. |
| String | 8,092,262 | The author’s email address. | |
| ZipCode | String | 16,573,810 | The postcode of this affiliation. |
| Location | String | 42,590,482 | The address of the affiliation. |
| Country | String | 39,536,798 | The country that the author belongs to. |
| City | String | 32,151,044 | The city that the author belongs to. |
| State | String | 31,910,547 | The state that the author belongs to. |
| AffiliationType | String | 35,706,926 | Enumerated type of affiliation; values include COM, EDU, EDU-HOS, GOV, HOS, MIL, ORG, and UNK. |
| Latitude | Float | 36,371,281 | The latitude of the affiliation. |
| Longitude | Float | 21,679,300 | The longitude of the affiliation. |
| Fips | Integer | 8,727,595 | FIPS code of the county that includes the geocode. |
Data type for records of Researcher_Employment.
| Index | Format | # of Lines with non-empty values | Short description |
|---|---|---|---|
| id | Integer | 532,356 | Unique ID for each scholar’s employment instance. |
| AND_ID | Integer | 532,356 | Unique author ID allocated by AND. |
| ORCID | String | 532,356 | Unique researcher ID that distinguishes the researcher from others. |
| Department | String | 426,597 | The department which the researcher belongs to. |
| BeginYear | String | 487,183 | The beginning year of the researcher’s employment. |
| Organization | String | 532,356 | The institution which the researcher belongs to. |
| City | String | 532,356 | The city where the researcher works. |
| Region | String | 363,066 | The region where the researcher works. |
| Country | String | 532,356 | The country where the researcher works. |
| Identifier | String | 392,562 | The identifier of an organization. |
| IdSource | String | 392,562 | The provider of an organizations’ identifier. |
| EndYear | String | 251,826 | The end year of the researcher’s employment. |