| Literature DB >> 33219227 |
Kevin W Boyack1, Caleb Smith2, Richard Klavans3.
Abstract
Portfolio analysis is a fundamental practice of organizational leadership and is a necessary precursor of strategic planning. Successful application requires a highly detailed model of research options. We have constructed a model, the first of its kind, that accurately characterizes these options for the biomedical literature. The model comprises over 18 million PubMed documents from 1996-2019. Document relatedness was measured using a hybrid citation analysis + text similarity approach. The resulting 606.6 million document-to-document links were used to create 28,743 document clusters and an associated visual map. Clusters are characterized using metadata (e.g., phrases, MeSH) and over 20 indicators (e.g., funding, patent activity). The map and cluster-level data are embedded in Tableau to provide an interactive model enabling in-depth exploration of a research portfolio. Two example usage cases are provided, one to identify specific research opportunities related to coronavirus, and the second to identify research strengths of a large cohort of African American and Native American researchers at the University of Michigan Medical School.Entities:
Mesh:
Year: 2020 PMID: 33219227 PMCID: PMC7680135 DOI: 10.1038/s41597-020-00749-y
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 6.444
Fig. 1Data and process used to create the PubMed model and associated tools.
Primary data sources, sizes and brief descriptions, 1996–2019.
| Data Source and Version | # Records | # PMID | Description |
|---|---|---|---|
| PubMed, pubmed.ncbi.nlm.nih.gov | 18,214,654 | 18,214,654 | Bibliographic metadata |
| NIH iCite2.0[ | 18,214,654 | 18,214,654 | Paper-level metrics (translation, RCR, etc.) |
| Open Citation Collection[ | 315,512,095 | 12,578,393 | Citation links by PMID |
| OpenCitations (Jan 2020), | 186,399,013 | 7,783,835 | Citation links by DOI |
| PMID Similar Article Scores (top 20 current as of Jan 2020) | 364,534,609 | 18,205,619 | Text-based relatedness scores based on Lin & Wilbur[ |
| Star Metrics (2008–2018), | 861,170 | n/a | Annual project data (including funding amounts) for NIH, NSF and other US agencies |
| NIH ExPORTER (1996–2018), | 4,224,360 | 1,789,416 | Link tables – PMID to NIH project |
| NSF Awards API (1996–2017), | 566,155 | 149,091 | List of references by NSF project, matched to PMID |
| USPTO Non-patent references, 2015–2019, | 2,952,584 | 660,581 | Full text XML of US patents, non-patent references were extracted and matched to PMID |
Properties of the three-level PubMed model.
| PM5 | PM4 | PM3 | |
|---|---|---|---|
| Resolution | 7.75E-05 | 10 | 21.25 |
| Minimum cluster size | 75 | 750 | 7500 |
| # Clusters | 28,889 | 3074 | 288 |
| Largest cluster | 6052 | 44,163 | 583,979 |
| Ratio largest::smallest | 80.7 | 56.0 | 74.0 |
Fig. 2Visual map of the PubMed model showing 28,743 clusters. Each cluster is colored according to its dominant field (see legend).
Fig. 3Detailed characterization of a single cluster in the Excel workbook.
Description of sheets in the Excel workbook.
| Sheet Name | # of Lines | Description |
|---|---|---|
| CLUST | 28,743 | Cluster positions, metrics and percentiles |
| TRANSP | 28,743 | Transparency metrics by cluster using Stanford data extractions from PMCOA documents, 2015–2019 |
| QUERY | 28,743 | COVID/University query counts by cluster, 2015–2019 |
| COUNT | 28,743 | Annual document counts by cluster, 1996–2019 |
| PHRASE | 280,200 | Top 10 phrases by cluster (rank, phrase, score), 2015–2019 |
| IDIO | 280,200 | Top 10 idiosyncratic (differentiating) phrases by cluster (rank, phrase, score), 2015–2019 |
| MESH | 286,296 | Top 10 MeSH headings by cluster (rank, MeSH, count), 2015–2019 |
| ASJC | 275,606 | Top 10 journal categories by cluster (rank, category, count), 2015–2019 |
| JNL | 234,438 | Top 10 journals/sources by cluster (rank, journal, count), 2015–2019 |
| AUTH | 286,425 | Top 10 authors by cluster (rank, count, cpp, author), 2015–2019 |
| CORE | 284,441 | Top 10 most central papers (excluding reviews) by cluster (rank, score, type, bibentry, cites), 2015–2019 |
| REVIEW | 109,171 | Top 5 most central review papers by cluster (rank, score, type, bibentry, cites), 2015–2019 |
| PM5_SHEET | Enter PM5 cluster number to populate this sheet with metadata from the preceding sheets | |
| JNL_EXCL | 42 | Journals excluded from the model |
| METHDISC | 764,405 | List of method and discovery papers by cluster (PMID, meth, disc), 1996–2019 |
Data types for records in the CLUST Excel sheet.
| Index | Format | Description |
|---|---|---|
| PM5 | Integer | PM5 cluster number |
| PM4 | Integer | Corresponding PM4 cluster number |
| PM3 | Integer | Corresponding PM3 cluster number |
| X | Double | X coordinate value on map |
| Y | Double | Y coordinate value on map |
| field | String | High-level field of science, see Fig. |
| nptot | Integer | Number of documents, 1996–2019 |
| np1519 | Integer | Number of documents, 2015–2019 |
| cpp19 | Double | Mean cites per paper for documents 2015–2019 as of end-2019 |
| cpp19_pctl | Double | cpp19 percentile among clusters |
| rcr | Double | Mean RCR (relative citation ratio) value, 2015–2019 |
| rcr_pctl | Double | rcr19 percentile among clusters |
| snip | Double | Mean SNIP (source normalized impact factor), 2015–2019, 2018 SNIP value used for 2019 documents |
| snip_pctl | Double | snip percentile among clusters |
| apt | Double | Mean APT (approximate potential to translate) value, 2015–2019 |
| apt_pctl | Double | apt percentile among clusters |
| ind_fr | Double | Fraction of documents with at least one industry affiliation/address, 2015–2019 |
| ind_pctl | Double | ind percentile among clusters |
| nprpp | Double | Mean number of patent citations per paper, patents 2015–2019, documents 1996–2019 |
| npr_pctl | Double | nprpp percentile among clusters |
| clin_fr | Double | Fraction of documents with at least one clinical affiliation/address, 2015–2019 |
| clin_pctl | Double | clin percentile among clusters |
| rlev | Double | Mean research level, 2015–2019 |
| fundpp | Double | Mean number of funding types per paper, 2015–2019 |
| nf_pctl | Double | fundpp percentile among clusters |
| grantpp | Double | Mean number of grants indexed in PubMed per paper, 2015–2019 |
| ng_pctl | Double | grantpp percentile among clusters |
| starpp | Double | Mean funding per paper in $$M, 2015–2019, NIH and NSF funding from Star Metrics |
| star_pctl | Double | starpp percentile among clusters |
| meth_fr | Double | Fraction of documents identified as method, 1996–2019 |
| meth_pctl | Double | meth percentile among clusters |
| disc_fr | Double | Fraction of documents identified as discovery, 1996–2019 |
| disc_pctl | Double | disc percentile among clusters |
| rev_fr | Double | Fraction of documents identified as review, 2015–2019 |
| rev_pctl | Double | rev percentile among documents |
| trl_fr | Double | Fraction of documents identified as clinical trial, 2015–2019 |
| trl_pctl | Double | trl percentile among clusters |
| nauth2 | Integer | Number of authors with at least 2 papers in cluster, 2015–2019 |
| nauth5 | Integer | Number of authors with at least 5 papers in cluster, 2015–2019 |
| age | Double | Mean age of papers in cluster |
| vit19 | Double | Mean vitality of papers in cluster as of end-2019 |
| 3yrgrw | Double | Annualized growth rate in cluster from 2016–2019 |
Data types for records in the TRANSP Excel sheet.
| Index | Format | Description |
|---|---|---|
| PM5 | Integer | PM5 cluster number |
| npoa | Integer | Number of open access documents per PubMed Central (PMCOA) |
| oa_fr | Double | Fraction of documents in cluster from PMCOA |
| coi_fr | Double | Fraction of PMCOA documents with a COI statement |
| coi_pctl | Double | coi percentile among clusters |
| fund_fr | Double | Fraction of PMCOA documents with a funding statement |
| fund_pctl | Double | fund percentile among clusters |
| reg_fr | Double | Fraction of PMCOA documents with a registration statement |
| reg_pctl | Double | reg percentile among clusters |
| data_fr | Double | Fraction of PMCOA documents with a data sharing statement |
| data_pctl | Double | data percentile among clusters |
| code_fr | Double | Fraction of PMCOA documents with a code sharing statement |
| code_pctl | Double | code percentile among clusters |
Common format for PHRASE, IDIO, MESH, ASJC and JNL Excel sheets.
| Index | Format | Description |
|---|---|---|
| PM5 | Integer | PM5 cluster number |
| rank | Integer | Field rank within cluster |
| descriptor | String | Phrase/idio/MeSH heading/category/journal |
| score | Double | Score or count of descriptor |
Data types for records in the QUERY Excel sheet.
| Index | Format | Description |
|---|---|---|
| PM5 | Integer | PM5 cluster number |
| #CORD | Integer | Number of documents found in the CORD-19 (Allen AI Covid 19) dataset |
| %CORD | Double | Fraction of documents found in the CORD-19 (Allen AI Covid 19) dataset |
| MICH | Integer | Number of documents with a University of Michigan address, 2015–2019 |
| STAN | Integer | Number of documents with a Stanford University address, 2015–2019 |
Data types for records in the METHDISC Excel sheet.
| Index | Format | Description |
|---|---|---|
| PM5 | Integer | PM5 cluster number |
| PMID | Integer | PubMed ID for document |
| method | String | identified as a method paper (=METH) |
| discovery | String | identified as a discovery paper (=DISC) |
Fig. 4Tableau views of the PubMed model filtered to show only those clusters with UMMS papers. Color reflects the research level of each cluster. (a) Map view. (b) Scatterplot view with the approximate potential to translate percentile on the x-axis and NIH/NSF funding percentile on the y-axis.
Fig. 5Tableau views of subsets of clusters related to coronavirus. (a) Map view of clusters with at least 25 CORD-19 documents and a CORD-19 document concentration of at least 10%. (b) Scatterplot view of clusters further filtered to those containing UMMS papers.
Fig. 6Publication profile of African American and Native American principal investigators at UMMS overlaid on the PubMed map. Sizes of colored circles reflect numbers of publications.
| Measurement(s) | Biomedical Research • document relatedness |
| Technology Type(s) | digital curation • machine learning • Cluster Analysis • text similarity • citation analysis |
Data types for records in the CORE and REVIEW Excel sheets.
| Index | Format | Description |
|---|---|---|
| PM5 | Integer | PM5 cluster number |
| rank | Integer | Core paper rank |
| score | Double | Relative score based on relatedness values within cluster |
| type | String | Document type(s) from PubMed |
| source | String | Source metadata - PMID, title, journal, volume, page, year, DOI |
| ncited | Integer | Number of times cited from OCC, January 2020 |