| Literature DB >> 35111920 |
David Schindler1, Felix Bensmann2, Stefan Dietze2,3, Frank Krüger1,4.
Abstract
Science across all disciplines has become increasingly data-driven, leading to additional needs with respect to software for collecting, processing and analysing data. Thus, transparency about software used as part of the scientific process is crucial to understand provenance of individual research data and insights, is a prerequisite for reproducibility and can enable macro-analysis of the evolution of scientific methods over time. However, missing rigor in software citation practices renders the automated detection and disambiguation of software mentions a challenging problem. In this work, we provide a large-scale analysis of software usage and citation practices facilitated through an unprecedented knowledge graph of software mentions and affiliated metadata generated through supervised information extraction models trained on a unique gold standard corpus and applied to more than 3 million scientific articles. Our information extraction approach distinguishes different types of software and mentions, disambiguates mentions and outperforms the state-of-the-art significantly, leading to the most comprehensive corpus of 11.8 M software mentions that are described through a knowledge graph consisting of more than 300 M triples. Our analysis provides insights into the evolution of software usage and citation patterns across various fields, ranks of journals, and impact of publications. Whereas, to the best of our knowledge, this is the most comprehensive analysis of software use and citation at the time, all data and models are shared publicly to facilitate further research into scientific use and citation of software.Entities:
Keywords: Knowledge graph; Named entity recognition; Software citation; Software mention
Year: 2022 PMID: 35111920 PMCID: PMC8771769 DOI: 10.7717/peerj-cs.835
Source DB: PubMed Journal: PeerJ Comput Sci ISSN: 2376-5992
Figure 1Annotated sentences from SOMESCI missing information required by software citation standards.
Summary of investigations concerning software in science together with source of the articles, number of articles and software, and a quality indicator.
Level of extracted details varies between listed approaches. Note that PLoS is a subset of PMC. M, manual; A, automatic; k, Cohen’s; F, FScore; O, Percentage Overlap.
| Approach | Quality | Source | Articles | Software | |
|---|---|---|---|---|---|
| M |
| O = 0.68–0.83 | Biology | 90 | 286 |
|
| – | Nature (Journal) | 40 | 211 | |
|
| O = 0.76 | PMC, Economics | 4,971 | 4,093 | |
|
| PMC | 1,367 | 3,756 | ||
| A |
| F = 0.58 | PLoS ONE | 10 K | 26 K |
|
| F = 0.67 | PMC | 714 K | 3.9 M | |
|
| F = 0.82 | PLoS (Social Science) | 51 K | 133 K |
Figure 2Sentence from SOMESCI annotated with respect to software, additional information, mention type, and software type as well as corresponding relations.
Overview of the SOMESCI corpus.
Further details can be found in Schindler et al. (2021b).
| S | |
|---|---|
| # Articles | 1,367 |
| # Sentences w/ Software | 2,728 |
| # Sentences w/o Software | 44,796 |
| # Annotations | 7,237 |
| # Software | 3,756 |
| # unique Software | 883 |
| # Relations | 3,776 |
| Software Type |
|
| Mention Type |
|
| Additional Information |
|
Hyper-parameters considered for BERT models including their default setting.
| Parameter | Default |
|---|---|
| Learning Rate (LR) | 1e−5 |
| Sampling | all data |
| Dropout | 0.1 |
| Gradient Clipping | 1.0 |
Figure 3Illustration of the employed multi-task, hierarchical, sequence labeling model.
Features are generated based on shared layers. The features are passed to 3 separate tasks and loss signals are summed to update shared weights. Outputs of classification layers are passed back to the network as input features to other classification layers, depicted from left to right in the image. Teacher forcing—replacing lower level classification outputs with gold label data—is used during training to stop potentially wrong classification outputs from being passed to other classification layers. Colors represent similar types of information.
Example for enforcing tagging consistency.
Inconsistencies are underlined.
| Sentence | We | Used | SPSS | Statistics | 16 | . |
| Entities | O | O | B-App | I-App |
| O |
| Types | O | O |
|
| O | O |
| Fixed | O | O | B-App-Use | B-App-Use | B-Ver | O |
Figure 4Overview of the software name disambiguation. For all pairs of extracted software entities (E1, E2), features are extracted (feature extraction) and used to determine a probability of linking (perceptron).
Finally, agglomerative clustering is performed to cluster similar software names.
Figure 5Data model of the Knowledge Graph representing extracted software mentions and their related information.
For reasons of conciseness some details are left out.
Development set results on software mention recognition.
Models marked with opt were optimized with respect to hyper-parameters, models marked with plain were not. Bold results highlight best performance for both plain and optimized models.
| Precision | Recall | FScore | |
|---|---|---|---|
| Model compare ( | |||
| S | 0.82 | 0.77 | 0.79 |
| M | 0.829 ( | 0.762 ( | 0.794 ( |
| M | 0.862 ( | 0.808 ( | 0.834 ( |
| M | |||
| M |
Selected hyper-parameters for M fine-tuning.
| Parameter | Value |
|---|---|
| Learning Rate (LR) | 1e−6 |
| Sampling | all data |
| Dropout | 0.2 |
| Gradient clipping | 1.0 |
Software mention extraction results for M in comparison with SOMESCI baseline as reported by Schindler et al. (2021b), where n denotes the number of samples available for each classification target.
Please note that the baseline model applies hierarchical classifiers on the task and does not adjust the performance for error propagation between the initial classification of software and all other down-stream tasks. Therefore, all baseline results except for software are prone to overestimate performance when compared to the given results. Bold results highlight best performance in terms of FScore.
| M | S | ||||
|---|---|---|---|---|---|
| Precision | Recall | FScore | FScore |
| |
| Software | 0.876 ( | 0.891 ( | 0.83 | 590 | |
| Abbreviation | 0.884 ( | 0.879 ( | 0.71 | 17 | |
| AlternativeName | 0.719 ( | 0.734 ( | 0.25 | 4 | |
| Citation | 0.868 ( | 0.855 ( | 0.861 ( |
| 120 |
| Developer | 0.867 ( | 0.901 ( | 0.88 | 110 | |
| Extension | 0.331 ( | 0.688 ( | 0.444 ( |
| 5 |
| License | 0.799 ( | 0.83 ( | 0.80 | 14 | |
| Release | 0.499 ( | 0.771 ( | 0.605 ( |
| 9 |
| URL | 0.858 ( | 0.979 ( | 0.914 ( |
| 53 |
| Version | 0.927 ( | 0.94 ( | 0.92 | 190 | |
| Entities | 0.875 ( | 0.897 ( | 0.85 | 1,112 | |
| Application | 0.788 ( | 0.865 ( | 0.81 | 415 | |
| OS | 0.933 ( | 0.852 ( | 0.82 | 30 | |
| PlugIn | 0.652 ( | 0.408 ( | 0.43 | 78 | |
| PE | 0.924 ( | 0.998 ( | 0.96 ( |
| 63 |
| Software Type | 0.792 ( | 0.818 ( | 0.78 | 590 | |
| Creation | 0.784 ( | 0.805 ( | 0.64 | 53 | |
| Deposition | 0.71 ( | 0.821 ( | 0.65 | 28 | |
| Allusion | 0.603 ( | 0.464 ( | 0.29 | 71 | |
| Usage | 0.832 ( | 0.883 ( | 0.80 | 438 | |
| Mention Type | 0.794 ( | 0.823 ( | 0.74 | 590 | |
Summary of RE results for both development and test set.
SOMESCI represents baseline FScores for comparison. P, Precision; R, Recall; F1, FScore; n, Number of samples per relation. Bold results highlight best performance in terms of FScore.
| Development set | Test set | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Random forest | S | Random forest | S | |||||||
| Label | P | R | F1 | F1 |
| P | R | F1 | F1 |
|
| Abbreviation | 1.00 | 1.00 | 1.00 | 1.00 | 17 | 1.00 | 0.94 | 0.97 | 0.97 | 17 |
| Developer | 0.94 | 0.97 | 0.95 | 0.95 | 87 | 0.95 | 0.95 |
| 0.94 | 111 |
| AltName | 1.00 | 1.00 |
| 0.83 | 6 | 1.00 | 1.00 | 1.00 | 1.00 | 4 |
| License | 0.88 | 0.70 |
| 0.57 | 10 | 1.00 | 0.93 |
| 0.64 | 14 |
| Citation | 0.94 | 0.97 |
| 0.83 | 90 | 0.94 | 0.92 |
| 0.86 | 121 |
| Release | 0.78 | 1.00 |
| 0.80 | 7 | 0.88 | 0.78 |
| 0.53 | 9 |
| URL | 0.93 | 0.94 |
| 0.80 | 70 | 0.98 | 0.92 |
| 0.89 | 53 |
| Version | 0.97 | 0.99 |
| 0.96 | 139 | 0.98 | 0.96 |
| 0.95 | 190 |
| Extension | 1.00 | 1.00 | 1.00 | 1.00 | 5 | 1.00 | 1.00 |
| 0.89 | 5 |
| PlugIn | 0.77 | 0.66 |
| 0.60 | 35 | 0.85 | 0.72 |
| 0.65 | 39 |
| Specification | 0.67 | 0.67 |
| 0.60 | 6 | 0.83 | 0.62 |
| 0.22 | 8 |
| Overall | 0.93 | 0.94 |
| 0.87 | 472 | 0.95 | 0.92 |
| 0.88 | 571 |
Overview of the 27 main research domains and 3 of their sub categories that were used to group journals.
Bold font highlights the abbreviation of the respective research domain used here.
| Main research domain | Research subcategories (excerpt) |
|---|---|
| Acoustics and Ultrasonics, Astronomy and Astrophysics, Atomic and Molecular Physics, and Optics | |
|
| Analytical Chemistry, Chemistry (miscellaneous), Electrochemistry |
| Anthropology, Archeology, Communication | |
| Biomaterials, Ceramics and Composites, Electronic | |
|
| Aerospace Engineering, Architecture, Automotive Engineering |
| Economics and Econometrics, Economics, Econometrics and Finance (miscellaneous) | |
|
| Multidisciplinary |
|
| Energy (miscellaneous), Energy Engineering and Power Technology, Fuel Technology |
| Agricultural and Biological Sciences (miscellaneous), Agronomy and Crop Science, Animal Science and Zoology | |
| Ecological Modeling, Ecology, Environmental Chemistry | |
|
| Equine, Food Animals, Small Animals |
|
| Advanced and Specialized Nursing, Assessment and Diagnosis, Care Planning |
| Statistics, Probability and Uncertainty, Information Systems and Management | |
| Atmospheric Science, Computers in Earth Sciences, Earth and Planetary Sciences (miscellaneous) | |
| Drug Discovery, Pharmaceutical Science, Pharmacology | |
|
| Algebra and Number Theory, Analysis, Applied Mathematics |
| Artificial Intelligence, Computational Theory and Mathematics, Computer Graphics and Computer-Aided Design | |
| Aging, Biochemistry, Biochemistry | |
|
| Dentistry (miscellaneous), Oral Surgery, Orthodontics |
|
| Behavioral Neuroscience, Biological Psychiatry, Cellular and Molecular Neuroscience |
| Archeology (arts and humanities), Arts and Humanities (miscellaneous), Conservation | |
|
| Applied Psychology, Clinical Psychology, Developmental and Educational Psychology |
| Accounting, Business and International Management, Business | |
|
| Anatomy, Anesthesiology and Pain Medicine, Biochemistry (medical) |
| Applied Microbiology and Biotechnology, Immunology, Immunology and Microbiology (miscellaneous) | |
| Chiropractics, Complementary and Manual Therapy, Health Information Management | |
| Bioengineering, Catalysis, Chemical Engineering (miscellaneous) |
Figure 6Cumulative distribution of software mentions per unique software.
Left (bottom) scale gives the relative values, whereas right (top) scale provides the absolute numbers.
Information about the 10 most frequent software mentions across all disciplines together with their absolute and relative number of mentions, the number of articles that contain at least one mention and the number of spelling variation that could be disambiguated.
| Software | Absolute # | Relative # | # Articles | # Spellings |
|---|---|---|---|---|
| SPSS | 539,250 | 4.57 | 466,505 | 440 |
| R | 469,751 | 3.98 | 235,180 | 1 |
| Prism | 220,175 | 1.87 | 189,578 | 1 |
| ImageJ | 228,140 | 1.93 | 144,737 | 83 |
| Windows | 140,941 | 1.19 | 127,691 | 6 |
| Stata | 147,586 | 1.25 | 118,413 | 141 |
| Excel | 151,613 | 1.29 | 118,082 | 54 |
| SAS | 140,214 | 1.19 | 112,679 | 215 |
| BLAST | 271,343 | 2.30 | 104,734 | 383 |
| MATLAB | 160,164 | 1.36 | 89,346 | 6 |
Figure 7Top 10 software per domain.
Higher rank within the domain is represented by darker color. The number on the tile gives the rank within the domain. Software with rank higher than 10, are excluded from the plot to improve readability. Software are ordered by rank over all domains left to right.
Figure 8Blue: Relative frequency of articles with at least one software mention per year. Green: Absolute mean frequency of unique software mentioned per article with at least one software mention per year.
Please note that standard deviations are at the same level as the actual average values but are omitted here for reasons of readability.
Figure 9Blue: Relative frequency of articles with at least one software mention per research domain. Green: Average number of different software mentioned per article with at least one software mention given by research domain.
Note that standard deviations are large (similar to average values) and are omitted here.
Figure 10Blue: Relative frequency of articles that contain at least one software mention per rank of bibliometric measure. Green: Average number of different software per article per bibliometric measure.
Note that the high standard deviation (at the same level as average values) are left out to increase readability.
Figure 11Distribution of software completeness per year with the percentage of unique software per article that is cited with provided additional information.
The colored bars represent the different levels of completeness while the line chart separately indicates how many software mentions were accompanied by a formal citation. The numbers at the top of the bars represent the absolute number of software considered per year.
Figure 12Distribution of software completeness per research domain.
The numbers at the top of the bars represent the absolute numbers of software considered per domain. Please note that articles may belong to multiple categories.
Figure 13Distribution of software completeness per ventile of journal rank per research domains.
The numbers at the top of the bars represent the absolute numbers considered per ventiles.
Figure 14Distribution of software completeness per ventile of citation count per research domain.
Note that only articles published before 2020 were included to prevent a bias towards lower citation ventiles.
Overview of the relative frequency of software and mention types as well as their combinations over all software mentions.
Note that overall numbers do not necessarily sum to 100 due to rounding issues.
| Allusion | Creation | Deposition | Usage | Overall | |
|---|---|---|---|---|---|
| Application | 13.95 | 1.80 | 0.56 | 68.14 | 84.49 |
| OperatingSystem | 0.26 | 0.00 | 0.00 | 1.69 | 1.95 |
| PlugIn | 0.47 | 0.17 | 0.04 | 5.58 | 6.27 |
| ProgrammingEnvironment | 0.40 | 0.00 | 0.00 | 6.88 | 7.29 |
| Overall | 15.09 | 2.01 | 0.60 | 82.31 | 100.00 |
Most frequent host software, i.e., mentioned together with a PlugIn, in combination with the most frequently used PlugIns for each of them.
# PlugIn, distinct, disambiguated PlugIns; # Mention, overall PlugIn mentions.
| Software | # PlugIn | # Mention | Top 5 PlugIn incl. % of mentions |
|---|---|---|---|
| R | 19,442 | 220,750 | Bioconductor (4.79%), ggplot 2 (3.63%), lme 4 (3.21%), vegan (3.09%), DESeq 2 (2.52%) |
| MATLAB | 4,442 | 18,616 | Psychophysics Toolbox (6.75%), Psychtoolbox (6.03%), Statistics Toolbox (3.92%), Image Processing Toolbox (2.87%), Neural Network Toolbox (1.43%) |
| Python | 3,157 | 11,688 | scikit - learn (12.55%), SciPy (4.10%), TensorFlow (3.75%), Network (2.37%), scipy (2.30%) |
| python | 1,449 | 3,533 | scikit - learn (10.44%), scipy (3.85%), sklearn (2.83%), matplotlib (2.52%), HTSeq (2.12%) |
| ImageJ | 1,286 | 10,761 | Fiji (44.10%), NeuronJ (3.00%), Cell Counter (2.77%), MTrackJ (2.11%), Analyze Particles (1.64%) |
| Stata | 809 | 2,190 | metan (5.94%), runmlwin (3.06%), mvmeta (2.33%), Image Composite Editor (1.87%), metareg (1.78%) |
| Perl | 774 | 1,176 | MISA (3.74%), speaks - NONMEM (2.64%), Bioconductor (2.21%), NONMEM (1.36%), Shell (1.11%) |
| Excel | 644 | 1,946 | XLSTAT (17.83%), nSolver (3.96%), Microsatellite Toolkit (2.77%), Analysis ToolPak (2.00%), @ Risk (1.95%) |
| Cytoscape | 553 | 5,902 | ClueGO (13.62%), MCODE (13.00%), BiNGO (7.66%), NetworkAnalyzer (7.56%), Enrichment Map (5.71%) |
| SPM | 521 | 2,671 | DARTEL (20.10%), MarsBaR (9.55%), Marsbar (2.92%), CONN (2.62%), DPARSF (2.55%) |
Top 10 most frequent URLs accompanying software deposition and usage statements together with their absolute and relative frequencies.
| Deposition | Usage | ||||
|---|---|---|---|---|---|
| URL | Absolute | Relative | URL | Absolute | Relative |
| github.com | 8,602 | 13.93 | github.com | 18,918 | 3.90 |
| journals.plos.org | 5,926 | 9.60 | ncbi.nlm.nih.gov | 16,832 | 3.47 |
| sourceforge.net | 918 | 1.49 | r-project.org | 13,176 | 2.71 |
| cran.r-project.org | 673 | 1.09 | pacev2.apexcovantage.com | 10,504 | 2.16 |
| bioconductor.org | 651 | 1.05 | ebi.ac.uk | 9,850 | 2.03 |
| ebi.ac.uk | 478 | 0.77 | blast.ncbi.nlm.nih.gov | 8,797 | 1.81 |
| ncbi.nlm.nih.gov | 454 | 0.74 | cbs.dtu.dk | 6,539 | 1.35 |
| bitbucket.org | 423 | 0.69 | fil.ion.ucl.ac.uk | 6,439 | 1.33 |
| code.google.com | 353 | 0.57 | cran.r-project.org | 6,015 | 1.24 |
| string-db.org | 204 | 0.33 | targetscan.org | 5,738 | 1.18 |
Statistics of SoftwareKG. Left: general KG properties. Right: frequencies of resources per type.
| Property | Frequency |
|---|---|
| Triples | 301,825,757 |
| Resources | 55,953,270 |
| Distinct Types | 12 |
| Distinct Properties | 47 |
| Reification Statements | 2,042,076 |
Figure 15Relative and absolute amount of articles per year mentioning the top statistical software.