| Literature DB >> 35431364 |
Lucie Beranová1, Marcin P Joachimiak2, Tomáš Kliegr3, Gollam Rabby3, Vilém Sklenák4,3.
Abstract
Multiple studies have investigated bibliometric factors predictive of the citation count a research article will receive. In this article, we go beyond bibliometric data by using a range of machine learning techniques to find patterns predictive of citation count using both article content and available metadata. As the input collection, we use the CORD-19 corpus containing research articles-mostly from biology and medicine-applicable to the COVID-19 crisis. Our study employs a combination of state-of-the-art machine learning techniques for text understanding, including embeddings-based language model BERT, several systems for detection and semantic expansion of entities: ConceptNet, Pubtator and ScispaCy. To interpret the resulting models, we use several explanation algorithms: random forest feature importance, LIME, and Shapley values. We compare the performance and comprehensibility of models obtained by "black-box" machine learning algorithms (neural networks and random forests) with models built with rule learning (CORELS, CBA), which are intrinsically explainable. Multiple rules were discovered, which referred to biomedical entities of potential interest. Of the rules with the highest lift measure, several rules pointed to dipeptidyl peptidase4 (DPP4), a known MERS-CoV receptor and a critical determinant of camel to human transmission of the camel coronavirus (MERS-CoV). Some other interesting patterns related to the type of animal investigated were found. Articles referring to bats and camels tend to draw citations, while articles referring to most other animal species related to coronavirus are lowly cited. Bat coronavirus is the only other virus from a non-human species in the betaB clade along with the SARS-CoV and SARS-CoV-2 viruses. MERS-CoV is in a sister betaC clade, also close to human SARS coronaviruses. Thus both species linked to high citation counts harbor coronaviruses which are more phylogenetically similar to human SARS viruses. On the other hand, feline (FIPV, FCOV) and canine coronaviruses (CCOV) are in the alpha coronavirus clade and more distant from the betaB clade with human SARS viruses. Other results include detection of apparent citation bias favouring authors with western sounding names. Equal performance of TF-IDF weights and binary word incidence matrix was observed, with the latter resulting in better interpretability. The best predictive performance was obtained with a "black-box" method-neural network. The rule-based models led to most insights, especially when coupled with text representation using semantic entity detection methods. Follow-up work should focus on the analysis of citation patterns in the context of phylogenetic trees, as well on patterns referring to DPP4, which is currently considered as a SARS-Cov-2 therapeutic target. © Akadémiai Kiadó, Budapest, Hungary 2022.Entities:
Keywords: Bibliometry; CORD-19: COVID-19 open research dataset; Citation prediction; Interpretability; Phylogenetic distance; SARS-CoV-2; Text analysis; Virus clades
Year: 2022 PMID: 35431364 PMCID: PMC8993675 DOI: 10.1007/s11192-022-04314-9
Source DB: PubMed Journal: Scientometrics ISSN: 0138-9130 Impact factor: 3.801
Fig. 1Overview of methodological pipeline
Fig. 2Process of collecting data and data reduction
Fig. 5Distribution of highly vs lowly cited articles before the normalization by age (left) and after normalization (right) for the V2 dataset
Fig. 3Correlation between number of citations retrieved from OpenCitations and from Web of Science Expanded API
Fig. 4Distribution of highly vs lowly cited articles before the normalization by age (left) and after normalization (right) for the V1 dataset
Distribution of the target variable (discretized citation count adjusted for article age)
| Category | V1 (small) | V2 (large) | ||
|---|---|---|---|---|
| Low | High | Low | High | |
| Citation count | [0;2] | (2;190] | 0 | (0;2905] |
| Frequency | 1127 | 1096 | 36171 | 36165 |
Entity recognition systems used
| Training corpus | Entity types |
|---|---|
| CRAFT | GGP, SO, TAXON, CHEBI, GO, CL |
| JNLPBA | DNA, CELL_TYPE, CELL_LINE, RNA, PROTEIN |
| BC5CDR | DNA, CELL_TYPE, CELL_LINE, RNA, PROTEIN |
| BIONLP13CG | AMINO_ACID, ANATOMICAL_SYSTEM, CANCER, CELL, CELLULAR_COMPONENT, DEVELOPING_ANATOMICAL_STRUCTURE, GENE_OR_GENE_PRODUCT, IMMATERIAL_ANATOMICAL_ENTITY, MULTI-TISSUE_STRUCTURE, ORGAN, ORGANISM, ORGANISM_SUBDIVISION, ORGANISM_SUBSTANCE, PATHOLOGICAL_FORMATION, SIMPLE_CHEMICAL, TISSUE |
List adapted from https://allenai.github.io/scispacy/
Overview of input datasets
| Dataset (Matrix) | Features | Reduction method | Columns | Original columns |
|---|---|---|---|---|
| AuthorsNames | Binary | min_df = 32 | 162 | 504,237 |
| BibliometricFeatures | Binary | min_df = 32 | 539 | 504,614 |
| Bow | Binary | min_df = 32 | 1495 | 547,769 |
| Bow_BibliometricFeatures | Binary | min_df = 32 | 2034 | 1,052,383 |
| TF-IDF | float | min_df = 32 | 1495 | 547,769 |
| TF-IDF_BibliometricFeatures | mixed | min_df = 32 | 2034 | 1,052,383 |
| PubTator | Binary | FI = max 1500 | 1500 | 3322 |
| PubTator_Conceptnet | Binary | FI = max 1500 | 1500 | 2087 |
| ScispaCy | Binary | FI = max 1500 | 1500 | 23,685 |
| ScispaCy_Conceptnet | Binary | FI = max 1500 | 1500 | 9483 |
| Bow_PubTator | Binary | min_df = 32, FI = max 1500 | 2995 | 551,091 |
| Bow_PubTator_Conceptnet | Binary | min_df = 32, FI = max 1500 | 2995 | 549,856 |
| Bow_ScispaCy | Binary | min_df = 32, FI = max 1500 | 2995 | 571,454 |
| Bow_ScispaCy_Conceptnet | Binary | min_df = 32, FI = max 1500 | 2995 | 557,252 |
| Bow_Pubtator_Conceptnet _BibliometricFeatures | Binary | min_df = 32, FI = max 1500 | 3534 | 1,054,470 |
Evaluation of BOW matrix for different value of the minimum document frequency (min_df) parameter with the RandomForest classifier
| min_df | Features | Fit time | Accuracy |
|---|---|---|---|
| 1 | 426,272 | 87.03 | 0.72 |
| 4 | 13,397 | 4.25 | 0.72 |
| 8 | 5848 | 1.86 | 0.73 |
| 12 | 3841 | 1.23 | 0.73 |
| 16 | 2940 | 0.97 | 0.73 |
| 20 | 2359 | 0.83 | 0.72 |
| 24 | 1961 | 0.72 | 0.73 |
| 28 | 1708 | 0.66 | 0.72 |
| 32 | 1495 | 0.60 | 0.73 |
| 36 | 1350 | 0.57 | 0.71 |
| 40 | 1178 | 0.54 | 0.72 |
| 44 | 1079 | 0.51 | 0.72 |
| 48 | 985 | 0.49 | 0.72 |
| 52 | 894 | 0.47 | 0.73 |
| 90 | 456 | 0.36 | 0.71 |
| 120 | 320 | 0.32 | 0.71 |
| 150 | 232 | 0.30 | 0.70 |
| 200 | 153 | 0.28 | 0.70 |
| 250 | 101 | 0.26 | 0.70 |
Fig. 6Architecture of the used Convolutional Neural Network, generated by Netron (https://netron.app/)
Hyperparameter combinations evaluated for the neural network model
| Parameters | Run 1 | Run 2 | Run 3 | Run 4 |
|---|---|---|---|---|
| EMB_DIM | 200 | 1400 | 500 | 1300 |
| CNN_FILTERS | 100 | 130 | 200 | 50 |
| DNN_UNITS | 256 | 256 | 256 | 256 |
| OUTPUT_CLASSES | 3 | 3 | 3 | 3 |
| DROPOUT_RATE | 0.2 | 0.2 | 0.2 | 0.2 |
| NB_EPOCHS | 5 | 5 | 5 | 5 |
| Accuracy | 0.56 | 0.70 | 0.59 | 0.68 |
Predictive performance of random forests and neural networks for V1 dataset of 2223 articles
| Matrix | Binary | Regression | |
|---|---|---|---|
| Accuracy | MSE | Accuracy | |
| AuthorsNames | 0.68 | 25,25 | 0.55 |
| BibliometricFeatures | 0.68 | 21.26 | 0.61 |
| Bow | 0.70 | 22.79 | 0.61 |
| Bow_BibliometricFeatures | 0.70 | 21.98 | 0.59 |
| Bow_PubTator | 0.72 | 22.50 | 0.61 |
| Bow_PubTator_Conceptnet | 0.71 | 22.66 | 0.62 |
| Bow_PubTator_Conceptnet_Bibliometric_Features | 0.72 | 17.61 | 0.67 |
| Bow_ScispaCy | 0.69 | 22.61 | 0.59 |
| Bow_ScispaCy_Conceptnet | 0.70 | 22.36 | 0.64 |
| PubTator | 0.68 | 27.90 | 0.60 |
| PubTator_Conceptnet | 0.67 | 28.60 | 0.58 |
| ScispacC | 0.60 | 32.12 | 0.53 |
| ScispaCy_Conceptnet | 0.60 | 31.80 | 0.53 |
| TF-IDF | 0.70 | 25.17 | 0.54 |
| TF-IDF_Bibliometric_Features | 0.71 | 21.30 | 0.61 |
| BERT embeddings | 0.67 | 32.92 | 0.54 |
| BERT embeddingsa | 0.83 | 22.04 | 0.80 |
aBERT results were updated for the final version of the article using BERT TF HUB Model (bert_en_uncased_L-12_H-768_A-12/2) instead of V1 of the same model. The previous accuracy for BERT (Classification) was 0.71 and accuracy for BERT (regression) was 0.56
Predictive performance and model size of rule learning (CBA and CORELS) for V1 dataset of of 2223 articles
| Matrix | CORELS | CBA | ||||
|---|---|---|---|---|---|---|
| Accuracy | avgRuleLen | ruleCount | Accuracy | avgRuleLen | ruleCount | |
| AuthorsNames | 0.51 | 1.0 | 1 | 0.67 | 1.3 | 99 |
| BibliometricFeatures | 0.66 | 1.5 | 2 | 0.69 | 2.1 | 192 |
| Bow | 0.64 | 1.5 | 2 | 0.66 | 2.2 | 350 |
| Bow_BibliometricFeatures | 0.65 | 1.5 | 2 | 0.68 | 2.1 | 417 |
| Bow_PubTator | 0.66 | 1.5 | 2 | 0.67 | 2.1 | 349 |
| Bow_PubTator_Conceptnet | 0.66 | 1.5 | 2 | 0.67 | 2.1 | 424 (465) |
Bow_Pubtator_Conceptnet _BibliometricFeatures | 0.61 | 1.5 | 2 | 0.68 | 2.0 | 349 |
| Bow_ScispaCy | 0.61 | 1.5 | 2 | 0.68 | 1.5 | 162 |
| Bow_ScispaCy_Conceptnet | 0.65 | 1.0 | 2 | 0.67 | 1.1 | 121 |
| PubTator | 0.62 | 1.8 | 5 | 0.66 | 1.7 | 64 |
| PubTator_Conceptnet | 0.64 | 1.5 | 2 | 0.64 | 1.9 | 73 |
| ScispaCy | 0.60 | 1.5 | 2 | 0.57 | 0.5 | 2 |
| ScispaCy_Conceptnet | 0.60 | 1.5 | 2 | 0.57 | 0.7 | 3 |
For Bow_PubTator_Conceptnet, the number in parenthesis is for the version with extra feature cleaning described in “Feature cleaning for improving interpretability” section, the remaining results were the same as for the base version
Predictive performance of Random Forest and Neural Network for both versions of the input dataset
| Model | Matrix | Accuracy | |
|---|---|---|---|
| V1 (small) | V2 (large) | ||
| Random Forest | Bow | 0.70 | 0.70 |
| Random Forest | TF-IDF | 0.70 | 0.70 |
| Neural Network | BERT | 0.83 | 0.66 |
Results for V1 are taken for reference from Table 6 (2223 articles), V2 dataset contains 72,336 articles
Topmost important features (MDI method) by matrix
| BibliometricFeatures | PubTator_Concepnet | Bow_PubT_Conc_BiblFeatures | |||
|---|---|---|---|---|---|
| Feature | Imp. | Feature | Imp. | Feature | Imp. |
| FORD_0_impactQ_Q2 | 0.076 | Mers | 0.050 | FORD_0_impactQ_Q2 | 0.026 |
| FORD_0_aisQ_Q1_D1 | 0.047 | Humans | 0.023 | east | 0.021 |
| WoScategory_0_impactQ_Q3 | 0.031 | Human | 0.021 | middle east | 0.019 |
| FORD_0_impactQ_Q1_D2 | 0.031 | Dromedary | 0.019 | east respiratory | 0.017 |
| license_elscovid | 0.029 | Camels | 0.016 | WoScategory_0_aisQ_Q1_D2 | 0.016 |
| FORD_0_aisQ_Q2 | 0.029 | Cov | 0.015 | FORD_0_aisQ_Q1_D1 | 0.015 |
| WoScategory_0_aisQ_Q3 | 0.028 | Cow | 0.013 | FORD_0_impactQ_Q1_D2 | 0.014 |
| FORD_0_impactQ_Q1_D1 | 0.028 | Body | 0.010 | east respiratory syndrome | 0.013 |
| WoScategory_0_aisQ_Q1_D2 | 0.025 | Infection | 0.009 | respiratory syndrome | 0.012 |
| FORD_0_aisQ_Q1_D2 | 0.023 | Rats | 0.009 | syndrome | 0.010 |
| license_unk | 0.022 | Fever | 0.008 | license_unk | 0.010 |
| WoScategory_0_aisQ_Q1_D1 | 0.020 | Bovine | 0.008 | FORD_0_aisQ_Q1_D2 | 0.009 |
| FORD_0_aisQ_Q3 | 0.013 | Canine | 0.008 | WoScategory_0_aisQ_Q3 | 0.009 |
| peter | 0.011 | Failure | 0.008 | middle_east_raspiratory | 0.009 |
| journal_Journal_of_Virology | 0.010 | C | 0.007 | middle | 0.008 |
| journal_Arch_Virol | 0.008 | Pneumonia | 0.007 | WoScategory_0_aisQ_Q1_D1 | 0.008 |
| WoScategory_0_impactQ_Q2 | 0.008 | Respiratory | 0.007 | FORD_0_impactQ_Q1_D1 | 0.007 |
| WoScategory_0_aisQ_Q2 | 0.008 | Transgenic | 0.006 | FORD_0_aisQ_Q2 | 0.007 |
| WoScategory_0_impactQ_Q2 | 0.007 | Dog | 0.006 | journal_Journal_of_Virology | 0.007 |
| paul | 0.007 | People | 0.006 | license_elscovid | 0.005 |
(WoScategory|FORD)_0 indicates the value is for the journal’s primary FORD (Web of Science) category. (AIS|Impact)Q_Q{q} indicates that the journal in which the publication appeared is in the q-th quartile by AIS (impact factor). If the journal is in the first two deciles of Q1, then D{d} indicates the decile
Fig. 7Shapley plot for bibliometric features (left) and article abstracts (right). Features are sorted by mean SHAP value. Example explanation: articles annotated with license_nocc “no Creative Commons license” in CORD-on-FHIR-19 have value 1 (denoted by red dots), and articles with other license value 0 (blue dots). Concentration of red dots left of the vertical line (SHAP value ) indicates that article license “nocc” has a negative effect on the number of citations. Note that some features like camel and camels could have been aggregated by stemming. This was not performed for the Random Forest model, since it had negative effect on predictive performance (e.g. human and humans are often used in different contexts). (Color figure online)
Fig. 8LIME plot for authors
Fig. 9LIME plot for abstract
Example rules generated by the CBA algorithm grouped by input dataset (matrix)
| Matrix | LHS | RHS | Supp. | Conf. | Cov. | Lift |
|---|---|---|---|---|---|---|
| Bow_ScispaCy | {oc43 strain,bcv} | {low} | 22 | 1.00 | 0.01 | 1.89 |
| Bow_ScispaCy | {merscov infection,virus} | {high} | 22 | 1.00 | 0.01 | 2.13 |
| Bow_ScispaCy | {killing,merscov infection} | {high} | 44 | 0.97 | 0.02 | 2.07 |
| Bow_ScispaCy | {neutralizing antibody, merscov infection} | {high} | 44 | 0.96 | 0.02 | 2.04 |
| Bow_ScispaCy_Conc | {bradycardie,bcv} | {low} | 22 | 1.00 | 0.01 | 1.89 |
| Bow_ScispaCy_Conc | {results indicated} | {low} | 23 | 0.96 | 0.01 | 1.81 |
| Bow_ScispaCy_Conc | {canine,dogs} | {low} | 21 | 0.96 | 0.01 | 1.81 |
| Bow_ScispaCy_Conc | {merscov infection} | {high} | 40 | 0.94 | 0.03 | 2.01 |
| Bow_PubTator | {reversed,bcv} | {low} | 22 | 1.00 | 0.01 | 1.89 |
| Bow_PubTator | {merscov infection} | {high} | 40 | 0.94 | 0.03 | 2.01 |
| Bow_PubT_Concepnet | {canine,virus} | {low} | 23 | 1.00 | 0.01 | 1.89 |
| Bow_PubT_Concepnet | {bats,coronavirus,transmission} | {high} | 17 | 1.00 | 0.11 | 2.01 |
| Bow_PubT_Concepnet | {merscov,mice} | {high} | 22 | 1.00 | 0.01 | 2.13 |
| Bow_PubT_Concepnet | {ifn,innate,respiratory} | {high} | 22 | 1.00 | 0.01 | 2.13 |
| Bow_PubT_Concepnet | {mice,protection,vaccine} | {high} | 22 | 1.00 | 0.01 | 2.13 |
| Bow_PubT_Concepnet | {dpp4,respiratory} | {high} | 22 | 1.0 | 0.014 | 2.02 |
| Bow_PubT_Concepnet | {virus,infectious} | {high} | 34 | 0.72 | 0.02 | 1.53 |
| BibliometricFeatures | {FORD_0_aisQ_Q1_D1 ,christian} | {high} | 20 | 1.00 | 0.01 | 2.02 |
| BibliometricFeatures | {FORD_0_impactQ_Q1 _D1,van} | {high} | 22 | 1.00 | 0.01 | 2.02 |
| BibliometricFeatures | {WoSkateg._0_obor_ VIROLOGY_SCIE, FORD_0_aisQ_Q1_D2} | {high} | 378 | 0.7 | 0.24 | 1.41 |
| BibliometricFeatures | {FORD_0_ford_10600, FORD_0_aisQ_Q1_D2, FORD_0_impactQ_Q1_D2} | {high} | 375 | 0.7 | 0.25 | 1.4 |
| Bow_BibFeatures | {FORD_0_aisQ_Q1_D1, merscov} | {high} | 67 | 1.00 | 0.03 | 2.02 |
| Bow_BibFeatures | {antibodies,middle east} | {high} | 44 | 1.00 | 0.03 | 2.02 |
| Bow_BibFeatures | {WoScategory_0_ aisQ_Q1_D1,merscov} | {high} | 67 | 1.00 | 0.03 | 2.02 |
| Bow_BibFeatures | {middle east,spike protein} | {high} | 36 | 1.00 | 0.02 | 2.02 |
| Bow_BibFeatures | {dromedary, east respiratory syndrome} | {high} | 35 | 1.00 | 0.02 | 2.02 |
| Bow_BibFeatures | {WoScategory_0_obor_ VIROLOGY_SCIE, homology,sequence} | {low} | 22 | 0.89 | 0.01 | 1.76 |
| Bow_BibFeatures | {FORD_0_impactQ_Q2, coronavirus, substitutions} | {low} | 22 | 0.89 | 0.01 | 1.76 |
| Bow_PubT_Conc_BibF | {merscov, FORD_0_aisQ_Q1_D1} | {high} | 67 | 1.00 | 0.03 | 2.02 |
| Bow_PubT_Conc_BibF | {antibodies,middle east} | {high} | 67 | 1.00 | 0.03 | 2.02 |
| Bow_PubT_Conc_BibF | {merscov,WoScategory_0_ aisQ_Q1_D1} | {high} | 67 | 1.00 | 0.03 | 2.02 |
| Bow_PubT_Conc_BibF | {flea,canine, FORD_0_ford_10600} | {low} | 22 | 0.89 | 0.01 | 1.76 |
| Bow_PubT_Conc_BibF | {coronavirus,isolate, FORD_0_impactQ_Q2} | {low} | 22 | 0.89 | 0.01 | 1.76 |
| ScispaCy | {simple antibody test methods} | {low} | 689 | 0.58 | 0.53 | 1.1 |
| PubTator_Conceptnet | {mers,Mice,us} | {high} | 22 | 1.00 | 0.01 | 2.13 |
| PubTator_Conceptnet | {mers,infection,Mice} | {high} | 22 | 0.96 | 0.01 | 2.04 |
| PubTator_Conceptnet | {canine,flea} | {low} | 21 | 0.95 | 0.01 | 1.8 |
| PubTator_Conceptnet | {mers,body,us} | {high} | 44 | 0.94 | 0.02 | 2 |
| PubTator_Conceptnet | {infection,infected} | {high} | 22 | 0.65 | 0.02 | 1.39 |
| PubTator | {recombinant fcov nucleocapsid protein rnp} | {low} | 689 | 0.58 | 0.53 | 1.1 |
| AuthorsNames | {woo,yuen kwok yung} | {high} | 44 | 0.97 | 0.02 | 2.06 |
| AuthorsNames | {patrick,yuen kwok yung} | {high} | 22 | 0.97 | 0.01 | 2.05 |
| AuthorsNames | {chan,patrick} | {high} | 22 | 0.96 | 0.01 | 2.05 |
| AuthorsNames | {chan,yuen kwok} | {high} | 27 | 0.94 | 0.02 | 2.01 |
| Bow | {canine,virus} | {low} | 23 | 1.00 | 0.01 | 1.89 |
| Bow | {merscov,mice} | {high} | 28 | 1.00 | 0.01 | 2.13 |
| Bow | {ifn,innate,respiratory} | {high} | 22 | 1.00 | 0.01 | 2.13 |
| Bow | {mice,protection,vaccine} | {high} | 22 | 1.00 | 0.01 | 2.13 |
| Bow | {hepatitis,study} | {low} | 16 | 0.73 | 0.03 | 1.38 |
| Bow | {associated,recently} | {high} | 22 | 0.73 | 0.02 | 1.55 |
LHS antecedent of the rule, RHS prediction made by the rule, Supp number of articles matching the complete rule, Conf percentage of articles matching LHS for which the RHS is true (1 is 100%). Cov percentage of articles in the input dataset for which LHS is true, Lift is a ratio of the confidence of the rule (conf) and the expected confidence, which is the percentage of articles in the input dataset being assigned to the target class in the RHS of the rule
Example rule lists generated by CORELS
| Matrix | Rule list |
|---|---|
| ScispaCy | if [cos7 cells&& not wildtype di rna ne1 rna]: |
| high_citation = True, else high_citation = False | |
| PubTator_Conceptnet | if [not mers&& not body]: |
| high_citation = False, else high_citation = True | |
| ScispaCy_Conceptnet | if [chemoattractant&& not pegylated]: |
| high_citation = True, else high_citation = False | |
| PubTator | if [ not human&& not MERS-CoV ]: |
| high_citation = False, else high_citation = True | |
| AuthorsNames | if [ not paul&& not peter ]: |
| high_citation = False, else high_citation = True | |
| Bow | if [respiratory syndrome]: |
| high_citation = True, else high_citation = False | |
| Bow_ScispaCy | if [assessment&& not east]: |
| high_citation = False, else high_citation = True | |
| Bow_ScispaCy_Conceptnet | if [respiratory syndrome]: |
| high_citation = True, else high_citation = False | |
| Bow_PubTator | if [respiratory syndrome]: |
| high_citation = True, else high_citation = False | |
| Bow_PubTator_Conceptnet | if [respiratory syndrome&& not sars patients]: |
| high_citation = True, else high_citation = False | |
| BibliometricFeatures | if [not FORD_0_aisQ_Q1_D1&& not FORD_0_aisQ_Q1_D2]: |
| high_citation = False, else high_citation = True | |
| Bow_BibliometricFeatures | if [FORD_0_impactQ_Q2&& not middle east]: |
| high_citation = False, else high_citation = True | |
| Bow_PubT_Conc_BibFeatures | if [respiratory syndrome]: |
| high_citation = True, else high_citation = False |
Fig. 10Rule clustering results for CBA model generated on BOW_Pubtator_Conceptnet (version with additional cleaning)
Example of the rules considering animals for CBA algorithm
| Matrix | LHS | RHS | Support | Confidence | Lift |
|---|---|---|---|---|---|
| Bow_ScispaCy | {evolutionary flexibility,animal models} | {Target=high} | 16 | 0.76 | 1.53 |
| Bow_ScispaCy_Conceptnet | {theoretic,animal models} | {Target=high} | 16 | 0.76 | 1.53 |
| Bow_PubTator | {animal,evidence} | {Target=high} | 18 | 0.90 | 1.81 |
| Bow_PubTator_Conceptnet | {us,animal} | {Target=high} | 25 | 1.00 | 2.01 |
| Bow_BibliometricFeatures | {animals,zoonotic} | {Target=high} | 18 | 1.00 | 2.01 |
| Bow_BibliometricFeatures | {journal_Journal_of_Virology,animals} | {Target=high} | 30 | 0.85 | 1.72 |
Bow_Pubtator_Conceptnet _BibliometricFeatures | {cov,animal} | {Target=high} | 25 | 1 | 2.01 |
| Bow | {animals,middle east} | {Target=high} | 26 | 1.00 | 2.01 |
| Bow | {animals,zoonotic} | {Target=high} | 18 | 1.00 | 2.01 |
| Bow | {animal,evidence} | {Target=high} | 18 | 0.90 | 1.81 |
| Bow_ScispaCy | {dogs} | {Target=low} | 22 | 0.79 | 1.55 |
| Bow_BibliometricFeatures | {canine,dogs} | {Target=low} | 21 | 0.88 | 1.73 |
| Bow_PubTator_Conceptnet | {Cats,feline coronavirus,infectious_y} | {Target=low} | 18 | 0.81 | 1.62 |
| PubTator_Conceptnet | {coronavirus,cat,Cats} | {Target=low} | 22 | 0.73 | 1.45 |
| PubTator | {cats,feline coronavirus} | {Target=low} | 20 | 0.71 | 1.41 |
| Bow_ScispaCy | {bats rhinolophus ferrumequinum, demyelinating} | {Target=low} | 16 | 1.00 | 1.98 |
| Bow_ScispaCy_Conceptnet | {bats} | {Target=high} | 47 | 0.80 | 1.60 |
| Bow | {bats,coronavirus, east respiratory syndrome} | {Target=high} | 21 | 1.00 | 2.01 |
| Bow | {bats,coronavirus,transmission} | {Target=high} | 17 | 1.00 | 2.01 |
| Bow | {bats,respiratory,virus} | {Target=high} | 22 | 0.96 | 1.92 |
| PubTator | {rats} | {Target=low} | 16 | 0.84 | 1.67 |
| Bow_BibliometricFeatures | {camels,east respiratory syndrome, infection} | {Target=high} | 31 | 1.00 | 2.01 |
| Bow_BibliometricFeatures | {camel,east respiratory syndrome} | {Target=high} | 21 | 1.00 | 2.01 |
| PubTator_Conceptnet | {mers,camel} | {Target=high} | 35 | 1.00 | 2.01 |
| PubTator_Conceptnet | {humans,camel} | {Target=high} | 25 | 1.00 | 2.01 |
| PubTator_Conceptnet | {camel,coronavirus} | {Target=high} | 16 | 1.00 | 2.01 |
| PubTator_Conceptnet | {camel,infection} | {Target=high} | 31 | 0.97 | 1.95 |
| PubTator | {humans,camels} | {Target=high} | 22 | 1.00 | 2.01 |
| PubTator | {camel,camels} | {Target=high} | 17 | 1.00 | 2.01 |
| PubTator | {camel} | {Target=high} | 24 | 0.96 | 1.93 |
| PubTator | {camels} | {Target=high} | 46 | 0.96 | 1.93 |
| Bow_ScispaCy_Conceptnet | {dromedary} | {Target=high} | 36 | 0.97 | 1.96 |
| PubTator_Conceptnet | {mers,dromedary} | {Target=high} | 26 | 1.00 | 2.01 |
| PubTator | {humans,dromedary} | {Target=high} | 18 | 1.00 | 2.01 |
| Bow | {dromedary,east respiratory syndrome} | {Target=high} | 35 | 1.00 | 2.01 |
Fig. 11Visualization of CBA rules related to animals from BOW_Pubtator_Conceptnet (version with extra feature cleaning). This graph was automatically generated by arulesViz (Hahsler & Karpienko, 2017), and subsequently edited for better readability (visually overlapping text and nodes were moved, no changes to the nodes, their labels, or their connections were made)
Number of rules predicting the high/low categories containing the given concept in the antecedent
| Concept | High | Low |
|---|---|---|
| Camel | 503 | 0 |
| Dromedary | 575 | 0 |
| Feline | 0 | 162 |
| Dog | 0 | 35 |
| Rat | 0 | 6 |
| Mouse | 537 | 950 |
| Bat | 292 | 0 |
| Cow | 0 | 156 |
| Bovine | 0 | 200 |
| Squirrel | 0 | 8 |
The counts were generated from all candidate rules learned with the apriori algorithm from BOW_Pubtator_Conceptnet (version with extra feature cleaning)
Example of the author names rules for CBA algorithm
| LHS | RHS | Support | Lift |
|---|---|---|---|
| {m_ller marcel} | {Target=high} | 0.012211 | 2.015544 |
| {abdullah} | {Target=high} | 0.010925 | 2.015544 |
| {baric ralph,mark} | {Target=high} | 0.010283 | 2.015544 |
| {peter,van} | {Target=high} | 0.013496 | 1.923928 |
| {memish} | {Target=high} | 0.012853 | 1.919566 |
| {lai,michael} | {Target=low} | 0.010925 | 1.874433 |
| {wang,zheng} | {Target=high} | 0.010925 | 1.903569 |
| {yi,yuen kwok} | {Target=high} | 0.010925 | 1.903569 |
| {haagmans} | {Target=high} | 0.015424 | 1.860502 |
| {lai} | {Target=low} | 0.017352 | 1.786224 |
| {baker susan} | {Target=high} | 0.010925 | 1.803382 |
| {drosten christian} | {Target=high} | 0.015424 | 1.791595 |
| {kwok yung,woo} | {Target=high} | 0.017995 | 1.763601 |
| {woo patrick} | {Target=high} | 0.017352 | 1.755474 |
| {m_ller} | {Target=high} | 0.012853 | 1.752647 |
| {albert} | {Target=high} | 0.012211 | 1.740697 |
| {hung,woo} | {Target=high} | 0.012211 | 1.740697 |
| {woo,yuen} | {Target=high} | 0.018638 | 1.719141 |
| {jian,zheng} | {Target=high} | 0.010925 | 1.713212 |
| {huang,yi} | {Target=high} | 0.010925 | 1.713212 |
| {eric} | {Target=high} | 0.025064 | 1.708831 |
| {poon} | {Target=high} | 0.014139 | 1.70546 |
| {chan,patrick} | {Target=high} | 0.014139 | 1.70546 |
| {chan,yuen kwok} | {Target=high} | 0.017352 | 1.700615 |
| {peiris malik} | {Target=high} | 0.010283 | 1.6973 |
| {te,tseng} | {Target=high} | 0.010283 | 1.6973 |
| {li,zheng} | {Target=high} | 0.010283 | 1.6973 |
| {li,wang,yi} | {Target=high} | 0.010283 | 1.6973 |
| {berend} | {Target=high} | 0.013496 | 1.693057 |
| {baric} | {Target=high} | 0.029563 | 1.685728 |
| {woo} | {Target=high} | 0.022494 | 1.67962 |
| {wang,yi} | {Target=high} | 0.012853 | 1.67962 |
| {susanna} | {Target=high} | 0.015424 | 1.668036 |
| {christian} | {Target=high} | 0.020566 | 1.65378 |
| {deng} | {Target=high} | 0.011568 | 1.649081 |
| {fang,li} | {Target=high} | 0.011568 | 1.649081 |
| {graham} | {Target=high} | 0.014139 | 1.642295 |
| {lau} | {Target=high} | 0.018638 | 1.623633 |
| {patrick} | {Target=high} | 0.023136 | 1.612435 |
| {ali} | {Target=high} | 0.015424 | 1.612435 |
| {du,jiang} | {Target=high} | 0.010283 | 1.612435 |
| {li,zhou} | {Target=high} | 0.010283 | 1.612435 |
| {chan,yi} | {Target=high} | 0.010283 | 1.612435 |
| {stefan} | {Target=high} | 0.012211 | 1.595639 |
| {al} | {Target=high} | 0.023136 | 1.577382 |
| {baker} | {Target=high} | 0.011568 | 1.577382 |
| {shibo} | {Target=high} | 0.011568 | 1.577382 |
| {te} | {Target=high} | 0.016067 | 1.574644 |
| {zheng} | {Target=high} | 0.017995 | 1.567645 |
| {matthew} | {Target=high} | 0.01928 | 1.550418 |
| {buonavoglia} | {Target=low} | 0.012853 | 1.526688 |
| {vincent} | {Target=high} | 0.014781 | 1.54525 |
| {van} | {Target=high} | 0.041131 | 1.535653 |
| {jan,peter} | {Target=high} | 0.010283 | 1.535653 |
| {peter} | {Target=high} | 0.044344 | 1.52827 |
| {mark} | {Target=high} | 0.025064 | 1.511658 |
| {li,yi} | {Target=high} | 0.015424 | 1.511658 |
| {christopher} | {Target=high} | 0.012211 | 1.472898 |
| {joo} | {Target=low} | 0.012211 | 1.450353 |
| {yee} | {Target=low} | 0.012211 | 1.450353 |
| {xiang} | {Target=high} | 0.015424 | 1.46585 |
| {alexander} | {Target=high} | 0.010283 | 1.46585 |
| {liu ding} | {Target=high} | 0.010283 | 1.46585 |
| {haan} | {Target=high} | 0.010283 | 1.46585 |
| {ching} | {Target=low} | 0.010283 | 1.443414 |
| {jiang,liu} | {Target=high} | 0.010283 | 1.46585 |
| {chen,wei} | {Target=low} | 0.011568 | 1.42898 |
| {jan} | {Target=high} | 0.017995 | 1.447057 |
| {andrew} | {Target=high} | 0.021208 | 1.445934 |
| {jian} | {Target=high} | 0.017352 | 1.432097 |
| {yuan} | {Target=low} | 0.018638 | 1.403808 |
| {peiris} | {Target=high} | 0.015424 | 1.422737 |
| {zhou} | {Target=high} | 0.019923 | 1.420042 |
| {nicola} | {Target=low} | 0.012211 | 1.396636 |
| {paul} | {Target=high} | 0.037918 | 1.41568 |
The continent, nationality, and number of author names, based on (Ye et al., 2017)
| Continent | Nationality | Number of names |
|---|---|---|
| Africa | Muslim-Nubian | 2 |
| Africa | African-WestAfrican | 1 |
| Africa | Muslim-Maghreb | 1 |
| Africa | African-EastAfrican | 3 |
| Asia | EastEasian-Malay-Indonesia | 1 |
| Asia | EastEasian-Malay-Malaysia | 1 |
| Asia | Muslim-ArabianPeninsula | 1 |
| Asia | EastAsian-Indochina-Myanmar | 2 |
| Asia | EastAsian-Chinese | 64 |
| Asia | EastAsian-South Korea | 8 |
| Asia | EastAsian-Indochina-Vietnam | 9 |
| Asia | EastAsian-Japan | 4 |
| Europe | Hispanic-Portuguese | 3 |
| Europe | Hispanic-Spanish | 4 |
| Europe | European-SouthSlavs | 1 |
| Europe | European-Italian-Romania | 1 |
| Europe | European-Italian-Italy | 1 |
| Europe | Europe- French | 1 |
| Europe | European-German | 12 |
| Europe | European-French | 13 |
| Europe | Celtic-English | 31 |
| Europe | Nordic-Finland | 1 |
| Total | 165 |
Fig. 12Left side: distribution of the number of European and Asia names by the percentage of highly cited articles discretized by the number of bins = 20. Right side: Distribution of the number of articles with Europe and Asian author names by the number of citations