| Literature DB >> 30256792 |
Shubhanshu Mishra1, Brent D Fegley1,2, Jana Diesner1, Vetle I Torvik1.
Abstract
It was recently reported that men self-cite >50% more often than women across a wide variety of disciplines in the bibliographic database JSTOR. Here, we replicate this finding in a sample of 1.6 million papers from Author-ity, a version of PubMed with computationally disambiguated author names. More importantly, we show that the gender effect largely disappears when accounting for prior publication count in a multidimensional statistical model. Gender has the weakest effect on the probability of self-citation among an extensive set of features tested, including byline position, affiliation, ethnicity, collaboration size, time lag, subject-matter novelty, reference/citation counts, publication type, language, and venue. We find that self-citation is the hallmark of productive authors, of any gender, who cite their novel journal publications early and in similar venues, and more often cross citation-barriers such as language and indexing. As a result, papers by authors with short, disrupted, or diverse careers miss out on the initial boost in visibility gained from self-citations. Our data further suggest that this disproportionately affects women because of attrition and not because of disciplinary under-specialization.Entities:
Mesh:
Year: 2018 PMID: 30256792 PMCID: PMC6157831 DOI: 10.1371/journal.pone.0195773
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Comparison of gender proportions by using SSA data (with a 95% cut-off) versus Genni 2.0, aggregated by ethnicity.
U denotes the percentage of authorships labelled Unknown, %F denotes the percentage of female authorships among male and female authorships, and G = SSA denotes the percentage of male and female SSA predictions that match the Genni predictions.
| Ethnicity | First Author | Last Author | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Proportion | Total | Genni | SSA | G = SSA | Proportion | Total | Genni | SSA | G = SSA | |||||
| U | %F | U | %F | U | %F | U | %F | |||||||
| ENGLISH | ||||||||||||||
| GERMAN | ||||||||||||||
| HISPANIC | ||||||||||||||
| CHINESE | ||||||||||||||
| JAPANESE | ||||||||||||||
| SLAV | ||||||||||||||
| FRENCH | ||||||||||||||
| ITALIAN | ||||||||||||||
| INDIAN | ||||||||||||||
| NORDIC | ||||||||||||||
| ARAB | ||||||||||||||
| DUTCH | ||||||||||||||
| KOREAN | ||||||||||||||
| UNKNOWN | ||||||||||||||
| OTHER | ||||||||||||||
| OVERALL | ||||||||||||||
Descriptions of all the explanatory features.
| Feature | Description |
|---|---|
| is the gender of the author in question as predicted by Genni 2.0 [ | |
| is the age of the author as measured by the number of papers published in the years prior to the article in question. | |
| is the ethnicity of the author as predicted by Ethnea [ | |
| is the country of affiliation of the first-listed author on the article in question, as inferred by MapAffil [ | |
| is the number of authors on the article in question, capped at 20. | |
| is 1 if the article was written in English as tagged in MEDLINE [ | |
| is the total number of references listed on the article in question. | |
| is the number of MeSH terms (and all their unique ancestors in the MeSH tree structure) as assigned in MEDLINE. | |
| is the number of prior papers for the youngest MeSH term assigned to the article in question (the so-called “volume novelty” in [ | |
| is the publication type(s) of the referenced article as tagged in MEDLINE [ | |
| is encoded by indicating whether the article and its reference were published in the same or similar journal as captured by the exact name match and the implicit journal score [ | |
| is the difference in publication years between the articles in question and its reference | |
| is 1 if the referenced article was written in English as tagged in MEDLINE [ | |
| is the number of citations received by the referenced article prior to the citation in question. | |
| is the number of MeSH terms (and all their unique ancestors in the MeSH tree structure) assigned to the referenced article. | |
| is the number of prior papers for the youngest MeSH term assigned to the referenced article (the so-called “volume novelty” in [ | |
| is the publication type of the referenced article as tagged in MEDLINE [ |
Distribution (in percentage) of 41.6 million references (from 1.6 million articles with 2 or more authors published during 2002-2005) across select categorical features.
| Features | First Author | Last Author |
|---|---|---|
Fig 1Self-citation rates as functions of author age as measured by prior publication count (top panels).
The horizontal lines show the overall self-citation rates. The bottom panels show the cumulate distributions of author age.
Gender effects for selected journals using a simple model with author age (pub count) only.
| Category | Journal | First Author | Last Author | ||||||
|---|---|---|---|---|---|---|---|---|---|
| references | SE | p-val | references | SE | p-val | ||||
| Science | PNAS | 298,356 | -0.067 | 0.020 | 0.001 | 334,968 | -0.016 | 0.015 | 0.297 |
| Ann N Y Acad Sci | 52,540 | -0.035 | 0.033 | 0.288 | 54,224 | 0.214 | 0.034 | 0.000 | |
| Nature | 46,138 | -0.162 | 0.054 | 0.003 | 50,625 | 0.028 | 0.044 | 0.531 | |
| Science | 41,328 | -0.154 | 0.053 | 0.004 | 45,043 | 0.066 | 0.041 | 0.107 | |
| Biology | J Biol Chem | 676,859 | -0.095 | 0.013 | 0.000 | 758,553 | 0.009 | 0.010 | 0.348 |
| Biochemistry | 182,433 | -0.036 | 0.024 | 0.143 | 204,527 | 0.030 | 0.017 | 0.084 | |
| J Virol | 155,081 | -0.063 | 0.025 | 0.013 | 172,265 | 0.024 | 0.018 | 0.185 | |
| Biochim Biophys Acta | 104,434 | -0.028 | 0.029 | 0.344 | 110,138 | 0.003 | 0.025 | 0.915 | |
| J Bacteriol | 102,012 | -0.020 | 0.032 | 0.535 | 109,737 | 0.082 | 0.022 | 0.000 | |
| Nucleic Acids Res | 98,322 | -0.107 | 0.035 | 0.002 | 104,933 | -0.051 | 0.027 | 0.061 | |
| FEBS Lett | 94,021 | -0.086 | 0.031 | 0.005 | 99,364 | 0.007 | 0.027 | 0.805 | |
| Biochem J | 91,576 | -0.176 | 0.034 | 0.000 | 97,174 | -0.075 | 0.026 | 0.004 | |
| Mol Cell | 33,538 | -0.090 | 0.067 | 0.182 | 39,524 | -0.203 | 0.045 | 0.000 | |
| Cell | 32,399 | -0.200 | 0.076 | 0.008 | 37,042 | -0.089 | 0.050 | 0.073 | |
| Adv Exp Med Biol | 21,625 | -0.081 | 0.053 | 0.124 | 22,072 | 0.101 | 0.056 | 0.070 | |
| Bioinformatics | 20,756 | -0.080 | 0.103 | 0.437 | 23,014 | 0.009 | 0.082 | 0.912 | |
| Medicine | J Immunol | 208,354 | -0.021 | 0.024 | 0.389 | 228,129 | -0.017 | 0.017 | 0.324 |
| Blood | 140,887 | -0.041 | 0.028 | 0.140 | 152,394 | 0.000 | 0.022 | 0.984 | |
| Cancer Res | 131,329 | 0.056 | 0.029 | 0.057 | 149,313 | 0.051 | 0.022 | 0.018 | |
| Brain Res | 100,389 | -0.071 | 0.031 | 0.025 | 108,379 | 0.004 | 0.028 | 0.882 | |
| Circulation | 92,220 | -0.020 | 0.036 | 0.575 | 98,741 | -0.051 | 0.035 | 0.143 | |
| Clin Cancer Res | 91,687 | 0.057 | 0.035 | 0.101 | 96,551 | 0.050 | 0.031 | 0.113 | |
| J Clin Oncol | 61,722 | 0.069 | 0.041 | 0.093 | 62,049 | 0.056 | 0.041 | 0.176 | |
| J Am Coll Cardiol | 50,523 | -0.070 | 0.061 | 0.248 | 52,579 | 0.124 | 0.057 | 0.030 | |
| J Urol | 49,314 | 0.274 | 0.063 | 0.000 | 50,379 | 0.099 | 0.063 | 0.113 | |
| JAMA | 33,674 | 0.164 | 0.050 | 0.001 | 34,651 | -0.012 | 0.055 | 0.825 | |
| Gut | 31,027 | 0.000 | 0.065 | 1.000 | 32,663 | 0.028 | 0.069 | 0.683 | |
| N Engl J Med | 30,970 | 0.068 | 0.061 | 0.270 | 31,971 | 0.030 | 0.065 | 0.637 | |
| Lancet | 26,344 | -0.074 | 0.059 | 0.209 | 26,301 | -0.061 | 0.065 | 0.348 | |
| BMJ | 23,987 | 0.170 | 0.065 | 0.009 | 24,438 | 0.123 | 0.073 | 0.091 | |
Models of self-citation behavior of first and last authors based on 41.6 million references from 1.6 million articles with 2 or more authors published during 2002-2005.
| Predictor | First author effects, simple models | First author effects, complete model | Last author effects, complete model |
|---|---|---|---|
| Intercept | — | ||
| Female | |||
| Male | |||
| ARAB | |||
| CHINESE | |||
| DUTCH | |||
| FRENCH | |||
| GERMAN | |||
| HISPANIC | |||
| INDIAN | |||
| ITALIAN | |||
| JAPANESE | |||
| KOREAN | |||
| NORDIC | |||
| OTHER | |||
| SLAV | |||
| UNKNOWN | |||
| Australia | |||
| Canada | |||
| China | |||
| France | |||
| Germany | |||
| India | |||
| Italy | |||
| Japan | |||
| Netherlands | |||
| Other | |||
| Spain | |||
| Sweden | |||
| UK | |||
| Unknown | |||
| ref. English | |||
| English | |||
| count = 1 | |||
| ref = Case Report | |||
| ref = Journal | |||
| ref = Letter | |||
| ref = Review | |||
| Case Report | |||
| Journal | |||
| Letter | |||
| Review | |||
| same journal | |||
| count = 0 |
†Format: logit (SE) signif., where X = p ≥ 0.05, * = p < 0.05, . = p < 0.01, and p < 0.001 otherwise.
‡Each category represents a simple model (with only one predictor); intercepts not shown.
Fig 2Change in effect of gender at each model-fitting step.
The sub-figures show the contribution of gender at each step in the iterative process of fitting and evaluating combinations of factors; only the model at the final step is the best-fitting among them. In both models, confounding factors ultimately minimize the effect of gender in self-citation; the most influential of them is author’s publication count (note Table 6). Y-axis is on log scale.
Fit statistics for individual and accretive models of self-citation based on 41.6 million references from 1.6 million articles with 2 or more authors published during 2002-2005.
The best-performing model at each step is the one with the largest log-likelihood (LL); only the highest-ranking of which are shown in steps 2 and following. Models comprise the predictors from the best-performing models in all previous steps along with the newly added category indicated by the plus sign (+). AUC (Area Under the receiver operating characteristic Curve), given as a percentage, roughly measures the accuracy of estimated probabilities. The number of terms in the model is denoted by nf, excluding intercept.
| First authors | Last authors | |||||||
|---|---|---|---|---|---|---|---|---|
| Step | Feature | LL(105) | nf | AUC | Feature | LL(105) | nf | AUC |
| 1 | ref. citation count | time lag | ||||||
| age (pub count) | ref. citation count | |||||||
| time lag | age (pub count) | |||||||
| venue | venue | |||||||
| pub type | pub type | |||||||
| reference count | reference count | |||||||
| gender | MeSH count | |||||||
| MeSH count | country | |||||||
| ethnicity | ethnicity | |||||||
| country | gender | |||||||
| novelty | novelty | |||||||
| author count | language | |||||||
| language | author count | |||||||
| 2 | + age (pub count) | + age (pub count) | ||||||
| 3 | + pub type | + pub type | ||||||
| 4 | + time lag | + venue | ||||||
| 5 | + venue | + ref. citation count | ||||||
| 6 | + reference count | + reference count | ||||||
| 7 | + country | + country | ||||||
| 8 | + novelty | + novelty | ||||||
| 9 | + language | + author count | ||||||
| 10 | + author count | + language | ||||||
| 11 | + ethnicity | + MeSH count | ||||||
| 12 | + MeSH count | + ethnicity | ||||||
| 13 | + gender | + gender | ||||||
Fig 3Change in odds with respect to mentioned values (in parentheses) of self-citation for select predictors of models of first and last authors.
Shaded regions indicate 95% confidence intervals. Y-axis is on log scale.
Fig 4Change in odds with respect to mentioned values of self-citation for select predictors of models of first and last authors.
Error bars indicate 95% confidence intervals. Among other interesting points, note that the likelihood of self-citation is least for last authors with non-USA affiliation, implying that self-citing is customary among USA authors. X-axis is on log scale.
Comparison of full model (based on all 41.6 million references from 1.6 million articles with 2 or more authors published during 2002-2005) with filtered models (26.2 million references for first authors, and 27.5 million for last authors).
| First Author | Last Author | |||
|---|---|---|---|---|
| Full | Filtered | Full | Filtered | |
| Intercept | ||||
| Journal | - | - | ||
| Review | - | - | ||
| Case Report | - | - | ||
| Letter | - | - | ||
| ref = Journal | ||||
| ref = Review | ||||
| ref = Case Report | ||||
| ref = Letter | ||||
| same journal | ||||
| - | - | |||
| Unknown | - | - | ||
| UK | - | - | ||
| Japan | - | - | ||
| Germany | - | - | ||
| France | - | - | ||
| Italy | - | - | ||
| Canada | - | - | ||
| China | - | - | ||
| Australia | - | - | ||
| Spain | - | - | ||
| Netherlands | - | - | ||
| Sweden | - | - | ||
| India | - | - | ||
| Other | - | - | ||
| English | - | - | ||
| ref. English | ||||
| GERMAN | - | - | ||
| HISPANIC | - | - | ||
| CHINESE | - | - | ||
| JAPANESE | - | - | ||
| SLAV | - | - | ||
| FRENCH | - | - | ||
| ITALIAN | - | - | ||
| INDIAN | - | - | ||
| NORDIC | - | - | ||
| ARAB | - | - | ||
| DUTCH | - | - | ||
| KOREAN | - | - | ||
| UNKNOWN | - | - | ||
| OTHER | - | - | ||
| Female | ||||
| Male | ||||
†Format: logit (SE) signif., where X = p ≥ 0.05, * = p < 0.05, . = p < 0.01, and p < 0.001 otherwise.
Percentage of authorships, on the 1.6 million articles with 2 or more authors published between 2002-2005, by authors who (a) started, (b) ended, and (c) started as well as ended their career in during period.
Note that career start and end years were determined based on the full 2009 Author-ity dataset.
| Ethnicity | First Author | Last Author | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Started | Ended | Started—Ended | Started | Ended | Started—Ended | |||||||||||||
| U | F | M | U | F | M | U | F | M | U | F | M | U | F | M | U | F | M | |
| ENGLISH | ||||||||||||||||||
| GERMAN | ||||||||||||||||||
| HISPANIC | ||||||||||||||||||
| CHINESE | ||||||||||||||||||
| JAPANESE | ||||||||||||||||||
| SLAV | ||||||||||||||||||
| FRENCH | ||||||||||||||||||
| ITALIAN | ||||||||||||||||||
| INDIAN | ||||||||||||||||||
| NORDIC | ||||||||||||||||||
| ARAB | ||||||||||||||||||
| DUTCH | ||||||||||||||||||
| KOREAN | ||||||||||||||||||
| UNKNOWN | ||||||||||||||||||
| OTHER | ||||||||||||||||||
| OVERALL | ||||||||||||||||||
Fig 5Author expertise as a function of prior publication count.
Expertise of an author on a given paper is measured by the proportion of subjects (MeSH; a paper typically has a dozen or so terms) on which the author has previously published. Expertise naturally grows with age but never reaches 100% because authors tend to publish on some topics that are new to them.