| Literature DB >> 27429773 |
Julian Sienkiewicz1, Eduardo G Altmann1.
Abstract
We investigate how textual properties of scientific papers relate to the number of citations they receive. Our main finding is that correlations are nonlinear and affect differently the most cited and typical papers. For instance, we find that, in most journals, short titles correlate positively with citations only for the most cited papers, whereas for typical papers, the correlation is usually negative. Our analysis of six different factors, calculated both at the title and abstract level of 4.3 million papers in over 1500 journals, reveals the number of authors, and the length and complexity of the abstract, as having the strongest (positive) influence on the number of citations.Entities:
Keywords: citation analysis; quantile regression; sentiment analysis
Year: 2016 PMID: 27429773 PMCID: PMC4929908 DOI: 10.1098/rsos.160140
Source DB: PubMed Journal: R Soc Open Sci ISSN: 2054-5703 Impact factor: 2.963
List of textual factors whose relation to citations we investigate in our paper. Whenever possible, factors are obtained on the title and abstract of a paper.See §§4.2 and 4.3 for exact definitions. Additionally, we consider the number of authors (motivated by previous studies, e.g. [7,13]).
| property | title | abstract |
|---|---|---|
| number of characters | number of words | |
| — | Gunning fog index | |
| Herdan's | Herdan's | |
| valence | valence | |
| arousal | arousal | |
Figure 1.Relation between different factors and the number of citations in two journals: Science (top) and Nature Genetics (bottom). Left-side plots: each black dot corresponds to one paper, and lines show quantile regression (QR) results for colour-coded quantiles τ={0.02,0.04,…,0.98}. Middle panels: β coefficients (slopes of QR in the left panel) as a function of quantile τ. The red arrows (summary pointers) show βlow≡β(τ=0.02), βhalf≡β(τ=0.5) and βtop≡β(τ=0.98), as, respectively, the nock, a circle on the shaft, and the head of the arrow. Right panels: summary pointers for all factors.
Figure 2.Strength of factors calculated over all journals. Box-plots (see definition on the right) summarize the distribution of βhalf values across different journals. Influential factors are identified as those for which is large for almost all journals (e.g. when the box does not contain βhalf=0 line this implies that in at least 75% of the journals the value of βhalf is above or below zero).
Factors often affect top and typical papers differently. Percentage of journals for which are reported. The right column, βtop≠βhalf, is the sum of the two others.
| property | factor | |||
|---|---|---|---|---|
| length | no. characters (title) | 2.6 | 44.4 | 47.0 |
| no. words (abstract) | 8.3 | 29.4 | 36.7 | |
| mean | 41.9 | |||
| complexity | Herdan's | 18.7 | 8.5 | 27.2 |
| Herdan's | 34.9 | 6.5 | 41.4 | |
| 8.3 | 16.7 | 25.0 | ||
| 24.6 | 7.7 | 32.3 | ||
| fog index (abstract) | 26.4 | 8.0 | 34.4 | |
| mean | 32.0 | |||
| sentiment | arousal (title) | 11.0 | 13.5 | 24.5 |
| arousal (abstract) | 15.7 | 13.7 | 29.4 | |
| valence (title) | 16.1 | 11.3 | 27.4 | |
| valence (abstract) | 29.2 | 5.7 | 34.9 | |
| mean | 29.1 | |||
| no. authors | 4.0 | 39.6 | 43.6 | |
| overall mean | 33.7 |
Percentage of journals with positive βlow, βhalf and βtop for each factor. All values are statistically significant (p<0.001) except for those marked with an asterisk (see §4.5).
| property | factor | |||
|---|---|---|---|---|
| length | no. characters (title) | 71.4 | 56.2 | 27.7 |
| no. words (abstract) | 96.5 | 96.7 | 83.4 | |
| complexity | Herdan's | 50.1* | 56.7 | 62.4 |
| Herdan's | 19.4 | 28.1 | 51.2* | |
| 62.0 | 58.2 | 47.9* | ||
| 71.3 | 82.1 | 81.0 | ||
| fog index (abstract) | 62.9 | 68.2 | 72.9 | |
| sentiment | arousal (title) | 56.5 | 61.6 | 58.3 |
| arousal (abstract) | 62.8 | 67.8 | 61.9 | |
| valence (title) | 42.9 | 43.1 | 49.5* | |
| valence (abstract) | 42.0 | 47.6* | 63.4 | |
| no. authors | 93.4 | 92.5 | 61.6 |
Figure 3.Summary pointers show βlow, βhalf and βtop for two factors: number of title characters (top) and valence in abstract (bottom) (see figure 1 for the definition of summary pointers). Journals are grouped according to the OECD bibliographic categories (see footnote 3). The eight journals with highest IF in each category are shown (six for other natural sciences). The categories are sorted with respect to the number of positive βhalf values. Testing null hypothesis that categories are randomly attributed to journals (we compare the average standard deviation within categories with a random attribution of categories to journals) yields p-values p=0.002 for title length and p<10−8 for valence in abstract. The same procedure performed for IF (by creating 12 categories according to decreasing IF) gives p=0.02 for title length and p<10−5 for valence in abstract, suggesting higher impact exerted by scientific category.