| Literature DB >> 29186141 |
Erjia Yan1, Jake Williams1, Zheng Chen1.
Abstract
Publication metadata help deliver rich analyses of scholarly communication. However, research concepts and ideas are more effectively expressed through unstructured fields such as full texts. Thus, the goals of this paper are to employ a full-text enabled method to extract terms relevant to disciplinary vocabularies, and through them, to understand the relationships between disciplines. This paper uses an efficient, domain-independent term extraction method to extract disciplinary vocabularies from a large multidisciplinary corpus of PLoS ONE publications. It finds a power-law pattern in the frequency distributions of terms present in each discipline, indicating a semantic richness potentially sufficient for further study and advanced analysis. The salient relationships amongst these vocabularies become apparent in application of a principal component analysis. For example, Mathematics and Computer and Information Sciences were found to have similar vocabulary use patterns along with Engineering and Physics; while Chemistry and the Social Sciences were found to exhibit contrasting vocabulary use patterns along with the Earth Sciences and Chemistry. These results have implications to studies of scholarly communication as scholars attempt to identify the epistemological cultures of disciplines, and as a full text-based methodology could lead to machine learning applications in the automated classification of scholarly work according to disciplinary vocabularies.Entities:
Mesh:
Year: 2017 PMID: 29186141 PMCID: PMC5706669 DOI: 10.1371/journal.pone.0187762
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Examples of extracted terms using the defined POS pattern.
| POS tag sequence | Extracted terms |
|---|---|
| JJ NN NN | mesenchymal stem cell |
| NN JJ NN | mouse embryonic fibroblast |
| NN IN NN | mutation in gene |
| VBG NN | cold-seeking response |
| NN VBN NN | hiv-2 uninfected individual |
| NN JJ NN | brand new innovation |
| NN IN NN | people with history |
Evaluation results.
| C-Value | W1 Weighted | W1 W2 Weighted | |
|---|---|---|---|
| Precision | 0.8170 | 0.9390 | 0.9520 |
| Recall | 0.4230 | 0.4670 | 0.4990 |
Reclassification of PLoS subjects.
| Reclassified disciplines | Original disciplines | |
|---|---|---|
| 1 | Agriculture | Agriculture |
| 2 | Biology | Biology; Biology and life sciences; |
| 3 | Chemistry | Chemistry |
| 4 | Computer and Information Sciences | Computer science; Computer and information sciences |
| 5 | Earth Sciences | Earth sciences |
| 6 | Ecology and Environmental Sciences | Ecology and environmental sciences |
| 7 | Engineering | Engineering and technology; Engineering; Materials science |
| 8 | Mathematics | Mathematics |
| 9 | Medicine and Health Sciences | Medicine and health sciences; Medicine |
| 10 | Physics | Physics; Astronomical sciences; Physical sciences |
| 11 | Research and Analysis Methods | Research and analysis methods |
| 12 | Social Sciences | Social sciences; Social and behavioral sciences; |
Descriptive statistics of the 12 disciplines.
| Disciplines | No. of publications | No. of unique terms | Terms per paper (tpp) |
|---|---|---|---|
| Agriculture | 2,047 | 42,005 | 20.52 |
| Biology | 41,136 | 497,492 | 12.09 |
| Chemistry | 12,530 | 195,853 | 15.63 |
| Computer and Information Sciences | 8,356 | 67,296 | 8.05 |
| Earth Sciences | 1,676 | 29,971 | 17.88 |
| Ecology and Environmental Sciences | 7,638 | 117,838 | 15.43 |
| Engineering | 2,866 | 102,327 | 35.70 |
| Mathematics | 2,793 | 53,372 | 19.11 |
| Medicine and Health Sciences | 25,068 | 368,896 | 14.72 |
| Physics | 4,773 | 87,349 | 18.30 |
| Research and Analysis Methods | 3,089 | 33,764 | 10.93 |
| Social Sciences | 7,514 | 67,328 | 8.96 |
Top 10 terms of each discipline.
| 1 | seedling | chromatin | SDS-PAGE | algorithm | ecosystem | biodiversity |
| 2 | molecular marker | chromatin immunoprecipitation | mutant protein | node | environmental variable | predation |
| 3 | transcriptome | wnt | phospholipid | dataset | biomass | habitat |
| 4 | cm | zebrafish | cysteine | database | biodiversity | ecosystem |
| 5 | nutrient | histone | ion | functional annotation | nutrient | environmental variable |
| 6 | genetic diversity | dorsal | fusion protein | simulation | habitat | biomass |
| 7 | functional annotation | transcriptional regulator | recombinant protein | matrix | predation | conservation |
| 8 | genome sequence | chromosome | protease | equation | China | nutrient |
| 9 | biomass | phylogenetic analysis | tyrosine | dynamics | gradient | mammal |
| 10 | cellular component | recombination | proteasome | parameter | cm | genetic diversity |
| 1 | seedling | equation | CD8 T cell | equation | heterogeneity | SD |
| 2 | laser | algorithm | CD4 | dynamics | follow-up | variable |
| 3 | algorithm | dynamics | CD4 T cell | orientation | meta-analysis | evaluation |
| 4 | electron | node | T cell | voltage | injection | sensitivity |
| 5 | ph | parameter | follow-up | ligand | questionnaire | ANOVA |
| 6 | fusion protein | dataset | HIV | microtubule | diagnosis | questionnaire |
| 7 | voltage | simulation | morbidity | simulation | diabetes mellitus | impact |
| 8 | ml | heterogeneity | IL-10 | ion | consensus | meta-analysis |
| 9 | gel | regression coefficient | vaccination | node | metabolic syndrome | respondent |
| 10 | PCR amplification | matrix | diagnosis | radiation | OR | feedback |
Fig 1The distribution of terms in disciplines.
Fig 2The rank-frequency distribution of terms in documents.
Fig 3The rank-frequency distribution of terms in documents for 12 disciplines.
Fig 4The plot of the first two components of the principal component analysis.
Fig 5The plot of the first component of the principal component analysis.
Fig 6Stacked loadings of the first two components.