| Literature DB >> 28617229 |
Go Eun Heo1, Keun Young Kang1, Min Song2, Jeong-Hoon Lee3.
Abstract
BACKGROUND: Bioinformatics is an interdisciplinary field at the intersection of molecular biology and computing technology. To characterize the field as convergent domain, researchers have used bibliometrics, augmented with text-mining techniques for content analysis. In previous studies, Latent Dirichlet Allocation (LDA) was the most representative topic modeling technique for identifying topic structure of subject areas. However, as opposed to revealing the topic structure in relation to metadata such as authors, publication date, and journals, LDA only displays the simple topic structure.Entities:
Keywords: ACT model; Bioinformatics; Keyphrase extraction; Text mining; Topic modeling
Mesh:
Year: 2017 PMID: 28617229 PMCID: PMC5471940 DOI: 10.1186/s12859-017-1640-x
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Research overflow. Research overflow of our approach consists of data collection, preprocessing, keyphrase extraction, ACT model application, and topic analysis
Statistics of collected publications
| Ranking | Journal Name | Number of Papers | Ratio (%) |
|---|---|---|---|
| 1 | Biochemistry | 62,270 | 25.78 |
| 2 | Journal of Molecular Biology | 29,968 | 12.41 |
| 3 | The EMBO Journal | 17,296 | 7.16 |
| 4 | Journal of Theoretical Biology | 12,200 | 5.05 |
| 5 | Bioinformatics | 9,847 | 4.08 |
| 6 | Human Molecular Genetics | 9,347 | 3.87 |
| 7 | Genomics | 8,316 | 3.44 |
| 8 | BMC Genomics | 7,741 | 3.20 |
| 9 | BMC Bioinformatics | 6,780 | 2.81 |
| 10 | Protein Science : a publication of the Protein Society | 6,047 | 2.50 |
| 11 | Journal of Proteome Research | 5,575 | 2.31 |
| 12 | Proteomics | 5,545 | 2.30 |
| 13 | Journal of Biotechnology | 5,204 | 2.15 |
| 14 | PLOS Genetics | 5,139 | 2.13 |
| 15 | PLOS Computational Biology | 3,852 | 1.59 |
| 16 | BMC Research Notes | 3,743 | 1.55 |
| 17 | Mammalian Genome | 3,499 | 1.45 |
| 18 | Genome Biology | 3,411 | 1.41 |
| 19 | PLOS Biology | 3,280 | 1.36 |
| 20 | Trends in Biochemical Sciences | 3,171 | 1.31 |
| 21 | Trends in Genetics | 3,035 | 1.26 |
| 22 | Journal of Molecular Modeling | 2,852 | 1.18 |
| 23 | Molecular & cellular proteomics : MCP | 2,796 | 1.16 |
| 24 | Trends in Biotechnology | 2,353 | 0.97 |
| 25 | Bulletin of Mathematical Biology | 2,331 | 0.96 |
| 26 | Journal of Proteomics | 2,158 | 0.89 |
| 27 | Physiological Genomics | 1,794 | 0.74 |
| 28 | Journal of Computer-Aided Molecular Design | 1,706 | 0.71 |
| 29 | BMC Systems Biology | 1,397 | 0.58 |
| 30 | Bioinformation | 1,297 | 0.54 |
| 31 | Pharmacogenetics and Genomics | 1,072 | 0.44 |
| 32 | Statistical Methods in Medical Research | 976 | 0.40 |
| 33 | Journal of ComputationalNeuroscience | 925 | 0.38 |
| 34 | Molecular Systems Biology | 822 | 0.34 |
| 35 | Genome Medicine | 676 | 0.28 |
| 36 | Theoretical Biology and Medical Modeling | 498 | 0.21 |
| 37 | Comparative and Functional Genomics | 466 | 0.19 |
| 38 | Neuroinformatics | 385 | 0.16 |
| 39 | Cancer Informatics | 355 | 0.15 |
| 40 | Briefings in Functional Genomics & Proteomics | 290 | 0.12 |
| 41 | Evolutionary Bioinformatics | 249 | 0.10 |
| 42 | Algorithms for Molecular Biology | 245 | 0.10 |
| 43 | Journal of Biomedical Semantics | 240 | 0.10 |
| 44 | BioData Mining | 149 | 0.06 |
| 45 | EURASIP Journal on Bioinformatics and Systems Biology | 140 | 0.06 |
| 46 | Source Code for Biology and Medicine | 131 | 0.05 |
| Total | 241,569 | 100.00 | |
Fig. 2Data distribution. Publication year of our dataset is from 1996 to 2015. To identify topical trends of bioinformatics, we divided total 20 years into four time periods. X-axis is publication year and Y-axis is the number of papers
Time-based statistics for 20 years
| Year | Number of Papers | Ratio (%) | Ranking |
|---|---|---|---|
| 1996 | 5,713 | 3.36 | 19 |
| 1997 | 5,549 | 3.26 | 20 |
| 1998 | 5,853 | 3.44 | 18 |
| 1999 | 5,877 | 3.46 | 17 |
| 2000 | 5,947 | 3.50 | 16 |
| Period 1 | 28,939 | 17.01 | |
| 2001 | 6,199 | 3.64 | 14 |
| 2002 | 6,456 | 3.80 | 13 |
| 2003 | 6,668 | 3.92 | 12 |
| 2004 | 7,564 | 4.45 | 11 |
| 2005 | 8,545 | 5.02 | 10 |
| Period 2 | 35,432 | 20.83 | |
| 2006 | 9,845 | 5.79 | 9 |
| 2007 | 10,112 | 5.94 | 8 |
| 2008 | 10,352 | 6.09 | 7 |
| 2009 | 10,868 | 6.39 | 6 |
| 2010 | 11,031 | 6.49 | 5 |
| Period 3 | 52,208 | 30.69 | |
| 2011 | 11,518 | 6.77 | 4 |
| 2012 | 11,986 | 7.05 | 2 |
| 2013 | 11,695 | 6.88 | 3 |
| 2014 | 12,251 | 7.20 | 1 |
| 2015 | 6,070 | 3.57 | 15 |
| Period 4 | 53,520 | 31.46 | |
| Total | 170,099 | 100.00 |
Example of results of keyphrase extraction and other metadata from PMID of 26030820
| Information | Content |
|---|---|
| Title | encoding cell amplitude frequency modulation |
| Author | Micali Gabriele, Aquino Gerardo, Richards David M, Endres Robert G |
| Year | 2015 |
| Journal | PLOS computational biology |
| Keyphrases | Down-Regulation | Ion Channels | Ions | L Cells (Cell Line) | Ligands | Social Control, Formal | Social Control, Informal | Up-Regulation |
Fig. 3ACT Model. Author-Conference-Topic (ACT) Model is proposed by Tang et al. which is a probabilistic topic model to extract topics, authors, and conference simultaneously
Notation and description of the ACT model
| d | Paper | Nd | Total number of words in paper d |
| x | Author | Ad | Total number of authors in paper d |
| w | Word | z | Topic |
| j | Journal | θ | Author-topic distribution |
| D | Total number of papers | φ | Topi |
| A | Total number of authors | ψ | Topic-word distribution |
| K | Selected number of topics | α,β,γ | Hyper-parameters of Dirichlet distribution |
Average of Pearson correlation coefficients result
| Number of Runs | Pearson correlation coefficients |
|---|---|
| 1 | 0.155 |
| 2 | 0.140 |
| 3 | 0.152 |
| 4 | 0.177 |
| 5 | 0.180 |
| 6 | 0.146 |
| 7 | 0.136 |
| 8 | 0.160 |
| 9 | 0.158 |
| 10 | 0.178 |
Perplexity result of topic model
| Number of Topics | 1996–2000 | 2001–2005 | 2006–2010 | 2011–2015 | Average |
|---|---|---|---|---|---|
| 10 | 2,712 | 2,060 | 875,088 | 501,176 | 345,259 |
| 20 | 2,978 | 3,161 | 726,329 | 513,176 | 311,411 |
| 30 | 2,872 | 2,176 | 742,307 | 481,875 | 307,308 |
| 50 | 2,480 | 2,149 | 635,960 | 466,676 | 276,816 |
| Average | 2,760 | 2,387 | 744,921 | 490,726 |
Fig. 4Perplexity result. For evaluation of topic modeling results, we used perplexity. We calculated perplexity per each period with the number of topics as 10, 20, 30, and 50. X-axis is period and Y-axis means a perplexity value
Fig. 5Journal focused topic distribution with related authors and keyphrases. For integrated pattern analysis, we examined eight representative journals with top authors and keyphrases. Patterns were classified as four outstanding ones such as rising (a-b), falling (c-f), convex (g) and concave (h)
The list of journals showed in all periods
| Journal Name | Sum of Probability | Average of Probability |
|---|---|---|
| Biochemistry | 13.13233 | 3.28308 |
| Bioinformatics (Oxford, England) | 2.49214 | 0.62304 |
| BMC Bioinformatics | 1.64624 | 0.41156 |
| BMC Genomics | 1.92022 | 0.48005 |
| Bulletin of mathematical biology | 0.68576 | 0.17144 |
| Genome biology | 1.28690 | 0.32173 |
| Genomics | 2.40453 | 0.60113 |
| Human molecular genetics | 2.98031 | 0.74508 |
| Journal of biotechnology | 1.32392 | 0.33098 |
| Journal of computational neuroscience | 0.42877 | 0.10719 |
| Journal of computer-aided molecular design | 0.60322 | 0.15081 |
| Journal of molecular biology | 5.82534 | 1.45633 |
| Journal of theoretical biology | 3.00820 | 0.75205 |
| Mammalian genome | 1.06714 | 0.26678 |
| Physiological genomics | 0.61599 | 0.15400 |
| Protein science | 2.16372 | 0.54093 |
| Statistical methods in medical research | 0.25062 | 0.06266 |
| The EMBO journal | 3.35207 | 0.83802 |
| Trends in biochemical sciences | 0.50571 | 0.12643 |
| Trends in biotechnology | 0.50871 | 0.12718 |
| Trends in genetics | 0.73352 | 0.18338 |