| Literature DB >> 30147200 |
Takahiro Kawamura1, Katsutaro Watanabe1, Naoya Matsumoto1, Shusaku Egami1, Mari Jibu1.
Abstract
Maps of science representing the structure of science can help us understand science and technology (S&T) development. Studies have thus developed techniques for analyzing research activities' relationships; however, ongoing research projects and recently published papers have difficulty in applying inter-citation and co-citation analysis. Therefore, in order to characterize what is currently being attempted in the scientific landscape, this paper proposes a new content-based method of locating research projects in a multi-dimensional space using the recent word/paragraph embedding techniques. Specifically, for addressing an unclustered problem associated with the original paragraph vectors, we introduce paragraph vectors based on the information entropies of concepts in an S&T thesaurus. The experimental results show that the proposed method successfully formed a clustered map from 25,607 project descriptions of the 7th Framework Programme of EU from 2006 to 2016 and 34,192 project descriptions of the National Science Foundation from 2012 to 2016.Entities:
Keywords: Information entropy; Map of science; Paragraph embedding; Thesaurus
Year: 2018 PMID: 30147200 PMCID: PMC6096681 DOI: 10.1007/s11192-018-2783-x
Source DB: PubMed Journal: Scientometrics ISSN: 0138-9130 Impact factor: 3.238
Fig. 1FP7 projects that have the cosine similarity of > 0.35 to other projects (BIO Biotechnology, MBI Medical Biotechnology, LIF Life Science, SOC Social Aspects, INF Information and Media, IPS Information Processing and Information Systems, ICT Information and Communication Technology Applications, ROB Robotics). (Color figure online)
Fig. 2Construction of paragraph vectors based on cluster vectors
Fig. 3Concepts in a thesaurus. (Color figure online)
Fig. 4Funding map of FP7 projects that have a cosine similarity of > 0.35 to other projects. (Color figure online)
Fig. 5Edges with higher similarities and nodes with higher degree centrality in the original paragraph vectors and entropy-clustered paragraph vectors from the FP7 data set
Fig. 6Edges with higher similarities and nodes with higher degree centrality in the original paragraph vectors and entropy-clustered paragraph vectors from the NSF data set
Accuracy of project similarities (%)
| Method | PV with entropy | BM25 vector | ||||
|---|---|---|---|---|---|---|
| Strength | Weak | Middle | Strong | Weak | Middle | Strong |
| Precision | 77.5 | 83.3 | 100.0 | 65.0 | 60.0 | 100.0 |
| Recall | 98.6 | 33.3 | 83.3 | 18.6 | 20.0 | 66.7 |
| F1-score | 86.8 | 47.6 | 90.9 | 28.9 | 30.0 | 80.0 |
Examples of two projects relationship
| Title/(desc.) | cos. | Title/(desc.) |
|---|---|---|
| Understanding the physics of galaxy formation and evolution at high redshift/understanding the processes regulating galaxy ... | 0.50 (weak) | The birth of the first stars and galaxies/the aim of this proposal is to simulate the formation and evolution of galaxies within the ... |
| Asymptotic graph properties/many parts of graph theory have witnessed a huge growth over the last years, partly because of their relation to theoretical computer science and statistical physics ... | 0.52 (weak) | Benj amini-schramm approximation of groups and graphings/large graphs have become central objects in many fields in the last couple of decades: in neural sciences, network sciences ... |
| A high intensity neutrino oscillation facility in Europe/the recent discovery that the neutrino changes type (or flavour) as it travels through space, a phenomenon referred to as neutrino oscillations,... | 0.53 (weak) | Probing fundamental properties of the neutrino at the sno+ experiment/i propose a comprehensive programme of research on sno+, a multi-purpose neutrino experiment that has the capacity ... |
| Systems biology of pseudomonas aeruginosa in biofilms/systems biology is a new and rapidly growing discipline . it is widely ... | 0.54 (weak) | Cyclic-di-gmp: new concepts in second messenger signaling and bacterial biofilm formation/biofilms represent a multicellular ... |
| Investigation of human nucleoporins stoichiometry and intracellular distribution by quantitative mass spectrometry/the nuclear pore complex (npc) is one of the most intricate multi-protein ... | 0.56 (weak) | Atlas of cell-type specific nuclear pore complex structures/the nuclear pore complex (npc) is one of the most intricate components of eukaryotic cells and is assembled from ~ 30 nucleoporins ... |
| European science and technology in action building links with industry, schools and home/the aim of establish is to facilitate and implement an inquiry based approach in the teaching and learning ... | 0.67 (middle) | Science teacher education advanced methods/helping teachers raise the quality of science teaching and its educational environment has the potential to increase student engagement,... |
| Support to tenth european conference on turbomachinery-fluid dynamics and thermodynamics, lappeenranta, finland, 15–19 march 2013/the european turbomachinery conference is ... | 0.99 (strong) | Support to ninth european conference on turbomachinery-fluid dynamics and thermodynamics, istanbul, turkey, 21–25 march 2011/the european turbomachinery conference is ... |
Fig. 7Similarities of artificial data with partial replacement in PV with entropy
Fig. 8Similarities of artificial data with partial replacement in BM25 vector
SIC codes by cluster when splitted into the same number as SIC code (COO coordination, cooperation, EMP employment issues, MED medicine, health, ENV environmental protection, IND industrial manufacture, ECO economic aspects)
| SIC | Cl 1 | Cl 2 | Cl 3 | Cl 4 | Cl 5 | Cl 6 | Cl 7 | Cl 8 | Cl 9 | Cl 10 |
|---|---|---|---|---|---|---|---|---|---|---|
| BIO |
| 20 | 10 | 3 | 109 | 22 | 19 | 2 | 4 | 0 |
| SOC | 13 |
| 32 | 15 | 18 | 28 | 35 | 3 | 24 | 48 |
| COO | 30 | 50 |
| 29 | 26 | 17 | 52 | 10 | 19 | 30 |
| EMP | 26 | 95 | 3 |
| 35 | 10 | 51 | 13 | 5 | 11 |
| MED | 31 | 16 | 18 | 5 |
| 17 | 9 | 3 | 7 | 2 |
| INF | 2 | 27 | 25 | 36 | 10 |
| 10 | 1 | 92 | 20 |
| ENV | 0 | 15 | 35 | 1 | 1 | 14 |
| 3 | 25 | 26 |
| IND | 3 | 15 | 2 | 39 | 7 | 26 | 10 |
| 43 | 1 |
| ICT | 0 | 7 | 14 | 19 | 4 | 77 | 4 | 0 |
| 0 |
| ECO | 1 | 19 | 14 | 0 | 4 | 3 | 37 | 8 | 17 |
|
Bold number indicates the dominant code in each cluster
Fig. 9Matching rate of SIC codes in project pairs according to cosine similarities
Fig. 10Project distributions by SIC code and by cluster when splitted into the same number as SIC code