| Literature DB >> 35873705 |
Alexandre Hannud Abdo1,2, Jean-Philippe Cointet3, Pascale Bourret4, Alberto Cambrosio5.
Abstract
This paper presents a contribution to the study of bibliographic corpora through science mapping. From a graph representation of documents and their textual dimension, stochastic block models can provide a simultaneous clustering of documents and words that we call a domain-topic model. Previous work investigated the resulting topics, or word clusters, while ours focuses on the study of the document clusters we call domains. To enable the description and interactive navigation of domains, we introduce measures and interfaces that consider the structure of the model to relate both types of clusters. We then present a procedure that extends the block model to cluster metadata attributes of documents, which we call a domain-chained model, noting that our measures and interfaces transpose to metadata clusters. We provide an example application to a corpus relevant to current science, technology and society (STS) research and an interesting case for our approach: the abstracts presented between 1995 and 2017 at the American Society of Clinical Oncology Annual Meeting, the major oncology research conference. Through a sequence of domain-topic and domain-chained models, we identify and describe a group of domains that have notably grown through the last decades and which we relate to the establishment of "oncopolicy" as a major concern in oncology.Entities:
Year: 2021 PMID: 35873705 PMCID: PMC9299004 DOI: 10.1002/asi.24606
Source DB: PubMed Journal: J Assoc Inf Sci Technol ISSN: 2330-1635 Impact factor: 3.275
FIGURE 1Incidence graph representation of relationships found in a research corpus. Each document appears as a node, with edges toward its textual content nodes (terms) and metadata nodes (e.g., authors, years, journals)
FIGURE 2Domain‐topic model of the bipartite graph of documents linked to their terms. The resulting dual structure features on the right side a hierarchy of topics (groups of terms) and on the left side a hierarchy of domains (groups of documents). Labels 1D = “level 1 domains,” 2D = “level 2 domains,” and similarly for 1T and 2T as topics
FIGURE 3Domain‐chained model of documents linked to their year of publication. By keeping the inferred document partition fixed, the model can be extended to other variables that get assigned to documents. In this example, publication years get partitioned into nested blocks. The best fit partition will reflect the connectivity patterns between the chained dimension and the lexically structured domains (1P = level 1 periods, 2P = level 2 periods)
Domain‐topic table for the level 2 subdomains of a level 3 domain, showing common topics for domains at levels above 1, and specific topics for level 1 domains
| [Level 3 domain] | |||
|---|---|---|---|
| [Common topics of its level 2 subdomains] | |||
| [Level 2 domain] | [Common topics] | [Level 1 domain] | [Specific topics] |
| [Level 1 domain] | [Specific topics] | ||
| [Level 2 domain] | [Common topics] | [Level 1 domain] | [Specific topics] |
| [Level 1 domain] | [Specific topics] | ||
FIGURE 4Screen capture of a domain‐topic map, with longitudinal histogram and term search. Domains are shown on the left, in red, and topics on the right, in blue. Columns show the partitions at decreasing levels of the nested hierarchy, where each block is sliced into subblocks with equal heights. In the figure, domains appear colored for their usage of the selected level 1 topic, associated with health care, and topics appear colored for their usage in the selected domain, which happens to be the level 3 domain most associated with the selected topic
FIGURE 5Abstracts presented at the American Society of Clinical Oncology Annual Meeting between 1995 and 2017, totaling 83,476. Advancing some of our results, we highlight the contribution of a group of domains we label “oncopolicy” and show the split between the two main periods detected
The base number (N) of documents and terms, followed by the number of partitions in domains and topics at each nested level of the domain‐topic model
|
| L1 | L2 | L3 | L4 | L5 | ||
|---|---|---|---|---|---|---|---|
| Documents | 83,476 | Domains | 479 | 110 | 24 | 4 | 1 |
| Terms | 253,758 | Topics | 407 | 112 | 37 | 5 | 1 |
FIGURE 6The nested partition of conference years into periods. Contrary to domains and topics, levels L3, L4, and L5 are all equivalent, as the inference procedure found no statistically significant distinctions above level L2. Moreover, since years are treated as categorical data, the fact that the partitions respect the chronological sequence is not a given, but reveals a progressive character in the evolution of research domains at the American Society of Clinical Oncology Annual Meeting
FIGURE 7Domain‐topic network showing connections between level 3 domains and their shared most specific level 1 topics. For each domain node, the strength of the red filling corresponds to the volume of documents belonging to the domain, and the border color represents the intensity of its growth (green) or decline (blue) between periods (1995–2005) and (2006–2017). The label on each edge shows the word from the topic most specific to the domain it connects. As an example, at the bottom we see D50, which has an intermediate total volume and strongly decreases between the periods. T29 is a specific topic for D50, with “mtd” (for “maximum tolerated dose”) its most specific term for this domain
FIGURE 8Colors represent the growth, in red, or decline, in gray, of the prevalence of domains between periods (1995–2005) and (2006–2017). Labels are the result of procedures akin to what we will perform for L3D44
FIGURE 9Area bump chart for the 15 level 2 domains of “oncopolicy” (L3D44 and L3D48), displaying rank and volume changes along the 6 level 1 periods. Volumes are year averages within each period, and both absolute (continuous line) and relative (dotted line) volumes are shown for each domain
Line L2D120 from the domain‐topic table for L3D44
| L2 | Common topics | L1 | Specific topics |
|---|---|---|---|
| D120 |
• T156 (9%): • T67 (8%): • T14 (8%): • T272 (7%): • T23 (6%): • T241 (4%): • T12 (4%) • T249 (3%): • T367 (3%): | D415 |
T156 (23%): T67 (15%): T14 (10%): T23 (4%): |
| D563 |
T67 (42%): T249 (5%): T156 (4%): | ||
| D584 |
T156 (30%): T14 (6%): T67 (5%): T241 (4%): T272 (4%): T23 (3%): | ||
| D614 | T134 (54%): | ||
| D635 | T38 (54%): | ||
| D687 |
T12 (28%): T23 (4%): T79 (4%): T142 (3%): T181 (3%): T241 (3%): T28 (2%): T283 (2%): T355 (2%): | ||
| D805 |
T40 (19%): T228 (5%): T93 (3%): T65 (3%): T251 (3%): T236 (3%): T292 (2%): T29 (2%): T11 (2%): T395 (2%): T258 (2%): T240 (2%): T239 (2%): T230 (1%): |
Note: Percentages are the item's fraction of total positive similarity (or commonality) contributions, either at the topic or term level. Topics are always level 1. Note that L1T12 has no terms present in all subdomains of L2D120.
Complete description of the “public health and health technology assessment”L3D44 domain in terms of its common topics and subdomains
|
| |||
|
| |||
|
| |||
|
| |||
|
|
|
|
|
| D120 |
| 1,796 | Quality of life (physical and psychological). Treatment side effects. |
| D121 |
| 2,303 | Quality improvement (professionals, costs, and practices). Professional education and communication with patients. |
| D124 |
| 1,343 | Comparative evaluation of treatment regimens (resources, costs, efficacy, side effects). |
| D126 |
| 1,931 | Epidemiological surveillance of outcomes (SEER = surveillance, epidemiology, and end results). Prognosis. |
| D131 |
| 1,387 | Risks and management of treatment side effects (toxicity, infection, etc.) |
| D147 |
| 1,523 | Assessment of physical and psychosocial side effects. Hospitalization. |
| D151 |
| 805 | Pain management and quality of life (randomized studies thereof). |
| D153 |
| 1,844 | Survey questionnaires of quality of life (including depression and anxiety). Survey of professional adherence to clinical guidelines. |
| D163 |
| 1,142 | Patient education concerning lifestyles. Barriers to screening. |
| D186 |
| 953 | Meta‐analysis of clinical trials (all sorts of cancers). |
Note: Level 2 subdomains were labeled by employing the domain‐topic table for the level 3 domain, whose row describing the “quality of life and treatment side effects”L2D120 subdomain was presented as Table B1.
Complete description of the “screening and risk factors for cancers”L3D48 domain
|
| |||
|
| |||
|
| |||
|
| |||
|
|
|
|
|
| D134 |
| 588 | Screening (mammography and genetic testing) for breast cancer, esp. hereditary one, and prophylaxis, including practice guidelines |
| D139 |
| 509 | Genetic testing and genetic counseling for hereditary cancer risks. Epidemiological surveillance |
| D141 |
| 433 | Cancer risk factors, including race, ethnicity, smoking, and obesity |
| D159 |
| 426 | Epidemiology of cancer, esp. blood cancers. Risk factors for secondary cancer (following treatment of primary cancer). Epidemiological surveillance in different world populations |
| D180 |
| 342 | Molecular pathology: Quality and comparison of different technologies. Molecular testing and its regulation, access to testing, expertise, and decision support for testing (both human expertise and computational approaches) |