Chaomei Chen1,2, Min Song2. 1. Department of Information Science, College of Computing and Informatics, Drexel University, Philadelphia, Pennsylvania, United States of America. 2. Department of Information Science, Yonsei University, Seoul, Republic of Korea.
Abstract
Systematic scientometric reviews, empowered by computational and visual analytic approaches, offer opportunities to improve the timeliness, accessibility, and reproducibility of studies of the literature of a field of research. On the other hand, effectively and adequately identifying the most representative body of scholarly publications as the basis of subsequent analyses remains a common bottleneck in the current practice. What can we do to reduce the risk of missing something potentially significant? How can we compare different search strategies in terms of the relevance and specificity of topical areas covered? In this study, we introduce a flexible and generic methodology based on a significant extension of the general conceptual framework of citation indexing for delineating the literature of a research field. The method, through cascading citation expansion, provides a practical connection between studies of science from local and global perspectives. We demonstrate an application of the methodology to the research of literature-based discovery (LBD) and compare five datasets constructed based on three use scenarios and corresponding retrieval strategies, namely a query-based lexical search (one dataset), forward expansions starting from a groundbreaking article of LBD (two datasets), and backward expansions starting from a recently published review article by a prominent expert in LBD (two datasets). We particularly discuss the relevance of areas captured by expansion processes with reference to the query-based scientometric visualization. The method used in this study for comparing bibliometric datasets is applicable to comparative studies of search strategies.
Systematic scientometric reviews, empowered by computational and visual analytic approaches, offer opportunities to improve the timeliness, accessibility, and reproducibility of studies of the literature of a field of research. On the other hand, effectively and adequately identifying the most representative body of scholarly publications as the basis of subsequent analyses remains a common bottleneck in the current practice. What can we do to reduce the risk of missing something potentially significant? How can we compare different search strategies in terms of the relevance and specificity of topical areas covered? In this study, we introduce a flexible and generic methodology based on a significant extension of the general conceptual framework of citation indexing for delineating the literature of a research field. The method, through cascading citation expansion, provides a practical connection between studies of science from local and global perspectives. We demonstrate an application of the methodology to the research of literature-based discovery (LBD) and compare five datasets constructed based on three use scenarios and corresponding retrieval strategies, namely a query-based lexical search (one dataset), forward expansions starting from a groundbreaking article of LBD (two datasets), and backward expansions starting from a recently published review article by a prominent expert in LBD (two datasets). We particularly discuss the relevance of areas captured by expansion processes with reference to the query-based scientometric visualization. The method used in this study for comparing bibliometric datasets is applicable to comparative studies of search strategies.
Systematic reviews play a critical role in scholarly communication [1]. Systematic reviews typically synthesize findings from original research in a field of study, assess the degree of consensus or the lack of it concerning the state of the art in the field, and identify challenges and future directions. For newcomers to a field of study, a timely and comprehensive systematic review can provide a valuable overview of the intellectual landscape and guide new researchers to pursue their research effectively. For experienced and active researchers, systematic reviews can be instrumental in keeping their knowledge of the field up to date, especially when involving areas that are potentially relevant but fall outside the immediate topic of one’s interest. Researchers have also studied scientific literature to uncover potentially significant but currently overlooked connections between disparate bodies of literature, notably, as demonstrated in the research of literature-based discovery (LBD) [2-7].A fast-growing trend is the increase of systematic reviews conducted with the assistance of science mapping tools [8]. A science mapping tool typically takes a set of bibliographic records of a research field and generates an overview of the underlying knowledge domain, e.g. as with CiteSpace [9, 10] and VOSviewer [11]. For example, systematic reviews facilitated by using CiteSpace include a diverse range of research areas such as regenerative medicine [12], natural disaster research [13], greenhouse gas emission [14], and identifying disruptive innovation and emerging technology [15]. Similarly, VOSviewer was used in reviews of topics such as citizen science [16] and climate change [17]. A scientometric overview of a field of research provides a valuable source of input to conducting systematic reviews, especially in situations when relevant and up-to-date systematic reviews may not be readily available or accessible. The quality of the input data therefore is critical to the overall quality of subsequent analyses and reviews. This practical issue is particularly significant when we need to select subsets of articles from a large pool of available data or when we want to limit the scope of a study to specific disciplines as opposed to open to all disciplines.Considerations concerning the scope of selection have been discussed in the literature in terms of local, global, and hybrid approaches [18]. Global maps of science by definition provide a comprehensive coverage of all scientific disciplines [19], whereas local maps typically focus on selected areas of interest. Hybrid maps may use a global map as a base map and superimpose local maps as overlays [20]. Generating global maps of science requires a substantial array of resources that are not commonly accessible to most of researchers, whereas resources required for generating local maps are more reachable, especially with the increasing accessible science mapping tools. Global maps are valuable as they provide a broad context of a specific research interest. In practice, generating global maps is resource consuming and demanding, for example, requiring direct access to large data sources that are only accessible to a small number of researchers. Consequently, global maps may not be updated as frequently as needed by end users. For example, a significant update of a global model may not be undertaken for 5 years [20]. The underlying structure of a research field is often subject to a variety of changes as the scientific literature grows over time [21, 22]. Therefore, it is reasonable to question whether an existing global model created a few years ago remains a valid representation of the underlying structure for specific analytic tasks in hand, although the answer may differ at different levels of granularity.In contrast, approaches of localism face different challenges. For example, a common challenge for individual researchers of science mapping tools such as CiteSpace [9, 10], VOSviewer [11], and HistCite [23] is to select a representative subset of articles from tens of millions or more of scholarly publications in the Web of Science, Scopus, Dimensions, and/or the Lens. Researchers have used lexical search, citation expansion, and hybrid strategies [24-26].Complex and sophisticated queries are typically constructed with the input from domain experts and iteratively refined as demonstrated in several in-depths studies [25-27]. We refer such strategies as query-based approaches in this article. The role of relevant and sufficient domain expertise in such approaches is critical, which may lead to a double-edged sword. As we will see in this article, using computational approaches to reduce the cognitive burden from domain experts is a noticeable area of research. Following the principles of LBD to emphasize potentially valuable but currently overlooked connections in scientific literature, it is conceivable that there are undiscovered connections where even domain experts’ input may be limited. Reducing the initial reliance on domain expertise can increase the applicability of query-based strategies.In this article, we propose a flexible computational approach to the construction of a representative dataset of scholarly publications concerning a field of research through iterative citation expansions. Starting from an initial dataset or even a single seed article, the expansion process will automatically expand the initial set by adding articles through citation links in forward, backward, or both directions. This flexibility is particularly valuable in some common scenarios in practice. For example, Swanson’s article on fish oil and Raynaud’s syndrome [6] is widely considered as a groundbreaking one in LBD research. A researcher may want to retrieve subsequently published articles that are connected to the groundbreaking article through potentially lengthy citation paths. Such expansions are valuable to preserve the continuity of research literature identified across an extensive time span. In contrast, a researcher may come across a recently published review article of LBD and would like to find previously published articles that lead to the state of the knowledge summarized in the review. For example, in 2017, Swanson’s long-term collaborator Smalheiser published an article entitled “Rediscovering Don Swanson: the past, present and future of literature-based discovery” [28] and researchers may be interested in tracing several generations of relevant articles.Pragmatically speaking, the ability to expand any set of articles of interest provides a smooth transition across a local-global continuum. The agility of the approach enables us to improve the quality of an input dataset of scientometric studies by refining the context incrementally. We demonstrate the expansion and evaluation of five datasets retrieved by different strategies on literature-based discovery (LBD). The field of LBD is potentially very broad as it is applicable to numerous scientific disciplines. Swanson’s pioneering studies in the 1980s have been a major source of inspiration [6, 7, 29, 30], followed by a series of studies in collaboration with Smalheiser [31-33], who recently reviewed the past, present, and future of LBD [28]. Significant developments have also been made by other researchers along this generic framework, for example, [34-36]. It would be challenging to formulate a complex query to capture a diverse range of relevant research activities, some of which may have drifted away from the core literature of LBD.Judging the relevance of topics is often situational in nature and therefore challenging even with domain expertise. In LBD, the threshold of relevance is by definition lower than other fields because the focus is on undiscovered connections, which in turn leads to a demand for an even higher level of recall for conducting systematic scientometric reviews. Shneider proposed a four-stage model to characterize how a scientific field may evolve, namely, identifying the problem, building tools and instruments, applying tools to the problem, and codifying lessons learned [22, 37]. The application stage may also reveal unanticipated problems, which in turn may lead to a new line of research and form a new field of research. The LBD research may continue along the research directions set off by the pioneering studies in LBD. According to Shneider’s four-stage model, new specialties of research may emerge. A systematic review of LBD should open to such new developments as well as the established ones that we are familiar with. What can we gain by using cascading citation expansions? What might we miss if we rely on the simple query-based search strategy alone? What types of biases may be introduced or avoided by cascading expansions as well as by selecting points of departure?To address these questions, we develop an intuitive visual analytic method to compare multiple search strategies and reveal the strengths and weaknesses of a specific procedure. We compare a total of five datasets in this study, including a dataset from a simple query with phrases of “literature-based discovery” as a baseline reference and four datasets from cascading citation expansions based on two singleton seed article sets using Dimensions as the source of data.The rest of the article is organized as follows. We characterize existing science mapping approaches in terms globalism and localism, especially their strengths and weaknesses for conducting systematic reviews of relevant literature. We introduce a flexible citation expansion approach–cascading citation expansion–to improve the data quality for conducting systematic scientometric reviews. We first demonstrate what a query-based strategy may reveal about the LBD research, then we combine the five datasets to form a common baseline for comparing the topics covered by the five individual datasets.
Mapping the scientific landscape
Considerations concerning the scope of a study of scientific literature can be characterized as global, local, and hybrid in terms of their intended scope and applications.
Globalism
Global maps of science aim to represent all scientific disciplines [18]. Commonly used units of analysis in global maps of science include journals and, to a much less extent, articles. Clusters of journals are typically used to represent disciplines of science.Global maps of science are valuable for developing an understanding of the structure of scientific knowledge involving all disciplines. Global maps provide a relatively stable representation of the underlying structure of knowledge, which can be rather volatile at finer levels of granularity. Klavans and Boyack compared the structure of 20 global maps of science [38]. They arrive at a consensus map generated from edges that occur in at least half of the input maps. A stable global representation is suitable to serve the role of an organizing framework to accommodate additional details as overlays [19, 20, 39]. Global maps may reveal insights that may not be possible at a smaller scale. For example, in studies of structural variations caused by scholarly publications, detecting a potentially transformative link in a network representation of the literature relies on the extent to which the contexts of both ends of the link are adequately represented [21].On the other hand, the validity of a global map as a base map is likely to decrease over time as subsequent publications may change the underlying knowledge structure considerably [21]. Given that updating a global map of science is a very resource demanding task and it may be years before a major update becomes available [20], is the structure shown in a global map still a valid representation at the intended level of granularity? If a global map is updated too often, it may lose its stability value as an organizing framework to sustain overlays. A fundamental question is not necessarily whether the representation is categorically comprehensive, but whether the structural representation as a context for analyzing the literature is adequate and effective.
Localism
To localism, the focus is on communicating the structure of a subject matter of interest at the level of scientific inquiry and scholarly communication, including analytic reasoning, hypothesis generation, and argumentation. Many studies of science and systematic scientometric reviews belong to this category. Scientometric studies commonly draw bibliographic data, especially citation data, from long-established sources such as the Web of Science and Scopus. More recent additions include Microsoft Academic Search, Dimensions, and the Lens. Knowledge domain visualization, for example, focuses on knowledge domains as the unit of analysis [10, 40, 41]. The concept of a knowledge domain emphasizes that the boundary of a research field should reflect the underlying structure of knowledge. Relying on organizing structures such as institutions, journals, and even articles along has the risk of missing potentially significant articles.Query-based search is perhaps by far the most popular strategy for finding articles relevant to a topic of interest or a field of research. A query-based search process typically starts with a list of keywords or phrases provided by an end user. For example, a query of “literature-based discovery” OR “undiscovered public knowledge” could be a valid starting point to search for articles in the field of LBD. The quality of retrieval is routinely measured in terms of recall and precision. Initial search for systematic reviews tends to give a higher priority to recall than precision.Formulating a good query is a non-trivial task even for a domain expert because the quality of a query can be affected by several factors, including our current domain knowledge and our motivations of the search [24]. Furthermore, if our goal is to identify emerging topics and trends in a research field, which is very likely when we conduct a systematic review of the field, then it would be a significant challenge to formulate a complex query effectively in advance. As a result, iteratively refining queries over lessons learned from the performance of previous queries is a common strategy, especially when combined with the input from relevant domain experts [25, 26]. Scatter/Gatherer [42] is an influential early example of iteratively improving the quality of retrieved information based on feedback.A profound challenge to query-based search is the detection of an implicit semantic connection, or a latent semantic relation. Search systems have utilized additional resources such as WordNet [43] and domain ontologies [44, 45] to enhance users’ original terms with their synonyms and/or closely related concepts. With techniques such as latent semantic indexing [46] and more recent advances in distributional semantic learning [47], the relevance of an article to a topic can be established in terms of its distributional properties of language use. The semantic similarity of an article can be detected without an explicit presence of any keywords from users’ original query. This is known as the distributional hypothesis: linguistic items with similar distributions have similar meanings.
Cascading citation expansion
In this article, we conceptualize a unifying framework that accommodates both globalism and localism as special cases on a consistent and continuous spectrum through a combination of query-based search and cascading citation expansion. Under the new conceptualization, the coverage of a local map can grow incrementally as the expansion process may continue as many generations of citation as needed. Thus, the gap between a local map and a global map can be reduced considerably. Table 1 summarizes the major advantages and weaknesses of globalism and localism along with potential benefits that the conceptualized incremental expansion may provide.
Table 1
Contrasts between globalism and localism approaches.
Criteria for representing a research field
Globalism
Localism
Incremental Expansion
Broad context
High
Low
Increased
Coverage, Diversity, Recall
High
Low
Increased
Redundancy
High
Low
Reduced
Structural stability
High
Low
Increased
Precision
Low
High
Increased
Timeliness
Low
High
Increased
In this article, we demonstrate the flexibility and extensibility of a incremental expansion approach–cascading citation expansion–by applying this methodology to a few common scenarios in research in relation to a field of research of own interest, i.e. literature-based discovery (LBD).Citation indexing was originally proposed by Eugene Garfield to tackle the information retrieval problem in the context of scientific literature [48]. While the nature of a citation may vary widely as many researchers have documented [49], an instance of a citation from one article to another provides evidence of some potentially significant connections. A unique advantage of a citation-based search method is that it frees us from having to specify a potentially relevant topic in our initial query, which is useful to reduce the risk of missing important relevant topics that we may not be aware of.We often encounter situations in which we have found a small set of highly relevant articles and yet they are still not fully satisfactory for some reasons. For example, if we were new to a topic of interest and all we have to start with is a systematic review of the topic published many years ago, how can we bring our knowledge up to date? If what we have is a newly published review written by a domain expert, how can we expand the review to find more relevant articles? Another common scenario is when we want to construct a relatively comprehensive survey of a target topic, how can we generate an optimal dataset that is comprehensive enough but also contains the least amount of less relevant articles.When we encounter these situations, an effective method would enable us to build on what we have found so far and add new articles iteratively. Scatter/Gather [42] was a dynamic clustering strategy proposed in mid-1990s by allowing users to re-focus on query-specific relevant documents as opposed to query-independent clusters in browsing search results. From the citation indexing point of view, articles that cite any articles in the initial result set are good candidates for further consideration. Thus, an incremental expansion strategy can be built based on these insights to uncover additional relevant articles.Incremental expansion processes can be performed in parallel or in sequence. When an expansion step is applied to a base set S of articles, articles associated with S through citation links are added to S thus expand the set S. Users may define an inclusion threshold such that the expansion is limited to adding articles with sufficient citations. We refer successive citation expansions as cascading citation expansion [50]. The concept of cascading citation expansion is intrinsic to the general framework of citation indexing. An expansion process is capable of generating an adequate context for a systematic scientometric review of a research field by incrementally retrieving more and more articles from the literature.Cascading citation expansion requires a constant programmatic access to a master source of scientific articles. We utilize the Dimensions API to access their collection of over 98 million publications (at the time of writing). A cascading citation expansion process may move forward and backward along citation paths. If the entire universe of all the scientific publications is completely reachable from one article to another, then cascading citation expansion will eventually reach all the publications, which means that one can achieve a coverage as large as we wish. If the entire universe is in fact made of several galaxies that are not reachable through citation links, the question is whether this is desirable to transcend the void and reach other galaxies. Strategies using hybrid lexical and semantic resources may become useful in such situations, but in this study we assume it is acceptable to terminate the expansion process once citation links associated with a chosen starting point are exhausted.Fig 1 illustrates the cascading citation expansion process. Given an initial set of seed articles, which can be a singleton set, a 1-generation forward citation expansion will add articles that cite any member article of the seed set directly, i.e. with a one-step citation path. A 2-generation forward expansion will add articles connecting to the seed set with two-step citation paths. We include a 5-generation forward expansion in this study.
Fig 1
Applying incremental citation expansions increases the quality of input data for mapping a research field.
Use scenarios and corresponding search strategies
Here we consider two common scenarios in research. In the first scenario, we have identified a well-known classic work, for example, Swanson’s 1986 article on fish oil and Raynaud’s syndrome [6], and would like to retrieve all follow-up studies and articles that cited the original work directly or indirectly since 1986. What are the newly developed major topics ever since? What are the hottest and the most far-reaching topics in more recent years? Are there any areas branching off the main paths?In the second scenario, we are attracted to a very interesting recently published article and we would like to collect relevant studies in the past that lead to the article, i.e., its intellectual base. Smalheiser’s 2017 review of LBD [28] is an example. Smalheiser has co-authored with Swanson on several landmark studies in the development of literature-based discovery. Smalheiser cited 71 references in his review. What would be a broader context of Smalheiser’s 71-reference review? Are there LBD-relevant topics but excluded by the authoritative domain expert?Pragmatically, if we were to rely on the simple full text search alone, how much would we miss? Are there topics that we might have missed completely? What would be an optimal search strategy that not only adequately captures the essence of the development of the field but also does in the most efficient way? Borrowing the terminology from information retrieval, an optimal search strategy should maximize the recall and the precision at the same time.Cascading citation expansion functions are implemented in CiteSpace based on the Dimensions’ API. The expansion process starts with an initial search query in DSL, which is Dimensions’ search language. Users who are familiar with SQL should be able to recognize the resemblance immediately. The result of the initial query forms the initial set of articles. In fact, in addition to publications, one can retrieve grants, patents, and clinical trials from Dimensions. In this study, we concentrate on publications.
Constructing five datasets of literature-based discovery
To demonstrate the flexibility and extensibility of the incremental expansion approach, we take the literature-based discovery (LBD) research as the field to study. We choose LBD for several reasons: 1) we are familiar with the early development of the domain, 2) we are aware of a recent review written by one of the pioneer researchers and we would like to set it in a broader context, and 3) we would like to take this opportunity to demonstrate how one can apply the methodology to a visual exploration of the relevant literature and develop a good understanding of the state of the art. These reasons echo the common scenarios discussed earlier.Table 2 summarized the construction of the five datasets included in the study, including key parameters such as citation thresholds.
Table 2
Five datasets on literature-based discovery.
Set
Description
Records
F
Fulltext search on Dimensions for “literature-based discovery” OR “undiscovered public knowledge” (Data of Search: 3/4/2019)
1,777
S3
3-generation forward expansion from a seed article by Swanson [6]Dimensions:pub.1053548428 (Times Cited: 407)Citing Article and Reference Thresholds: 10, 10
748
S5
5-generation forward expansion of Swanson’s pioneering work [6]Citing Article and Reference Thresholds: 20, 20
45,178
NF
Forward expansion from a 2017 seed article by Smalheiser [28]
73
NB
Backward expansion from NF
2,451
Combined
48,298
Fig 2 illustrates the process of the comparative study of five datasets retrieved based on a query-based search and cascading citation expansions. In this study, we applied a citation filter in cascading citation expansions to the selection of citing and cited articles. Articles with citations below the threshold are filtered out from the expansion processes. These filters provide users with a flexible trade-off option between concentrating on major citation paths with a reduced completion time versus retrieving articles comprehensively with a much longer completion time. Since the distribution of citations of articles follows power law, a comprehensive expansion process may become too long to be viable for a daily use of these functions. The DSL query searched for Swanson’s two articles published in 1986, namely, the fish oil and Raynaud’ syndrome article 1986a [6] and the undiscovered public knowledge article 1986b [7]. The query search found 1,777 articles as the set F for Full data search. Swanson 1986a is used as the seed article for two multi-generation forward citation expansions, one for 3 generations (as set S3) and the other for 5 generations (as set S5). S3 contains 748 articles, whereas S5 is about 60 times larger, containing 45,178 articles. The other two datasets are expanded from Smalheiser’s 2017 review as the seed, NF and NB, where N is for Neil, Smalheiser’s first name. Smalheiser’s review contains 71 references. At the time of the experiment, a forward expansion from it found two articles that cite the review. The 73 articles form the set NF. The NB set is obtained by applying backward citation expansions on the set NF. The expansion stopped in 1934 with 2,451 articles. The five datasets are combined as the set All5, containing 48,298 unique articles. S5 contributed most of articles to the combination. The five datasets overlap to a different extent. S3 is a subset of S5. NB expands from NF. F overlaps with S5 the most (702 articles out of its 1,777 articles). Each of the five individual datasets and the combined dataset are visualized in CiteSpace as networks of co-cited references with thematic labels for clusters.
Fig 2
Using multiple search and expansion strategies improves the data quality for scientometric studies.
Given a dataset, its core is defined by references that satisfy two conditions, inspired by [24]: 1) global citation scores (GCS) in Dimensions are greater than two and 2) the ratio between local citation scores (LCS) within the analyzed dataset to corresponding GCS are greater than or equal to 0.01. According to [24], the core represents articles with a sufficient specificity to the field of research in question. The core will downplay the role of an article that has a very high GCS but a low LCS because it suggests that the article probably belongs to a field elsewhere. It may be possible for an article to have its citations evenly split across multiple fields, but its LCS/GCS ratio in any of these fields should have a good chance to qualify the article for the core.Researchers have used main paths of a citation network to study major flows of information or the diffusion of ideas [51, 52]. Main paths of a dataset are derived from the corresponding direct citation networks of the dataset. Direct citation networks are generated in CiteSpace with GCS of 1 as the selection threshold. Pajek is used to select main paths based on Search Path Link Count (SPLC) using top 30 key routes found by local search.Datasets, their cores, main paths, and clusters can be used as the initial seed set for cascading citation expansion. They can all be used as network overlays in CiteSpace to delineate the scope of a research field at various levels of granularity.Fig 3 shows logarithmically transformed distributions of the five datasets. The distributions shown under the title are the original ones.
Fig 3
Logarithmically transformed distributions of articles by year in the five datasets.
The F dataset (in blue) is evenly distributed except a surprising peak in 2009, which turns out due to many articles from an encyclopedia. The number of articles each year ranges between 60 and 130.Both S5 (red) and S3 (orange) are forward expansions starting with Swanson’s 1986 article on fish oil and Raynaud’s syndrome [6]. In S, the inclusion threshold was at least 10 citations, whereas it was 20 in S so as to keep the total processing time down. The majority of the articles in S3 appeared between 2006 and 2018. S3 had the first peak in 1984. It didn’t return to the same level for the next 10 years until it started to climb up from 2003 and reached the second peak in 2012. In contrast, the distribution of the more extensive forward expansion S5 shows a steady increase all the way over time.N, a forward expansion from Smalheiser’s review of Swanson’s work [28], includes articles that cited the references used in Smalheiser’s review. Its distribution steadily increased between 1986 and 2017 with a rise of 5 over the 31-year span till 2016.N, a backward expansion from the set N, contains articles that are cited by N. The earliest article in NB was published in 1936. A noticeable hump from 1983 followed by a valley around early 1990s. Two peaks appeared in 2006 and 2011, respectively.We will focus on how these datasets differ in terms of networks of co-cited references in the following analysis. There are of course many other ways to conduct scientometric studies based on these datasets, but we will limit to the co-citation networks generated with CiteSpace. It is important to note that co-citation networks in CiteSpace include much more information than a classic co-citation network, notably including various indicators and thematic labels derived from citing articles to clusters of co-cited references, year-by-year concept labels to track the evolution of a cluster, and a hierarchical representation of concept terms extracted from citing articles’ titles and abstracts. Some of these features will be illustrated in the following sections.The five datasets and the combined dataset are processed in CiteSpace with the consistent configurations. In particular, the link-to-node ratio is 3, look back years is 10, annual citation threshold is 2, and the node selection is based on g-index with a scaling factor of 30. These configuration settings are derived empirically, which tend to identify meaningful patterns. Given a dataset, CiteSpace first develops a network model by synthesizing a time series of annual networks of co-cited references. Then the synthesized network is divided into clusters of cited references. Themes in each clusters are identified based on noun phrases extracted from citing articles’ titles and abstracts. Citing articles to a cluster are defined as articles that cite at least one member of the cluster. Extracted noun phrases are further computed to identify the most representative ones as the thematic labels for their cluster. CiteSpace supports three ways to select cluster labels based on Latent Semantic Indexing, Log-Likelihood Ratio Test, and Mutual Information [53]. CiteSpace supports a built-in database. Many attributes of datasets can be compared with the database. The core of a dataset, for example, can be identified using SQL queries.Interactive visualizations in CiteSpace support several views, i.e., types of visualization, including a cluster view, a timeline view, a history view, and a hierarchical view. A network can be superimposed to another network as a layer. A list of references can be superimposed to a network as well. We use this feature to overlay the core and main paths of a dataset to its own network or to a network of another dataset. CiteSpace reports network and cluster properties such as modularity and silhouette scores. The modularity score of a network reflects the clarity of the network structure at the level of decomposed clusters. The silhouette score of a cluster measures the homogeneity of its members. A network with a high modularity and a high average of silhouette scores would be desirable. We will focus on the largest connect component of each network, which is shown as the default visualization. Users may choose to reveal all components of a network if they wish.Table 3 summarizes various properties of the five individual datasets and the combined set along with their networks and core references. The F dataset, for example, contains 1,777 articles, which in turn cite 30,606 unique references. Among them, 367 references are identified as the core references based on the LCS/GCS ratio and the threshold of 3 for GCS. The resultant network contains 1,269 references as nodes and 5,937 co-citation links. The largest connected component (LCC) consists of 1,029 references, or 74% of the entire network. The modularity with reference to the clusters is 0.76, indicating a relatively high level of clarity. The average silhouette score of 0.34 out of 1.0 is moderate.
Table 3
Properties of datasets and their networks (1986–2019).
CiteSpace Configuration: Network: LRF = 3, LBY = 10, e = 2.0, g(30); Core: GCS>2; LCS/GCS≥ 0.01.The modularity of a network measures the clarity of the network structure in terms of how well the entire network can be naturally divided into clusters such that nodes within the same cluster are tightly coupled, whereas nodes in different clusters are loosely coupled. The higher the modularity is, the easier to find such a division. The silhouette score of a network measures the average homogeneity of derived clusters [53]. The higher the average silhouette score is, the more meaningful a group is in terms of a cluster. The smallest dataset N has the highest modularity of 0.93. It also has the highest average silhouette score of 0.50. The full text search has the modularity of 0.76, which is slightly lower than NB, but its silhouette value of 0.34 is lower than others except S5.
Literature-based discovery
In this section, we visualize the thematic landscape of the field of literature-based discovery from multiple perspectives of the five datasets. We will start with the full text search results and then cascading citation expansions.
Full text search
The query for the full text search on Dimensions consists of ‘literature-based discovery’ and ‘undiscovered public knowledge.’ The phrase ‘literature-based discovery’ is commonly used as the name of the research field. The phrase ‘undiscovered public knowledge’ appears in the titles of two Swanson’s publications in 1986. One is entitled “Fish oil, Raynaud’s Syndrome [6], and Undiscovered Public Knowledge” in Perspectives in Biology and Medicine and the other is “Undiscovered Public Knowledge” in Library Quarterly [7].The full text search found 1,777 records. Dimensions’ export center supports the export of up to 50,000 records to a file in a CSV format for CiteSpace [9, 40]. Publication records returned from Dimensions do not include abstracts. We found 431 matched records in PubMed with their abstracts, but in this study the analysis is based on the full set of 1,777 records regardless they have abstracts or not because we primarily focus on the references they cite.Fig 4 shows an overview map of LBD according to the dataset F. The color of a link indicates the earliest year when two publications were co-cited for the first time in the dataset. In this visualization, the earliest work appeared from the top of the network, whereas the most recent ones appeared at the bottom, although CiteSpace does not utilize any particular layout mechanisms to orient the visualization. The network is decomposed into clusters of references based the strengths of co-citation links. Clusters are numbered in the descending order of their size. The largest one is numbered as #0, followed by #1, and so on. Fig 4 also depicts the core of the F set as an overlay (in green) and its main paths (in red).
Fig 4
A synthesized document co-citation network of the F dataset along with cluster labels and overlays of main paths of direct citations (red lines) and core references (yellow lines).
A synthesized document co-citation network of the F dataset along with cluster labels and overlays of main paths of direct citations (red lines) and core references (yellow lines).
CiteSpace configuration: LRF = 3, LBY = 10, e = 2.0, g-index (k = 30). Network: 1,269 references and 5,937 co-citation links.The largest cluster is #0 machine learning. More recent clusters, further down in the visualized network, include #1 semantic predication, and #7 citation network.Fig 5 shows more features of the dataset F through colormaps of time (a), an overlay of its core (b), its and main paths (c). Labeled nodes on main paths include Swanson DR (1986), which also appears to be one of the oldest core references, Swanson DR (1997), Smalheiser (1998), Webber W (2001), and the most recently Cohen T (2010).
Fig 5
A network visualization based on the dataset F and various overlays of substructures: a) a visualized network with clusters labeled, b) core references as an overlay, c) main paths overlay, d) an overlay of references cited by Smalheiser (2017), e) an overlay of references cited by Sebastian et al. (2017), and f) clusters are assigned distinct colors.
Fig 5 also contrasts references cited by Smalheiser’s 2017 review of LBD and the main paths (d) and references cited by another LBD review published in 2017 by Sebastian et al. [5] in part (e). There are several notable differences. Torvik VI (2007) cited in Smalheiser’s review was not on the main paths. In contrast, a few articles on the main paths such as Hunter L (2006), Zweigenbaum P (2007), and Agarwal P (2008) are not cited in Smalheiser’s review. Similarly, Sebastian et al.’s review also cited Torvik VI (2007) and a few other articles off the main paths, including Kostoff RN (2009), Chen C (2009), and Kostoff RN (2008). The general area of Sebastian et al.’s review overlaps with that of Smalheiser’s review considerably except Sebastian et al.’s review reached further towards cluster #7 citation network. The boundaries of clusters are show in distinct colors in part (f) of Fig 5. According to the colored cluster areas, the main paths go through #0 machine learning, whereas the two LBD reviews did not.In Fig 6, references cited by the two LBD review articles are shown as overlays on a timeline visualization. Each cluster is shown horizontally and advances over time from the left to the right. Both LBD reviews make substantial connections between #1 semantic predication and #6 validating discovery. Given the recency of #1 semantic predication, the role of semantic predication is significant. Our own ongoing research also investigates the role of semantic predication in understanding uncertainties of scientific knowledge [22]. Sebastian et al.’s review reached further down to #8 biomarker discovery, which was not cited in Smalheiser’s review.
Fig 6
Timeline visualization with overlays of references cited by two recent reviews of LBD, namely Smalheiser [28] and Sebastian et al. [5].
Comparing five individual datasets
In order to compare the coverage of individual datasets, we construct a baseline map based on the combined dataset. The base map is created with the same procedure that was applied to individual datasets. The base map is divided into clusters. As a measure of the specificity of a dataset, we calculate the K-L divergence between normalized GCS and LCS scores. A low K-L divergence would suggest that the dataset is representative of the underlying field of research, whereas a high K-L divergence would indicate that the dataset contains many out-of-place articles (Table 4). S3 has the lowest K-L divergence. F and NF have similar scores. NB and S5 have higher scores as expected given their size.
Table 4
The K-L divergences of the datasets and their cores.
Dataset
Range
max GCS
GCS≥3
(%)
KL(GCS,LCS)
Core
F
1983–2019
667
823
45.47
0.138
350
NF
1986–2018
3,919
68
93.15
0.135
59
NB
1936–2018
49,363
2,414
98.61
0.182
1,686
S3
1975–2018
4,969
748
100.00
0.070
86
S5
1975–2019
6,929
31,376
69.91
0.283
13,202
All5
1936–2019
49,363
33,836
70.11
0.311
15,226
The overview of a visualized network based on the combined dataset is shown in Fig 7. The colors in the map on the left depict the time of a link is added. For example, the youngest areas are located towards the lower left of the network, whereas the oldest ones are located near the top. The colors in the map on the right are encoded to depict the membership of clusters. The largest cluster is shown in red, following a rainbow colormap, so that we will know the relative size of a cluster. We will overlay networks from each individual dataset to identify what each search strategy brings unique topics to the overall landscape.
Fig 7
A network visualization based on the combined dataset, featuring 3,095 references and 16,314 co-citation links.
Modularity: 0.84. Silhouette: 0.34.
A network visualization based on the combined dataset, featuring 3,095 references and 16,314 co-citation links.
Modularity: 0.84. Silhouette: 0.34.Table 5 lists the distribution of each of the largest 10 clusters in the combined network across the five individual datasets. Thematic labels of each cluster include terms selected by Latent Semantic Indexing (LSI) and by log-likelihood ratio. The former tends to identify common themes, whereas the latter tends to highlight unique themes. The two selections may differ as well as agree. Among the 10 largest clusters, the oldest one is #4 information retrieval with 1990 as the average year of publication. The youngest one is #8 deep learning with 2014 as the average year of publication. The largest cluster #0 systems biology/protein interaction network has the lowest silhouette score, which is expected given its size of 284 references. #6 microRNAs/drosophila melanogaster development has the highest silhouette score of 0.967, followed by the 0.965 of #7 big data, suggesting both them are highly uniformed.
Table 5
Distributions of clusters across individual datasets.
Clusters
Major Themes of Citing Articles
Cluster Size
F
NF
NB
S3
S5
0
systems biology / protein interaction network
283
70.0
4.9
73.9
65.0
100.0
1
biomedical literature
250
98.4
42.4
95.2
42.8
98.4
2
cancer / autism spectrum disorder
245
31.0
0.4
27.3
52.2
100.0
3
science / bibliometric analysis
214
69.6
13.6
59.3
29.9
100.0
4
migraine / information retrieval
183
54.6
13.7
51.9
12.6
84.2
5
drug / drug discovery
160
71.9
11.9
48.8
40.0
100.0
6
microRNAs / drosophila melanogaster development
143
4.9
23.8
100.0
8.4
67.8
7
big data
132
18.2
4.5
14.4
43.9
96.2
8
deep learning
117
10.3
4.3
15.4
27.4
99.1
9
metabolomics / mass spectrometry
106
24.5
1.9
25.5
56.6
100.0
The distributions of clusters across individual datasets show that clusters 0–1 and 3–5 are well represented in F with over 50% of the members of these cluster (highlighted in the table). NB is essentially responsible for Cluster #6, whereas S5 is responsible for #7 big data and #8 deep learning. Independently we can identify the unique contributions associated with #6, #7, and #8 from network overlays shown in Fig 8.
Fig 8
Network overlays of individual datasets (in red) and core reference overlays (in green).
Fig 8 shows a set of network overlays of individual datasets (in red) and core reference overlays of the F and S5 datasets. The three circles in part g highlight the three unique clusters. #6 is contributed essentially by NB, whereas #7 and #8 are captured by S5. The effect of cascading citation expansions is evident. The query-based approach (F) failed to capture #7 big data and #8 deep learning as the 5-generation forward expansion from Swanson’s pioneering article did. Are these clusters relevant enough to be still considered as part of a systematic scientometric review of LBD or rather they should be considered as applications of computational technologies to literature-based discovery? Similarly, #6 microRNAs is missed by forward expansions from Swanson’s 1986 article. What is the basis of its relevance? We will address these questions as follows.Each cluster can be further analyzed by applying the same visual analytic procedure at the next level, i.e. Level 2. The cluster at the original level is known as a Level-1 cluster. One may continue this drill-down process iteratively as needed. Level-2 clusters are useful for interpreting their Level-1 cluster in terms of more specific topics.Fig 9 illustrates a few reports from CiteSpace on Cluster #, including a visualization that shows Level-2 clusters of the Level-1 cluster (left), a hierarchy of concepts (top), and year-by-year thematic terms of Level-2 clusters. The hierarchy of concepts, also known as a concept tree, provides a useful context to identify the major themes of a cluster according to the degree of a concept node in the tree, or the number of children in the tree. In this case, RNA interference has the highest degree and it suggests that the cluster’s overarching theme is to do with RNA interference. A concept tree provides an informative context for selecting thematic labels. Labels selected through LSI or LLR do not have the benefit of such contextual information. As shown in Fig 9, #6 appears to be different from other clusters because connections to the study of scientific literature are not obvious.
Fig 9
Cluster #6 is primarily contributed by dataset NB.
It is a very specific domain on RNA interference.
Cluster #6 is primarily contributed by dataset NB.
It is a very specific domain on RNA interference.Since NB is exclusively responsible for the cluster #6, we overlay references cited by Smalheiser’s review, i.e. its footprints, on a network visualization of NB (Fig 10). The NB network consists of two components that are loosely connected with each other. The footprints of Smalheiser’s review mostly appear in the lower component, indicating that the lower component is strongly relevant to LBD. In contrast, the upper component only contains two footprint references, namely Smalheiser NR (2001) and Lugli G (2005). The two references may hold the key to the formation of #6.
Fig 10
The network of NB reveals two weakly connected continents.
The upper area is responsible for the formation of Cluster #6, which is in turn due to two references cited in Smalheiser in his 2017 review.
The network of NB reveals two weakly connected continents.
The upper area is responsible for the formation of Cluster #6, which is in turn due to two references cited in Smalheiser in his 2017 review.We examine the full text of Smalheiser’s review for the contexts in which these two references are cited. As it turns out, Smalheiser cited the two references as atypical examples of LBD that “arose haphazardly during the course of laboratory investigation” and they are unlike typical LBD examples, in which complementary bodies of literature were purposefully sought after. Lugli G (2005) was cited in the first example of how Smalheiser and his colleagues put two lines of studies together that involved concepts such as double-stranded RNA, which is featured in the concept tree of #6 in Fig 9. Smalheiser NR (2001) was cited in the second example of atypical LBD, which was about RNA interference in mammalian brain. It took them a decade to find provisional evidence that may valid the discovery in 2012.Now it becomes evident that the upper component is connected to LBD through this specific connection. Thus the inclusion of #6 by expansion is reasonable. On the other hand, if the upper component does not contain any other references specifically relevant to LBD, then it seems to be necessary to investigate whether the NB expansion should be cut short in this area.The query-based search (F) did not capture clusters #7 big data and #8 deep learning. As shown in Fig 11, the red lines indicate the coverage of F. No red lines even remotely approach to either of the clusters.
Fig 11
Clusters #7 big data and #8 deep learning are captured by S5 but not by the query-based search (F).
The relevance of Cluster #8 deep learning is investigated as follows. Fig 12 depicts a drill-down analysis of Cluster #8 deep learning, which is the youngest cluster among the 10 largest Level-1 clusters. The concept tree of the cluster identifies deep learning as the primary theme. More specifically, the concept of deep learning appears in contexts that are relevant to LBD, namely in association with drug discovery and biomedical literature. Level-2 clusters include #0 deep learning, #1 deep learning, #2 ensemble gene selection, #3 neuromorphic computing, #4 drug discovery, and #5 medical record. Year-by-year thematic terms include deep learning for the last four years since 2016 along with domain-specific terms such as radiology, breast ultrasound, and precision medicine. Given the multiple connections to biomedical literature, drug discovery, and other domain-specific terms, the cluster on deep learning should be considered as a relevant development of LBD.
Fig 12
Major Level-2 clusters of Level-1 cluster #8 deep learning.
Discussions and conclusions
We have proposed and demonstrated a flexible method to improve the quality of data retrieved for systematic scientometric reviews. We have demonstrated how one may use the approach to develop search strategies to meet the needs in common scenarios in practice. The comparisons of network visualization overlays of five datasets have revealed what a commonly used full text search strategy could have missed. Such omissions are likely to be recently emerged topics and missing them in a systematic review may undermine its overall quality. A strategy that combines query-based search and cascading citation expansion is likely to provide a more balanced coverage of a research domain and to reduce the risk of overly relying on topics explicitly specified in the initial queries. A practical implication on finding a representative body of the literature is its potential to uncover emerging topics that are currently connected to the main body of the literature through a chain of weak links. We recommend researchers to consider this strategy in situations when they only have a small number of relevant articles to begin with. As our study demonstrated, a wide variety of articles can serve as a starting point of an expansion process and multiple processes can be utilized and the combination of their results is likely to provide a comprehensive coverage of the underlying thematic landscape of a research field or a discipline.The present study has some limitations and it raises new questions that need to be addressed in future studies. Our approach implies an assumption that the structure of scientific knowledge can be essentially captured through semantically similar text and/or explicit citation links. Is this assumption valid at the disciplinary level? To what extent does the choice of the seed articles for the expansion process matter? Does the choice of seed articles influence the stability of the expansion process? How many generations of expansion would be optimal?We have made a few observations and recommendations that are potentially valuable for adapting this type of search strategies to develop a systematic review of a body of scientific literature of interest.Using a combination of multiple cascading citation expansions with different seed articles is recommended to obtain a more balanced representation of a field than using a full text search alone.Multi-generation citation expansions provide a systematic approach to reduce the risk of missing topics that we may not be familiar with or not aware of altogether.Triangulating multiple aggregations of articles such as the core references of a dataset and main paths of a dataset as well as multiple review articles provide useful insights.The flexibility of the approach enables researchers to apply the expansion and visual analytic procedure iteratively at multiple levels of granularity, for example, expanding a cluster, comparing the footprints of review articles with main paths, and drilling down a cluster in terms of Level-2 clusters.Choosing the starting point and an end point of a cascading expansion process may lead to different results, suggesting the complexity of the networks and threshold selections may play important roles in reproducing the results in similar studies.Modularity and cluster silhouette measures can help us to assess the quality of an expansion process.Comparing multiple networks in the same context allows us to identify the topic areas that are particularly well represented in some of the datasets but not in other ones. Such an understanding of the landscape of a field provides additional insights into the structure and the long-term development of the field.As a methodology for generating systematic scientometric reviews of a knowledge domain, it bridges the formally mutual exclusive globalism and localism by providing a scalable transition mechanism between them. The most practical contribution of our work is the development and dissemination of a tool that is readily accessible by end users.2 Aug 2019PONE-D-19-16569Visualizing a Field of Research: A Methodology of Systematic Scientometric ReviewsPLOS ONEDear Dr. Chen,Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.I would follow Reviewer 2, who points to several majors issues that can be linked to two fundamental problems. In particular, the promised simplification of the conceptualisation of globalism and localism is not convincingly tackled and to a certain extent, the manuscript lacks proper documentation of methodology and its application. I would like to ask the authors to remedy these issues and also to resolve the minor issues according to the reviewer’s comments.We would appreciate receiving your revised manuscript by Sep 16 2019 11:59PM. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter.To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocolsPlease include the following items when submitting your revised manuscript:A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'.A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'.An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'.Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.We look forward to receiving your revised manuscript.Kind regards,Wolfgang Glanzel, PhDAcademic EditorPLOS ONEJournal Requirements:1. When submitting your revision, we need you to address these additional requirements.Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found athttp://www.journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and http://www.journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf2. We note that Figure 2 in your submission contains a copyrighted image. All PLOS content is published under the Creative Commons Attribution License (CC BY 4.0), which means that the manuscript, images, and Supporting Information files will be freely available online, and any third party is permitted to access, download, copy, distribute, and use these materials in any way, even commercially, with proper attribution. For more information, see our copyright guidelines: http://journals.plos.org/plosone/s/licenses-and-copyright.We require you to either (1) present written permission from the copyright holder to publish these figures specifically under the CC BY 4.0 license, or (2) remove the figures from your submission:1. You may seek permission from the original copyright holder of Figure(s) [2] to publish the content specifically under the CC BY 4.0 license.We recommend that you contact the original copyright holder with the Content Permission Form (http://journals.plos.org/plosone/s/file?id=7c09/content-permission-form.pdf) and the following text:“I request permission for the open-access journal PLOS ONE to publish XXX under the Creative Commons Attribution License (CCAL) CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). Please be aware that this license allows unrestricted use and distribution, even commercially, by third parties. Please reply and provide explicit written permission to publish XXX under a CC BY license and complete the attached form.”Please upload the completed Content Permission Form or other proof of granted permissions as an "Other" file with your submission.In the figure caption of the copyrighted figure, please include the following text: “Reprinted from [ref] under a CC BY license, with permission from [name of publisher], original copyright [original copyright year].”2. If you are unable to obtain permission from the original copyright holder to publish these figures under the CC BY 4.0 license or if the copyright holder’s requirements are incompatible with the CC BY 4.0 license, please either i) remove the figure or ii) supply a replacement figure that complies with the CC BY 4.0 license. Please check copyright information on all replacement figures and update the figure caption with source information. If applicable, please specify in the figure caption text when a figure is similar but not identical to the original image and is therefore for illustrative purposes only.3. Please upload a new copy of Figures 4 - 17 as the detail is not clear. Please follow the link for more information: http://blogs.PLOS.org/everyone/2011/05/10/how-to-check-your-manuscript-image-quality-in-editorial-manager/[Note: HTML markup is below. Please do not edit.]Reviewers' comments:Reviewer's Responses to QuestionsComments to the Author1. Is the manuscript technically sound, and do the data support the conclusions?The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.Reviewer #1: YesReviewer #2: Partly**********2. Has the statistical analysis been performed appropriately and rigorously?Reviewer #1: YesReviewer #2: N/A**********3. Have the authors made all data underlying the findings in their manuscript fully available?The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.Reviewer #1: YesReviewer #2: Yes**********4. Is the manuscript presented in an intelligible fashion and written in standard English?PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.Reviewer #1: YesReviewer #2: Yes**********5. Review Comments to the AuthorPlease use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)Reviewer #1: The authors have demonstrated how to use cascading citation expansion method to generally increase the body of literature to be included and analyzed. The integration between Dimesions and CiteSpace is beneficial to bibliometricians. The method described in the current manuscript offers an alternative solution to define a research landscape, which can be used in parallel with other bibliometric approaches such as cited reference analysis with CRExplorer.Reviewer #2: This paper compares different ways of creating scientometric datasets in an attempt to identify a method that maximizes coverage of relevant documents while minimizing extraneous material.I have several major concerns with the paper.First, the paper is framed in terms of aiding systematic scientometric reviews and in bridging the local-to-global continuum in terms of datasets. However, the paper does little to address either of these issues. In fact, there is little to no referencing of these framing features throughout the results and discussion other than a simple reprisal of the claims at the very end of the paper. So, while the issues used to frame the paper are real, the body of the paper really doesn’t address the issues. I would thus suggest that the paper be framed more simply in terms of differences between datasets, which is what the paper is really about.Second, and perhaps more importantly, the purpose of the paper is to contrast various ways of creating a dataset. However, the analysis is a full step removed from this because the results of each query (and the superset) are not directly compared. Instead, these results are run through the CiteSpace co-citation black box which hugely amplifies some queries while dramatically reducing the F5 query and the overall space, in addition to being a citation generation removed from all of the queries. Co-citation does tend to remove the cutting edge of research fronts, and since reviews are often focused on the hot or emerging topics, I fail to see how this method could be a great basis for aiding in reviews. I’m not saying that CiteSpace/co-citation should not be used, but I am saying that the authors need to honestly and accurately characterize what they are really doing. This does not currently come through in the paper.Related to this, the CiteSpace step (bridging Tables 3 and 4) is all of one or two sentences, and the first impression from comparing the numbers of articles/nodes in Tables 3 and 4 is one of bewilderment. What is going on here. How does NF go from 73 articles to 1903 nodes. Also, why does CiteSpace end up with a much smaller set for F5 and smaller sets for the other queries. This all seems to be magic, and much more detail is required for this transition.Finally, the comparison is qualitative, which it has to be given the lack of ground truth. On the other hand, the authors chose this subject because of their familiarity with it, and yet fail to mention which of the clusters are within vs. without what they would consider LBD to encompass. They seem to favor the F5 expansion. Are there clusters in the F5 that, because of over-expansion, are really outside the field? We are left wanting a more definitive result.Minor issues:Michel Zitt wrote about citation expansion several times, most directly in IPM 2006, but also presaged in Scientometrics 1996 and 2003. Many people have used citation expansion – for instance the original Places&Spaces display had a three-generation backward citation expansion based on Nobel Prize winners and others that was displayed on a science map. Author Chen also has a 2006 paper with forward expansion. So this paper is not introducing citation expansion, but is definitely refining it. Claims in this paper should be adjusted accordingly.Although creation of global maps is indeed “limited to a small number of researchers”, it is also true that a global model is now available to all users of SciVal, which has over a thousand institutional subscribers, and is being widely used for institutional portfolio analysis. While the detailed information in this global model is not used widely to seed systematic reviews, it could be.The authors claim that local maps are free from the need for stability. I disagree. If there is any comparison to be done from one point in time to another, stability is still needed, and cannot be achieved by a local map.When it comes to query-based search, I would suggest using a reference to at least one paper by Alan Porter that contains a detailed multi-part query. Also, to be fair to the detailed multi-part query, these queries are typically designed in a multi-step process. A query is run, results are examined to see what should be dropped and/or added, and this is done multiple times, which results in the detailed query with presumable high precision/recall. This should be mentioned. Porter’s paper will mention this process.Figure 3 would be far easier to interpret if the y-axis used base10 for its log scale.In the Figure 7 overlay maps, it is very difficult to tell what is overlay and what is basemap. It would be far easier to distinguish the overlay if the basemap were in gray or some other neutral color, and the overlay were colored.**********6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.If you choose “no”, your identity will remain anonymous but your review may still be made public.Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.Reviewer #1: NoReviewer #2: No[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files to be viewed.]While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please note that Supporting Information files do not need this step.16 Sep 2019PONE-D-19-16569Visualizing a Field of Research: A Methodology of Systematic Scientometric ReviewsPLOS ONEDear Editor,First, we would like to thank the reviewers for providing their comments.We have carefully revised the manuscript based on the reviews. Most of the manuscript has been re-written to improve the clarity of the description. Most of the figures are re-generated as well.We summarize our responses to the reviewers as follows. The text reflects the changes accordingly.Best wishes,Chaomei Chen and Min Song2. We note that Figure 2 in your submission contains a copyrighted image. All PLOS content is published under the Creative Commons Attribution License (CC BY 4.0), which means that the manuscript, images, and Supporting Information files will be freely available online, and any third party is permitted to access, download, copy, distribute, and use these materials in any way, even commercially, with proper attribution. For more information, see our copyright guidelines: http://journals.plos.org/plosone/s/licenses-and-copyright.Response: Figure 2 was in fact a screenshot of an interface in CiteSpace for connecting to Dimensions. Thus we would be its owner. Regardless, the figure is now replaced with a new Figure 2 to illustrate the workflow of handling the datasets and major analytic tasks.Comments to the Author________________________________________5. Review Comments to the AuthorResponse: We are grateful to the reviewers for providing their comments. We have thoroughly revised the study to address these comments. We have in fact gone through the entire process and updated most of the figures and tables along with interpretations and discussions. As a result, the manuscript has been re-written substantially. Please be aware of the possible shifts in text between the revised version and the original version in terms of references to figures and details such as clusters.Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)Reviewer #1: The authors have demonstrated how to use cascading citation expansion method to generally increase the body of literature to be included and analyzed. The integration between Dimesions and CiteSpace is beneficial to bibliometricians. The method described in the current manuscript offers an alternative solution to define a research landscape, which can be used in parallel with other bibliometric approaches such as cited reference analysis with CRExplorer.Response: Thank you.Reviewer #2: This paper compares different ways of creating scientometric datasets in an attempt to identify a method that maximizes coverage of relevant documents while minimizing extraneous material.Response: Thank you. This is indeed a very accurate outline of the study.I have several major concerns with the paper.First, the paper is framed in terms of aiding systematic scientometric reviews and in bridging the local-to-global continuum in terms of datasets. However, the paper does little to address either of these issues. In fact, there is little to no referencing of these framing features throughout the results and discussion other than a simple reprisal of the claims at the very end of the paper. So, while the issues used to frame the paper are real, the body of the paper really doesn’t address the issues. I would thus suggest that the paper be framed more simply in terms of differences between datasets, which is what the paper is really about.Response: While the paper indeed focuses on the differences between datasets, the motivation for doing so is driven by the lack of flexibility in access the local-global continuum. The cascading citation expansion presented in the paper offers a method that enables us to iteratively expand a set of articles so as to include relevant articles that are not matched by initial search queries. Therefore we believe this paper offers a concrete method in response to the local-global dilemma when researchers try to find publications with high recall (in theory in favor of global approaches) and up-to-date structures (as global structures may not be updated as frequently as needed). As we show in the paper, with cascading citation expansions in different combinations, the boundaries of the topic of interest are no longer limited by initial search.Another advantage of citation-based expansion is to reduce the burden of having to rely on domain experts in constructing complex queries and thus to shift the focus of domain experts from the question of what we may miss to the question of whether a topic in front of us is relevant enough.The local-global continuum provides a necessary context of the study also in connection to the principles of literature-based discovery, namely to discovery new connections, including individual domain experts may not be aware of, as one cannot expect to have access to domain experts who know everything about the research area to be surveyed.We have revised the text with reference to the role of the local-global continuum.Second, and perhaps more importantly, the purpose of the paper is to contrast various ways of creating a dataset. However, the analysis is a full step removed from this because the results of each query (and the superset) are not directly compared. Instead, these results are run through the CiteSpace co-citation black box which hugely amplifies some queries while dramatically reducing the F5 query and the overall space, in addition to being a citation generation removed from all of the queries. Co-citation does tend to remove the cutting edge of research fronts, and since reviews are often focused on the hot or emerging topics, I fail to see how this method could be a great basis for aiding in reviews. I’m not saying that CiteSpace/co-citation should not be used, but I am saying that the authors need to honestly and accurately characterize what they are really doing. This does not currently come through in the paper.Response: Thank you for your comments. We added direct comparisons between different datasets with new tables (Tables 3, 4). In general, they capture many articles that are not in the F set, i.e. the query search result. For example, the S5 set contains 44,883 articles with only 702 of them found in F. More details can be found in the manuscript.There are a few reasons for using CiteSpace. First, the original motivation of the study is to improve the quality of systematic scientometric studies supported by CiteSpace and a few other tools such as VOSviewer. Improving the quality of input data is a common need for applications of scientometric tools. Second, CiteSpace allows us to inspect the interrelationships between datasets obtained from different expansions in context so as to contrast which areas are covered well by particular expansions and which areas are underrepresented. The visualization process retains articles with at least 2 citations in at least one year. The resultant map does not represent all the articles in a dataset, only those with sufficient citations. This is a justifiable selection and it is used widely in practice, including global approaches.Co-citation maps in CiteSpace have grown out of the classic notion of co-citation considerably by incorporating information from citing articles to a great extent. Perhaps we ought to give it a new name to indicate the difference instead of keeping calling it a co-citation network. Notably, clusters shown in the interactive visualization represent a duality between citing articles and cited references. Cluster labels are noun phrases extracted from citing articles to cited references in corresponding clusters. Year-by-Year labels of a cluster provide detailed information on emerging topics of citing articles. For example, clusters such as big data and deep learning are identified in the study. Citing articles to these clusters feature many articles as recent as 2018, considering the data was collected in early 2019. CiteSpace does support functions for generating maps based on bibliographic coupling. In this study, we limit our focus to how cascading citation expansion may affect subsequently generated science maps. We believe citation/co-citation provides an additional vetted information to the bibliometric landscape that alternative methods may not readily reveal. Note that there are a wide variety of possible routes to conduct a visual analytic study of a given dataset. Here these examples are primarily provided to demonstrate the major differences of different strategies of data collection.As a side note, the ‘black-box’ process in CiteSpace has many user-controllable parameters for users to verify the role of each parameter in the workflow, although the complexity involved in the process is indeed non-trivial in terms of the number of decisions to make when selecting a particular workflow to proceed.We update the text accordingly to clarify the above points and to provide more detailed descriptions of the steps.Related to this, the CiteSpace step (bridging Tables 3 and 4) is all of one or two sentences, and the first impression from comparing the numbers of articles/nodes in Tables 3 and 4 is one of bewilderment. What is going on here. How does NF go from 73 articles to 1903 nodes. Also, why does CiteSpace end up with a much smaller set for F5 and smaller sets for the other queries. This all seems to be magic, and much more detail is required for this transition.Response: We have revised the descriptions to make them self-contained and added references to publications that describe the relevant information in further detail. The two tables are updated as well.The Nf dataset contains 73 articles, or source articles, or citing articles. The 73 articles cite 2,239 unique references. 1,903 of them satisfied the selection criteria based on local citation counts and they become the nodes in the corresponding network of references for Nf.Similarly, S5 contains a much larger number of citing articles, which in turn cited an even larger number of references. References must meet the selection criteria to be retained. As shown in publications such as van Noorden et al. (2014) the majority of the paper mountain has papers with zero citations globally.We have revised the text accordingly to clarify the details.Finally, the comparison is qualitative, which it has to be given the lack of ground truth. On the other hand, the authors chose this subject because of their familiarity with it, and yet fail to mention which of the clusters are within vs. without what they would consider LBD to encompass. They seem to favor the F5 expansion. Are there clusters in the F5 that, because of over-expansion, are really outside the field? We are left wanting a more definitive result.Response: We give a higher priority to reduce the risk of missing important relevant articles than including less relevant ones. In part, we share the view with the literature-based discovery in general in that we consider it is critical to bring otherwise disparate bodies of scholarly publications to the attention of researchers. Furthermore, the quality of the coverage directly impacts subsequent analytic tasks, including the assessment of relevance of a given set of articles retrieved.In the revised version, we particularly discuss the relevance issue with reference to clusters on big data, deep learning, and human metabolisms. The first two clusters are considered relevant with reference to the literature-based discovery research based on the roles they played in those studies (through inspections in cluster explorer), whereas the third one (cluster #6), with a specific focus on domain specific topics such as microRNA and RNA interference and without strong evidence of connections to literature-based discovery, is considered beyond the scope of literature-based discovery.We have revised the text accordingly.Minor issues:Michel Zitt wrote about citation expansion several times, most directly in IPM 2006, but also presaged in Scientometrics 1996 and 2003. Many people have used citation expansion – for instance the original Places&Spaces display had a three-generation backward citation expansion based on Nobel Prize winners and others that was displayed on a science map. Author Chen also has a 2006 paper with forward expansion. So this paper is not introducing citation expansion, but is definitely refining it. Claims in this paper should be adjusted accordingly.Response: Michael Zitt (2006) in IPM is indeed relevant and useful. We have added an overlay of the core references of a dataset based on the ratio of local- and global citations. The core list is useful as an additional level of aggregation of references to highlight the boundary of the core references.Theoretically, the concept of citation expansion is indeed conceivable from the original introduction of citation indexing. Pragmatically, the workflow has remained to be outside of the access of most researchers. We emphasize the practical contribution of the cascading part of the citation expansion, which would require a sustained access to the source database. Therefore, we welcome the new opportunities enabled by the Dimensions API. The 5-generation citation expansions illustrated in the paper may continue further and thus reduce the gap between a local- and a global view of the literature. More importantly to us, our work enables many researchers to adopt the workflow by applying the freely available functions supported in CiteSpace to the vast coverage of the Dimensions platform.We have adjusted our claims in the revised text accordingly.Although creation of global maps is indeed “limited to a small number of researchers”, it is also true that a global model is now available to all users of SciVal, which has over a thousand institutional subscribers, and is being widely used for institutional portfolio analysis. While the detailed information in this global model is not used widely to seed systematic reviews, it could be.Response: Yes, it could. Updating a global model is much more time/effort consuming than creating a local model from scratch. According to Borner et al. (2012), the 2005 USCD map was created in 2007. Its next version 2010 USCD map was reported in 2012, adding six years of data from WoS and 3 years from Scopus. The issue is how long the global maps can stay valid as the structure depicted by the existing model is constantly subject to the changes introduced by new publications and how sensitive the current model to changes that may present only at a finer level of granularity, e.g., as shown in local maps. What we contribute here is a compromise that offers an increased flexibility and agility.The authors claim that local maps are free from the need for stability. I disagree. If there is any comparison to be done from one point in time to another, stability is still needed, and cannot be achieved by a local map.Response: The sentence you referred to is probably this one: “Localized science maps are free from the need to maintain a stable structure.” What we intend to convey here is that for local maps the structure is constructed based on the current data, thus the most recent data is taken into account. In contrast, if a global map is created a few years ago, then it is a legitimate question whether any subsequently published articles might have altered the structure significantly enough, especially at finer levels of granularity.If we define global maps as the ones featuring all scientific disciplines (Borner et al. 2012), then it is certainly achievable by a local map to make comparisons. What we need in such scenarios is a sufficiently large context to accommodate snapshots taken at different time points. The Link Walkthrough function in CiteSpace allows users to step through the visualized network one time slice at a time so that users can see which areas are covered in a particular time slice.We have revised the text accordingly.When it comes to query-based search, I would suggest using a reference to at least one paper by Alan Porter that contains a detailed multi-part query. Also, to be fair to the detailed multi-part query, these queries are typically designed in a multi-step process. A query is run, results are examined to see what should be dropped and/or added, and this is done multiple times, which results in the detailed query with presumable high precision/recall. This should be mentioned. Porter’s paper will mention this process.Response: We have added the following papers to the related work/discussion, two by Porter and his coauthors and one by Kostoff et al. They are in the same category of constructing complex queries through iterative processes. The key strength of such approaches is the input of domain-specific expertise, which also makes it a potential weakness of the method if the domain expertise is not readily accessible. It is cognitively more demanding to think of what is missing in the queries than to review whether specific articles retrieved are indeed relevant. Besides, multi-part complex query approaches can also benefit from cascading citation expansion as the initial burden on the domain experts would be much reduced.In addition, in consistent with the basic idea of literature-based discovery, it is valuable to undiscover disparate bodies of relevant literature that even domain experts may not be aware of. In the IPM paper you recommended to us , Zitt identified some of the issues with query-based approaches: “The reasons for the absence are various: terms forgotten by experts; terms deliberately excluded because of the fear of noise generated by too general or ambiguous formulation; terms deliberately excluded as deemed out of the scope of the topic.” P. 1526.• Huang Y, Schuehle J, Porter AL, Youtie J. A systematic method to create search strategies for emerging technologies based on the Web of Science: illustrated for ‘Big Data’. Scientometrics. 2015;105(3):2005-22. doi: 10.1007/s11192-015-1638-y.• Porter AL, Youtie Y, Shapira P, Schoeneck DJ. Refining search terms for nanotechnology. Journal of Nanopartical Research. 2008;10(5):715-28. doi: 10.1007/s11051-007-9266-y.• Kostoff RN, Koytcheff RG, Lau CGY. Technical structure of the global nanoscience and nanotechnology literature. Journal of Nanopartical Research. 2007;9(5):701-24.Figure 3 would be far easier to interpret if the y-axis used base10 for its log scale.Response: Thank you for your comments. Revised as such.In the Figure 7 overlay maps, it is very difficult to tell what is overlay and what is basemap. It would be far easier to distinguish the overlay if the basemap were in gray or some other neutral color, and the overlay were colored.Response: Thank you for your comments. Revised as such.Thanks again for reviewers’ comments!________________________________________Submitted filename: Response to Reviewers.docxClick here for additional data file.3 Oct 2019Visualizing a Field of Research: A Methodology of Systematic Scientometric ReviewsPONE-D-19-16569R1Dear Dr. Chen,We are pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it complies with all outstanding technical requirements.Within one week, you will receive an e-mail containing information on the amendments required prior to publication. When all required modifications have been addressed, you will receive a formal acceptance letter and your manuscript will proceed to our production department and be scheduled for publication.Shortly after the formal acceptance letter is sent, an invoice for payment will follow. To ensure an efficient production and billing process, please log into Editorial Manager at https://www.editorialmanager.com/pone/, click the "Update My Information" link at the top of the page, and update your user information. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.If your institution or institutions have a press office, please notify them about your upcoming paper to enable them to help maximize its impact. If they will be preparing press materials for this manuscript, you must inform our press team as soon as possible and no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.With kind regards,Wolfgang Glanzel, PhDAcademic EditorPLOS ONEAdditional Editor Comments (optional):Reviewers' comments:Reviewer's Responses to QuestionsComments to the Author1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.Reviewer #2: All comments have been addressed**********2. Is the manuscript technically sound, and do the data support the conclusions?The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.Reviewer #2: Yes**********3. Has the statistical analysis been performed appropriately and rigorously?Reviewer #2: N/A**********4. Have the authors made all data underlying the findings in their manuscript fully available?The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.Reviewer #2: Yes**********5. Is the manuscript presented in an intelligible fashion and written in standard English?PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.Reviewer #2: Yes**********6. Review Comments to the AuthorPlease use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)Reviewer #2: (No Response)**********7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.If you choose “no”, your identity will remain anonymous but your review may still be made public.Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.Reviewer #2: Yes: Kevin W. Boyack21 Oct 2019PONE-D-19-16569R1Visualizing a field of research: A methodology of systematic scientometric reviewsDear Dr. Chen:I am pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.If your institution or institutions have a press office, please notify them about your upcoming paper at this point, to enable them to help maximize its impact. If they will be preparing press materials for this manuscript, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.For any other questions or concerns, please email plosone@plos.org.Thank you for submitting your work to PLOS ONE.With kind regards,PLOS ONE Editorial Office Staffon behalf ofProf. Dr. Wolfgang GlanzelAcademic EditorPLOS ONE
Authors: Katy Börner; Richard Klavans; Michael Patek; Angela M Zoss; Joseph R Biberstine; Robert P Light; Vincent Larivière; Kevin W Boyack Journal: PLoS One Date: 2012-07-12 Impact factor: 3.240
Authors: Fei Zhai; Yuxuan Zhai; Chuang Cong; Tingyan Song; Rongwu Xiang; Tianyi Feng; Zhengxuan Liang; Ya Zeng; Jing Yang; Jie Yang; Jiankun Liang Journal: Int J Environ Res Public Health Date: 2020-05-26 Impact factor: 3.390