Literature DB >> 31967999

Measuring the diffusion of innovations with paragraph vector topic models.

David Lenz1, Peter Winker1.   

Abstract

Measuring the diffusion of innovations from textual data sources besides patent data has not been studied extensively. However, early and accurate indicators of innovation and the recognition of trends in innovation are mandatory to successfully promote economic growth through technological progress via evidence-based policy making. In this study, we propose Paragraph Vector Topic Model (PVTM) and apply it to technology-related news articles to analyze innovation-related topics over time and gain insights regarding their diffusion process. PVTM represents documents in a semantic space, which has been shown to capture latent variables of the underlying documents, e.g., the latent topics. Clusters of documents in the semantic space can then be interpreted and transformed into meaningful topics by means of Gaussian mixture modeling. In using PVTM, we identify innovation-related topics from 170, 000 technology news articles published over a span of 20 years and gather insights about their diffusion state by measuring the topic importance in the corpus over time. Our results suggest that PVTM is a credible alternative to widely used topic models for the discovery of latent topics in (technology-related) news articles. An examination of three exemplary topics shows that innovation diffusion could be assessed using topic importance measures derived from PVTM. Thereby, we find that PVTM diffusion indicators for certain topics are Granger causal to Google Trend indices with matching search terms.

Entities:  

Year:  2020        PMID: 31967999      PMCID: PMC6976149          DOI: 10.1371/journal.pone.0226685

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.240


Introduction

The rapidly growing amount of digital information provides novel data sources for economic analysis with regard to identifying and measuring innovation trends. To exploit this valuable information, there is a growing need for automated information retrieval from large text corpora [1]. Meanwhile, great progress has been made in machine learning and neural network theory leading to the emergence of new methods for extracting high-quality indicators from text. However, the opportunities created by the ongoing digitization have not yet been fully acknowledged nor have they been extensively studied in the economic literature. Following [2], who suggested that machine learning methods should be more widely known to and used by economists, more research that appropriately incorporates these new data sets and methods needs to be conducted in the field of innovation economics. For a few early exceptions, see the broader economic-use cases discussed below. In an innovation context, working with textual data is a well-established practice. However, most studies have used patent registers as their main source of text data and have relied on more traditional methods to transform text into something readable by a computer. A major part of the literature considers either patents, some specific aspects of innovation, or technology trends when investigating innovation. [3] is an early work that took into account not only the number of filed patents or frequency of citations but also the primary texts of the patents. The authors utilized text mining techniques to extract important keywords to build frequency vectors representing single patents. Subsequently, the authors performed a network analysis to visualize the relationships between patents and proposed measures for “importance,” “newness,” and “similarity” of the patents. Later studies such as [4], who considered patents involving a light emitting diode (LED) and wireless broadband fields, also conducted a network analysis. [5] proposed a semantic approach for patent classification that can be used in addition to the traditional USPC (United State Patent Classification) classes and also helps predict technology convergence. Recent work by [6] described an automated approach for patent landscaping, i.e. the process of finding patents related to a particular topic. Starting with a human-selected seed set of patents and expanding it through citations and class codes, semi-supervised machine learning is used to prune the expanded data set. [7] applied clustering techniques based on keyword analysis and community detection on scientific publications related to research in embryology to reconstruct the dynamics of the embryology research field and to understand early-warning signals prior to the occurring of some special events. [8] considered texts of approximately 170,000 awards from 2000 to 2011 that were included in the portfolio of the National Science Foundation (NSF). The authors were interested in measuring the inter-disciplinarity of the NSF portfolio and introduced an improved topic model based on latent Dirichlet allocation (LDA, [9]. See [10] for an introduction to probabilistic topic modeling including the LDA algorithm.). Our approach differs from previous work mainly in two aspects, the data and the methodology. While most studies focuse on patents, our study is the first to examine news articles as a way to measure the diffusion of innovations. Additionally, very few studies concerned with innovation research used embedding methods to represent text. For example, one study closely related to ours is [11]. The authors utilized patent fillings dating back to 1840 to estimate their novelty and significance by quantifying the impact on future technological innovations, using a time-aware term-weighting scheme based on term-frequency inverse-document-frequency (tf-idf) that the authors specifically constructed for this purpose. With this approach, the authors can capture technological evolution over a long time span, demonstrating the captured trends as strong predictors of productivity at various levels. In the same line, we tried to capture the diffusion of innovations by representing potential innovations as topics that can be found in news articles using topic models. Our study differed from [11] in the time scope (20 years for our data set vs 170 years) as well as the data source (news articles vs patents). Further, we relied upon text embedding methods to represent text, compared to the tf-idf scheme utilized by [11]. We are also more interested in the diffusion of innovations than technological progress as a whole, for which our data set is not extensive enough. Regarding the methodology, in natural language processing (NLP), topic modeling describes a set of methods to extract the latent topics from a collection of documents. Topic models have been applied to extract stock topics from financial news with an application to predict abnormal returns [12], to categorize 8-k fillings and determine which topics are associated to abnormal returns [13], to measure the novelty of financial news [14] and to determine the key issues faced by firms [15]. Researchers have investigated the effect of increased transparency on monetary policy [16] and the adaptation of topics found in a Norwegian business newspaper to model the impact of news on the business cycle [17]. Scientific discourse modeling is also within the scope of topic models. For example, [18] identified topics in the Journal of Economics and Statistics to study whether or not the scientific discussion of topics correlates with the actual development of economic key indicators. [19] examined how central bank communications affect real economic variables using topic modeling, and [20] employed LDA to model the topics in the Journal of Economic History between 1941 and 2016. For many years, LDA has been the algorithm of choice for modeling latent topics in text corpora. However, LDA only describes the statistical relationships between words in a text corpus based on co-occurrence probabilities, which might not be the best feature representation for text [21]. Furthermore, LDA reportedly has long computation times, especially with large text corpora [22]. The interpretation of topics generated by LDA is not always straightforward, and the necessary mental effort to give meaning to the extracted words can be tedious and demanding work [23]. We propose a topic model architecture based on neural embedding methods that can generate meaningful and coherent topics as an alternative to LDA. In particular, we use Paragraph Vector (also known as Doc2Vec, [24]) to compute vector space representations of text documents and Gaussian mixture models (GaussMMs) to cluster the resulting document vectors into meaningful semantic topics. We call this combination of embedding and clustering Paragraph Vector Topic Modeling (PVTM). Our main methodological contribution is the way in which we combined these algorithms to extract latent topics from text corpora. A methodologically related approach to PVTM is from [25], who identified relevant articles for systematic reviews using Paragraph Vectors to construct vector representations of documents and then cluster the document vectors using k-means. The resulting cluster centroids were interpreted as latent topics, and the topic probabilities for the documents were constructed using the distance between cluster centroids and document vectors. In comparison, while also relying on Paragraph Vectors to construct vector representations of documents, we employed soft clustering via GaussMMs to directly assess topic memberships at the document level. The applicability of our approach was demonstrated on a corpus of news articles from the German IT news publisher Heise Medien from the last 20 years. We compared our constructed topic measures with Google Trend indices (GTIs, https://trends.google.com/trends/). Google Trends measures the evolution of the frequency of search terms entered by users of the site over time. The results are set in relation to the total search volume and have been available in weekly resolution since the beginning of 2004 for the whole world or individual regions. Our results suggest that PVTM is well suited for topic modeling this type of text data. Clearly, the topic probabilites generated by PVTM might serve as a proxy to measure the diffusion of certain types of innovations. In particular, it became apparent that the topic importance measures generated by PVTM Granger caused GTIs for matching search terms. The remainder of this article is organized as follows. Section 1 details the methodological background of our analysis. The news ticker data set and the experimental design are reviewed in Section 2. In Section 3 we discuss our findings. Section 4 summarizes our results and describes avenues for future research.

1 Paragraph vector topic modeling

Section 1 introduces the PVTM methodology to generate topics and topic membership probabilities. PVTM relies upon the Paragraph Vector algorithm to generate document representations that are then clustered using GMMs. Our main methodological contribution is the way in which we combined these algorithms to extract latent topics from text corpora. Also, to the best of our knowledge very few studies have used embedding methods in an innovation context, e.g. [6]. Paragraph Vector borrows the main ideas from Word2Vec, which is why it is useful to discuss the Word2Vec mechanics for encoding single words before detailing Paragraph Vector. In the last part of this section, we discuss the document clustering algorithm and provide intuition about the outputs yielded by PVTM.

1.1 Neural embeddings of words and documents

Neural network based embedding methods like Word2Vec [26, 27] play an increasingly vital role for encoding the semantic and syntactic meaning of words. Word2Vec builds low-dimensional dense vector space representations which encode the meaning of a word in a given context. As similar words tend to appear in similar contexts [28], encoding words based on their local context captures interesting properties in the resulting vectors, which have been shown to represent the way in which we use these words [29]. Intuitively, words that share many contexts are more similar than words that share fewer contexts. Vector calculations on the resulting word vectors yielded useful results, for example, v − v + v ≈ v [26], where v is the vector representation for a given word. These meaningful representations of words can be used as features in a variety of NLP tasks, including topic modeling. Word2Vec comes in two architectural variants: Skip-Gram (SG) and Continuous Bag of Words. We discuss the SG architecture in more detail, as this is relevant for our application of the Doc2Vec model.

Skip-Gram

During model training, the SG architecture iterates over a given text corpus in fixed-sized sliding windows and generates (context—target) word pairs, where q target words are considered on both sides of the context word, which is the word in the middle. Assume the following 5-word window (q = 2): Innovation is good for business. The context word w would be good, and the target words w, w, w, w are Innovation, is, for and business. This results in 4 (context—target) pairs, (good—Innovation), (good—is), (good—for) and (good—business). Given such pairs of context and target words resulting from the sliding window for a document, the construction of the vector representation is done as follows. First, each word in the vocabulary was represented as a one-hot encoded sparse vector v of size V × 1, where V is the number of different words in the vocabulary. In a one-hot encoding scheme, a 0 indicated the absence of a word, whereas a 1 indicated the presence of a word. Thus, v represented a sparse vector consisting of zeros and a single 1 for the row corresponding to the word. Second, for each context word v, a lower dimensional (denser) vector representation d was obtained using a projection matrix W of dimension D × V, where D ≪ V: Eq (1) extracts the column of W corresponding to the non-zero entry in v to use it as a vector representation d for word v in a D-dimensional vector space. The same procedure was repeated for each target word using a different projection (weight) matrix W of size D × V and the one-hot representation of the target word v resulting again in a D-dimensional vector d, which is the vector representation of the target word: The probability of finding a target word t close to the context word c was modeled based on the similarity between the corresponding vectors d and d in the low dimensional vector space. The dot product of both vectors was used as a measure of similarity, which is closely related to the cosine similarity but also takes into account the lengths of both vectors. Then, the probability that a target word t is observed close to context word c was defined making use of the softmax function [30]: where the sum in the denominator is over all V words in the text, i.e., all potential target words. The projection matrices W and W were obtained by maximizing the joint (log-)probability of observing the target words for all context words in the corpus: C = {D1, …, D} was the corpus of R news articles, where each article consisted of several words, i.e. . The number of words per article D was denoted by n, and the size of the corpus was denoted by R. Thus, given an article D, was the current context word used to predict target word that fell in a given range around the context word. The size of the context window is denoted by q. Gradient descent was used to iteratively update the weights in W until some convergence criteria was met. After training the weights, W acts as a lookup table for the vector representations of words. We used negative sampling and thereby sampled 5 negative examples for each positive word. With negative sampling, the SG algorithm compared the observed word-context pairs with randomly generated unobserved pairs and minimized the probability of the negative pairs while maximizing the probability of the actual word-context pairs. Negative sampling avoids calculating the softmax function for all possible words in the vocabulary and has been shown to speed up training and improve the resulting word vectors. [31] showed that the Word2Vec-SG architecture with negative sampling is equivalent to a weighted logistic principal component analysis.

1.2 Paragraph vector

The Paragraph Vector algorithm [24] expands the ideas of Word2Vec to longer pieces of texts. Instead of word vectors, document vectors are learned during the training process. The resulting vector representations have been shown to capture latent semantic properties of the text fragments, such as the underlying semantic topic of a document. The Paragraph Vectors algorithm also comes in two variants: Distributed Bag of Words (DBOW) and Distributed Memory (DM). In our experiments, we focused on the DBOW methodology, as it has been shown to produce slightly better results compared to DM. Although [24] reported that the DM architecture seems to perform better, subsequent research came to different conclusions [32]. DBOW builds upon the Word2Vec—SG architecture but replaces the center word with a unique document ID. Thus, instead of conditioning a single word on its surrounding words, the whole document is conditioned on the words appearing in it.

1.3 Gaussian mixture clustering

Topic modeling uses clusters of important words to define topics, where different topics may share some words. In our approach, this was done by clustering the vector representations obtained from Paragraph Vectors and then determining the most relevant words per cluster. A GaussMM (see, e.g., [33]) is a parametric probability density function represented as a weighted sum of Gaussian component densities [34]. GaussMMs employ the expectation maximization algorithm [35] to fit a mixture of Gaussian models to a given data set and can be used to represent normally distributed subpopulations within an overall population. GaussMMs have been used to track multiple objects in video sequences [36], to extract features from speech data [37] and for speaker verification [38]. Compared to frequently used clustering techniques such as k-means [39] or mean-shift [40], GaussMMs offer the advantage of soft clustering the data. Soft clustering allows multiple cluster memberships per document, so each document can be represented as a probability distribution over the cluster memberships. The result of the process is a matrix with one row per document and one column per identified cluster, where each entry represents the probability of belonging to a certain cluster. Given that Paragraph Vectors capture latent topics in the corpus, it is reasonable to suggest that clustering the resulting document vectors can be seen as identifying latent topics. Particularly, given a D-dimensional vector representation d of a news article D and a pre-set number of Gaussian components M with mixture weights w, the GMM was defined as a weighted sum over the M Gaussian components, where the mixture weights satisfied the constraint : Thereby, each mixture component g(d|μ, Σ), i = 1, ⋯, M was defined as a D-variate Gaussian function of the form with μ and Σ representing the mean vector and covariance matrix respectively, and λ collecting all parameters of all mixture componentes, i.e., The optimal parameter configuration λ was estimated by iteratively updating the model components to best fit the training data using the EM algorithm. is the GaussMM likelihood given the training data W = (d1 … d). Starting with an initial configuration λ, a new configuration λ was computed such that p(W|λ) ≥ p(W|λ) for R training vectors collected in W. The initial configuration was computed using k-means, and the mixture components were updated according to where (8) is the update for weight w, the means are updated according to (9) and (10) details the variance re-estimation, in this case, for a diagonal covariance. The a posteriori probability for component i is given by The downside of GaussMMs is that the number of mixture components M needs to be specified beforehand, and the algorithm is always going to use all M components. This gives rise to the need of external validation methods. One way of evaluating the quality of a given GaussMM clustering is to use theoretical criteria like the Bayesian information criterion (BIC, [41]), which is the approach we took. The procedure described in 1.1—1.3 is referred to as PVTM. Similar to LDA, the most widely used topic model, results yielded by PVTM are twofold: 1) a list of topics, wherein each topic is associated to certain words especially relevant in the context of the topic, and 2) a document-topic matrix, where each document is assigned a probability for each of the topics determined in 1) to belong to this document. Topics were found by clustering document embeddings obtained from Doc2Vec. The document-topic matrix is created by applying the GaussMM with optimized weights to the vector representations of the documents. The result was a probability distribution across all topics (clusters). The most important words per topic were determined by the proximity of the trained word vectors to the topic vector. While learning the document embeddings, word vectors were also trained and embedded in the same space. The cosine similarity could then be used to determine the most similar words to the cluster centers (= topic vectors). These words then formed the word list for the corresponding topic. The topic vectors corresponded to GaussMM cluster centers. A disadvantage of the method is that a large corpus is necessary to get reasonable word vectors (and also document vectors). The intuition behind PVTM is that Doc2Vec embeddings group similar documents into similar regions of the embedding space. Using a clustering algorithm on the high dimensional document representations yields document clusters, i.e., regions with a high density of documents sharing similar latent elements. PVTM makes these latent elements accessible in the form of words located in close proximity to the topic.

2 Discovering innovation-related topics from news articles

We applied PVTM to news articles in an effort to discover innovation-related topics and measure their diffusion by means of topic probabilities over time.

2.1 Technology-related news corpus

The data set was formed by news articles published by the German IT news ticker heise online (https://www.heise.de/newsticker/) in their news-ticker archive from 1997 to 2016. The total number of news articles was 174, 532, resulting in an average of 8, 727 articles per year. However, the number of articles per year before the 2, 000s was considerably lower compared to subsequent periods. The average news article consisted of 278 words. Fig 1 details the number of documents per year and the number of words per document.
Fig 1

Text corpus descriptive statistics.

A: Number of documents per year. B: Distribution of the number of words per document.

Text corpus descriptive statistics.

A: Number of documents per year. B: Distribution of the number of words per document. We choose this corpus for the following reasons. First, the corpus has a clear technology reference, which makes the identification of technology-related topics easier. Second, the corpus is available over a long period of time, 20 years, making it possible to measure long-term processes such as the diffusion of innovations in the first place. Third, news are available in a timely manner in comparison to, for example, patent data so that adjustments or trends can be identified in the short term, which is an important criterion for Science, Technology and Innovation (STI) policy. One of the potential disadvantages is that the corpus is in German, which makes the results less accessible to non-German-speaking readers. This disadvantage is partially mitigated by the fact that many of the frequently used terms in current technologies spring from English and can therefore be understood by non-German-speaking readers. Apart from that, our question was specifically aimed at measuring the diffusion of innovations via news in one country. Germany is an example of an industrialized country. Future studies could transfer these ideas to other countries.

2.2 Data preprocessing & parameter settings

We removed all non-alphanumeric characters and lowercased the resulting words. Next, we applied popularity-based pre-filtering, which is a commonly applied technique in recommendation systems [42]. This can be seen as removing corpus-specific stop words from the vocabulary by setting an upper and lower threshold, defining the number of documents in which a word is allowed to occur before it is considered to be too frequent or infrequent. All words that occured in more than 65% of documents and in less than 0.05% of documents were filtered, as these words appeared to be too common/rare to be considered useful for our application of PVTM given the size of the corpus. The thresholds have been determined empirically, i.e., they have been adjusted until all corpus-specific stop words were removed. Thereby, we found that the upper threshold was not very sensible to changes. However, the lower threshold was a somewhat sensible parameter for PVTM, and as for the more liberal lower thresholds it was oftentimes not easy to determine meaningful topics due to the way the topic words were obtained. The most similar words to a topic vector were used as topic words. When the lower threshold was very low, very rare words, which might even appear in a single document only, were mapped close to the vector of the document in which they appeared. Thus, with no restriction on the minimum appearance of a word, words that appeared very few times and were very specific happen to be close to the topic vector and were chosen as topic words. The algorithm to compute vector respresentations from news articles was run for 10 epochs, where each epoch consisted of going over all articles once. Following the default settings, we chose the dimensionality of the document vectors to be 100. We used the BIC to find the optimal number of Gaussian mixture components K and the best approach to construct the covariance matrices Σ. We considered four methods to estimate the covariance matrix: Full used the full covariance matrix for each individual component, i.e., each cluster can experience any shape, while Diagonal only used the diagonal of the covariance matrix per component, resulting in cluster shapes that were orientated along the coordinate axes. Tied used a genereal covariance matrix for all mixture components; therefore, all clusters had an identical shape. Spherical made use of a single variance per component instead of a covariance matrix, resulting in spherical cluster forms in higher dimensions. All possible combinations of K and Σ for K ∈ {50, 1000} and Σ ∈ {Diagonal, Full, Spherical, Tied} were tested. From there, we used a two-step procedure to find the optimal model parameters. We first iterated over the parameter space of K in steps of 50 and started from 50, i.e., 50, 100, …, 1000. At every step during the parameter optimization procedure, all four options to construct the covariance matrices were evaluated. The best result K* was then used to construct a smaller search space K ∈ {K* ± 50}, which was searched in steps of 5. The optimal number of Gaussian mixture components K (= topics) was found to be 675 after this run, which we kept as the final number of clusters. Eventually, the covariance matrices Σ of the GaussMM were constrained to be Diagonal as this resulted in the lowest BIC scores for the data set at hand.

3 Approaching the diffusion of innovations from related topics

The OSLO manual [43] defines innovation as the implementation of a new or significantly improved product or process. Diffusion can be described as the process by which an innovation is adopted through certain channels over time among the members of a social system [44]. The diffusion curve is often drawn as a hump shaped line as shown in Fig 2.
Fig 2

The diffusion of innovations.

Adopters were categorized as innovators, early adopters, early majority, late majority and laggards depending on their adoption time.

The diffusion of innovations.

Adopters were categorized as innovators, early adopters, early majority, late majority and laggards depending on their adoption time. A diffusion consisted of four main elements: (1) an innovation (2) that is communicated through certain channels (3) over time (4) among the members of a social system. Thereby, mass media outlets are effective tools to disseminate knowledge about innovations [44]. In our approach, the communication channel is the internet via a media outlet and the social system consisted of the users of the media outlet. Using topic modeling, we attempted to measure the diffusion of topics related to innovations by aggregating topic probabilities in a large technology-related news corpus over time. This way, conclusions about an innovation regarding adoption time and status could be drawn based on the aggregated probabilities for this topic in the corpus. We present three exemplary topics that were related to innovative activities. In particular, we used the evolution of their weights over time as an approximation of the diffusion curve of topic relevance. The topics related to innovative activities are visualized in Figs 3–5. Thereby, the tablet topic in Fig 3 represents a product innovation, the wikipedia topic in Fig 4 refers to a process innovation and the virtual reality topic in Fig 5 can be described as technology innovation.
Fig 3

Topic tablet.

A: Most relevant words. B: Monthly observations of the topic importance over time, smoothed using a 12-month exponentially weighted moving average.

Fig 5

Topic virtual reality.

A: Most relevant words. B: Monthly observations of the topic importance over time, smoothed using a 12-month exponentially weighted moving average.

Fig 4

Topic wikipedia.

A: Most relevant words. B: Monthly observations of the topic importance over time, smoothed using a 12-month exponentially weighted moving average.

Topic tablet.

A: Most relevant words. B: Monthly observations of the topic importance over time, smoothed using a 12-month exponentially weighted moving average.

Topic wikipedia.

A: Most relevant words. B: Monthly observations of the topic importance over time, smoothed using a 12-month exponentially weighted moving average.

Topic virtual reality.

A: Most relevant words. B: Monthly observations of the topic importance over time, smoothed using a 12-month exponentially weighted moving average. Each figure consists of two panels. The left panel exhibits a wordcloud representing the most relevant words in the topic excluding stopwords. To this end we used stopwords from the python package stopwords (https://pypi.python.org/pypi/stop-words), added stopwords from the Natural Language Toolkit [45] and also included a list of common stopwords from https://github.com/6/stopwords-json. The right panel in each figure shows a line plot of the evolution of topic weights over time. Given a certain time interval, we quantified the probability that a topic appears in the text corpus. Given a text corpus C at a certain time frame—we considered one month—t, i.e., C, the probability for a single topic p(T|C) was defined as the sum of the topic probabilities for all articles in the corpus in that time frame, weighted by the overall number of articles in period t, D. The first of the exemplary topics represented by the wordcloud in Fig 3 appears to be about (android)—tablets, as androidtablet or zoll (“inch”) are among the top words for this topic. This topic received major attention and started its take-off around 2006/2007 before peaking in 2010–2012, followed by a steep decline indicating a successful diffusion. Since then, the importance of the topic in the corpus has been decreasing and, given the topic timeline, one could conclude the adoption to the tablet technology is almost completed. The evolution of the topic’s importance over time resembles a prototypical diffusion curve, spanning about 10 years. In Fig 4 the wikipedia topic is detailed, which has been labeled according to top words such as wiki, enzyklopaedie (“encyclopedia”) or wikimedia. The peak appears around 2006. Furthermore, the increase in the topic weights from 2002 to 2006 appears steeper than in the tablet topic, implying a faster rate of diffusion. Similarly to the tablet topic, diffusion can be considered as almost complete, which aligns with intuition. The observation of a rather fast diffusion rate of the wikipedia topic represents one of the advantages of the topic model based approach to determine the diffusion state of innovations as it allows for a quite granular and timely approximation of this indicator. For the application, a steeper curve might correspond to a more disruptive innovation compared to innovations with flatter curves, as adoption is quicker. Starting with this observation, one might build a measure for the disruptiveness of an innovation based on the rate of diffusion measured through the importance of a topic over time. The third topic shown in Fig 5 is labeled the virtual reality topic. The relevant words in this topic include oculus, rift and vive. Considering the importance of the topic over time, it became clear that the topic virtual reality had not peaked yet. However, it appears to be close to its peak, considering the overall shape of the aggregated weights over time, which resembles a Gaussian probability curve close to its maximum. Therefore, based on visual inspection, one might expect that after reaching the peak in 2018 or 2019, the relevance of this topic in the news corpus would decrease. To go one step beyond the visual inspection and intuitive reasoning regarding the information content of our topic-based innovation measures, we compared their evolution over time with GTIs, which have been used in the literature as a leading indicator for a variety of economic quantities [46], including the diffusion of product innovations [47, 48]. In particular, [47] found that GTI data indicated a change in the interest for a product or technology well before it is visible in the purchasing behavior. The rationale for using these indicators was that potential users, in a first step, will search for information on new products, possibly making use of a search engine. The GTI provides a measure of the relative popularity of a specific search term in Google’s search engine over time. Unfortunately, it is not known in detail how these measures are constructed. The period of highest interest in the time span under consideration is set to 100, and the remaining observations are adjusted according to the relative frequency compared to the reference period. Thus, GTI provide information about the relative importance of a topic over time, but not compared with other topics. For our application, we consider monthly observations for the three terms wikipedia, virtual reality, and tablet, which have been obtained for the period between 01/2004 and 12/2016 in Germany. The approach can be extended to further topics, which are well described by one word or short expressions, as they are used in Google’s search engine. Some more examples are provided in the supporting information section in S1–S3 and S4 Figs. Fig 6 provides a graphical comparison of the GTI with the corresponding monthly relevance measures from our PVTM implementation. Thereby, the PVTM based measures have been smoothed to reduce noise using an exponentially weighted moving average over the last 12 month. In each Fig, the left y-axis corresponds to the PVTM importance measure, while the scale of the GTI is given on the right y-axis.
Fig 6

Paragraph vector topic model diffusion indicators obtained from topic modeling compared to Google Trends indices.

A: wikipedia topic. B: tablet topic. C: virtual reality topic. The PVTM measures are smoothed using the 12-month exponentially weighted moving average of the topic importance over time. We found statistical evidence that the unsmoothed PVTM topic importance measures Granger cause GTI for each of the presented topics.

Paragraph vector topic model diffusion indicators obtained from topic modeling compared to Google Trends indices.

A: wikipedia topic. B: tablet topic. C: virtual reality topic. The PVTM measures are smoothed using the 12-month exponentially weighted moving average of the topic importance over time. We found statistical evidence that the unsmoothed PVTM topic importance measures Granger cause GTI for each of the presented topics. The visual inspection of the plots in Fig 6 highlights some leading behaviour of the PVTM indicators as compared to GTI, which appears most pronounced for the tablet topic. To substantiate these findings by means of a statistical analysis, we applied Granger causality tests [49, pp. 48ff]. Intuitively, this test analyzes whether past information of one series, e.g., from PVTM, helps to improve forecasts of another series, e.g., from GTI, beyond the information already contained in the past development of this series. A positive result of the test indicates that the first series provides additional information, which is leading the other series. More formally, the Granger causality test is based on a bivariate vector autoregressive (VAR) model, in which the current value of each series is modeled by lagged values of both series under consideration. The null hypothesis of each test is that the joint impact of all lagged values of the other series has no relevant impact. The relevant test statistic asymptotically followed a chi-square distribution, if the series under consideration is stationary or cointegrated. Given that the visual inspection of the smoothed time series already indicated non-stationarity, we ran augmented Dickey-Fuller tests with automatic lag length selection based on the BIC assuming a maximum lag length of 13. As expected, the null hypothesis of non-stationarity could not be rejected for any of the variables under consideration at the 5 percent level. Consequently, Johansen’s cointegration test was applied to identify potential long-run, cointegration relationships between the PVTM and corresponding GTI series. These tests indicate the existence of cointegration relationships. However, the original series are jagged and exhibit structural breaks which might lead to biased results of Johansen’s test. Therefore, we decided to conduct the Granger causality test to first differences of the series as a more robust procedure even though we might lose some level information if the lag length is too small. For each pair of variables, we selected the lag length of a VAR model in first differences based on the BIC with a maximum lag length of 24 month. The results including the lag length used are shown in Table 1.
Table 1

VAR Granger causality/block exogeneity wald test.

TopicNull HypothesisChi-sqdfProb.
WikipediaΔ PVTM does not Granger Cause Δ Google Trends5.56520.062
Δ Google Trends does not Granger Δ Cause PVTM3.08820.214
TabletΔ PVTM does not Granger Cause Δ Google Trends40.127130.000
Δ Google Trends does not Granger Cause Δ PVTM9.271130.752
Virtual RealityΔ PVTM does not Granger Cause Δ Google Trends21.55150.001
Δ Google Trends does not Granger Cause Δ PVTM15.30950.001
The null hypothesis that PVTM does not Granger cause GTI may be rejected at the 1% significance level for virtual reality and tablet and at the 10% significance level for wikipedia. This suggests that the topic indicators derived from PVTM contain leading information for the corresponding GTIs. One possible cause for this leading behaviour might be that news outlets and investigative journalists are likely to catch novel ideas faster than typical users of Google search such as consumers. Once reporting specific topics in news media increases, public interest also rises, possibly with some lag until people perceive and digest the news. Then, they use search engines like Google for further information about these technologies. Summarizing these findings, we suggest that importance measures derived from topic models might exhibit some leads compared to GTIs and, consequently, are more appropriate as early and leading indicators for the diffusion of innovations. As a robustness check, we also ran the Granger causality tests in a cointegration setting as proposed by [50] and a test adding one extra lag to the VAR model to account for uncertainty regarding the level of integration and the existence of cointegration [49, p. 49]. These tests resulted in similar qualitative findings as the Granger causality tests applied to first differences reported here. As a further robustness check, we compared the topic importance of the wikipedia topic to the number of new wikipedians per month in Germany. The results can be found in the supporting information section in S5 Fig and generally support the conclusion that PVTM generates sensible diffusion indicators.

4 Conclusion and outlook

There is an increasing interest in topic modeling, which is driven and ignited by the fast growing amount of textual data sources. In NLP, neural embedding methods have been shown to outperform standard methods on many tasks. They are therefore viable candidates for information retrieval from big text corpora for, for example, topic modeling. We proposed PVTM, which uses Paragraph Vectors to construct document vectors and Gaussian mixture clustering to cluster the resulting vectors into meaningful topics. The applicability of our approach has been demonstrated by the emergence of coherent topics from technology-related news articles. Thereby, it became apparent that PVTM offers a useful alternative to discover latent topics in text corpora. Empirical examples derived from the application of PVTM to (technology) news ticker data demonstrate the potential relevance for innovation economics. In particular, we were interested in measuring the diffusion of innovations over time. It appeared that the importance measures from PVTM Granger caused GTIs for the presented topics, thus offering a promising early indicator for innovation diffusions. However, further research is necessary to better understand PVTM’s potential for innovation economics. One open issue of the methodological approach is an inherent forward looking bias arising from estimating the model based on the whole corpus only once. While the weights entering the Granger causality tests are only based on past documents, a future extension of the approach should also allow for a recursive estimation of the model as in [11]. However, this is not part of this study. It remains a task for further research to derive methods for prediction of diffusion and for the assessment of entities involved in innovative activities with regard to their stage of technology adoption. Of particular relevance is also whether an innovation is successful, as communication without adoption does not make the innovation. Taking the ongoing digitization into account, early identification and measurement of innovations will become more important. Therefore, we plan to analyze to what extent the method is applicable also in a dynamic setting. From an economic point of view, it would be interesting to know which entities act as the main players in a topic. A simple possibility would be to count how often an entity has been mentioned. Going further, one could use entity-related sentiment analysis to determine whether entities have a positive or negative impact on the topic. By identifying the first mentions of companies, they could be classified into categories such as innovators, early adopters, early majority, late majority and laggards with respect to specific topics. Using live news feeds could offer the possibility to capture the diffusion of innovations with very little delay, while including more news sources might inherit the potential to cover a larger share of ongoing innovative activities.

Topic kindle.

A: Most relevant words. B: Monthly observations of the topic importance over time, smoothed using a 12-month exponentially weighted moving average. (TIF) Click here for additional data file.

Topic nfc.

A: Most relevant words. B: Monthly observations of the topic importance over time, smoothed using a 12-month exponentially weighted moving average. (TIF) Click here for additional data file.

Topic tegra.

A: Most relevant words. B: Monthly observations of the topic importance over time, smoothed using a 12-month exponentially weighted moving average. (TIF) Click here for additional data file.

PVTM diffusion indicators obtained from topic modeling compared to Google Trend indices.

A: kindle topic. B: nfc topic. C: tegra topic. Monthly PVTM observations were smoothed using a 12-month exponentially weighted moving average. (TIF) Click here for additional data file.

Topic wikipedia vs new wikipedians in Germany.

Monthly observations of the topic importance over time, smoothed using a 12-month exponentially weighted moving average, compared against the number of new wikipedians in Germany. (TIF) Click here for additional data file. 25 Jul 2019 PONE-D-19-15216 Measuring the diffusion of innovations with Paragraph Vector topic models PLOS ONE Dear Mr. Lenz, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. We would appreciate receiving your revised manuscript by Sep 08 2019 11:59PM. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. We look forward to receiving your revised manuscript. Kind regards, Diego Raphael Amancio Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at http://www.journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and http://www.journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Partly Reviewer #2: Yes Reviewer #3: Yes Reviewer #4: Yes ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes Reviewer #4: Yes ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: No Reviewer #2: Yes Reviewer #3: No Reviewer #4: No ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: No Reviewer #2: Yes Reviewer #3: Yes Reviewer #4: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: Review: PONE-D-19-15216, “Measuring the diffusion of innovations with Paragraph Vector topic models” What the paper does: This article proposes to use newly developed Paragraph Vector algorithms together with a Gaussian mixture model to estimate technology related news topics from a large news corpus and measuring their diffusion across time. The derived innovation topics are shown to Granger cause (predict) Google trends indices with matching search terms. General comments: I find the paper interesting, but with large room for improvement. The idea, to use news articles and their underlying topics to measure diffusion of innovation seems sensible and novel. The approach, using paragraph vectors coupled with a Gaussian mixture models is creative and well placed. Finally, most of the analysis seems well conducted and the presentation is reasonably clear. However, I have two main objections with the paper: 1) First, I find the result section rather disappointing. It seems to me that they only focus on three handpicked dimensions of this very high dimensional problem. I would like to see a much richer presentation of the results from their analysis in this respect. E.g., results for a wide range of innovation terms. Maybe using actual patent data registers might be useful here? 2) Second, as the article is written now, it is somewhat unclear what the methodological contribution is. Is it the how the authors use the paragraph embedding technology together with the Gaussian clustering algorithm, or is it something more fundamental. This should be clarified better. Specific comments: i) I did not really understand the last sentence of the abstract ii) The authors should cite and relate to the article by “Measuring Technological Innovation over the Long Run”, by Bryan Kelly, Dimitris Papanikolaou, Amit Seru, Matt Taddy (2018). How do your paper differ, how do you contribute etc. ? iii) Line17 to 19: These are typical LDA assumptions that need not be general for all other types of existing topic models. Should be clarified. iv) Line 30: LDA term is introduced without having used before and explained. v) Paragraph starting on Line 32: It is somewhat unclear if you use or develop a new approach here. This should be made much more clear. Also relates to my main comment. If you mostly use, but in a new way, existing methods, I think much of the math and explanations in Sections 1.1, and 1.3 could be removed. Would rather focus on the intuition and what various input and output matrices(vectors) vectors actually represent. This is somewhat unclear to me. vi) How do you handle the potential forward looking bias arising from estimating the textual model/clusters on the same data as you do the Granger causality tests? This should be addressed, and I do not think you comment on it or address it now, or? vii) Line 76: Doc2Vec is mentioned, but not explained. Related to the point above and the articles methodological contribution. viii) The skipgram model. Is it not easier to just say that this is a logistic regression with latent parameters (word embeddings) and explanatory variables (context vectors)? Solved using a simple two-layer neural network? ix) Do you apply the standard negative sampling scheme when solving the model? x) In Section 1.3, it is not exactly clear to me what the final output from all this is. Should perhaps give the reader some more intuition for this. xi) Below equation 11. Correct the text. Error in formatting. xii) Section2.1. I think you should explain better why this corpus is a good choice in the current setting (and potential extensions etc. ) xiii) Line 191. K is used before it is defined and explained. xiv) I find Section 3 confusing. What is the point of this section? You start to talk about the results and figures, but then comes section 4. Likewise, if the innovations should follow the shape you illustrate, how can you test this? Should you test this? xv) In my version of the paper I could not find a figure of the word clouds that are mentioned in the text. xvi) Line 311-318 feels somewhat speculative. xvii) I do not understand why you need to use Johansen Trace Tests for cointegration in this setting. Would it not be enough to just difference the data to get them “roughly” stationary, and then run a battery of predictive tests (for many innovation terms)? References: Bryan Kelly, Dimitris Papanikolaou, Amit Seru, Matt Taddy (2018), “Measuring Technological Innovation over the Long Run”, NBER. Reviewer #2: The article describes a novel approach to measuring innovation. This approach consists of a combination of some well-known methods. By considering this framework, tests were executed by using real data sets. Additionally, to analyze the results, the authors compared the proposed index of importance with Google trends index. The analysis seems to be well-founded, but some crucial issues have to be corrected, as follows: 1- You mentioned LDA in line 30, but did not define what LDA means. It would be better to write "Latent Dirichlet Allocation (LDA) ". Additionally, I think that you could cite a book that describes this method. 2- In Skip-Gram subsection, you used two different symbols to mean $v_c$. 3- In Subsection 1.2, you named a method as Word2Vec-SG. However, it is not clear the meaning of SG. Is it Skip-gram? Please, define it in the text. 4- There is a problem with a citation on the least line of Subsection 1.2. 5- In subsection 2.2, you used a variable $K$ that was not defined before. 6- When you use BIC, do you mean Bayesian Information Criterion? It is not clear in the text. 7- Google Trends is crucial to your analysis. SO, I believe that you shall write a sentence explaining the index in the introduction. Furthermore, on page 8, you added a footnote with the website address, but this footnote should be on the first page where you mention it for the first time. 8- You mentioned that there are wordclouds in Figures 3 to 5, but there are no such figures in the PDF document. 9- Figure 6 is not necessary because its information is already shown in Figures 3 to 5. 10- You described some well-known methods of your paper but did not explain about the Granger Causality Test (GCT). I think that it would be appropriate to describe GCT and the advantages of the method when employed as a forecast of a series. Furthermore, you shall better explain all the techniques that take part in the comparison between PVTM and GTI. 11- In lines 292 and 298, you wrote GT instead of GTI. 12- The results regarding Table 1 could be better explained. 13- The data underlying the results are available on a website written in German. So, it was not possible to check it. Reviewer #3: The paper delivers a good contribution to the literature on applying the "Paragraph Vector Topic Model"(PVTM) to a data set or corpus of technolgy news articles. The PVT model is taken from the machine learning literature and adapted to the data set. The presentation of the PVTM is concise and clear. To evaluate the information content of the extracted topics, a Granger causality test is performed. All data are found to be I(1), the Johansen test confirms cointegration. The author proceeds using the Toda-Yamamoto approach by referring to Lütkepohl's textbook. This is inappropriate. This approach was developed for the case where the degree of integration is not clearly esdtablished. For the case here, the "trick" of adding an extra lag and estimating the VAR in levels is inferior to estimating a error correction model and analysing short-term and long-term aspects of cointegration separately (see the textbook on time series econometrics by Wolters, Kichgaessner and Hassler in this issue) by checking weak exogeneity and F-Tests/ Wald tests. Another appropriate test could also be found in the frequency domain Granger causality test proposed by Breitung and Candelon or Breitung and Schreiber. I recommend to redraft the section using an appropriate test approach. Furthermore there are a few type-setting errors (e.g. when the BIC test is referenced). Reviewer #4: I have made my review in a pdf that is attached to this report. Please refer to this document for further details ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Reviewer #3: Yes: Ulrich Fritsche Reviewer #4: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files to be viewed.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please note that Supporting Information files do not need this step. Submitted filename: Referee_Report.pdf Click here for additional data file. 8 Oct 2019 Please see the "Response to Reviewers.pdf" file. Submitted filename: Response to Reviewers.pdf Click here for additional data file. 19 Nov 2019 PONE-D-19-15216R1 Measuring the diffusion of innovations with Paragraph Vector topic models PLOS ONE Dear Mr. Lenz, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. ============================== Minor edits are still necessary, as pointed out by the reviewers. ============================== We would appreciate receiving your revised manuscript by Jan 03 2020 11:59PM. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. We look forward to receiving your revised manuscript. Kind regards, Diego Raphael Amancio Academic Editor PLOS ONE [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: (No Response) Reviewer #2: All comments have been addressed Reviewer #3: All comments have been addressed Reviewer #4: All comments have been addressed ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes Reviewer #4: Yes ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes Reviewer #4: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes Reviewer #4: Yes ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes Reviewer #4: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: I think the authors have adequately addressed my comments raised in the previous round. I only have two minor comments left: 1) Three are still some typos in the text. I encourage the authors to fix these before publication. E.g., line 122 2) The authors should check the reference list. Some of the articles listed as working papers are now published. E.g., reference [17]: Vegard H. Larsen, Leif A. Thorsrud, The value of news for economic developments, Journal of Econometrics, Volume 210, Issue 1, 2019, Pages 203-218. Reviewer #2: All the comments have been addressed in the new version. However, there is no label on the y-axis of the new figures. Reviewer #3: The authors reacted to the recommendations appropriately. Specifically, a robustness check of the Granger non-causality tests was integrated. The paper adds to the literature sufficiently. The method is very interesting and appropriate for the purpose of the study. Congratulations. Reviewer #4: Dear authors, Thank you for your response. I am now satisfied with the current version of the revisions and the responses to my comments. ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Reviewer #3: Yes: Ulrich Fritsche, Universität Hamburg Reviewer #4: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files to be viewed.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please note that Supporting Information files do not need this step. 27 Nov 2019 Response to Reviewer 1 I think the authors have adequately addressed my comments raised in the previous round. I only have two minor comments left. We are again indebted to the referee for the careful reading of our review paper and his/her overall positive assessment. We will describe in detail below how we reacted to the specific comments made while preparing our revised version. 1. There are still some typos in the text. I encourage the authors to fix these before publication. E.g., line 122 We thank the referee for this comment. We have again proofread the paper very carefully and are confident that no typos are left at this stage. 2. The authors should check the reference list. Some of the articles listed as working papers are now published. E.g., reference [17]: Vegard H. Larsen, Leif A. Thorsrud, The value of news for economic developments, Journal of Econometrics, Volume 210, Issue 1, 2019, Pages 203-218. We thank the referee for pointing out that some articles have been published in the meantime. We checked all unpublished paper and corrected the citation where appropiate. Response to Reviewer 2 All the comments have been addressed in the new version. However, there is no label on the y-axis of the new figures. We are again indebted to the referee for the careful reading of our review paper and his/her overall positive assessment. We will describe in detail below how we reacted to the specific comments made while preparing our revised version. 1. There is no label on the y-axis of the new figures. We consent with the reviewer that y-axis should have labels and therefore labeled the y-axis in the relevant figures with ”PVTM Topic Importance”. Response to Reviewer 3 The authors reacted to the recommendations appropriately. Specifically, a robustness check of the Granger non-causality tests was integrated. The paper adds to the literature sufficiently. The method is very interesting and appropriate for the purpose of the study. Congratulations. We are again indebted to the referee for the careful reading of our review paper and his/her overall positive assessment. Response to Reviewer 4 Dear authors, Thank you for your response. I am now satisfied with the current version of the revisions and the responses to my comments. We are again indebted to the referee for the careful reading of our review paper and his/her overall positive assessment. Submitted filename: Response to Reviewers.pdf Click here for additional data file. 5 Dec 2019 Measuring the diffusion of innovations with Paragraph Vector topic models PONE-D-19-15216R2 Dear Dr. Lenz, We are pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it complies with all outstanding technical requirements. Within one week, you will receive an e-mail containing information on the amendments required prior to publication. When all required modifications have been addressed, you will receive a formal acceptance letter and your manuscript will proceed to our production department and be scheduled for publication. Shortly after the formal acceptance letter is sent, an invoice for payment will follow. To ensure an efficient production and billing process, please log into Editorial Manager at https://www.editorialmanager.com/pone/, click the "Update My Information" link at the top of the page, and update your user information. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to enable them to help maximize its impact. If they will be preparing press materials for this manuscript, you must inform our press team as soon as possible and no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. With kind regards, Diego Raphael Amancio Academic Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: 26 Dec 2019 PONE-D-19-15216R2 Measuring the diffusion of innovations with Paragraph Vector topic models Dear Dr. Lenz: I am pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please notify them about your upcoming paper at this point, to enable them to help maximize its impact. If they will be preparing press materials for this manuscript, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. For any other questions or concerns, please email plosone@plos.org. Thank you for submitting your work to PLOS ONE. With kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Diego Raphael Amancio Academic Editor PLOS ONE
  5 in total

1.  Forecasting new product diffusion using both patent citation and web search traffic.

Authors:  Won Sang Lee; Hyo Shin Choi; So Young Sohn
Journal:  PLoS One       Date:  2018-04-09       Impact factor: 3.240

2.  Classifying patents based on their semantic content.

Authors:  Antonin Bergeaud; Yoann Potiron; Juste Raimbault
Journal:  PLoS One       Date:  2017-04-26       Impact factor: 3.240

3.  High quality topic extraction from business news explains abnormal financial market volatility.

Authors:  Ryohei Hisano; Didier Sornette; Takayuki Mizuno; Takaaki Ohnishi; Tsutomu Watanabe
Journal:  PLoS One       Date:  2013-06-06       Impact factor: 3.240

4.  Phylomemetic patterns in science evolution--the rise and fall of scientific fields.

Authors:  David Chavalarias; Jean-Philippe Cointet
Journal:  PLoS One       Date:  2013-02-11       Impact factor: 3.240

5.  Topic detection using paragraph vectors to support active learning in systematic reviews.

Authors:  Kazuma Hashimoto; Georgios Kontonatsios; Makoto Miwa; Sophia Ananiadou
Journal:  J Biomed Inform       Date:  2016-06-10       Impact factor: 6.317

  5 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.