Literature DB >> 31967999

Measuring the diffusion of innovations with paragraph vector topic models.

Abstract

Measuring the diffusion of innovations from textual data sources besides patent data has not been studied extensively. However, early and accurate indicators of innovation and the recognition of trends in innovation are mandatory to successfully promote economic growth through technological progress via evidence-based policy making. In this study, we propose Paragraph Vector Topic Model (PVTM) and apply it to technology-related news articles to analyze innovation-related topics over time and gain insights regarding their diffusion process. PVTM represents documents in a semantic space, which has been shown to capture latent variables of the underlying documents, e.g., the latent topics. Clusters of documents in the semantic space can then be interpreted and transformed into meaningful topics by means of Gaussian mixture modeling. In using PVTM, we identify innovation-related topics from 170, 000 technology news articles published over a span of 20 years and gather insights about their diffusion state by measuring the topic importance in the corpus over time. Our results suggest that PVTM is a credible alternative to widely used topic models for the discovery of latent topics in (technology-related) news articles. An examination of three exemplary topics shows that innovation diffusion could be assessed using topic importance measures derived from PVTM. Thereby, we find that PVTM diffusion indicators for certain topics are Granger causal to Google Trend indices with matching search terms.

Entities: Chemical Disease Gene Species

Year: 2020 PMID： 31967999 PMCID： PMC6976149 DOI： 10.1371/journal.pone.0226685

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

The rapidly growing amount of digital information provides novel data sources for economic analysis with regard to identifying and measuring innovation trends. To exploit this valuable information, there is a growing need for automated information retrieval from large text corpora [1]. Meanwhile, great progress has been made in machine learning and neural network theory leading to the emergence of new methods for extracting high-quality indicators from text. However, the opportunities created by the ongoing digitization have not yet been fully acknowledged nor have they been extensively studied in the economic literature. Following [2], who suggested that machine learning methods should be more widely known to and used by economists, more research that appropriately incorporates these new data sets and methods needs to be conducted in the field of innovation economics. For a few early exceptions, see the broader economic-use cases discussed below. In an innovation context, working with textual data is a well-established practice. However, most studies have used patent registers as their main source of text data and have relied on more traditional methods to transform text into something readable by a computer. A major part of the literature considers either patents, some specific aspects of innovation, or technology trends when investigating innovation. [3] is an early work that took into account not only the number of filed patents or frequency of citations but also the primary texts of the patents. The authors utilized text mining techniques to extract important keywords to build frequency vectors representing single patents. Subsequently, the authors performed a network analysis to visualize the relationships between patents and proposed measures for “importance,” “newness,” and “similarity” of the patents. Later studies such as [4], who considered patents involving a light emitting diode (LED) and wireless broadband fields, also conducted a network analysis. [5] proposed a semantic approach for patent classification that can be used in addition to the traditional USPC (United State Patent Classification) classes and also helps predict technology convergence. Recent work by [6] described an automated approach for patent landscaping, i.e. the process of finding patents related to a particular topic. Starting with a human-selected seed set of patents and expanding it through citations and class codes, semi-supervised machine learning is used to prune the expanded data set. [7] applied clustering techniques based on keyword analysis and community detection on scientific publications related to research in embryology to reconstruct the dynamics of the embryology research field and to understand early-warning signals prior to the occurring of some special events. [8] considered texts of approximately 170,000 awards from 2000 to 2011 that were included in the portfolio of the National Science Foundation (NSF). The authors were interested in measuring the inter-disciplinarity of the NSF portfolio and introduced an improved topic model based on latent Dirichlet allocation (LDA, [9]. See [10] for an introduction to probabilistic topic modeling including the LDA algorithm.). Our approach differs from previous work mainly in two aspects, the data and the methodology. While most studies focuse on patents, our study is the first to examine news articles as a way to measure the diffusion of innovations. Additionally, very few studies concerned with innovation research used embedding methods to represent text. For example, one study closely related to ours is [11]. The authors utilized patent fillings dating back to 1840 to estimate their novelty and significance by quantifying the impact on future technological innovations, using a time-aware term-weighting scheme based on term-frequency inverse-document-frequency (tf-idf) that the authors specifically constructed for this purpose. With this approach, the authors can capture technological evolution over a long time span, demonstrating the captured trends as strong predictors of productivity at various levels. In the same line, we tried to capture the diffusion of innovations by representing potential innovations as topics that can be found in news articles using topic models. Our study differed from [11] in the time scope (20 years for our data set vs 170 years) as well as the data source (news articles vs patents). Further, we relied upon text embedding methods to represent text, compared to the tf-idf scheme utilized by [11]. We are also more interested in the diffusion of innovations than technological progress as a whole, for which our data set is not extensive enough. Regarding the methodology, in natural language processing (NLP), topic modeling describes a set of methods to extract the latent topics from a collection of documents. Topic models have been applied to extract stock topics from financial news with an application to predict abnormal returns [12], to categorize 8-k fillings and determine which topics are associated to abnormal returns [13], to measure the novelty of financial news [14] and to determine the key issues faced by firms [15]. Researchers have investigated the effect of increased transparency on monetary policy [16] and the adaptation of topics found in a Norwegian business newspaper to model the impact of news on the business cycle [17]. Scientific discourse modeling is also within the scope of topic models. For example, [18] identified topics in the Journal of Economics and Statistics to study whether or not the scientific discussion of topics correlates with the actual development of economic key indicators. [19] examined how central bank communications affect real economic variables using topic modeling, and [20] employed LDA to model the topics in the Journal of Economic History between 1941 and 2016. For many years, LDA has been the algorithm of choice for modeling latent topics in text corpora. However, LDA only describes the statistical relationships between words in a text corpus based on co-occurrence probabilities, which might not be the best feature representation for text [21]. Furthermore, LDA reportedly has long computation times, especially with large text corpora [22]. The interpretation of topics generated by LDA is not always straightforward, and the necessary mental effort to give meaning to the extracted words can be tedious and demanding work [23]. We propose a topic model architecture based on neural embedding methods that can generate meaningful and coherent topics as an alternative to LDA. In particular, we use Paragraph Vector (also known as Doc2Vec, [24]) to compute vector space representations of text documents and Gaussian mixture models (GaussMMs) to cluster the resulting document vectors into meaningful semantic topics. We call this combination of embedding and clustering Paragraph Vector Topic Modeling (PVTM). Our main methodological contribution is the way in which we combined these algorithms to extract latent topics from text corpora. A methodologically related approach to PVTM is from [25], who identified relevant articles for systematic reviews using Paragraph Vectors to construct vector representations of documents and then cluster the document vectors using k-means. The resulting cluster centroids were interpreted as latent topics, and the topic probabilities for the documents were constructed using the distance between cluster centroids and document vectors. In comparison, while also relying on Paragraph Vectors to construct vector representations of documents, we employed soft clustering via GaussMMs to directly assess topic memberships at the document level. The applicability of our approach was demonstrated on a corpus of news articles from the German IT news publisher Heise Medien from the last 20 years. We compared our constructed topic measures with Google Trend indices (GTIs, https://trends.google.com/trends/). Google Trends measures the evolution of the frequency of search terms entered by users of the site over time. The results are set in relation to the total search volume and have been available in weekly resolution since the beginning of 2004 for the whole world or individual regions. Our results suggest that PVTM is well suited for topic modeling this type of text data. Clearly, the topic probabilites generated by PVTM might serve as a proxy to measure the diffusion of certain types of innovations. In particular, it became apparent that the topic importance measures generated by PVTM Granger caused GTIs for matching search terms. The remainder of this article is organized as follows. Section 1 details the methodological background of our analysis. The news ticker data set and the experimental design are reviewed in Section 2. In Section 3 we discuss our findings. Section 4 summarizes our results and describes avenues for future research.

1 Paragraph vector topic modeling

Section 1 introduces the PVTM methodology to generate topics and topic membership probabilities. PVTM relies upon the Paragraph Vector algorithm to generate document representations that are then clustered using GMMs. Our main methodological contribution is the way in which we combined these algorithms to extract latent topics from text corpora. Also, to the best of our knowledge very few studies have used embedding methods in an innovation context, e.g. [6]. Paragraph Vector borrows the main ideas from Word2Vec, which is why it is useful to discuss the Word2Vec mechanics for encoding single words before detailing Paragraph Vector. In the last part of this section, we discuss the document clustering algorithm and provide intuition about the outputs yielded by PVTM.

1.1 Neural embeddings of words and documents

Neural network based embedding methods like Word2Vec [26, 27] play an increasingly vital role for encoding the semantic and syntactic meaning of words. Word2Vec builds low-dimensional dense vector space representations which encode the meaning of a word in a given context. As similar words tend to appear in similar contexts [28], encoding words based on their local context captures interesting properties in the resulting vectors, which have been shown to represent the way in which we use these words [29]. Intuitively, words that share many contexts are more similar than words that share fewer contexts. Vector calculations on the resulting word vectors yielded useful results, for example, v − v + v ≈ v [26], where v is the vector representation for a given word. These meaningful representations of words can be used as features in a variety of NLP tasks, including topic modeling. Word2Vec comes in two architectural variants: Skip-Gram (SG) and Continuous Bag of Words. We discuss the SG architecture in more detail, as this is relevant for our application of the Doc2Vec model.

Skip-Gram

During model training, the SG architecture iterates over a given text corpus in fixed-sized sliding windows and generates (context—target) word pairs, where q target words are considered on both sides of the context word, which is the word in the middle. Assume the following 5-word window (q = 2): Innovation is good for business. The context word w would be good, and the target words w, w, w, w are Innovation, is, for and business. This results in 4 (context—target) pairs, (good—Innovation), (good—is), (good—for) and (good—business). Given such pairs of context and target words resulting from the sliding window for a document, the construction of the vector representation is done as follows. First, each word in the vocabulary was represented as a one-hot encoded sparse vector v of size V × 1, where V is the number of different words in the vocabulary. In a one-hot encoding scheme, a 0 indicated the absence of a word, whereas a 1 indicated the presence of a word. Thus, v represented a sparse vector consisting of zeros and a single 1 for the row corresponding to the word. Second, for each context word v, a lower dimensional (denser) vector representation d was obtained using a projection matrix W of dimension D × V, where D ≪ V: Eq (1) extracts the column of W corresponding to the non-zero entry in v to use it as a vector representation d for word v in a D-dimensional vector space. The same procedure was repeated for each target word using a different projection (weight) matrix W of size D × V and the one-hot representation of the target word v resulting again in a D-dimensional vector d, which is the vector representation of the target word: The probability of finding a target word t close to the context word c was modeled based on the similarity between the corresponding vectors d and d in the low dimensional vector space. The dot product of both vectors was used as a measure of similarity, which is closely related to the cosine similarity but also takes into account the lengths of both vectors. Then, the probability that a target word t is observed close to context word c was defined making use of the softmax function [30]: where the sum in the denominator is over all V words in the text, i.e., all potential target words. The projection matrices W and W were obtained by maximizing the joint (log-)probability of observing the target words for all context words in the corpus: C = {D1, …, D} was the corpus of R news articles, where each article consisted of several words, i.e. . The number of words per article D was denoted by n, and the size of the corpus was denoted by R. Thus, given an article D, was the current context word used to predict target word that fell in a given range around the context word. The size of the context window is denoted by q. Gradient descent was used to iteratively update the weights in W until some convergence criteria was met. After training the weights, W acts as a lookup table for the vector representations of words. We used negative sampling and thereby sampled 5 negative examples for each positive word. With negative sampling, the SG algorithm compared the observed word-context pairs with randomly generated unobserved pairs and minimized the probability of the negative pairs while maximizing the probability of the actual word-context pairs. Negative sampling avoids calculating the softmax function for all possible words in the vocabulary and has been shown to speed up training and improve the resulting word vectors. [31] showed that the Word2Vec-SG architecture with negative sampling is equivalent to a weighted logistic principal component analysis.

1.2 Paragraph vector

The Paragraph Vector algorithm [24] expands the ideas of Word2Vec to longer pieces of texts. Instead of word vectors, document vectors are learned during the training process. The resulting vector representations have been shown to capture latent semantic properties of the text fragments, such as the underlying semantic topic of a document. The Paragraph Vectors algorithm also comes in two variants: Distributed Bag of Words (DBOW) and Distributed Memory (DM). In our experiments, we focused on the DBOW methodology, as it has been shown to produce slightly better results compared to DM. Although [24] reported that the DM architecture seems to perform better, subsequent research came to different conclusions [32]. DBOW builds upon the Word2Vec—SG architecture but replaces the center word with a unique document ID. Thus, instead of conditioning a single word on its surrounding words, the whole document is conditioned on the words appearing in it.

1.3 Gaussian mixture clustering

Topic modeling uses clusters of important words to define topics, where different topics may share some words. In our approach, this was done by clustering the vector representations obtained from Paragraph Vectors and then determining the most relevant words per cluster. A GaussMM (see, e.g., [33]) is a parametric probability density function represented as a weighted sum of Gaussian component densities [34]. GaussMMs employ the expectation maximization algorithm [35] to fit a mixture of Gaussian models to a given data set and can be used to represent normally distributed subpopulations within an overall population. GaussMMs have been used to track multiple objects in video sequences [36], to extract features from speech data [37] and for speaker verification [38]. Compared to frequently used clustering techniques such as k-means [39] or mean-shift [40], GaussMMs offer the advantage of soft clustering the data. Soft clustering allows multiple cluster memberships per document, so each document can be represented as a probability distribution over the cluster memberships. The result of the process is a matrix with one row per document and one column per identified cluster, where each entry represents the probability of belonging to a certain cluster. Given that Paragraph Vectors capture latent topics in the corpus, it is reasonable to suggest that clustering the resulting document vectors can be seen as identifying latent topics. Particularly, given a D-dimensional vector representation d of a news article D and a pre-set number of Gaussian components M with mixture weights w, the GMM was defined as a weighted sum over the M Gaussian components, where the mixture weights satisfied the constraint : Thereby, each mixture component g(d|μ, Σ), i = 1, ⋯, M was defined as a D-variate Gaussian function of the form with μ and Σ representing the mean vector and covariance matrix respectively, and λ collecting all parameters of all mixture componentes, i.e., The optimal parameter configuration λ was estimated by iteratively updating the model components to best fit the training data using the EM algorithm. is the GaussMM likelihood given the training data W = (d1 … d). Starting with an initial configuration λ, a new configuration λ was computed such that p(W|λ) ≥ p(W|λ) for R training vectors collected in W. The initial configuration was computed using k-means, and the mixture components were updated according to where (8) is the update for weight w, the means are updated according to (9) and (10) details the variance re-estimation, in this case, for a diagonal covariance. The a posteriori probability for component i is given by The downside of GaussMMs is that the number of mixture components M needs to be specified beforehand, and the algorithm is always going to use all M components. This gives rise to the need of external validation methods. One way of evaluating the quality of a given GaussMM clustering is to use theoretical criteria like the Bayesian information criterion (BIC, [41]), which is the approach we took. The procedure described in 1.1—1.3 is referred to as PVTM. Similar to LDA, the most widely used topic model, results yielded by PVTM are twofold: 1) a list of topics, wherein each topic is associated to certain words especially relevant in the context of the topic, and 2) a document-topic matrix, where each document is assigned a probability for each of the topics determined in 1) to belong to this document. Topics were found by clustering document embeddings obtained from Doc2Vec. The document-topic matrix is created by applying the GaussMM with optimized weights to the vector representations of the documents. The result was a probability distribution across all topics (clusters). The most important words per topic were determined by the proximity of the trained word vectors to the topic vector. While learning the document embeddings, word vectors were also trained and embedded in the same space. The cosine similarity could then be used to determine the most similar words to the cluster centers (= topic vectors). These words then formed the word list for the corresponding topic. The topic vectors corresponded to GaussMM cluster centers. A disadvantage of the method is that a large corpus is necessary to get reasonable word vectors (and also document vectors). The intuition behind PVTM is that Doc2Vec embeddings group similar documents into similar regions of the embedding space. Using a clustering algorithm on the high dimensional document representations yields document clusters, i.e., regions with a high density of documents sharing similar latent elements. PVTM makes these latent elements accessible in the form of words located in close proximity to the topic.

2 Discovering innovation-related topics from news articles

We applied PVTM to news articles in an effort to discover innovation-related topics and measure their diffusion by means of topic probabilities over time.

2.1 Technology-related news corpus

The data set was formed by news articles published by the German IT news ticker heise online (https://www.heise.de/newsticker/) in their news-ticker archive from 1997 to 2016. The total number of news articles was 174, 532, resulting in an average of 8, 727 articles per year. However, the number of articles per year before the 2, 000s was considerably lower compared to subsequent periods. The average news article consisted of 278 words. Fig 1 details the number of documents per year and the number of words per document.

Fig 1

Text corpus descriptive statistics.

A: Number of documents per year. B: Distribution of the number of words per document.

Text corpus descriptive statistics.

A: Number of documents per year. B: Distribution of the number of words per document. We choose this corpus for the following reasons. First, the corpus has a clear technology reference, which makes the identification of technology-related topics easier. Second, the corpus is available over a long period of time, 20 years, making it possible to measure long-term processes such as the diffusion of innovations in the first place. Third, news are available in a timely manner in comparison to, for example, patent data so that adjustments or trends can be identified in the short term, which is an important criterion for Science, Technology and Innovation (STI) policy. One of the potential disadvantages is that the corpus is in German, which makes the results less accessible to non-German-speaking readers. This disadvantage is partially mitigated by the fact that many of the frequently used terms in current technologies spring from English and can therefore be understood by non-German-speaking readers. Apart from that, our question was specifically aimed at measuring the diffusion of innovations via news in one country. Germany is an example of an industrialized country. Future studies could transfer these ideas to other countries.

2.2 Data preprocessing & parameter settings

We removed all non-alphanumeric characters and lowercased the resulting words. Next, we applied popularity-based pre-filtering, which is a commonly applied technique in recommendation systems [42]. This can be seen as removing corpus-specific stop words from the vocabulary by setting an upper and lower threshold, defining the number of documents in which a word is allowed to occur before it is considered to be too frequent or infrequent. All words that occured in more than 65% of documents and in less than 0.05% of documents were filtered, as these words appeared to be too common/rare to be considered useful for our application of PVTM given the size of the corpus. The thresholds have been determined empirically, i.e., they have been adjusted until all corpus-specific stop words were removed. Thereby, we found that the upper threshold was not very sensible to changes. However, the lower threshold was a somewhat sensible parameter for PVTM, and as for the more liberal lower thresholds it was oftentimes not easy to determine meaningful topics due to the way the topic words were obtained. The most similar words to a topic vector were used as topic words. When the lower threshold was very low, very rare words, which might even appear in a single document only, were mapped close to the vector of the document in which they appeared. Thus, with no restriction on the minimum appearance of a word, words that appeared very few times and were very specific happen to be close to the topic vector and were chosen as topic words. The algorithm to compute vector respresentations from news articles was run for 10 epochs, where each epoch consisted of going over all articles once. Following the default settings, we chose the dimensionality of the document vectors to be 100. We used the BIC to find the optimal number of Gaussian mixture components K and the best approach to construct the covariance matrices Σ. We considered four methods to estimate the covariance matrix: Full used the full covariance matrix for each individual component, i.e., each cluster can experience any shape, while Diagonal only used the diagonal of the covariance matrix per component, resulting in cluster shapes that were orientated along the coordinate axes. Tied used a genereal covariance matrix for all mixture components; therefore, all clusters had an identical shape. Spherical made use of a single variance per component instead of a covariance matrix, resulting in spherical cluster forms in higher dimensions. All possible combinations of K and Σ for K ∈ {50, 1000} and Σ ∈ {Diagonal, Full, Spherical, Tied} were tested. From there, we used a two-step procedure to find the optimal model parameters. We first iterated over the parameter space of K in steps of 50 and started from 50, i.e., 50, 100, …, 1000. At every step during the parameter optimization procedure, all four options to construct the covariance matrices were evaluated. The best result K* was then used to construct a smaller search space K ∈ {K* ± 50}, which was searched in steps of 5. The optimal number of Gaussian mixture components K (= topics) was found to be 675 after this run, which we kept as the final number of clusters. Eventually, the covariance matrices Σ of the GaussMM were constrained to be Diagonal as this resulted in the lowest BIC scores for the data set at hand.

3 Approaching the diffusion of innovations from related topics

The OSLO manual [43] defines innovation as the implementation of a new or significantly improved product or process. Diffusion can be described as the process by which an innovation is adopted through certain channels over time among the members of a social system [44]. The diffusion curve is often drawn as a hump shaped line as shown in Fig 2.

Fig 2

The diffusion of innovations.

Adopters were categorized as innovators, early adopters, early majority, late majority and laggards depending on their adoption time.

The diffusion of innovations.

Adopters were categorized as innovators, early adopters, early majority, late majority and laggards depending on their adoption time. A diffusion consisted of four main elements: (1) an innovation (2) that is communicated through certain channels (3) over time (4) among the members of a social system. Thereby, mass media outlets are effective tools to disseminate knowledge about innovations [44]. In our approach, the communication channel is the internet via a media outlet and the social system consisted of the users of the media outlet. Using topic modeling, we attempted to measure the diffusion of topics related to innovations by aggregating topic probabilities in a large technology-related news corpus over time. This way, conclusions about an innovation regarding adoption time and status could be drawn based on the aggregated probabilities for this topic in the corpus. We present three exemplary topics that were related to innovative activities. In particular, we used the evolution of their weights over time as an approximation of the diffusion curve of topic relevance. The topics related to innovative activities are visualized in Figs 3–5. Thereby, the tablet topic in Fig 3 represents a product innovation, the wikipedia topic in Fig 4 refers to a process innovation and the virtual reality topic in Fig 5 can be described as technology innovation.