| Literature DB >> 26696691 |
Abdullah Gök1, Alec Waterworth1, Philip Shapira2.
Abstract
As enterprises expand and post increasing information about their business activities on their websites, website data promises to be a valuable source for investigating innovation. This article examines the practicalities and effectiveness of web mining as a research method for innovation studies. We use web mining to explore the R&D activities of 296 UK-based green goods small and mid-size enterprises. We find that website data offers additional insights when compared with other traditional unobtrusive research methods, such as patent and publication analysis. We examine the strengths and limitations of enterprise innovation web mining in terms of a wide range of data quality dimensions, including accuracy, completeness, currency, quantity, flexibility and accessibility. We observe that far more companies in our sample report undertaking R&D activities on their web sites than would be suggested by looking only at conventional data sources. While traditional methods offer information about the early phases of R&D and invention through publications and patents, web mining offers insights that are more downstream in the innovation process. Handling website data is not as easy as alternative data sources, and care needs to be taken in executing search strategies. Website information is also self-reported and companies may vary in their motivations for posting (or not posting) information about their activities on websites. Nonetheless, we find that web mining is a significant and useful complement to current methods, as well as offering novel insights not easily obtained from other unobtrusive sources.Entities:
Keywords: Innovation; R&D; Web mining; Web scraping
Year: 2014 PMID: 26696691 PMCID: PMC4677352 DOI: 10.1007/s11192-014-1434-0
Source DB: PubMed Journal: Scientometrics ISSN: 0138-9130 Impact factor: 3.238
Firms and Data Included in the Web Content Analysis
| Year | Firms with websites | Webpages (thousands) | Phrases (millions) |
|---|---|---|---|
| 2004 | 125 | 14.9 | 1.9 |
| 2005 | 133 | 11.7 | 2.0 |
| 2006 | 131 | 15.8 | 2.0 |
| 2007 | 173 | 10.6 | 1.3 |
| 2008 | 173 | 12.8 | 1.2 |
| 2009 | 161 | 10.8 | 1.3 |
| 2010 | 163 | 13.0 | 1.6 |
| 2011 | 199 | 15.8 | 2.6 |
| 2012 | 237 | 51.7 | 10.3 |
Source: Website analysis of sample of 296 UK-based green goods small and medium-sized enterprises (see text for details)
Fig. 1Web content analysis process
R&D activity variables and keywords
| R&D variable | Keywords | Difference from previous keyword set |
|---|---|---|
| rndweb1 | Research* | |
| rndweb2 | Research* AND development* | rndweb1 + (development*) |
| rndweb3 | Research* AND development* AND R&D | rndweb2 + (R&D) |
| rndweb4 | Research* AND development* AND R&D AND lab*, scientist* | rndweb3 + (lab, laboratory, scientist) |
| rndweb5 | Research* AND (development NEARBY research) AND, R&D AND lab* AND scientist* | rndweb4 − (development [NOT NEARBY] research) |
| rndweb6 | (Research and development) AND R&D AND lab* AND scientist* AND research AND researcher AND scientist* AND (product development*) AND (technology development*) AND (development phase) AND (technical development*) AND (development program*) AND (development process*) AND (development project*) AND (development cent*) AND (development facilit*) AND (technological development*) AND (development efforts) AND (development cycle) AND (development research) AND (research & development) AND (development activity) | rndweb5 − (development [NEARBY] research) + (a set of development variants) |
Fig. 2R&D website variables: comparison of mean values by different transformations and normalizations. Source: Analysis of website variables for sample of UK green goods small and medium enterprises. Mean values reported. Covers 2004–2012, see Table 1 for N each year. See Table 2 for keyword definitions of R&D variable labels
Coverage of website-based and other R&D variables
| Variable | Explanation | Number of firmsc | Coverage of firms (%) | Number of observationsd | Coverage of observations (%) |
|---|---|---|---|---|---|
| Publications | Number of publications | 43 | 14.5 | 150 | 5.6 |
| Patents | Number of patents | 15 | 5.1 | 33 | 1.2 |
| R&D expenditure | Research and development spendinga | 51 | 17.2 | 91 | 3.4 |
| Grants | TSBb grant awards | 66 | 22.3 | 187 | 7.0 |
| rndweb6 | Website-based variable, number of instances of keywords normalized by the number of noun phrases in websites | 204 | 68.9 | 909 | 34.1 |
aR&D amount as reported in FAME (2014)
bTSB = UK Technology Strategy Board
cFirms reporting a value for this variable at any year between 2004 and 2012
dNon-missing observations over 9 years
Pairwise 95 % significant correlations between website-based and other R&D variables
| No | Variable | Explanation | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Publications | Total number of publications | 1 | |||||||||
| 2 | Patents | Total number of patents | 1 | |||||||||
| 3 | R&D expenditure | Amount of R&D expenditure as reported in the FAME database, in GBP | 0.5431 | 1 | ||||||||
| 4 | Number of grants | Total number of grants from the TSB | 0.9764 | 0.5367 | 1 | |||||||
| 5 | rndweb1 | Website based variables, number of instances of keywords normalized by the number of noun phrases in websites | 1 | |||||||||
| 6 | rndweb2 | 0.8437 | 1 | |||||||||
| 7 | rndweb3 | 0.8388 | 0.9974 | 1 | ||||||||
| 8 | rndweb4 | 0.8265 | 0.9924 | 0.9957 | 1 | |||||||
| 9 | rndweb5 | 0.8317 | 0.8038 | 0.8230 | 0.8205 | 1 | ||||||
| 10 | rndweb6 | 0.8635 | 0.9108 | 0.9016 | 0.905 | 0.9551 | 1 |
Source: Analysis of conventional and website-based R&D variables for sample of 296 UK-based green goods small and medium-sized enterprises (see text for details)
Comparison of different data sources
| R&D expenditure | R&D grants | Patents and publications | Analysis of R&D activity in websites | ||
|---|---|---|---|---|---|
| Source | Financial database (FAME) | Government database (TSB) | Web of science | Current and historic websites | |
| Indicator type | Input | Input | Output | Process | |
| Data structure | Structured | Structured | Semi-structured | Unstructured | |
| Data quality dimensionsa,b | |||||
| Completeness | Sufficient breadth and depth and scope | ★☆☆ | ★☆☆ | ★☆☆ | ★★☆ |
| Accuracy | Correct representation of the phenomenon | ★☆☆ | ★☆☆ | ★☆☆ | ★★☆ |
| Currency | How promptly data is updated | ★☆☆ | ★★☆ | ★★☆ | ★★★ |
| Volatility | Data change frequency | ★☆☆ | ★★☆ | ★★☆ | ★★★ |
| Consistency | Agreement among components | ★★☆ | ★★★ | ★★☆ | ★☆☆ |
| Interpretability | Easiness of interpreting meaning | ★★★ | ★★★ | ★★☆ | ★☆☆ |
| Accessibility | Easiness of access and analysis | ★☆☆ | ★☆☆ | ★★☆ | ★★★ |
| Handling | Easiness of analysis | ★★★ | ★★★ | ★★☆ | ★☆☆ |
| Amount | Quantity of data | ★☆☆ | ★☆☆ | ★★☆ | ★★★ |
| Flexibility | Adaptable to different purposes | ★☆☆ | ★☆☆ | ★★☆ | ★★★ |
aAdapted and extended from Batini and Scannapieco (2006)
bStars qualitatively denote the relative performance of data sources for data quality dimensions (three stars indicate relatively superior performance, while relatively inferior performance is denoted by one star). Comparison is made for the UK