| Literature DB >> 35599825 |
Sajad Ashouri1,2, Arho Suominen1,2, Arash Hajikhani1,3, Lukas Pukelis4, Torben Schubert5,6, Serdar Türkeli7, Cees Van Beers8, Scott Cunningham9.
Abstract
This article presents data on companies' innovative behavior measured at the firm-level based on web scraped firm-level data derived from medium-high and high-technology companies in the European Union and the United Kingdom. The data are retrieved from individual company websites and contains in total data on 96,921 companies. The data provide information on various aspects of innovation, most significantly the research and development orientation of the company at the company and product level, the company's collaborative activities, company's products, and use of standards. In addition to the web scraped data, the dataset aggregates a variety firm-level indicators including patenting activities. In total, the dataset includes 21 variables with unique identifiers which enables connecting to other databases such as financial data.Entities:
Keywords: Big data; Firm-level data; Innovation; Text data; Web scraped data
Year: 2022 PMID: 35599825 PMCID: PMC9120249 DOI: 10.1016/j.dib.2022.108246
Source DB: PubMed Journal: Data Brief ISSN: 2352-3409
Fig. 1The heatmap showing the geographical distribution of companies in the dataset.
Fig. 2Relational database including seven data tables constructed from text data analytics.
Description of the variables in each data table.
| Variable | Data format | Description | Source table |
|---|---|---|---|
| ID | Varchar | Unique identifier of company | Companies Table |
| ISO | Dictionary including {ISO code: full name of ISO standard} | ISO standards associated with the company. | Companies Table |
| Keywords | Dictionary including {keyword: frequency in the text} | Meaningful keywords extracted from company's text. | Companies Table |
| FOS IDs | Dictionary including {fos_id: frequency in the text} | FOS id number and it's similarity score for the company. FOS names can be found in FOS table. | Companies Table |
| Linked countries | Dictionary including: {country_code: score} | Countries that are mentioned in the company's website text (not banners, ribbons or such). The scores are Min-Max scaled mention counts. e.g., score of 0.5 means that that country constitutes 50% of all country mentions. | Companies Table |
| Timestamp | timestamp | Timestamp of when the entity was uploaded to the database. | Companies Table |
| ID | Varchar | Unique identifier of company | Collaborations Table |
| Name | Varchar | Name of the collaborator | Collaborations Table |
| Category | Varchar | Country code of collaborator ("RPO/University" or "Other") | Collaborations Table |
| Country code | Varchar | Country code of collaborator | Collaborations Table |
| ID | Varchar | Unique identifier of a company | Patents Table |
| Patent ID | Varchar | Patent publication number | Patents Table |
| DOI | Varchar | Unique identifier of publication | Publications Table |
| FOS IDs | List | List of fos ids associated with the publication | Publications Table |
| ID | Varchar | The company associated with the publication | Publications Table |
| ID | Varchar | Unique identifier of a company | Products Table |
| Product Name | Varchar | Name of the Company | Products Table |
| Trademark | Boolean | If set True, the extracted product is a trademark | Products Table |
| Product FOS | Dictionary including {FOS id: Frequency} | List of FOS ids associated with the publication | Products Table |
| Product Keywords | List | List of keywords associated with the products | Products Table |
| FOS ID | Varchar | FOS identifier | FOS table |
| FOS name | Varchar | FOS name | FOS table |
| Level | Categorical | Indicates FOS level in hierarchical structure. Can be 1-5. | FOS table |
| Paper count | Numerical | Count of publications in Microsoft Academics database with specified FOS | FOS table |
| Parent ID | Varchar | Parent FOS ID | FOS Relations Table |
| Child ID | Varchar | Child FOS ID | FOS Relations Table |
Number of missing/not reported values for each variable.
| Variable | Missing values/ not reported | Source table | Variable | Missing values/ not reported | Source table |
|---|---|---|---|---|---|
| ID | 0 | Companies Table | FOS IDs | 0 | Publications Table |
| ISO | 61919 | Companies Table | Similar Companies | 0 | Publications Table |
| Keywords | 78 | Companies Table | Bvd ID | 0 | Products Table |
| FOS IDs | 4566 | Companies Table | Product Name | 0 | Products Table |
| Linked countries | 5614 | Companies Table | Trademark | 0 | Products Table |
| Timestamp | 13789 | Companies Table | Product FOS | 66299 | Products Table |
| ID | 0 | Collaborations Table | Product Keywords | 0 | Products Table |
| Name | 0 | Collaborations Table | FOS ID | 0 | FOS Table |
| Category | 0 | Collaborations Table | FOS name | 0 | FOS Table |
| Country Code | 35196 | Collaborations Table | Level | 0 | FOS Table |
| Patent ID | 0 | Patents Table | Paper count | 0 | FOS Table |
| ID | 0 | Patents Table | Parent ID | 0 | FOS Relations Table |
| Child ID | 0 | FOS Relations Table |
Summary statistics for categorical variables.
| Variable | Source table | Unique observations | Top count |
|---|---|---|---|
| ID | 96921 | - | |
| ISO code | 3406 | ('ISO 9001′, 25861), ('ISO 14001′, 10749), ('ISO 13485′, 3398) | |
| Linked Countries | 195 | ('DE', 51943), ('US', 49695), ('IT', 34930) | |
| FOS IDs | 105527 | ('204441458′, 8628), ('2775945657′, 8593), ('160403385′, 7474) | |
| Keywords | 849203 | ('contact', 10261), ('product', 10169), ('high', 10145) | |
| ID | Collaboration Table | 18697 | - |
| Name | 57899 | ('FOOD AND DRUG ADMINISTRATION', 632), ('MINISTRY OF ECONOMIC AFFAIRS', 397), ('MINISTRY OF DEFENCE', 392) | |
| Category | 2 | Other | |
| Country Code | 190 | ('US', 55711), ('DE', 28376), ('GB', 25090) | |
| ID | Patent Table | 3114 | - |
| Patent ID | 361121 | ('US2015233026A1′, 25), ('CN105542514A ', 25), ('US2015217877A1′, 25) | |
| FOS ID | Publication Table | 69834 | (71924100, 29685), (126322002, 18176), (41008148, 16493) |
| ID | 3631 | - | |
| ID | Product Table | 71082 | ('GB08774049′, 186), ('GBML3898974′, 176), ('GB02027512′, 150) |
| Product name | 387420 | ('product portfolio', 468), ('management system', 386), ('surface finish', 377) | |
| Trademark | 2 | ('False’, 606892) | |
| Product FOS | 165763 | ('50549864′, 25201), ('122555611′, 18212), ('122707667′, 14822) | |
| Product Keywords | 109891 | ('type-members', 28312), ('pressure-testing', 19918), ('metal-insulator-semiconductors', 18909) | |
| FOS ID | FOS Table | 664968 | - |
| FOS name | 664968 | - | |
| Level | 6 | ('3′, 321082), ('2′, 131604), ('4′, 111271) | |
| Parent ID | FOS Relations Table | 53933 | ('59822182′, 19563), ('141071460′, 9129), ('555293320′, 7993) |
| Child ID | 429817 | ('2777753429′, 15), ('144623209′, 14), ('120592756′, 13) | |
Collaborators’’ names such as Food and Drug Administration do not record the country name in the name data column. To avoid the bias of such co-occurrences in future analyses, such collaborators' name should be recognized using their country code.
Summary statistics for numerical variables.
| Variable | Source table | Mean | Std | Min | 25% | 50% | 75% | Max |
|---|---|---|---|---|---|---|---|---|
| #ISO codes | Companies Table | 2.62 | 3.67 | 0 | 1 | 2 | 3 | 199 |
| #Keywords | 199.8 | 537.52 | 0 | 17 | 30 | 30 | 2000 | |
| #FOS IDs | 88.82 | 25.65 | 0 | 100 | 100 | 100 | 100 | |
| #Linked countries | 10.67 | 16.07 | 1 | 2 | 5 | 12 | 190 | |
| #Product keywords | Product Table | 5.11 | 3.98 | 1 | 2 | 4 | 5 | 30 |
| #Product FOS | 13.37 | 27.87 | 1 | 3 | 3 | 8 | 474 | |
| Paper count | FOS Table | 1873.3 | 76235.6 | 1 | 2 | 9 | 79 | 2807188 |
Descriptive statistics for the digitalization score.
| Missing values | Mean | Std | Min | 25% | 50% | 75% | Max |
|---|---|---|---|---|---|---|---|
| 28948 | 0.095 | 0.202 | 0 | 0 | 0 | 0.100 | 1 |
Fig. 3Distribution of product digitalization measure at the industry level (Nace code).
| Subject area | Management of Technology and Innovation |
| More specific subject area | Big data in innovation management |
| Type of data | Web scraped data; Text data |
| How data were acquired | Data were acquired by web scraping companies’ website. |
| Data Format | Semi-structured (raw and preprocessed) |
| Description of data collection | The relational database has stored web scrapped data of companies’ websites as a PostgreSQL database and stored on the virtual machine in the Microsft Azure Cloud. Some data tables are constructed by linking the web scrapped data to publicly available data. |
| Data source location | Sample of med-high and high-technology companies based in EU-27 and UK. |
| Data accessibility | Repository name: DataverseNL |