| Literature DB >> 33354599 |
Gaétan de Rassenfosse1, Florian Seliger2.
Abstract
We present a general method for imputing missing information in the Worldwide Patent Statistical Database (PATSTAT) and make the resulting datasets publicly available. The PATSTAT database is the de facto standard for academic research using patent data. Complete information on patents is essential to obtain an accurate picture of technological activities across countries and over time. However, the coverage of the database is far from complete. Our data imputation method exploits detailed institutional knowledge about the international patent system, and we codify it in a SQL algorithm. We provide two datasets related to the imputation of missing country codes and missing technology classification. We also release the algorithm that can be easily adapted to impute other pieces of information that are missing in PATSTAT.Entities:
Keywords: Imputation; Missing data; PATSTAT; Patents; PostgreSQL
Year: 2020 PMID: 33354599 PMCID: PMC7744924 DOI: 10.1016/j.dib.2020.106615
Source DB: PubMed Journal: Data Brief ISSN: 2352-3409
Share of data available in different fields in PATSTAT for selected years at the largest patent offices.
| Patent office | Year | No. Filings | Abstract | Applicant address | Applicant country | Citation | Inventor address | Inventor country | IPC | Legal Event |
|---|---|---|---|---|---|---|---|---|---|---|
| CA | 1990 | 28402 | 100 | 59 | 87 | 0 | 0 | 100 | 100 | 100 |
| CN | 1990 | 9518 | 52 | 0 | 100 | 0 | 0 | 100 | 98 | 100 |
| DE | 1990 | 102954 | 20 | 0 | 93 | 20 | 0 | 91 | 100 | 71 |
| EP | 1990 | 63898 | 99 | 100 | 100 | 93 | 100 | 100 | 100 | 100 |
| FR | 1990 | 14582 | 100 | 0 | 77 | 92 | 0 | 1 | 100 | 82 |
| GB | 1990 | 28086 | 32 | 33 | 37 | 35 | 0 | 0 | 61 | 44 |
| JP | 1990 | 347106 | 95 | 0 | 0 | 17 | 0 | 0 | 100 | 25 |
| KR | 1990 | 14890 | 33 | 0 | 100 | 0 | 0 | 100 | 99 | 91 |
| US | 1990 | 99540 | 80 | 95 | 100 | 100 | 99 | 100 | 100 | 100 |
| WO | 1990 | 19078 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
| CA | 2000 | 42750 | 98 | 0 | 100 | 0 | 0 | 100 | 100 | 100 |
| CN | 2000 | 57560 | 58 | 0 | 98 | 0 | 0 | 97 | 100 | 100 |
| DE | 2000 | 108233 | 46 | 0 | 100 | 29 | 0 | 98 | 100 | 98 |
| EP | 2000 | 110729 | 54 | 100 | 100 | 99 | 100 | 100 | 100 | 100 |
| FR | 2000 | 15196 | 100 | 0 | 100 | 94 | 0 | 1 | 100 | 89 |
| GB | 2000 | 31725 | 41 | 0 | 43 | 40 | 0 | 42 | 61 | 96 |
| JP | 2000 | 372797 | 92 | 0 | 0 | 12 | 0 | 0 | 100 | 69 |
| KR | 2000 | 82130 | 87 | 0 | 71 | 6 | 0 | 74 | 91 | 100 |
| US | 2000 | 199450 | 100 | 92 | 97 | 98 | 96 | 100 | 100 | 100 |
| WO | 2000 | 91203 | 100 | 0 | 100 | 99 | 0 | 70 | 100 | 100 |
| CA | 2010 | 36176 | 97 | 0 | 100 | 0 | 0 | 100 | 100 | 100 |
| CN | 2010 | 361268 | 98 | 0 | 0 | 82 | 0 | 0 | 100 | 100 |
| DE | 2010 | 50734 | 94 | 0 | 100 | 82 | 0 | 96 | 100 | 99 |
| EP | 2010 | 133630 | 50 | 100 | 100 | 99 | 99 | 99 | 100 | 100 |
| FR | 2010 | 14270 | 100 | 0 | 100 | 94 | 0 | 100 | 100 | 93 |
| GB | 2010 | 21987 | 46 | 0 | 46 | 45 | 0 | 45 | 47 | 86 |
| JP | 2010 | 265346 | 88 | 0 | 0 | 71 | 0 | 0 | 100 | 83 |
| KR | 2010 | 158707 | 79 | 0 | 99 | 38 | 0 | 98 | 100 | 100 |
| US | 2010 | 323703 | 100 | 100 | 100 | 98 | 100 | 100 | 100 | 100 |
| WO | 2010 | 160839 | 100 | 0 | 100 | 100 | 0 | 96 | 100 | 100 |
``WO'' refers to the World Intellectual Property Office.
Calculations by EPO, source: https://public.tableau.com/profile/patstat.support#!/vizhome/CoverageofPATSTAT2019AutumnEdition/CoveragePATSTATGlobal, accessed on 2020/05/24
Fig. 1Flowchart of the algorithm.
Share of available information for inventor countries before and after imputation (sources 1 to 3).
| Patent office | Year | No. first filings | Inventor country before imputation (%) | Inventor country after imputation (%) |
|---|---|---|---|---|
| CA | 1990 | 4832 | 99.6 | 99.6 |
| CN | 1990 | 28289 | 99.7 | 99.7 |
| DE | 1990 | 31422 | 78.4 | 88.2 |
| EP | 1990 | 8605 | 99.4 | 99.4 |
| FR | 1990 | 11046 | 1.1 | 49.8 |
| GB | 1990 | 4451 | 0.9 | 27.5 |
| JP | 1990 | 317810 | 0.0 | 8.9 |
| KR | 1990 | 15069 | 100.0 | 100.0 |
| US | 1990 | 63210 | 99.9 | 99.9 |
| WO | 1990 | 3127 | 99.9 | 99.9 |
| CA | 2000 | 5014 | 98.9 | 99.6 |
| CN | 2000 | 76905 | 99.9 | 99.9 |
| DE | 2000 | 44793 | 99.9 | 99.9 |
| EP | 2000 | 11207 | 99.5 | 99.7 |
| FR | 2000 | 12999 | 0.6 | 60.0 |
| GB | 2000 | 6670 | 99.6 | 99.8 |
| JP | 2000 | 352646 | 0.0 | 13.0 |
| KR | 2000 | 84934 | 99.1 | 99.1 |
| US | 2000 | 130298 | 100.0 | 100.0 |
| WO | 2000 | 12198 | 85.2 | 90.8 |
| CA | 2010 | 2464 | 99.6 | 99.7 |
| CN | 2010 | 585934 | 0.0 | 2.8 |
| DE | 2010 | 38316 | 100.0 | 100.0 |
| EP | 2010 | 13705 | 99.6 | 99.8 |
| FR | 2010 | 13560 | 100.0 | 100.0 |
| GB | 2010 | 5456 | 99.3 | 99.7 |
| JP | 2010 | 241803 | 0.0 | 19.6 |
| KR | 2010 | 115917 | 90.4 | 94.7 |
| US | 2010 | 159334 | 100.0 | 100.0 |
| WO | 2010 | 23611 | 97.1 | 98.5 |
Share of available information on IPC before and after imputation.
| Patent office | Year | No. first filings | IPC before imputation (%) | IPC after imputation (%) |
|---|---|---|---|---|
| CA | 1990 | 4832 | 99.8 | 100.0 |
| CN | 1990 | 28289 | 99.9 | 100.0 |
| DE | 1990 | 31422 | 91.7 | 100.0 |
| EP | 1990 | 8605 | 93.9 | 100.0 |
| FR | 1990 | 11046 | 99.5 | 100.0 |
| GB | 1990 | 4451 | 86.5 | 100.0 |
| JP | 1990 | 317810 | 96.6 | 100.0 |
| KR | 1990 | 15069 | 95.7 | 100.0 |
| US | 1990 | 63210 | 72.5 | 100.0 |
| WO | 1990 | 3127 | 98.6 | 100.0 |
| CA | 2000 | 5014 | 98.9 | 100.0 |
| CN | 2000 | 76905 | 98.9 | 100.0 |
| DE | 2000 | 44793 | 90.6 | 100.0 |
| EP | 2000 | 11207 | 77.8 | 100.0 |
| FR | 2000 | 12999 | 98.4 | 100.0 |
| GB | 2000 | 6670 | 84.7 | 100.0 |
| JP | 2000 | 352646 | 92.5 | 100.0 |
| KR | 2000 | 84934 | 97.8 | 100.0 |
| US | 2000 | 130298 | 85.7 | 100.0 |
| WO | 2000 | 12198 | 98.2 | 100.0 |
| CA | 2010 | 2464 | 99.5 | 100.0 |
| CN | 2010 | 585934 | 98.8 | 100.0 |
| DE | 2010 | 38316 | 90.2 | 100.0 |
| EP | 2010 | 13705 | 68.6 | 100.0 |
| FR | 2010 | 13560 | 98.2 | 100.0 |
| GB | 2010 | 5456 | 55.7 | 100.0 |
| JP | 2010 | 241803 | 87.9 | 100.0 |
| KR | 2010 | 115917 | 96.2 | 100.0 |
| US | 2010 | 159334 | 93.3 | 100.0 |
| WO | 2010 | 23611 | 97.6 | 100.0 |
| Subject | Social Sciences (General) |
| Specific subject area | Innovation policies, regional studies, strategic management |
| Type of data | Table Dataset PostgreSQL code |
| How data were acquired | The data were extracted from the Worldwide Patent Statistical Database (PATSTAT) |
| Data format | Raw |
| Parameters for data collection | PATSTAT needs to be set up as a PostgreSQL database |
| Description of data collection | The datasets result from querying different PATSTAT tables, extracting the desired information and inserting it into an output table |
| Data source location | European Patent Office, Vienna |
| Data accessibility | Repository name: Harvard Dataverse Data identification number: |