| Literature DB >> 29583142 |
Zi-Lin He1, Tony W Tong2, Yuchen Zhang3, Wenlong He4.
Abstract
To meet researchers' increasing interest in the fast growing innovation activities taking place in China, we match patents filed with China's State Intellectual Property Office to firms covered in China's Census. China has experienced a strong growth in patent filings over the past two decades, and has since 2011 become the world's top patent filing country. China's Census database covers about one million unique manufacturing firms from 1998-2009, representing the broad Chinese economy. We design data parsing and pre-processing routines to clean and stem firm and assignee names, create a matching algorithm that fits with our data and maintains a balance between matching accuracy and workload of manual check, and implement a systematic manual check process to filter out false positives generated from computerized matching. Our project generates 1,113,588 matches for the Census firms, among which 849,647 patents are uniquely matched. By creating the patent-firm linked dataset, we hope to reduce duplicative effort and encourage more research to better understand China's fast changing innovation landscape.Entities:
Year: 2018 PMID: 29583142 PMCID: PMC5956277 DOI: 10.1038/sdata.2018.42
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 6.444
Figure 1Workflow to match SIPO patents to ASIE firms and generate the final data files.
This figure illustrates the workflow to match SIPO patents to ASIE firms and generate the final data files. The rectangular boxes represent the processing procedures and the cylinders the data in separate phases.
Distribution of ASIE firms across 31 provinces, autonomous regions, and province-equivalent municipalities, 1998-2009.
| Province | 1998 | 1999 | 2000 | 2001 | 2002 | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| This table reports the number of firms covered in our database by province (autonomous region, municipality) and year. HMI (HuaMei Information) total can be found at | ||||||||||||
| Anhui | 3,820 | 3,782 | 3,681 | 3,674 | 3,920 | 4,159 | 4,793 | 5,278 | 6,523 | 8,111 | 11,112 | 14,516 |
| Beijing | 4,502 | 5,240 | 4,575 | 4,327 | 4,557 | 4,024 | 6,906 | 6,297 | 6,400 | 6,397 | 6,886 | 6,934 |
| Fujian | 6,103 | 5,548 | 6,006 | 6,554 | 7,474 | 9,211 | 11,953 | 12,396 | 13,755 | 15,178 | 16,898 | 18,743 |
| Gansu | 1,654 | 2,253 | 2,859 | 3,097 | 3,217 | 2,894 | 2,022 | 1,733 | 1,733 | 1,841 | 1,807 | 1,993 |
| Guangdong | 17,977 | 18,881 | 19,695 | 20,652 | 22,620 | 24,519 | 34,738 | 35,157 | 37,494 | 42,260 | 51,134 | 53,422 |
| Guangxi | 3,365 | 3,146 | 3,159 | 3,059 | 2,913 | 2,873 | 3,751 | 3,686 | 4,049 | 4,408 | 5,089 | 5,841 |
| Guizhou | 2,051 | 2,121 | 2,088 | 1,923 | 2,069 | 2,123 | 2,546 | 2,584 | 2,594 | 2,295 | 2,501 | 2,807 |
| Hainan | 640 | 578 | 596 | 573 | 602 | 620 | 634 | 616 | 595 | 488 | 521 | 500 |
| Hebei | 7,524 | 7,337 | 7,164 | 7,511 | 7,536 | 7,816 | 9,345 | 9,938 | 10,633 | 10,870 | 12,192 | 13,509 |
| Henan | 10,445 | 9,913 | 9,924 | 9,644 | 9,663 | 9,089 | 11,741 | 10,867 | 11,895 | 13,510 | 17,825 | 18,754 |
| Heilongjiang | 3,558 | 2,995 | 2,716 | 2,504 | 2,635 | 2,612 | 3,345 | 2,887 | 2,956 | 3,172 | 4,296 | 4,496 |
| Hubei | 7,398 | 6,871 | 6,281 | 6,146 | 6,176 | 6,272 | 6,366 | 6,814 | 7,546 | 8,995 | 11,759 | 14,214 |
| Hunan | 4,557 | 4,797 | 4,808 | 4,779 | 5,439 | 5,959 | 7,610 | 8,022 | 8,999 | 10,201 | 11,345 | 13,509 |
| Jilin | 2,845 | 2,837 | 2,768 | 2,606 | 2,622 | 2,343 | 3,451 | 2,774 | 3,249 | 3,984 | 5,151 | 6,133 |
| Jiangsu | 17,997 | 18,004 | 18,313 | 19,610 | 21,467 | 23,856 | 40,899 | 32,224 | 36,319 | 41,841 | 63,610 | 63,380 |
| Jiangxi | 3,951 | 3,737 | 3,556 | 3,105 | 3,085 | 3,054 | 4,263 | 4,403 | 5,333 | 6,028 | 6,750 | 7,712 |
| Liaoning | 6,250 | 5,806 | 6,018 | 5,693 | 6,018 | 6,844 | 11,458 | 11,509 | 14,754 | 16,556 | 21,124 | 26,276 |
| Inner Mongolia | 1,368 | 1,280 | 1,262 | 1,195 | 1,320 | 1,531 | 2,284 | 2,448 | 3,074 | 3,363 | 3,691 | 4,639 |
| Ningxia | 539 | 521 | 435 | 424 | 408 | 437 | 666 | 685 | 761 | 745 | 876 | 1,007 |
| Qinghai | 570 | 555 | 441 | 378 | 396 | 398 | 478 | 406 | 437 | 473 | 470 | 552 |
| Shandong | 11,443 | 11,432 | 11,721 | 12,149 | 13,508 | 16,226 | 23,916 | 27,540 | 31,936 | 36,145 | 41,927 | 46,671 |
| Shanxi | 3,934 | 3,349 | 3,280 | 3,021 | 3,460 | 3,610 | 5,067 | 4,441 | 4,671 | 4,472 | 4,296 | 4,049 |
| Shaanxi | 2,683 | 2,587 | 2,551 | 2,329 | 2,464 | 2,489 | 3,117 | 2,998 | 3,372 | 3,373 | 3,702 | 4,494 |
| Shanghai | 9,401 | 9,340 | 8,588 | 9,745 | 10,094 | 11,126 | 15,766 | 14,806 | 14,403 | 15,099 | 18,291 | 18,043 |
| Sichuan | 4,982 | 4,542 | 4,399 | 4,477 | 4,907 | 5,434 | 7,454 | 7,958 | 8,995 | 10,709 | 13,258 | 13,461 |
| Tianjin | 5,437 | 5,245 | 5,465 | 5,598 | 5,376 | 5,381 | 6,466 | 6,145 | 6,302 | 6,361 | 7,658 | 8,639 |
| Xizang | 340 | 328 | 361 | 363 | 345 | 324 | 187 | 195 | 202 | 98 | 74 | 90 |
| Xinjiang | 1,821 | 1,627 | 1,456 | 1,326 | 1,266 | 1,256 | 1,446 | 1,445 | 1,481 | 1,575 | 1,807 | 2,022 |
| Yunnan | 2,514 | 2,131 | 2,122 | 1,988 | 2,070 | 1,992 | 2,398 | 2,362 | 2,603 | 2,699 | 2,994 | 3,547 |
| Zhejiang | 13,450 | 13,274 | 14,552 | 18,549 | 21,869 | 25,508 | 41,369 | 40,277 | 45,688 | 51,604 | 57,739 | 62,271 |
| Chongqing | 1,997 | 1,976 | 2,042 | 2,032 | 2,061 | 2,242 | 2,657 | 2,943 | 3,208 | 3,916 | 5,987 | 6,481 |
| Province unknown | 2 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 47 | 36 |
| Our total | 165,118 | 162,033 | 162,883 | 169,031 | 181,557 | 196,222 | 279,092 | 271,835 | 301,960 | 336,767 | 412,817 | 448,741 |
| HMI total | 165,119 | 162,034 | 162,885 | 169,031 | 181,557 | 196,222 | 279,092 | 271,835 | 301,961 | 336,768 | 412,000 | 434,000 |
| NBS total | 165,080 | 162,033 | 162,885 | 171,256 | 181,557 | 196,222 | 276,474 | 271,835 | 301,961 | 336,768 | 426,113 | 434,364 |
Examples of the left-aligned strict substring condition.
| This table lists several exemplary inputs and outputs of left-aligned strict substring matching. ASIE firms’ full names and stem names are in the first and second columns. Patent assignees’ full names and stem names are in the third and fourth columns. The matching output (true match or not) based on stem names is in the last column. | ||||
|---|---|---|---|---|
| 贵州黄果树烟草集团公司 | 贵州黄果树烟草 | 贵州黄果树烟草集团有限责任公司贵阳烟叶购销分公司原料分厂 | 贵州黄果树烟草贵阳烟叶购销原料分 | Yes |
| 鞍山钢铁集团公司 | 鞍山钢铁 | 鞍山钢铁集团公司水泥厂 | 鞍山钢铁水泥 | Yes |
| 长飞光纤光缆有限公司 | 长飞光纤光缆 | 长飞光纤光缆(上海)有限公司 | 长飞光纤光缆上海 | Yes |
| 上海创开无框阳台有限公司 | 上海创开无框阳台 | 上海创开无框阳台窗有限公司 | 上海创开无框阳台窗 | Yes |
| 上海宝钢集团公司 | 上海宝钢 | 上海宝钢建筑工程设计研究院 | 上海宝钢建筑工程设计研究院 | Yes |
| 北京市北郊冷饮食品厂 | 北京市北郊冷饮食品 | 北京市北郊冷饮食品三厂 | 北京市北郊冷饮食品三 | No |
| 天津市自动化仪表厂 | 天津市自动化仪表 | 天津市自动化仪表七厂 | 天津市自动化仪表七 | No |
| 洛阳市工程机械厂 | 洛阳市工程机械 | 洛阳市工程机械设计所 | 洛阳市工程机械设计所 | No |
Variable names and definitions.
| This table lists the variable names and definitions of the final database. | ||
|---|---|---|
| A | ASIE_id (ASIE 企业代码) | Unique ID of ASIE firm |
| B | Fullname (ASIE 企业全名) | Full name of ASIE firm |
| C | Stemname (ASIE 企业全名) | Stem name of ASIE firm |
| D | Patent_type (专利类型) | Type of patent: d = design patent; i = invention patent; u = utility model patent |
| E | Serial_no (序列号) | Serial number of patent in SIPO CD-ROMs |
| F | Application_year (申请年份) | Patent application year |
| G | Assignee (申请人) | Assignee field |
| H | Assignee_full (申请人全名) | Full name of focal assignee to be matched |
| I | Assignee_stem (申请人根名) | Stem name of focal assignee to be matched |
| J | Manual_check (手工较验) | Manual check flag: 1 = manual check is needed; 0 = manual check is not needed |
| K | True_match (正确匹配) | True match flag: Yes = true match; No = false match |
| L | Publication_date (公开日) | Patent publication date |
| M | Application_date (申请日) | Patent application date |
| N | Primary_class (主分类号) | Primary technology class |
| O | Class (分类号) | All technology class(es) |
| P | Divisional_application (分案原申请号) | Number of prior SIPO application, if any, that focal application refers to |
| Q | Priority (优先权) | Priority number(s) |
| R | Address (地址) | Address of the first assignee |
| S | Patent _agency (专利代理机构) | Name of patent agency |
| T | Patent_attorney (代理人) | Name of patent attorney |
| U | Pages (页数) | Number of pages of patent application |
| V | Country_or_province_code (国省代码) | Country/province code of the first assignee |
| W | Grant (专利授权) | Grant status as of April 2013: 1 = granted; 0 = not granted |
| X | Grant_date (专利授权日) | Grant date as of April 2013 (NA if not granted) |
Breakdown of results from automated matching program and manual check.
| Design patents | Invention patents | Utility model patents | |||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 398,483 | 332,682 | 424,484 | |||||||||||||||||||||
| Exact matching | Left-aligned strict substring matching | Exact matching | Left-aligned strict substring matching | Exact matching | Left-aligned strict substring matching | ||||||||||||||||||
| 356,121 | 42,362 | 291,449 | 41,233 | 362,901 | 61,583 | ||||||||||||||||||
| Manual check required | Manual check not required | Manual check required | Manual check not required | Manual check required | Manual check not required | Manual check required | Manual check not required | Manual check required | Manual check not required | Manual check required | Manual check not required | ||||||||||||
| 575 | 355,546 | 38,621 | 3,741 | 578 | 290,871 | 32,195 | 9,038 | 686 | 362,215 | 50,242 | 11,341 | ||||||||||||
| This table reports the subtotals of automated matching results and manual checks. Total number of computer generated matches based on exact matching=356,121+291,449+362,901=1,010,471. Total number of computer generated matches based on left-aligned strict substring matching=42,362+41,233+61,583=145,178. Total number of matches that require manual check=575+38,621+578+32,195+686+50,242=122,897, among which 90,860 are true matches and 32,037 are false matches. Total number of matches that do not require manual check=355,546+3,741+290,871+9,038+362,215+11,341=1,032,752, among which 1,022,728 are true matches and 10,024 are false matches. | |||||||||||||||||||||||
| True matches | False matches | True matches | False matches | True matches | False matches | True matches | False matches | True matches | False matches | True matches | False matches | True matches | False matches | True matches | False matches | True matches | False matches | True matches | False matches | True matches | False matches | True matches | False matches |
| 551 | 24 | 355,527 | 19 | 29,022 | 9,599 | 2,150 | 1,591 | 532 | 46 | 290,094 | 777 | 22,983 | 9,212 | 3,659 | 5,379 | 605 | 81 | 361,833 | 382 | 37,167 | 13,075 | 9,465 | 1,876 |
List of “fuzzy” ASIE firm names.
| This table provides a full list of ASIE firm names with ambiguous semantic meaning. | |
|---|---|
| 1 | 印刷厂 |
| 2 | 建材厂 |
| 3 | 机械厂 |
| 4 | 林化厂 |
| 5 | 电机厂 |
| 6 | 电缆厂 |
| 7 | 食品厂 |
| 8 | 无线电厂 |
| 9 | 水电公司 |
| 10 | 油脂集团 |
| 11 | 热电公司 |
| 12 | 物流公司 |
| 13 | 电业公司 |
| 14 | 电力公司 |
| 15 | 黄金公司 |
| 16 | 天然食品厂 |
| 17 | 建筑材料厂 |
| 18 | 水电总公司 |
| 19 | 电力总公司 |
| 20 | 石油化工厂 |
| 21 | 矿业总公司 |
| 22 | 第三化工厂 |
| 23 | 有色金属公司 |
| 24 | 汽车零部件厂 |
| 25 | 油脂有限公司 |
| 26 | 电力工业公司 |
| 27 | 电力集团公司 |
| 28 | 电装有限公司 |
| 29 | 造纸有限公司 |
| 30 | 医药集团有限公司 |
| 31 | 电力有限责任公司 |
| 32 | 食品集团有限责任公司 |
| 33 | 2 |
| 34 | 鼎盛 |
| 35 | 微生物研究所 |
| 36 | 陶瓷有限责任公司 |
Assessing degrees of contamination and omission of alternative approaches.
| Our approach: | Alternative 1: | Alternative 2: | |
|---|---|---|---|
| Combine computerized matching and manual check | Regard all computer-generated name pairs as true match | Regard all name pairs requiring manual check as false match | |
| This table reports the assessment of our matching approach against alternative approaches. Degree of contamination=100×(“alternative 1” – “our approach”) / “our approach”. Degree of omission=100×(“our approach” – “alternative 2”) / “our approach”. | |||
| # of matched design patents | 387,250 | 398,483 (degree of contamination: 2.9%) | 357,677 (degree of omission: 7.6%) |
| # of matched invention patents | 317,268 | 332,682 (degree of contamination: 4.9%) | 293,753 (degree of omission: 7.4%) |
| # of matched utility model patents | 409,070 | 424,484 (degree of contamination: 3.8%) | 371,298 (degree of omission: 9.2%) |
Assessing the extent of duplicate matches.
| Computer generated matches | Unique patents matched | TRUE matches after manual check | Unique patents matched | |||
|---|---|---|---|---|---|---|
| (a) | (b) | %(b/a) | (c) | (d) | %(d/c) | |
| This table reports the assessment of the extent of duplicate matches for the three different types of patents | ||||||
| Design patents | 398,483 | 300,956 | 75.5 | 387,250 | 291,578 | 75.3 |
| Invention patents | 332,682 | 265,338 | 79.8 | 317,268 | 253,628 | 79.9 |
| Utility model patents | 424,484 | 316,303 | 74.5 | 409,070 | 304,441 | 74.4 |
| Total | 1,155,649 | 882,597 | 76.4 | 1,113,588 | 849,647 | 76.3 |
Comparing our number of matched patents with that in prior studies.
| Authors | ASIE period | Number of firms | Distinguish different types of patents? | Number of patents matched | Our number of unique patents matched during the same period |
|---|---|---|---|---|---|
| This table provides a comparison of the number of matched patents in this study with the numbers reported in the four prior studies. a). Eberhardt et al[ | |||||
| Eberhardt et al. [ | 1999-2006 | about 590,000 | Only invention patents are matched | 44,344 (invention patents) | 95,902 b (invention patents) |
| Dang & Motohashi [ | 1998-2008 | 12,208 (panel data) | Only invention patents are matched | 126,386 (invention patents) | 188,773 (invention patents) |
| Chen et al. [ | 1998-2007 | 11,631 (panel data) | Yes | 50,013 (three types of patents in total) | 484,359 (three types of patents in total) |
| Xie & Zhang [ | 1998-2009 | 682,814 | Yes | 749,691 granted patents d (three types of patents in total) | 849,647 patent applications, among which 737,834 are granted e (three types of patents in total) |
Distribution of matched patents, by year and type of patents.
| Design patents | Invention patents | Utility model patents | Three types in total | |||||
|---|---|---|---|---|---|---|---|---|
| (1) | (2) | (3) | (4) | |||||
| Year | Number of matches | % | Number of matches | % | Number of matches | % | Number of matches | % |
| This table reports the distribution of our matched patents, by year and type of patents. Columns 1 to 3 report the distribution for the three types of patents separately, and Column 4 reports the distribution for all three types of patents combined. | ||||||||
| 1998 | 9,375 | 2.4 | 1,226 | 0.4 | 5,430 | 1.3 | 16,031 | 1.4 |
| 1999 | 13,089 | 3.4 | 2,162 | 0.7 | 8,011 | 2 | 23,262 | 2.1 |
| 2000 | 15,095 | 3.9 | 3,392 | 1.1 | 9,848 | 2.4 | 28,335 | 2.5 |
| 2001 | 16,741 | 4.3 | 4,822 | 1.5 | 12,053 | 2.9 | 33,616 | 3 |
| 2002 | 22,373 | 5.8 | 9,281 | 2.9 | 16,938 | 4.1 | 48,592 | 4.4 |
| 2003 | 23,652 | 6.1 | 14,630 | 4.6 | 21,855 | 5.3 | 60,137 | 5.4 |
| 2004 | 26,986 | 7 | 18,420 | 5.8 | 23,435 | 5.7 | 68,841 | 6.2 |
| 2005 | 31,863 | 8.2 | 26,868 | 8.5 | 29,807 | 7.3 | 88,538 | 8 |
| 2006 | 41,118 | 10.6 | 38,828 | 12.2 | 40,384 | 9.9 | 120,330 | 10.8 |
| 2007 | 52,071 | 13.4 | 50,791 | 16 | 52,103 | 12.7 | 154,965 | 13.9 |
| 2008 | 60,354 | 15.6 | 65,539 | 20.7 | 76,205 | 18.6 | 202,098 | 18.1 |
| 2009 | 74,533 | 19.2 | 81,309 | 25.6 | 113,001 | 27.6 | 268,843 | 24.1 |
| Total | 387,250 | 100 | 317,268 | 100 | 409,070 | 100 | 1,113,588 | 100 |