| Literature DB >> 25328257 |
Linda Reijnhoudt1, Rodrigo Costas2, Ed Noyons2, Katy Börner3, Andrea Scharnhorst1.
Abstract
The study of science at the individual scholar level requires the disambiguation of author names. The creation of author's publication oeuvres involves matching the list of unique author names to names used in publication databases. Despite recent progress in the development of unique author identifiers, e.g., ORCID, VIVO, or DAI, author disambiguation remains a key problem when it comes to large-scale bibliometric analysis using data from multiple databases. This study introduces and tests a new methodology called seed + expand for semi-automatic bibliographic data collection for a given set of individual authors. Specifically, we identify the oeuvre of a set of Dutch full professors during the period 1980-2011. In particular, we combine author records from a Dutch National Research Information System (NARCIS) with publication records from the Web of Science. Starting with an initial list of 8,378 names, we identify 'seed publications' for each author using five different approaches. Subsequently, we 'expand' the set of publications in three different approaches. The different approaches are compared and resulting oeuvres are evaluated on precision and recall using a 'gold standard' dataset of authors for which verified publications in the period 2001-2010 are available.Entities:
Keywords: Author disambiguation; Publication oeuvre; Scalable methods
Year: 2014 PMID: 25328257 PMCID: PMC4190454 DOI: 10.1007/s11192-014-1256-0
Source DB: PubMed Journal: Scientometrics ISSN: 0138-9130 Impact factor: 3.238
Fig. 1General workflow and relevant data sources
Result sets obtained by different seeds
| Seed method | CWTS publications | NARCIS full professors | Full professors unique to this seed |
|---|---|---|---|
| EM | 40,826 | 4,786 | 790 |
| RP | 81,079 | 5,819 | 149 |
| DL | 79,515 | 5,749 | 158 |
| AL | 28,837 | 5,018 | 76 |
| DAI | 30,322 | 2,742 | 162 |
| Total unique in combined seed | 174,568 | 6,989 |
Fig. 2Example of direct author organization linkage
Pruning the seeds to increase precision
| Seed method | Number of found professors | Remove multiple assignments | Remove common names |
|---|---|---|---|
| EM | 4,786 | 4,786 | 4,786 |
| RP | 5,819 | 5,696 | 4,648 |
| DL | 5,749 | 5,629 | 4,675 |
| AL | 5,018 | 4,864 | 3,147 |
| DAI | 2,742 | 2,742 | 2,740 |
| Total unique in combined seed | 6,989 | 6,947 | 6,753 |
Missing professors per university with more than 100 professors
| Abbr. | University | No. profs | % Missed | % Found EM |
|---|---|---|---|---|
| OUH | Open Universiteit - OUNL | 124 | 50 | 34 |
| UVT | Tilburg University | 452 | 32 | 26 |
| VUA | VU University Amsterdam | 889 | 24 | 47 |
| RUM | Maastricht University | 552 | 24 | 40 |
| RUL | Leiden University | 802 | 21 | 62 |
| UVA | University of Amsterdam | 961 | 19 | 60 |
| RUG | University of Groningen | 938 | 18 | 62 |
| EUR | Erasmus University Rotterdam | 668 | 18 | 67 |
| KUN | Radboud University Nijmegen | 689 | 17 | 62 |
| TUD | Delft University of Technology | 621 | 17 | 67 |
| RUU | Utrecht University | 930 | 16 | 55 |
| TUE | Technische Universiteit Eindhoven | 369 | 11 | 73 |
| TUM | University of Twente | 358 | 11 | 68 |
| WUR | Wageningen University & Research Centre | 333 | 3 | 84 |
Fig. 3Dutch University profiles, from (2012) NARCIS: Network of Experts and Knowledge Organizations in the Netherlands
Percentage missed professors by affiliation in the first three seeds
| Job title | Abbr. | No. profs | % Missed | Found on EM |
|---|---|---|---|---|
| Visiting professor | GHL | 46 | 61 | 22 |
| Rector | RMA | 11 | 55 | 36 |
| Honorary professor (without salary) | OHL | 119 | 41 | 34 |
| Honorary professor | HHL | 18 | 39 | 44 |
| Extraordinary professor | BHL | 1,380 | 28 | 46 |
| Dean | DCN | 78 | 26 | 44 |
| Part-time professor | PTH | 300 | 25 | 41 |
| Professor emeritus | EMT | 38 | 21 | 37 |
| Professor | HGL | 4,596 | 18 | 60 |
| Associate professor | UHD | 2,465 | 14 | 61 |
| Management | DIR | 612 | 14 | 66 |
| Researcher | OND | 557 | 11 | 71 |
| Contact person organisation | CPO | 58 | 9 | 74 |
| University professor | UHL | 29 | 7 | 68 |
| Project leader | PRL | 162 | 2 | 85 |
Fig. 4Number of authors (y-axis) from the seed with the number of matched Scopus author identifiers (x-axis). The 41 authors with ten or more were ultimately discarded
Performance of the three expansion approaches—individually and combined
| Scopus identifier | Meso | Micro | ScopusI and Meso | ScopusI and Micro | |
|---|---|---|---|---|---|
| True pos. (A∩B) | 55,405 | 55,459 | 55,394 | 55,509 | 55,460 |
| False pos. (¬A∩B) | 8,055 | 10,430 | 7,212 | 13,200 | 10,260 |
| False neg. (A∩¬B) | 2,370 | 2,316 | 2,381 | 2,260 | 2,315 |
| Precision | 87.3 | 84.2 | 88.5 | 80.8 | 84.4 |
| Recall | 95.9 | 96.0 | 95.9 | 96.1 | 96.0 |
| F1 | 45.7 | 44.9 | 46.0 | 43.9 | 44.9 |
Fig. 5Gold standard set (a) versus the result of the expansion (b)