| Literature DB >> 28596627 |
Mark-Christoph Müller1, Florian Reitz2, Nicolas Roy3.
Abstract
Data sets of publication meta data with manually disambiguated author names play an important role in current author name disambiguation (AND) research. We review the most important data sets used so far, and compare their respective advantages and shortcomings. From the results of this review, we derive a set of general requirements to future AND data sets. These include both trivial requirements, like absence of errors and preservation of author order, and more substantial ones, like full disambiguation and adequate representation of publications with a small number of authors and highly variable author names. On the basis of these requirements, we create and make publicly available a new AND data set, SCAD-zbMATH. Both the quantitative analysis of this data set and the results of our initial AND experiments with a naive baseline algorithm show the SCAD-zbMATH data set to be considerably different from existing ones. We consider it a useful new resource that will challenge the state of the art in AND and benefit the AND research community.Entities:
Keywords: Author name disambiguation; Author name homography; Author name variability; Data sets; Digital libraries
Year: 2017 PMID: 28596627 PMCID: PMC5438420 DOI: 10.1007/s11192-017-2363-5
Source DB: PubMed Journal: Scientometrics ISSN: 0138-9130 Impact factor: 3.238
Fig. 1Author-centric information in the KISTI-AD-E-01 data set
Fig. 2Information from Fig. 1 in canonical, publication-centric format
Some quantitative and qualitative properties of AND data sets
| Data set | Measure | Total | ID | No ID | Author order | Full + short names | Fully disambiguated | External link | Analysed names |
|---|---|---|---|---|---|---|---|---|---|
| Han-DBLP | Publications | 8453 | – | – | – | – | – | ||
| Records | 27,575 | 8431 | 19,144 | ||||||
| Distinct names | 7443 | 157 | 7286 | ||||||
| Distinct authors | 479 | ||||||||
| Culotta-REXA | Publications | 3007 | x | – | – | – | x | ||
| Records | 9362 | 3015 | 6347 | ||||||
| Distinct names | 3411 | 298 | 3113 | ||||||
| Distinct authors | 324 | ||||||||
| Wang-Arnetminer | Publications | 6656 | x | – | – | – | – | ||
| Records | 23,608 | 6729 | 16,879 | ||||||
| Distinct names | 8007 | 121 | 7886 | ||||||
| Distinct authors | 1257 | ||||||||
| KISTI-AD-E-01 (Full names) | Publications | 37613 | x | x | – | x | – | ||
| Records | 116,565 | 41,674 | 74,891 | ||||||
| Distinct names | 35,532 | 6250 | 29,282 | ||||||
| Distinct authors | 6921 | ||||||||
| KISTI-AD-E-01 (Short names) | Publications | 37,613 | x | x | – | x | – | ||
| Records | 116,565 | 41,674 | 74,891 | ||||||
| Distinct names | 22,393 | 881 | 21,512 | ||||||
| Distinct authors | 6921 | ||||||||
| Cota-BDBComp | Publications | 361 | – | – | – | – | – | ||
| Records | 1251 | 361 | 890 | ||||||
| Distinct names | 820 | 245 | 575 | ||||||
| Distinct authors | 205 | ||||||||
| Qian-DBLP | Publications | 6716 | – | – | – | x | – | ||
| Records | 24,755 | 6716 | 18,039 | ||||||
| Distinct names | 8899 | 672 | 8227 | ||||||
| Distinct authors | 1200 | ||||||||
| SCAD-zbMATH (original names) | Publications | 28,321 | x | x | x | x | x | ||
| Records | 33,810 | ||||||||
| Distinct names | 4696 | ||||||||
| Distinct authors | 2946 | ||||||||
| SCAD-zbMATH (short names) | Publications | 28,321 | x | x | x | x | x | ||
| Records | 33,810 | ||||||||
| Distinct names | 2919 | ||||||||
| Distinct authors | 2946 | ||||||||
Han-DBLP sample record
| Pub-ID | – | |||
| Title | Information-Theoretic Analysis of Neural Coding | |||
| Venue | Journal of Computational Neuroscience | |||
| Year | – | |||
| Author-Pos. | ? | ? | ? | ? |
| Original name | – | Chandran Seshagiri | Charlotte M Gruner | Keith A Baggerly |
| Short name | D Johnson | – | – | – |
| Block | D Johnson | – | – | – |
| Author-ID | 8 | – | – | – |
Culotta-REXA sample record
| Pub-ID | – | |||
| Title | Accurate Building Structure Recovery from Aerial Imagery | |||
| Venue | – | |||
| Year | – | |||
| Author-Pos. | 1 | 2 | 3 | |
| Original name | Cocquerez, Jean Pierre | Cord, Mathieu | – | |
| Short name | – | – | Jordan, M | |
| Block | – | – | jordan_m | |
| Author-ID | – | – | MichelJordan | |
Wang-Arnetminer sample record
| Pub-ID | 738300 | |||
| Title | A dynamic learning model for on-line quality control using the TAGUCHI approach | |||
| Venue | Applied Artificial Intelligence | |||
| Year | 1992 | |||
| Author-Pos. | 1 | 2 | 3 | |
| Original name | – | – | Ram Ramesh | |
| Short name | H. Raghav Rao | M. V. Thirumurthy | – | |
| Block | – | – | R. Ramesh | |
| Author-ID | – | – | 5 | |
KISTI-AD-E-01 sample record
| Pub-ID | conf/pomc/AnceaumeDGS02 | |||
| Title | Publish/subscribe scheme for mobile networks | |||
| Venue | – | |||
| Year | 2002 | |||
| Author-Pos. | 1 | 2 | 3 | 4 |
| Original name | Emmanuelle Anceaume | Ajoy Kumar Datta | Maria Gradinariu | Gwendal Simon |
| Short name | E. Anceaume | A. Datta | M. Gradinariu | G. Simon |
| Block | – | A. Datta | – | |
| Author-ID | – | 2 | – | – |
Cota-BDBComp sample record
| Pub-ID | – | |||
| Title | Towards a web service for geographic and multidimensional processing | |||
| Venue | vi simposio brasileiro de geoinformatica | |||
| Year | – | |||
| Author-Pos. | ? | ? | ? | ? |
| Original name | – | – | – | joel da silva |
| Short name | v times | r fidalgo | r barros | – |
| Block | – | – | – | – |
| Author-ID | – | – | – | 114 |
Qian-DBLP sample record
| Pub-ID | 2545 | |||
| Title | An Approach to Composing Web Services with Context Heterogeneity | |||
| Venue | International Conference on Web Services | |||
| Year | 2009 | |||
| Author-Pos. | ? | ? | ? | ? |
| Original name | Hongwei Zhu | xitong li | stuart e. madnick | yushun fan |
| Short name | – | – | – | – |
| Block | – | – | – | – |
| Author-ID | 295 | – | – | |
Publications per Author. Identified authors with n publications
| # Publications | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | >10 |
|
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Data set | ||||||||||||
| Han-DBLP | 66 | 99 | 39 | 36 | 17 | 16 | 17 | 9 | 5 |
| 479 | |
| 14 | 21 | 8 | 8 | 4 | 3 | 4 | 2 | 1 |
| 100 | ||
| Culotta-REXA |
| 44 | 14 | 10 | 6 | 2 | 1 | 3 | 4 | 1 | 26 | 324 |
|
| 14 | 4 | 3 | 2 | 1 | 0 | 1 | 1 | 0 | 8 | 100 | |
| Cota-BDBComp |
| 35 | 5 | 3 | 1 | 1 | 2 | 1 | 6 | 205 | ||
|
| 17 | 2 | 1 | 0 | 0 | 1 | 0 | 3 | 100 | |||
| Qian-DBLP |
| 175 | 119 | 90 | 60 | 48 | 39 | 24 | 26 | 11 | 168 | 1200 |
|
| 15 | 10 | 8 | 5 | 4 | 3 | 2 | 2 | 1 | 14 | 100 | |
| Wang-Arnetminer |
| 199 | 96 | 52 | 33 | 32 | 19 | 12 | 14 | 11 | 114 | 1257 |
|
| 16 | 8 | 4 | 3 | 3 | 2 | 1 | 1 | 1 | 9 | 100 | |
| KISTI-AD-E-01 |
| 1071 | 655 | 461 | 317 | 215 | 168 | 136 | 116 | 91 | 827 | 6921 |
|
| 15 | 9 | 7 | 5 | 3 | 2 | 2 | 2 | 1 | 12 | 100 |
Authors per Publications. Publications with n authors
| # Authors | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | >10 |
|
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Data set | ||||||||||||
| Han-DBLP | 675 | 2410 |
| 1462 | 697 | 312 | 164 | 77 | 40 | 20 | 59 | 8453 |
| 8 | 29 |
| 17 | 8 | 4 | 2 | 1 | 0 | 0 | 1 | 100 | |
| Culotta-REXA | 445 |
| 729 | 466 | 189 | 102 | 54 | 22 | 13 | 13 | 36 | 3034 |
| 15 |
| 24 | 15 | 6 | 3 | 2 | 1 | 0 | 0 | 1 | 100 | |
| Cota-BDBComp | 11 |
| 95 | 75 | 37 | 25 | 5 | 7 | 1 | 1 | 361 | |
| 3 |
| 26 | 21 | 10 | 7 | 1 | 2 | 0 | 0 | 100 | ||
| Qian-DBLP | 306 | 1416 |
| 1548 | 863 | 337 | 172 | 78 | 51 | 31 | 64 | 6716 |
| 5 | 21 |
| 23 | 13 | 5 | 3 | 1 | 1 | 0 | 1 | 100 | |
| Wang-Arnetminer | 467 | 1548 |
| 1458 | 708 | 315 | 148 | 77 | 40 | 25 | 66 | 6656 |
| 7 | 23 |
| 22 | 11 | 5 | 2 | 1 | 1 | 0 | 1 | 100 | |
| KISTI-AD-E-01 | 3349 |
| 11268 | 6304 | 2738 | 1117 | 482 | 238 | 124 | 91 | 208 | 37,613 |
| 9 |
| 30 | 17 | 7 | 3 | 1 | 1 | 0 | 0 | 1 | 100 |
Author Name Homography. Names applying to n identified authors
| # Authors | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | >10 |
|
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Data set | ||||||||||||
| Han-DBLP |
| 8 | 1 | 13 | 157 | |||||||
|
| 5 | 0 | 8 | 100 | ||||||||
| Culotta-REXA |
| 26 | 8 | 5 | 4 | 2 | 1 | 3 | 298 | |||
|
| 9 | 3 | 2 | 1 | 1 | 0 | 1 | 100 | ||||
| Cota-BDBComp |
| 8 | 1 | 1 | 245 | |||||||
|
| 3 | 0 | 0 | 100 | ||||||||
| Qian-DBLP |
| 93 | 37 | 20 | 18 | 9 | 3 | 2 | 3 | 4 | 10 | 672 |
|
| 14 | 6 | 3 | 3 | 1 | 0 | 0 | 0 | 1 | 1 | 100 | |
| Wang-Arnetminer | 24 | 12 | 7 | 10 | 5 | 7 | 6 | 6 | 7 | 3 |
| 121 |
| 20 | 10 | 6 | 8 | 4 | 6 | 5 | 5 | 6 | 2 |
| 100 | |
| KISTI-AD-E-01 (original names) |
| 679 | 138 | 50 | 33 | 14 | 4 | 2 | 2 | 2 | 1 | 6250 |
|
| 11 | 2 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 100 | |
| KISTI-AD-E-01 (short names) | 106 | 104 | 82 | 80 | 79 | 70 | 63 | 45 | 44 | 31 |
| 881 |
| 12 | 12 | 9 | 9 | 9 | 8 | 7 | 5 | 5 | 4 |
| 100 |
Author Name Variability. Identified authors appearing with n names
| # Names | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | >10 |
|
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Data set | ||||||||||||
| Han-DBLP |
| 82 | 19 | 6 | 1 | 1 | 479 | |||||
|
| 17 | 4 | 1 | 0 | 0 | 100 | ||||||
| Culotta-REXA |
| 25 | 12 | 4 | 3 | 2 | 2 | 1 | 2 | 324 | ||
|
| 8 | 4 | 1 | 1 | 1 | 1 | 0 | 1 | 100 | |||
| Cota-BDBComp |
| 29 | 4 | 2 | 1 | 1 | 205 | |||||
|
| 14 | 2 | 1 | 0 | 0 | 100 | ||||||
| Qian-DBLP |
| 46 | 1 | 1200 | ||||||||
|
| 4 | 0 | 100 | |||||||||
| Wang-Arnetminer |
| 12 | 1257 | |||||||||
|
| 1 | 100 | ||||||||||
| KISTI-AD-E-01 (original names) |
| 556 | 58 | 13 | 2 | 6921 | ||||||
|
| 8 | 1 | 0 | 0 | 100 | |||||||
| KISTI-AD-E-01 (short names) |
| 6921 | ||||||||||
|
| 100 |
SCAD-zbMATH sample record
| Pub-ID | zbmath:0738.35028 | |||
| Title | The nonlinear heat equation | |||
| Venue | Proc. Math. Meet. in Honor of A. Dou, Madrid/Spain 1988, 251–258 (1989) | |||
| Year | 1989 | |||
| Author-Pos. | 1 | |||
| Original name | Vázquez Suárez, Juan Luis | |||
| Short name | Vázquez Suárez, J. | |||
| Block | – | |||
| Author-ID | vazquez.juan-luis | |||
Publications per Author. Identified authors with n publications
| # Publications | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | >10 |
|
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Data set | ||||||||||||
| SCAD-zbMATH |
| 414 | 223 | 166 | 110 | 76 | 68 | 46 | 41 | 32 | 572 | 2946 |
|
| 14 | 8 | 6 | 4 | 3 | 2 | 2 | 1 | 1 | 19 | 100 |
Authors per Publications. Publications with n authors
| # Authors | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|
|
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Data set | ||||||||||||
| SCAD-zbMATH |
| 4400 | 461 | 44 | 1 | 5 | 1 | 28,321 | ||||
|
| 16 | 2 | 0 | 0 | 0 | 0 | 100 |
Author Name Homography. Names applying to n identified authors
| # Authors | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | >10 |
|
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Data set | ||||||||||||
| SCAD-zbMATH (original names) |
| 508 | 59 | 25 | 8 | 4 | 1 | 4696 | ||||
|
| 11 | 1 | 0 | 0 | 0 | 0 | 100 | |||||
| SCAD-zbMATH (short names) |
| 407 | 80 | 27 | 17 | 5 | 1 | 2 | 2919 | |||
|
| 14 | 3 | 1 | 1 | 0 | 0 | 0 | 100 |
Author Name Variability. Identified authors appearing with n names
| # Names | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | >10 |
|
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Data set | ||||||||||||
| SCAD-zbMATH (original names) |
| 383 | 61 | 10 | 7 | 158 | 64 | 39 | 25 | 13 | 16 | 2946 |
|
| 13 | 2 | 0 | 0 | 5 | 2 | 1 | 0 | 0 | 0 | 100 | |
| SCAD-zbMATH (short names) |
| 138 | 83 | 47 | 31 | 15 | 7 | 1 | 3 | 1 | 1 | 2946 |
|
| 5 | 3 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 100 |
Naive Disambiguation Algorithm Performance (full data sets)
| Data set | Names | % Homogr. | % Variab. |
|
|
|
|---|---|---|---|---|---|---|
| Han-DBLP | Mixed | 14% | 23% |
| 23.34 | 37.08 |
| Culotta-REXA | Mixed | 16% | 16% | 48.03 |
| 62.26 |
| Wang-Arnetminer | Mixed | 80% | 01% |
| 47.11 | 63.92 |
| KISTI-AD-E-01 | Original | 15% | 09% |
| 90.94 | 93.40 |
| Short | 88% | 00% |
| 43.01 | 60.15 | |
| Cota-BDBComp | Mixed | 04% | 18% | 76.10 |
| 84.49 |
| Qian-DBLP | Mixed | 30% | 04% |
| 55.53 | 71.22 |
| SCAD-zbMATH | Original | 13% | 26% | 60.87 |
| 73.66 |
| Short | 18% | 11% | 82.47 |
| 86.88 |
Naive Disambiguation Algorithm Performance (Data Sub Sets)
| Data set SCAD-zbMATH | # Ident. Records | % Homogr. | % Variab. |
|
|
|
|---|---|---|---|---|---|---|
| Top Variable Authors | 8.587 | 00% | 100% | 42.39 |
| 59.54 |
| Top Ambiguous Names | 1.578 | 100% |
|
| 53.82 | 69.91 |
| Merged | 10.162 | 10% | 22% | 51.29 |
| 66.07 |
| “Simon, L.”, “Tanaka, K.” | 37 | 100% | 00% |
| 21.85 | 35.86 |