| Literature DB >> 28758138 |
Wanli Liu1, Rezarta Islamaj Doğan1, Sun Kim1, Donald C Comeau1, Won Kim1, Lana Yeganova1, Zhiyong Lu1, W John Wilbur1.
Abstract
Log analysis shows that PubMed users frequently use author names in queries for retrieving scientific literature. However, author name ambiguity may lead to irrelevant retrieval results. To improve the PubMed user experience with author name queries, we designed an author name disambiguation system consisting of similarity estimation and agglomerative clustering. A machine-learning method was employed to score the features for disambiguating a pair of papers with ambiguous names. These features enable the computation of pairwise similarity scores to estimate the probability of a pair of papers belonging to the same author, which drives an agglomerative clustering algorithm regulated by 2 factors: name compatibility and probability level. With transitivity violation correction, high precision author clustering is achieved by focusing on minimizing false-positive pairing. Disambiguation performance is evaluated with manual verification of random samples of pairs from clustering results. When compared with a state-of-the-art system, our evaluation shows that among all the pairs the lumping error rate drops from 10.1% to 2.2% for our system, while the splitting error rises from 1.8% to 7.7%. This results in an overall error rate of 9.9%, compared with 11.9% for the state-of-the-art method. Other evaluations based on gold standard data also show the increase in accuracy of our clustering. We attribute the performance improvement to the machine-learning method driven by a large-scale training set and the clustering algorithm regulated by a name compatibility scheme preferring precision. With integration of the author name disambiguation system into the PubMed search engine, the overall click-through-rate of PubMed users on author name query results improved from 34.9% to 36.9%.Entities:
Year: 2013 PMID: 28758138 PMCID: PMC5530597 DOI: 10.1002/asi.23063
Source DB: PubMed Journal: J Assoc Inf Sci Technol ISSN: 2330-1635 Impact factor: 2.687
FIG. 1User behavior statistics with different number of retrieved citations.
FIG. 2Workflow of similarity computation and clustering.
Availability of PubMed citation fields.
| Field | Title | Affiliation | Grant | Journal | Abstract | Substance | MeSH | Author | Date |
|---|---|---|---|---|---|---|---|---|---|
| Available | 100% | 53.2% | 8.5% | 100% | 48.7% | 48.5% | 91.3% | 97.6% | 100% |
Computed features from PubMed fields.
| Field | Field content | Stopword list | Feature |
|---|---|---|---|
| Title | text string | General | similarityoverall(similarity1) |
| Affiliation | text string | affiliation | similarityoverall(similarity2) |
| Grant | text string | General | similarityoverall(similarity3) |
| Journal | text string | General | similarityoverall(similarity4) |
| Abstract | text string | General | similarityoverall(similarity5) |
| Substance | text string | General | similarityoverall(similarity6) |
| MeSH | text string | MeSH | similarityoverall(similarity7) |
| Author | text string | similarityname(similarity8) | |
| Date | numerical | yeardiff (similarity9) |
Note. The affiliation stopword list is the PubMed general stopword list with addition of common affiliation terms.The MeSH stopword list is the PubMed general stopword list with addition of common MeSH terms.
FIG. 3Weight functions of PubMed field features.
FIG. 4PAV functions of Huber score.
FIG. 5Coauthor pair proportion and name space size (x-axis shows the floor of natural logarithm of namespace size).
Name information based clustering priority.
| Name label information per cluster | ||||
|---|---|---|---|---|
|
| ||||
| First name | Middle name | |||
|
|
| |||
| Priority | Cluster 1 | Cluster 2 | Cluster 1 | Cluster 2 |
| 1 | Same | Same | Same | Same |
| 2 | Same | Same | Middle name | None |
| 3 | Full first name | First initial | Compatible | |
Comparing clustering results.
| Our clustering | |||
|---|---|---|---|
|
| |||
| Clustered (C) | Unclustered (U) | ||
| Authority 2009 | Clustered (C) 80.4% | CC: 65% | CU: 15.4% |
| Unclustered (U) 19.6% | UC: 1.6% | UU: 18% | |
Pairwise error rate by human review.
| Category | CC | UU | CU | UC | PSER | PLER | Error rate |
|---|---|---|---|---|---|---|---|
| Authority 2009 | 2% | 6% | 57% | 46% | 9.3% | 12.5% | 11.9% = 1.8% + 10.1% |
| Our clustering | 2% | 6% | 43% | 54% | 23.1% | 3.2% | 9.9% = 7.7% + 2.2% |
|
| |||||||
| Precision | Recall | F-score | |||||
|
| |||||||
| Authority 2009 | 87.5% | 97.5% | 92.2% | ||||
| Our clustering | 96.8% | 89.3% | 92.9% | ||||
Pairwise precision, recall, and F-scores for highly cited researchers.
| Researcher name | Our clustering | Authority 2009 | ||||
|---|---|---|---|---|---|---|
|
|
| |||||
| Precision | Recall | F-score | Precision | Recall | F-score | |
| Agarwal, Ashok | 0.882 | 0.405 | 0.556 | 0.778 | 0.991 | 0.871 |
| Alves, Cintia | 1.000 | 0.960 | 0.980 | 0.960 | 0.960 | 0.960 |
| Ammenwerth, Elske | 0.935 | 1.000 | 0.966 | 0.935 | 1.000 | 0.966 |
| Amorim, Antonio | 0.887 | 0.988 | 0.935 | 0.848 | 1.000 | 0.918 |
| Antman, Elliott | 0.974 | 0.916 | 0.944 | 0.965 | 0.970 | 0.967 |
| Bates, David | 1.000 | 0.778 | 0.875 | 1.000 | 0.822 | 0.902 |
| Buring, Julie | 1.000 | 1.000 | 1.000 | 1.000 | 0.984 | 0.992 |
| Camargo, Carlos | 0.915 | 0.993 | 0.952 | 0.743 | 1.000 | 0.853 |
| Cannon, Christopher | 0.977 | 0.955 | 0.966 | 0.964 | 0.968 | 0.966 |
| Carrell, Douglas | 0.890 | 1.000 | 0.942 | 0.890 | 1.000 | 0.942 |
| Durham, Stephen | 1.000 | 0.946 | 0.972 | 1.000 | 0.843 | 0.915 |
| Eisenberg, David | 0.977 | 0.630 | 0.766 | 0.973 | 0.463 | 0.628 |
| Epstein, Ronald | 1.000 | 0.807 | 0.893 | 1.000 | 0.653 | 0.790 |
| Hellstrom, Wayne | 0.908 | 0.981 | 0.943 | 0.901 | 1.000 | 0.948 |
| Hood, Kerenza | 0.821 | 0.739 | 0.778 | 1.000 | 0.718 | 0.836 |
| Hu, Frank | 0.977 | 0.703 | 0.818 | 0.983 | 0.679 | 0.803 |
| Ioannidis, John | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| Jorgensen, Niels | 1.000 | 1.000 | 1.000 | 1.000 | 0.941 | 0.969 |
| Kaptchuk, Ted | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| Kritchevsky, Stephen | 1.000 | 0.967 | 0.983 | 1.000 | 0.903 | 0.949 |
| Lako, Majlinda | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| Leffers, Henrik | 0.964 | 1.000 | 0.982 | 0.964 | 1.000 | 0.982 |
| Libby, Peter | 0.921 | 0.885 | 0.903 | 0.911 | 0.966 | 0.938 |
| Manson, JoAnn | 1.000 | 0.962 | 0.981 | 1.000 | 0.993 | 0.996 |
| Ridker, Paul | 1.000 | 0.977 | 0.988 | 1.000 | 0.995 | 0.998 |
| Rifai, Nader | 1.000 | 0.966 | 0.983 | 0.989 | 0.994 | 0.992 |
| Rimm, Eric | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| Rodriguez Martinez, Heriberto | 0.780 | 1.000 | 0.877 | 0.768 | 1.000 | 0.869 |
| Roewer, Lutz | 0.925 | 1.000 | 0.961 | 0.925 | 1.000 | 0.961 |
| Schneider, Peter | 0.984 | 0.923 | 0.953 | 0.983 | 0.792 | 0.877 |
| Simonsick, Eleanor | 0.985 | 1.000 | 0.992 | 0.985 | 1.000 | 0.992 |
| Stampfer, Meir | 0.992 | 0.914 | 0.952 | 0.990 | 0.938 | 0.963 |
| Sunde, Kjetil | 0.904 | 1.000 | 0.949 | 0.861 | 1.000 | 0.925 |
| Szibor, Reinhard | 0.924 | 1.000 | 0.961 | 0.901 | 1.000 | 0.948 |
| Ter Kuile, Feiko | 0.823 | 1.000 | 0.903 | 0.823 | 1.000 | 0.903 |
| Thomson, James | 0.946 | 0.895 | 0.920 | 1.000 | 0.797 | 0.887 |
| Vincent, Jean Louis | 1.000 | 0.876 | 0.934 | 1.000 | 0.957 | 0.978 |
| Weiss, Scott | 0.971 | 0.809 | 0.883 | 0.973 | 0.700 | 0.814 |
| Willett, Walter | 0.998 | 0.938 | 0.967 | 0.997 | 0.991 | 0.994 |
| Yen, Kathrin | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| Average ± standard deviation | 0.957 ± .057 | 0.923 ± .123 | 0.934 ± .083 | 0.950 ± .071 | 0.925 ± .125 | 0.930 ± .076 |