| Literature DB >> 31763359 |
Maria Karaulova1, Abdullah Gök2, Philip Shapira1,3.
Abstract
This research article puts forward a method to identify the national heritage of authors based on the morphology of their surnames. Most studies in the field use variants of dictionary-based surname methods to identify ethnic communities, an approach that suffers from methodological limitations. Using the public file of ORCID (Open Researcher and Contributor ID) identifiers in 2015, we developed a surname-based identification method and applied it to infer Russian heritage from suffix-based morphological regularities. The method was developed conceptually and tested in an undersampled control set. Identification based on surname morphology was then complemented by using first-name data to eliminate false-positive results. The method achieved 98% precision and 94% recall rates-superior to most other methods that use name data. The procedure can be adapted to identify the heritage of a variety of national groups with morphologically regular naming traditions. We elaborate on how the method can be employed to overcome long-standing limitations of using name data in bibliometric datasets. This identification method can contribute to advancing research in scientific mobility and migration, patenting by certain groups, publishing and collaboration, transnational and scientific diaspora links, and the effects of diversity on the innovative performance of organizations, regions, and countries.Entities:
Year: 2019 PMID: 31763359 PMCID: PMC6853192 DOI: 10.1002/asi.24104
Source DB: PubMed Journal: J Assoc Inf Sci Technol ISSN: 2330-1635 Impact factor: 2.687
Figure 1Publication records with ORCID ID identifiers listing affiliations in Russia (source: Web of Science, N = 406,830). [Color figure can be viewed at http://wileyonlinelibrary.com]
Types of Russian surnames (adapted from Unbegaun, 1972).
| Surname type | Definition | Examples | General popularity |
|---|---|---|---|
| Patronymic/metronymic | Surnames are derived from a name, place, or a profession. Low variability. |
Ivanov Nikitina Vyazemskiy | Overwhelming majority |
| Adjectival | As above, derived from adjectives. Low variability. | Chernykh | Rare |
| Substantival | Surnames derived from nouns. High variability. |
Medved Golub | Negligible |
| Surnames of foreign origin | Russianized surnames of foreign origin. Varying popularity and some suffix variability depending on the origin. |
Landau Bidon'ko | Rare |
Figure 2Russian heritage identification procedure sequence. Note: Rectangle shapes indicate a rule‐based search query; oval shapes indicate a dictionary‐based search. A dotted rectangle signifies mutually nonexclusive selection; bold lines signify choices made based on an F‐measure increment when applied to the testing dataset (source: authors).
Proportionate undersampling strategy of non‐Russian names for the control set.
| Country | Share in the ORCID dataset | Author names sampled |
|---|---|---|
| USA | 21.79% | 217 |
| Great Britain | 10.41% | 104 |
| Spain | 8.13% | 81 |
| Brazil | 6.18% | 61 |
| India | 5.85% | 58 |
| Italy | 5.20% | 52 |
| Canada | 4.88% | 49 |
| Australia | 3.9% | 39 |
| Portugal | 3.58% | 36 |
| Germany | 3.25% | 33 |
| France | 2.93% | 29 |
| Canada | 2.28% | 23 |
| Sweden | 2.28% | 23 |
| South Korea | 1.95% | 19 |
| Others | 14.47% | 158 |
| Total | 100% | 1000 |
Test dataset results (source: ORCID, calculations by the authors; N = 2000).
| Rule | Precision | Recall |
| Change in | Decision |
|---|---|---|---|---|---|
|
| |||||
| 1.a. Russian Lexicological Morphology | 91.05% | 84.4% | 87.60% | Use | |
| 1.b. Popular surnames (public domain) | 97.45% | 30.6% | 46.58% | ||
| 1.c. Zhuravlev ( | 100% | 27.3% | 42.89% | ||
|
| |||||
| 2.a. Lexicological Morphology of Surnames with Origin in Soviet Countries | 80.87% | 93.4% | 86.68% | 1.08% increase from 1.a. | |
| 2.b. Lexicological Morphology of Selected Surnames with Origin in Soviet Countries | 90.36% | 90.9% | 90.63% | 3.03% increase from 1.a. | Use |
|
| |||||
| 3.a. Russian Surnames of Jewish or Germanic Origin | 76.54% | 93.3% | 84.9% | 5.73% decrease from 2.b. | |
| 3.b. Irregular Russian Surnames | 90.76% | 95.3% | 92.98% | 2.35% increase from 2.b. | Use |
| 3.c. Russian Surnames of Romanian and Baltic Origin | 90.27% | 90.9% | 90.58% | 0.05% decrease from 2.b. | |
| 3.d. Irregular Surnames of Top‐Cited Scientists | 89.45% | 92.4% | 90.90% | 0.27% increase from 2.b. | Use |
| 3.e. Combined Identification with the two selected “Further Enrichments” | 89.87% | 96.7% | 93.16% | 2.53% increase from 2.b. | Use |
|
| |||||
| 4.a. Russian Given Names | 98.12% | 94% | 96.02% | 2.86% increase from 3.e. | Use |
Number of Russian heritage researchers for different types of ORCID users.
| Category of ORCID user | Total users in the group | Russian heritage users |
|---|---|---|
| (1) ORCID users with addresses in Russia who may have reported an international address in their career history | 8,799 | 7,501 (85.25%) |
| (2) ORCID users who reported Russian addresses, but no international addresses | 7,378 | 6,559 (88.90%) |
| (3) ORCID users who reported an international address and may have reported a Russian address | 287,484 | 8,561 (2.98%) |
| (4) ORCID users who reported at least one international and one Russian address | 1,473 | 994 (67.48%) |
| (5) ORCID users who reported an international address, but no Russian address | 286,030 | 7,718 (2.7%) |
Note. Numbers are calculated as a percentage of total users in each group (source: ORCID, calculations by the authors. N = 294,746).
Further applicability of the surname morphology method to identify heritage.
| Applicable | Partially applicable | |
|---|---|---|
| Japanese | Turkish | Indian |
| Finnish | German | US ethnicities and distinctions within groups |
| Iranian | Vietnamese | French and French Canadian |
| Italian | Estonian | Portuguese (in Europe only) |
| Greek | Nigerian (varied across ethnic groups) | |