Literature DB >> 24944286

A "Roziah" by any other name: a simple Bayesian method for determining ethnicity from names.

Kridaraan Komahan, Daniel D Reidpath.   

Abstract

Correct identification of ethnicity is central to many epidemiologic analyses. Unfortunately, ethnicity data are often missing. Successful classification typically relies on large databases (n > 500,000 names) of known name-ethnicity associations. We propose an alternative naïve Bayesian strategy that uses substrings of full names. Name and ethnicity data for Malays, Indians, and Chinese were provided by a health and demographic surveillance site operating in Malaysia from 2011-2013. The data comprised a training data set (n = 10,104) and a test data set (n = 9,992). Names were spliced into contiguous 3-letter substrings, and these were used as the basis for the Bayesian analysis. Performance was evaluated on both data sets using Cohen's κ and measures of sensitivity and specificity. There was little difference between the classification performance in the training and test data (κ = 0.93 and 0.94, respectively). For the test data, the sensitivity values for the Malay, Indian, and Chinese names were 0.997, 0.855, and 0.932, respectively, and the specificity values were 0.907, 0.998, and 0.997, respectively. A naïve Bayesian strategy for the classification of ethnicity is promising. It performs at least as well as more sophisticated approaches. The possible application to smaller data sets is particularly appealing. Further research examining other substring lengths and other ethnic groups is warranted.
© The Author 2014. Published by Oxford University Press on behalf of the Johns Hopkins Bloomberg School of Public Health. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.

Keywords:  classification; ethnicity; naïve Bayesian approach

Mesh:

Year:  2014        PMID: 24944286     DOI: 10.1093/aje/kwu129

Source DB:  PubMed          Journal:  Am J Epidemiol        ISSN: 0002-9262            Impact factor:   4.897


  2 in total

1.  Surnames and ancestry in Brazil.

Authors:  Leonardo Monasterio
Journal:  PLoS One       Date:  2017-05-08       Impact factor: 3.240

2.  HDSS Profile: The South East Asia Community Observatory Health and Demographic Surveillance System (SEACO HDSS).

Authors:  Uttara Partap; Elizabeth H Young; Pascale Allotey; Ireneous N Soyiri; Nowrozy Jahan; Kridaraan Komahan; Nirmala Devarajan; Manjinder S Sandhu; Daniel D Reidpath
Journal:  Int J Epidemiol       Date:  2017-10-01       Impact factor: 7.196

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.