Literature DB >> 17135185

SNP@Ethnos: a database of ethnically variant single-nucleotide polymorphisms.

Jungsun Park¹, Sohyun Hwang, Yong Seok Lee, Sang-Cheol Kim, Doheon Lee.

Abstract

Inherited genetic variation plays a critical but largely uncharacterized role in human differentiation. The completion of the International HapMap Project makes it possible to identify loci that may cause human differentiation. We have devised an approach to find such ethnically variant single-nucleotide polymorphisms (ESNPs) from the genotype profile of the populations included in the International HapMap database. We selected ESNPs using the nearest shrunken centroid method (NSCM), and performed multiple tests for genetic heterogeneity and frequency spectrum on genes having ESNPs. The function and disease association of the selected SNPs were also annotated. This resulted in the identification of 100 736 SNPs that appeared uniquely in each ethnic group. Of these SNPs, 1009 were within disease-associated genes, and 85 were predicted as damaging using the Sorting Intolerant From Tolerant system. This study resulted in the creation of the SNP@Ethnos database, which is designed to make this type of detailed genetic variation approach available to a wider range of researchers. SNP@Ethnos is a public database of ESNPs with annotation information that currently contains 100 736 ESNPs from 10 138 genes, and can be accessed at http://variome.net and http://bioportal.net/ or directly at http://bioportal.kobic.re.kr/SNPatETHNIC/.

Entities: Chemical Disease Gene Mutation Species

Mesh：

Year: 2006 PMID： 17135185 PMCID： PMC1747186 DOI： 10.1093/nar/gkl962

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Identifying genetic variations that give rise to human differences is one of the most interesting issues in human evolution. Many of the related natural-selection and selective-sweep studies have produced interesting findings (1–3), and the completion of the International HapMap Project (4) has increased the popularity of this type of study. According to the HapMap reports, candidate loci in which selection has occurred can be identified using long-range haplotype testing (5). However, previous studies did not measure nearly fixed variations despite evidence that rare variants with low minor-allele frequencies also contribute to observed variations in complex human traits (6,7). Therefore, in order to identify ethnically variant single-nucleotide polymorphisms (ESNPs), we devised a new systematic approach based on the nearest shrunken centroid method (NSCM) (8) that is not affected by the minor-allele frequency. The present study compared the genotype profiles of three ethnic groups: Yoruba in Ibadan, Nigeria (YRI), a combination of Japanese in Tokyo (JPT) and Han Chinese (CHB) in Beijing (CHB+JPT), and Utah residents with ancestry from northern and western Europe (CEU). The study identified 100 736 SNPs that could classify the ethnic groups based on the NSCM (8). Of those SNPs, 5515 were in well-known loci of natural selection (e.g. Duffy and lactase genes) and disease-associated genes. Using the Sorting Intolerant From Tolerant system (9), 85 coding nonsynonymous ethnically variant SNPs (ESNPs) were predicted as damaging, indicating that these SNPs may be highly relevant in disease research. This study resulted in the creation of the SNP@Ethnos database that contains genetic-variation information for use in human differentiation studies.

DATABASE CONSTRUCTION

Data source

The International HapMap Phase I release #16.c genotype dataset was downloaded from the project web site (). Unrelated individuals were selected for examination, comprising 60 CEU, 45 CHB, 44 JPT and 60 YRI samples. Our analysis involved the examination of 3 565 483 common SNPs.

Data processing

Pre-processing

Data pre-processing involved two steps: missing-allele imputation and the replacement of genotype features. We used the R package pamr, which does not allow for missing data and only allows numeric input data, so we had to impute the missing values and replace genotype features with numbers. For the missing-allele imputation, we replaced the missing values by the major allele of each ethnic group class (10): CEU, YRI and CHB+JPT. The proportion of missing values was 0.50%. For processing convenience, genotype features were coded using four numbers: (i) homo-reference allele; (ii) hetero allele; (iii) homo-other allele; and 0, missing value. The data processing is outlined in Figure 1.

Figure 1

The data processing strategy for identifying ethnically variant SNPs and their functional annotations. Ethnically variant SNPs (ESNPs) were identified using the nearest shrunken centroid method (NSCM) of the R package pamr. Gene mapping was performed by combining three databases: the University of California, Santa Cruz, Genome Browser hg17, HUGO Gene Nomenclature Committee and dbSNP (build 125). Multiple tests were performed for genetic heterogeneity and frequency spectrum on genes having ESNPs. Links are provided for the following online databases: dbSNP, SNP@Domain, Entrez Gene, Online Mendelian Inheritance in Man (OMIM), Haplotter, International HapMap and Human Gene Mutation Database (HGMD).

ESNP selection

ESNPs were identified using the NSCM of the R package pamr. This method has been proposed as a suitable approach for solving the classification problem when there are a large number of features from which to predict classes and a relatively small number of cases, and it is important to identify which features contribute most to the classification (8). A detailed mathematical explanation of NSCM is as follows. Let x be the genotype for SNPs i (= 1, 2, … , p) and samples j (= 1, 2, … , n). We have classes 1, 2, … , K, and let C be indices of the n samples in class k. The i-th component of the centroid for class k is , which gives the mean genotype value in class k for SNP i, and the i-th component of the overall centroid is . In words, we shrink the class centroids toward the overall centroids after standardizing by the within-class SD for each SNP. Let where s is the pooled within-class SD for SNP i and makes equal to the estimated standard error of the numerator in d. In the denominator, s0 is a positive constant equal to the median of the s values over the set of SNPs. Thus d is a t statistic for SNP i that compares class k to the overall centroid. This method shrinks each d toward zero, giving and yielding shrunken centroids or prototypes Specifically, if for a SNP i the value of d is shrinks to zero for all classes k, then the centroid for SNP i is , and is the same for all classes. Thus SNP i does not contribute to the nearest-centroid computation. As in the above explanation, the present study involved a large number of features from 1 007 376 SNPs and a relatively small number of classes (three ethnic groups). The use of standard statistical methods may cause problems in multiple comparisons because of the huge number of SNPs (11). For example, if 10 000 SNPs are discovered using those methods with a significance level of 5%, it is likely that 500 of them will be false-positive errors. NSCM has the desirable property that many of the SNPs that do not contribute to the nearest-centroid computation are eliminated from the class prediction.

Gene mapping and multiple analyses

Gene mapping was performed by combining three databases: the University of California, Santa Cruz, Genome Browser hg17 (12), HUGO Gene Nomenclature Committee (13) and dbSNP (build 125) (14). SNP sequence files were constructed for all genes using the gene mapping information and International HapMap genotype data for tests for genetic heterogeneity and frequency spectrum. The following tests were performed: Hudson, Kreitman and Aguade (HKA) (15) and Fst (16) for genetic heterogeneity and Tajima's D (17), Fu and Li's D (18) for frequency spectrum. The glutamate receptor and iduronate 2-sulfatase (19) genes were used as reference natural-selection loci. It should be noted that these statistics are affected by natural selection and by the frequency spectrum associated with demographic processes in a population (e.g. population expansion).

Database contents and availability

SNP@Ethnos provides functional information of ESNPs, with natural-selection and disease-association annotation of genes in which ESNPs are placed. The SNP information in the search results consists of the NSCM score, minor-allele frequency, chromosomal location and the functional annotation. The NSCM score () is a discriminating value from Equation 2, which is small if there is little difference between classes or the variation of the SNP distribution is large. For example, three similar scores for CHB+JPT, CEU and YRI indicate that the SNP is not critical, whereas one score differing from the other two indicates that the SNP is specific to that population. The functional annotation of SNPs provides a link to the SNP@Domain database (20) when the SNP is on the coding region of a protein. The gene information in the search results consists of the statistical results from the tests for genetic heterogeneity and frequency spectrum and annotation links to the Online Mendelian Inheritance in Man (OMIM) database (21), and the Human Gene Mutation Database (22) and genome browser. The results of the multiple tests include Fst, HKA test, Tajima's D and Fu and Li's D values. Some general guidelines for interpreting the statistics of the tests are shown in FAQ page of our web site. Example results from database searching are shown in Figure 2.

Figure 2

Example results of a SNP@Ethnos database search. (A) The gene information in the search results consists of statistical values for the neutrality test and annotation links to the OMIM database and the HGMD and genome browser. The SNP information in the search results consists of the NSCM score, minor-allele frequency, chromosomal location and the functional annotation. (B) The genome browser of SNP@Ethnos shows the location of ESNPs. The database contains 100 736 ESNPs and 10 138 annotated genes, where 1009 of the latter have OMIM entries, while 436 SNPs are in protein domains. Some of the SNPs are found in disease-associated genes and cause functional protein defects. There are many reports on ethnic variations in genes associated with disease (23–27). Using SNP@Ethnos, the present study identified an interesting ESNP in a tyrosinase gene associated with albinism; this nonsynonymous ESNP may cause a functional defect, but this has yet to be shown. SNP@Ethnos appears to be useful for this type of genetic variation investigation.

Data characteristics

The identified ESNPs comprised 73.95% YRI-specific, 15.25% CEU-specific and 6.80% CHB+JPT-specific SNPs and 4.00% ethnically different SNPs. All ESNPs were evenly distributed across the chromosomes. However, when the boundary of ESNPs was fixed using top 1%, there were three times more SNPs in the X chromosome (152) than on any other chromosome. These findings are consistent with those of other recent studies showing that the extent of population differentiation is similar across the autosomes, but higher in the X chromosome (Fst = 0.21), and that the number of low-frequency alleles is smaller for CEU and CHB + JPT samples than for YRI samples (5). These patterns may be attributable to bottlenecks in the history of the non-YRI populations (5). The Gene Ontology (GO) class was analyzed after the genes of the 100 736 selected SNPs were annotated with the GO database (28) using the OntoExpress (29) and FatiGO (30) programs. Searches using both OntoExpress and FatiGO resulted in some of the genes being assigned to biological processes (cell communications and cellular physiological processes) and cellular-component (membranes) classes. Comparison of ESNPs with non-ESNPs using the FatiGO program revealed significant correlations with biological processes (responses to biotic stimulus, localization and responses to external stimulus) and cellular components (membranes, voltage-gated calcium channel complexes and the extracellular matrix), providing population-specific adaptive polymorphisms. Detailed results and their probability values are given on the Statistics page of our web site. There were 82 SNPs which could be used to perfectly classify the ethnic groups, of which three were nonsynonymous coding-region SNPs (3.7%), which is a very high percentage compared with the original SNP functional distribution (0.84%). This suggests that the ESNPs play an important role in protein function.

DATA ACCESS AND VISUALIZATION

The SNP@Ethnos database can be queried using gene symbols, RefSeq mRNA IDs, dbSNP rs numbers and lists containing multiple genes. Regardless of the query type, the results are displayed in the same format. For visualization, SNP@Ethnos offers a generic genome browser (31) that displays an overview of chromosomes, contigs, genes, mRNAs and ESNPs. This genome browser can be accessed via the gene region of the results page. Moreover, the database provides an open-architecture web page using a wiki interface for data access. A user id for accessing the system is available from the authors on request. After logging in, users can submit comments and feedback. The content of web pages can be edited by users who wish to contribute, correct or add information.

30 in total

1. Genomic regions exhibiting positive selection identified from dense genotype data.

Authors: Christopher S Carlson; Daryl J Thomas; Michael A Eberle; Johanna E Swanson; Robert J Livingston; Mark J Rieder; Deborah A Nickerson
Journal: Genome Res Date: 2005-11 Impact factor: 9.043

2. A haplotype map of the human genome.

Authors:
Journal: Nature Date: 2005-10-27 Impact factor: 49.962

3. Genomic scans for selective sweeps using SNP data.

Authors: Rasmus Nielsen; Scott Williamson; Yuseob Kim; Melissa J Hubisz; Andrew G Clark; Carlos Bustamante
Journal: Genome Res Date: 2005-11 Impact factor: 9.043

4. Ethnic variation of Fc gamma receptor polymorphism in Sami and Norwegian populations.

Authors: Oivind Torkildsen; Egil Utsi; Svein Ivar Mellgren; Hanne F Harbo; Christian A Vedeler; Kjell-Morten Myhr
Journal: Immunology Date: 2005-07 Impact factor: 7.397

5. Human Gene Mutation Database.

Authors: D N Cooper; M Krawczak
Journal: Hum Genet Date: 1996-11 Impact factor: 4.132

6. Online Mendelian Inheritance in Man (OMIM).

Authors: A Hamosh; A F Scott; J Amberger; D Valle; V A McKusick
Journal: Hum Mutat Date: 2000 Impact factor: 4.878

7. Statistical tests of neutrality of mutations.

Authors: Y X Fu; W H Li
Journal: Genetics Date: 1993-03 Impact factor: 4.562

8. Patterns of genetic variation in the hypertension candidate gene GRK4: ethnic variation and haplotype structure.

Authors: K E Lohmueller; L J C Wong; M M Mauney; L Jiang; R A Felder; P A Jose; S M Williams
Journal: Ann Hum Genet Date: 2006-01 Impact factor: 1.670

9. SNP@Domain: a web resource of single nucleotide polymorphisms (SNPs) within protein domain structures and sequences.

Authors: Areum Han; Hyo Jin Kang; Yoobok Cho; Sunghoon Lee; Young Joo Kim; Sungsam Gong
Journal: Nucleic Acids Res Date: 2006-07-01 Impact factor: 16.971

10. A map of recent positive selection in the human genome.

Authors: Benjamin F Voight; Sridhar Kudaravalli; Xiaoquan Wen; Jonathan K Pritchard
Journal: PLoS Biol Date: 2006-03-07 Impact factor: 8.029

15 in total

1. Shifting paradigm of association studies: value of rare single-nucleotide polymorphisms.

Authors: Ivan P Gorlov; Olga Y Gorlova; Shamil R Sunyaev; Margaret R Spitz; Christopher I Amos
Journal: Am J Hum Genet Date: 2008-01 Impact factor: 11.025

2. Shrunken methodology to genome-wide SNPs selection and construction of SNPs networks.

Authors: Yang Liu; Michael Ng
Journal: BMC Syst Biol Date: 2010-09-13

3. LCT 13910 C/T polymorphism, serum calcium, and bone mineral density in postmenopausal women.

Authors: K Bácsi; J P Kósa; A Lazáry; B Balla; H Horváth; A Kis; Z Nagy; I Takács; P Lakatos; G Speer
Journal: Osteoporos Int Date: 2008-08-13 Impact factor: 4.507

4. A survey of the population genetic variation in the human kinome.

Authors: Wei Zhang; Daniel V T Catenacci; Shiwei Duan; Mark J Ratain
Journal: J Hum Genet Date: 2009-07-31 Impact factor: 3.172

Review 5. Single-nucleotide polymorphism bioinformatics: a comprehensive review of resources.

Authors: Andrew D Johnson
Journal: Circ Cardiovasc Genet Date: 2009-10

Review 6. Safety paradigm: genetic evaluation of therapeutic grade human embryonic stem cells.

Authors: Emma Stephenson; Caroline Mackie Ogilvie; Heema Patel; Glenda Cornwell; Laureen Jacquet; Neli Kadeva; Peter Braude; Dusko Ilic
Journal: J R Soc Interface Date: 2010-09-08 Impact factor: 4.118

7. Genetic predisposition of RSV infection-related respiratory morbidity in preterm infants.

Authors: Simon B Drysdale; Michael Prendergast; Mireia Alcazar; Theresa Wilson; Melvyn Smith; Mark Zuckerman; Simon Broughton; Gerrard F Rafferty; Sebastian L Johnston; Hennie M Hodemaekers; Riny Janssen; Louis Bont; Anne Greenough
Journal: Eur J Pediatr Date: 2014-02-02 Impact factor: 3.183

8. Stable patterns of gene expression regulating carbohydrate metabolism determined by geographic ancestry.

Authors: Jonathan C Schisler; Peter C Charles; Joel S Parker; Eleanor G Hilliard; Sabeen Mapara; Dane Meredith; Robert E Lineberger; Samuel S Wu; Brian D Alder; George A Stouffer; Cam Patterson
Journal: PLoS One Date: 2009-12-09 Impact factor: 3.240

9. Nuclear receptor coregulator SNP discovery and impact on breast cancer risk.

Authors: Ryan J Hartmaier; Sandrine Tchatchou; Alexandra S Richter; Jay Wang; Sean E McGuire; Todd C Skaar; Jimmy M Rae; Kari Hemminki; Christian Sutter; Nina Ditsch; Peter Bugert; Bernhard H F Weber; Dieter Niederacher; Norbert Arnold; Raymonda Varon-Mateeva; Barbara Wappenschmidt; Rita K Schmutzler; Alfons Meindl; Claus R Bartram; Barbara Burwinkel; Steffi Oesterreich
Journal: BMC Cancer Date: 2009-12-14 Impact factor: 4.430

10. Functional and genetic predisposition to rhinovirus lower respiratory tract infections in prematurely born infants.

Authors: Simon B Drysdale; Mireia Alcazar; Theresa Wilson; Melvyn Smith; Mark Zuckerman; Hennie M Hodemaekers; Riny Janssen; Louis Bont; Sebastian L Johnston; Anne Greenough
Journal: Eur J Pediatr Date: 2016-10-01 Impact factor: 3.183