Literature DB >> 19756043

A simple and efficient algorithm for genome-wide homozygosity analysis in disease.

Wei Liu¹, Jinhui Ding, Jesse Raphael Gibbs, Sue Jane Wang, John Hardy, Andrew Singleton.

Abstract

Here we propose a simple statistical algorithm for rapidly scoring loci associated with disease or traits due to recessive mutations or deletions using genome-wide single nucleotide polymorphism genotyping case-control data in unrelated individuals. This algorithm identifies loci by defining homozygous segments of the genome present at significantly different frequencies between cases and controls. We found that false positive loci could be effectively removed from the output of this procedure by applying different physical size thresholds for the homozygous segments. This procedure is then conducted iteratively using random sub-datasets until the number of selected loci converges. We demonstrate this method in a publicly available data set for Alzheimer's disease and identify 26 candidate risk loci in the 22 autosomes. In this data set, these loci can explain 75% of the genetic risk variability of the disease.

Entities: Chemical

Mesh：

Year: 2009 PMID： 19756043 PMCID： PMC2758715 DOI： 10.1038/msb.2009.53

Source DB: PubMed Journal: Mol Syst Biol ISSN： 1744-4292 Impact factor: 11.429

Introduction

Advances in whole-genome single nucleotide polymorphism (SNP) assay technology have provided a powerful array of tools for simultaneously scoring common genetic variation. However, it is often difficult to identify loci associated with disease because of the large number of tests carried out and the associated conservative multiplicity adjustment, such as Bonferroni method. We are interested in identifying such loci associated with a disease likely due to recessive mutation or gene deletions. High density SNP analysis readily reveals the presence of large homozygous segments in unrelated subjects (Hinds ; Simon-Sanches ; Wang ). The probability of a randomly selected SNP locus being homozygous (‘AA' or ‘BB') based on data from HapMap is about 0.65 (Hinds ; Rabbee and Speed, 2006) and this may lend itself to autozygosity mapping in ostensibly outbred populations; however, traditional autozygosity mapping methods (Lander and Botstein, 1987; Mueller and Bishop, 1993; Gschwend ) based on consanguineous relationships are not appropriate for unrelated individuals. To identify loci with possible recessive effects of relatively high penetrance in outbred populations, large sample sizes are needed for genotyping. Some recent studies on homozygosity analysis of SNP assays have been attempted using different approaches (Woods ; Lencz ; Miyazawa ). However, they either have some familial relationship requirements (Woods ; Miyazawa ) or a high false positive rate (Lencz ). In the context of SNP genotyping, it is often not easy to distinguish heterozygous genomic deletion from homozygosity; thus a segment with all loci genotyped being ‘AA' or ‘BB' in a pedigree genotype file could be either a region of genuine homozygosity or effective hemizygosity caused by genomic deletion. We call such a region ‘apparently homozygous region' (AH). By carrying out an appropriate association analysis on AHs, one can detect not only the possible recessively mutated loci from some common ancestor but also deletions (Hunter, 2005; Klein ; Van Eyken ). In this paper, we propose a simple statistical algorithm for genome-wide AH analysis (GAHA) of case–control data in unrelated subjects. It can robustly identify loci that are associated with disease by efficiently removing false positive loci. We demonstrate this method in a publicly available data set for Alzheimer's disease (AD) (Coon ), consisting 502 627 SNP loci genotyped in unrelated 859 cases and 552 neurologically normal controls. A total of 26 loci from the 22 autosomes are identified and they explain 75% of the genetic risk variability of the disease.

Results and discussion

AH size threshold

In the context of the current data, it is not appropriate to use the number of loci as a measure of AH size as previously reported (Lencz ) because of its dependence on SNP density. Here we use the number of nucleotide basepairs between the first and last loci of an AH as a measure of AH size. Let C be a size threshold of AHs. We are interested in identifying loci proportions of which are significantly different between controls and cases in AHs with sizes ⩾C. As seen in Figure 1, for example, there are n1 cases and with a given C we count the proportion of the locus SNP-1 on AHs p1=(number of AHs containing SNP-1)/n1. Similarly, for n0 controls, we find the proportion p0 of the same locus. Using p1 and p0, we compute z-statistic for proportional test as described in Materials and methods. The locus is selected for further screening if ∣z∣⩾z1−α/2, where α is the level of significance. The test statistic z follows a standard normal distribution asymptotically as n0 and n1 increase with each greater than 30.

Figure 1

Scheme for computing the proportion of a locus on AHs. For a given chromosome of a subject, the symbols (•, ○) represent SNP loci. The shaded segments denote AHs with size greater than or equal to a pre-selected threshold C. The proportion of a locus on AHs is computed as p= (the number of AHs containing this locus)/(the total number of individuals), for example p1=4/6 for SNP-1.

We investigated the power for selecting loci based on α, AH percentage difference between cases and controls, and AH size threshold C through simulation. The relationships between z value and AH percentage difference with various C are shown in Supplementary Figure 1. At a significance level α=0.001, the powers to detect candidate loci were computed accordingly. We define that a candidate locus is detectable if the power>0.8. Our results showed that at a significance level α=0.001, we could detect a locus on AHs⩾C with a difference of 30% between cases and controls using C=10 kb, or only of 7% using C=1 Mb. On the basis of above significance level α and a moderate C value, typically thousands of loci could be selected with a large false positive rate from data of unrelated subjects. A key step is to efficiently remove these falsely associated loci from the candidate list. If we knew the minimum size of risk loci, then we would set it as C and consider only AH⩾C, leading to a lower false positive rate. However, such a C value is unknown. One approach is to use multiple values of C as discussed below. In convention, define C=1 for considering AHs with size ⩾1.

Algorithm for screening risk loci

We propose to use multiple C values for screening risk loci. Suppose we choose C1 and C2, with C1C1 will remain using either C1 or C2, the loci, not in S1⋂S2, should be more likely false positives and thus be removed. For example, in the AD data using a significance level α=0.001, among the 25 086 loci on chromosome 1, there were 18 loci selected using C=10 kb and 12 loci using C=30 kb, respectively, with only three being common loci in both sets. In general, we set C={C, i=1, 2,…, L} with C1 The PACS can efficiently remove false positive loci, however, for a real data set in unrelated individuals with large genetic variation, the selected loci usually still contain some false positives, many of which could be removed through further ‘purification'. To achieve this, ideally we should repeat the above steps using an independent data set from the same population to get another candidate set. Then identify the common loci from both sets. This new candidate set contains fewer false positive loci, which could be further removed by repeating above steps iteratively until the number of candidate loci converges. Although it is generally not realistic to do so, we could do the ‘purification' using random subsets from the full data set as described below. Let n*=[f × n]>30 be the size of a random subset from the full data set of size n, where k=1 for cases and k=0 for controls, and f be a constant with 0 Let S be the set containing the selected loci from the full data set and S* be that from the first random sub-data set. Let S1*=S*⋂S containing the common loci in both sets and N1=∣S1*∣ be the number of loci in S1*. Next we generate a new S* from the second random sub-data set and let S2*=S1*⋂S* with N2=∣S2*∣. Repeating these steps to update the candidate loci set until the number of N, t=1,2,………, converges to a constant integer Nc with Nc =0 if the null hypothesis of no difference between p1 and p0 is true and Nc>0 if the alternative hypothesis p1≠p0 is true. For a given f, there are possible ways for selecting case–control subset, which should be much larger than the number required for reaching convergence at an appropriate level of significance. The above GAHA algorithm is summarized in Box 1. The false positive rate of a locus in the final set should be ⩽α. The false negative rates of loci selection in a random subset were estimated under the same settings for the full data set (Supplementary Table 2).

Application to AD data set

Set C={1, 10 kb, 30 kb, 50 kb, 100 kb, 140 kb, 250 kb, 500 kb, 1 Mb} and α=0.001. We identified 607 loci from 4054 loci whose ∣z∣⩾z1−α/2 (Figure 2A) from the 22 autosomes in the AD data set (Coon ).

Figure 2

The plot of z versus nucleotide basepair of chromosome 19 in the AD data set: (A) before and (B) after the procedure of adjacent-C-selection, (C) the most significant region—the peak locus is rs4420638, (D) the most significant region with two loci on APOE (↓).

The most significant AH region was on 19q13.2 (see Figure 2B) with positive z values suggesting significantly more AHs in controls than in cases. This region, covering the whole apolipoprotein E (APOE) gene, contains four loci including rs4420638 (Figure 2C), which is in linkage disequilibrium with APOE (Coon ). However, there were no genotypes within APOE in the AD data. We added available genotyping information (Coon ) of two loci on APOE, rs429358 and rs7412, to the AD data. The two APOE loci define the ɛ2/ɛ3/ɛ4 genotypes. Figure 2D shows the APOE loci indeed on the AH region where the majority controls have the ɛ3 genotype, supporting the observation that APOE ɛ3 is protective against the disease when compared with ɛ4 (Farrer ). To further reduce the false positive rate within this list, we chose f=0.9 for generating random subsets, each with 773 cases and 497 controls. The use of f=0.9 may not be the statistically optimal choice; it is, however, the best we tried. The convergence of the loci number is shown in Figure 3. There were 26 loci in the final list (Figure 3B) (Table I). Based on a logistic regression model fit, the percent variation of the genetic risk explained by these 26 loci was 75.3%. Model selection removed 10 confounder loci and retained 16 loci (each with P-value<0.05), including rs4420638, in the reduced model with 74.8% of the genetic risk variation explained (Supplementary Table 3, 4).

Figure 3

Convergence of the loci number. (A) At a level of significance α=0.001, a total of 607 loci (□) were selected from the 4054 loci for which ∣z∣⩾ z1−α/2 (Δ) by applying the procedure of adjacent-C-selection in the AD data set. Random case–control subsets were generated using f=0.9 and used in screening iteration (○). (B) The enlarged plot showing the convergence of selected loci to the number 26.

Table 1

List of candidate loci associated with AD from the 22 autosome of the AD SNP genotype data (Coon )

CHR	SNP ID	Location^a	Function	Gene	Gene ID	Effect
^aIn nucleotide basepair.
^bLoci remained in the model on logistic regression selection with a P-value<0.05.
^cLoci in homozygous regions containing candidate loci of recessive genetic lesion causing AD (Clarimón et al, 2008).
^dGenes are on known functional pathways and networks as revealed by the use of Ingenuity Pathway Analysis (Ingenuity Systems, www.ingenuity.com).
^eA SNP in Affymetrix 500K GeneChip, but without NCBI ID.
1	rs17325887^b,c	69998761	Intron	LRRC7^d	57554	Risk
1	rs7520521^c	70020703	Intron	LRRC7^d	57554	Risk
1	rs1913269^b,c	70052194	Intron	LRRC7^d	57554	Risk
1	rs10754339^b	117491795	mRNA–UTR	VTCN1^d	79679	Protect
1	rs16842422^b	196366613	–66918	LOC647195	647195	Protect
2	rs7582851	192032391	–392328	LOC647167	647167	Protect
3	rs6784615^b	52481466	Intron	NISCH^d	11188	Protect
4	rs9994615	40786592	Intron	APBB2^d	323	Risk
4	rs10015784^b	40793978	Intron	APBB2^d	323	Risk
5	rs1602843^b,c	86324342	0	COL24A1^d	255631	Risk
5	rs2913719^b	163947773	2403	LOC440700	440700	Protect
6	rs13213247^b	81572755	–91974	LOC729817	729817	Risk
6	rs16892285	81592721	–72008	LOC729817	729817	Risk
6	rs13193950	81593433	–71296	LOC729817	729817	Risk
6	rs156232^b	104979509	481535	LOC642337	642337	Risk
10	rs10827687^b	36999313	–39887	GRIK3^d	2899	Risk
10	rs10824310^b	53698470	Intron	PRKG1^d	5592	Risk
10	rs10740548	54877234	–2797	C1orf175^d	374977	Risk
11	rs1038891^b,c	40895642	0	RIMS3^d	9783	Risk
12	rs1354470^b	59088188	–32939	LOC645757	645757	Risk
12	rs7967572	73396068	51514	KRT8P21	126811	Risk
18	rs1785928^b	31979929	Coding non-synonymous	ELP2^d	55250	Risk
19	rs11879589	50065116	Intron	PVRL2^d	5819	Protect
19	rs4420638^b	50114786	Locus region	APOC1^d	341	Protect
19	^e	50150075
19	rs204907	50153836	Intron	CLPTM1^d	1209	Protect

The APOE ɛ4 was carried by ∼40% of the later-onset AD cases (Poirier ; Laws ). Recall that rs4420638 is in linkage disequilibrium with APOE, we found that the percent genetic risk variation explained by this locus alone was 34.2%. However, when rs4420638 was excluded from the reduced model, the percentage genetic risk variation explained by the remaining 15 loci was decreased only by 2.9% (from 74.8% to 71.9%). This suggests these loci explain the genetic risk variation of AD as a group. Several of the 26 loci identified in this screening were also found in homozygous regions identified in an early onset AD study of a consanguineous family (Clarimón ), suggesting that one of these regions harbors a recessive genetic lesion causing AD. The 26 loci are on 20 genes of which 13 are in known functional pathways or networks as revealed from an Ingenuity Pathway Analysis (Ingenuity Systems, www.ingenuity.com) (Supplementary Pathway/Network analysis). On the basis of the correlations among the 20 genes and AD status of subjects, we construct an AD genetic network (Supplementary Figure 2).

Summary

We propose a statistical method for GAHA of SNP case–control data in unrelated subjects to identify risk loci that are most likely associated with a disease or abnormality due to recessive mutation or deletion. The main novelty of this method over other approaches is to minimize the false positive rate of the risk candidates. We remove the false positive loci by selecting the common loci with different size thresholds of homozygous segments and repeating these steps iteratively using random sub-data sets until the number of selected loci converges. Furthermore, this method allows selects risk loci from a wider AH size range. By demonstrating of the method using a publicly available AD SNP assay data set, we identified 26 candidate risk loci from the 22 autosomes.

Materials and methods

Notes

Suppose there are n SNP loci genotyped on a given chromosome (an autosome). We view the sequences of SNP loci on a chromosome as linked regions either being heterozygous or AHs. Let H be a set such that H={h1, h2,…, h} where h denotes the number of AHs containing i consecutive SNP loci genotyped, and m is the maximum number of consecutive SNP loci. The probability of a randomly selected SNP locus on AHs with SNP number being equal to or larger than a predetermined integer k is .

Data

A SNP genotype data set of late-onset AD(500K Affymetrix) was downloaded from a publicly available website, http://www.neuron.org, to demonstrate our method. This data set consists of 502 627 SNP loci genotyped in unrelated 859 cases and 552 neurologically normal controls.

Proportion test

We are interested in identifying loci at which the proportion of a SNP locus, on AHs with size equal to or larger than a given threshold C, is significantly different between controls and cases. Our null hypothesis is that the SNP at a given locus has the same probability of being on AHs with size ⩾C in the control and case groups. The test statistic in a standard proportion test is and follows a Gaussian distribution under the null hypothesis, where the p0 is the proportion of the locus on AHs for the n0 control subjects and the p1 is that for the n1 cases. We define z=0 when both p0=0 and p1=0. For a given level of significance α, a locus is selected if ∣z∣⩾z1−α/2. This test requires large sample size (n0, n1>30).

Logistic regression

In logistic regression using the selected loci as predictor variables, let x=1 if the ith locus of the jth subject is on an AH with size being equal to or larger than C=10 kb and x=0 otherwise. Logistic regression is carried out using SAS 9.0.

Declaration

The views expressed in this article do not represent those of the US Food and Drug Administration. (1) For case–control SNP data with n1 cases and n0 controls, choose a level of significance α, set AH thresholds C={C, i=1, 2,…, L} with C (2) Compute z at each locus and select it if ∣z∣⩾z1-α/2. Perform the PACS and let Sold be the set of selected loci and Nold=∣Sold∣. Chose 0 < f < min{n}−1/min{n}, and ℓ=0 (3) Randomly select a case–control sub-dataset from (1) with n1* = [f × n1] >30 cases and n0* = [f × n0] >30 controls. Find AHs for each subject at given C, then compute z at each locus and select it if ∣z∣⩾ z1-α/2 (4) Carry out the PACS and let S* be the set containing all the loci selected from the sub-dataset. Find Snew=Sold⋂S* Nnew=∣Snew∣ (5) Supplementary figures S1–2, Supplementary tables S1–4, Pathway/Network Analysis

20 in total

1. Complement factor H polymorphism in age-related macular degeneration.

Authors: Robert J Klein; Caroline Zeiss; Emily Y Chew; Jen-Yue Tsai; Richard S Sackler; Chad Haynes; Alice K Henning; John Paul SanGiovanni; Shrikant M Mane; Susan T Mayne; Michael B Bracken; Frederick L Ferris; Jurg Ott; Colin Barnstable; Josephine Hoh
Journal: Science Date: 2005-03-10 Impact factor: 47.728

2. A genotype calling algorithm for affymetrix SNP arrays.

Authors: Nusrat Rabbee; Terence P Speed
Journal: Bioinformatics Date: 2005-11-02 Impact factor: 6.937

Review 3. Gene-environment interactions in human diseases.

Authors: David J Hunter
Journal: Nat Rev Genet Date: 2005-04 Impact factor: 53.242

4. Whole genome genotyping technologies on the BeadArray platform.

Authors: Frank J Steemers; Kevin L Gunderson
Journal: Biotechnol J Date: 2007-01 Impact factor: 4.677

5. Homozygosity haplotype allows a genomewide search for the autosomal segments shared among patients.

Authors: Hitoshi Miyazawa; Masaaki Kato; Takuya Awata; Masakazu Kohda; Hiroyasu Iwasa; Nobuyuki Koyama; Tomoaki Tanaka; Shunei Kyo; Yasushi Okazaki; Koichi Hagiwara
Journal: Am J Hum Genet Date: 2007-05-02 Impact factor: 11.025

6. Runs of homozygosity reveal highly penetrant recessive loci in schizophrenia.

Authors: Todd Lencz; Christophe Lambert; Pamela DeRosse; Katherine E Burdick; T Vance Morgan; John M Kane; Raju Kucherlapati; Anil K Malhotra
Journal: Proc Natl Acad Sci U S A Date: 2007-12-05 Impact factor: 11.205

7. Autozygosity mapping, complex consanguinity, and autosomal recessive disorders.

Authors: R F Mueller; D T Bishop
Journal: J Med Genet Date: 1993-09 Impact factor: 6.318

8. Homozygosity mapping: a way to map human recessive traits with the DNA of inbred children.

Authors: E S Lander; D Botstein
Journal: Science Date: 1987-06-19 Impact factor: 47.728

9. Genome-wide SNP assay reveals structural genomic variation, extended homozygosity and cell-line induced alterations in normal individuals.

Authors: Javier Simon-Sanchez; Sonja Scholz; Hon-Chung Fung; Mar Matarin; Dena Hernandez; J Raphael Gibbs; Angela Britton; Fabienne Wavrant de Vrieze; Elizabeth Peckham; Katrina Gwinn-Hardy; Anthony Crawley; Judith C Keen; Josefina Nash; Digamber Borgaonkar; John Hardy; Andrew Singleton
Journal: Hum Mol Genet Date: 2006-11-20 Impact factor: 6.150

10. Effects of age, sex, and ethnicity on the association between apolipoprotein E genotype and Alzheimer disease. A meta-analysis. APOE and Alzheimer Disease Meta Analysis Consortium.

Authors: L A Farrer; L A Cupples; J L Haines; B Hyman; W A Kukull; R Mayeux; R H Myers; M A Pericak-Vance; N Risch; C M van Duijn
Journal: JAMA Date: 1997 Oct 22-29 Impact factor: 56.272