Literature DB >> 34909743

Using a Machine Learning Approach to Identify Low-Frequency and Rare FLG Alleles Associated with Remission of Atopic Dermatitis.

Ronald Berna1, Nandita Mitra2, Ole Hoffstad1, Bradley Wubbenhorst3, Katherine L Nathanson3, David J Margolis1,2.   

Abstract

Atopic dermatitis (AD) is a common relapsing inflammatory skin disease. FLG is the gene most consistently associated with AD. Loss-of-function variants in FLG have been previously associated with AD. Low-frequency and rare alleles (minor allele frequency < 5%) in this gene have been given less attention than loss-of-function variants. We fine sequenced the FLG gene in a cohort of individuals with AD. We developed a machine learning‒based algorithm to associate low-frequency and rare alleles with the disease. We then applied this algorithm to the FLG data, searching for associations between groups of low-frequency and rare FLG alleles and AD remission. A group of 46 rare and low-frequency FLG alleles was associated with increased AD remission (P = 2.76e-11). Overall, 16 of these 46 FLG variants were identified in an independent cohort and were associated with decreased AD incidence (P = 0.0007). This study presents an application of statistical methods in AD genetics and suggests that low-frequency and rare alleles may play a larger role in AD pathogenesis than previously appreciated.
© 2021 The Authors.

Entities:  

Keywords:  AD, atopic dermatitis; GAD, genetics of atopic dermatitis; GEE, generalized estimating equation; LoF, loss of function; MAF, minor allele frequency; PEER, Pediatric Eczema Elective Registry; PoC, probability of clearance

Year:  2021        PMID: 34909743      PMCID: PMC8659778          DOI: 10.1016/j.xjidi.2021.100046

Source DB:  PubMed          Journal:  JID Innov        ISSN: 2667-0267


Introduction

Atopic dermatitis (AD) is a common chronic inflammatory skin disease that typically presents with red itchy patches on the flexural parts of the limbs (Abramovits, 2005; Akdis et al., 2006; Bieber, 2008; Leung and Bieber, 2003). It is common, affecting up to 20% of children and 3.2–10.2% of adults in the industrialized world (Chiesa Fuxench et al., 2019; Pellerin et al., 2013). Genetic studies of AD have suggested that both barrier dysfunction and immunodysregulation are key in the disease pathogenesis (Leung and Bieber, 2003). Genetic variation in the cytokines involved in the T helper type 2 response, such as TSLP, and in epidermal surface barrier proteins, primarily FLG, have been associated with variation in AD onset and persistence (Irvine et al., 2011; Kim et al., 2019; Palmer et al., 2006). Of these, the most consistently associated and widely studied is FLG. Loss-of-function (LoF) mutations in FLG lead to decreased production of FLG protein and are thought to result in a barrier defect that predisposes an individual to AD (Irvine et al., 2011; Margolis et al., 2012; Palmer et al., 2006; Quiroz et al., 2020). The FLG protein belongs to a family of S100 fused-type proteins, many of which are part of the development of the skin’s cornified envelope (Pellerin et al., 2013; Wu et al., 2011b). The genes are found in a section of chromosome 1 called the epidermal differentiation complex (Mischke et al., 1996; Pellerin et al., 2013; Wu et al., 2011b). Other skin barrier proteins in this family include FLG-2, TCHHL1, and hornerin (Margolis et al., 2014a; Pellerin et al., 2013; Pendaries et al., 2015; Wu et al., 2011b). In a previous study, we conducted next-generation sequencing of TSLP and IL7R and identified specific genetic variants likely to have a function in AD (Berna, 2021). However, this study was limited by the inability to effectively examine low-frequency (1% < minor allele frequency [MAF] < 5%) and rare (MAF < 1%) variants (Berna, 2021). Indeed, a key limitation of examining uncommon variants with next-generation sequencing studies is that one requires very large sample sizes to effectively examine such variants, an expensive and laborious endeavor. This paper aims to use machine learning methods to identify low-frequency and rare alleles in FLG, in a diseased population, associated with AD remission. We developed a genetic algorithm‒based approach to low-frequency and rare allele association, assessed general characteristics of this model using a simulated dataset, applied this model to low-frequency and rare alleles in FLG, and sought to identify the low-frequency and rare alleles associated with AD remission in a large longitudinal African American cohort.

Results

Simulation results

Before applying our algorithm to Pediatric Eczema Elective Registry (PEER) genetic data, we conducted a number of simulation studies to determine the effectiveness of our algorithm in identifying low-frequency and rare alleles associated with longitudinal measures of disease. First, we constructed three longitudinal datasets: one in which the first 10 alleles were associated with more severe disease, one in which the first 10 alleles were associated with less severe disease, and one in which no alleles were differentially associated with disease. A total of 100 simulations of our algorithm were run on each dataset individually, with results showing clear identification of disease-associated alleles and good discrimination for alleles unassociated with disease (Figure 1a–c). The difference between these groups was statistically significant, with P < 0.001 for alleles associated with disease.
Figure 1

Simulation studies of genetic algorithm model (a–c). Three simulation studies showing the ability of the model to identify associated alleles. The first 10 alleles were associated with (a) moderate disease control and (b) more severe disease control. Associated alleles are identified ~60% of the time. By contrast, when no alleles are associated with disease control (c), the algorithm picks up no clearly associated alleles. (d) Identification of alleles as the number of associated alleles increases. The black box plots represent the number of times the associated alleles were identified, the red box plots represent the number of times the unassociated alleles were identified. Each column represents the results of 100 independent runs of the model. (e) The proportion of identified alleles associated with disease, as the alleles’ association with disease varied. Each column represents the results of 100 independent model runs. The black boxes correspond to the associated alleles, and the red boxes correspond to the unassociated alleles. (f) The proportion of alleles associated with disease, as population size increases. A total of 100 simulations were run for sample sizes of 50, 100, and 500, which corresponded to 5, 10, and 50 associated alleles, respectively. The black boxes represent the truly associated alleles, and the red boxes represent the unassociated alleles. ∗∗ P < 0.01, ∗∗∗ P < 0.001.

Simulation studies of genetic algorithm model (a–c). Three simulation studies showing the ability of the model to identify associated alleles. The first 10 alleles were associated with (a) moderate disease control and (b) more severe disease control. Associated alleles are identified ~60% of the time. By contrast, when no alleles are associated with disease control (c), the algorithm picks up no clearly associated alleles. (d) Identification of alleles as the number of associated alleles increases. The black box plots represent the number of times the associated alleles were identified, the red box plots represent the number of times the unassociated alleles were identified. Each column represents the results of 100 independent runs of the model. (e) The proportion of identified alleles associated with disease, as the alleles’ association with disease varied. Each column represents the results of 100 independent model runs. The black boxes correspond to the associated alleles, and the red boxes correspond to the unassociated alleles. (f) The proportion of alleles associated with disease, as population size increases. A total of 100 simulations were run for sample sizes of 50, 100, and 500, which corresponded to 5, 10, and 50 associated alleles, respectively. The black boxes represent the truly associated alleles, and the red boxes represent the unassociated alleles. ∗∗ P < 0.01, ∗∗∗ P < 0.001. We next assessed the algorithm’s sensitivity to initial conditions, evaluating its ability to identify the associated variants when the number of associated alleles increased or decreased (Figure 1d), when the associated alleles were more or less strongly associated with the outcome measure (Figure 1e), and when the size of the cohort increased or decreased (Figure 1f). These analyses showed the identification of approximately 60% of associated alleles on an average run (Figure 1d), with improving identification for more strongly associated alleles (Figure 1e). The algorithm also performs strongly when the population size is quite large and when the number of associated alleles increases (Figure 1f). All group differences in Figure 1d–f were statistically significant with P < 0.01 or P < 0.001.

PEER results

Within our PEER genetic cohort, 42% were male, the mean age of AD onset was 2.14 years, and the mean duration of observation was 97 months. Demographic information is provided in Table 1. A total of 60% of all surveys reported disease clearance; the proportion of surveys reporting clearance at any given time point is presented in Figure 2.
Table 1

Participant Demographics: Basic Demographic Data Regarding African Americans within PEER

DemographicsAfrican Americans
Number326
Age of AD onset, y, mean (SD)2.14 (2.85)
Sex, male, n (%)136 (41.98)
Asthma, n (%)182 (56.17)
Seasonal allergies, n (%)228 (70.37)
Observation time in months, mean (95% CI)97 (101.2–92.9)
Disease control1, mean (SD)2.57 (0.74)

Abbreviations: AD, atopic dermatitis; CI, confidence interval; PEER, Pediatric Eczema Elective Registry.

Disease control measured on a 4-point scale, with 1 representing complete disease control, and 4 representing uncontrolled disease.

Figure 2

Proportion of subjects showing the outcome of disease clearance at each month of follow-up. The red line shows the proportion of surveys reporting disease clearance across the entire study period, 0.6.

Participant Demographics: Basic Demographic Data Regarding African Americans within PEER Abbreviations: AD, atopic dermatitis; CI, confidence interval; PEER, Pediatric Eczema Elective Registry. Disease control measured on a 4-point scale, with 1 representing complete disease control, and 4 representing uncontrolled disease. Proportion of subjects showing the outcome of disease clearance at each month of follow-up. The red line shows the proportion of surveys reporting disease clearance across the entire study period, 0.6. Massively parallel sequencing revealed 583 unique alleles in FLG, of which 337 had a MAF < 5%. After filtering FLG LoF alleles and alleles strongly correlated with FLG LoF, 322 FLG low-frequency and rare alleles remained (Figure 3). We performed 100 runs of the genetic algorithm on the FLG low-frequency and rare alleles. Evaluation of the alleles revealed a group of 46 low-frequency and rare alleles in FLG significantly associated with increased AD remission (OR = 5.19; 95% confidence interval = 3.52–7.66, adjusted P = 2.76e-11). The P-value was adjusted for 250,000 independent tests. Annotation of these variants is presented in Table 2. The frequencies in a control population, the African American subset of the Allele Frequency Aggregator, are provided in Table 3 (Phan et al., 2020). Genotype frequencies are provided in Table 4. A total of 40 of the identified alleles were within exon 3, FLG’s largest exon (Figure 4). This group of low-frequency and rare alleles is also associated with greater reports of disease clearance at almost every follow-up survey (Figure 5). These 46 low-frequency and rare FLG alleles are present in 55 different individuals, representing 16.9% of the PEER African American population. These 46 alleles are largely not in linkage disequilibrium with each other (Figure 6).
Figure 3

Analysis plan for AD, atopic dermatitis; GAD, genetics of atopic dermatitis; MAF, minor allele frequency; NGS, next-generation sequencing; PEER, Pediatric Eczema Elective Registry.

Table 2

Annotation of the FLG Alleles Associated with Decreased AD Persistence in African Americans

LocationNucleotide ChangeRSIDAmino Acid ChangeMAFFunctional Region
152275810G>Ars150496930p.A3851V0.002Repeat domain 12
152276248C>Ars371128626p.G3705V0.005Repeat domain 11
152276858A>Grs150047484p.S3502P0.002Repeat domain 11
152276976G>Ars140294281p.S3462S0.002Repeat domain 11
152276979C>Trs145466389p.G3461G0.003Repeat domain 11
152277184T>Crs146234375p.H3393R0.002Repeat domain 10
152277637G>A1p.S3242F0.002Repeat domain 10
152277738G>Crs774463249p.S3208R0.002Repeat domain 10
152277769A>Crs143183339p.V3198G0.002Repeat domain 10
152277794C>Trs146288788p.D3190N0.002Repeat domain 10
152277905G>Ars148315024p.R3153C0.002Repeat domain 10
152278525C>T1p.S2946N0.002Repeat domain 9
152278706G>Ars141172870p.R2886C0.002Repeat domain 9
152278787G>A1p.H2859Y0.002Repeat domain 9
152279561C>Trs146849256p.D2601N0.002Repeat domain 8
152280131A>Grs966449727p.S2411P0.002Repeat domain 7
152280134G>Ars141651911p.R2410C0.003Repeat domain 7
152280330T>Ars368784083p.S2344S0.003Repeat domain 7
152281133G>Ars151189270p.R2077C0.002Repeat domain 6
152281389C>Trs138652718p.A1991A0.005Repeat domain 6
152282238G>Ars151199504p.D1708D0.002Repeat domain 5
152282384T>Ars200033409p.T1660S0.002Repeat domain 5
152282684G>Ars151103850p.R1560C0.012Repeat domain 5
152283239G>Ars772994159p.H1375Y0.002Repeat domain 4
152283742G>Trs142660239p.S1207Y0.002Repeat domain 4
152283962T>Crs140646945p.T1134A0.005Repeat domain 4
152284289C>Grs145939718p.A1025P0.003Repeat domain 3
152284376C>Trs149106390p.G996R0.002Repeat domain 3
152284382C>Trs149595328p.G994S0.012Repeat domain 3
152284540C>Trs547196696p.R941H0.003Repeat domain 3
152285021C>Grs148739675p.D781H0.003Repeat domain 2
152285119G>Crs201522026p.T748S0.003Repeat domain 2
152285647C>Trs749798893p.R572Q0.002Repeat domain 2
152285807G>Trs12036682p.H519N0.002Repeat domain 2
152285981G>Ars184361545p.R461W0.002Repeat domain 2
152286006C>A1p.E452D0.002Repeat domain 1
152286029G>Ars141885805p.L445L0.002Repeat domain 1
152286043TG>T1p.H440fs0.002Repeat domain 1
152286061G>Trs144808372p.T434K0.002Repeat domain 1
152286118C>A1p.R415L0.002Repeat domain 1
152287645G>Ars9910981060.002Intron variant
152289544C>Trs10100150220.002Intron variant
152291412G>Crs7755135390.002Intron variant
152293991A>T10.002Intron variant
152296514C>Trs5541881710.003Intron variant
152296874C>Grs5500280100.002Intron variant

Abbreviations: AD, atopic dermatitis; CI, confidence interval; MAF, minor allele frequency; RSID, Reference SNP cluster ID.

The OR for the association of these alleles with AD persistence is 5.19 (95% CI = 3.52–7.66, adjusted P = 2.76e-11).

RSID unavailable.

Table 3

Frequencies of Low-Frequency and Rare FLG Alleles in PEER Versus Frequencies in African American Subset of ALFA

LocationNucleotide ChangeMAF in PEERALFA Allele Frequency, African Americans
152275810G>A0.0020
152276248C>A0.0050.0014
152276858A>G0.0020.001
152276976G>A0.0020.0026
152276979C>T0.0030.0027
152277184T>C0.0020
152277637G>A0.002NA
152277738G>C0.0020
152277769A>C0.0020
152277794C>T0.0020.0036
152277905G>A0.0020.0009
152278525C>T0.002NA
152278706G>A0.0020
152278787G>A0.002NA
152279561C>T0.0020.0006
152280131A>G0.0020
152280134G>A0.0030.0047
152280330T>A0.0030.0003
152281133G>A0.0020.0003
152281389C>T0.0050.0012
152282238G>A0.0020
152282384T>A0.0020
152282684G>A0.0120.0119
152283239G>A0.0020
152283742G>T0.0020.0006
152283962T>C0.0050.0047
152284289C>G0.0030
152284376C>T0.0020
152284382C>T0.0120.0012
152284540C>T0.0030
152285021C>G0.0030.0053
152285119G>C0.0030
152285647C>T0.0020
152285807G>T0.0020.0002
152285981G>A0.0020
152286006C>A0.002NA
152286029G>A0.0020
152286043TG>T0.002NA
152286061G>T0.0020
152286118C>A0.002NA
152287645G>A0.0020
152289544C>T0.0020
152291412G>C0.0020
152293991A>T0.002NA
152296514C>T0.0030.0025
152296874C>G0.0020.0004

Abbreviations: ALFA, Allele Frequency Aggregator; MAF, minor allele frequency; NA, not applicable; PEER, Pediatric Eczema Elective Registry.

Table 4

Genotype Frequencies of FLG Low-Frequency and Rare Alleles

LocationNucleotide ChangeGenotype Frequencies
AAAaaa
152275810G>A0.9970.0030
152276248C>A0.9910.0090
152276858A>G0.9970.0030
152276976G>A0.9880.0120
152276979C>T0.9940.0060
152277184T>C0.9970.0030
152277637G>A0.9970.0030
152277738G>C0.9970.0030
152277769A>C0.9940.0060
152277794C>T0.9970.0030
152277905G>A0.9970.0030
152278525C>T0.9970.0030
152278706G>A0.9910.0090
152278787G>A0.9970.0030
152279561C>T0.9910.0090
152280131A>G0.9970.0030
152280134G>A0.9940.0060
152280330T>A0.9940.0060
152281133G>A0.9940.0060
152281389C>T0.9910.0090
152282238G>A0.9970.0030
152282384T>A0.9970.0030
152282684G>A0.9140.0860
152283239G>A0.9970.0030
152283742G>T0.9720.0280
152283962T>C0.9910.0090
152284289C>G0.9790.0210
152284376C>T0.9880.0090.003
152284382C>T0.9750.0250
152284540C>T0.9940.0060
152285021C>G0.9940.0060
152285119G>C0.9940.0060
152285647C>T0.9970.0030
152285807G>T0.9880.0090.003
152285981G>A0.9970.0030
152286006C>A0.9970.0030
152286029G>A0.9970.0030
152286043TG>T0.9970.0030
152286061G>T0.9970.0030
152286118C>A0.9940.0060
152287645G>A0.9970.0030
152289544C>T0.9970.0030
152291412G>C0.9970.0030
152293991A>T0.9970.0030
152296514C>T0.9940.0060
152296874C>G0.9970.0030
Figure 4

Plot of low-frequency and rare alleles along the Alternating light and dark green segments correspond to the 12 tandem repeats in FLG. Allele locations based on Margolis et al. (2019).

Figure 5

Application of the model to Average clearance over time for individuals with (blue) and without (black) low-frequency and rare alleles listed earlier. Error bars represent 95% confidence intervals. AD, atopic dermatitis.

Figure 6

LD plot of the 46 low-frequency and rare 152286043TG>T not shown. Numbers overlying the shaded boxes represent the R2 values. AD, atopic dermatitis; LD, linkage disequilibrium.

Analysis plan for AD, atopic dermatitis; GAD, genetics of atopic dermatitis; MAF, minor allele frequency; NGS, next-generation sequencing; PEER, Pediatric Eczema Elective Registry. Annotation of the FLG Alleles Associated with Decreased AD Persistence in African Americans Abbreviations: AD, atopic dermatitis; CI, confidence interval; MAF, minor allele frequency; RSID, Reference SNP cluster ID. The OR for the association of these alleles with AD persistence is 5.19 (95% CI = 3.52–7.66, adjusted P = 2.76e-11). RSID unavailable. Frequencies of Low-Frequency and Rare FLG Alleles in PEER Versus Frequencies in African American Subset of ALFA Abbreviations: ALFA, Allele Frequency Aggregator; MAF, minor allele frequency; NA, not applicable; PEER, Pediatric Eczema Elective Registry. Genotype Frequencies of FLG Low-Frequency and Rare Alleles Plot of low-frequency and rare alleles along the Alternating light and dark green segments correspond to the 12 tandem repeats in FLG. Allele locations based on Margolis et al. (2019). Application of the model to Average clearance over time for individuals with (blue) and without (black) low-frequency and rare alleles listed earlier. Error bars represent 95% confidence intervals. AD, atopic dermatitis. LD plot of the 46 low-frequency and rare 152286043TG>T not shown. Numbers overlying the shaded boxes represent the R2 values. AD, atopic dermatitis; LD, linkage disequilibrium.

Genetics of AD results

As a secondary study, we evaluated the contribution of these alleles to AD risk in a different population, the genetics of AD (GAD) group. We include the GAD data because we are seeking to show the significance of our low-frequency and rare alleles in an alternate cohort and thereby show that these alleles have significance beyond the PEER cohort. Within the GAD genetic group, 316 individuals were African American. Massively parallel sequencing identified 1,094 unique alleles in FLG. Of the 46 low-frequency and rare alleles associated with less persistent AD in PEER, 16 were identified in the GAD African American group. Table 5 presents the MAFs of these alleles in GAD cases and controls. The OR of the association between these alleles and the presence of AD was 0.376, with a P-value of 0.0007.
Table 5

MAFs in GAD Cases and Controls

RSIDLocationMAF, GAD ControlsMAF, GAD Cases
rs37112862615227624800.005
rs1402942811522769760.0250.010
rs14546638915227697900.005
rs1462887881522777940.0170.005
rs1468492561522795610.0080
rs1416519111522801340.0250.005
rs15118927015228113300.005
rs1386527181522813890.0580.015
rs2000334091522823840.0500.050
rs1511038501522826840.0080.020
rs1426602391522837420.0080
rs1459397181522842890.0080
rs1495953281522843820.0080.020
rs1487396751522850210.0330.005
rs2015220261522851190.1080.010
rs1203668215228580700.005

Abbreviations: GAD, genetics of atopic dermatitis; MAF, minor allele frequency; RSID, Reference SNP cluster ID.

MAFs in GAD Cases and Controls Abbreviations: GAD, genetics of atopic dermatitis; MAF, minor allele frequency; RSID, Reference SNP cluster ID.

Discussion

In this study, we used a genetic algorithm‒based approach to identify low-frequency and rare alleles associated with longitudinal measures of AD. We identified a group of low-frequency and rare alleles within FLG associated with AD. This FLG low-frequency and rare allele group represents a larger proportion of the PEER population than FLG LoF variants. This group is associated with lower odds of having AD in an independent dataset (the GAD group). A key contribution of this report was the use of an algorithm to assess the association of low-frequency and rare alleles (MAF < 5%) with longitudinal measures of disease. Traditionally, it has been very difficult to identify low-frequency and rare alleles associated with common diseases because (i) these alleles occur infrequently and there is typically insufficient statistical power to generate accurate ORs, (ii) existing methods (burden tests and variant component tests) are primarily useful for examining whether a whole gene, and not particular variants, are implicated in a disease, and (iii) there are few methods for associating low-frequency and rare alleles with a disease when outcome measures are longitudinal (Lee et al., 2012; Wu et al., 2011a). Our genetic algorithm‒based approach has several strengths. By grouping alleles, as is done in burden tests and variance component tests, we can increase the statistical power of any given generalized estimating equation (GEE) calculation. By iteratively refining the alleles in our group, we can better localize low-frequency and rare alleles that are likely to be disease-associated. By focusing on optimization of the OR of the association, we can associate low-frequency and rare alleles with a disease even when the outcome measure is nonbinary. Numerous studies have examined the role of FLG in skin barrier formation. One recent report suggested that it is essential for the assembly of keratohyalin granules in the terminally differentiating epidermal layers (Quiroz et al., 2020). The association between FLG LoF variation and AD has been established in multiple cohorts and holds across ancestral groups (Barker et al., 2007; Marenholz et al., 2006; Margolis et al., 2014b; Margolis et al., 2018; Palmer et al., 2006; Pigors et al., 2018; Weidinger et al., 2007). However, few studies have associated low-frequency and rare FLG alleles with disease. None, to our knowledge, have identified non-LoF alleles associated with AD. Our group of 46 low-frequency and rare alleles, associated with five-fold less severe AD, is present in 16.9% of the PEER population. By contrast, FLG LoF variants are present in only 11.3% of this study population. The association between our rare variant group and AD is visually apparent, with greater reports of disease clearance at almost every follow-up survey (Figure 5). Six of the 46 variants identified represent synonymous amino acid changes. This is both interesting and unexpected. It is possible that these regions represent regulatory regions within coding exons (Dong et al., 2010). Further study is needed to identify the function of these alleles. These FLG alleles are also protective against AD in an independent population, the GAD group. The fact that these variants are protective in the GAD group whereas they decrease persistence in the PEER cohort suggests that these variants mitigate AD presence and severity. The GAD results suggest that these variants have significance beyond the PEER population. Although our analyses do not show causality between these FLG alleles and AD remission, these findings are valuable for two reasons. First, these low-frequency and rare alleles could be incorporated into genetic risk models predicting AD severity. Evaluation of these alleles in alternate populations could further verify these associations. Second, these alleles could be investigated in future studies of the molecular mechanisms of AD. This study has several limitations. We only examined African American individuals, so our results may not apply to AD in different races/ethnicities. Because African Americans represent a distinct subset of those with African ancestry, our results may not generalize to all those of African ancestry. As with all observational studies, our analyses show that low-frequency and rare alleles are associated with AD severity. This association is not yet causal. Third, although associated with AD persistence in PEER, our allele groups may not generalize to the broader AD population. Studies in different populations will be needed to confirm these findings. In this paper, we uncovered associations between low-frequency and rare alleles and AD remission using an approach employing a genetic algorithm to group these alleles. We identified a group of 46 low-frequency and rare alleles in FLG associated with decreased AD remission. These variants represent a contribution to AD genetics independent of FLG LoF insofar as all LoF variants and all alleles strongly correlated with LoF variants were removed before generating these allele groups. These alleles were present at a clinically significant frequency in this study population and accounted for a larger proportion of the African American PEER population than FLG LoF alleles alone. A subset of these alleles is associated with AD risk in an independent population, the GAD group. This study presents an application of statistical methods in AD genetics and uncovers genetic associations that may be valuable for future study of the mechanisms and epidemiology of AD.

Materials and Methods

Rare allele model

Existing methods for associating low-frequency and rare alleles with a disease, such as burden tests and variance component tests, are intended to identify whether a gene is significant rather than whether any given low-frequency and rare variants are significant. However, machine learning methods enable the identification of clusters of the covariates associated with a specific outcome and have been shown to be useful in studying AD in the past (Berna et al., 2020; Paternoster et al., 2018; Thijs et al., 2017). We approached the identification of low-frequency and rare alleles associated with disease as a clustering problem. Low-frequency and rare alleles (defined as SNPs with a MAF ≤ 5%) were considered predictor variables, and a measure of disease outcome was considered the outcome variable. By iteratively drawing the subsets of our predictor variables and evaluating their association with disease outcome, we sought to identify groups or clusters of low-frequency and rare alleles associated with different disease outcomes. Our outcome measure (the predictor variable) is described below. We used a genetic algorithm, implemented by the rbga.bin function from the genalg 0.2.0 package in R, to identify clusters of low-frequency and rare alleles (Chatterjee et al., 1996). The rgba.bin function was used to select alleles, which were then evaluated with GEE for association with AD remission/persistence. To iteratively improve the group of alleles selected, the rgba.bin function requires an outcome metric. This outcome metric was the GEE model’s OR for the association. Each run of the genetic algorithm drew 2,500 subsets of low-frequency and rare alleles (divided into 50 iterations, each with a population size of 50), with replacement, from the set of all low-frequency and rare alleles. Multiple runs of the algorithm were computed, and the results across runs were aggregated; alleles that occurred in ≥20 runs were included in the final rare allele groups. Because the OR of the association was used to inform allele selection, this method represents a supervised clustering approach. P-values were computed from the GEE models. Because numerous independent estimates were computed to obtain the final rare allele groups, P-values were corrected for multiple comparisons according to a Bonferroni correction, accounting for the total number of estimates computed.

Simulation studies

To assess the performance of our genetic algorithm‒based approach and to verify the model’s ability to accurately identify the alleles associated with disease, we conducted sensitivity analyses. We evaluated the model’s ability to identify (i) low-frequency and rare alleles associated with more severe disease, (ii) low-frequency and rare alleles associated with less severe disease, and (iii) when no low-frequency and rare alleles are disease-associated. A simulated dataset of 100 alleles, each with a MAF = 0.01, was constructed, for which binary outcome measures of disease clearance were recorded at 6-month intervals for 12 total years (the simulated data were intentionally reflective of the PEER data, in which binary outcome measures were recorded at 6-month intervals for ~12 years). Outcomes for each allele were binomially distributed, with a mean probability of a positive outcome varying between 0 (outcome never occurs) and 1 (outcome always occurs). For initial simulations, alleles were assigned outcomes according to a binary distribution with means of 0.95, 0.05, and 0.5 to represent a strong positive association with AD clearance, strong negative association with AD clearance, and no association with AD clearance, respectively. First, the model was run 100 times on a dataset where 10 alleles had a probability of clearance (PoC) of 0.95 (the remaining alleles were assigned a PoC of 0.5). Second, the model was run 100 times on a dataset where 10 alleles had a PoC of 0.05 (the remaining alleles were assigned a PoC of 0.5). Third, the model was run 100 times on a dataset in which all 100 alleles had a PoC = 0.5. We next evaluated the algorithm’s ability to identify a variable number of associated alleles. We applied the algorithm to a simulated dataset of 100 alleles, as mentioned earlier. In this simulation, X alleles were assigned a PoC of 0.95 (the remaining alleles were assigned a PoC = 0.5). We ran 100 simulations each for every value of X between 5 and 15. We also evaluated the model's ability to identify alleles as the strength of association varied. In this case, we assigned 10 of 100 alleles a PoC of Z (the remainder had a PoC = 0.5). We ran 10 simulations for each value of Z between 0.55 and 0.95, incrementing by 0.05 each time. We finally evaluated the model's ability to identify low-frequency and rare alleles as the pool of low-frequency and rare alleles increased in size. Three simulations of 100 runs each were computed for population sizes of 50, 100, and 500. Truly associated alleles represented 10% of the total alleles. t-Tests were used to compare group means. P-values for these t-tests are presented in the respective figures.

Participants and genetic data

Genetic data were obtained from a subset of the PEER study for which DNA samples were available. The PEER study was approved by the Institutional Review Board of the University of Pennsylvania (Philadelphia, PA), and written informed consent was obtained from all participants or from their caregivers. Both the overall PEER cohort and the subset with genetic data have been previously described (Berna et al., 2021; Margolis et al., 2014c). This study examines only individuals self-described as African American. Self-described ancestry previously has been determined to strongly correlate with genetic markers of race within this cohort (Lou et al., 2019). We chose to strictly examine African American individuals because of the increased genetic variability observed in this population and the relative paucity of AD genetic studies of African Americans. DNA was collected with Oragene DNA collection kits (DNA Genotek, Ottawa, Canada). Massively parallel sequencing genotyping of FLG was conducted on the 326 African American PEER individuals with sufficient DNA. Genotyping was by targeted capture using the Agilent SureSelect Platform (Agilent, Santa Clara, CA). Sequencing was performed on an Illumina Hiseq 4000 (Illumina, San Diego, CA). Raw sequencing data were aligned and mapped to the reference genome GRCh37 using the Burrows‒Wheeler Aligner, version 0.7.17-r1188 (Li and Durbin, 2010). Single nucleotide variant and insertion and deletion calling were accomplished using the Genome Analysis Toolkit HaplotypeCaller, version 3.7 (Broad Institute, Boston, MA), after following Genome Analysis Toolkit best practices realignment and recalibration (DePristo et al., 2011; Poplin et al., 20181; Van der Auwera et al., 2013).

Uncommon allele model to PEER low-frequency and rare alleles

Before applying the rare allele model to low-frequency and rare FLG alleles found in PEER, alleles were filtered as shown in Figure 3. First, we a priori chose to examine only alleles with a MAF < 5%. Then, we removed the previously identified LoF variants (as identified in Margolis et al. [2018]). Then, alleles strongly correlated (R2 > 0.1) with LoF alleles were removed. The uncommon variant model was applied to the remaining alleles. All alleles were read to a mean depth ≥30, with the majority read to a depth ≥100. After extensive discussion with our genetics partners, we concluded that this sequencing is more than adequate to accurately call even the rarest of these alleles. Sequencing depth for each of the 46 uncommon alleles of interest is presented in Table 6.
Table 6

Sequencing Depth for Each Allele in Our 46 Allele Composite

LocationNucleotide ChangeRSIDMean Total DepthMean Alternate Allele DepthAlternate Allele FractionNumber of Individuals with Each Allele
152275810G>Ars1504969304632180.4711
152276248C>Ars3711286263391580.4663
152276858A>Grs1500474842851370.4811
152276976G>Ars1402942813951900.4814
152276979C>Trs1454663895302760.5212
152277184T>Crs146234375179890.4971
152277637G>A1197840.4261
152277738G>Crs77446324995530.5581
152277769A>Crs1431833393982130.5352
152277794C>Trs1462887887543710.4921
152277905G>Ars148315024113560.4961
152278525C>T160330.551
152278706G>Ars1411728705901750.2973
152278787G>A11261120.8891
152279561C>Trs1468492563571830.5133
152280131A>Grs966449727141640.4541
152280134G>Ars1416519113531840.5212
152280330T>Ars3687840832431220.5022
152281133G>Ars1511892705481560.2851
152281389C>Trs1386527186791820.2683
152282238G>Ars151199504137750.5471
152282384T>Ars200033409224860.3841
152282684G>Ars1511038502411400.58128
152283239G>Ars7729941593841890.4921
152283742G>Trs1426602394402140.4869
152283962T>Crs1406469453431660.4843
152284289C>Grs1459397182301080.471
152284376C>Trs1491063904692880.6144
152284382C>Trs1495953284221830.4347
152284540C>Trs5471966962861210.4232
152285021C>Grs1487396754502150.4782
152285119G>Crs2015220262141030.4812
152285647C>Trs7497988934732250.4761
152285807G>Trs120366823882200.5675
152285981G>Ars1843615455032630.5231
152286006C>A1157740.4711
152286029G>Ars1418858054042090.5171
152286043TG>T1161800.4971
152286061G>Trs1448083723561580.4441
152286118C>A12681330.4961
152287645G>Ars99109810666320.4851
152289544C>Trs10100150222601160.4461
152291412G>Crs775513539194980.5051
152293991A>T1125530.4241
152296514C>Trs554188171184990.5382
152296874C>Grs5500280103481860.5341

Abbreviation: RSID, Reference SNP cluster ID.

RSID unavailable.

Sequencing Depth for Each Allele in Our 46 Allele Composite Abbreviation: RSID, Reference SNP cluster ID. RSID unavailable.

Outcome measure

Disease clearance was defined using a self-reported outcome of whether or not a child's skin was symptom free during the previous 6 months. Because children in PEER could be followed for >10 years, participants could have multiple reports of this outcome over time. The association between these outcomes and individual low-frequency and rare alleles was evaluated with GEEs for binary outcomes, assuming an exchangeable working correlation structure with empirical standard errors. This GEE model provided an estimate of the likelihood of AD improvement over time, which we interpreted as a measure of AD remission. An OR > 1 indicates that a risk factor increases the odds of remission; an OR < 1 indicates that a risk factor decreases the odds of disease remission. P-values reported are from these GEE estimates. GEE models were implemented through the geeglm function from the R package geepack 1.2-1.

Statistics and visualizations

Demographic characteristics are presented with means and SDs, as appropriate. Plots of clearance over time (with 95% confidence intervals) were constructed to show temporal differences in outcome measures for different groups. It is important to note that although these graphs provide an intuitive visualization of longitudinal differences in the disease course, the GEE models are not computing ORs on the basis of these curves. Plots of low-frequency and rare alleles along a gene were created using the lolliplot function in R, utilizing the locations in the Single Nucleotide Polymorphism Database (National Center for Biotechnology Information, Bethesda, MD), using reference genome GRCh37 (https://www.ncbi.nlm.nih.gov/snp/). The R2 between the associated alleles identified in the analysis discussed earlier was calculated to show the correlation between alleles within this study population and to show that these alleles are largely not collinear. R2 plots were generated in Haploview (Broad Institute, Boston, MA) (https://www.broadinstitute.org/haploview/haploview). Because these R2 calculations represent the true R2 values within our population of interest and are not intended to represent the R2 within any broader population, they are not powered to any condition.

GAD cohort

As a secondary study, we evaluated the contribution of these alleles to AD risk in a different population. Specifically, we evaluated our low-frequency and rare alleles with AD in an independent cohort, the GAD group. Individuals in GAD were examined by dermatologists with expertise in the diagnosis of AD (from the University of Pennsylvania Perelman School of Medicine [Philadelphia, PA], Children’s Hospital of Philadelphia [Philadelphia, PA], Pennsylvania State University/Hershey Medical Center [Philadelphia, PA], and Washington University School of Medicine in St Louis [St Louis, MO]). All subjects had a history and an examination consistent with AD (cases) or had no history of AD by history and examination (controls). For this study, we analyzed only individuals who were African American, by self-report. All subjects or legal guardians provided written informed consent or, if appropriate, assent approved by their appropriate Institutional Review Board. Genotyping in the GAD cohort was carried out as previously described (Margolis et al., 2020). We then assessed for the association between our rare allele composite and the presence of AD using a chi-square test. P-values presented for all GAD cohort analyses represent P-values from chi-square tests. All analyses were implemented in R, version 3.6.1.

Data availability statement

The R code for the entire project was uploaded to FigShare and is publicly available. The DOI for this data is https://doi.org/10.6084/m9.figshare.14569467.v1. The Pediatric Eczema Elective Registry data (source) are not currently publicly available. The Pediatric Eczema Elective Registry study is an ongoing study sponsored by Valeant in response to a postmarketing commitment with the Food and Drug Administration. The GAD data (source) are not currently publicly available because this study is still enrolling. A Browser Extensible Data file of the probes used for targeted capture is available on request from the corresponding author.

ORCIDs

Ronald Berna: http://orcid.org/0000-0003-0520-1218 Nandita Mitra: http://orcid.org/0000-0002-7714-3910 Ole Hoffstad: http://orcid.org/0000-0002-0261-903X Bradley Wubbenhorst: http://orcid.org/0000-0001-8489-3659 Katherine L. Nathanson: http://orcid.org/0000-0002-6740-0901 David J. Margolis: http://orcid.org/0000-0002-0506-8085

Author Contributions

Conceptualization: RB, DJM; Data Curation: RB, OH, BW; Formal Analysis: RB, DJM; Funding Acquisition: DJM; Investigation: RB, DJM; Methodology: RB, DJM, NM, KLN; Project Administration: RB, DJM, OH; Resources: DJM, KLN; Software: RB, OH, DJM, BW; Supervision: DJM, NM, KLN; Validation: RB, DJM, OH; Visualization: RB, DJM; Writing - Original Draft Preparation: RB, DJM; Writing - Review and Editing: RB, DJM, NM, KLN
  35 in total

1.  Optimal tests for rare variant effects in sequencing association studies.

Authors:  Seunggeun Lee; Michael C Wu; Xihong Lin
Journal:  Biostatistics       Date:  2012-06-14       Impact factor: 5.899

2.  Filaggrin mutations strongly predispose to early-onset and extrinsic atopic dermatitis.

Authors:  Stephan Weidinger; Elke Rodríguez; Caroline Stahl; Stefan Wagenpfeil; Norman Klopp; Thomas Illig; Natalija Novak
Journal:  J Invest Dermatol       Date:  2006-11-09       Impact factor: 8.551

3.  Atopic dermatitis.

Authors:  Thomas Bieber
Journal:  N Engl J Med       Date:  2008-04-03       Impact factor: 91.245

4.  Identifying Phenotypes of Atopic Dermatitis in a Longitudinal United States Cohort Using Unbiased Statistical Clustering.

Authors:  Ronald Berna; Nandita Mitra; Ole Hoffstad; Joy Wan; David J Margolis
Journal:  J Invest Dermatol       Date:  2019-08-22       Impact factor: 8.551

5.  Uncommon Filaggrin Variants Are Associated with Persistent Atopic Dermatitis in African Americans.

Authors:  David J Margolis; Nandita Mitra; Heather Gochnauer; Bradley Wubbenhorst; Kurt D'Andrea; Adam Kraya; Ole Hoffstad; Jayanta Gupta; Brian Kim; Albert Yan; Zelma Chiesa Fuxench; Katherine L Nathanson
Journal:  J Invest Dermatol       Date:  2018-02-08       Impact factor: 8.551

6.  Liquid-liquid phase separation drives skin barrier formation.

Authors:  Felipe Garcia Quiroz; Vincent F Fiore; John Levorse; Lisa Polak; Ellen Wong; H Amalia Pasolli; Elaine Fuchs
Journal:  Science       Date:  2020-03-13       Impact factor: 47.728

7.  Association between fine mapping thymic stromal lymphopoietin and atopic dermatitis onset and persistence.

Authors:  Carolyn Lou; Nandita Mitra; Bradley Wubbenhorst; Kurt D'Andrea; Ole Hoffstad; Brian S Kim; Albert Yan; Andrea L Zaenglein; Zelma Chiesa Fuxench; Katherine L Nathanson; David J Margolis
Journal:  Ann Allergy Asthma Immunol       Date:  2019-09-04       Impact factor: 6.347

8.  Persistence of mild to moderate atopic dermatitis.

Authors:  Jacob S Margolis; Katrina Abuabara; Warren Bilker; Ole Hoffstad; David J Margolis
Journal:  JAMA Dermatol       Date:  2014-06       Impact factor: 10.282

9.  Filaggrin-2 variation is associated with more persistent atopic dermatitis in African American subjects.

Authors:  David J Margolis; Jayanta Gupta; Andrea J Apter; Tapan Ganguly; Ole Hoffstad; Maryte Papadopoulos; Tim R Rebbeck; Nandita Mitra
Journal:  J Allergy Clin Immunol       Date:  2013-11-01       Impact factor: 10.793

10.  Fast and accurate long-read alignment with Burrows-Wheeler transform.

Authors:  Heng Li; Richard Durbin
Journal:  Bioinformatics       Date:  2010-01-15       Impact factor: 6.937

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.