| Literature DB >> 33790900 |
Or Shemesh1,2, Pazit Polak1,2, Knut E A Lundin3,4, Ludvig M Sollid3,5, Gur Yaari1,2.
Abstract
Celiac disease (CeD) is a common autoimmune disorder caused by an abnormal immune response to dietary gluten proteins. The disease has high heritability. HLA is the major susceptibility factor, and the HLA effect is mediated via presentation of deamidated gluten peptides by disease-associated HLA-DQ variants to CD4+ T cells. In addition to gluten-specific CD4+ T cells the patients have antibodies to transglutaminase 2 (autoantigen) and deamidated gluten peptides. These disease-specific antibodies recognize defined epitopes and they display common usage of specific heavy and light chains across patients. Interactions between T cells and B cells are likely central in the pathogenesis, but how the repertoires of naïve T and B cells relate to the pathogenic effector cells is unexplored. To this end, we applied machine learning classification models to naïve B cell receptor (BCR) repertoires from CeD patients and healthy controls. Strikingly, we obtained a promising classification performance with an F1 score of 85%. Clusters of heavy and light chain sequences were inferred and used as features for the model, and signatures associated with the disease were then characterized. These signatures included amino acid (AA) 3-mers with distinct bio-physiochemical characteristics and enriched V and J genes. We found that CeD-associated clusters can be identified and that common motifs can be characterized from naïve BCR repertoires. The results may indicate a genetic influence by BCR encoding genes in CeD. Analysis of naïve BCRs as presented here may become an important part of assessing the risk of individuals to develop CeD. Our model demonstrates the potential of using BCR repertoires and in particular, naïve BCR repertoires, as disease susceptibility markers.Entities:
Keywords: BCR repertoire; celiac disease; immune response; machine learning; naïve B-cells
Year: 2021 PMID: 33790900 PMCID: PMC8006302 DOI: 10.3389/fimmu.2021.627813
Source DB: PubMed Journal: Front Immunol ISSN: 1664-3224 Impact factor: 7.561
Figure 1Workflow scheme.Experimental and computational analysis included the following steps: collection of blood samples from individuals with CeD and healthy controls, sequencing of naïve B cell repertoires, and analyzing the repertoires. Repertoire analysis consisted of V(D)J sequence annotation, creation of repertoire representations by antibody clustering, identifying clinical-predictable features using ML methods, and characterizing disease biomarker motifs.
Figure 2Classification performance of the ML model using heavy and light chain variable region repertoires. Bar graphs show the average F1 score ± standard deviation (black line) across a 1,000 iterations. Light blue bar indicates the result using the heavy chain (HC) dataset, turquoise bar for the light chain dataset, green bar for the integrated heavy and light chain (HC_LC) dataset, and gray bar for control analysis using the HC_LC dataset with random labels.
Figure 3Characterization of key features used by our ML model. (A) Aggregation of feature selection outcomes. The Y-axis indicates the frequency of feature selection across 1 K subsamples, and the X-axis represents the feature rank index. (B) V gene usage within the features selected in more than 40% of subsamples. (C) J gene usage within the features selected in more than 40% of subsamples. (D) CDR3 length distribution within the features selected in more than 40% of subsamples. (E) Enrichment cluster set results. Top normalized enrichment score (NES) outcomes of three cluster set analyses (V-gene, J-gene, and CDR3-length). Colors indicate whether the FDR for multiple hypotheses is lower than 0.05, with blue for TRUE and red for FALSE. (F) Conjoint AA-triad (CT) motif is depleted in CeD-associated features. The left panel graphically displays the division of AAs to seven groups and an example representation of the possible AA-conjoint triads (as described in section 2). The right panel is a volcano plot showing the k-mer enrichment analysis with CT representation of selected clusters. The statistical significance of the difference in kmer expression plotted by the log10 q-value (FDR adjusted p-value, where q<0.05 is considered significant; horizontal line). The x-axis indicates the magnitude of the change, plotting the fold-change ratios in a log-2 scale (log2FC). The blue color indicates a point-of-interest that represents the depleted k-mer. Non-significant k-mers are shown as gray points. (G) The classifier weights for the three residue positions, provided by the MIL method (7). Columns represent the categories of five biophysicochemical factors. Positive weight values are shown as facing up bars, and negative weight values are shown as facing down bars. The length of the bar corresponds to the weight's magnitude, and the color corresponds to the position in the snippet.