| Literature DB >> 35073334 |
Pegah Abed-Esfahani1, Benjamin C Darwin2, Derek Howard1, Nick Wang2, Ethan Kim1,3, Jason Lerch2,4, Leon French1,3,5,6.
Abstract
High resolution in situ hybridization (ISH) images of the brain capture spatial gene expression at cellular resolution. These spatial profiles are key to understanding brain organization at the molecular level. Previously, manual qualitative scoring and informatics pipelines have been applied to ISH images to determine expression intensity and pattern. To better capture the complex patterns of gene expression in the human cerebral cortex, we applied a machine learning approach. We propose gene re-identification as a contrastive learning task to compute representations of ISH images. We train our model on an ISH dataset of ~1,000 genes obtained from postmortem samples from 42 individuals. This model reaches a gene re-identification rate of 38.3%, a 13x improvement over random chance. We find that the learned embeddings predict expression intensity and pattern. To test generalization, we generated embeddings in a second dataset that assayed the expression of 78 genes in 53 individuals. In this set of images, 60.2% of genes are re-identified, suggesting the model is robust. Importantly, this dataset assayed expression in individuals diagnosed with schizophrenia. Gene and donor-specific embeddings from the model predict schizophrenia diagnosis at levels similar to that reached with demographic information. Mutations in the most discriminative gene, Sodium Voltage-Gated Channel Beta Subunit 4 (SCN4B), may help understand cardiovascular associations with schizophrenia and its treatment. We have publicly released our source code, embeddings, and models to spur further application to spatial transcriptomics. In summary, we propose and evaluate gene re-identification as a machine learning task to represent ISH gene expression images.Entities:
Mesh:
Year: 2022 PMID: 35073334 PMCID: PMC8786163 DOI: 10.1371/journal.pone.0262717
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Overview of this study.
Patches are extracted from ISH images (left) and used to train a single ResNet50 model that shares weights in a triplet loss architecture (middle top). Learned embeddings are then evaluated at the level of genes and images. The trained ResNet model is used to embed patches from a second dataset of ISH images (bottom left). These embeddings are then evaluated at the image level to assess generalizability. Embeddings for individual genes at the level of donors are used to predict schizophrenia diagnosis (bottom right).
Rank-1 accuracy scores and fold increase over random in parentheses for each dataset split.
| Model | Training | Validation | Test |
|---|---|---|---|
|
| 0.129% (1x) | 0.851% (1x) | 2.90% (1x) |
|
| 4.86% (37.7x) | 18.8% (22.1x) | 24.9% (8.57x) |
|
| 13.2% (102x) | 26.5% (31.1x) | 38.3% (13.2x) |
Fig 2Gene level embeddings represented in a 2-dimensional UMAP projection.
Genes are coloured to represent summary descriptions of ISH images by expert neuroanatomists as described in Zeng et al. (2012). Specifically, colours mark the level of expression in the V1 region and range from—(no signal) to +++++ (over labelling) (panel A). Panel B and C colour summarized expression pattern and cell-type specific expression, respectively.
Gene counts and prediction performance (AUC, AUC-PR, maximum p-value across folds) for Zeng et al. pattern description annotations.
| ResNet Base | Triplet | ||||||
|---|---|---|---|---|---|---|---|
| Pattern | # Genes | AUC | AUC-PR | Max p-value | AUC | AUC-PR | Max p-value |
| widespread | 417 | 0.893 | 0.855 | 1.13E-17 | 0.943 | 0.931 | 2.72E-24 |
| sparse | 16 | 0.655 | 0.091 | 5.85E-01 | 0.713 | 0.092 | 8.83E-01 |
| scattered | 151 | 0.819 | 0.541 | 3.07E-07 | 0.896 | 0.697 | 6.40E-10 |
| not determined | 263 | 0.911 | 0.769 | 5.38E-18 | 0.969 | 0.917 | 1.01E-20 |
| laminar | 125 | 0.731 | 0.336 | 5.27E-03 | 0.877 | 0.594 | 3.86E-07 |
AUC scores for the top ten genes most predictive of schizophrenia diagnosis.
| Gene Symbol | Number of Donors | Embeddings | Demographic variables | Demographics + embeddings |
|---|---|---|---|---|
| SCN4B | 52 | 0.835 | 0.857 | 0.832 |
| BDNF | 51 | 0.829 | 0.840 | 0.739 |
| CUX2 | 53 | 0.779 | 0.763 | 0.793 |
| RASGRF2 | 53 | 0.750 | 0.763 | 0.751 |
| CIT | 53 | 0.735 | 0.763 | 0.726 |
| TRMT9B | 50 | 0.734 | 0.773 | 0.654 |
| PVALB | 53 | 0.720 | 0.763 | 0.727 |
| PPP1R1B | 52 | 0.712 | 0.832 | 0.750 |
| GAD2 | 53 | 0.702 | 0.763 | 0.801 |
| CTGF | 52 | 0.696 | 0.760 | 0.726 |
Fig 3Patches of SCN4B ISH expression images that most (top) and least (bottom) activate embedding dimension 40.
Each patch is from a unique donor, and the diagnostic group is labelled for each.