| Literature DB >> 31284814 |
Samuel J Yang1, Scott L Lipnick2,3,4, Nina R Makhortova2,5, Subhashini Venugopalan1, Minjie Fan1, Zan Armstrong1, Thorsten M Schlaeger5, Liyong Deng6, Wendy K Chung6, Liadan O'Callaghan1, Anton Geraschenko1, Dosh Whye2, Marc Berndl1, Jon Hazard1, Brian Williams1, Arunachalam Narayanaswamy1, D Michael Ando1, Philip Nelson1, Lee L Rubin2,7.
Abstract
The etiological underpinnings of many CNS disorders are not well understood. This is likely due to the fact that individual diseases aggregate numerous pathological subtypes, each associated with a complex landscape of genetic risk factors. To overcome these challenges, researchers are integrating novel data types from numerous patients, including imaging studies capturing broadly applicable features from patient-derived materials. These datasets, when combined with machine learning, potentially hold the power to elucidate the subtle patterns that stratify patients by shared pathology. In this study, we interrogated whether high-content imaging of primary skin fibroblasts, using the Cell Painting method, could reveal disease-relevant information among patients. First, we showed that technical features such as batch/plate type, plate, and location within a plate lead to detectable nuisance signals, as revealed by a pre-trained deep neural network and analysis with deep image embeddings. Using a plate design and image acquisition strategy that accounts for these variables, we performed a pilot study with 12 healthy controls and 12 subjects affected by the severe genetic neurological disorder spinal muscular atrophy (SMA), and evaluated whether a convolutional neural network (CNN) generated using a subset of the cells could distinguish disease states on cells from the remaining unseen control-SMA pair. Our results indicate that these two populations could effectively be differentiated from one another and that model selectivity is insensitive to batch/plate type. One caveat is that the samples were also largely separated by source. These findings lay a foundation for how to conduct future studies exploring diseases with more complex genetic contributions and unknown subtypes.Entities:
Keywords: assay development; deep learning; disease modeling; high-content screening; spinal muscular atrophy
Mesh:
Year: 2019 PMID: 31284814 PMCID: PMC6710615 DOI: 10.1177/2472555219857715
Source DB: PubMed Journal: SLAS Discov ISSN: 2472-5552 Impact factor: 3.341
Figure 1.Plate layout design for a disease-focused experiment with 27 human fibroblast cell lines. Each square represents one well (on a 96-well plate) containing cells from one subject cell line (labeled with a two-digit subject ID). The images of the cells were used in two separate analyses with completely independent sets of subjects. In the first analysis, the gray wells representing three healthy control subjects (C1, C2, and C3) were used to assess the detectability of nuisance factors. The second analysis, for detecting disease state, used the green and magenta wells representing 24 experimental subjects (01, …, 24) consisting of 12 healthy subjects and 12 subjects with spinal muscular atrophy [SMA; five with the survival of motor neuron 1 (SMA1) gene, four with SMA2, and three with SMA3; SMA* refers to disease type]. Unused wells were filled with media but contained no cells.
Figure 2.Flow chart of three primary data analysis methods used. (Upper left) For the first two approaches, a pre-trained convolutional neural network (CNN) is used for dimensionality reduction to produce 320-dimensional (320D) cell embeddings (i.e., a numeric vector with length 320) for each segmented five-channel image of a cell. A vector with the median value throughout each dimension is used to produce one embedding (e.g., 320D point) per well, after which either T-distributed stochastic neighbor embedding (t-SNE) is used to further reduce the dimensionality such that each well is represented as a two-dimensional (2D) point for visualization, or a random forest classifier is trained to identify nuisance factors. (Bottom left) The final approach utilized the original cellular images labeled with healthy or disease status from which we trained a CNN to predict disease state.
Figure 3.Cell Painting example and image focus analyses: (a) images of each stain acquired using Cell Painting; (b) image focus quality analysis as a function of position on six 96-well plates (PerkinElmer ViewPlate) for DAPI stain widefield images; (c) 128×128 crops around randomly sampled cells from well B07; (d) cropped cells from well H01; and (e) image focus quality analysis, similar to (b), but with a different plate type (Cellvis glass) and image acquisition scheme (maximum projections of a confocal z-stack).
Figure 4.Dimensionality reduction visualization with T-distributed stochastic neighbor embedding (t-SNE) of image embeddings from 24 experimental subjects. Each point represents the median cell image embedding from ~2000 cells in a single well, and the points are colored based on the following: (a) column; (b) row; (c) batch/plate type; (d) plate; and (e) disease condition.
Figure 5.Supervised learning assessment of nuisance signals. (a,d) The subset of wells, highlighted in yellow, corresponding to healthy control subjects (denoted by “C”; “E” denotes experimental subjects) that were selected on all 12 plates from both batches/plate types for analyses. (b,c,e,f) Accuracy from a random forest model trained on well-median 320-dimensional embeddings using threefold cross-validation, repeated five times. Error bars denote one standard deviation. Both the unmodified set of embeddings and a “permuted” baseline dataset (the same embeddings but with randomly permuted labels) were evaluated. (b) Column and (c) row predictions, both using the wells highlighted in (a). (e) Batch/plate type and (f) plate predictions, both using the wells highlighted in (d).
Figure 6.Convolutional neural network (CNN) performance in predicting disease state [i.e., healthy or spinal muscular atrophy (SMA)] of individual cell images from the listed unseen (e.g., not used during model training) subject pairs. The 24 experimental subjects, denoted by a two-digit subject ID, disease state, and lab source (A or B), are grouped into 12 subject pairs. Each bar denotes a CNN trained on images from 11 other pairs of subjects and evaluated on images from an unseen subject pair using the well-level area under the receiver operator characteristic curve (AUC) metric.