| Literature DB >> 33953203 |
Abel Szkalisity1,2, Filippo Piccinini3, Attila Beleon1, Tamas Balassa1, Istvan Gergely Varga4, Ede Migh1, Csaba Molnar1, Lassi Paavolainen5, Sanna Timonen5, Indranil Banerjee6, Elina Ikonen2, Yohei Yamauchi7, Istvan Ando4, Jaakko Peltonen8,9, Vilja Pietiäinen5, Viktor Honti4, Peter Horvath10,11,12.
Abstract
Biological processes are inherently continuous, and the chance of phenotypic discovery is significantly restricted by discretising them. Using multi-parametric active regression we introduce the Regression Plane (RP), a user-friendly discovery tool enabling class-free phenotypic supervised machine learning, to describe and explore biological data in a continuous manner. First, we compare traditional classification with regression in a simulated experimental setup. Second, we use our framework to identify genes involved in regulating triglyceride levels in human cells. Subsequently, we analyse a time-lapse dataset on mitosis to demonstrate that the proposed methodology is capable of modelling complex processes at infinite resolution. Finally, we show that hemocyte differentiation in Drosophila melanogaster has continuous characteristics.Entities:
Mesh:
Substances:
Year: 2021 PMID: 33953203 PMCID: PMC8100172 DOI: 10.1038/s41467-021-22866-x
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Fig. 1Classification vs regression.
a Regression plane concept. The classical way to model a biological process includes the phenotypical analysis of cells (i.e. subdividing cells into classes). However, in a high-content screening scenario, the multitude of different phenotypes makes it extremely challenging to create a set of representative classes. A possible solution builds on using a regression line, allowing to represent a single effect without the need of discretization. Nonetheless, biological processes are typically characterized by numerous ongoing effects. Thus, the regression plane represents a good trade-off between visualization capabilities and annotation complexity. Basically, it allows to represent a biological process with the limits of a planar graph. b Active regression. The aim of an active regression algorithm is to improve the training set (TS) to achieve better prediction performance. It is an iterative process where a cell that is difficult to annotate is proposed to the oracle who annotates it, and by doing so moves it to the TS used to train the regression model. c Synthetic dataset. Image from the synthetic dataset, generated using SIMCEP. d Experimental design. The designed processes overlayed on the space of perturbations. 6 processes are tracks in the space, and an extra process is formed of uniformly distributed cells (latent process 7). e Designed processes. The 6 continuous processes are modelled between two fixed endpoints: green cells of highly irregular shape and red, rounded cells. To assign a colour to the middle point of each process we interpolated between white (process 1) and blue (process 6). f Classification vs regression applied on synthetic data. Comparison of the performance of regression and classification. Statistics: precision, recall and the number of identified processes. Columns represent mean, error bars show the standard deviations from n = 5 independent users/experimental setup. Source data are provided as a Source Data file.
Fig. 2Lipid droplet dataset.
a Training set. Regression plane of 457 cells representing various lipid morphologies, created by an expert biologist. b RP output. Kernel Density Estimation (KDE)-maps of the predicted regression positions for cells treated with selected siRNAs. Arrows originate from the peak of the control KDE-map, and point to the peaks of the selected KDE-maps. c HCS analysis. Plate-based analysis performed by comparing well-based KDE-maps. Meta-visualization (in this case PCA–Principal Component Analysis) is obtained by extracting the principal components (PC1 and PC2) of the flattened KDE-maps.
Fig. 3Mitosis data analysis.
a Regression plane of 585 cells annotated by a microscopy expert. b 498 trajectories for all the predicted cells. The median curve is shown in solid blue. c Example of a single-cell trajectory with representative cell icons visualized. d Regression plane with all (n = 19,920) predicted cells. The borders of the cell icons correspond to their nuclear area (Colour Frame module). Highlighted regions: early prophase region, large nuclear area (red). Metaphase region, nuclear area decreased (orange). Early-anaphase region, nuclear area is increasing as spindle fibres are pulling chromosomes apart (yellow). Anaphase, nuclear area dropped as the nucleus is considered as two separate objects with half the area (green). Late-telophase, nuclear area increasing up to half of the initial value (blue). e Trend for the normalized nuclear area according to standard mitotic time. Grey lines represent single-cell trajectories. f Trend for the normalized nuclear area according to the regression plane. Grey lines represent single-cell trajectories. The coordinates predicted by RP were converted to 1D by taking the angle argument of the polar coordinate representation as illustrated in a.
Fig. 4Hemocyte dataset analysis.
a Training set. 109 cells were placed on the regression plane by a microscopy expert. Cells were segmented by applying the NucleAIzer[40] deep learning method on brightfield microscopy images. b Single cell features. Colour-coded feature values overlay on the predicted cells. c Density plots. Kernel density estimation of single cells. d Single-cell trajectories. 2323 cell trajectories on the regression plane. e Selected cell trajectories. Representative phenotypes highlighted in d. f Differentiation speed histogram. Cell differentiation speed on the regression plane. g Trajectory histogram. 2D trajectory histogram on the regression plane and 1D projection with trajectory counts, including only those trajectories that reach beyond the green line in c.