| Literature DB >> 34980200 |
Pelin Gundogdu1, Carlos Loucera1,2, Inmaculada Alamo-Alvarez1,2, Joaquin Dopazo3,4,5,6, Isabel Nepomuceno7.
Abstract
BACKGROUND: Single-cell RNA sequencing (scRNA-seq) data provide valuable insights into cellular heterogeneity which is significantly improving the current knowledge on biology and human disease. One of the main applications of scRNA-seq data analysis is the identification of new cell types and cell states. Deep neural networks (DNNs) are among the best methods to address this problem. However, this performance comes with the trade-off for a lack of interpretability in the results. In this work we propose an intelligible pathway-driven neural network to correctly solve cell-type related problems at single-cell resolution while providing a biologically meaningful representation of the data.Entities:
Keywords: Deep neural network; Gene expression; Machine learning; Signaling pathway; Single cell; Transcriptomics; scRNA-seq
Year: 2022 PMID: 34980200 PMCID: PMC8722116 DOI: 10.1186/s13040-021-00285-4
Source DB: PubMed Journal: BioData Min ISSN: 1756-0381 Impact factor: 2.522
Details of input and hidden layers, as well as parameters for each architecture used
| Dataset | Architecture | Nodes Layer 1 | Nodes layer 2 | Effective parameters (million) |
|---|---|---|---|---|
| Dense | 100 | – | 0.95 M | |
| Dense + pathway | 100 + 92 | – | 0.95 M | |
| Dense + PPI | 100 + 348 | – | 0.96 M | |
| Mouse | Dense + PPI and GRN | 100 + 696 | – | 1.01 M |
| Dense + PPI and GRN | 100 + 696 | 100 | 1.08 M | |
| Pathway | 92 | – | 0.01 M | |
| Pathway | 92 | 100 | 0.02 M | |
| Dense | 100 | – | 1.80 M | |
| Human | Pathway | 93 | – | 0.01 M |
| Pathway | 93 | 100 | 0.02 M |
Average performance (F1, accuracy, precision, recall) of the different models in a supervised task scenario. Although our pathway-primed models are nearly ten times smaller (sparse), the performance is very close to the PPI-based NN. We report the mean for 100 iterations of train test splits
| F1 | PRECISION | RECALL | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Dense | – | 0.825 | 0.788 | 0.748 | 0.825 | 0.802 | 0.769 | 0.825 | 0.844 | 0.788 | 0.825 | 0.825 |
| Dense with pathways | – | 0.810 | 0.781 | 0.743 | 0.810 | 0.783 | 0.763 | 0.810 | 0.823 | 0.781 | 0.810 | 0.810 |
| Dense with PPI | – | 0.802 | 0.770 | 0.730 | 0.802 | 0.774 | 0.753 | 0.802 | 0.817 | 0.770 | 0.802 | 0.802 |
| Dense with PPI/GRN | – | 0.800 | 0.777 | 0.735 | 0.800 | 0.771 | 0.757 | 0.800 | 0.815 | 0.777 | 0.800 | 0.800 |
| Signaling pathways | – | 0.813 | 0.781 | 0.743 | 0.813 | 0.790 | 0.764 | 0.813 | 0.834 | 0.781 | 0.813 | 0.813 |
| 100 | 0.766 | 0.724 | 0.673 | 0.766 | 0.728 | 0.690 | 0.766 | 0.762 | 0.724 | 0.766 | 0.766 | |
Fig. 1Network performance. The figure depicts the global metrics distribution for each design in a supervised task scenario. It shows the cell types prediction i.e., the performance of different models following a 100-times repeated stratified holdout cross-validation schema with a test size of 0.30
Unknown cell-type clustering performance of the different models analyzed for the LPGO experiments (P = 4). Although our pathway-based models are nearly ten times smaller (sparse), the performance is very close to the PPI-based NN. The mean of 20 splits was reported
| Architecture | Number of nodes | Homogeneity | Completeness | V-measure | ARI | AMI | Fowlkes-Mallows | Average |
|---|---|---|---|---|---|---|---|---|
| Dense | – | 0.801 | 0.799 | 0.798 | 0.725 | 0.786 | 0.814 | 0.787 |
| Dense with pathways | – | 0.804 | 0.797 | 0.798 | 0.718 | 0.786 | 0.811 | 0.786 |
| Dense with PPI | – | 0.811 | 0.804 | 0.805 | 0.728 | 0.794 | 0.817 | 0.793 |
| Dense with PPI and GRN | – | 0.820 | 0.808 | 0.812 | 0.746 | 0.802 | 0.827 | 0.802 |
| Signaling pathways | – | 0.797 | 0.788 | 0.790 | 0.716 | 0.778 | 0.809 | 0.780 |
| Signaling pathways | 100 | 0.775 | 0.803 | 0.786 | 0.729 | 0.774 | 0.820 | 0.781 |
Fig. 2Clustering performance in the 4-left-out experiment. Each network is trained by leaving 4 cell types out (LPGO technique). The cell types which are left-out are randomly selected, and the procedure is repeated 20 times. After the neural network training is completed, the encoding (learned representation) is computed for the test (left-out cells) and used as input to the K-Means algorithm. The output is then evaluated using a comprehensive set of metrics (see “Materials and methods” section)
Average retrieval performance across the different cell type
| Architecture | Number of nodes | HSC | 4cell | ICM | Spleen | 8cell | Neuron | Zygote | 2cell | ESC | Mean |
|---|---|---|---|---|---|---|---|---|---|---|---|
| PCA 100 (with full gene space) | – | 0.181 | 0.669 | 0.026 | 0.975 | 0.176 | 0.627 | 0.462 | 0.675 | 0.106 | 0.433 |
| PCA 100 (with signaling gene space) | – | 0.179 | 0.561 | 0.128 | 0.989 | 0.191 | 0.624 | 0.455 | 0.676 | 0.205 | 0.445 |
| Dense | – | 0.243 | 0.643 | 0.000 | 0.734 | 0.147 | 0.404 | 0.569 | 0.514 | 0.148 | 0.378 |
| Dense with signaling pathways | – | 0.259 | 0.648 | 0.000 | 0.849 | 0.236 | 0.486 | 0.509 | 0.656 | 0.130 | 0.419 |
| Dense with PPI | – | 0.196 | 0.638 | 0.041 | 0.927 | 0.212 | 0.550 | 0.575 | 0.686 | 0.179 | 0.445 |
| Dense with PPI/GRN | – | 0.194 | 0.645 | 0.007 | 0.930 | 0.294 | 0.542 | 0.600 | 0.711 | 0.190 | 0.457 |
| Dense with PPI/GRN | 100 | 0.068 | 0.771 | 0.182 | 0.956 | 0.849 | 0.561 | 0.415 | 0.553 | 0.710 | 0.563 |
| Signaling pathways (+) | – | 0.163 | 0.438 | 0.011 | 0.619 | 0.179 | 0.475 | 0.344 | 0.465 | 0.128 | 0.314 |
| Signaling pathways (+) | 100 | 0.149 | 0.307 | 0.050 | 0.402 | 0.198 | 0.352 | 0.592 | 0.368 | 0.137 | 0.284 |
| Signaling pathways (parameter tuning) (+) | – | 0.107 | 0.768 | 0.049 | 0.960 | 0.625 | 0.549 | 0.471 | 0.627 | 0.110 | 0.474 |
| Signaling pathways (parameter tuning) (+) | Size* | 0.155 | 0.803 | 0.117 | 0.955 | 0.568 | 0.550 | 0.497 | 0.623 | 0.150 | 0.491 |
| scvis | – | 0.203 | 0.522 | 0.000 | 0.813 | 0.077 | 0.733 | 0.424 | 0.683 | 0.512 | 0.441 |
*The size of second layer (size) is defined after tuning in hyperparameter tuning networks
Proposed network performance with log normalization
| F1 | PRECISION | RECALL | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Architecture | Number of nodes | Accuracy | Balanced accuracy | Macro | Micro | Weighted | Macro | Micro | Weighted | Macro | Micro | Weighted |
| Dense | – | 0.938 | 0.839 | 0.859 | 0.938 | 0.934 | 0.923 | 0.938 | 0.938 | 0.839 | 0.938 | 0.938 |
| Pathways | – | 0.936 | 0.844 | 0.861 | 0.936 | 0.933 | 0.922 | 0.936 | 0.938 | 0.844 | 0.936 | 0.936 |
| Pathways | 100 | 0.930 | 0.834 | 0.847 | 0.930 | 0.926 | 0.901 | 0.930 | 0.932 | 0.834 | 0.930 | 0.930 |
Fig. 32D TSNE showing the dimensional reduction result based on the learned representation (encoding) of the data. Each pathway-based NN (1 and two layers designs) was trained using the training set and then used to compute the encoding of the training and test sets
Number of nodes in the input layer (genes) and in the first hidden layer (biological information) according to the type of biological information used to relate genes among them
| Organism | Biological information | Source | Number of genes (input layer) | Number of nodes (first hidden layer) |
|---|---|---|---|---|
| PPI | [ | 3553 | 348 | |
| Mouse | GRN | [ | 8307 | 348 |
| Signaling pathway | [ | 3737 | 92 | |
| Human | Signaling pathway | [ | 2987 | 93 |
Fig. 4Encoding information in the DNN. a. The proposed network is based on a feedforward neural network. b. The integration of biological knowledge is implemented into the first hidden layer, i.e. each neuron/node corresponds to one biological unit. c. The learned representation or encoding (the activation of the last hidden layer) can be used as the input of the TSNE algorithm to produce a 2D visualization of the data. Finally, the supervised performance of the network can be evaluated with classical ML metrics