| Literature DB >> 33277618 |
Patrick S Stumpf1,2, Xin Du3, Haruka Imanishi4, Yuya Kunisaki5, Yuichiro Semba6, Timothy Noble7, Rosanna C G Smith8, Matthew Rose-Zerili8, Jonathan J West8,9, Richard O C Oreffo7,9, Katayoun Farrahi3, Mahesan Niranjan3, Koichi Akashi6, Fumio Arai10, Ben D MacArthur11,12,13,14.
Abstract
Biomedical research often involves conducting experiments on model organisms in the anticipation that the biology learnt will transfer to humans. Previous comparative studies of mouse and human tissues were limited by the use of bulk-cell material. Here we show that transfer learning-the branch of machine learning that concerns passing information from one domain to another-can be used to efficiently map bone marrow biology between species, using data obtained from single-cell RNA sequencing. We first trained a multiclass logistic regression model to recognize different cell types in mouse bone marrow achieving equivalent performance to more complex artificial neural networks. Furthermore, it was able to identify individual human bone marrow cells with 83% overall accuracy. However, some human cell types were not easily identified, indicating important differences in biology. When re-training the mouse classifier using data from human, less than 10 human cells of a given type were needed to accurately learn its representation. In some cases, human cell identities could be inferred directly from the mouse classifier via zero-shot learning. These results show how simple machine learning models can be used to reconstruct complex biology from limited data, with broad implications for biomedical research.Entities:
Year: 2020 PMID: 33277618 PMCID: PMC7718277 DOI: 10.1038/s42003-020-01463-6
Source DB: PubMed Journal: Commun Biol ISSN: 2399-3642
Fig. 1Dissecting the cellular heterogeneity of the mouse bone marrow.
a Experiment schematic. Single-cell RNA sequencing was performed on total and depleted (CD45−/Ter119−) bone marrow cells. b Projection of data onto two dimensions using t-distributed stochastic neighbor embedding (tSNE[63]) indicates that bone marrow population structure is preserved in biological replicates (n = 3, shown in purple, blue, and green). c Cell types were identified using unsupervised clustering[9] followed by annotation of clusters according to localization of known markers for different cell types. d Clusters naturally arrange in accordance with the known bone marrow lineage tree. The lineage tree shown is taken from ref. [13]. HSPCs hematopoietic stem and progenitor cells. e Relative abundance of cell types in total and depleted bone marrow samples. Bar height indicates the mean over the biological replicates (n = 3). f Key markers of the main branches of the hematopoietic lineage tree and niche cells localize to distinct clusters in the data. The following representative markers are shown: stem and progenitor cells: Cd34; niche cells: Kitl; myeloid lineage: Spi1; erythroid lineage: Gata1; lymphoid lineage: Pax5. See Supplementary Fig. 2 for localization patterns of a range of other markers. g Schematic of the multinomial logistic regression (MLR) model used to identify cell types from gene expression profiles obtained from mouse bone marrow cell samples. The MLR consists of an input layer with 4372 units (corresponding to the set of high variability genes that have unique human homologs; see Methods), and a 14-class SoftMax output layer. h Confusion matrix of validation data, showing accurate classification of cell identities by the ANN. Data displayed are the average over a fivefold cross-validation. i Distribution of misclassified cells in the training data. Color represents the distance d between the true and predicted label in the cell lineage tree in panel d.
Fig. 2Bone marrow biology maps partially from mouse to humans.
a Schematic of the naïve transfer process. The MLR trained in the source domain (mouse) is used to classify test data from the target domain (humans). b Confusion matrix of classification consensus from fivefold cross-validation. The dashed box highlights cell types identified in the mouse but not the human data. c, d Projection of human data onto two dimensions using tSNE[63]. Points represent cells colored by c predicted cell identity or d misclassification. Cells for which the five classifiers did not agree are shown in turquoise. e Heatmap of similarity between mouse and MLR predicted human cell types. Similarities were calculated between cell-type mediancentres[27] using cosine similarity (see “Methods”). Clustering using single linkage reveals high similarity between equivalent mouse and human cell types. f Similarity of mouse and human cell types using annotations obtained from unsupervised Louvain clustering (x-axis) and MLR model predictions (y-axis). Pericytes (pink) and Pre-B-lymphocytes (brown) contain a mixture of cell types that are not resolved by unsupervised clustering but are identified by the MLR model. The black diagonal marks y = x.
Fig. 3Mapping biology from mouse to human using transfer learning.
a Schematic of the transfer learning process. Abundant data from source domain (here the mouse) are used to train a source MLR. Sparse data from the target domain (here the humans) is used to fine-tune the parameters of the source MLR, thereby transferring knowledge from source to target domain. b Schematic of naïve learning as a control for transfer learning. Rather than updating the pre-trained mouse model, a series of separate MLRs are trained from random initial conditions on sparse data from the human target domain. c Both transfer and naïve learning improves with the number of human samples used for training (shown is data for 0,1, 2, …, 10, 15, 20, 25, 30 human cells per class). Transfer learning performance (top) and naïve learning (bottom). Displayed is the average F1 score from fivefold cross-validation as a measure of classifier performance. d Learning curves illustrate the evolution of classification performance starting from the initial mouse (triangle) to the final model (square; trained on 30 human examples per class). e Schematic to interpret the learning curves in panel d. Three features are of importance. A is the initial performance deficit = 1 – F10, where F10 is the F1 score of the mouse model in predicting human samples. B is the learning curve: each point on this curve plots the F1 score of the re-trained mouse model against the naïve human model for a fixed number of human training examples from 0 to 30 per class. C is the final performance deficit = 1 – F1end, where F1end is the F1 score of the naïve human model trained on 30 samples per class in predicting human samples. The line y = x is in black. On this line the naïve and re-trained models have equivalent accuracy for the same number of human training samples. At this point the advantage of transfer learning is neutralized, and equivalent learning can be achieved by a naïve human model. All learning trajectories eventually converge to this line. f Cell types may be grouped by their initial and final performance deficits. Equivalent re-training results for the ANN classifier are shown in Supplementary Fig. 5.