Literature DB >> 29881253

Phenotype Classification Using Moment Features of Single-Cell Data.

Chao Sima¹, Jianping Hua¹, Michael L Bittner², Seungchan Kim³, Edward R Dougherty⁴.

Abstract

Features for standard expression microarray and RNA-Seq classification are expression averages over collections of cells. Single cell provides expression measurements for individual cells in a collection of cells from a particular tissue sample. Hence, it can yield feature vectors consisting of higher order and mixed moments. This article demonstrates the advantage of using these expression moments in cancer-related classification. We use synthetic data generated from 2 real networks, the mammalian cell cycle network and a melanoma-related pathway network, and real single-cell data generated via fluorescent protein reporters from 2 cell lines, HT-29 and HCT-116. The networks consist of hidden binary regulatory networks with Gaussian observations. The steady-state distributions of both the original and mutated networks are found, and data are drawn from these for moment-based classification using the mean, variance, skewness, and mixed moments. For the real data, we only observe 1 gene at a time, so that only the mean, variance, and skewness are considered, the analysis being done for 2 genes, EGFR and ERRB2. For the synthetic data, classification improves as we move from just the mean to mean, variance, and skewness and then to these plus the mixed moments. Comparisons are done with 3, 4, or 5 features, using feature selection. Sample size effects are considered. For the real data, we only consider mean, variance, and skewness, with results improving when the higher order moments are used as features.

Entities: CellLine Chemical Disease Gene Species

Keywords: Classification; gene regulatory network; moment features; single-cell data

Year: 2018 PMID： 29881253 PMCID： PMC5987911 DOI： 10.1177/1176935118771701

Source DB: PubMed Journal: Cancer Inform ISSN： 1176-9351

Introduction

Transcriptome analysis is a powerful strategy to connect genotype to phenotype of cells. Essentially, all cells share the same genetic code inherited from their ancestor, but transcriptomes of individual cells characterize a subset of genes expressed to reflect their epigenetic status or their genetic regulatory system leading to specific phenotypes.[1,2] Hence, ideally, the transcriptome should be profiled for each individual cell; however, owing to technical limitations, until recently, most transcriptomic profiling has been done on bulk cells, yielding only average behavior of tens of thousands of cells. Recent advances in next-generation sequencing technologies have allowed in-depth investigation of the transcriptome at a single-cell resolution,[3] thereby opening avenues for innovative target discovery.[4] A recent theoretical analysis has compared phenotype classification based on single-cell expression trajectories with mean expression levels across multiple cells, as is the case with both ordinary RNA-Seq and expression microarrays.[5] Bulk expression measurement (multiple-cell averaging) destroys both intercell and dynamical information, and therefore using single-cell trajectories should be expected to achieve lower misclassification rates for phenotype classification. In the work by Karbalayghareh et al.,[5] cell trajectory versus average cell classification is studied in the context of Boolean networks with perturbation (BNp) and, except in some cases where the network attractors have special form,[6-8] single-cell trajectory data outperform average cell measurements. In practice, regulatory asynchronicity would lead to missing values.[9] Moreover, lower amounts of messenger RNA in individual cells can cause experimental issues that would also lead to dropouts.[10] Thus, the modeling in the work by Karbalayghareh et al.[5] assumes random missing values in trajectory readouts, with missing value rates as high as 50%. Looking into the future, we should expect that nondynamical (nontrajectory) single-cell data in sufficient supply for cancer classification will become available before sufficient dynamical single-cell data. In this case, because a gene’s expression is not static, expression measurements for a gene will possess a distribution over the cell collection and not simply from measurement error. Moreover, as genes interact, they will possess a joint expression distribution. Reducing this multivariate distribution to a set of averages entails a significant compression of information. If single-cell measurements were available, then each cell would yield an expression vector, the collection of cells would yield a sample of expression vectors, sample moments (not just averages) can be computed, and moment-based classification could proceed using higher order and mixed moments. The latter can be particularly useful because they reveal interaction and the alteration of signaling pathways can significantly alter gene interaction, thereby enabling phenotype discrimination.

Methods

For moment-based classification, suppose that for each tissue, cells are collected and for each cell an expression vector is formed from genes. This yields a sample of expression vectors , where . From these, we can compute empirical moments. A feature vector is composed of some set of moments. We focus on the first 3 moments, µ1, µ2, µ3, for each gene expression , and the second-order mixed moments, , , with , for . This gives a total of moments for each tissue and these form a feature vector , where . We will not consider additional moments due to the sample size being small, which is typical in genomics applications. If there are tissue samples, from phenotype 0 and from phenotype 1, then we have the training sample , with moment feature vectors from class 0 and from class 1. Because it is typically the case in biomedicine that sampling is separate and not random, meaning that tissues are chosen randomly from each class but not from the population as a whole, so that prior probabilities and , where is the class label, cannot be estimated from the data[11]; we assume that they are known.

Synthetic data via a gene regulatory network

If we assume a network model, then we can solve for the Bayes classifier and generate synthetic data to study classifier design and feature selection.[12] We shall assume Gaussian networks generated from hidden discrete networks, which we will take to be BNp[6] but which could also be probabilistic Boolean networks[8] or Bayesian networks.[13] Knowing the generating BNp allows us to study the effects of regulatory alteration, for instance, classifying between a nominal network and another resulting from mutation or drugs. Using Gaussian measurements allows us to model basal-level expressions and variability. We describe the network model for a single BNp and later return to classification with 2 BNps. Consider a BNp with genes. States are of the form , where , and there are states. The transition probability matrix (TPM) can be analytically derived and the steady-state distribution π can be derived from the TPM. Let be the moment vector associated with . We assume a Gaussian observation. The observation of the gene is normally distributed: where is the mean expression when the gene is considered off, is the mean expression when the gene is on, and is the expression variance. There is no theoretical difficulty in making and depend on , but to keep the results more transparent and the simulations less burdensome, we will not. is the observation corresponding to the hidden state . For a single subject, there are cells observed. This yields observations where , , , and are randomly drawn from . A moment feature vector , where is defined earlier in the section, is calculated from . When the BNp is perturbed, a different TPM will follow which results in a different steady-state distribution. We refer to the steady-state distribution for the unperturbed and perturbed BNps as and , respectively. We are interested in studying whether including different moments will improve the classification. Therefore, we categorize 3 types of moment features: for first moment features only, supersetting but also including the second and third moment features, and supersetting but also including mixed moments. If there are subjects, each with cells in the sample, then this procedure yields a training sample , where and , with and feature vectors based on the steady-state distributions and , respectively, with being the observed labels, and . Randomness in results from randomness in and randomness in the observations . In this study, we generate synthetic data using 2 pathway networks: Pathway Network 1 (PN1) is a mammalian cell cycle network and Pathway Network 2 (PN2) is melanoma-related pathway network (Figure 1). Both of these have been previously proposed and described.[14] Briefly, PN1 includes a few key genes in the mammalian cell cycle whose signals and controls play a critical role in cell growth, among which P27 is active in the absence of the cyclins and blocks the action of CycE or CycA. When P27 is mutated and always off, it introduces a mutated phenotype where the growth factors are inactive. On the other hand, PN2 focuses on gene Wnt5a, which has been found to be highly discriminating between cells with properties typically associated with high versus low metastatic competence, as validated in melanoma cells. A different type of perturbation is applied to PN2, where we added the regulatory predictor Ret1 for S1000p (dashed arrow from Ret1 in Figure 1) as a function wiring modification. A summary of these networks is shown in Table 1.

Figure 1.

Logical regulatory network graphs for a mammalian cell cycle network (PN1) and a melanoma-related pathway network (PN2), modified from Figures 3 and 1 in Qian and Dougherty,[14] respectively. An arrow represents activation regulation, whereas an arrow ending with a bar represents inhibition. A different steady-state distribution resulted from P27 stuck-at-0 change (shaded node in PN1) or regulatory change (dashed arrow in PN2). PN1 indicates Pathway Network 1; PN2, Pathway Network 2.

Table 1.

A summary of the pathway networks in this study.

	Pathway Network (PN1)	Pathway Network (PN2)
Description	Mammalian cell cycle	Melanoma-related pathway
No. of genes	10	7
Perturbation	P27 mutated and stuck at 0	Adding regulatory predictor

Figure 3.

Distribution plots (mean errors shown on the horizontal axis), for : (a) for PN1, (b) for PN1, (c) for PN2, and (d) for PN2.

A summary of the pathway networks in this study.

Real data from fluorescent protein reporters

Ideally, single-cell RNA-Seq technology could be used to profile the whole transcriptome of hundreds of cells from individual patients/cell lines. However, because this technology is still under intensive development and not widely adopted with a common protocol, current publicly available single-cell RNA-Seq data sets are generated to demonstrate the ability of a certain methodology, usually the amount and quality of cells profiled per patient/cell line. As a result, these data sets contain extremely small numbers of patient/cell line sample points per phenotype, usually 1 or 2, thus thwarting any effort for realistic classification based on such data. Thus, in this study, we use an in-house data set collected through a high-content imager that tracks a given gene’s transcription level in individual cancer cells via fluorescent protein reporters.[15,16] Prior to the introduction of single-cell RNA-Seq, fluorescent protein reporters, along with single-molecule fluorescent in situ hybridization and single-cell quantitative polymerase chain reaction, have been the most common approaches to examine transcriptional heterogeneity among cells.[17-20] In the fluorescent technology, the protein reporter is assembled by fusing the coding sequence of a fluorescent protein reporter with the promoter region of the target gene and then transfecting it into target cells. The abundance of the fluorescent protein indicates the transcriptional level and it can be captured by an epifluorescent microscope. In our experimental setup, each cell is transfected with just one reporter to follow the transcription of one specific target gene. The fluorescent images are commonly taken as 2-color image pairs with a blue channel for the nuclei and a green channel for the fluorescent reporters. Then, the expression levels of individual cells are extracted using an in-house software that first identifies the individual nuclei in the nuclei channel and then extracts the corresponding fluorescent protein signal in the other channel. Compared with single-cell technology, the fluorescent protein reporter technology detects only 1 gene, rather than the whole transcriptome. An advantage of a high-content imager is that one can capture the transcriptional activities in many wells on the plate simultaneously, where each well is an independent sample point from the corresponding cell line. With this size, we can test the potential of moment classification, even with just a single gene. The drawback is that there are no second-order mixed moments. Nonetheless, we can test moment classification using the first 3 moments and demonstrate its advantage over classification using only the mean. Our simulation study involves 2 cell lines, HT-29 and HCT-116, that are resistant and sensitive to the drug lapatinib, respectively. Lapatinib is a cancer drug that has been approved for treating breast cancer by inhibiting Egfr and Erbb2, 2 membrane-bound protein receptors commonly associated with cancer. Thus, we have selected Egfr and Erbb2 as the 2 genes to be profiled. Because each cell has only 1 fluorescent protein, Egfr and Erbb2 expressions are profiled separately. The number of wells (sample points) available for each (cell line, gene) combination from our experiment is shown in Table 2. The images are taken 2 hours before lapatinib is added.

Table 2.

Number of wells/samples measured for every gene and cell line.

	HT-29	HCT-116
Egfr	43	24
Erbb2	24	24

Median number of cells per well: 247.

Number of wells/samples measured for every gene and cell line. Median number of cells per well: 247.

Results and Discussion

Synthetic data

To compare classification error rates using features from different moments, we repeatedly sample times from both and . For sample , a sequential forward search[21] is used to find features from , estimated with cells. For , the error rates are computed from a large set of test data for each category of moment features, , respectively, and average error rates are computed from the samples. We have computed error rates for linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), a support vector machine (SVM) with linear kernel, and a shallow feedforward neural network with hidden layer of size 10 (NNet). We use sample sizes and , which are representative of many studies in genomics. Figure 2 shows data plots for network PN1, with sample size , for , where for , multidimensional scaling[22] has been used to reduce the plot to 3 dimensions. For each value of , there are 2 data plots arising from different samples: one possessing low LDA error (top figures) and the other possessing high LDA error (bottom figures). In all figures, the data appear to be compatible with linear discrimination, an observation that is born out in the error rates. Table 3 shows the average error rates for feature sets , feature counts , and sample sizes for networks PN1 and PN2.

Figure 2.

Table 3.

Average error rates for and both networks PN1 and PN2.

			k=3			k=4			k=5
			M₁	M₂	M₃	M₁	M₂	M₃	M₁	M₂	M₃
			ε^13	ε^23	ε^33	ε^14	ε^24	ε^34	ε^15	ε^25	ε^35
PN1	LDA	N = 100	0.2070	0.2073	0.1867	0.2092	0.2093	0.1867	0.2109	0.2106	0.1868
		N = 200	0.2008	0.1970	0.1629	0.2005	0.1973	0.1607	0.2006	0.1975	0.1590
	QDA	N = 100	0.2121	0.2129	0.1914	0.2189	0.2190	0.1948	0.2274	0.2270	0.2011
		N = 200	0.2029	0.1999	0.1684	0.2054	0.2030	0.1683	0.2084	0.2063	0.1696
	SVM	N = 100	0.2145	0.2152	0.1962	0.2163	0.2193	0.1992	0.2200	0.2226	0.1996
		N = 200	0.2040	0.2004	0.1699	0.2041	0.2014	0.1679	0.2045	0.2025	0.1692
	NNet	N = 100	0.2476	0.2412	0.2198	0.2479	0.2542	0.2234	0.2611	0.2563	0.2230
		N = 200	0.2219	0.2216	0.1868	0.2224	0.2196	0.1850	0.2268	0.2204	0.1820
PN2	LDA	N = 100	0.1019	0.1017	0.0995	0.1015	0.1018	0.0991	0.1002	0.1014	0.0998
		N = 200	0.0936	0.0907	0.0869	0.0923	0.0892	0.0847	0.0910	0.0882	0.0840
	QDA	N = 100	0.1048	0.1050	0.1035	0.1064	0.1069	0.1053	0.1076	0.1097	0.1079
		N = 200	0.0965	0.0935	0.0895	0.0962	0.0932	0.0893	0.0951	0.0938	0.0885
	SVM	N = 100	0.1081	0.1085	0.1092	0.1079	0.1119	0.1111	0.1085	0.1139	0.1147
		N = 200	0.0985	0.0953	0.0922	0.0981	0.0956	0.0917	0.0976	0.0956	0.0923
	NNet	N = 100	0.1358	0.1304	0.1277	0.1285	0.1319	0.1264	0.1349	0.1347	0.1266
		N = 200	0.1112	0.1059	0.1051	0.1095	0.1043	0.1030	0.1060	0.1046	0.1028

Abbreviations: LDA, linear discriminant analysis; NNet, neural network; PN1, Pathway Network 1; PN2, Pathway Network 2; QDA, quadratic discriminant analysis; SVM, support vector machine.

The 3-dimensional scatterplots for network PN1 sample points, with sample size , for ((a) and (d)), ((b) and (e)), and ((c) and (f)). For , multidimensional scaling has been used to reduce the plot to 3 dimensions. For each value of k, there are 2 data plots arising from different samples: one possessing low LDA error ((a)-(c)) and the other possessing high LDA error ((d)-(f)). LDA indicates linear discriminant analysis; PN1, Pathway Network 1. Average error rates for and both networks PN1 and PN2. Abbreviations: LDA, linear discriminant analysis; NNet, neural network; PN1, Pathway Network 1; PN2, Pathway Network 2; QDA, quadratic discriminant analysis; SVM, support vector machine. A considerable amount of insight can be gleaned from Table 3. Focusing first on PN1 with , we see the kind of behavior one might expect regarding the relation between . For all values of , the errors decrease from to to . Hence, using single-cell measurement is important. Most critically, the decrease from to is much greater than the decrease from to . This means that adding mixed moments has a more significant effect than adding higher order single-variable moments. This behavior is important for genomics: the mixed moments capture gene-gene interaction, which is affected by regulatory mutations. If we fix the feature class, we do not see improvement for increasing (a slight bit for ). This means that 3 features are enough (for ), the issue being to have mixed-moment features in the mix. For the smaller sample size , the errors are greater but again the decrease from to remains significant for all values of (albeit less than for ). Much of the advantage of the added high-order moments is lost from to so that the errors remain essentially the same. We are observing the peaking phenomenon[23-25]: for fixed sample size, as the overall number of features grows, at first, the error decreases, but then it increases, the phenomenon being more prominent for small samples. Thus, the advantage of a superset of features is diminished and can actually be harmful. For a fixed feature class and increasing , for and , the errors get worse, and for , they remain the same. Feature selection mitigates to some extent the effect of peaking; however, as opposed to earlier studies in previous works,[23-25] in which there is no feature selection, the results in the work by Sima and Dougherty[26] demonstrate that peaking behavior is affected in peculiar ways by feature selection and is dependent on the classification rule. Similar effects are seen for network PN2 with LDA but the improvement is significantly less for PN2 as compared with PN1. In part this is because the overall error rates are much smaller and, perhaps, in part because there are less mixed moments to choose from or because there are no strong mixed-moment effects due to regulatory change (which can happen if the effects of the regulatory change are spread out in the steady-state distribution). To get a better sense of the contribution of extra features resulting from single-cell classification, in Figure 3, we have plotted the distributions for the LDA error rates , respectively (mean errors shown on the horizontal axis), for and . For both networks, classification improvement when moving from to to is apparent as the error rate distributions shift to the left. Distribution plots (mean errors shown on the horizontal axis), for : (a) for PN1, (b) for PN1, (c) for PN2, and (d) for PN2. QDA, SVM, and NNet show similar behavior to that of LDA for network PN1 and . Most importantly, for all , the errors decrease from to to , and the decrease from to is much greater than the decrease from to . Once again, the differences are less pronounced for on account of peaking. Analogous comments also apply to network PN2, but to a lesser degree, as is the case with LDA.

Real data

For the fluorescent imaging data, as we have profiled 1 gene for each cell, for each case, there are 3 features associated with that gene: mean, variance, and skewness. To test the potential of moment-based classification, we have tested all 7 feature combinations: mean, variance, skewness, mean plus variance, mean plus skewness, variance plus skewness, and all 3 features. Linear discriminant analysis is used for classification (QDA performs poorly on account of small sample size). The 10-fold cross-validation averaged over 10 repeats has been used to estimate errors. The simulation results are summarized in Table 4.

Table 4.

Classification error rates for linear discriminant analysis on all possible feature combinations for Egfr or Erbb2, based on 10-fold cross-validation repeated for 10 times.

	µ¹	µ²	µ³	µ¹ + µ²	µ¹ + µ³	µ² + µ³	µ¹ + µ² + µ³
Egfr	0.597	0.434	0.440	0.516	0.416	0.376	0.406
Erbb2	0.083	0.246	0.579	0.038	0.075	0.248	0.038

—mean; —variance, µ3—skewness.

Classification error rates for linear discriminant analysis on all possible feature combinations for Egfr or Erbb2, based on 10-fold cross-validation repeated for 10 times. —mean; —variance, µ3—skewness. With higher moments added to the feature pool, classification performance can improve significantly. For Egfr, it is hard to classify with only the mean. Using variance and skewness, classification performance improves from to . For Erbb2, with the mean, the performance is already very good at . By adding variance, the error rate is cut by more than half to . Clearly, the higher moments improve performance.

Conclusions

The advent of single-cell expression measurement creates the potential for high-throughput expression-based classification with much greater accuracy than simply using mean expression over many cells. As we have demonstrated with both synthetic data generated from real networks and real single-cell data, higher order moments can improve moment-based classification, and the inclusion of mixed moments can make a more substantial improvement, not only because there are simply more features but also because mixed moments can capture gene-gene regulatory differences. Hopefully, these results will spur the development of more sophisticated single-cell technology so that practical sample sizes can be efficiently generated.

16 in total

Phenotype Classification Using Moment Features of Single-Cell Data.

Introduction

Methods

Synthetic data via a gene regulatory network

Real data from fluorescent protein reporters

Results and Discussion

Synthetic data

Real data

Conclusions

1. A synthetic oscillatory network of transcriptional regulators.

2. Probabilistic Boolean Networks: a rule-based uncertainty model for gene regulatory networks.

3. Noise in eukaryotic gene expression.

4. Optimal number of features as a function of sample size for various classification rules.

5. Tracking transcriptional activities with high-content epifluorescent imaging.

6. Quantitative analysis of gene expression in a single cell by qPCR.

Review 7. Drug targets: single-cell transcriptomics hastens unbiased discovery.

Review 8. Development and applications of single-cell transcriptome analysis.

9. DTWscore: differential expression and cell clustering analysis for time-series single-cell RNA-seq data.

10. Integrative single-cell analysis of transcriptional and epigenetic states in the human adult brain.

Review 1. Understanding tumor ecosystems by single-cell sequencing: promises and limitations.