| Literature DB >> 29881253 |
Chao Sima1, Jianping Hua1, Michael L Bittner2, Seungchan Kim3, Edward R Dougherty4.
Abstract
Features for standard expression microarray and RNA-Seq classification are expression averages over collections of cells. Single cell provides expression measurements for individual cells in a collection of cells from a particular tissue sample. Hence, it can yield feature vectors consisting of higher order and mixed moments. This article demonstrates the advantage of using these expression moments in cancer-related classification. We use synthetic data generated from 2 real networks, the mammalian cell cycle network and a melanoma-related pathway network, and real single-cell data generated via fluorescent protein reporters from 2 cell lines, HT-29 and HCT-116. The networks consist of hidden binary regulatory networks with Gaussian observations. The steady-state distributions of both the original and mutated networks are found, and data are drawn from these for moment-based classification using the mean, variance, skewness, and mixed moments. For the real data, we only observe 1 gene at a time, so that only the mean, variance, and skewness are considered, the analysis being done for 2 genes, EGFR and ERRB2. For the synthetic data, classification improves as we move from just the mean to mean, variance, and skewness and then to these plus the mixed moments. Comparisons are done with 3, 4, or 5 features, using feature selection. Sample size effects are considered. For the real data, we only consider mean, variance, and skewness, with results improving when the higher order moments are used as features.Entities:
Keywords: Classification; gene regulatory network; moment features; single-cell data
Year: 2018 PMID: 29881253 PMCID: PMC5987911 DOI: 10.1177/1176935118771701
Source DB: PubMed Journal: Cancer Inform ISSN: 1176-9351
Figure 1.Logical regulatory network graphs for a mammalian cell cycle network (PN1) and a melanoma-related pathway network (PN2), modified from Figures 3 and 1 in Qian and Dougherty,[14] respectively. An arrow represents activation regulation, whereas an arrow ending with a bar represents inhibition. A different steady-state distribution resulted from P27 stuck-at-0 change (shaded node in PN1) or regulatory change (dashed arrow in PN2). PN1 indicates Pathway Network 1; PN2, Pathway Network 2.
A summary of the pathway networks in this study.
| Pathway Network (PN1) | Pathway Network (PN2) | |
|---|---|---|
| Description | Mammalian cell cycle | Melanoma-related pathway |
| No. of genes | 10 | 7 |
| Perturbation | P27 mutated and stuck at 0 | Adding regulatory predictor |
Figure 3.Distribution plots (mean errors shown on the horizontal axis), for : (a) for PN1, (b) for PN1, (c) for PN2, and (d) for PN2.
Number of wells/samples measured for every gene and cell line.
| HT-29 | HCT-116 | |
|---|---|---|
|
| 43 | 24 |
|
| 24 | 24 |
Median number of cells per well: 247.
Figure 2.The 3-dimensional scatterplots for network PN1 sample points, with sample size , for ((a) and (d)), ((b) and (e)), and ((c) and (f)). For , multidimensional scaling has been used to reduce the plot to 3 dimensions. For each value of k, there are 2 data plots arising from different samples: one possessing low LDA error ((a)-(c)) and the other possessing high LDA error ((d)-(f)). LDA indicates linear discriminant analysis; PN1, Pathway Network 1.
Average error rates for and both networks PN1 and PN2.
|
|
|
| |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| M1 | M2 | M3 | M1 | M2 | M3 | M1 | M2 | M3 | |||
|
|
|
|
|
|
|
|
|
| |||
| PN1 | LDA | 0.2070 | 0.2073 | 0.1867 | 0.2092 | 0.2093 | 0.1867 | 0.2109 | 0.2106 | 0.1868 | |
| 0.2008 | 0.1970 | 0.1629 | 0.2005 | 0.1973 | 0.1607 | 0.2006 | 0.1975 | 0.1590 | |||
| QDA | 0.2121 | 0.2129 | 0.1914 | 0.2189 | 0.2190 | 0.1948 | 0.2274 | 0.2270 | 0.2011 | ||
| 0.2029 | 0.1999 | 0.1684 | 0.2054 | 0.2030 | 0.1683 | 0.2084 | 0.2063 | 0.1696 | |||
| SVM | 0.2145 | 0.2152 | 0.1962 | 0.2163 | 0.2193 | 0.1992 | 0.2200 | 0.2226 | 0.1996 | ||
| 0.2040 | 0.2004 | 0.1699 | 0.2041 | 0.2014 | 0.1679 | 0.2045 | 0.2025 | 0.1692 | |||
| NNet | 0.2476 | 0.2412 | 0.2198 | 0.2479 | 0.2542 | 0.2234 | 0.2611 | 0.2563 | 0.2230 | ||
| 0.2219 | 0.2216 | 0.1868 | 0.2224 | 0.2196 | 0.1850 | 0.2268 | 0.2204 | 0.1820 | |||
| PN2 | LDA | 0.1019 | 0.1017 | 0.0995 | 0.1015 | 0.1018 | 0.0991 | 0.1002 | 0.1014 | 0.0998 | |
| 0.0936 | 0.0907 | 0.0869 | 0.0923 | 0.0892 | 0.0847 | 0.0910 | 0.0882 | 0.0840 | |||
| QDA | 0.1048 | 0.1050 | 0.1035 | 0.1064 | 0.1069 | 0.1053 | 0.1076 | 0.1097 | 0.1079 | ||
| 0.0965 | 0.0935 | 0.0895 | 0.0962 | 0.0932 | 0.0893 | 0.0951 | 0.0938 | 0.0885 | |||
| SVM | 0.1081 | 0.1085 | 0.1092 | 0.1079 | 0.1119 | 0.1111 | 0.1085 | 0.1139 | 0.1147 | ||
| 0.0985 | 0.0953 | 0.0922 | 0.0981 | 0.0956 | 0.0917 | 0.0976 | 0.0956 | 0.0923 | |||
| NNet | 0.1358 | 0.1304 | 0.1277 | 0.1285 | 0.1319 | 0.1264 | 0.1349 | 0.1347 | 0.1266 | ||
| 0.1112 | 0.1059 | 0.1051 | 0.1095 | 0.1043 | 0.1030 | 0.1060 | 0.1046 | 0.1028 | |||
Abbreviations: LDA, linear discriminant analysis; NNet, neural network; PN1, Pathway Network 1; PN2, Pathway Network 2; QDA, quadratic discriminant analysis; SVM, support vector machine.
Classification error rates for linear discriminant analysis on all possible feature combinations for Egfr or Erbb2, based on 10-fold cross-validation repeated for 10 times.
| µ1 | µ2 | µ3 | µ1 + µ2 | µ1 + µ3 | µ2 + µ3 | µ1 + µ2 + µ3 | |
|---|---|---|---|---|---|---|---|
|
| 0.597 | 0.434 | 0.440 | 0.516 | 0.416 | 0.376 | 0.406 |
|
| 0.083 | 0.246 | 0.579 | 0.038 | 0.075 | 0.248 | 0.038 |
—mean; —variance, µ3—skewness.