| Literature DB >> 21176149 |
Huilei Xu1, Ihor R Lemischka, Avi Ma'ayan.
Abstract
BACKGROUND: Mouse embryonic stem cells (mESCs) are derived from the inner cell mass of a developing blastocyst and can be cultured indefinitely in-vitro. Their distinct features are their ability to self-renew and to differentiate to all adult cell types. Genes that maintain mESCs self-renewal and pluripotency identity are of interest to stem cell biologists. Although significant steps have been made toward the identification and characterization of such genes, the list is still incomplete and controversial. For example, the overlap among candidate self-renewal and pluripotency genes across different RNAi screens is surprisingly small. Meanwhile, machine learning approaches have been used to analyze multi-dimensional experimental data and integrate results from many studies, yet they have not been applied to specifically tackle the task of predicting and classifying self-renewal and pluripotency gene membership.Entities:
Mesh:
Substances:
Year: 2010 PMID: 21176149 PMCID: PMC3019180 DOI: 10.1186/1752-0509-4-173
Source DB: PubMed Journal: BMC Syst Biol ISSN: 1752-0509
Training set gene list
| MSMGs | Non-MSMGs |
|---|---|
| Bmp4, Cdyl, Cdyl2, Dmrt1, Dppa4, | Afp, Arid3a, Arid3b, Ascl1, Ascl2, |
| Dppa5a, Esrrb, Etv4, Etv5, Fgf4, Foxd3, | Bat1a, Bmp2, Bmp5, Bmper, Ccnd2, |
| Foxh1, Gbx2, Grhl2, Jarid2, Klf2, Klf5, | Cdh2, Cebpa, Cited1, Dach1, Dlx1, |
| Lefty2, Lin28, Mkrn1, Mycn, Nanog, | Dlx4, Dlx6, Ednra, En1, Eomes, Ets2, |
| Nodal, Nr0b1, Nr5a2, Phc1, Phf17, | Eya2, Fgf5, Foxb1, Gata1, Gata3, |
| Pou4f2, Pou5f1, Rif1, Sall1, Sall4, Sgk1, | Gata4, Gata5, Gata6, Gfap, Gli3, Gsc, |
| Slc27a2, Socs3, Sox2, Spp1, Tcf15, | Hand1, Hand2, Insm1, Isl1, Lbx1, |
| Tcfap2c, Tcfcp2l1, Tcl1, Tle4, Trp53, | Lhx2, Lhx5, Lmx1a, Mbd2, Meis1, |
| Utf1, Zfp296, Zfp42 | Mixl1, Myf5, Neurog1, Nfia, Npas3, |
| Nr2f1, Nr2f2, Nrp1, Nrp2, Olig3, Otp, | |
| Otx1, Pax3, Pdx1, Peg3, Phox2b, | |
| Prl3d1, Prox1, Rybp, Shh, Sox1, | |
| Sox18, Sox3, Sox5, Sox9, Stra13, Syp, | |
| Tcf4 |
List of genes used as training set include 46 positive examples labelled as MSMG class and 70 negative examples labelled as non-MSMG class. These genes are derived from expert curation.
Performance of SVM classifiers
| Datatype_kernel | TP | FP | TN | FN | TPR | FPR | Accuracy |
|---|---|---|---|---|---|---|---|
| micro_linear | 42 | 17 | 53 | 4 | 0.91 | 0.24 | 0.82 |
| micro_poly | 39 | 24 | 46 | 7 | 0.85 | 0.34 | 0.73 |
| 37 | 3 | 67 | 9 | 0.80 | 0.04 | ||
| chip_binary_linear | 35 | 10 | 60 | 11 | 0.78 | 0.13 | 0.84 |
| chip_binary_poly | 36 | 5 | 65 | 10 | 0.78 | 0.07 | 0.87 |
| 39 | 8 | 62 | 7 | 0.85 | 0.11 | ||
| chip_contin_linear | 38 | 7 | 63 | 8 | 0.83 | 0.10 | 0.87 |
| chip_contin_poly | 36 | 8 | 62 | 10 | 0.78 | 0.11 | 0.84 |
| 39 | 5 | 65 | 7 | 0.85 | 0.07 | ||
| weight_binary_linear | 39 | 9 | 61 | 7 | 0.85 | 0.13 | 0.86 |
| weight_binary_poly | 37 | 5 | 65 | 9 | 0.80 | 0.07 | 0.88 |
| weight_binary_RBF | 40 | 4 | 66 | 6 | 0.87 | 0.06 | 0.91 |
| weight_contin_linear | 41 | 9 | 61 | 5 | 0.89 | 0.13 | 0.88 |
| weight_contin_poly | 37 | 8 | 62 | 9 | 0.80 | 0.11 | 0.85 |
| 42 | 5 | 65 | 4 | 0.91 | 0.07 | ||
| simple_binary_linear | 39 | 9 | 61 | 7 | 0.85 | 0.13 | 0.86 |
| simple_binary_poly | 37 | 3 | 67 | 9 | 0.80 | 0.04 | 0.90 |
| 42 | 3 | 67 | 4 | 0.91 | 0.04 | ||
| simple_contin_linear | 41 | 9 | 61 | 5 | 0.89 | 0.13 | 0.88 |
| simple_contin_poly | 43 | 17 | 53 | 3 | 0.93 | 0.24 | 0.83 |
| 41 | 3 | 67 | 5 | 0.89 | 0.04 | ||
Comparison of performance of several kernel functions used for SVM learning applied on single and heterogeneous data types (mRNA expression and ChIP-seq). The best performer for each category is bold-highlighted. Kernel functions include: linear kernel, polynomial kernel (poly) and Gaussian radial basis kernel (RBF) (see methods). Datasets include: micro-mRNA expression microarrays; chip_binary-ChIP-seq data with pre-processing into binary feature values; chip_contin-ChIP-seq data with pre-processing into continuous feature values. Performance of two data integration strategies: "weight"- weighted kernel matrices; "simple"- one kernel matrix by concatenation of the two data types (see methods). As an example, "simple_binary_poly" means the approach of concatenating microarray and binary ChIP-seq data and training using an SVM with a polynomial kernel function.
Figure 1ROC curves. Representative ROC curves for three kernel-based SVM classifiers generated using the threefold cross-validation with the mRNA expression microarray dataset for training only. The ROC curves were generated by varying the decision threshold of each SVM classifier. The average AUC for the linear kernel, polynomial kernel and RBF kernel are 0.89, 0.85, and 0.95, respectively. ROC: receiver operating characteristic; TPR: true positive rate; FPR: false positive rate; AUC: area under the curve.
Figure 2Classification performance of different types of classifiers. The performance of the best SVM in each category is compared to three other standard machine learning methods: LDA (Linear Discriminant Analysis), Decision Tree, and ANN (Artificial Neural Networks) and a simple fold-change-based predictor. Performance of machine learning methods is evaluated and accuracy is measured using LOOCV. Labelling of panels is as follows, "microarray": using genome-wide mRNA microarray profiling data; "chip": using genome-wide ChIP-seq of transcription factors data; "micro-chip": using both microarray and ChIP-seq. The fold-change-based predictor results are only under the "microarray" panel since it uses only microarray data.
Evaluation of RNAi screens as a test set
| Datatype_kernel | Signal-to-noise ratio |
|---|---|
| micro_linear | 1.44 |
| micro_poly | 1.00 |
| chip_binary_linear | 1.36 |
| chip_binary_poly | 1.84 |
| chip_contin_linear | 1.57 |
| chip_contin_poly | 1.86 |
| weight_binary_linear | 1.80 |
| weight_binary_poly | 1.96 |
| weight_contin_linear | 1.84 |
| weight_contin_poly | 1.65 |
| simple_binary_linear | 1.80 |
| simple_binary_poly | 3.43 |
| simple_contin_linear | 1.84 |
| simple_contin_poly | 1.38 |
Ranked methods based on signal-to-noise ratio performance of predicting the percentage of genes as positive from the positive test set (self-renewal screen) and as positive from the negative test set (insulin-pathway screen).
Figure 3SVM classifiers to prioritize candidate genes from genome-wide RNAi screens. Application of SVM classifiers to predict "stemness" genes applied on test sets of two independent genome-wide RNAi screens that identified candidate genes functional for self-renewal and insulin cell signalling. The black bars show the percentage of predicted MSMGs among the total genes from (a) positive test set (functional in self-renewal); and (b) negative test set (functional in insulin signalling).