| Literature DB >> 28155657 |
Yongli Hu1,2, Takeshi Hase3, Hui Peng Li4, Shyam Prabhakar4, Hiroaki Kitano3, See Kiong Ng5, Samik Ghosh3, Lawrence Jin Kiat Wee5,6.
Abstract
BACKGROUND: The ability to sequence the transcriptomes of single cells using single-cell RNA-seq sequencing technologies presents a shift in the scientific paradigm where scientists, now, are able to concurrently investigate the complex biology of a heterogeneous population of cells, one at a time. However, till date, there has not been a suitable computational methodology for the analysis of such intricate deluge of data, in particular techniques which will aid the identification of the unique transcriptomic profiles difference between the different cellular subtypes. In this paper, we describe the novel methodology for the analysis of single-cell RNA-seq data, obtained from neocortical cells and neural progenitor cells, using machine learning algorithms (Support Vector machine (SVM) and Random Forest (RF)).Entities:
Keywords: Machine learning; Network reconstruction; Single-cell RNA-seq; Systems biology
Mesh:
Substances:
Year: 2016 PMID: 28155657 PMCID: PMC5260093 DOI: 10.1186/s12864-016-3317-7
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1Workflow for data analysis carried out in this paper. *All genes refers to the set of genes filtered by expression values and ^ selected genes refers to the optimal set of genes identified by geneset enrichment analysis (GSEA), statistical and machine-learning approaches (See Methods for more information)
Fig. 2Clustering of 65 neuronal cells. The approximately unbiased (AU) probability value at each node is shown in red font. There are four distinct clusters (red boxes labelled 1–4) with an AU higher than 80. Box1 comprises of mainly NPCs while boxes 2–4 primarily consists of neuronal cells only
Genes/features selected by disparate feature selection techniques
| Feature selection techniquesa | Features/Genes (No.) |
|---|---|
| Filtered by low expression | 8281 |
| GSVA feature enrichment | 1161 |
| sRAP | 837 |
| SVM-RFE | 38 |
| RF-based Positive MDA | 3339 |
|
| 60 |
aFeature selection is based on five different methodologies based on machine learning algorithms (SVM and RF) and also that of traditional differentially expressed genes (sRAP), t-test based analysis (limma) and genes in deregulated pathways (GSVA)
Accuracy of RF and SVM classifiers on the neuronal dataset
| Genes selected | Accuracy (%)a | MCC^ | ||
|---|---|---|---|---|
| SVM | RF | SVM | RF | |
| All genesb | 95.3 | 76.9 | 0.91 | 0.00 |
| GSVA feature enrichment | 98.5 | 76.9 | 0.87 | 0.00 |
| sRAP | 100 | 76.9 | 1.00 | 0.00 |
| SVM-RFE | 100 | 100 | 1.00 | 1.00 |
| RF-based Positive MDA | 100 | 76.9 | 1.00 | 0.00 |
|
| 100 | 97.0 | 1.00 | 0.91 |
The accuracy of the SVM predictors were obtained from LOO cross validation. SVM and RF classifiers were constructed with each set of data listed in Table 2
aAll percentages are rounded off to three significant figures
bTranscripts with a total expression of zero and/or having more than six samples with expression levels less than one were excluded
^Matthews correlation coefficient (MCC) rounded to 2 decimal places
Fig. 3GRN in NPCs (a) neuronal cells (b) and differential GRN between the two cell types. a GRN in NPCs. Nodes represent transcripts, while links between two nodes represent regulatory interactions between two transcripts in NPCs. Gene regulatory interactions with high confidence score (confidence score > 0.75) in NPCs are shown in the diagram. b GRN in neuronal cells. Nodes represent transcripts, while links between two nodes represent regulatory interactions between two transcripts in neuronal cells. Gene regulatory interactions with high confidence score (confidence score > 0.75) in neuronal cells are shown in the diagram. c Differential GRN between two cell lines. Nodes are transcripts. Red links represent gene regulatory interactions that are activated in neuronal cells but not activated in NPCs, while blue links represents those activated in NPCs but not in neuronal cells. In this diagram, we assumed that a regulatory interaction is activated in neuronal cells (or NPCs) but is not activated in NPCs (or neuronal cells), if difference in confidence score between the two cell types is greater than 0.75, eg, an interaction whose confidence scores are 0.99 and 0.20 in neuronal cells and NPCs, respectively. Note that OTP is a representative DHG (see Table 3) and thus the gene is highlighted in red
Degree of difference between neuronal cells and NPCs in different genes
| Genes | Degree in neuronal cellsa | Degree in NPCsa | Degree difference between neuronal cells and NPCsa |
|---|---|---|---|
| MRPS36 | 57.1 | 34.0 | 23.1 |
| RP11_4K3 | 23.0 | 44.0 | 21.0 |
| SENP5 | 19.9 | 35.7 | 15.8 |
| CLCNKB | 23.9 | 39.6 | 15.7 |
| POLR2F | 23.7 | 39.2 | 15.4 |
| OTP | 48.3 | 33.1 | 15.2 |
| RP1_58B11_1 | 24.6 | 39.3 | 14.7 |
| RP11_293M10_2 | 32.4 | 45.9 | 13.5 |
| RP3_465N24_6 | 25.6 | 39.1 | 13.5 |
| SNORA77 | 30.0 | 43.2 | 13.2 |
| C3orf65 | 50.8 | 38.0 | 12.8 |
| ANGPTL7 | 49.7 | 37.9 | 11.8 |
| RP4_580O19_2 | 47.2 | 36.9 | 10.2 |
| RNU4_27P | 45.1 | 35.0 | 10.0 |
| COL11A1 | 25.8 | 35.3 | 9.46 |
| RP11_68I18_10 | 47.9 | 39.3 | 8.67 |
| Y_RNA | 44.1 | 35.8 | 8.32 |
| THTPA | 29.9 | 37.5 | 7.61 |
| ZNF44 | 45.7 | 38.6 | 7.11 |
| CTH | 28.9 | 35.9 | 7.02 |
| RP11_692M12_4 | 42.8 | 35.8 | 7.02 |
| RP11_345P4_7 | 32.0 | 38.8 | 6.82 |
| RP13_614K11_2 | 36.5 | 43.3 | 6.81 |
| MIR4417 | 30.7 | 37.2 | 6.55 |
| RP11_223J15_2 | 36.7 | 42.5 | 5.77 |
| RNU6_1330P | 47.1 | 41.4 | 5.67 |
| AC004893_11 | 45.1 | 39.9 | 5.23 |
| RP3_406A7_5 | 33.4 | 37.5 | 4.06 |
| MIR378F | 39.1 | 35.5 | 3.63 |
| PNRC2 | 33.7 | 30.6 | 3.15 |
| RP5_886K2_3 | 43.5 | 40.6 | 2.89 |
| RP5_857K21_5 | 34.9 | 33.2 | 1.61 |
| PCID2 | 35.8 | 37.3 | 1.47 |
| MST1L | 38.5 | 37.4 | 1.00 |
| RP4_749H3_1 | 39.5 | 38.9 | 0.624 |
| RPL18AP2 | 36.0 | 36.6 | 0.573 |
| DHDDS | 41.1 | 40.7 | 0.414 |
Degree of difference is corrected to three significant figures
Parameter used to optimize each network-inference algorithms
| Network-inference algorithms | Parameter optimization settingc |
|---|---|
| GENIE3-Aa | K = “all”, nb.trees = 10,000 |
| GENIE3-Ba | K = “sqrt”, nb.trees = 10,000 |
| TIGRESS-Ab | scoring = “area” |
| TIGRESS-Bb | scoring = “max” |
| ARACNE | eps = 0.1 |
| BC3NET | boot = 10, alpha1 = 0.99, alpha2 = 0.99 |
| SiGN-BN | Number of iteration of bootstrap method = 1,000 |
aGENIE3-A and -B represent two different parameter settings for GENIE3 algorithm used in this study
bTIGRESS-A and -B represent two different parameter settings for TIGRESS algorithm used in this study
cWe used default settings for parameters that are not shown in this table