| Literature DB >> 29536823 |
Maurizio Giordano1, Kumar Parijat Tripathi2, Mario Rosario Guarracino2.
Abstract
BACKGROUND: System toxicology aims at understanding the mechanisms used by biological systems to respond to toxicants. Such understanding can be leveraged to assess the risk of chemicals, drugs, and consumer products in living organisms. In system toxicology, machine learning techniques and methodologies are applied to develop prediction models for classification of toxicant exposure of biological systems. Gene expression data (RNA/DNA microarray) are often used to develop such prediction models.Entities:
Keywords: Feature selection; Gene signature; Smoking; Supervised learning; Toxicology
Mesh:
Year: 2018 PMID: 29536823 PMCID: PMC5850943 DOI: 10.1186/s12859-018-2035-3
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1SysTox challenge workflow. First stage (up row): gene selection (signature) from the gene expression data from humans blood samples of the training dataset. Second stage (bottom row): develop inductive prediction models bases on training data from gene signature and provide classification results on testing dataset
Fig. 2SysTox challenge datasets. Distributions of training labels and testing (gold) labels into classes of subjects: smokers (treated group), former smokers (cessation group), and never smokers (control group)
Prediction models
| Classifier | Acronym | Parameters |
|---|---|---|
| Random forests | RF | split=gini, max depth=none, min samples leaf=1, min samples split=1, max features=auto, no. estimators=10 |
| Gaussian Naive Bayes | GNB |
|
| kNN | no.neighbors=3, algorithm=auto, metric=minkowski, p=2, weights=uniform, leaf size=30 | |
| MultiLayer perceptron | MLP | activation=relu;algorithm=l-bfgs, |
| Support vector classifier | SVC | kernel=linear, C=0.1, tolerance=0.001 |
| Logistic regression | LR | C=1.0 max iter=100 penalty=L2 tolerance=0.0001, multi class=OvR |
| Linear discriminant analysis | LDA | solver=SVD, tolerance=0.0001 |
| Gradient tree boosting | GTB | loos=deviance, subsample=1.0 learning rate=0.1, min sample split=2, mean sample leaf=1, max depth=3, estimators=100 |
| Extremely randomized trees | ERT | split=gini, max depth=No, min samples leaf=1, min samples split=1, max features=auto, no. estimators=10 |
The set of nine prediction models built by means of supervised learning on expression data (from H1 training dataset) of gene signatures
RFE-SVM SvsNCS signature
|
|
|
|
| SEMA6B | RAD52 |
|
| SYCE1L |
|
| CRACR2B | MOG | ZP4 | KIT |
| AK8 | PLA2G4C | MIR4697HG | SPAG6 | ZNF618 |
|
| COL5A1 |
| TREM2 | TYR | MMP3 | LHX8 | KCNJ2-AS1 |
| SCIN |
| SPRY2 | ADRA2A | GCNT3 | PTGFR | PACRG-AS1 | LINC00599 | NR4A1 | CHI3L1 | TPPP3 | SLC25A20 |
| NT5C1A | TCEB3B | BMP7 | FANK1 | TMTC1 | FGD5 | APCDD1L | GYS2 | TIMM8A |
|
| SHISA6 | MYO1E | ADIRF-AS1 | CTTNBP2 | H19 | P2RY12 | DSTNP2 | MAGI2-AS3 | VSIG4 | NR4A2 |
| ICA1L | GFRA2 | GSE1 | NPIPB15 | ZFP64 | AFF3 | FOXC2 | CCR10 | ARHGAP32 |
|
| RRNAD1 | NOP9 | HYPM |
| SLC25A27 | C3orf65 | ZMYND12 | TM4SF4 | C6orf10 | DUSP4 |
| FUCA1 | PALLD | ETNPPL | HMGCS2 | LMOD3 | EFNB1 | FABP4 | WNT2 | FAM187B | LINC01270 |
| PRKG2 | NMNAT2 | CYP4A11 | FAM19A2 | S1PR5 | LINC00544 | LRPAP1 | CTSV | LOC200772 | THBS2 |
Gene signature obtained with Recursive Feature ith SVM in in smokers versus non-current smoker case study. Gene names in bold are also present in the signatures found by Extra-Trees and LASSO-LARS methods
Extra-Trees SvsNCS signature
|
| LINC00599 |
|
|
|
| CTTNBP2 |
|
| PF4 |
| RGL1 |
|
|
| C15orf54 | MCOLN3 | F2R | P2RY1 | GUCY1A3 | NRG1 |
| SEMA6B | ESAM | CR1L |
| GP1BA | MAPK14 | PBX1 | GNAZ | GP6 |
|
| RNASE1 | SLC44A1 | ASGR2 | GUCY1B3 | ZNF101 | LTBP1 | TRIP6 | SRRD | PRR5L | CYSTM1 |
|
| GRAP2 | ANKRD37 | MKNK1 | BEX2 | SV2B | FAXDC2 |
| ICOS | NFIB |
| TRDC | SLPI | CDK2AP1 | IL4R | GPR20 | SH2D1B | TLR5 | VIL1 | ITGB5 | IGSF9B |
| CDR2 | BTBD11 | ELOVL7 | ARL3 | TUBB1 | BZRAP1 | ADAMDEC1 | C2orf88 | COCH | LOC100506870 |
| LOC100130938 | CA2 | P2RY12 | SH3BGRL2 | PCSK6 | PRTFDC1 | SAMD14 | CYP4A11 | ASAP2 | H19 |
| LOC283194 | BLCAP | GORASP1 | TGM2 | SLC26A8 | ZAK | PARD3 | MB21D2 | GP9 | S100A12 |
| FANK1 | TNFSF4 | ZNF618 | FAM210B | MYBPC3 | SLC35G2 | ASIC3 | SLC6A4 | CNST | PAPSS2 |
Gene signature obtained with feature selection of Extra-Trees in smokers versus non-current smoker case study. Gene names in bold are also present in the signatures found by RFE-SVM and LASSO-LARS methods
LASSO-LARS SvsNCS signature
|
|
|
| GPR63 |
|
|
|
| GSE1 | ARHGAP32 |
|
| CRACR2B | PTGFR | LHX8 |
| SYCE1L | APCDD1L | OTC |
|
|
|
| CCR10 | P2RY12 |
|
| RAD52 | TRDC | BCLAF1 | KNTC1 | CLSTN3 |
| ZNF536 | ACAP1 | DLGAP5 | IFT140 | LAPTM4A | MTSS1 | SETD1A | CCP110 | GPRASP1 | USP34 |
| SPCS2 | PHACTR2 | TM9SF4 | HDAC9 | SART3 | BMS1 | KIAA0232 | DOCK4 | TBC1D5 | CEP104 |
| PIEZO1 | PTDSS1 | VPRBP | SECISBP2L | SLK | FAM65B | KIAA0195 | SNPH | EIF4A3 | RAPGEF5 |
| RASSF2 | KIAA0101 | JADE3 | KIAA0247 | ZFYVE16 | KIAA0513 | LZTS3 | RIMS3 | SNX17 | MLEC |
| TOX | DHX38 | RAB11FIP3 | HDAC4 | FRMPD4 | KMT2B | TBKBP1 | STARD8 | ZSCAN12 | RNF144A |
| ATG13 | KIAA0586 | PCDHA9 | MATR3 | NOS1AP | ZNF646 | SDC3 | KIAA0430 | DZIP3 | SAFB2 |
| EIF5B | IPO13 | WSCD2 | SLC25A44 | CEP135 | KIAA0040 | TTI1 | PPIP5K1 | PHF14 | FAM53B |
Gene signature obtained with Least Absolute Shrinkage and Selection Operator (with Least Angle Regression procedure) in smokers versus non-current smokers case study. Gene names in bold are also present in the signatures found by RFE-SVM and Extra-Trees methods
Fig. 3SvsNCS signature. Boxplot distribution of expression data (from H1 training dataset) of genes from the signature obtained for the case study of smokers versus non-current smokers
RFE-SVM FSvsNS signature
| SLC38A3 | POU4F1 |
| GOLGA2P5 | IL17RD | CELF5 | ADAMTS14 | PTPN14 | MB21D2 | TBC1D29 |
| RRP12 | C4BPB | KRT73 | DCAF4 | ZNF280B | LOC648691 | DDX11 | TJP3 | LINC01097 | BCL2L12 |
| RAB42 | CLSPN | ADAM23 | CFD | TAS2R9 | CFAP46 | VSIG4 | GDF9 | SI | DOCK4-AS1 |
| SH3PXD2A-AS1 |
| MMP1 | PLA2G2A | RTN3 | LY6G6D | ANKRD6 | IGSF9B | ZNF582-AS1 | C8orf88 |
| REG3A | ETV2 | NDST3 | C6orf99 | WNT5B | PAX4 | NNAT | HCG26 | SLC5A11 | TAAR3 |
| TTC22 | HAGHL | C17orf78 | EDN2 | MTUS1 | PLCD4 | C1orf115 | PLEK |
| SLC34A2 |
| GGT5 | ZNF470 | SYN1 | SCD | MRAS | FOXI1 |
| HTN3 | SH3D19 | HIST1H4E |
| SHISA6 | MCOLN3 | LOC100507534 | SASH1 | APEX1 | C22orf31 | RNF114 | SRRM4 | SCN2B | HMBOX1 |
| ATP6V1C2 | HSF4 | SLC17A5 | SEPT2 | TFAP4 | WWTR1 | FGF4 | SRCIN1 | SLC35F1 | SLC16A2 |
| TAS2R50 | PCAT19 | ADAMTS18 | TMEM31 | CAMK1G | SLC25A31 | SMR3B | SLC17A4 | XRCC6BP1 | PTPRB |
Gene signature obtained with Recursive Feature Elimination with SVM in former smokers versus never smokers case study. Gene names in bold are also present in the signatures found by Extra-Trees and LASSO-LARS methods
Extra-Trees FSvsNS signature
| MMP1 | PRR29 | APCS |
| DLK2 |
| CNTN2 | CLDN17 | CHGA | TMEM31 |
| MAPK10 | ZNF280B | C20orf85 | LDHD |
| MAF | WFIKKN2 | CYP4B1 | NTRK3-AS1 | NKX6-1 |
| FAM221A | IFIT1 | SLC16A1 | HSD11B1L |
| CLCN1 | IGSF9B | CENPU | ZNF652 | GPAM |
| ENTPD7 | FBXL19-AS1 | PRKCE | HCG26 | NLRP14 | B3GNT7 | KLF14 | SLCO4A1 | SNCG | SLC34A2 |
| CEP76 | CXorf36 | ATF2 | STAU2-AS1 | SIGLEC11 | RWDD3 | ASB16 | FGB | HIST1H4H | ERN2 |
| CLRN1-AS1 | SLC50A1 | DOK4 | FASTKD1 | MB21D2 | HDAC1 | KIF2A | GMIP | CT83 | CYP2A13 |
| MED6 | CHDC2 | FGF13-AS1 | IFNA21 | DEPDC5 | CEP250 | MCM3AP | KRT75 | GLP1R | RAD51B |
| CFAP20 | TMEM184A | HOMEZ | LINC00922 | CRP | MAST1 | CBL | SDF4 | KRT19 | CELF5 |
| CDCA8 | ACTL8 | MRPS12 | ACER1 | SYCE3 | AP4E1 | TYK2 | LOC283914 | SLC12A1 | SCN2A |
| PLAC4 | OXCT1 | ABCA11P | GLB1 | TCEAL7 | LRRC32 | BHLHE22 | LINC01012 | TBK1 | TMEM225 |
Gene signature obtained with feature selection of Extra-Trees in former smokers versus never smokers case study. Gene names in bold are also present in the signatures found by RFE-SVM and LASSO-LARS methods
LASSO-LARS FSvsNS signature
| POU4F1 | PTPRB |
| SLC38A3 | PTPN14 | GDF9 |
| C4BPB | LINC00901 |
|
| HSF4 | ADAMTS18 | SEPT2 | LOC648691 | EDN2 | LINC00319 | DOCK4-AS1 | TMEM246 | PBK | LINC00964 |
| SLC7A11 | IL17RD | TBC1D29 | PTPN3 |
| KIAA0513 | KIAA0586 | IFT140 | LAPTM4A | RNF144A |
| MATR3 | RIMS3 | SETD1A | CCP110 | GPRASP1 | USP34 | SNX17 | DHX38 | KNTC1 | HDAC9 |
| PIEZO1 | SART3 | DOCK4 | CEP104 | VPRBP | SECISBP2L | RAB11FIP3 | ZNF646 | TMEM63A | UTP14C |
| SEMA3E | NOS1AP | GPRIN2 | ARHGAP32 | ACAP1 | ZFYVE16 | PCDHA9 | KIAA0247 | LZTS3 | MLEC |
| TOX | HDAC4 | FRMPD4 | JADE3 | KMT2B | TBKBP1 | KIAA0101 | STARD8 | ZSCAN12 | SNPH |
| ZNF536 | FAM65B | RASSF2 | RAPGEF5 | SLK | KIAA0195 | BCLAF1 | EIF4A3 | ATG13 | TM9SF4 |
| CLSTN3 | KIAA0232 | TBC1D5 | PHACTR2 | KIAA0226 | ADAMTSL2 | KIAA0430 | MDC1 | IQCB1 | ZNF516 |
| PDE4DIP | CEP135 | LPIN2 | DZIP3 | TTLL4 | SAFB2 | EIF5B | IPO13 | WSCD2 | SDC3 |
Gene signature obtained with Least Absolute Shrinkage and Selection Operator (with Least Angle Regression procedure) in former smokers versus never smokers case study. Gene names in bold are also present in the signatures found by RFE-SVM and Extra-Trees methods
Fig. 4FSvsNS signature. Boxplot distribution of expression data (from H1 training dataset) of genes from the signature obtained for the case study of former smokers versus never smokers
Fig. 5Diseases-pathways-GO-terms association to SVM, Extra-Trees and LASSO-LARS signature. Comparative analysis of gene-disease-pathways-gene ontology terms associated to the gene signatures which were obtained with RFE-SVM, Extra-Trees and LASSO-LARS selection methods in the case study of smokers versus non-current smokers
SvsNCS signature biological interpretation
| Gene name | Gene description | Chemical interaction |
|---|---|---|
| CLEC10A | C-type lectin domain containing 10A | Benzo(a)pyrene |
| GPR15 | G protein-coupled receptor 15 | Tobacco Smoke Pollution |
| B3GALT2 | beta-1,3-galactosyltransferase 2 | Tobacco Smoke Pollution, Tretinoin, Valproic Acid, Vehicle Emissions |
| CDKN1C | cyclin-dependent kinase inhibitor 1C (p57, Kip2) | Tetrachlorodibenzodioxin, tert-Butylhydroperoxide, Valproic Acid |
| DSC2 | desmocollin 2 | Tetrachlorodibenzodioxin, Valproic Acid |
| LRRN3 | leucine rich repeat neuronal 3 | Tobacco Smoke Pollution |
| AHRR | aryl-hydrocarbon receptor repressor; programmed cell death 6 | Benzo(a)pyrene |
| TMEM163 | transmembrane protein 163 | Valproic Acid, Benzo(a)pyrene |
| PID1 | phosphotyrosine interaction domain containing 1 | Valproic Acid, Benzo(a)pyrene |
| FSTL1 | follistatin-like 1 | Methylnitronitrosoguanidine co-treated with Cadmium Chloride |
| P2RY6 | pyrimidinergic receptor P2Y, G-protein coupled, 6 | Benzo(a)pyrene |
| PTGFRN | prostaglandin F2 receptor inhibitor | Benzo(a)pyrene, Tetrachlorodibenzodioxin, Valproic Acid |
| ST6GALNAC1 | ST6 N-acetylgalactosaminide alpha-2,6-sialyltransferase 1 | Acetaminophen, Clofibrate, Phenylmercuric Acetate |
| SASH1 | SAM and SH3 domain containing 1 | Benzo(a)pyrene |
Enrichment analysis of the proposed gene signature in the smokers versus non-current smokers case study
Signature overlaps among methods
| Gene | Our | PMI | T264 | T225 | T259 |
|---|---|---|---|---|---|
| CLEC10A | ✓ | ✓ | ✓ | ✓ | |
| GPR15 | ✓ | ✓ | ✓ | ✓ | |
| B3GALT2 | ✓ | ||||
| CDKN1C | ✓ | ✓ | ✓ | ✓ | ✓ |
| DSC2 | ✓ | ||||
| LRRN3 | ✓ | ✓ | ✓ | ✓ | ✓ |
| AHRR | ✓ | ✓ | ✓ | ✓ | |
| TMEM163 | ✓ | ||||
| PID1 | ✓ | ✓ | ✓ | ✓ | |
| FSTL1 | ✓ | ||||
| P2RY6 | ✓ | ✓ | ✓ | ✓ | |
| PTGFRN | ✓ | ||||
| ST6GALNAC1 | ✓ | ||||
| SASH1 | ✓ | ✓ | ✓ | ✓ | ✓ |
| RGL1 | ✓ | ✓ | ✓ | ||
| SEMA6B | ✓ | ✓ | ✓ | ||
| CTTNBP2 | ✓ | ✓ | |||
| F2R | ✓ | ✓ |
Overlap matrix of the proposed gene signature with those produced by PMI and by the three winning teams of the SysTox Computational Challenge (for the smokers versus non-current smokers case study)
Fig. 6Disease-chemical association of common gene signature. Disease and chemical association of 8 genes (common gene signature) from our signature which are shared by the three winning teams of the challenge (smokers versus non-current smokers case study)
Fig. 7Disease-chemical association of total gene signature. Disease and chemical association of our signature which includes 6 genes not-shared by the three winning teams of the challenge (smokers versus non-current smokers case study)
Fig. 8Pathways overlap in pathways dataset. Overlap of pathways information of common (8) and specific (6) gene signatures (obtained for the case study of smokers versus non-current smokers) with tobacco smoking exposure related complete pathways dataset
FSvsNS signature biological interpretation
| Gene name | Gene description | Chemical interaction |
|---|---|---|
| CLUL1 | clusterin like 1 | Valproic Acid, bisphenol A |
| NS3BP | NS3 binding protein |
|
| HSD11B1 | hydroxysteroid 11-beta dehydrogenase 1 | Hydrocortisone, bisphenol A, Tetrachlorodibenzodioxin |
Enrichment analysis of the proposed gene signature in the former smokers versus never smokers case study
Performance of classifiers using SvsNCS signature
| RF | GNB | kNN | MLP | SVC | LR | LDA | GTB | ERT | T264 | T225 | T259 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AUPR | 0.961 | 0.938 | 0.9140 | 0.9043 |
| 0.9537 | 0.9484 | 0.9650 | 0.9580 | 0.96 | 0.97 | 0.95 |
| MCC | 0.9012 | 0.8766 | 0.8025 | 0.8272 |
| 0.8148 | 0.8765 | 0.9136 | 0.8642 | 0.90 | 0.77 | 0.79 |
Performance measures, in terms of AUPR and MCC scores, of nine classifiers using the signature obtained for the case study of smokers versus non-current smokers. Results are compared to the scores obtained by winners of SysTox Computational Challenge. Best results in boldface
Performance of classifiers using FSvsNS signature
| RF | GNB | kNN | MLP | SVC | LR | LDA | GTB | ERT | T264 | T225 | T259 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AUPR | 0.6366 | 0.6357 | 0.6594 | 0.6710 |
| 0.7024 | 0.6581 | 0.5528 | 0.6774 | 0.58 | 0.50 | 0.47 |
| MCC | 0.0845 | 0.1092 | 0.1310 | 0.0307 |
| 0.2318 | 0.1472 | -0.0644 | 0.1092 | 0.07 | 0.02 | -0.02 |
Performance measures, in terms of AUPR and MCC scores, of nine classifiers using the signature obtained for the case study of former smokers versus never smokers. Results are compared to the scores obtained by winners of SysTox Computational Challenge. Best results in boldface