| Literature DB >> 24651484 |
Sait Can Yücebaş1, Yeşim Aydın Son2.
Abstract
Through Genome Wide Association Studies (GWAS) many Single Nucleotide Polymorphism (SNP)-complex disease relations can be investigated. The output of GWAS can be high in amount and high dimensional, also relations between SNPs, phenotypes and diseases are most likely to be nonlinear. In order to handle high volume-high dimensional data and to be able to find the nonlinear relations we have utilized data mining approaches and a hybrid feature selection model of support vector machine and decision tree has been designed. The designed model is tested on prostate cancer data and for the first time combined genotype and phenotype information is used to increase the diagnostic performance. We were able to select phenotypic features such as ethnicity and body mass index, and SNPs those map to specific genes such as CRR9, TERT. The performance results of the proposed hybrid model, on prostate cancer dataset, with 90.92% of sensitivity and 0.91 of area under ROC curve, shows the potential of the approach for prediction and early detection of the prostate cancer.Entities:
Mesh:
Year: 2014 PMID: 24651484 PMCID: PMC3961262 DOI: 10.1371/journal.pone.0091404
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Major allele coding scheme.
| Major Alleles | Coding Value |
| AA | 1 |
| AT/TA | 2 |
| AC/CA | 3 |
| AG/GA | 4 |
| TT | 5 |
| CT/TC | 6 |
| GT/TG | 7 |
| CC | 8 |
| GC/CG | 9 |
| GG | 10 |
Figure 1Overall Work flow of the SVM-Tree Hybrid Model.
Overall workflow starts with data preprocessing where representative SNP subset is formed by Plink and METU-SNP analysis, phenotype and genotyping data integrated and missing values are either eliminated or manually filled by class mean calculation. After the data preprocessing, integrated dataset is fed into hybrid model where SVM model gives the attribute weights which are used in ID3.
Performance comparison of stand-alone SVM model.
| Performance Criteria | Only-Genotype Dataset | Only-Phenotype Dataset | Integrated Genotype and Phenotype Dataset |
| Accuracy | 59.02 | 68.23 |
|
| Precision | 61.29 | 76.80 |
|
| Recall | 63.15 | 70.12 |
|
| AUC | 0.606 | 0.768 |
|
SVM model is tested by three different datasets, only genotype, only phenotype and integrated phenotype and genotype sets. Integrated data set performs best among others in terms of the performance criteria given.
Performance comparison of SVM-ID3 Hybrid Model.
| Performance Criteria | Only-Genotype Dataset | Only-Phenotype Dataset | Integrated Genotype and Phenotype Dataset |
| Accuracy | 71.67 | 84.23 |
|
| Precision | 72.69 | 86.20 |
|
| Recall | 68.96 | 83.78 |
|
| AUC | 0.674 | 0.857 |
|
The hybrid SVM-ID3 model is tested on the same datasets, only genotype, only phenotype and integrated phenotype and genotype sets. Integrated data set performs best among others in terms of the performance criteria given.
Figure 2Overall tree structure of the hybrid model.
The main tree is given in the Tree S1 material because the structure is too big. This figure is a small representation of main tree. Decision starts with ethnicity and African Americans are represented by AA, Japanese by JAP and Latinos by LAT. For all ethnicities the most descriptive phenotypic attribute is body mass index (BMI). Other phenotypic attributes that are in upper levels of tree are smoking behavior, family history, lycopene intake and physical activity. The number of SNPs in the nodes indicates the total number of SNPs found in different levels on that particular path of the tree.
SNPnexus results.
| Gene | Entrez gene | Phenotype | Disease Class | Pubmed |
| MCPH1 | 79648 | Adenocarcinoma|Pancreatic Neoplasms | CANCER | 19690177 |
| MCPH1 | 79648 | breast cancer | CANCER | 20508983 |
| SMARCA4 | 6597 | breast cancer | CANCER | 19183483 |
| CSMD1 | 64478 | Chromosomal Instability|Cystadenocarcinoma, Serous|Ovarian Neoplasms | CANCER | 19383911 |
| CSMD1 | 64478 | Chromosomal Instability|Cystadenocarcinoma, Serous|Ovarian Neoplasms | CANCER | 19383911 |
| MTAP | 4507 | Melanoma|Nevus|Precancerous Conditions|Skin Neoplasms | CANCER | 19578365 |
| MTAP | 4507 | melanoma|Nevus|Skin Neoplasms | CANCER | 20574843 |
| MTAP | 4507 | melanoma|Nevus|Skin Neoplasms|Sunburn | CANCER | 20647408 |
| MTAP | 4507 | Precursor Cell Lymphoblastic Leukemia-Lymphoma | CANCER | 19665068 |
| ST6GALNAC3 | 256435 | Alcoholism | CHEMDEPENDENCY | 20421487 |
| ANGPT2 | 285 | BMI- Edema rosiglitazone or pioglitazone | PHARMACOGENOMIC | 18996102 |
| KLF7 | 8609 | Body Weight|Diabetes Mellitus, Type 2|Obesity|Overweight | METABOLIC | 19147600 |
| MTAP | 4507 | diabetes, type 2 | METABOLIC | 11985785 |
| PACRG | 135138 | male infertility | REPRODUCTION | 19268936 |
| SEMA5B | 54437 | Tobacco Use Disorder | CHEMDEPENDENCY | 20379614 |
The SNPs found by hybrid system are searched through SNPnexus. Many of them are found to be associated with specific genes and phenotypes. This table lists some of the genes that are matched by SNPs. As the disease class and phenotype indicates, our findings match with cancer disease class and the phenotypes searched for prostate cancer such as body mass index, smoking and drinking habits.
High score SNPs from RegulomeDB.
| rsID | Hits | score |
| rs1433369 | Motifs|Footprinting|IRF, Motifs|PWM|DMRT5, Motifs|Footprinting|DMRT5, Motifs|Footprinting|STAT1, Motifs|PWM|IRF, Motifs|PWM|STAT1, Chromatin_Structure|FAIRE, Chromatin_Structure|DNase-seq, Protein_Binding|ChIP-seq|SMARCB1, Protein_Binding|ChIP-seq|POLR2A | 2,2 |
| rs11790106 | Motifs|Footprinting|Pax-6, Motifs|PWM|Pax-6, Chromatin_Structure|FAIRE, Chromatin_Structure|DNase-seq, Protein_Binding|ChIP-seq|GATA1, Protein_Binding|ChIP-seq|HNF4A, Protein_Binding|ChIP-seq|HEY1, Protein_Binding|ChIP-seq|EP300, Protein_Binding|ChIP-seq|SMARCC2, Protein_Binding|ChIP-seq|CEBPB, Protein_Binding|ChIP-seq|FOXA2, Protein_Binding|ChIP-seq|NR3C1, Protein_Binding|ChIP-seq|STAT3, Protein_Binding|ChIP-seq|POLR2A, Protein_Binding|ChIP-seq|FOXA1, Protein_Binding|ChIP-seq|SRF, Protein_Binding|ChIP-seq|CDX2 | 2,2 |
| rs6774902 | Motifs|PWM|MAF, Motifs|PWM|c-Ets-1, Motifs|Footprinting|c-Ets-1, Motifs|Footprinting|MAF, Chromatin_Structure|DNase-seq, Protein_Binding|ChIP-seq|RAD21, Protein_Binding|ChIP-seq|CTCF | 2,2 |
| rs17701543 | Motifs|PWM|CP2, Chromatin_Structure|DNase-seq, Protein_Binding|ChIP-seq|CTCF | 3,1 |
| rs12644498 | Motifs|PWM|REST, Chromatin_Structure|FAIRE, Chromatin_Structure|DNase-seq, Protein_Binding|ChIP-seq|USF1 | 3,1 |
| rs17375010 | Chromatin_Structure|DNase-seq, Protein_Binding|ChIP-seq|CTCF | 4 |
| rs10788555 | Chromatin_Structure|FAIRE, Chromatin_Structure|DNase-seq, Protein_Binding|ChIP-seq|STAT1, Protein_Binding|ChIP-seq|STAT3 | 4 |
| rs6887293 | Chromatin_Structure|FAIRE, Chromatin_Structure|DNase-seq, Protein_Binding|ChIP-seq|FOXA1, Protein_Binding|ChIP-seq|GATA3 | 4 |
| rs744346 | Chromatin_Structure|FAIRE, Chromatin_Structure|DNase-seq, Protein_Binding|ChIP-seq|ELK4 | 4 |
| rs4562278 | Chromatin_Structure|FAIRE, Chromatin_Structure|DNase-seq, Protein_Binding|ChIP-seq|HNF4A | 4 |
The SNPs found by hybrid system are searched thorough regulomeDB. Many of the found to be affect binding and this table lists the SNPs with score lower than 4.