| Literature DB >> 19534750 |
Vincenzo Lagani1, Alberto Montesanto, Fausta Di Cianni, Victor Moreno, Stefano Landi, Domenico Conforti, Giuseppina Rose, Giuseppe Passarino.
Abstract
BACKGROUND: Recent technological advances in DNA sequencing and genotyping have led to the accumulation of a remarkable quantity of data on genetic polymorphisms. However, the development of new statistical and computational tools for effective processing of these data has not been equally as fast. In particular, Machine Learning literature is limited to relatively few papers which are focused on the development and application of data mining methods for the analysis of genetic variability. On the other hand, these papers apply to genetic data procedures which had been developed for a different kind of analysis and do not take into account the peculiarities of population genetics. The aim of our study was to define a new similarity measure, specifically conceived for measuring the similarity between the genetic profiles of two groups of subjects (i.e., cases and controls) taking into account that genetic profiles are usually distributed in a population group according to the Hardy Weinberg equilibrium.Entities:
Mesh:
Year: 2009 PMID: 19534750 PMCID: PMC2697648 DOI: 10.1186/1471-2105-10-S6-S24
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Example of a genetic dataset with two SNPs
| ID | SNP1 | SNP2 | Class |
| SAMPLE1 | A1A1 | A1A1 | +1 |
| SAMPLE2 | A2A2 | A1A2 | -1 |
| SAMPLE3 | A1A2 | A2A2 | -1 |
Numeric encoding of Table 1
| ID | SNP11 | SNP12 | SNP13 | SNP21 | SNP22 | SNP23 | Class |
| SAMPLE1 | 1 | 0 | 0 | 1 | 0 | 0 | +1 |
| SAMPLE2 | 0 | 0 | 1 | 0 | 1 | 0 | -1 |
| SAMPLE3 | 0 | 1 | 0 | 0 | 0 | 1 | -1 |
Genotypic frequencies as determined by Hardy-Weinberg law given two alleles A1 (with a frequency of p) and A2 (with a frequency of q).
| p2 | |
| 2 pq | |
| q2 |
Figure 1Results obtained on the simulated datasets. Each set of figures refers to different rules for the generation of the r parameter (r = 1.5 (a), r = 2.0 (b), r = 2.5 (c), r = 3.0 (d), r in [1.5–3.0] (e), r in [1.5–3.0] with "low frequency uninformative" SNPs (f)); the horizontal axis reports the frequency p of the allele A1 of the relevant SNP, while the vertical axis reports the best cross validated AUC values obtained with HW (black) and linear (red) kernels.
Figure 2Mean values of the AUC performances computed on the five repeated cross-validations obtained using a SVM classifier with either HWk (black) and the linear kernel (red) with the SCC dataset. The results refer to a threshold T for the univariate feature selection procedure of respectively 0.20 (a), 0.15 (b) and 0.10 (c).