| Literature DB >> 26385549 |
Joaquín Pérez1, Emmanuel Iturbide2, Víctor Olivares1, Miguel Hidalgo1,3, Alicia Martínez1, Nelva Almanza1.
Abstract
It is known that the data preparation phase is the most time consuming in the data mining process, using up to 50% or up to 70% of the total project time. Currently, data mining methodologies are of general purpose and one of their limitations is that they do not provide a guide about what particular task to develop in a specific domain. This paper shows a new data preparation methodology oriented to the epidemiological domain in which we have identified two sets of tasks: General Data Preparation and Specific Data Preparation. For both sets, the Cross-Industry Standard Process for Data Mining (CRISP-DM) is adopted as a guideline. The main contribution of our methodology is fourteen specialized tasks concerning such domain. To validate the proposed methodology, we developed a data mining system and the entire process was applied to real mortality databases. The results were encouraging because it was observed that the use of the methodology reduced some of the time consuming tasks and the data mining system showed findings of unknown and potentially useful patterns for the public health services in Mexico.Entities:
Keywords: Censuses databases; Data preparation methodology; Epidemiological data mining; Mortality databases
Mesh:
Year: 2015 PMID: 26385549 PMCID: PMC4575356 DOI: 10.1007/s10916-015-0312-5
Source DB: PubMed Journal: J Med Syst ISSN: 0148-5598 Impact factor: 4.460
Fig. 1Tasks of the data preparation methodology
Description of the population-based databases
| Database | Attributes | Records | Description |
|---|---|---|---|
| Mortality | 38 | 437,667 | Deaths occurring |
| in 2000. | |||
| SINAIS-INEGI [ | |||
| Geographic | 7 | 2,475 | Geographical |
| position of the | |||
| districts of Mexico. | |||
| SIMBAD- | |||
| INEGI [ | |||
| Population | 3 | 2,475 | Total population by |
| district in Mexico. | |||
| INEGI [ | |||
| ICD | 24 | 2,049 | International |
| Classification | |||
| of Diseases. | |||
| CEMECE [ |
Fig. 2Conceptual diagram of the data preparation subsystem
Cluster of interest for the cause of death C16
| Name of the District | Incidence | Rate |
|---|---|---|
| Guaymas | 15 | 11.52 |
| Hermosillo | 48 | 7.87 |
| La Paz | 14 | 7.11 |
| Los Cabos | 7 | 6.64 |
Cluster of interest for the cause of death C16
| Name of the District | Incidence | Rate |
|---|---|---|
| Minatitlan | 14 | 9.15 |
| Comalcalco | 14 | 8.50 |
| Tapachula | 21 | 7.73 |
| San Cristobal | 9 | 6.80 |
| Macuspana | 9 | 6.72 |
| Tuxtla Gutierrez | 28 | 6.45 |
Fig. 3Clusters of interest for the cause of death C16
Dataset example
| Latitude | Longitude | Rate |
|---|---|---|
| 19.39073 | −99.14361 | 7.0031 |
| 18.92133 | −99.23468 | 4.6957 |
| 19.03247 | −98.19576 | 2.6367 |
| ... | ... | ... |
Time comparison for the data preparation tasks for cause C16
| Task | Manual | Automated | Reduction |
|---|---|---|---|
| (min) | (min) | (%) | |
| Mortality | |||
| Incidence | 33.53 | 0.058 | 99.83 |
| Mortality | |||
| Rate | 5.16 | 0.330 | 99.61 |