M Sebban1, I Mokrousov, N Rastogi, C Sola. 1. French West Indies and Guiana University, TRIVIA, Department of Mathematics and Computer Science, Campus Fouillole, 97159 Pointe-à-Pitre Cedex, Guadeloupe.
Abstract
MOTIVATION: The Direct Repeat (DR) locus of Mycobacterium tuberculosis is a suitable model to study (i) molecular epidemiology and (ii) the evolutionary genetics of tuberculosis. This is achieved by a DNA analysis technique (genotyping), called sp acer oligo nucleotide typing (spoligotyping ). In this paper, we investigated data analysis methods to discover intelligible knowledge rules from spoligotyping, that has not yet been applied on such representation. This processing was achieved by applying the C4.5 induction algorithm and knowledge rules were produced. Finally, a Prototype Selection (PS) procedure was applied to eliminate noisy data. This both simplified decision rules, as well as the number of spacers to be tested to solve classification tasks. In the second part of this paper, the contribution of 25 new additional spacers and the knowledge rules inferred were studied from a machine learning point of view. From a statistical point of view, the correlations between spacers were analyzed and suggested that both negative and positive ones may be related to potential structural constraints within the DR locus that may shape its evolution directly or indirectly. RESULTS: By generating knowledge rules induced from decision trees, it was shown that not only the expert knowledge may be modeled but also improved and simplified to solve automatic classification tasks on unknown patterns. A practical consequence of this study may be a simplification of the spoligotyping technique, resulting in a reduction of the experimental constraints and an increase in the number of samples processed.
MOTIVATION: The Direct Repeat (DR) locus of Mycobacterium tuberculosis is a suitable model to study (i) molecular epidemiology and (ii) the evolutionary genetics of tuberculosis. This is achieved by a DNA analysis technique (genotyping), called sp acer oligo nucleotide typing (spoligotyping ). In this paper, we investigated data analysis methods to discover intelligible knowledge rules from spoligotyping, that has not yet been applied on such representation. This processing was achieved by applying the C4.5 induction algorithm and knowledge rules were produced. Finally, a Prototype Selection (PS) procedure was applied to eliminate noisy data. This both simplified decision rules, as well as the number of spacers to be tested to solve classification tasks. In the second part of this paper, the contribution of 25 new additional spacers and the knowledge rules inferred were studied from a machine learning point of view. From a statistical point of view, the correlations between spacers were analyzed and suggested that both negative and positive ones may be related to potential structural constraints within the DR locus that may shape its evolution directly or indirectly. RESULTS: By generating knowledge rules induced from decision trees, it was shown that not only the expert knowledge may be modeled but also improved and simplified to solve automatic classification tasks on unknown patterns. A practical consequence of this study may be a simplification of the spoligotyping technique, resulting in a reduction of the experimental constraints and an increase in the number of samples processed.
Authors: E M Streicher; R M Warren; C Kewley; J Simpson; N Rastogi; C Sola; G D van der Spuy; P D van Helden; T C Victor Journal: J Clin Microbiol Date: 2004-02 Impact factor: 5.948
Authors: Thomas C Victor; Petra E W de Haas; Annemarie M Jordaan; Gian D van der Spuy; Madalene Richardson; D van Soolingen; Paul D van Helden; Robin Warren Journal: J Clin Microbiol Date: 2004-02 Impact factor: 5.948
Authors: E M Streicher; T C Victor; G van der Spuy; C Sola; N Rastogi; P D van Helden; R M Warren Journal: J Clin Microbiol Date: 2006-10-25 Impact factor: 5.948
Authors: Ingrid Filliol; Jeffrey R Driscoll; Dick van Soolingen; Barry N Kreiswirth; Kristin Kremer; Georges Valétudie; Duc Anh Dang; Rachael Barlow; Dilip Banerjee; Pablo J Bifani; Karine Brudey; Angel Cataldi; Robert C Cooksey; Debby V Cousins; Jeremy W Dale; Odir A Dellagostin; Francis Drobniewski; Guido Engelmann; Séverine Ferdinand; Deborah Gascoyne-Binzi; Max Gordon; M Cristina Gutierrez; Walter H Haas; Herre Heersma; Eric Kassa-Kelembho; Minh Ly Ho; Athanasios Makristathis; Caterina Mammina; Gerald Martin; Peter Moström; Igor Mokrousov; Valérie Narbonne; Olga Narvskaya; Antonino Nastasi; Sara Ngo Niobe-Eyangoh; Jean W Pape; Voahangy Rasolofo-Razanamparany; Malin Ridell; M Lucia Rossetti; Fritz Stauffer; Philip N Suffys; Howard Takiff; Jeanne Texier-Maugein; Véronique Vincent; Jacobus H de Waard; Christophe Sola; Nalin Rastogi Journal: J Clin Microbiol Date: 2003-05 Impact factor: 5.948