Literature DB >> 27993774

KODAMA: an R package for knowledge discovery and data mining.

Stefano Cacciatore¹, Leonardo Tenori², Claudio Luchinat³, Phillip R Bennett¹, David A MacIntyre¹.

Abstract

Summary: KODAMA, a novel learning algorithm for unsupervised feature extraction, is specifically designed for analysing noisy and high-dimensional datasets. Here we present an R package of the algorithm with additional functions that allow improved interpretation of high-dimensional data. The package requires no additional software and runs on all major platforms. Availability and Implementation: KODAMA is freely available from the R archive CRAN ( http://cran.r-project.org ). The software is distributed under the GNU General Public License (version 3 or later). Contact: s.cacciatore@imperial.ac.uk. Supplementary information: Supplementary data are available at Bioinformatics online.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2017 PMID： 27993774 PMCID： PMC5408808 DOI： 10.1093/bioinformatics/btw705

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Knowledge Discovery and Data Mining is an interdisciplinary area focusing upon methodologies for extracting useful knowledge from complex data. With the explosive growth of high-throughput experimental data, data-based solutions are increasingly crucial. We recently published KODAMA, a novel learning algorithm for unsupervised feature extraction, specifically designed for analysing noisy and high-dimensional datasets (Cacciatore ). This versatile method has been successfully applied to a wide range of disciplines including genomics (Cacciatore ) and metabolomics (Priolo ) and has even been used in the analysis of hyper-spectral images (Cao ). Here, we present for the first time the KODAMA package developed for use in the R programming environment. The core of the algorithm consists of two main parts. The first step involves random assignment of each sample to a different class. In the second step, the cross-validated accuracy is maximized by an iterative procedure by swapping the class labels (no a priori information is needed). The cross-validated accuracy can be calculated by using any supervized classifier. In the current version of KODAMA, two classifiers are implemented: k-Nearest Neighbors (kNN) and Partial Least Squares (PLS)—Discriminant Analysis (DA). The iterative procedure used in KODAMA leads to suboptimal solutions and must be repeated (100 as default) to average the effects owing to randomness. External class information can be integrated in KODAMA before performing the iterative procedure thereby supporting a semi-supervized approach for highlighting otherwise hidden features of interest. After each run of the procedure, a classification vector with high cross-validated accuracy is obtained. KODAMA subsequently collects and processes these results by constructing a dissimilarity matrix to provide a holistic view of the data while maintaining their intrinsic structure. Here, we show that KODAMA demonstrates high capacity to detect different underlying relationships in experimental datasets including patient phenotypes (Cacciatore ; Priolo ). We also introduce the possibility of using KODOMA to correlate extracted features describing phenotype with accompanying metadata.

2 Methods

The revised KODAMA package includes improvements in the implementation of the code and seven major new functions: pls.kodama, knn.kodama, pls.double.cv, knn.double.cv, k.test, loads and mcplot. The package is computationally efficient with the workhorse functions written in C ++ using Rcpp (Eddelbuettel, 2011), RcppArmadillo (Eddelbuettel and Sanderson, 2014) and integrating the Approximate Nearest Neighbour Searching (ANN) library (Arya ). Functions coded in C ++ include kNN (knn.kodama) and PLS-DA (pls.kodama) classifiers and the iterative procedure of KODAMA. The pls.double.cv and knn.double.cv functions perform double cross-validation procedures using PLS-DA or kNN as classifiers, respectively (Bertini ). The internal parameter (i.e. number of components or k) is optimized by maximising the cross-validated coefficient of determination (Q2y) obtained by an inner cross-validation on the training sets. The loads function can be used to extract the variable ranking. After each maximization of the cross-validated accuracy the final label set is used to calculate the loadings of PLS-DA or the logarithm of the P-value from the Kruskal-Wallis Rank Sums test. The output of LOADS function is the average of these values for each variable of the dataset. The highest values indicate the most important variables. The k.test function performs a statistical test to assess association between the KODAMA output and any additional related parameters such as clinical metadata. The coefficient of determination (R2) is used to assess the proportion of the variance in the dependent variable (KODAMA output) that is predictable from the independent variable (e.g. clinical parameter) and can thus be used as a measure of the goodness of fit (Cameron ). A permutation test is performed by randomly sampling the value of the labels to estimate the significance of the observed association. The mcplot function is now included as a diagnostic solution of the iterative process for maximization of cross-validated accuracy. This function visualizes the values of accuracy step-by-step through each separate iterative process. The Shannon Entropy (Shannon, 1948), is now implemented as output of the KODAMA function and can be used as a measure of unpredictability of information content to select the optimal classifier and its relative parameter.

3 Results

To demonstrate briefly the performance of the KODAMA package, we used the MetRef dataset (included in this package), a collection of 873 nuclear magnetic resonance spectra of urine samples from a cohort of 22 healthy donors (11 male and 11 female). Figure 1 shows a comparison between KODAMA and Principal Component Analysis (PCA), an unsupervised method widely used in metabolic profiling (Aimetti ; MacIntyre ). As can be observed in Figure 1a, PCA provides comparatively poor description of the underlying variation in metabolic profiles of urine collected from healthy individuals. In contrast, KODAMA (Fig. 1b) permits identification of the underlying patient-specific signature of the urine metabolome in an unsupervised fashion. This important biologically relevant information would have been otherwise lost using PCA. The script for generating Figure 1a, b is included in Supplementary material. Figure 1c highlights spectral features most responsible for separation of patient samples obtained with the loads function. Further analysis using the k.test function shows a statistically significant association between the KODAMA output with clinical metadata including donor (P < 0.0001) and gender (P < 0.0001).

Fig. 1.

(a) PCA and (b) KODAMA of MetRef dataset. Color coding indicates samples from the same donor. (c) Average NMR spectrum of MetRef dataset. Color-code represents the output of the LOADS function. The spectral features with the highest contribution to the spatial separation observed in the KODAMA output are represented in yellow

4 Summary and outlook

KODAMA represents a valuable tool for performing feature extraction on noisy and high-dimensional datasets. Addition functions facilitate the identification of key features associated with the generated output and are easily interpretable for the user. The K-test permits the identification of significant associations between the KODAMA output and related information. Click here for additional data file.

4 in total

1. Serum metabolome analysis by 1H-NMR reveals differences between chronic lymphocytic leukaemia molecular subgroups.

Authors: D A MacIntyre; B Jiménez; E Jantus Lewintre; C Reinoso Martín; H Schäfer; C García Ballesteros; J Ramón Mayans; M Spraul; J García-Conde; A Pineda-Lucena
Journal: Leukemia Date: 2010-01-21 Impact factor: 11.528

2. Knowledge discovery by accuracy maximization.

Authors: Stefano Cacciatore; Claudio Luchinat; Leonardo Tenori
Journal: Proc Natl Acad Sci U S A Date: 2014-03-24 Impact factor: 11.205

3. Metabolomic NMR fingerprinting to identify and predict survival of patients with metastatic colorectal cancer.

Authors: Ivano Bertini; Stefano Cacciatore; Benny V Jensen; Jakob V Schou; Julia S Johansen; Mogens Kruhøffer; Claudio Luchinat; Dorte L Nielsen; Paola Turano
Journal: Cancer Res Date: 2011-11-11 Impact factor: 12.701

4. AKT1 and MYC induce distinctive metabolic fingerprints in human prostate cancer.

Authors: Carmen Priolo; Saumyadipta Pyne; Joshua Rose; Erzsébet Ravasz Regan; Giorgia Zadra; Cornelia Photopoulos; Stefano Cacciatore; Denise Schultz; Natalia Scaglia; Jonathan McDunn; Angelo M De Marzo; Massimo Loda
Journal: Cancer Res Date: 2014-10-16 Impact factor: 12.701

4 in total

11 in total

1. Metabolic Profiling in Formalin-Fixed and Paraffin-Embedded Prostate Cancer Tissues.

Authors: Stefano Cacciatore; Giorgia Zadra; Clyde Bango; Kathryn L Penney; Svitlana Tyekucheva; Oscar Yanes; Massimo Loda
Journal: Mol Cancer Res Date: 2017-01-10 Impact factor: 5.852

2. Integrated Lipidomics and Proteomics Point to Early Blood-Based Changes in Childhood Preceding Later Development of Psychotic Experiences: Evidence From the Avon Longitudinal Study of Parents and Children.

Authors: Francisco Madrid-Gambin; Melanie Föcking; Sophie Sabherwal; Meike Heurich; Jane A English; Aoife O'Gorman; Tommi Suvitaival; Linda Ahonen; Mary Cannon; Glyn Lewis; Ismo Mattila; Caitriona Scaife; Sean Madden; Tuulia Hyötyläinen; Matej Orešič; Stanley Zammit; Gerard Cagney; David R Cotter; Lorraine Brennan
Journal: Biol Psychiatry Date: 2019-01-30 Impact factor: 13.382

3. Malignancy Grade-Dependent Mapping of Metabolic Landscapes in Human Urothelial Bladder Cancer: Identification of Novel, Diagnostic, and Druggable Biomarkers.

Authors: Aikaterini Iliou; Aristeidis Panagiotakis; Aikaterini F Giannopoulou; Dimitra Benaki; Mariangela Kosmopoulou; Athanassios D Velentzas; Ourania E Tsitsilonis; Issidora S Papassideri; Gerassimos E Voutsinas; Eumorphia G Konstantakou; Evagelos Gikas; Emmanuel Mikros; Dimitrios J Stravopodis
Journal: Int J Mol Sci Date: 2020-03-10 Impact factor: 5.923

4. On the Use of Correlation and MI as a Measure of Metabolite-Metabolite Association for Network Differential Connectivity Analysis.

Authors: Sanjeevan Jahagirdar; Edoardo Saccenti
Journal: Metabolites Date: 2020-04-24

5. Candida albicans PPG1, a serine/threonine phosphatase, plays a vital role in central carbon metabolisms under filament-inducing conditions: A multi-omics approach.

Authors: Mohammad Tahseen A L Bataineh; Nelson Cruz Soares; Mohammad Harb Semreen; Stefano Cacciatore; Nihar Ranjan Dash; Mohamad Hamad; Muath Khairi Mousa; Jasmin Shafarin Abdul Salam; Mutaz F Al Gharaibeh; Luiz F Zerbini; Mawieh Hamad
Journal: PLoS One Date: 2021-12-07 Impact factor: 3.240

6. A Prospective Metagenomic and Metabolomic Analysis of the Impact of Exercise and/or Whey Protein Supplementation on the Gut Microbiome of Sedentary Adults.

Authors: Owen Cronin; Wiley Barton; Peter Skuse; Nicholas C Penney; Isabel Garcia-Perez; Eileen F Murphy; Trevor Woods; Helena Nugent; Aine Fanning; Silvia Melgar; Eanna C Falvey; Elaine Holmes; Paul D Cotter; Orla O'Sullivan; Michael G Molloy; Fergus Shanahan
Journal: mSystems Date: 2018-04-24 Impact factor: 6.496

7. Group-wise ANOVA simultaneous component analysis for designed omics experiments.

Authors: Edoardo Saccenti; Age K Smilde; José Camacho
Journal: Metabolomics Date: 2018-05-21 Impact factor: 4.290

Review 8. High-Throughput Metabolomics by 1D NMR.

Authors: Alessia Vignoli; Veronica Ghini; Gaia Meoni; Cristina Licari; Panteleimon G Takis; Leonardo Tenori; Paola Turano; Claudio Luchinat
Journal: Angew Chem Int Ed Engl Date: 2018-11-11 Impact factor: 15.336

9. Fermented-Food Metagenomics Reveals Substrate-Associated Differences in Taxonomy and Health-Associated and Antibiotic Resistance Determinants.

Authors: John Leech; Raul Cabrera-Rubio; Aaron M Walsh; Guerrino Macori; Calum J Walsh; Wiley Barton; Laura Finnegan; Fiona Crispie; Orla O'Sullivan; Marcus J Claesson; Paul D Cotter
Journal: mSystems Date: 2020-11-10 Impact factor: 6.496

Review 10. The metaRbolomics Toolbox in Bioconductor and beyond.

Authors: Jan Stanstrup; Corey D Broeckling; Rick Helmus; Nils Hoffmann; Ewy Mathé; Thomas Naake; Luca Nicolotti; Kristian Peters; Johannes Rainer; Reza M Salek; Tobias Schulze; Emma L Schymanski; Michael A Stravs; Etienne A Thévenot; Hendrik Treutler; Ralf J M Weber; Egon Willighagen; Michael Witting; Steffen Neumann
Journal: Metabolites Date: 2019-09-23