| Literature DB >> 20936122 |
Irene Kouskoumvekaki1, Gianni Panagiotou.
Abstract
Metabolomics is a rapidly evolving discipline that involves the systematic study of endogenous small molecules that characterize the metabolic pathways of biological systems. The study of metabolism at a global level has the potential to contribute significantly to biomedical research, clinical medical practice, as well as drug discovery. In this paper, we present the most up-to-date metabolite and metabolic pathway resources, and we summarize the statistical, and machine-learning tools used for the analysis of data from clinical metabolomics. Through specific applications on cancer, diabetes, neurological and other diseases, we demonstrate how these tools can facilitate diagnosis and identification of potential biomarkers for use within disease diagnosis. Additionally, we discuss the increasing importance of the integration of metabolomics data in drug discovery. On a case-study based on the Human Metabolome Database (HMDB) and the Chinese Natural Product Database (CNPD), we demonstrate the close relatedness of the two data sets of compounds, and we further illustrate how structural similarity with human metabolites could assist in the design of novel pharmaceuticals and the elucidation of the molecular mechanisms of medicinal plants.Entities:
Mesh:
Substances:
Year: 2010 PMID: 20936122 PMCID: PMC2948926 DOI: 10.1155/2011/525497
Source DB: PubMed Journal: J Biomed Biotechnol ISSN: 1110-7243
Figure 1Metabolomics holds the promise to deliver valuable information about biochemical pathways perturbed in disease and upon treatment, to monitor healthy people to detect early signs of disease, to diagnose disease or predict the risk of a disease, to subclassify disease, to make safer drugs by predicting the potential for adverse drug reactions, and to speed the discovery and development of novel drug molecules.
Machine-learning algorithms often used in metabolomics.
| Technique | Description |
|---|---|
| PCA | The Principal Component Analysis (PCA) is a frequently used method which is applied to extract the systematic variance in a data matrix. It helps to obtain an overview over dominant patterns and major trends in the data. The aim of PCA is to create a set of latent variables which is smaller than the set of original variables but still explains all the variance of the original variables. In mathematical terms, PCA transforms a number of correlated variables into a smaller number of uncorrelated variables, the so-called principal components. |
| PLS | Partial Least Squares (PLS), also called Projection to Latent Structures, is a linear regression method that can be applied to establish a predictive model, even if the objects are highly correlated. The X variables (the predictors) are reduced to principal components, as are the Y variables (the dependents). The components of X are used to predict the scores on the Y components, and the predicted Y component scores are used to predict the actual values of the Y variables. In constructing the principal components of X, the PLS algorithm iteratively maximizes the strength of the relation of successive pairs of X and Y component scores by maximizing the covariance of each X-score with the Y variables. This strategy means that while the original X variables may be multicollinear, the X components used to predict Y will be orthogonal. Also, the X variables may have missing values, but there will be a computed score for every case on every X component. Finally, since only a few components (often two or three) will be used in predictions, PLS coefficients may be computed even when there may have been more original X variables than observations. |
| O-PLS | The Orthogonal Projections to Latent Structures (O-PLS) is a linear regression method similar to PLS. However, the interpretation of the models is improved because the structured noise is modeled separately from the variation common to X and Y. Therefore, the O-PLS loading and regression coefficients allow for a more realistic interpretation than PLS, which models the structured noise together with the correlated variation between X and Y. Furthermore, the orthogonal loading matrices provide the opportunity to interpret the structured noise. |
| PLS-DA | PLS-Discriminant Analysis (PLS-DA) is a frequently used classification method that is based on the PLS approach, in which the dependent variable is chosen to represent the class membership. PLS-DA makes it possible to accomplish a rotation of the projection to give latent variables that focus on class separation. The objective of PLS-DA is to find a model that separates classes of objects on the basis of their X-variables. This model is developed from the training set of objects of known class membership. The X-matrix consists of the multivariate characterization data of the objects. To encode a class identity, one uses as Y-data a matrix of dummy variables, which describe the class membership. A dummy variable is an artificial variable that assumes a discrete numerical value in the class description. The dummy matrix Y has G collumns (for G classes) with ones and zeros, such that the entry in the gth column is one and the entries in other columns are zero for observations of class g. |
| ANN | Artificial Neural Networks (ANN) is a method, or more precisely a set of methods, based on a system of simple identical mathematical functions, that working in parallel yield for each multivariate input X a single or multiresponse answer. ANN methods can only be used if a comparably large set of multivariate data is available which enables ANN training by example and work best if they are dealing with nonlinear relationships between complex inputs and outputs. The main component of a neural network is the neuron. Each neuron has an activation threshold, and a series of weighted connections to other neurons. If the aggregate activation a neuron receives from the neurons connected to it exceeds its activation threshold, the neuron fires and relays its activation to the neurons to which it is connected. The weights associated with these connections can be modified by training the network to perform a certain task. This modification accounts for learning. ANN are often organized into layers, with each layer receiving input from one adjacent layer, and sending it to another. Layers are categorized as input layers, output layers, and hidden layers. The input layer is initialized to a certain set of values, and the computations performed by the hidden layers update the values of the output layers, which comprise the output of the whole network. Learning is accomplished by updating the weights between connected neurons. The most common method for training neural networks is back propagation, a statistical method for updating weights based on how far their output is from the desired output. To search for the optimal set of weights, various algorithms can be used. The most common is gradient descent, which is an optimization method that, at each step, searches in the direction that appears to come nearest to the goal. |
| SOM | Self-Organizing Maps (SOM) or Kohonen network is an unsupervised neural network method which has both clustering and visualization properties. It can be used to classify a set of input vectors according to their similarity. The result of such a network is usually a two-dimensional map. Thus, SOM is a method for projecting objects from a high dimensional data space to a two-dimensional space This projection enables the input data to be partitioned into “similar” clusters while preserving their topology, that is, points that are close to one another in the multidimensional space are neighbors in the two-dimensional space as well. |
| SVM | Support Vector Machines (SVM) perform classification by constructing an N-dimensional hyperplane that optimally separates the data into two categories. A SVM model using a sigmoid kernel function is equivalent to a two-layer, perceptron neural network. The task of choosing the most suitable representation is known as feature selection. A set of features that describes one object is called a vector. The goal of SVM modeling is to find the optimal hyperplane that separates clusters of vectors in such a way that objects with one category of the target variable are on one side of the plane and objects with the other category are on the other size of the plane. The vectors near the hyperplane are the support vectors. |
| Genetic Algorithms | Genetic algorithms are nondeterministic stochastic search/optimization methods that utilize the evolutionary concepts of selection, recombination or crossover, and mutation into data processing to solve a complex problem dynamically. Possible solutions to the problem as so-called artificial chromosomes, which are changed and adapted throughout the optimization process until an optimus solution is obtained. A set of chromosomes is called population and creation of a population from a parent population is called generation. In a first step, the original population is created. For each chromosome, the fitness is determined and a selection algorithm is applied to choose chromosomes for mating. These chromosomes are then subject to the crossover, and the mutation operators, which finally yield a new generation of chromosomes. |
Figure 2A screenshot montage of the HMDB, SMPDB and T3DB databases.
Freely available databases on metabolic pathways and the metabolome.
| Metabolic Pathways Databases | Webpage |
|---|---|
| BRENDA, the enzyme database, has comprehensive information on enzymes and enzymatic reactions. It is one of several databases nested within the metabolic pathway database set of the SRS5 sequence retreival system at EBI. | |
| Reactome is an online bioinformatics database of biology described in molecular terms. The largest set of entries refers to human biology, but Reactome covers a number of other organisms as well. It is an on-line encyclopedia of core human pathways-DNA replication, transcription, translation, the cell cycle, metabolism, and signaling cascades. | |
| KEGG Metabolic Pathways include graphical pathway maps for all known metabolic pathways from various organisms. Ortholog group tables, containing conserved, functional units in a molecular pathway or or assembly, as well as comparative lists of genes for a given functional unit in different organisms, are also available. | |
| MetaCyc is a database of nonredundant, experimentally elucidated metabolic pathways. MetaCyc contains more than 1,400 pathways from more than 1,800 different organisms, and is curated from the scientific experimental literature. MetaCyc contains pathways involved in both primary and secondary metabolism, as well as associated compounds, enzymes, and genes. | |
| The WIT Metabolic Reconstruction project produces metabolic reconstructions for sequenced, or partially sequenced, genomes. It currently provides a set of over 25 such reconstructions in varying states of completion. Over 2900 pathway diagrams are available, associated with functional roles and linked to ORFs. | |
| BioCarta website provides gene interactions in dynamic graphical models. The online maps depicts molecular relationships and it catalogs and summarizes important resources providing information for more than 12,000 genes from multiple species. It contains both classical pathways as well as suggestions for new pathways. | |
| EcoCyc describes the genome and the biochemical machinery of E. coli. It provides a molecular and functional catalog of the E. coli cell to facilitates system-level understanding. Its Pathway/Genome Navigator user interface visualizes the layout of genes, of individual biochemical reactions, or of complete pathways. It also supports computational studies of the metabolism, such as pathway design, evolutionary studies, and simulations. A related metabolic database is Metalgen. | |
| BioSilico is a web-based database system that facilitates the search and analysis of metabolic pathways. Heterogeneous metabolic databases including LIGAND, ENZYME, EcoCyc and MetaCyc are integrated in a systematic way, thereby allowing users to efficiently retrieve the relevant information on enzymes, biochemical compounds and reactions. In addition, it provides well-designed view pages for more detailed summary information. | |
| EXPASY - Biochemical Pathways is a searchable database of metabolic pathways, enzymes, substrates and products. Based on a given search, it produces a graphic representation of the relevant pathway(s) within the context of an enormous metabolic map. Neighboring metabolic reactions can then be viewed through links to adjacent maps. | |
| BioPath is a database of biochemical pathways that provides access to metabolic transformations and cellular regulations derived from the Roche Applied Science “Biochemical Pathways” wall chart. BioPath provides access to biological transformations and regulations as described on the “Biochemical Pathways” chart. | |
| BioCyc is a collection of 505 Pathway/Genome Databases. Each database in the BioCyc collection describes the genome and metabolic pathways of a single organism. The BioCyc Web site contains many tools for navigating and analyzing these databases, and for analyzing omics data, including the following: Genome browser, Display of individual metabolic pathways, and of full metabolic maps, Visual analysis of user-supplied omics datasets by painting onto metabolic map, regulatory map, and genome map, Comparative analysis tools. | |
| Metabolome Databases | Webpage |
| The Biological Magnetic Resonance Data Bank (BMRB) focuses on quantitative data generated by spectroscopic investigations of biological macromolecules. It has links to search engines such as PubChem, that connect to recent articles and new data. It also links to projects and other databases that are all related to Metabolomics and Metabonomics. This database focuses on the NMR research aspect of metabolites discovery and their role in metabolism. | |
| The Madison Metabolomics Consortium Database contains metabolites determined through NMR and MS. It contains information with the main focus on Arabidopsis thaliana, but also refers to many different species. The database also contains information on the presence of metabolites under several different physiological conditions, their structures in 2D and 3D, and links to related resource sources and other databases. | |
| The Human Metabolome Database is an extremely comprehensive, free electronic database that gives a detailed overview of human metabolites divided into chemical, clinical, and molecular biology/biochemistry data. | |
| KNApSAcK is a Java application that presents an interactive display of biochemical information that can be searched by organism or metabolite name. KNApSAcK focuses primarily on the origin and mass spectra of particular metabolites. | |
| The BiGG database is a metabolic reconstruction of human metabolism designed for systems biology simulation and metabolic flux balance modeling. It is a comprehensive literature-based genome-scale metabolic reconstruction that accounts for the functions of 1,496 ORFs, 2,004 proteins, 2,766 metabolites, and 3,311 metabolic and transport reactions. It was assembled from build 35 of the human genome. | |
| SetupX, developed by the Fiehn laboratory at UC Davis, is a web-based metabolomics LIMS. It is XML compatible and built around a relational database management core. It is particularly oriented towards the capture and display of GC-MS metabolomic data through its metabolic annotation database called BinBase. | |
| McGill-MD is a metabolome database containing metabolite mass spectra of organisms; with abiotic/biotic stress or in homeostasis. Users are able to obtain a table containing the metabolome of an organism, or download mass spectra of all the metabolites entered in the database. | |
| SYSTOMONAS (SYSTems biology of pseudOMONAS) is a database for systems biology studies of Pseudomonas species. It contains extensive transcriptomic, proteomic and metabolomic data as well as metabolic reconstructions of this pathogen. Reconstruction of metabolic networks in SYSTOMONAS was achieved via comparative genomics. Broad data integration with well established databases BRENDA, KEGG and PRODORIC is also maintained. | |
| MassBank is a mass spectral database of experimentally acquired high resolution MS spectra of metabolites. Maintained and supported by the JST-BIRD project, it offers various query methods for standard spectra obtained from Keio University, RIKEN PSC, and other Japanese research institutions. It is officially sanctioned by the Mass Spectrometry Society of Japan. The database has very detailed MS data and excellent spectral/structure searching utilities. More than 13,000 spectra from 1900 different compounds are available. | |
| The Golm Metabolome Database provides public access to custom GC/MS libraries which are stored as Mass Spectral (MS) and Retention Time Index (RI) Libraries (MSRI). These libraries of mass spectral and retention time indices can be used with the NIST/AMDIS software to identify metabolites according their spectral tags and RI's. The libraries are both searchable and downloadable and have been carefully collected under defined conditions on several types of GC/MS instruments (quadrupole and TOF). | |
| The METLIN Metabolite Database is a repository for mass spectral metabolite data. All metabolites are neutral or free acids. It is a collaborative effort between the Siuzdak and Abagyan groups and Center for Mass Spectrometry at The Scripps Research Institute. METLIN is searchable by compound name, mass, formula or structure. It contains 15,000 structures, including more than 8000 di and tripeptides. METLIN contains MS/MS, LC/MS and FTMS data that can be searched by peak lists, mass range, biological source and or disease. | |
Figure 3Comparison of the distribution of selected druglike molecular properties for natural compounds from CNPD and human metabolites from HMDB. Violin plots for (a) molecular weight, (b) hydrogen-bond donors, (c) hydrogen-bond acceptors, (d) number of rings and (e) number of rotatable bonds, along with table with mean values and standard deviations. A violin plot is a combination of a box plot and a kernel density plot and offers a more detailed view of a dataset's variability than a box plot alone. The white marker indicates the median of the data and the black box the interquirtile range (the difference between the third and first quartiles that contain 50% of the distribution). The black lines extend to one and a half times the width of the box. Violin plots were made in R.
Figure 4Similarity network of 2-pyrocatechuic acid. Pink nodes indicate human metabolites from HMDB and green nodes indicate natural compounds from CNPD. Node labels denote the respective ID codes of the compounds. The nodes are linked when the two compounds have Tc ≥ 0.85. Due to the high number of pairs with similarity between 0.85 and 0.90, we included in the figure only connections of Tc ≥ 0.90 to allow better visualization of the network. The width and color of the edges are analogous to the value of Tc: Cyan: 0.90 ≤ Tc < 0.95, Blue: 0.95 ≤ Tc < 1.0, Black: Tc = 1. The two nodes in yellow denote 2-pyrocatechuic acid with HMDB ID and CAS registry number, respectively.
Figure 5Similarity network of indole. Pink nodes indicate human metabolites from HMDB and green nodes indicate natural compounds from CNPD. Node labels denote the respective ID codes of the compounds. The nodes are linked when the two compounds have Tc ≥ 0.85. The width and color of the edges are analogous to the value of Tc: Cyan: 0.85 ≤ Tc < 0.95, Black: Tc = 1. The node in yellow denotes indole with HMDB ID and CAS registry number, respectively.