| Literature DB >> 26510531 |
Anis Karimpour-Fard1, L Elaine Epperson2, Lawrence E Hunter3.
Abstract
Proteomics is an expanding area of research into biological systems with significance for biomedical and therapeutic applications ranging from understanding the molecular basis of diseases to testing new treatments, studying the toxicity of drugs, or biotechnological improvements in agriculture. Progress in proteomic technologies and growing interest has resulted in rapid accumulation of proteomic data, and consequently, a great number of tools have become available. In this paper, we review the well-known and ready-to-use tools for classification, clustering and validation, interpretation, and generation of biological information from experimental data. We suggest some rules of thumb for the reader on choosing the best suitable learning method for a particular dataset and conclude with pathway and functional analysis and then provide information about submitting final results to a repository.Entities:
Mesh:
Year: 2015 PMID: 26510531 PMCID: PMC4624643 DOI: 10.1186/s40246-015-0050-2
Source DB: PubMed Journal: Hum Genomics ISSN: 1473-9542 Impact factor: 4.639
Summary and comparison of classification and clustering methods
| Classification | Clustering | ||||||
|---|---|---|---|---|---|---|---|
| PCA | ICA | RF | PLS | SVM | K-means | Hierarchical | |
| What does it do? | Separates features into groups based on commonality and reports the weight of each component’s contribution to the separation | Separates features into groups by eliminating correlation and reports the weight of each component’s contribution to the separation | Separates features into groups based on commonality; identifies important predictors | Separates features into groups based on maximal covariation and reports the contribution of each variable | Uses a user-specified kernel function to quantify the similarity between any pair of instances and create a classifier | Separates features into clusters of similar expression patterns | Clusters treatment groups, features, or samples into a dendrogram |
| By what mechanism? | Orthogonal transformation; transfers a set of correlated variables into a new set of uncorrelated variables | Nonlinear, non-orthogonal transformation; standardizes each variable to a unit variance and zero mean | Uses an ensemble classifier that consists of many decision trees | Multivariate regression | Finds a decision boundary maximizing the distance to nearby positive and negative examples | Compares and groups magnitudes of changes in the means into K clusters where K is defined by the user | Compares all samples using either agglomerative or divisive algorithms with distance and linkage functions |
| Strengths | Unsupervised, nonparametric, useful for reducing dimensions before using supervision | Works well when other approaches do not because data are not normally distributed | Robust to outliers and noise; gives useful internal estimates of error; resistant to overtraining | Diverse experiments that have the same features are made comparable; variables can outnumber features | Robust to outliers, gives useful internal estimates of error, can exploit knowledge of the domain if using appropriate kernel functions | Easily visualized and intuitive; greatly reduces complexity; performs well when distance information between data points is important to clustering | Unsupervised; easily visualized and intuitive |
| Weaknesses | Number of features must exceed number of treatment groups | Features are assumed to be independent when they actually may be dependent | Does not allow missing data (requires imputation to replace missing values) | Fails to deal with data containing outliers | Selection of an inappropriate kernel yields poor results | Sensitive to initial conditions and specified number of clusters (K) | Does not provide feature contributions; not iterative, therefore, sensitive to cluster distance measures and noise/outliers |
| More information | Performance depends on number of trees and varies among experiments | Supervised; requires training and testing; groups pre-defined | Supervised; requires training and testing; many good kernel functions have been described, e.g., based on structural alignment | Tools are available to determine the optimal cluster count (K) | User does not define the number of clusters | ||
| Sample size/data characteristics | Unlimited sample size, data normally distributed | Unlimited sample size; data non-normally distributed | Performs well on small sample size and is resistant to over-fitting | Unlimited sample size; sensitive to outliers | Performs well on small sample size and resistant to over-fitting | Performs best with a limited dataset, i.e., ~20 to 300 features | Performs best with limited dataset, i.e., ~20 to 300 features or samples |
Summary of functional and network tools
| Name | Description | Link | References | Function |
|---|---|---|---|---|
| KEGG | Kyoto Encyclopedia of Genes and Genomes |
| Kanehisa and Goto (2000) [ | Pathway |
| DAVID | The Database for Annotation, Visualization and Integrated Discovery |
| Dennis et al. (2003) [ | Pathway and functional annotation using GO |
| PID | Pathway Interaction Database |
| Schaefer et al. (2009) [ | Pathway interaction |
| IPA | Ingenuity Pathway Analysis |
| Pathway and functional annotation | |
| Cytoscape | An open source platform for complex network analysis and visualization |
| Shannon et al. (2003) [ | Network visualization |
| HAPPI | Human Annotated and Predicted Protein Interaction Database |
| Chen et al. (2009) [ | Protein interaction |
| GSEA | Gene Set Enrichment Analysis |
| Subramanian et al. (2005) [ | Pathway analysis and functional annotation |
| Reactome | Curated database of pathways and reactions (pathway steps) |
| Matthews et al. (2009) [ | Pathway |
| BioCarta | Pathway database |
| Nishimura (2001) [ | Pathway |
| HPD | Integrated Human Pathway Database |
| Chowbina et al. (2009) [ | Pathway |
| PAGED | Pathway and Gene Enrichment Database |
| Huang et al. (2012) [ | Pathway, functional annotation |
| HPRDB | Human Protein Reference Database |
| Keshava Prasad, T. S. et al. (2009) [ | Annotation |
| DrugBank | Drug Bank |
| Combines drug data with drug target | |
| CPDB | Consensus Path DB |
| Kamburov, A. et al. (2013) [ | Interaction networks (protein-protein, genetic, metabolic, signaling, gene regulatory, and drug-target) |
| BINGO | Biological Network Gene Ontology Tool |
| Maere S, Heymans K, and Kuiper M (2005) [ | Biological network gene ontology |
| GATHER | Gene Annotation Tool to Help Explain Relationships |
| Chang JT, and Nevins JR. (2006) [ | Gene annotation tool |