Pierre Poulain1, Jean-Michel Camadro2. 1. Université Paris Cité, CNRS, Institut Jacques Monod, Paris, F-75013, France. pierre.poulain@u-paris.fr. 2. Université Paris Cité, CNRS, Institut Jacques Monod, Paris, F-75013, France.
Abstract
OBJECTIVE: Data clustering is a common exploration step in the omics era, notably in genomics and proteomics where many genes or proteins can be quantified from one or more experiments. Bayesian clustering is a powerful unsupervised algorithm that can classify several thousands of genes or proteins. AutoClass C, its original implementation, handles missing data, automatically determines the best number of clusters but is not user-friendly. RESULTS: We developed an online tool called AutoClassWeb, which provides an easy-to-use and simple web interface for Bayesian clustering with AutoClass. Input data are entered as TSV files and quality controlled. Results are provided in formats that ease further analyses with spreadsheet programs or with programming languages, such as Python or R. AutoClassWeb is implemented in Python and is published under the 3-Clauses BSD license. The source code is available at https://github.com/pierrepo/autoclassweb along with a detailed documentation.
OBJECTIVE: Data clustering is a common exploration step in the omics era, notably in genomics and proteomics where many genes or proteins can be quantified from one or more experiments. Bayesian clustering is a powerful unsupervised algorithm that can classify several thousands of genes or proteins. AutoClass C, its original implementation, handles missing data, automatically determines the best number of clusters but is not user-friendly. RESULTS: We developed an online tool called AutoClassWeb, which provides an easy-to-use and simple web interface for Bayesian clustering with AutoClass. Input data are entered as TSV files and quality controlled. Results are provided in formats that ease further analyses with spreadsheet programs or with programming languages, such as Python or R. AutoClassWeb is implemented in Python and is published under the 3-Clauses BSD license. The source code is available at https://github.com/pierrepo/autoclassweb along with a detailed documentation.
In biology, high-throughput technologies (notably in genomics and proteomics) enable identification and quantification of several thousands of genes or proteins in a single experiment. To analyze such a large amount of data, from one or more experiments, clustering algorithms are widely used unsupervised machine-learning methods to group genes or proteins with similar patterns. Bayesian clustering is such an algorithm and one of its implementation in the C programming language (AutoClass C) has been developed in 1996 at the Ames Research Center at NASA [1, 2]. The idea behind Bayesian clustering and the AutoClass algorithm is to find a classification that fits the data with the highest probability. The AutoClass algorithm provides some additional and interesting features: it handles missing data and determines automatically the best number of clusters.AutoClass C has been used in a wide variety of applications from clustering cells of the prefrontal cortex in rats and mice [3] to detecting body patterns in the common cuttlefish [4] (see also references [5] and [6] for a detailed list of applications). However, AutoClass C, originally developed by physicists, is not user-friendly: the program is solely accessible through the command line, only 32-bit binaries are available and results files are difficult to parse for subsequent analysis.More than 10 years ago, Achcar et al. published AutoClass@IJM [5], a web interface for AutoClass C. This web service drastically simplified the use of AutoClass C and widen its adoption, especially in biology [3, 7–10]. Unfortunately, this tool is not maintained anymore, and its source code is not publicly available.To continue to offer this powerful unsupervised Bayesian clustering method to the community, we first developed AutoClassWrapper [6], a Python library that wraps the functionality of AutoClass C. We now go further by providing AutoClassWeb, a new easy-to-use open-source web interface for AutoClass C. AutoClassWeb requires no programming skills, performs a quality control on the input data, and ships results in formats that support further analysis.
Main text
Implementation
AutoClassWeb utilizes AutoClassWrapper [6], a Python wrapper for the AutoClass C program. This wrapper facilitates the preparation and quality control of data, runs the actual classification, and eventually, prepare results in file formats that allow further analysis.AutoClassWeb is written in Python [11] and uses the Flask library to build the web interface users interact with. For better reproducibility and sustainability, AutoClassWeb is packaged in a Docker image stored in the BioContainers [12] registry.The web service itself has been designed to be user-friendly and straightforward. There is no user authentication and by default, results are kept 30 days before being deleted. A comprehensive help page provides all the help and guidance the user might need.Using Docker technology, AutoClassWeb can be quickly deployed on a local machine or on a public web server. To this end and to reduce the installation burden, we provide two companion GitHub repositories with detailed instructions, for local (https://github.com/pierrepo/autoclassweb-app) and server installation (https://github.com/pierrepo/autoclassweb-server).
Data submission
The input data must be formatted as tab-separated values (TSV) files. The first line is a header containing the names of the columns which must be unique. The first column contains the names of the objects studied (i.e. protein or gene identifiers).Missing data is supported and should be coded with an empty value (i.e. nothing)..AutoClass supports three categories of data:If the initial input dataset contains several data types (real scalar, real location, discrete), it is recommended to split the initial dataset into several datasets of homogeneous type and submit them in the input form (Figure 1 (A)).
Fig. 1
Views of AutoClassWeb. A Main page with the form to input data according to its type. B View after data input and quality check. C Status page
real location: negative and positive values such as position, elevation, microarray log ratio...real scalar: singly bounded real values, typically bounded below at zero (i.e.: length, weight, age).discrete: qualitative data. For instance, color, phenotype, name...Views of AutoClassWeb. A Main page with the form to input data according to its type. B View after data input and quality check. C Status pageFor real location and real scalar data types, the user can optionally specify an absolute and relative error, respectively. By default, the absolute error for real location data is 0.01 and the relative error for real scalar data is 0.01 (1%). However, it is worth noting that the AutoClass C algorithm is not very sensitive to the error parameter [5].
Clustering
Upon submission, input data is quality checked and formatted to be usable by AutoClass C. The web interface provides a unique job name, a link to the status page and a quick summary of input data (toggled with the text Hide/show logs), as illustrated in Figure 1 (B).The status page lists running, failed and completed runs with their respective identifier (Job name), creation date, status and running time (Figure 1 (C)).
Results
Once a job is completed, a green button allows the user to download results of the clustering. Results are bundled in a zip archive with the following files (where xxx stands for the unique identifier of the job):xxx_autoclass_out.cdt and xxx_autoclass_out_withproba.cdt can be viewed with Java TreeView [13], a versatile viewer initially developed for microarray data. The file xxx_autoclass_out_withproba.cdt contains the probability for each object (gene or protein) to belong to each class.xxx_autoclass_out_stats.tsv contains means and standard deviations of numeric columns (real scalar and real location data types) for each class.xxx_autoclass_out_dendrogram.png is a dendrogram plot that visualizes the distance between all classes.xxx_autoclass_out.tsv contains all the data with the class assignment and membership probabilities for all classes. This file is in the TSV format and can be easily parsed with spreadsheet programs such as Microsoft Excel or LibreOffice Calc, or programming languages such as R or Python.
Performance
The AutoClass C algorithm has been designed to run on a single CPU. The running time depends exponentially on the size of the input dataset. Figure 2 illustrates the running time as a function of the input dataset sizes. Dataset size is expressed as the number of rows (usually genes or proteins) times the number of columns (features or properties of interest).
Fig. 2
Running time (in hours) as a function of the input dataset size. Dataset size is expressed as the number of rows (genes or proteins) times the number of columns (features or studied properties)
Running time (in hours) as a function of the input dataset size. Dataset size is expressed as the number of rows (genes or proteins) times the number of columns (features or studied properties)
Conclusion
Data clustering is an essential unsupervised machine-learning approach used in most modern omics data analyses. The AutoClass algorithm, while very powerful, is not widely used, mainly because its original AutoClass C implementation is difficult to use. AutoClassWeb provides an easy-to-use and straightforward web interface for AutoClass C. The project is open-source, packaged in a Docker image available in BioContainers for better reproducibility and sustainability.
Limitations
AutoClassWeb provides a convenient online service to cluster results from high-throughput experiments such as RNA-seq or mass spectrometry based proteomics. However, we would like to point out that the processing time required to cluster data with AutoClass is proportional to the number of genes or proteins to be clustered. A parallel version of AutoClass C that potentially reduces the processing time has been published [14]. Unfortunately, the source code is not available, and the project has been discontinued.Another limitation of AutoClassWeb requires users to split input data by type (real location, real scalar or discrete) with a special attention to real location and real scalar which may sometimes be confused.
Authors: Felipe da Veiga Leprevost; Björn A Grüning; Saulo Alves Aflitos; Hannes L Röst; Julian Uszkoreit; Harald Barsnes; Marc Vaudel; Pablo Moreno; Laurent Gatto; Jonas Weber; Mingze Bai; Rafael C Jimenez; Timo Sachsenberg; Julianus Pfeuffer; Roberto Vera Alvarez; Johannes Griss; Alexey I Nesvizhskii; Yasset Perez-Riverol Journal: Bioinformatics Date: 2017-08-15 Impact factor: 6.937