Literature DB >> 35799281

AutoClassWeb: a simple web interface for Bayesian clustering of omics data.

Abstract

OBJECTIVE: Data clustering is a common exploration step in the omics era, notably in genomics and proteomics where many genes or proteins can be quantified from one or more experiments. Bayesian clustering is a powerful unsupervised algorithm that can classify several thousands of genes or proteins. AutoClass C, its original implementation, handles missing data, automatically determines the best number of clusters but is not user-friendly.
RESULTS: We developed an online tool called AutoClassWeb, which provides an easy-to-use and simple web interface for Bayesian clustering with AutoClass. Input data are entered as TSV files and quality controlled. Results are provided in formats that ease further analyses with spreadsheet programs or with programming languages, such as Python or R. AutoClassWeb is implemented in Python and is published under the 3-Clauses BSD license. The source code is available at https://github.com/pierrepo/autoclassweb along with a detailed documentation.

Entities: Chemical

Keywords: Autoclass; Bayesian; Clustering; Genomics; Machine learning; Proteomics

Mesh：

Year: 2022 PMID： 35799281 PMCID： PMC9264669 DOI： 10.1186/s13104-022-06129-6

Source DB: PubMed Journal: BMC Res Notes ISSN： 1756-0500

Introduction

In biology, high-throughput technologies (notably in genomics and proteomics) enable identification and quantification of several thousands of genes or proteins in a single experiment. To analyze such a large amount of data, from one or more experiments, clustering algorithms are widely used unsupervised machine-learning methods to group genes or proteins with similar patterns. Bayesian clustering is such an algorithm and one of its implementation in the C programming language (AutoClass C) has been developed in 1996 at the Ames Research Center at NASA [1, 2]. The idea behind Bayesian clustering and the AutoClass algorithm is to find a classification that fits the data with the highest probability. The AutoClass algorithm provides some additional and interesting features: it handles missing data and determines automatically the best number of clusters. AutoClass C has been used in a wide variety of applications from clustering cells of the prefrontal cortex in rats and mice [3] to detecting body patterns in the common cuttlefish [4] (see also references [5] and [6] for a detailed list of applications). However, AutoClass C, originally developed by physicists, is not user-friendly: the program is solely accessible through the command line, only 32-bit binaries are available and results files are difficult to parse for subsequent analysis. More than 10 years ago, Achcar et al. published AutoClass@IJM [5], a web interface for AutoClass C. This web service drastically simplified the use of AutoClass C and widen its adoption, especially in biology [3, 7–10]. Unfortunately, this tool is not maintained anymore, and its source code is not publicly available. To continue to offer this powerful unsupervised Bayesian clustering method to the community, we first developed AutoClassWrapper [6], a Python library that wraps the functionality of AutoClass C. We now go further by providing AutoClassWeb, a new easy-to-use open-source web interface for AutoClass C. AutoClassWeb requires no programming skills, performs a quality control on the input data, and ships results in formats that support further analysis.

Main text

Implementation

AutoClassWeb utilizes AutoClassWrapper [6], a Python wrapper for the AutoClass C program. This wrapper facilitates the preparation and quality control of data, runs the actual classification, and eventually, prepare results in file formats that allow further analysis. AutoClassWeb is written in Python [11] and uses the Flask library to build the web interface users interact with. For better reproducibility and sustainability, AutoClassWeb is packaged in a Docker image stored in the BioContainers [12] registry. The web service itself has been designed to be user-friendly and straightforward. There is no user authentication and by default, results are kept 30 days before being deleted. A comprehensive help page provides all the help and guidance the user might need. Using Docker technology, AutoClassWeb can be quickly deployed on a local machine or on a public web server. To this end and to reduce the installation burden, we provide two companion GitHub repositories with detailed instructions, for local (https://github.com/pierrepo/autoclassweb-app) and server installation (https://github.com/pierrepo/autoclassweb-server).

Data submission

The input data must be formatted as tab-separated values (TSV) files. The first line is a header containing the names of the columns which must be unique. The first column contains the names of the objects studied (i.e. protein or gene identifiers). Missing data is supported and should be coded with an empty value (i.e. nothing).. AutoClass supports three categories of data:If the initial input dataset contains several data types (real scalar, real location, discrete), it is recommended to split the initial dataset into several datasets of homogeneous type and submit them in the input form (Figure 1 (A)).

Fig. 1

Views of AutoClassWeb. A Main page with the form to input data according to its type. B View after data input and quality check. C Status page

real location: negative and positive values such as position, elevation, microarray log ratio... real scalar: singly bounded real values, typically bounded below at zero (i.e.: length, weight, age). discrete: qualitative data. For instance, color, phenotype, name... Views of AutoClassWeb. A Main page with the form to input data according to its type. B View after data input and quality check. C Status page For real location and real scalar data types, the user can optionally specify an absolute and relative error, respectively. By default, the absolute error for real location data is 0.01 and the relative error for real scalar data is 0.01 (1%). However, it is worth noting that the AutoClass C algorithm is not very sensitive to the error parameter [5].

Clustering

Upon submission, input data is quality checked and formatted to be usable by AutoClass C. The web interface provides a unique job name, a link to the status page and a quick summary of input data (toggled with the text Hide/show logs), as illustrated in Figure 1 (B). The status page lists running, failed and completed runs with their respective identifier (Job name), creation date, status and running time (Figure 1 (C)).

Results

Once a job is completed, a green button allows the user to download results of the clustering. Results are bundled in a zip archive with the following files (where xxx stands for the unique identifier of the job): xxx_autoclass_out.cdt and xxx_autoclass_out_withproba.cdt can be viewed with Java TreeView [13], a versatile viewer initially developed for microarray data. The file xxx_autoclass_out_withproba.cdt contains the probability for each object (gene or protein) to belong to each class. xxx_autoclass_out_stats.tsv contains means and standard deviations of numeric columns (real scalar and real location data types) for each class. xxx_autoclass_out_dendrogram.png is a dendrogram plot that visualizes the distance between all classes. xxx_autoclass_out.tsv contains all the data with the class assignment and membership probabilities for all classes. This file is in the TSV format and can be easily parsed with spreadsheet programs such as Microsoft Excel or LibreOffice Calc, or programming languages such as R or Python.

Performance

The AutoClass C algorithm has been designed to run on a single CPU. The running time depends exponentially on the size of the input dataset. Figure 2 illustrates the running time as a function of the input dataset sizes. Dataset size is expressed as the number of rows (usually genes or proteins) times the number of columns (features or properties of interest).

Fig. 2

Running time (in hours) as a function of the input dataset size. Dataset size is expressed as the number of rows (genes or proteins) times the number of columns (features or studied properties)

Conclusion

Data clustering is an essential unsupervised machine-learning approach used in most modern omics data analyses. The AutoClass algorithm, while very powerful, is not widely used, mainly because its original AutoClass C implementation is difficult to use. AutoClassWeb provides an easy-to-use and straightforward web interface for AutoClass C. The project is open-source, packaged in a Docker image available in BioContainers for better reproducibility and sustainability.

Limitations

AutoClassWeb provides a convenient online service to cluster results from high-throughput experiments such as RNA-seq or mass spectrometry based proteomics. However, we would like to point out that the processing time required to cluster data with AutoClass is proportional to the number of genes or proteins to be clustered. A parallel version of AutoClass C that potentially reduces the processing time has been published [14]. Unfortunately, the source code is not available, and the project has been discontinued. Another limitation of AutoClassWeb requires users to split input data by type (real location, real scalar or discrete) with a special attention to real location and real scalar which may sometimes be confused.

8 in total

1. Java Treeview--extensible visualization of microarray data.

Authors: Alok J Saldanha
Journal: Bioinformatics Date: 2004-06-04 Impact factor: 6.937

2. The metacaspase (Mca1p) has a dual role in farnesol-induced apoptosis in Candida albicans.

Authors: Thibaut Léger; Camille Garcia; Marwa Ounissi; Gaëlle Lelandais; Jean-Michel Camadro
Journal: Mol Cell Proteomics Date: 2014-10-27 Impact factor: 5.911

3. The control of tomato fruit elongation orchestrated by sun, ovate and fs8.1 in a wild relative of tomato.

Authors: Shan Wu; Josh P Clevenger; Liang Sun; Sofia Visa; Yuji Kamiya; Yusuke Jikumaru; Joshua Blakeslee; Esther van der Knaap
Journal: Plant Sci Date: 2015-06-09 Impact factor: 4.729

4. Identifying the structure in cuttlefish visual signals.

Authors: Anne C Crook; Roland Baddeley; Daniel Osorio
Journal: Philos Trans R Soc Lond B Biol Sci Date: 2002-11-29 Impact factor: 6.237

5. Serotonin Differentially Regulates L5 Pyramidal Cell Classes of the Medial Prefrontal Cortex in Rats and Mice.

Authors: Mary C Elliott; Peter M Tanaka; Ryan W Schwark; Rodrigo Andrade
Journal: eNeuro Date: 2018-02-06

6. AutoClass@IJM: a powerful tool for Bayesian classification of heterogeneous data in biology.

Authors: Fiona Achcar; Jean-Michel Camadro; Denis Mestivier
Journal: Nucleic Acids Res Date: 2009-05-27 Impact factor: 16.971

7. BioContainers: an open-source and community-driven framework for software standardization.

Authors: Felipe da Veiga Leprevost; Björn A Grüning; Saulo Alves Aflitos; Hannes L Röst; Julian Uszkoreit; Harald Barsnes; Marc Vaudel; Pablo Moreno; Laurent Gatto; Jonas Weber; Mingze Bai; Rafael C Jimenez; Timo Sachsenberg; Julianus Pfeuffer; Roberto Vera Alvarez; Johannes Griss; Alexey I Nesvizhskii; Yasset Perez-Riverol
Journal: Bioinformatics Date: 2017-08-15 Impact factor: 6.937

8. The adaptive response to iron involves changes in energetic strategies in the pathogen Candida albicans.

Authors: Celia Duval; Carole Macabiou; Camille Garcia; Emmanuel Lesuisse; Jean-Michel Camadro; Françoise Auchère
Journal: Microbiologyopen Date: 2019-12-01 Impact factor: 3.139

8 in total