| Literature DB >> 35524557 |
Zhen Chen1,2, Xuhan Liu3, Pei Zhao4, Chen Li5, Yanan Wang5, Fuyi Li5, Tatsuya Akutsu6, Chris Bain7, Robin B Gasser8, Junzhou Li1, Zuoren Yang4, Xin Gao9, Lukasz Kurgan10, Jiangning Song5,7.
Abstract
The rapid accumulation of molecular data motivates development of innovative approaches to computationally characterize sequences, structures and functions of biological and chemical molecules in an efficient, accessible and accurate manner. Notwithstanding several computational tools that characterize protein or nucleic acids data, there are no one-stop computational toolkits that comprehensively characterize a wide range of biomolecules. We address this vital need by developing a holistic platform that generates features from sequence and structural data for a diverse collection of molecule types. Our freely available and easy-to-use iFeatureOmega platform generates, analyzes and visualizes 189 representations for biological sequences, structures and ligands. To the best of our knowledge, iFeatureOmega provides the largest scope when directly compared to the current solutions, in terms of the number of feature extraction and analysis approaches and coverage of different molecules. We release three versions of iFeatureOmega including a webserver, command line interface and graphical interface to satisfy needs of experienced bioinformaticians and less computer-savvy biologists and biochemists. With the assistance of iFeatureOmega, users can encode their molecular data into representations that facilitate construction of predictive models and analytical studies. We highlight benefits of iFeatureOmega based on three research applications, demonstrating how it can be used to accelerate and streamline research in bioinformatics, computational biology, and cheminformatics areas. The iFeatureOmega webserver is freely available at http://ifeatureomega.erc.monash.edu and the standalone versions can be downloaded from https://github.com/Superzchen/iFeatureOmega-GUI/ and https://github.com/Superzchen/iFeatureOmega-CLI/.Entities:
Year: 2022 PMID: 35524557 PMCID: PMC9252729 DOI: 10.1093/nar/gkac351
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 19.160
Comparison of existing state-of-the-art computational toolkits for feature engineering, extraction, calculation, analysis and visualization. Tools are sorted chronologically
| Tools | Coverage of different molecule types | Performs feature analysis | Performs data visualization | Available interfaces | Ref. | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
|
| DNAs | RNAs | Protein sequences | Ligands | Protein structures |
|
| Web server | CLI stand-alone | GUI stand-alone |
|
| PIC | × | × | × | × | √ | × | √ | √ | × | × | ( |
| PseAAC | × | × | √ (3) | × | × | × | × | √ | × | × | ( |
| PROFEAT | × | × | √ (11) | √ (1) | × | × | × | √ | × | × | ( |
| PseAAC-Builder | × | × | √ (3) | × | × | × | × | × | √ | √ | ( |
| PyDPI | × | × | √ (14) | √ (13) | × | × | × | × | √ | × | ( |
| ChemoPy | × | × | × | √ (19) | × | × | × | × | √ | × | ( |
| Propy | × | × | √ (13) | × | × | × | × | × | √ | × | ( |
| PseAAC-General | × | × | √ (13) | × | × | × | × | × | √ | × | ( |
| Rcpi | × | × | √ (10) | √ (8) | × | × | × | × | √ | × | ( |
| Protr/ProtrWeb | × | × | √ (22) | × | × | × | × | × | √ | × | ( |
| BioTriangle | √(14) | √(14) | √(14) | √(18) | × | × | × | √ | × | × | ( |
| PDBparam | × | × | × | × | √(4) | × | × | √ | × | × | ( |
| repRNA | × | √ (11) | × | × | × | × | × | √ | × | × | ( |
| PseKRAAC | × | × | √ (16) | × | × | × | × | √ | × | × | ( |
| iFeature | × | × | √ (53) | × | × | √(10) | √(2) | √ | √ | × | ( |
| PyFeat | √(13) | √ (13) | √(9) | × | × | × | × | × | √ | × | ( |
| Seq2Feature | √(1) | √(1) | √ (4) | × | × | × | × | √ | × | × | ( |
| BioSeq-Analysis2.0* | √(36) | √(27) | √(53) | × | × | √(2) | √(1) | √ | √ | × | ( |
| PFeature* | × | × | √ | × | √ | × | × | √ | √ | × | ( |
| iLearn* | √(26) | √(18) | √(53) | × | × | √(15) | √(3) | √ | √ | × | ( |
| iLearnPlus* | √(46) | √(35) | √(66) | × | × | √( | √(7) | √ | × | √ | ( |
| MathFeature | √(30) | √(30) | √(12) | × | × | × | × | × | √ | √ | ( |
|
| √( | √( | √( | √( | √( | √(15) | √( | √ | √ | √ | - |
Note: *the tool is a machine-learning platform. ‘X’ means that the function is unavailable. Numbers in the brackets denote the numbers of different feature sets, or analysis/visualization options.
Figure 1.The iFeatureOmega architecture with three version applications, including iFeatureOmega-Web, iFeatureOmega-GUI and iFeatureOmega-CLI.
Figure 2.The screenshot showing the GUI version of iFeatureOmega, including ‘Protein’ panel, ‘DNA’ panel, ‘RNA’ panel, ‘Structure’ panel, ‘Ligand’ panel, ‘Feature analysis’ panel and ‘Plot’ panel.
Figure 3.The feature analysis result for protein zinc-binding sites using the ‘AAC_type2’ feature extraction method and the local CLI version of iFeatureOmega, including the data visualization for four types of zinc-binding sites (A), the data visualization for zinc-binding sites and non-zinc-binding sites (B).
Figure 4.The data visualization for lncRNA sequences and mRNA sequences using local GUI version of iFeatureOmega, including the histogram and kernel density plot shows the distribution difference between lncRNA and mRNA sequences (A), line chart shows the mean value difference (B) and box plot shows the distribution difference (C) for each descriptor between lncRNA and mRNA sequences.
Figure 5.The data visualization for ligands with constitution (A–D) and geary (E and F) features. The feature matrix is shown as heatmap, and the value in the matrix can be filtered with the color bar (A). The distribution of whole feature values is shown in a histogram, and the line plot represents the probability density curve fitted with kernel density estimation (B). The distribution of each feature is shown using the box plots (C). The first two component of PCA on these calculated features with provided labels are shown using a scatter plot, and the detailed information are listed in the table at the right side when the points are railed out (D). The similarity of these molecules is exhibited as relationship plot in which each node stands for a molecule (E). If it is similar to another molecule, there will be an edge between them, and the similarity value will be shown when the mouse pointer hovers. The different clusters obtained by the clustering algorithms will be labeled with different colors. The relationship plot can also visualize the similarity of features with different distance metrics (F).