| Literature DB >> 23716644 |
Maria D Paraskevopoulou1, Ioannis S Vlachos, Emmanouil Athanasiadis, George Spyrou.
Abstract
BiDaS is a web-application that can generate massive Monte Carlo simulated sequence or numerical feature data sets (e.g. dinucleotide content, composition, transition, distribution properties) based on small user-provided data sets. BiDaS server enables users to analyze their data and generate large amounts of: (i) Simulated DNA/RNA and aminoacid (AA) sequences following practically identical sequence and/or extracted feature distributions with the original data. (ii) Simulated numerical features, presenting identical distributions, while preserving the exact 2D or 3D between-feature correlations observed in the original data sets. The server can project the provided sequences to multidimensional feature spaces based on: (i) 38 DNA/RNA features describing conformational and physicochemical nucleotide sequence features from the B-DNA-VIDEO database, (ii) 122 DNA/RNA features based on conformational and thermodynamic dinucleotide properties from the DiProDB database and (iii) Pseudo-aminoacid composition of the initial sequences. To the best of our knowledge, this is the first available web-server that allows users to generate vast numbers of biological data sets with realistic characteristics, while keeping between-feature associations. These data sets can be used for a wide variety of current biological problems, such as the in-depth study of gene, transcript, peptide and protein groups/families; the creation of large data sets from just a few available members and the strengthening of machine learning classifiers. All simulations use advanced Monte Carlo sampling techniques. The BiDaS web-application is available at http://bioserver-3.bioacademy.gr/Bioserver/BiDaS/.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23716644 PMCID: PMC3692108 DOI: 10.1093/nar/gkt420
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Histograms of AAs within the initial and the MC-generated training data sets.
Classification results of best feature combination for each of the three classifiers using real as well as simulated sequences in the training process
| Classifier | Best feature combination | Training data set | Accuracy | F-Score | |
|---|---|---|---|---|---|
| Bayesian | PseudoAA1-AA2 | Real | 75.63 | 88.67 | 0.35 |
| Monte Carlo | 75.04 | 88.72 | 0.35 | ||
| PNN with Gaussian Kernel | PseudoAA1-AA2 | Real | 78.76 | 88.45 | 0.39 |
| Monte Carlo | 77.03 | 87.42 | 0.34 | ||
| PNN with Exponential Kernel | PseudoAA1-AA2 | Real | 80.95 | 88.20 | 0.41 |
| Monte Carlo | 78.42 | 85.92 | 0.32 |
Figure 2.Simulated 2D and 3D correlated features using Cholesky Decomposition. Original features are represented in all diagrams with blue points, while MC-simulated values are depicted in black. In-between correlations are preserved in the simulated feature sets.