| Literature DB >> 29135934 |
Pu-Feng Du1, Wei Zhao2, Yang-Yang Miao3,4, Le-Yi Wei5, Likun Wang6.
Abstract
With the avalanche of biological sequences in public databases, one of the most challenging problems in computational biology is to predict their biological functions and cellular attributes. Most of the existing prediction algorithms can only handle fixed-length numerical vectors. Therefore, it is important to be able to represent biological sequences with various lengths using fixed-length numerical vectors. Although several algorithms, as well as software implementations, have been developed to address this problem, these existing programs can only provide a fixed number of representation modes. Every time a new sequence representation mode is developed, a new program will be needed. In this paper, we propose the UltraPse as a universal software platform for this problem. The function of the UltraPse is not only to generate various existing sequence representation modes, but also to simplify all future programming works in developing novel representation modes. The extensibility of UltraPse is particularly enhanced. It allows the users to define their own representation mode, their own physicochemical properties, or even their own types of biological sequences. Moreover, UltraPse is also the fastest software of its kind. The source code package, as well as the executables for both Linux and Windows platforms, can be downloaded from the GitHub repository.Entities:
Keywords: extensible software; pseudo-amino acid compositions; pseudo-k nucleotide compositions
Mesh:
Year: 2017 PMID: 29135934 PMCID: PMC5713368 DOI: 10.3390/ijms18112400
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Figure 1Computational efficiency comparisons. Three programs are compared. The comparison was carried out by letting the three programs compute amino acid compositions on the same dataset on the same machine. Every program was executed with the same parameters for three times. The average execution time was applied in calculating the computational efficiency. The computational efficiency is measured by the average number of sequences that are processed every second. Pse-In-One: A program in literature [52]; PseAAC-General: A program in literature [46]; UltraPse: A program of this work.
Figure 2Hierarchical organization of integrated sequence representation modes. UltraPse integrated the sequence representation modes in its distribution package. Most of these modes can also be applied in user-defined sequence types, as long as the users provide proper definitions of the physicochemical properties.
Software function comparison in terms of flexibility and extensibility.
| Software Functions | Sequence Types | Extensibility |
|---|---|---|
| UltraPse | DNA, RNA, Protein, User-defined types | Users can define their own sequence types, representation modes and physicochemical properties |
| PseAAC-General [ | Protein | Users can define their own representation modes |
| PseAAC-Builder [ | Protein | No extensibility |
| Pse-In-One [ | DNA, RNA, Protein | Users can define their own physicochemical properties |
| PseKNC [ | DNA, RNA | Users can define their own physicochemical properties |
| PseKNC-General [ | DNA, RNA | Users can define their own physicochemical properties |
Software function comparison in terms of data processing ability.
| Software | Output Formats | Input Formats | Data Fault Tolerant a |
|---|---|---|---|
| UltraPse | SVM b, TSV c, CSV d | Multi-line FASTA (Automatic ID recognition for UniProt, GenBank, EMBL, DDBJ and RefSeq) | User-controllable behavior on data faults |
| PseAAC-General [ | SVM, TSV, CSV | Single-line FASTA (With restrictions on comment line) e | Automatically ignore and report data faults |
| PseAAC-Builder [ | SVM, TSV, CSV | Single-line FASTA (With restrictions on comment line) | Automatically ignore and report data faults |
| Pse-In-One [ | SVM, TSV, CSV | Mutlti-line FASTA | Abort processing on data faults |
| PseKNC [ | SVM, TSV, CSV | Mutlti-line FASTA | Abort processing on data faults |
| PseKNC-General [ | SVM, TSV, CSV | Mutlti-line FASTA | Abort processing on data faults |
a Data fault tolerant: The behavior of a software when it encounters some invalid data records. Here, the invalid data records include the sequences with non-standard letter and the sequence without sufficient length; b SVM: data format for libSVM [61]; c TSV: tab separated vector; d CSV: comma separated vector; e Single-line FASTA: the sequence of a record in the file must not spread to multiple lines. Both PseAAC-General and PseAAC-Builder have the same restrictions.
Figure 3The abstracted software design and data flow chart of UltraPse.
Figure 4An example on using UltraPse. UltraPse was used to implement classic pseudo-amino acid compositions. A TDF: classic-pseaac.lua, was applied. The FASTA format sequences are stored in the demo.fas file. The command options indicate that the Type 2 PseAAC will be computed with parameters: λ = 10 and ω = 0.05. The output format is compatible to libSVM.