| Literature DB >> 28076851 |
Bin Liu1,2,3, Hao Wu1, Deyuan Zhang4, Xiaolong Wang1,2, Kuo-Chen Chou3,5.
Abstract
To expedite the pace in conducting genome/proteome analysis, we have developed a Python package called Pse-Analysis. The powerful package can automatically complete the following five procedures: (1) sample feature extraction, (2) optimal parameter selection, (3) model training, (4) cross validation, and (5) evaluating prediction quality. All the work a user needs to do is to input a benchmark dataset along with the query biological sequences concerned. Based on the benchmark dataset, Pse-Analysis will automatically construct an ideal predictor, followed by yielding the predicted results for the submitted query samples. All the aforementioned tedious jobs can be automatically done by the computer. Moreover, the multiprocessing technique was adopted to enhance computational speed by about 6 folds. The Pse-Analysis Python package is freely accessible to the public at http://bioinformatics.hitsz.edu.cn/Pse-Analysis/, and can be directly run on Windows, Linux, and Unix.Entities:
Keywords: genome/proteome analysis; pseudo components; sequence analysis; support vector machine
Mesh:
Year: 2017 PMID: 28076851 PMCID: PMC5355101 DOI: 10.18632/oncotarget.14524
Source DB: PubMed Journal: Oncotarget ISSN: 1949-2553
Figure 1The flowchart of Pse-Analysis Python package
The “train.py” script is for training the predictive model based on the benchmark dataset submitted by the user. It contains four procedures; i.e., feature extraction, parameter selection, model training, and cross validation. The “predict.py” is for using the trained model to predict the query samples and evaluate their prediction quality by a set of widely used metrics Acc, MCC, Sn, Sp [25], and AUC [68].
Figure 2The computational cost of Pse-Analysis can be significantly reduced by using multiprocessing technique
The blue curve reflects the computational time time for the parameter optimization process when using Pse-Analysis of one CPU core to process the five subsets for nucleosome positioning prediction of Caenorhabditis elegans [7], while the red curve reflects the corresponding computational time when using Pse-Analysis of ten CPU cores to do the same.