| Literature DB >> 27626500 |
Chang-Jian Zhang1, Hua Tang2, Wen-Chao Li1, Hao Lin1,3, Wei Chen1,4,3, Kuo-Chen Chou1,4,3.
Abstract
The initiation of replication is an extremely important process in DNA life cycle. Given an uncharacterized DNA sequence, can we identify where its origin of replication (ORI) is located? It is no doubt a fundamental problem in genome analysis. Particularly, with the rapid development of genome sequencing technology that results in a huge amount of sequence data, it is highly desired to develop computational methods for rapidly and effectively identifying the ORIs in these genomes. Unfortunately, by means of the existing computational methods, such as sequence alignment or kmer strategies, it could hardly achieve decent success rates. To address this problem, we developed a predictor called "iOri-Human". Rigorous jackknife tests have shown that its overall accuracy and stability in identifying human ORIs are over 75% and 50%, respectively. In the predictor, it is through the pseudo nucleotide composition (an extension of pseudo amino acid composition) that 96 physicochemical properties for the 16 possible constituent dinucleotides have been incorporated to reflect the global sequence patterns in DNA as well as its local sequence patterns. Moreover, a user-friendly web-server for iOri-Human has been established at http://lin.uestc.edu.cn/server/iOri-Human.html, by which users can easily get their desired results without the need to through the complicated mathematics involved.Entities:
Keywords: human DNA; origin of replication; physicochemical properties of dinucleotides; pseudo k-tuple nucleotide composition
Mesh:
Substances:
Year: 2016 PMID: 27626500 PMCID: PMC5342515 DOI: 10.18632/oncotarget.11975
Source DB: PubMed Journal: Oncotarget ISSN: 1949-2553
Figure 1The schematic diagram of origin of replication of human
The process of DNA replication requires two DNA polymerase complexes traveling in opposite direction (i.e. two bidirectional replication forks) from the origin.
Figure 2A semi-screenshot for the top-page of the iOri-Human web-server at http://lin.uestc.edu.cn/server/iOri-Human.html
The success rates obtained by various machine-learning algorithms via jackknife tests on the benchmark dataset (Supporting Information S1)
| Algorithm | Sn | Sp | Acc | MCC | AUC |
|---|---|---|---|---|---|
| iOri-Human | |||||
| SVM | 0.688 | 0.544 | 0.616 | 0.400 | 0.651 |
| Naive Bayes | 0.379 | 0.746 | 0.563 | 0.286 | 0.614 |
| KNN | 0.606 | 0.473 | 0.54 | 0.144 | 0.529 |
| Decision Tree | 0.078 | 0.936 | 0.508 | 0.028 | 0.511 |
See Eq.8 for the definition of the metrics.
AUC means the area under the ROC curves in Figure 3; the greater the AUC value is, the better the predictor will be [53, 54].
The proposed predictor in which the number of trees used was 100 with seed equal to 1.
The optimal parameters used for SVM were C= 0.5 and γ = 0.125.
The optimal parameters used for KNN (K nearest neighbor) was K = 1.
Figure 3A graphical illustration to show the performances of iOri-Human and its cohorts via the ROC (receiver operating characteristic) curves [53, 54]
The area under the ROC curve is called AUC (area under the curve). The greater the AUC value is, the better the performance will be. See the text for further explanation.