| Literature DB >> 31620165 |
Bin Liu1,2, Shengyu Chen3, Ke Yan4, Fan Weng4.
Abstract
Identification of replication origins is playing a key role in understanding the mechanism of DNA replication. This task is of great significance in DNA sequence analysis. Because of its importance, some computational approaches have been introduced. Among these predictors, the iRO-3wPseKNC predictor is the first discriminative method that is able to correctly identify the entire replication origins. For further improving its predictive performance, we proposed the Pseudo k-tuple GC Composition (PsekGCC) approach to capture the "GC asymmetry bias" of yeast species by considering both the GC skew and the sequence order effects of k-tuple GC Composition (k-GCC) in this study. Based on PseKGCC, we proposed a new predictor called iRO-PsekGCC to identify the DNA replication origins. Rigorous jackknife test on two yeast species benchmark datasets (Saccharomyces cerevisiae, Pichia pastoris) indicated that iRO-PsekGCC outperformed iRO-3wPseKNC. It can be anticipated that iRO-PsekGCC will be a useful tool for DNA replication origin identification. Availability and implementation: The web-server for the iRO-PsekGCC predictor was established, and it can be accessed at http://bliulab.net/iRO-PsekGCC/.Entities:
Keywords: DNA sequence analysis; pseudo k-tuple GC composition; random forest; replication origin identification; web-server
Year: 2019 PMID: 31620165 PMCID: PMC6759546 DOI: 10.3389/fgene.2019.00842
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Figure 1A schematic diagram to illustrate how to calculate the GC Skew in the front, middle, and rear windows along a DNA sequence. (A) The coupling between all the contiguous k-GCC (k = 3); (B) The coupling between the second most contiguous k-GCC (k = 3); (C) The coupling between all the contiguous k-GCC (k = 4); (D) The coupling between the second most contiguous k-GCC (k = 4).
Figure 2A flowchart illustration to show how the iRO-PseKGCC predictor works.
The results of the iRO-PseKGCC Predictor and comparison with iRO-PseKGCC on the two benchmark datasets (cf. Equation 1) obtained by using jackknife test.
| Species | Method | Acc(%) | MCC | Sn(%) | Sp(%) | AUC |
|---|---|---|---|---|---|---|
|
| iRO-PseKGCC | 76.46 | 0.5298 | 73.90 | 78.13 | 0.8129 |
| iRO-3wPseKNC | 72.95 | 0.4594 | 70.67 | 75.22 | 0.8084 | |
|
| iRO-PseKGCC | 74.22 | 0.4844 | 74.51 | 73.93 | 0.8002 |
| iRO-3wPseKNC | 71.10 | 0.4222 | 69.93 | 72.28 | 0.7962 |
The parameters are listed in .
The predictor reported in (Liu et al., 2018b) with parameter ε = 0.25, δ = 0.85, k = 5, λ= 6, w = 0.3, and = 700.
The predictor reported in (Liu et al., 2018b) with parameter ε = 0.15, δ = 0.55, k = 4, λ = 9, w = 0.3, and = 800.
The top 10 most important features of the top two performing RF-based predictors on the two benchmark datasets (cf. Equation 1).
| Rank | Saccharomyces cerevisiae | Pichia pastoris | ||||
|---|---|---|---|---|---|---|
| Feature | Window | MDA (%) | Feature | Window Index | MDA (%) | |
| 1 | *** | Rear window | 20.49 | ***** | Middle window | 15.89 |
| 2 | *** | Middle window | 19.62 | ****G | Middle window | 5.69 |
| 3 | *GG | Rear window | 9.04 | G**** | Middle window | 5.38 |
| 4 | GG* | Rear window | 8.35 | *C*** | Middle window | 5.23 |
| 5 | *GG | Middle window | 8.26 | *G*** | Middle window | 5.14 |
| 6 | λ = 1 | Rear window | 7.67 | *CGCG | Middle window | 3.99 |
| 7 | GG* | Middle window | 7.45 | ****C | Middle window | 3.94 |
| 8 | CC* | Middle window | 7.31 | ***G* | Middle window | 3.77 |
| 9 | G*G | Rear window | 6.64 | *C*GG | Middle window | 3.47 |
| 10 | λ = 2 | Rear window | 6.12 | C**G* | Middle window | 3.40 |
Figure 3A semi-screen shot to show the homepage of the web-server iRO-PseKGCC, which can be accessed at http://bliulab.net/iRO-PsekGCC/.