Literature DB >> 24469313

iRSpot-TNCPseAAC: identify recombination spots with trinucleotide composition and pseudo amino acid components.

Wang-Ren Qiu¹, Xuan Xiao², Kuo-Chen Chou³.

Abstract

Meiosis and recombination are the two opposite aspects that coexist in a DNA system. As a driving force for evolution by generating natural genetic variations, meiotic recombination plays a very important role in the formation of eggs and sperm. Interestingly, the recombination does not occur randomly across a genome, but with higher probability in some genomic regions called "hotspots", while with lower probability in so-called "coldspots". With the ever-increasing amount of genome sequence data in the postgenomic era, computational methods for effectively identifying the hotspots and coldspots have become urgent as they can timely provide us with useful insights into the mechanism of meiotic recombination and the process of genome evolution as well. To meet the need, we developed a new predictor called "iRSpot-TNCPseAAC", in which a DNA sample was formulated by combining its trinucleotide composition (TNC) and the pseudo amino acid components (PseAAC) of the protein translated from the DNA sample according to its genetic codes. The former was used to incorporate its local or short-rage sequence order information; while the latter, its global and long-range one. Compared with the best existing predictor in this area, iRSpot-TNCPseAAC achieved higher rates in accuracy, Mathew's correlation coefficient, and sensitivity, indicating that the new predictor may become a useful tool for identifying the recombination hotspots and coldspots, or, at least, become a complementary tool to the existing methods. It has not escaped our notice that the aforementioned novel approach to incorporate the DNA sequence order information into a discrete model may also be used for many other genome analysis problems. The web-server for iRSpot-TNCPseAAC is available at http://www.jci-bioinfo.cn/iRSpot-TNCPseAAC. Furthermore, for the convenience of the vast majority of experimental scientists, a step-by-step guide is provided on how to use the current web server to obtain their desired result without the need to follow the complicated mathematical equations.

Entities: CellLine Chemical Disease Gene Species

Mesh：

Substances：
Amino Acids
Codon

Year: 2014 PMID： 24469313 PMCID： PMC3958819 DOI： 10.3390/ijms15021746

Source DB: PubMed Journal: Int J Mol Sci ISSN： 1422-0067 Impact factor: 5.923

Introduction

Meiosis and recombination are two indispensible aspects for cell reproduction and growth (Figure 1). The former is a special type of cell division by which the genome is divided in half to generate daughter cells for participating in sexual reproduction, while the latter is to produce single-strand ends that can invade the homologous chromosome [1].

Figure 1.

An illustration to show the process of meiosis and recombination in a DNA system. Adapted from [2].

Recombination is initiated by double-strand breaks (or broken DNA ends); defecting in meiosis may lead to male infertility [3-5]. Meiotic recombination ensures accurate chromosome segregation during the first meiotic division and provides a mechanism to increase genetic heterogeneity among the meiotic products. Accordingly, identification of recombination spots may provide very useful information for in-depth understanding the reproduction and growth of cells. In the past decades, a lot of global mapping studies have been performed to map double-strand break sites on chromosomes [6-13]. The following findings were observed through these studies for the meiotic recombination events. (i) They generally concentrate in 1:2.5 kilobase regions; (ii) They do not occur randomly across the entire genome but with a higher rate in some regions and lower in others; the former is a so-called “hotspot” while the latter, “coldspot”; (iii) They do not share a consensus sequence pattern. With the rapid increasing number of genome sequences, it is important to address the following problem. Given a genome sequence, how can we predict which part of it is the hotspot for recombination, and which part is not? Based on the nucleotide sequence contents, Liu et al. [14] proposed a computational method to deal with this problem. However, in their method no sequence-order effect whatsoever was taken into account, and, hence, its prediction power might be limited. Actually, one of the most important, but also most difficult, problems in computational biology is how to formulate a biological sequence with a discrete model or a vector, yet still keep considerable sequence order information. This is as all the existing operation engines, such as covariance discriminant (CD) [15-20], neural network [21-23], support vector machine (SVM) [24-26], random forest [27,28], conditional random field [29], nearest neighbor (NN) [30,31], K-nearest neighbor (KNN) [32-34], OET-KNN (optimized evidence-theoretic k-nearest neighbors) [35-38], and Fuzzy K-nearest neighbor [39-43], can only handle vector, but not sequence, samples. However, a vector defined in a discrete model may completely lose all the sequence-order information. To avoid completely losing the sequence-order information for proteins, the pseudo amino acid composition [44,45] or Chou’s pseudo amino acid components (PseAAC) [46] was proposed. Ever since the concept of PseAAC was proposed in 2001 [44], it has penetrated into almost all the areas of computational proteomics, such as identifying cysteine S-nitrosylation sites in proteins [29], predicting bacterial virulent proteins [47], predicting antibacterial peptides [48], identifying bacterial secreted proteins [49], predicting supersecondary structure [50], predicting protein subcellular location [51-59], predicting membrane protein types [60,61], discriminating outer membrane proteins [62], identifying antibacterial peptides [48], identifying allergenic proteins [63], predicting metalloproteinase family [64], predicting protein structural class [65], identifying GPCRs (G protein-coupled receptors) and their types [66,67], identifying protein quaternary structural attributes [68,69], predicting protein submitochondria locations [70-73], identifying risk type of human papillomaviruses [74], identifying cyclin proteins [75], predicting GABA(A) receptor proteins [76], classifying amino acids [77], predicting the cofactors of oxidoreductases [78], predicting enzyme subfamily classes [79], detecting remote homologous proteins [80], analyzing genetic sequences [81], predicting anticancer peptides [82], among many others (see a long list of papers cited in the References section of [83]). Recently, the concept of PseAAC was further extended to represent the feature vectors of nucleotides [15], as well as other biological samples [84-86]. As it has been widely and increasingly used, recently two powerful soft-wares, called “PseAAC-Builder” [87] and “propy” [88], were established for generating various special Chou’s pseudo-amino acid compositions, in addition to the web-server “PseAAC” [89], built in 2008. Encouraged by the success of introducing PseAAC for proteins, recently, Chen et al. [25] proposed the pseudo dinucleotide composition or PseDNC to represent DNA sequences for identifying the recombination spots by counting some sequence effects, remarkably improving the prediction results in comparison with those by Liu et al. [14], without including any sequence information. However, in PseDNC, only the correlations of dinucleotides along a DNA sequence were considered, and, hence, some important sequence order effects might be missed. The present study was initiated in an attempt to incorporate the long-range or global correlations of trinucleotides along a DNA sequences in hope to further improve the prediction quality in indentifying the recombination spots. As demonstrated in a series of recent publications [24,42,90-92] and summarized in a comprehensive review [83], to establish a really useful statistical predictor for a biological system, one needs to consider the following procedures: (i) construct or select a valid benchmark dataset to train and test the predictor; (ii) formulate the biological samples with an effective mathematical expression that can truly reflect their intrinsic correlation with the target to be predicted; (iii) introduce or develop a powerful algorithm (or engine) to operate the prediction; (iv) properly perform cross-validation tests to objectively evaluate the anticipated accuracy of the predictor; and (v) establish a user-friendly web-server for the predictor that is accessible to the public. Below, let us elaborate how to deal with these procedures one-by-one.

Results and Discussion

Benchmark Dataset

The benchmark dataset S used in this study was taken from Liu et al. [14], which contains 490 recombination hotspots and 591 recombination coldspots, as can be formulated by: where subset S+ and S− are respectively for the hot and cold spots, while ∪ represents the symbol for “union” in the set theory. For reader’s convenience, the 490 DNA sequences in S+ and 591 sequences in S− are given in the Supplementary Information S1.

Formulate DNA Samples by Combining Trinucleotide Composition and Pseudo Amino Acid Components

Suppose a DNA sequence D with L nucleotides; i.e., where denotes the i-th (i = 1, 2, …, L) nucleotide in the DNA sequence. If the feature vector of the DNA sequence is formulated by its mononucleotide composition (MNC), we have: where , and are the normalized occurrence frequencies of adenine (A), cytosine (C), guanine (G), and thymine (T), respectively, in the DNA sequence; and the symbol T is the transpose operator. As we can see from Equation (4), all the sequence order information is missed if using MNC to represent a DNA sequence. If using the dinucleotide composition (DNC) to represent the DNA sequence, instead of the four components as shown in Equation (4), the corresponding feature vector will contain 4 × 4 = 16 components, as given below: where is the normalized occurrence frequency of AA in the DNA sequence; , that of AC; , that of AG; and so forth. If represented by the trinucleotide composition (TNC), the corresponding feature vector will contain 4×4×4 = 43 = 64 components, as given below: where is the normalized occurrence frequency of AAA in the DNA sequence; , that of AAC; and so forth. Generally speaking, if a DNA sequence is represented by the K-tuple nucleotide composition, the corresponding vector D for the DNA sequence will contain 4 components; i.e., As we can see from Equations (5–7), with increasing the tuple number, although the base sequence-order information within a local or very short range could be gradually included, none of the global or long-range sequence-order information would be reflected by the formulation. Actually, in computational proteomics, we have also faced exactly the same situation; i.e., although the dipeptide composition, tripeptide composition, and K-tuple peptide composition were used by many investigators to represent protein sequences by incorporating their local sequence order information [93-97], their global or long-range sequence order information still could not be reflected. As mentioned above, to deal with this kind of problems in proteomics, the concept of PseAAC [44,45] was introduced. Stimulated by the PseAAC approach [44,45] in computational proteomics, below let us propose a novel feature vector to represent the DNA sequence (cf. Equation (2)) by combining its TNC (see Equation (2)) and the pseudo amino acid components of its translated protein chain. As is well known, three nucleotides encode an amino acid (see Figure 2). Thus, according the conversion table from DNA codons to amino acids (Table 1), the DNA sequence in Equation (2) can be translated into a protein sequence expressed by:

Figure 2.

A graph to show how a DNA codon of three nucleotides is converted to an amino acid. The characters in the first three rings from the center represent four bases in DNA, while those in the fourth ring represent the single-letter codes of the 20 native amino acids in protein. The symbol * means the “Stop” sign.

Table 1.

The conversion code of the 64 trinucleotides in DNA to the 20 amino acids in protein.

Trinucleotide	Amino acid
AAA	Lys (K)
AAC	Asn (N)
AAG	Lys (K)
AAT	Asn (N)

ACA	Thr (T)
ACC
ACG
ACT

AGA	Arg (R)
AGC	Ser (S)
AGG	Arg (R)
AGT	Ser (S)

ATA	Ile (I)
ATC	Ile (I)

ATG	Met (M)
ATT	Ile (I)
CAA	Gln (Q)
CAC	His (H)
CAG	Gln (Q)
CAT	His (H)

CCA	Pro (P)
CCC
CCG
CCT

CGA	Arg (R)
CGC
CGG
CGT

CTA	Leu (L)
CTC
CTG
CTT

GAA	Glu (E)
GAC	Asp (D)
GAG	Glu (E)
GAT	Asp (D)

GCA	Ala (A)
GCC
GCG
GCT

GGA	Gly (G)
GGC
GGG
GGT

GTA	Val (V)
GTC
GTG
GTT

TAA	Stop!
TAC	Tyr (Y)
TAG	Stop!
TAT	Tyr (Y)

TCA	Ser (S)
TCC
TCG
TCT

TGA	Stop!
TGC	Cys (C)
TGG	Trp (W)
TGT	Cys (C)
TTA	Leu (L)
TTC	Phe (F)
TTG	Leu (L)
TTT	Phe (F)

with where the symbol “Int” is an integer truncation operator meaning to take the integer part for the number in the brackets immediately after it. Now, according to the formulation of Chou’s PseAAC approach [44,45], for the protein chain of Equation (8), we have: where θ (k = 1,2,3, ···, λ) is called the k-th tier correlation factor that reflects the sequence order correlation between all the k-th most contiguous residues along a protein chain. In this study, the correlation function in Equation 10 is given by: where H (A) (n = 1,2,···, 6) is the six physicochemical properties of amino acid A; they are, respectively, hydrophobicity, hydrophilicity, side-chain mass, pK1 (α-COOH), pK2 (NH3), and PI. Note that before substituting these physicochemical values into Equation (11), they were all subjected to a standard conversion as described by the following equation: where H (A) (n = 1,2,···, 6) is the n-th original physicochemical property value for the amino acid A as given in Table 2, the symbol < and > means taking the average of the quantity therein over 20 native amino acids, and SD means the corresponding standard deviation. Listed in Table 3 are the converted values obtained by Equation (12) that will have a zero mean value over the 20 native amino acids, and will remain unchanged if going through the same conversion procedure again.

Table 2.

List of the original values of the six physical-chemical properties for each of the 20 native amino acids.

Amino acid	Hydro-phobicity a H10	Hydro-philicity b H20	Side-chain mass c H30	pK1 d H40	pK2 e H50	PI f H60
A	0.62	−0.5	15	2.35	9.87	6.11
C	0.29	−1.00	47	1.71	10.78	5.02
D	−0.90	3.00	59	1.88	9.60	2.98
E	−0.74	3.00	73	2.19	9.67	3.08
F	1.19	−2.50	91	2.58	9.24	5.91
G	0.48	0.00	1	2.34	9.60	6.06
H	−0.40	−0.50	82	1.78	8.97	7.64
I	1.38	−1.80	57	2.32	9.76	6.04
K	−1.50	3.00	73	2.20	8.90	9.47
L	1.06	−1.80	57	2.36	9.60	6.04
M	0.64	−1.30	75	2.28	9.21	5.74
N	−0.78	0.20	58	2.18	9.09	10.76
P	0.12	0.00	42	1.99	10.60	6.30
Q	−0.85	0.20	72	2.17	9.13	5.65
R	−2.53	3.00	101	2.18	9.09	10.76
S	−0.18	0.30	31	2.21	9.15	5.68
T	−0.05	−0.40	45	2.15	9.12	5.60
V	1.08	−1.50	43	2.29	9.74	6.02
W	0.81	−3.40	130	2.38	9.39	5.88
Y	0.26	−2.30	107	2.20	9.11	5.63

Taken from [98];

Taken from [99];

Taken from any biochemistry text book;

Taken from [100] for C-COOH;

Taken from [100] for NH3;

Taken from [101].

Table 3.

The corresponding values obtained by the standard conversion of Equation 12 on the original values in Table 2.

Amino acid	H₁	H₂	H₃	H₄	H₅	H₆
A	0.62	−0.15	−1.55	0.78	0.77	−0.10
C	0.29	−0.41	−0.52	−2.27	2.57	−0.64
D	−0.90	1.67	−0.13	−1.46	0.24	−1.65
E	−0.74	1.67	0.33	0.01	0.37	−1.61
F	1.19	−1.19	0.91	1.87	−0.48	−0.20
G	0.48	0.11	−2.00	0.73	0.24	−0.13
H	−0.40	−0.15	0.62	−1.94	−1.01	0.65
I	1.38	−0.82	−0.19	0.63	0.55	−0.14
K	−1.50	1.67	0.33	0.06	−1.15	1.56
L	1.06	−0.82	−0.19	0.82	0.24	−0.14
M	0.64	−0.56	0.39	0.44	−0.54	−0.29
N	−0.78	0.22	−0.16	−0.03	−0.77	2.20
P	0.12	0.11	−0.68	−0.94	2.21	−0.01
Q	−0.85	0.22	0.29	−0.08	−0.69	−0.33
R	−2.53	1.67	1.23	−0.03	−0.77	2.20
S	−0.18	0.27	−1.03	0.11	−0.65	−0.32
T	−0.05	−0.10	−0.58	−0.18	−0.71	−0.36
V	1.08	−0.67	−0.65	0.49	0.51	−0.15
W	0.81	−1.65	2.17	0.92	−0.18	−0.22
Y	0.26	−1.08	1.43	0.06	−0.73	−0.34

By combining the λ correlation factors with the 64 components in TNC (see Equation (6)), the DNA sequence is formulated by: where: where w is the weight factor which is determined by optimizing the outcome as will be mentioned later. The rationale of using Equation (13) to represent the DNA sequence is that the local or short-range sequence order effect can be directly reflected via the occurrence frequencies of its 64 trinucleotides, while the global or long-range sequence order effect can be indirectly reflected via the λ pseudo amino acid components of its translated protein chain. As three nucleotides encode an amino acid, the above approach is both quite rational and natural.

Use Support Vector Machine as an Operation Engine

Support vector machine (SVM) has been widely to make classification prediction (see, e.g., [24,102-105]. The basic idea of SVM is to transform the input data into a high dimensional feature space and then determine the optimal separating hyperplane. A brief introduction about the formulation of SVM was given in [103,106]. Here, the DNA samples as formulated by Equation (13) were used as inputs for the SVM. Its software was downloaded from the LIBSVM package [107,108], which provided a simple interface. Due to this advantages, the users can easily perform classification prediction by properly selecting the built-in parameters C and γ. In order to maximize the performance of the SVM algorithm, the two parameters in the RBF kernel were preliminarily optimized through a grid search strategy in this study. To obtain the optimized parameters, the search function “SVMcgForClass” was downloaded from http://www.matlabsky.com. The predictor obtained via the aforeSpecies">mentioned procedures is called iRSpot-TNCPseAAC, where “i” means “identify”, “RSpot” means “Recombination Spots”, while TNCPseAAC means a combination of “Tri-Nucleotide Composition” and “Pseudo Amino Acid Components.” To objectively evaluate the quality of a new predictor, one should use proper metrics [109] and rigorous cross-validation [83] to test it. Below, let us address these problems.

Four Different Metrics for Measuring the Prediction Quality

In literature, the following metrics are often used for examining the performance quality of a predictor: where TP represents the number of the true positive; TN, the number of the true negative; FP, the number of the false positive; FN, the number of the false negative; Sn, the sensitivity; Sp, the specificity; Acc, the accuracy; MCC, the Mathew’s correlation coefficient. To most biologists, however, the four metrics as formulated in Equation (15) are not quite intuitive and easier-to-understand, particularly for the Mathew’s correlation coefficient. Here let us adopt the formulation proposed recently [25,29] based on the Chou’s symbol and definition [110]; i.e., where N+ is the total number of the hotspot samples investigated while the number of the hotspot samples incorrectly predicted as coldspots; N− the total number of the coldspot samples investigated while the number of the coldspot samples incorrectly predicted as the hotspots [111]. Now, it can be clearly seen from Equation (16) that when meaning none of the hotspots was incorrectly predicted to be a coldspot, we have the sensitivity Sn = 1. When meaning that all the hotspots were incorrectly predicted to be the coldspots, we have the sensitivity Sn = 0. Likewise, when meaning none of the coldspots was incorrectly predicted to be the hotspot, we have the specificity Sp = 1; whereas meaning all the coldspots were incorrectly predicted as the hotspots, we have the specificity Sp = 0. When meaning that none of hotspots in the positive dataset and none of the coldspots in the negative dataset was incorrectly predicted, we have the overall accuracy Acc = 1 and MCC = −1; when and meaning that all the hotspots in the positive dataset and all the coldspots in the negative dataset were incorrectly predicted, we have the overall accuracy Acc = 1 and MCC = −1; whereas when and we have Acc = 0.5 and MCC = 0 meaning no better than random guess. As we can see from the above discussion based on Equation (16), the meanings of sensitivity, specificity, overall accuracy, and Mathew’s correlation coefficient have become much more intuitive and easier-to-understand. It should be pointed out that the metrics as given in Equation (15) and Equation (16) are valid only for the single-label systems as in the current case. For the multi-label systems in which emergence has become increasingly frequent in cell’s molecular systems [112-118] and biomedical systems [43,119], a completely different set of metrics as defined in [109] is needed.

Evaluate the Anticipated Success Rates by Jackknife Tests

The following three cross-validation methods are often used in statistical prediction to evaluate the anticipated accuracy of a predictor: independent dataset test, subsampling (K-fold cross-validation) test, and jackknife test [120]. However, as elucidated by a review article [83], among the three methods, the jackknife test is deemed the least arbitrary and most objective as it can always yield a unique outcome for a given benchmark dataset, and hence has been increasingly used and widely recognized by investigators to examine the accuracy of various predictor [48,60,63,65,69,76,121,122]. Accordingly, in this study we also used the results obtained by jackknife tests to optimizing the uncertain parameters and to compare with the other predictors in this area.

Experimental Section

The results obtained with iRSpot-TNCPseAAC on the benchmark dataset S of Supplementary Information S1 by the jackknife test are given in Table 4, where for facilitating comparison the corresponding results by the iRSpot-PseDNC [25] on the same benchmark dataset are also given.

Table 4.

A comparison of iRSpot-TNCPseAAC with the best existing method.

Predictor	Test method	Sn (%)	Sp (%)	Acc (%)	MCC
iRSpot-PseDNC a	Jackknife	73.06	89.49	82.04	0.638
iRSpot-KNCPseAAC b	Jackknife	87.14	79.59	83.72	0.671

From [25];

This paper with λ = 5, w = 1.1, C = 32 and γ = 0.5 for the LIBSVM operation engine [107,108].

As we can clearly see from the table, the iRSpot-TNCPseAAC predictor is superior to iRSpot-PseDNC [25] in three of the four metrics as defined by Equation (16); i.e., it can yield higher accuracy Acc, higher Mathew’s correlation coefficient MCC, and higher sensitivity Sn. Therefore, it is anticipated that the new predictor will become a useful tool for identifying the recombination spots in DNA, or at the very least become a complementary tool to iRSpot-PseDNC, the best existing prediction method in this area.

Conclusions

The above fact has also proved that it is indeed a feasible and promising approach to extend the concept of pseudo amino acid composition [44,45,123] developed in computational proteomics to the area of computational genomics. As shown by Equation (13) and the related equations in defining its 64 + λ components, each of the DNA samples investigated in this study was formulated by a combination of its trinucleotide composition (TNC) with the pseudo amino acid components (PseAAC) that were derived from the protein translated from the DNA sample according to its genetic codes. The former can better incorporate its local or short-rage sequence order information in comparison with the dinucleotide composition (DNC) used in iRSpot-PseDNC [25]; while the latter can incorporate its global or long-range sequence order effects in a more natural or logical manner. Accordingly, it is anticipated that the idea or approach by extending the Chou’s pseudo amino acid composition [44,45,123] for protein sequences to the pseudo oligonucleotide composition for DNA or RNA sequences may also be used to deal with many other genome analysis problems.

Web Server and User Guide

To enhance the value of its practical applications, a web-server for the iRSpot-TNCPseAAC predictor was established. Moreover, for the convenience of the vast majority of experimental scientists, here a step-to-step guide is provided for how to use the web server to get the desired results without the need to follow the mathematic equations that were presented just for the integrity in developing the predictor. Step 1. Open the web server at http://www.jci-bioinfo.cn/iRSpot-TNCPseAAC and you will see the top page of the predictor on your computer screen, as shown in Figure 3. Click on the Read Me button to see a brief introduction about the iRSpot-TNCPseAAC predictor and the caveat when using it.

Figure 3.

A semi-screenshot for the top page of the web-server iRSpot-TNCPseAAC at http://www.jci-bioinfo.cn/iRSpot-TNCPseAAC.

Step 2. Either type or copy/paste the query DNA sequences into the input box at the center of Figure 3. The input sequence should be in the FASTA format. For the examples of sequences in FASTA format, click the Example button right above the input box. Step 3. Click on the Submit button to see the predicted result. For example, if you use the three query DNA sequences in the Example window as the input, after clicking the Submit button, you will see the following message shown on the screen of your computer: the outcome for the 1st query sample is “recombination hotspot”; the outcome for the 2nd query sample is “recombination coldspot”. All these results are fully consistent with the experimental observations as summarized in the Supplementary Information S1. However, no result was given for the 3rd query sample as it contains some invalid characters as warned in the output screen. It takes about a few seconds for the above computation before the predicted result appears on your computer screen; the more number of query sequences and longer of each sequence, the more time it is usually needed. Step 4. As shown on the lower panel of Figure 3, you may also choose the batch prediction by entering your e-mail address and your desired batch input file (in FASTA format) via the “Browse” button. To see the sample of batch input file, click on the button Batch-example. After clicking the button Batch-submit, you will see “Your batch job is under computation; once the results are available, you will be notified by e-mail.” Step 5. Click the Supporting Information button to download the benchmark dataset used to train and test the iRSpot-TNCPseAAC predictor. Step 6. Click the Citation button to find the relevant papers that docuSpecies">ment the detailed development and algorithm of iRSpot-TNCPseAAC. The benchmark dataset S consists of a positive dataset S+ and a negative dataset S−. The positive dataset contains 490 recombination hot spots, while the negative dataset contains 591 recombination cold spots.

113 in total

1. Predicting subcellular localization of proteins in a hybridization space.

Authors: Yu-Dong Cai; Kuo-Chen Chou
Journal: Bioinformatics Date: 2004-02-05 Impact factor: 6.937

2. Multi-kernel transfer learning based on Chou's PseAAC formulation for protein submitochondria localization.

Authors: Suyu Mei
Journal: J Theor Biol Date: 2011-10-21 Impact factor: 2.691

3. Prediction of protease types in a hybridization space.

Authors: Kuo-Chen Chou; Yu-Dong Cai
Journal: Biochem Biophys Res Commun Date: 2005-11-09 Impact factor: 3.575

Review 4. Meiotic and mitotic recombination in meiosis.

Authors: Kathryn P Kohl; Jeff Sekelsky
Journal: Genetics Date: 2013-06 Impact factor: 4.562

5. A top-down approach to enhance the power of predicting human protein subcellular localization: Hum-mPLoc 2.0.

Authors: Hong-Bin Shen; Kuo-Chen Chou
Journal: Anal Biochem Date: 2009-08-03 Impact factor: 3.365

Review 6. Meiotic recombination hotspots.

Authors: M Lichten; A S Goldman
Journal: Annu Rev Genet Date: 1995 Impact factor: 16.830

7. SLLE for predicting membrane protein types.

Authors: Meng Wang; Jie Yang; Zhi-Jie Xu; Kuo-Chen Chou
Journal: J Theor Biol Date: 2005-01-07 Impact factor: 2.691

8. Prediction of G-protein-coupled receptor classes based on the concept of Chou's pseudo amino acid composition: an approach from discrete wavelet transform.

Authors: Jian-Ding Qiu; Jian-Hua Huang; Ru-Ping Liang; Xiao-Quan Lu
Journal: Anal Biochem Date: 2009-04-11 Impact factor: 3.365

9. iLoc-Euk: a multi-label classifier for predicting the subcellular localization of singleplex and multiplex eukaryotic proteins.

Authors: Kuo-Chen Chou; Zhi-Cheng Wu; Xuan Xiao
Journal: PLoS One Date: 2011-03-30 Impact factor: 3.240

10. iEzy-drug: a web server for identifying the interaction between enzymes and drugs in cellular networking.

Authors: Jian-Liang Min; Xuan Xiao; Kuo-Chen Chou
Journal: Biomed Res Int Date: 2013-11-26 Impact factor: 3.411

53 in total

1. iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition.

Authors: Hao Lin; En-Ze Deng; Hui Ding; Wei Chen; Kuo-Chen Chou
Journal: Nucleic Acids Res Date: 2014-10-31 Impact factor: 16.971

2. iRSpot-GAEnsC: identifing recombination spots via ensemble classifier and extending the concept of Chou's PseAAC to formulate DNA samples.

Authors: Muhammad Kabir; Maqsood Hayat
Journal: Mol Genet Genomics Date: 2015-08-30 Impact factor: 3.291

3. repRNA: a web server for generating various feature vectors of RNA sequences.

Authors: Bin Liu; Fule Liu; Longyun Fang; Xiaolong Wang; Kuo-Chen Chou
Journal: Mol Genet Genomics Date: 2015-06-18 Impact factor: 3.291

4. iN6-methylat (5-step): identifying DNA N⁶-methyladenine sites in rice genome using continuous bag of nucleobases via Chou's 5-step rule.

Authors: Nguyen Quoc Khanh Le
Journal: Mol Genet Genomics Date: 2019-05-04 Impact factor: 3.291

5. iAFP-Ense: An Ensemble Classifier for Identifying Antifreeze Protein by Incorporating Grey Model and PSSM into PseAAC.

Authors: Xuan Xiao; Mengjuan Hui; Zi Liu
Journal: J Membr Biol Date: 2016-11-03 Impact factor: 1.843

6. Sequence-specific flexibility organization of splicing flanking sequence and prediction of splice sites in the human genome.

Authors: Yongchun Zuo; Pengfei Zhang; Li Liu; Tao Li; Yong Peng; Guangpeng Li; Qianzhong Li
Journal: Chromosome Res Date: 2014-04-12 Impact factor: 5.239

7. Classifying Multifunctional Enzymes by Incorporating Three Different Models into Chou's General Pseudo Amino Acid Composition.

Authors: Hong-Liang Zou; Xuan Xiao
Journal: J Membr Biol Date: 2016-04-25 Impact factor: 1.843

8. Protein remote homology detection by combining Chou's distance-pair pseudo amino acid composition and principal component analysis.

Authors: Bin Liu; Junjie Chen; Xiaolong Wang
Journal: Mol Genet Genomics Date: 2015-04-21 Impact factor: 3.291

9. Deep4mC: systematic assessment and computational prediction for DNA N4-methylcytosine sites by deep learning.

Authors: Haodong Xu; Peilin Jia; Zhongming Zhao
Journal: Brief Bioinform Date: 2021-05-20 Impact factor: 11.622

10. Prediction of Protein Submitochondrial Locations by Incorporating Dipeptide Composition into Chou's General Pseudo Amino Acid Composition.

Authors: Khurshid Ahmad; Muhammad Waris; Maqsood Hayat
Journal: J Membr Biol Date: 2016-01-08 Impact factor: 1.843