Literature DB >> 32199128

sgRNA-PSM: Predict sgRNAs On-Target Activity Based on Position-Specific Mismatch.

Abstract

As a key technique for the CRISPR-Cas9 system, identification of single-guide RNAs (sgRNAs) on-target activity is critical for both theoretical research (investigation of RNA functions) and real-world applications (genome editing and synthetic biology). Because of its importance, several computational predictors have been proposed to predict sgRNAs on-target activity. All of these methods have clearly contributed to the developments of this very important field. However, they are suffering from certain limitations. We proposed two new methods called "sgRNA-PSM" and "sgRNA-ExPSM" for sgRNAs on-target activity prediction via capturing the long-range sequence information and evolutionary information using a new way to reduce the dimension of the feature vector to avoid the risk of overfitting. Rigorous leave-one-gene-out cross-validation on a benchmark dataset with 11 human genes and 6 mouse genes, as well as an independent dataset, indicated that the two new methods outperformed other competing methods. To make it easier for users to use the proposed sgRNA-PSM predictor, we have established a corresponding web server, which is available at http://bliulab.net/sgRNA-PSM/.

Entities: CellLine Chemical Disease Gene Species

Keywords: XGBoost; position-specific mismatch; sgRNAs on-target activity

Year: 2020 PMID： 32199128 PMCID： PMC7083770 DOI： 10.1016/j.omtn.2020.01.029

Source DB: PubMed Journal: Mol Ther Nucleic Acids ISSN： 2162-2531 Impact factor: 8.886

Introduction

Three main genome editing tools, including zinc-finger nucleases (ZFNs), transcription activator-like effector nucleases (TALENs), and CRISPR-Cas9 RNA-guided technologies,,, can be used to recognize and cleave specific DNA sequences. Compared with ZFNs and TALENs, CRISPR-Cas9 has been widely applied in various cell types and organisms in recent years. In the type II CRISPR-Cas9 system, single-guide RNA (sgRNA) directs Cas9 protein to the target site to cleave the DNA target sequences, and sgRNA should be designed to have around a 20-nt sequence to be complementary to the guide sequence in the DNA target sequences.,. Rational design of sgRNA is a crucial part for CRISPR-Cas9. Therefore, the prediction of sgRNAs on-target activity is very important for CRISPR-Cas9. Researchers have proposed several computational methods for sgRNAs on-target activity prediction. Most of them treat the prediction problem of sgRNA as a binary classification task or a regression task, and the computational predictors were constructed based on machine learning algorithms. The differences between these approaches are feature extraction methods and machine learning techniques, such as gradient boosting regression (GBR), support vector machines (SVMs),9, 10, 11, 12, 13, 14, 15, 16, 17, 18 ensemble classifiers19, 20, 21, 22, 23, 24, and deep learning,25, 26, 27, 28, 29, 30, 31, 32 among others. As shown in the aforementioned studies,, discriminative features are critical for constructing the computational predictors. Accordingly, some features have been proposed to capture the characteristics of sgRNAs, for example, because the position of a nucleotide in sgRNA will affect its activity, and thus the position-specific (PS) feature was proposed to incorporate these sequence patterns, which has been used in ge-CRISPR,, Azimuth, and CRISPRpred. Kaur et al. proposed an integrated pipeline called ge-CRISPR to predict and analyze the genome editing efficiency of sgRNAs. Azimuth employed GBR to train the model, achieving state-of-the-art performance. CRISPRpred is another efficient predictor, combining the discriminative features selected by random forest (RF) and the SVM regression. All of the aforementioned predictors have obtained encouraging results and played a role in the development of computational predictors for sgRNAs on-target activity prediction, but they are also suffering from some problems or limitations. Further work is required for the following reasons: (1) these predictors are only able to consider the short-range sequence information of the DNA sequences, otherwise they will cause “high-dimension disaster”;, and (2) these predictors failed to incorporate the evolutionary information, ignoring information between non-consecutive nucleotides. In order to solve these aforementioned problems, we proposed a novel feature, PS mismatch (PSM), sharing the advantages of both PS and mismatch features. RNA sequence evolution involves single nucleotides, insertions and deletions of several nucleotides, and other factors. With the long-term accumulation of these changes in evolution, although the similarities between the initial and the final RNA sequences are gradually reduced, these RNA sequences still have many features in common. PSM is such a method for extracting the evolutionary information from RNA sequences by allowing mismatches occurring in k-mers from specific positions. PSM has been applied to predict sgRNAs on-target activity, and two predictors were established called “sgRNA-PSM” and “sgRNA-ExPSM” (sgRNA-extended PSM). Finally, a corresponding web server has been constructed (http://bliulab.net/sgRNA-PSM/).

Results and Discussion

Parameter Optimization

According to Equations 9 and 10, there are two parameters in PSM, k and m, and three parameters in the XGBoost algorithm, C, R, and F. These parameters were optimized according to AUC (area under the curve) by using leave-one-gene-out cross-validation on the benchmark dataset S (cf. Equation 3). In this study, these parameters were optimized in the ranges listed in the following:The final optimal values of the five parameters (cf. Equation 1) were optimized based on the AUC on the benchmark dataset S (cf. Equation 3), as given by

Feature Selection and Analysis

In order to remove the redundant features and reduce the dimension of the resulting feature vectors, here we used SelectKBest in scikit-learn to select the top number of features with the highest scores based on the scoring function f_regression, which can avoid the overfitting risk with low computational cost. We investigated the influence of the value n (number of selected features) in SelectKBest on the predictive performance of sgRNA-PSM, and the results are shown in Figure 1, from which we can see that the values of n have little impact on the performance, and sgRNA-PSM can achieve the best performance when n is equal to 2,000.

Figure 1

Graph Showing AUC Scores of the sgRNA-PSM Predictors with Different n Values, where n Denotes the Number of Selected Features

Graph Showing AUC Scores of the sgRNA-PSM Predictors with Different n Values, where n Denotes the Number of Selected Features The importance of each feature can be analyzed based on F_score. To explore the reason why the proposed sgRNA-PSM predictor works so well, we analyzed the contribution of each feature. Table 1 lists the 10 most important features, from which we can see that (1) the top 9 most important features belong to the features generated in the sequence positions from 23 to 30. In the CRISPR-Cas9 system, the DNA target sequences are composed of two parts: one is the guide sequence, and the other is the protospacer adjacent motif (PAM). The guide sequence is complementary to around a 20-bp sequence in sgRNA, and PAM is the downstream short sequence of the guide sequence and is recognized by the Cas9 protein. In the benchmark dataset S (cf. Equation 3), the guide sequence is in the sequence positions from 5 to 24, and PAM is the short sequence in the sequence positions from 25 to 27. Therefore, the top 9 most important features all cover PAM, indicating that the proposed PSM is able to incorporate this important sequence pattern. (2) PAM is composed of any nucleotide in sequence position 25 followed by GG in positions 26 and 27., 7 of the 10 most important features capture this sequence pattern.

Table 1

The 10 Most Important Features in the sgRNA-PSM Predictor

No.	PSM Featurea	Sequence Positionb	F_scorec
1	GGG	23–27	185.6
2	GGG	24–28	185.6
3	CGG	24–28	136.2
4	C**GG	24–28	136.2
5	CGG	23–27	129.0
6	CGG	24–28	129.0
7	**GGG	24–28	128.0
8	GGG	25–29	128.0
9	GGG**	26–30	128.0
10	**TTC	20–24	113.0

Parameters were k = 5, m = 2.

The sequence position of mismatches.

Calculated by F regression.

The 10 Most Important Features in the sgRNA-PSM Predictor Parameters were k = 5, m = 2. The sequence position of mismatches. Calculated by F regression.

Comparison with Other Methods

The results obtained by sgRNA-PSM and sgRNA-ExPSM on the benchmark dataset S are listed in Table 2, from which we can see that the AUC achieved by sgRNA-PSM was 73.8%. The corresponding AUC achieved by sgRNA-ExPSM was even better, which was 74.4%. This is reasonable because the acid cut position and percent peptide features referred to in Equation 11 are complementary with the PSM features in Equation 9. The PSM feature vector reflects long-range sequence information, while the amino acid cut position and percent peptide are guide-positional features corresponding to the start distance of the protein coding region of the gene where the cleavage site of the sgRNA is positioned.

Table 2

List of AUC Scores Obtained by Various Methods via the Leave-One-Gene-Out Cross-Validation on the Same Benchmark Dataset S (cf. Equation 3)

Methods	AUC (%)a
Azimthb	71.9
ge-CRISPRc	71.7
CRISPRpredd	71.6
sgRNA-PSMe	73.8
sgRNA-ExPSMf	74.4

AUC means the area under the ROC curve;, the better predictor corresponds to larger AUC values.

Results obtained by in-house implementation from Doench et al.

Results obtained by in-house implementation from Kaur et al.

Results obtained by in-house implementation from Rahman and Rahman.

For the proposed predictor in this article, see Equations 9 and 10 with k = 5, m = 2, = 3, R = 0.1, F = 800.

For the proposed predictor in this article, see Equations 10 and 11 with k = 5, m = 2, = 3, R = 0.1, F = 800.

List of AUC Scores Obtained by Various Methods via the Leave-One-Gene-Out Cross-Validation on the Same Benchmark Dataset S (cf. Equation 3) AUC means the area under the ROC curve;, the better predictor corresponds to larger AUC values. Results obtained by in-house implementation from Doench et al. Results obtained by in-house implementation from Kaur et al. Results obtained by in-house implementation from Rahman and Rahman. For the proposed predictor in this article, see Equations 9 and 10 with k = 5, m = 2, = 3, R = 0.1, F = 800. For the proposed predictor in this article, see Equations 10 and 11 with k = 5, m = 2, = 3, R = 0.1, F = 800. Then, we made a comparison of the sgRNA-PSM and sgRNA-ExPSM with ge-CRISPR, Azimth, and CRISPRpred. All of these predictors were examined by the leave-one-gene-out cross-validation on the benchmark dataset S (cf. Equation 3). For facilitating comparison, the corresponding results obtained by the ge-CRISPR predictor, the Azimth predictor, and the CRISPRpred predictor are also given in Table 2 and Figure 2. Here, Figure 2 includes the corresponding receiver operating characteristic (ROC) curves showing the performance of the five predictors. A diagonal from the point (0,0) to (1,1) means a random guess. The better performance of the predictor corresponds to a larger AUC.

Figure 2

Graph Showing the Predictive Quality of the Aforementioned Predictors via the ROC Curves

The corresponding AUC scores are 0.717, 0.716, 0.719, 0.738, and 0.744 for ge-CRISPR, CRISPRpred, Azimth, sgRNA-PSM, and sgRNA-ExPSM predictors via the leave-one-gene-out cross-validation on the same benchmark dataset S, respectively.

Graph Showing the Predictive Quality of the Aforementioned Predictors via the ROC Curves The corresponding AUC scores are 0.717, 0.716, 0.719, 0.738, and 0.744 for ge-CRISPR, CRISPRpred, Azimth, sgRNA-PSM, and sgRNA-ExPSM predictors via the leave-one-gene-out cross-validation on the same benchmark dataset S, respectively. The following conclusions can be drawn from Table 2 and Figure 2: (1) the AUC score achieved by the proposed sgRNA-PSM predictor is higher than that of ge-CRISPR, and even higher than those of Azimth and CRISPRpred based on the wet experiment features, such as amino acid cut position and percent peptide. Please note that these two features are not sequence-based features, and they are often unavailable. (2) The sgRNA-ExPSM predictor outperforms the sgRNA-PSM predictor by incorporating the amino acid cut position feature and percent peptide feature. In addition, the sgRNA-PSM predictor was further compared with Azimuth and DeepCRISPR (pt+aug CNN) on the on-target dataset.,. In order to make a fair comparison, the sgRNA-PSM predictor was trained on the training set of on-target dataset reported in Chuai et al. and tested on the independent test dataset for the hct116, hela, and hl60 cell types. The hek293t dataset reported in Doench et al. is a subset of our benchmark dataset S (cf. Equation 3). Therefore, our method was not tested on the hek293t dataset again. For sgRNA-PSM, SelectKBest with the scoring function chi2 in scikit-learn was used to select 1,100 dimensions of the PSM features and fed into XGBoost for classification. The predictive results of sgRNA-PSM, DeepCRISPR (pt+aug CNN), and Azimuth are shown in Table 3. As shown in this table, our method outperformed Azimuth and DeepCRISPR (pt+aug CNN) on the hct116 and hela cell types, and it is highly comparable to DeepCRISPR (pt+aug CNN) on the hl60 cell type.

Table 3

List of the AUC Scores Obtained by Various Methods on the On-Target Dataset Reported in Chuai et al.

Cell Typea	Methods	AUC (%)
hct116	Azimuthb	74.1
	DeepCRISPR (pt+aug CNN)c	87.4
	sgRNA-PSMd	91.7
	Retrained sgRNA-PSMe	74.0
Hela	Azimuthb	67.5
	DeepCRISPR (pt+aug CNN)c	78.2
	sgRNA-PSMd	82.8
	Retrained sgRNA-PSMe	72.1
hl60	Azimuthb	79.2
	DeepCRISPR (pt+aug CNN)c	73.9
	sgRNA-PSMd	77.6
	Retrained sgRNA-PSMe	83.7

The cell type of the independent test dataset.

Results reported in Chuai et al.

The sgRNA-PSM predictor trained with the dataset reported in Chuai et al.; see Equations 9 and 10 with k = 4, m = 2, = 9, R = 0.05, F = 2,300.

The sgRNA-PSM predictor trained with each of the three datasets (hct116, hela, and hl60).

List of the AUC Scores Obtained by Various Methods on the On-Target Dataset Reported in Chuai et al. The cell type of the independent test dataset. Results reported in Chuai et al. Results reported in Chuai et al. The sgRNA-PSM predictor trained with the dataset reported in Chuai et al.; see Equations 9 and 10 with k = 4, m = 2, = 9, R = 0.05, F = 2,300. The sgRNA-PSM predictor trained with each of the three datasets (hct116, hela, and hl60). To further explore the reasons why our method cannot perform well on the hl60 cell type, we retrained the sgRNA-PSM classifier with each of the three datasets (hct116, hela, and hl60). For each dataset, 20% of the samples were used as the test dataset, which were stratified by labels following Chuai et al., and the remaining 80% of the samples were used as the training dataset. The results are also listed in Table 3, from which we can see that the sgRNA-PSM trained with the hl60 dataset outperformed the corresponding classifier trained with the training data consisting of all four cell types, and it even outperformed Azimuth. The results are not surprising because the four different cell types have different data distributions. Noise information was introduced when combining all four cell types to train a computational predictor. Therefore, the overall performance of sgRNA-PSM is better than that of all of the other competing methods.

Web Server and User Guide

Providing a user-friendly and freely accessible web server for a new predictor can obviously improve its impact. To make it easier for users to use the proposed predictor, we established the corresponding sgRNA-PSM web server. Because the sgRNA-ExPSM predictor requires two features obtained from wet experiments, which are often unavailable, its corresponding web server is not able to be constructed. The web server has the following functions: (1) it allows users to input sgRNA sequences in reverse-complementary order, and (2) it allows users to input longer sequences (30–1,000 bp). The web server will detect all of the possible sgRNAs and predict their on-target activities. The steps for using the sgRNA-PSM web server are as follows: Step 1. Click on the website address http://bliulab.net/sgRNA-PSM/ to open the sgRNA-PSM web server, at which point the homepage of the website will appear as shown in Figure 3. The detailed introduction to the web server can be obtained by clicking on the “Read Me” button.

Figure 3

Graphic of the Homepage of the Web Server http://bliulab.net/sgRNA-PSM/

Graphic of the Homepage of the Web Server http://bliulab.net/sgRNA-PSM/ Step 2. Click on the “Browse” button to upload the input file or type your query DNA sequences in FASTA format. Step 3. Click on the “Submit” button to get the final predictive results. When inputting the four DNA sequences in the “Example” window, you will see that the first and second are predicted as high on-target activity sgRNAs, while the third is the sequence in reverse-complementary order, which is predicted as low on-target activity sgRNA, and the fourth has four low on-target activity sgRNAs and one high on-target activity sgRNA. These results are consistent with the experimental results. In order to help the users to solve the problems when using the web server, the Frequently Questioned Answers (FQA) are provided by clicking on the FQA button.

Materials and Methods

Benchmark Datasets

In this study, a widely used benchmark dataset constructed by the FC dataset and the RES dataset was employed to evaluate the performance of different methods. The benchmark dataset consists of 5,310 sequences from 11 human genes (CD33, MED12, NF2, CD13, TADA2B, CUL3, TADA1, HPRT, NF1, CD15, CCDC101) and 6 mouse genes (Cd45, Cd43, Cd28,H2-K, Cd5, Thy1). There are 1,059 high on-target activity sgRNAs and 4,251 low on-target activity sgRNAs. The benchmark dataset S is as follows:wherewithwhere represents the union symbol between two sets, denotes the subset whose sgRNAs are from the ith targeting gene, the positive subset contains high on-target activity sgRNAs, the negative subset contains the low on-target activity sgRNAs, represents the number of sgRNAs in , represents the number of sgRNAs in , and denotes the number of sgRNAs in and in a ratio of about 1:4. The corresponding detailed sequences can be found in Data S1. The most updated on-target dataset established in Chuai et al. was employed to further evaluate the performance of the proposed method. This on-target dataset was constructed based on hct116, hek293t, hela, and hl60. Those datasets were also employed by Haeussler et al.

PSM

Feature extraction is very important for constructing a computational predictor. Inspired by the PS and mismatch features, here a novel feature extraction method, PSM, was proposed to capture the long-range sequence information and evolutionary information. Furthermore, PSM is able to efficiently reduce the dimension of the feature vectors. The detailed procedures of generating PSM are described as follows. A DNA sample D can be represented as follows:whererepresents the ith nucleobase in the sequence, the symbol denotes ‘‘member of’’ in the set, and L represents the length of D. The PS feature is an important and useful feature extraction method widely used in previous studies.35, 36, 37, 38 Because the position of nucleotide in a sgRNA affects its activity, the PS feature incorporates the local sequence position information by representing the k-mers, along a DNA sample D (cf. Equation 6) by “one-hot” encoding. By using the PS feature, D can be represented as follows:35, 36, 37, 38 where T represents the transpose symbol, denotes the jth feature in the one-hot encoding at the ith position in D, whose value is 0 or 1, and k is the number of adjacent nucleotides in a k-mer. From Equation 8, we can see that the dimension of the PS vector will increase rapidly with the incensement of k values. For example, when k is equal to 6, the dimension of the PS feature vector will be 46 × (30 − 6 + 1) = 1.024 × 105, which will cause high-dimension disaster.,, Therefore, Equation 8 is useful only when k is small, and it ignores the information of non-consecutive nucleotides. As a result, it can only incorporate the short-range and consecutive nucleotide information without considering the long-range and non-consecutive nucleotide information. The mismatch feature considers the evolutionary process and allows mismatches occurring in k-mers. Therefore, the dimension of the corresponding feature vectors can be obviously decreased compared with those of k-mers. In this study, we combined the mismatch with the PS feature and proposed a novel feature, i.e., PSM, which is defined as follows: where represents the jth feature in one-hot encoding at the ith position in D, whose value is 0 or 1, and α denotes the number of mismatch features considering the one-hot encoding, which can be defined as follows:where m is the number of mismatches in k-mers. As shown in Equations 9 and 10, the first components reflect the one-hot-encoded feature vector corresponding to the first sequence position, whereas the components from to reflect the one-hot-encoded feature vector corresponding to the second sequence position, and so forth. A feature vector formed with components is called the PSM vector for D as defined in Equation 9. A schematic diagram illustrating how to generate the PSM vector for D is shown in Figure 4. Compared to the PS vector defined in Equation 8, the dimension of the PSM vector will be significantly reduced. For example, when k = 6, the PS feature vector’s dimension (cf. Equation 8) is 1.024 × 105, while the PSM feature vector’s dimension is as defined in Equations 9 and 10. Now, when we assume m =5, the dimension will be . The size of the latter is around 1/170th that of the former. Namely, PSM can obviously reduce the dimension of the feature vector compared with PS. It is especially true for larger k values (see Table 4).

Figure 4

Schematic Diagram Illustrating How to Generate the PSM Vector for a DNA Sequence

(A) Example of PSM with parameters of k = 2, m = 1. (B) Example of PSM with parameters of k = 3, m = 1.

Table 4

Comparison between the PS Feature Vector’s dimension (cf. Equation 8) and the PSM Feature Vector’s Dimension (cf. Equation 9)

k	Dimension of PS Vectora	m	Dimension of PSM Vectorb	Ratio γc
2	464	1	232	∼2
3	1,792	1	1,344	∼1.3
3	1,792	2	336	∼5.3
4	6,912	1	6,912	1
		2	2,592	∼2.7
		3	432	∼16
5	26,624	2	16,640	∼1.6
		3	4,160	∼6.4
		4	520	∼51.2
6	102,400	4	6,000	∼17.07
6	102,400	5	600	∼170.67
⋮	⋮	⋮	⋮	⋮
⋮	⋮	⋮	⋮	⋮

Calculated by Equation 8.

Calculated by Equation 9.

Ratio of the number of column 2 and the number of column 4; it is the same with , where m is given in column 3.

Schematic Diagram Illustrating How to Generate the PSM Vector for a DNA Sequence (A) Example of PSM with parameters of k = 2, m = 1. (B) Example of PSM with parameters of k = 3, m = 1. Comparison between the PS Feature Vector’s dimension (cf. Equation 8) and the PSM Feature Vector’s Dimension (cf. Equation 9) Calculated by Equation 8. Calculated by Equation 9. Ratio of the number of column 2 and the number of column 4; it is the same with , where m is given in column 3. Therefore, the PSM vector (cf. Equation 9) should be used to represent the DNA samples, because PSM can overcome the aforementioned limitations for large values of k, while avoiding the high-dimension disaster problem. Finally, we can augment the PSM vector (cf. Equation 9) to where is the augmented PSM, a is the amino acid cut position, and b is the percent peptide given in Doench et al. Both of these two features were obtained by wet experiments, which are often unavailable. The feature vector formed with components is the ExPSM vector for D.

XGBoost Algorithm

The XGBoost algorithm is a technique for classification and regression tasks, which is based on tree boosting. The most important advantage of XGBoost is its scalability in all scenarios. For more detailed information on XGBoost, please refer to Chen and Guestrin. In this study, the regression model of the XGBoost algorithm was employed. We used the scikit-learn package to implement the XGBoost algorithm. The values of its three main parameters (maximum depth of a tree C, boosting learning rate R, and number of boosted trees F) are given in the following sections, and all the other parameters were set as default values. Finally, according to Equations 9 and 11, two predictors have been proposed as follows:

Evaluation Method of Performance

The AUC, as it pertains to the ROC curve,56, 57, 58 is a widely used measure for evaluating the performance of the predictors. The better predictor corresponds to larger AUC values.

Cross-Validation

The cross-validation method is an important step for evaluating the performance of a predictor. In this study, in order to ensure that a predictor can be generalized across genes, the leave-one-gene-out cross-validation, was used, where each of the 17 subsets of S (cf. Equation 3) was selected one by one as the test set, while the other 16 subsets were used to construct the training set to train the predictor. This process was repeated for 17 times, and each subset was selected as the test set once.

Implementation of the Competing Methods

In this study, we compared the proposed methods with three state-of-the-art methods, including ge-CRISPR, Azimuth, and CRISPRpred. The detailed processes of these three approaches were introduced as follows: for ge-CRISPR, the 464 dinucleotide (1-degree) binary features were finally fed into SVM regressor with a radial basis function (RBF) kernel with a c value of 25 for regression. For Azimuth, seven features were used to represent the samples, including position-independent, position-specific, GC count, NGGN, thermodynamic features, amino acid cut position, and percent peptide. These features were combined with GBR with the parameters learning_rate = 0.1, max_depth = 3, and n_estimators = 100 to construct the predictor. For CRISPRpred, five different feature extraction methods were employed, including position-independent, position-specific, thermodynamic features, amino acid cut position, and percent peptide. Please note that ViennaRNA package version 2.0 was used to generate thermodynamic features. RF was then performed on these features to select 2,899 relevant features according to the importance scores (Mean Decrease Gini) with the maximum number of trees of 500. These features were finally fed into the SVM regressor with linear kernel function with a c value of 2−2 for regression.

43 in total

1. Support vector machines for predicting membrane protein types by using functional domain composition.

Authors: Yu-Dong Cai; Guo-Ping Zhou; Kuo-Chen Chou
Journal: Biophys J Date: 2003-05 Impact factor: 4.033

2. Protein fold recognition based on sparse representation based classification.

Authors: Ke Yan; Yong Xu; Xiaozhao Fang; Chunhou Zheng; Bin Liu
Journal: Artif Intell Med Date: 2017-03-27 Impact factor: 5.326

3. Improving tRNAscan-SE Annotation Results via Ensemble Classifiers.

Authors: Quan Zou; Jiasheng Guo; Ying Ju; Meihong Wu; Xiangxiang Zeng; Zhiling Hong
Journal: Mol Inform Date: 2015-09-14 Impact factor: 3.353

4. A Consensus Community-Based Particle Swarm Optimization for Dynamic Community Detection.

Authors: Xiangxiang Zeng; Wen Wang; Cong Chen; Gary G Yen
Journal: IEEE Trans Cybern Date: 2019-09-23 Impact factor: 11.448

5. BioSeq-Analysis2.0: an updated platform for analyzing DNA, RNA and protein sequences at sequence level and residue level based on machine learning approaches.

Authors: Bin Liu; Xin Gao; Hanyu Zhang
Journal: Nucleic Acids Res Date: 2019-11-18 Impact factor: 16.971

6. ViennaRNA Package 2.0.

Authors: Ronny Lorenz; Stephan H Bernhart; Christian Höner Zu Siederdissen; Hakim Tafer; Christoph Flamm; Peter F Stadler; Ivo L Hofacker
Journal: Algorithms Mol Biol Date: 2011-11-24 Impact factor: 1.405

7. Optimized sgRNA design to maximize activity and minimize off-target effects of CRISPR-Cas9.

Authors: John G Doench; Nicolo Fusi; Meagan Sullender; Mudra Hegde; Emma W Vaimberg; Jennifer Listgarten; Katherine F Donovan; Ian Smith; Zuzana Tothova; Craig Wilen; Robert Orchard; Herbert W Virgin; David E Root
Journal: Nat Biotechnol Date: 2016-01-18 Impact factor: 54.908

8. iPromoter-2L2.0: Identifying Promoters and Their Types by Combining Smoothing Cutting Window Algorithm and Sequence-Based Features.

Authors: Bin Liu; Kai Li
Journal: Mol Ther Nucleic Acids Date: 2019-08-14 Impact factor: 8.886

9. Improved Pre-miRNAs Identification Through Mutual Information of Pre-miRNA Sequences and Structures.

Authors: Xiangzheng Fu; Wen Zhu; Lijun Cai; Bo Liao; Lihong Peng; Yifan Chen; Jialiang Yang
Journal: Front Genet Date: 2019-02-25 Impact factor: 4.599

10. Evaluation of off-target and on-target scoring algorithms and integration into the guide RNA selection tool CRISPOR.

Authors: Maximilian Haeussler; Kai Schönig; Hélène Eckert; Alexis Eschstruth; Joffrey Mianné; Jean-Baptiste Renaud; Sylvie Schneider-Maunoury; Alena Shkumatava; Lydia Teboul; Jim Kent; Jean-Stephane Joly; Jean-Paul Concordet
Journal: Genome Biol Date: 2016-07-05 Impact factor: 13.583

3 in total