Literature DB >> 30258427

Sc-ncDNAPred: A Sequence-Based Predictor for Identifying Non-coding DNA in Saccharomyces cerevisiae.

Wenying He¹, Ying Ju², Xiangxiang Zeng², Xiangrong Liu², Quan Zou^1,3.

Abstract

With the rapid development of high-speed sequencing technologies and the implementation of many whole genome sequencing project, research in the genomics is advancing from genome sequencing to genome synthesis. Synthetic biology technologies such as DNA-based molecular assemblies, genome editing technology, directional evolution technology and DNA storage technology, and other cutting-edge technologies emerge in succession. Especially the rapid growth and development of DNA assembly technology may greatly push forward the success of artificial life. Meanwhile, DNA assembly technology needs a large number of target sequences of known information as data support. Non-coding DNA (ncDNA) sequences occupy most of the organism genomes, thus accurate recognizing of them is necessary. Although experimental methods have been proposed to detect ncDNA sequences, they are expensive for performing genome wide detections. Thus, it is necessary to develop machine-learning methods for predicting non-coding DNA sequences. In this study, we collected the ncDNA benchmark dataset of Saccharomyces cerevisiae and reported a support vector machine-based predictor, called Sc-ncDNAPred, for predicting ncDNA sequences. The optimal feature extraction strategy was selected from a group included mononucleotide, dimer, trimer, tetramer, pentamer, and hexamer, using support vector machine learning method. Sc-ncDNAPred achieved an overall accuracy of 0.98. For the convenience of users, an online web-server has been built at: http://server.malab.cn/Sc_ncDNAPred/index.jsp.

Entities: CellLine Chemical Disease Gene Species

Keywords: DNA sequence; feature representation; genome synthesis; non-coding DNA; support vector machine

Year: 2018 PMID： 30258427 PMCID： PMC6144933 DOI： 10.3389/fmicb.2018.02174

Source DB: PubMed Journal: Front Microbiol ISSN： 1664-302X Impact factor: 5.640

Introduction

After the implementation of many whole genome sequencing projects, more and more researches showed that non-coding DNA (ncDNA) is a major component of the biological genome. Numerous studies (Vogel, 1964; Thomas, 1971; Eddy, 2012; Puente et al., 2015; Liu et al., 2017a; Yao et al., 2018) have shown that the complexity of organisms is related to the length of non-coding regions, which are specially transcribed in physiological and disease states. Although the function of most ncDNAs is still unknown(Khurana et al., 2016), some studies (Horn et al., 2013; Huang et al., 2013; Vinagre et al., 2013; Puente et al., 2015; Hu et al., 2017, 2018; Rheinbay et al., 2017; Liao et al., 2018; Zhang W. et al., 2018) have shown that most cancer-related gene mutations are located in ncDNA regions. How ncDNAs specifically affect tumor formation is also an urgent problem to be solved. In addition, ncDNAs in the genome play an important role in gene expressing, regulatory, and inheritance (Khurana et al., 2016). Especially, with the rapid growth and development of synthetic biology, research in the genomics is advancing from genome sequencing to genome synthesis (Erlich and Zielinski, 2017; Jain et al., 2018; Liu B. et al., 2018). In recent years, various DNA assembly technologies (Ni et al., 2017; Wu et al., 2017; Xie et al., 2017; Zhang et al., 2017b) have been developed according to the principles of atypical enzyme cut connection (Engler et al., 2009; Sleight et al., 2010), single strand annealing and splicing (Gibson et al., 2009; Li and Elledge, 2012) and PCR (Warrens et al., 1997), which provide more rapid technical support for synthetic biology. In the following years, people are committed to improving the efficiency of large scale DNA assembly technologies. With the rapid development of the computer network and the popularity of the Internet, the number of digital information, such as network data, audio data, and video data, is increasing rapidly. It is urgent to establish a new system which has more efficiency than the existing storage system. DNA storage technology (Baum, 1995; Davis, 1996; Carr and Church, 2009) can meet the requirements above. In a new study (Shipman et al., 2017), the researchers introduced a method that encode images and video images into the genome of the Escherichia coli and read the corresponding images and videos from the genome of living bacterial cells. All the above studies require a large amount of DNA data. As a complex type of genetic information, DNA sequences have specific characteristics not only in the coding sequence (cDNA) but also in the ncDNA sequences. Currently, the identification of cDNAs and ncDNAs relies mainly on experimental methods. However, traditional experimental methods are time-consuming and laborious, and the amount of genomic data is large and the sequence types are complex. In this context, there is an urgent need to establish accurate and efficient prediction methods to mine the information and knowledge of ncDNAs and cDNAs. Computational methods, which achieve a complementary effect, indeed effectively improved the recognition accuracy (Zhou et al., 2016). In this study, a SVM-based computational method was first established to recognize the ncDNA sequences in Saccharomyces cerevisiae (S. cerevisiae). Totally several types of features, such as mononucleotide composition (MNC), dimer nucleotide composition (DNC), trimer nucleotide composition (TNC), tetramer nucleotide composition (TrNC), pentamer nucleotide composition (PNC), and hexamer nucleotide composition (HNC) were extracted. The optimal feature extraction strategy was selected using SVM machine learning method. The workflow of constructing the Sc-ncDNAPred model is shown in Figure 1.

Figure 1

The workflow of Sc-ncDNAPred.

Methods

Benchmark dataset

In this study, the benchmark dataset was derived from the Ensembl genome database project (Hubbard et al., 2002), which is one of several well-known genome browsers for the retrieval of genomic information. Experimentally validated cDNA sequences of S. cerevisiae were extracted from their database, which contains 6713 samples. Intercepting the ncDNAs of the S. cerevisiae based on the initial marker information of the coding region provided by the original genomic data. By doing so, we obtained 6410 ncDNA samples. To get rid of redundancy, the CD-HIT (Li and Godzik, 2006) was adopted to remove those sequences that had ≥ 75% sequence identity. Finally, we obtained 6030 and 6251 samples in ncDNAs and cDNAs, respectively. Thus, the benchmark dataset can be formulated as where S+ contained 6030 ncDNA samples, S−contained 6251 cDNA samples and the symbol ∪ means the ‘union' in the set theory. The length distribution of ncDNA samples was shown in Figure 2. According to the graph, the length distribution of ncDNA is mainly between 100 and 800.

Figure 2

The length distribution of ncDNA samples.

Feature vector construction

A sample can be simplified by a convenience form as: where R (i = 1,2,3 … L) represents the nucleotide at i-th position in one sequence.

K-mer composition

K-mer nucleotide composition has been applied in many fields of bioinformatics (Liu et al., 2015b,c; Kim et al., 2017; Matias Rodrigues et al., 2017; Orenstein et al., 2017; Liu, 2018; Liu X. et al., 2018; Rangavittal et al., 2018). MNC equate to k = 1, DNC equate to k = 2, TNC equate to k = 3, TrNC equate to k = 4, PNC equate to k = 5, HNC equate to k = 6. The occurrence frequency of k−mer(i)can be represented as: where denote the number of the i-th k-mer, L is the length of the sample sequence. Thus, each DNA sample can be defined feature vectors in different dimension of size 4. The generalized form of whole feature vectors X can be given by:

Feature ranking

Each sample sequence was represented by a large set of features, which leads to the redundant information (Wei and Billings, 2007; Senawi et al., 2017). In order to distinguish the contribution of different features to the prediction model. To analyze these feature vectors, F-score method (Chen W. et al., 2016; Jia and He, 2016; Tang et al., 2016, 2018; He and Jia, 2017) was adopted to rank the feature, in this study. The F-score value of the i-th feature is defined as: where , and are the average values of the i-th feature in whole, ncDNA and cDNA datasets, respectively. n+represents the number of ncDNA training samples, n−represents the number of cDNA training samples, represents the i-th feature of the k-th ncDNA sample and represents the i-th feature of the k-th cDNA sample. Obviously, the feature with a greater score value indicates that it has a better discrimination ability.

Support vector machine

Support vector machine (SVM) (Hearst et al., 1998) is a widely used two-class classification algorithm based on statistical learning theory. It has been proven to be powerful in many fields of pattern recognition and data classification (Byun and Lee, 2002; Nasrabadi, 2007; Zhang N. et al., 2018;). More and more applications also proved that SVM also has strong data processing capabilities in the fields of bioinformatics (Xiong et al., 2011; Jia et al., 2013, 2017; Cao et al., 2014; Liu et al., 2014, 2017b; Wei et al., 2015; Chen X. X. et al., 2016; Jia and He, 2016; Yang et al., 2016; Zou et al., 2016; Xiao et al., 2017; Qiao et al., 2018; Su et al., 2018). A set of ncDNA samples and cDNA samples were represented by the feature vectors. The SVM classifies the data by mapping the input feature vectors to a high-dimensional feature space using a kernel function. In this study, the public LIBSVM package (Chang and Lin, 2011) was implemented to train models for discriminating between ncDNA sequences and cDNA sequences. Here, the radial basis function (RBF) was set as the kernel function. The penalty parameter C and kernel parameter were preliminarily optimized through a grid search strategy.

Performance evaluation

K-fold cross-validation (Chou and Zhang, 1995; Kohavi, 1995; Zhang et al., 2012a,b, 2015; Liu et al., 2015a; Chen X. et al., 2016; Li et al., 2016; Luo et al., 2016; Chen et al., 2017b, 2018a,b; Pan et al., 2017a; Xu et al., 2017; He et al., 2018) is one of the widely used approach to examine the ability of prediction model, and other approaches: independent dataset test and jackknife test (Chou and Shen, 2008) are also used in many applications. To reduce the computational cost, 10-fold cross validation was used to examine each model for its effectiveness in identifying ncDNA sequences. The training dataset were randomly divided into 10 subsets of approximately the same size. In each iteration, one subset was chosen as the test set and the remaining 9 subsets were used to train the model. For a complete cycle of a 10-fold cross-validation, the process was repeated 10 times until each subset was chosen as a test set. This 10-fold cross-validation procedure was repeated five times, then the results were averaged. To evaluate the prediction performance of the models, five classic metrics were computed (Chou, 2001; Qiu et al., 2015, 2016; Liu et al., 2017; Pan et al., 2017b; Zhang et al., 2017a; Tang et al., 2018; Yang et al., 2018), including sensitivity (Sn), specificity (Sp), accuracy (Acc), Matthew correlation coefficient (MCC), and the receiver operating characteristic (ROC). These measurements were defined as: In these expressions, N+ and N− are the total number of ncDNA and cDNA samples, respectively, while and are respectively the number of ncDNA samples incorrectly predicted as cDNA samples, and the number of cDNA samples incorrectly predicted as ncDNA samples.

Results and discussion

Prediction results of models

We used six types of effective feature extraction methods, such as MNC, DNA, TNC, TrNC, PNC, and HNC, as input of SVM to establish six models. The ability of each feature extraction method to discriminate between ncDNA and cDNA samples was compared by the 10-fold cross-validation (Table 1). As we can see from Table 1, the model for a combination SVM and TrNC yielded the best prediction performance, with the accuracy of 98.26%, the sensitivity of 98.01%, the specificity of 98.51%, and the MCC of 0.965, respectively. Then, the following second best prediction performance was yielded by TNC with the accuracy of 96.93%, the sensitivity of 96.62%, the specificity of 97.22%, and the MCC of 0.939, respectively. Besides, in the case of PNC, the corresponding model still obtained a good prediction results, which are 95.56% of accuracy, 95.25% of sensitivity, 95.84% of specificity and 0.911 of MCC, respectively.

Table 1

The 10-fold cross-validation results by different feature methods on the benchmark dataset.

Methods	Sn (%)	Sp (%)	ACC (%)	MCC
MNC	80.56	87.02	83.85	0.678
DNC	92.64	92.62	92.64	0.853
TNC	96.62	97.22	96.93	0.939
TrNC	98.01	98.51	98.26	0.965
PNC	95.25	95.84	95.56	0.911
HNC	90.71	92.25	91.49	0.830
All Features	95.99	96.08	96.03	0.921

The experiments have been executed 5 times and the results were the mean values.

The 10-fold cross-validation results by different feature methods on the benchmark dataset. The experiments have been executed 5 times and the results were the mean values. To further investigate the overall prediction performance of each model, we showed the ROC curves and AUC values of different models for the 10-fold cross-validation in Figure 3. With the increase of k-mer value, the performance first increased and then decreased. Comparison demonstrated that the TrNC could produce the best results. Thus, the feature TrNC was adopted as the final model for Sc-ncDNAPred.

Figure 3

The ROC curves to assess the predictive performance based on different feature extraction methods.

The ROC curves to assess the predictive performance based on different feature extraction methods. To further optimize the model, we performed multiple rounds of experiments on TrNC to select the appropriate subset of all 256 features (see Additional file 1: Table S1 for full details); however, the results showed no significant improvement in the corresponding performance. The possible reason is that the selected feature cannot burden enough information for the discrimination.

Compositional analysis

To understand the 256 different tetramers bias in ncDNAs and cDNAs, a heap map was provided in Figure 4. Each square in the heat map corresponds to the F-score value of one tetramer (see Table 2 for full details). Deep red in the heap map corresponds to a strong recognition ability.

Figure 4

Heap map to illustrate the F_score values of 256 different tetramers to identify ncDNA and cDNA.

Table 2

Rules of composition of heat map.

AAAA

AAAC

AACA

AACC

ACAA

ACAC

ACCA

ACCC

CAAA

CAAC

CACA

CACC

CCAA

CCAC

CCCA

CCCC

AAAG

AAAT

AACG

AACT

ACAG

ACAT

ACCG

ACCT

CAAG

CAAT

CACG

CACT

CCAG

CCA

CCCG

CCCT

AAGA

AAGC

AATA

AATC

ACGA

ACGC

ACTA

ACTC

CAGA

CAGC

CATA

CATC

CCGA

CCGC

CCTA

CCTC

AAGG

AAGT

AATG

AATT

ACGG

ACGT

ACTG

ACTT

CAGG

CAG

CATG

CATT

CCGG

CCG

CCTG

CCTT

AGAA

AGAC

AGCA

AGCC

ATAA

ATAC

ATCA

ATCC

CGAA

CGAC

CGCA

CGCC

CTAA

CTAC

CTCA

CTCC

AGAG

AGAT

AGCG

AGCT

ATAG

ATAT

ATCG

ATCT

CGAG

CGAT

CGCG

CGCT

CTAG

CTAT

CTCG

CTCT

AGGA

AGGC

AGTA

AGTC

ATGA

ATGC

ATTA

ATTC

CGGA

CGGC

CGTA

CGTC

CTGA

CTGC

CTTA

CTTC

AGGG

AGGT

AGTG

AGTT

ATGG

ATGT

ATTG

ATTT

CGGG

CGGT

CGTG

CGTT

CTGG

CTGT

CTTG

CTTT

GAAA

GAAC

GACA

GACC

GCAA

GCAC

GCCA

GCCC

TAAA

TAAC

TACA

TACC

TCAA

TCAC

TCCA

TCCC

GAAG

GAAT

GACG

GACT

GCAG

GCAT

GCCG

GCCT

TAAG

TAAT

TACG

TACT

TCAG

TCAT

TCCG

TCCT

GAGA

GAGC

GATA

GATC

GCGA

GCGC

GCTA

GCTC

TAGA

TAGC

TATA

TATC

TCGA

TCGC

TCTA

TCTC

GAGG

GAGT

GATG

GATT

GCGG

GCGT

GCTG

GCTT

TAGG

TAGT

TATG

TATT

TCGG

TCGT

TCTG

TCTT

GGAA

GGAC

GGCA

GGCC

GTAA

GTAC

GTCA

GTCC

TGAA

TGAC

TGCA

TGCC

TTAA

TTAC

TTCA

TTCC

GGAG

GGAT

GGCG

GGCT

GTAG

GTAT

GTCG

GTCT

TGAG

TGAT

TGCG

TGCT

TTAG

TTAT

TTCG

TTCT

GGGA

GGGC

GGTA

GGTC

GTGA

GTGC

GTTA

GTTC

TGGA

TGGC

TGTA

TGTC

TTGA

TTGC

TTTA

TTTC

GGGG

GGGT

GGTG

GGTT

GTGG

GTGT

GTTG

GTTT

TGGG

TGGT

TGTG

TGTT

TTGG

TTGT

TTTG

TTTT

Heap map to illustrate the F_score values of 256 different tetramers to identify ncDNA and cDNA. Rules of composition of heat map. Heap map analysis revealed that tetramers include TATA, TTTT, CAAG, CCAA, ATAT, TAAA, TGGA, TTTA, ATGG, ATAA, AATA, and CTGG are with the F-score values ranking top twelve in all tetramers. In addition, we also analyzed the other k-mer components based on the F-score method, respectively. Among them, the two key nucleotides G and T from MNC, the top five key dimer nucleotide composition (TA, CG, GA, TT, and CA) from DNC, (TGG, ATA, CCA, TAT, and TTT) from TNC, (TTTTT, ATATA, TAAAA, TATAT, and TTTTA) from PNC, and (TTTTTT, ATTTTT, TTTTTA, TTTTTC and CTTTTT) from HNC. These key features are presented in a radar diagram (Figure 5). The study of these key features can deepen the understanding of the overall structure of the genome, which not only promotes the annotation of the genome, but also promotes the study of biological evolution.

Figure 5

Key features of each k-mer composition selected by F-score method. Red color denotes F-score value of each feature.

Comparison with other classifiers

To the best of our knowledge, this is the first time that machine learning method has been used to identify ncDNA in S. cerevisiae. In order to further testify the superiority of proposed model Sc-ncDNAPred, the predictive results of it were compared with that of other powerful and widely used classifiers, i.e., k-Nearest Neighbor (KNN), Naïve Bayes, Random Forest, and J48 Tree as implemented in WEKA (Frank et al., 2004). The 10-fold cross validation results of these four classifier for identifying ncDNA in the same benchmark dataset were shown in Additional file 1: Table S2. The results showed that the four metrics as defined in Eq. 6 of the proposed model Sc-ncDNAPred are all higher than those of k-Nearest Neighbor (KNN), Naïve Bayes, Random Forest, and J48 Tree.

Web-server

Based on the benchmark dataset defined in Eq.1, a predictor called Sc-ncDNAPred was established, where “Sc” stands for S. cerevisiae and “Pred” stands for “Prediction.” For conveniences of users' community, a step-by-step guide about how to use the web-server is provided as follows: Step 1. Open the web-server at: http://server.malab.cn/Sc_ncDNAPred/index.jsp, you will see the home page of Sc-ncDNAPred, as shown in Figure 6. Click the “About” button to see a brief introduction of the server.

Figure 6

A semi-screenshot of the top page of the Sc-ncDNAPred web-server at: http://server.malab.cn/Sc_ncDNAPred/index.jsp.

Step 2. Paste the query DNA sequences into the input box. The input sequence should be in FASTA format. For the example of DNA sequences in FASTA format, click the “example” button top above the input box. Step 3. Click on the “Submit” button to start the prediction. If the prediction result of a sequence is positive, its output is “ncDNA.” Otherwise, its output is “cDNA.” Step 4. Click on the “DataSet” button to download the benchmark dataset. Step 5. Click on the “Contact” button to contact us. A semi-screenshot of the top page of the Sc-ncDNAPred web-server at: http://server.malab.cn/Sc_ncDNAPred/index.jsp.

Conclusions

DNA assembly technology needs a large number of target sequences of known information as data support. Non-coding DNA (ncDNA) sequences occupy most of the organism genomes, thus accurate recognizing of them is necessary. In this study, an efficient computational model was proposed to identify ncDNAs in S. cerevisiae. The tetramer nucleotide composition (TrNC) was adopted to extract features. The F-score method was used to analyze these feature vectors and find the key features. The high accuracy indicated that Sc-ncDNAPred was a powerful tool for predicting ncDNA. Finally, a free web-server was developed based on the proposed model. We hope that the predictor will provide convenience to most of scholars. Currently, annotations for the genomic sequences of most species are lacking or unavailable. To analyze the ncDNA data of these organisms, we can obtain data and methodological support in a cross-species manner from annotated species. For example, we could try to use the model built from S. cerevisiae dataset to analyze other species of bacteria that have not been explored in depth. In addition, we will also apply this computational model for the prediction of potential disease related non-coding DNA. In the future, we will apply this computational model for the prediction of potential disease related non-coding RNA (Chen and Huang, 2017; Chen et al., 2017a, 2018c,d; You et al., 2017).

Author contributions

WH, QZ, and XL wrote the paper. XZ and YJ participated in preparation of the manuscript. QZ, WH, XL, XZ, and YJ participated in the research design. WH and QZ developed the web server. WH, YJ, XZ, XL, and QZ read and approved the final manuscript.

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

83 in total

Review 1. The genetic organization of chromosomes.

Authors: C A Thomas
Journal: Annu Rev Genet Date: 1971 Impact factor: 16.830

2. Splicing by overlap extension by PCR using asymmetric amplification: an improved technique for the generation of hybrid proteins of immunological interest.

Authors: A N Warrens; M D Jones; R I Lechler
Journal: Gene Date: 1997-02-20 Impact factor: 3.688

3. Predicting miRNA-disease association based on inductive matrix completion.

Authors: Xing Chen; Lei Wang; Jia Qu; Na-Na Guan; Jian-Qiang Li
Journal: Bioinformatics Date: 2018-12-15 Impact factor: 6.937

4. LPI-ETSLP: lncRNA-protein interaction prediction using eigenvalue transformation-based semi-supervised link prediction.

Authors: Huan Hu; Chunyu Zhu; Haixin Ai; Li Zhang; Jian Zhao; Qi Zhao; Hongsheng Liu
Journal: Mol Biosyst Date: 2017-08-22

5. Computational identification of binding energy hot spots in protein-RNA complexes using an ensemble approach.

Authors: Yuliang Pan; Zixiang Wang; Weihua Zhan; Lei Deng
Journal: Bioinformatics Date: 2018-05-01 Impact factor: 6.937

6. BioSeq-Analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches.

Authors: Bin Liu
Journal: Brief Bioinform Date: 2019-07-19 Impact factor: 11.622

7. "Perfect" designer chromosome V and behavior of a ring derivative.

Authors: Ze-Xiong Xie; Bing-Zhi Li; Leslie A Mitchell; Yi Wu; Xin Qi; Zhu Jin; Bin Jia; Xia Wang; Bo-Xuan Zeng; Hui-Min Liu; Xiao-Le Wu; Qi Feng; Wen-Zheng Zhang; Wei Liu; Ming-Zhu Ding; Xia Li; Guang-Rong Zhao; Jian-Jun Qiao; Jing-Sheng Cheng; Meng Zhao; Zheng Kuang; Xuya Wang; J Andrew Martin; Giovanni Stracquadanio; Kun Yang; Xue Bai; Juan Zhao; Meng-Long Hu; Qiu-Hui Lin; Wen-Qian Zhang; Ming-Hua Shen; Si Chen; Wan Su; En-Xu Wang; Rui Guo; Fang Zhai; Xue-Jiao Guo; Hao-Xing Du; Jia-Qing Zhu; Tian-Qing Song; Jun-Jun Dai; Fei-Fei Li; Guo-Zhen Jiang; Shi-Lei Han; Shi-Yang Liu; Zhi-Chao Yu; Xiao-Na Yang; Ken Chen; Cheng Hu; Da-Shuai Li; Nan Jia; Yue Liu; Lin-Ting Wang; Su Wang; Xiao-Tong Wei; Mei-Qing Fu; Lan-Meng Qu; Si-Yu Xin; Ting Liu; Kai-Ren Tian; Xue-Nan Li; Jin-Hua Zhang; Li-Xiang Song; Jin-Gui Liu; Jia-Fei Lv; Hang Xu; Ran Tao; Yan Wang; Ting-Ting Zhang; Ye-Xuan Deng; Yi-Ran Wang; Ting Li; Guang-Xin Ye; Xiao-Ran Xu; Zheng-Bao Xia; Wei Zhang; Shi-Lan Yang; Yi-Lin Liu; Wen-Qi Ding; Zhen-Ning Liu; Jun-Qi Zhu; Ning-Zhi Liu; Roy Walker; Yisha Luo; Yun Wang; Yue Shen; Huanming Yang; Yizhi Cai; Ping-Sheng Ma; Chun-Ting Zhang; Joel S Bader; Jef D Boeke; Ying-Jin Yuan
Journal: Science Date: 2017-03-10 Impact factor: 47.728

8. Bug mapping and fitness testing of chemically synthesized chromosome X.

Authors: Yi Wu; Bing-Zhi Li; Meng Zhao; Leslie A Mitchell; Ze-Xiong Xie; Qiu-Hui Lin; Xia Wang; Wen-Hai Xiao; Ying Wang; Xiao Zhou; Hong Liu; Xia Li; Ming-Zhu Ding; Duo Liu; Lu Zhang; Bao-Li Liu; Xiao-Le Wu; Fei-Fei Li; Xiu-Tao Dong; Bin Jia; Wen-Zheng Zhang; Guo-Zhen Jiang; Yue Liu; Xue Bai; Tian-Qing Song; Yan Chen; Si-Jie Zhou; Rui-Ying Zhu; Feng Gao; Zheng Kuang; Xuya Wang; Michael Shen; Kun Yang; Giovanni Stracquadanio; Sarah M Richardson; Yicong Lin; Lihui Wang; Roy Walker; Yisha Luo; Ping-Sheng Ma; Huanming Yang; Yizhi Cai; Junbiao Dai; Joel S Bader; Jef D Boeke; Ying-Jin Yuan
Journal: Science Date: 2017-03-10 Impact factor: 47.728

9. SMOQ: a tool for predicting the absolute residue-specific quality of a single protein model with support vector machines.

Authors: Renzhi Cao; Zheng Wang; Yiheng Wang; Jianlin Cheng
Journal: BMC Bioinformatics Date: 2014-04-28 Impact factor: 3.169

10. A genetic algorithm-based weighted ensemble method for predicting transposon-derived piRNAs.

Authors: Dingfang Li; Longqiang Luo; Wen Zhang; Feng Liu; Fei Luo
Journal: BMC Bioinformatics Date: 2016-08-31 Impact factor: 3.169

4 in total

1. BioAutoML: automated feature engineering and metalearning to predict noncoding RNAs in bacteria.

Authors: Robson P Bonidia; Anderson P Avila Santos; Breno L S de Almeida; Peter F Stadler; Ulisses N da Rocha; Danilo S Sanches; André C P L F de Carvalho
Journal: Brief Bioinform Date: 2022-07-18 Impact factor: 13.994

2. PredT4SE-Stack: Prediction of Bacterial Type IV Secreted Effectors From Protein Sequences Using a Stacked Ensemble Method.

Authors: Yi Xiong; Qiankun Wang; Junchen Yang; Xiaolei Zhu; Dong-Qing Wei
Journal: Front Microbiol Date: 2018-10-26 Impact factor: 5.640

3. LPI-IBNRA: Long Non-coding RNA-Protein Interaction Prediction Based on Improved Bipartite Network Recommender Algorithm.

Authors: Guobo Xie; Cuiming Wu; Yuping Sun; Zhiliang Fan; Jianghui Liu
Journal: Front Genet Date: 2019-04-18 Impact factor: 4.599

Review 4. Probing lncRNA-Protein Interactions: Data Repositories, Models, and Algorithms.

Authors: Lihong Peng; Fuxing Liu; Jialiang Yang; Xiaojun Liu; Yajie Meng; Xiaojun Deng; Cheng Peng; Geng Tian; Liqian Zhou
Journal: Front Genet Date: 2020-01-31 Impact factor: 4.599

4 in total