Literature DB >> 30845684

Prediction of Protein Subcellular Localization Based on Fusion of Multi-view Features.

Bo Li1, Lijun Cai2, Bo Liao3,4, Xiangzheng Fu5, Pingping Bing6, Jialiang Yang7.   

Abstract

The prediction of protein subcellular localization is critical for inferring protein functions, gene regulations and protein-protein interactions. With the advances of high-throughput sequencing technologies and proteomic methods, the protein sequences of numerous yeasts have become publicly available, which enables us to computationally predict yeast protein subcellular localization. However, widely-used protein sequence representation techniques, such as amino acid composition and the Chou's pseudo amino acid composition (PseAAC), are difficult in extracting adequate information about the interactions between residues and position distribution of each residue. Therefore, it is still urgent to develop novel sequence representations. In this study, we have presented two novel protein sequence representation techniques including Generalized Chaos Game Representation (GCGR) based on the frequency and distributions of the residues in the protein primary sequence, and novel statistics and information theory (NSI) reflecting local position information of the sequence. In the GCGR + NSI representation, a protein primary sequence is simply represented by a 5-dimensional feature vector, while other popular methods like PseAAC and dipeptide adopt features of more than hundreds of dimensions. In practice, the feature representation is highly efficient in predicting protein subcellular localization. Even without using machine learning-based classifiers, a simple model based on the feature vector can achieve prediction accuracies of 0.8825 and 0.7736 respectively for the CL317 and ZW225 datasets. To further evaluate the effectiveness of the proposed encoding schemes, we introduce a multi-view features-based method to combine the two above-mentioned features with other well-known features including PseAAC and dipeptide composition, and use support vector machine as the classifier to predict protein subcellular localization. This novel model achieves prediction accuracies of 0.927 and 0.871 respectively for the CL317 and ZW225 datasets, better than other existing methods in the jackknife tests. The results suggest that the GCGR and NSI features are useful complements to popular protein sequence representations in predicting yeast protein subcellular localization. Finally, we validate a few newly predicted protein subcellular localizations by evidences from some published articles in authority journals and books.

Entities:  

Keywords:  generalized chaos game representation; protein primary sequence; protein subcellular localization; statistical method; support vector machine; unitary distance

Mesh:

Substances:

Year:  2019        PMID: 30845684      PMCID: PMC6429470          DOI: 10.3390/molecules24050919

Source DB:  PubMed          Journal:  Molecules        ISSN: 1420-3049            Impact factor:   4.411


1. Introduction

Assigning subcellular localizations for a protein is a significant step to elucidate its interaction partners, functions and potential roles in the cellular machinery [1,2]. However, experimental methods to determine subcellular localization usually involve immunolabelling or tagging, which could be laborious and time-consuming [1,3,4,5]. With the development of high-throughput genomic and proteomic sequencing techniques, there have been increasing number of protein sequences sequenced and cataloged in the protein data banks. So there is an urgent need for effective and efficient computational methods to predict protein subcellular localizations, especially for species like yeast. Typical computational methods to predict protein subcellular localizations consist of two steps including: (1) protein sequence representation, in which each primary protein sequence was transformed into a numerical feature vector; and (2) protein classification, in which a classification model was then trained based on the feature vectors and labels of the training samples. Currently, there are generally three categories of sequence representation methods: (1) the amino acids composition based-methods, which calculate the occurrence frequencies of the 20 amino acids, but ignore the sequence-order information of each residue; (2) the Chou’s Pseudo Amino Acid Composition (PseAAC)-based methods [6,7], which not only model the amino acid composition information but also incorporate the interactions among adjacent residues. The Chou’s PseAAC based-methods achieved about an increase of 20 percent of predicting accuracy than amino acids composition-based methods; (3) the hybrid methods allowing for integrating features from multiple views, which usually increase prediction accuracy [8,9,10]. After the sequence feature was constructed, various classifiers including covariant discriminant (CDC) [10,11], nearest neighbor (NN) [12,13], support vector machine (SVM) [14], deep learning [15] and ensemble classifier [16,17] were adopted to predict protein subcellular localization. During the past decades, significant progresses have been made on developing efficient protein sequence representations and subsequent classifiers. For example, Zhang et al. introduced several amino acid hydrophobic patterns and average power-spectral density to define a modified PseAAC. Based on these features, they predicted protein subcellular localization by employing the covariant discriminant predictor [3]. Liao et al. attempted to identify protein subcellular locations based on amino acid composition components and adjacent triune residues [6]. Chen et al. utilized the measure of diversity and increment of diversity on protein primary sequences [18]. Ding et al. represented the apoptosis protein sequences by a novel approximate entropy (ApEn)-based PseAAC and employed an ensemble classifier model as the prediction engine, of which the basic classifier is the fuzzy K-nearest neighbor [16]. Lin et al. refined the PseAAC based on the physico-chemical characteristics of the 20 amino acids, and adopted SVM to predict protein subcellular locations [19]. Zhang et al. introduced the concept of distance frequency to capture the positional distribution information of amino acids and also adopted SVM to classify proteins [2]. More recently, Yu et al. implemented the CELLO2GO (http://cello.life.nctu.edu.tw/cello2go/) web-based system server for providing protein subcellular location prediction service based on functional Gene Ontology Annotation [20]. Wan et al. introduced a multi-label subcellular-localization predictor named HybridGO-Loc that leverages not only the GO term occurrences but also the inter-term relationships [21]. Dehzangi et al. proposed two segmentation-based feature extraction methods to explore potential local evolutionary-based information for Gram-positive and Gram-negative subcellular localizations [22]. Finally, Shao et al. employed a deep model-based descriptor (DMD) to extract high-level features from protein images, which was proven to be useful for determining the subcellular localization of proteins [23]. However, due to the limitations of feature representation schemes and the relative low accuracy of classification algorithms, most current algorithms still cannot be widely employed in real applications. To address this problem, we first introduced two novel feature representations based on Generalized Chaos Game Representation (GCGR) and novel statistics and information theory (NSI), respectively. Using the two types of features, we developed a predicting model based on unitary distances. Our experiments indicate that the model can quickly and accurately predict the subcellular localizations for yeast even without classifiers. To further evaluate the effectiveness of the proposed new features, we proposed a multi-view feature by combining these features with well-known features like PseAAC and dipeptide composition, which was fed into a SVM classification system. We then tested the performance of the proposed features and models on two yeast benchmark datasets and compared them with a few popular methods using the jackknife test.

2. Results and Discussions

We listed in Table 1 and Table 2 the predicting results of the proposed model and other existing models for the jackknife test on CL317 and ZW225 respectively. As can be seen, our model achieved overall prediction accuracies of 0.8825 and 0.7736 respectively on CL317 and ZW225. The performance on CL317 outperforms some existing methods, such as Wei et al. [15] (with accuracy 0.827) and Zhang et al. [24] (with accuracy 0.88). The improvement is important considering that we only used a 2-D GCGR feature and a 3-D NSI feature, while other methods combined features like amino acid composition of 20-D and dipeptide of 400-D. We further tested the performance of combining GCGR and NSI with other widely-recognized features including pseudo-amino acid composition (PwAAC) and dipeptide composition (Dipeptide). Specifically, we applied three models including: (1) PwAAC alone, (2) fusion of features PwAAC and Dipeptide, and (3) fusion of features PwAAC, Dipeptide, GCGR and NSI into the CL317 dataset. Their prediction results for the jackknife test were summarized in Figure 1.
Table 1

The prediction results of dataset CL317 using unitary distance based on GCGR + NSI features in the jackknife test.

CyMeNuEnMiSe
Sn (%)91.885.386.593.783.374.5
Sp (%)86.299.691.986.291.590.9
MCC0.830.830.860.880.860.81
Acc0.8825
Table 2

The prediction results of dataset ZW225 using unitary distance based on GCGR + NSI features in the jackknife test.

MeCyNuMi
Sn (%)0.66170.82860.880.8439
Sp (%)0.77920.74320.93830.8841
MCC0.68630.67890.91150.7745
Acc0.7736
Figure 1

The prediction results based on CL317 using the support vector machine algorithm with different combination of features.

As Figure 1 depicts, the model combined all features achieved much higher prediction accuracy than others, indicating that: (1) feature fusion techniques are promising to improve the prediction accuracy since single-view feature can only reflect part of the information of a protein sequence; (2) the two features GCGR and NSI can be served as a helpful complementary to features like PwAAC and Dipeptide, revealing the effectiveness of the two novel feature representation techniques as well. To further evaluate the efficiency of the feature fusion technique and improve protein subcellular location prediction accuracy, we introduced the final multiple-views based model, in which the feature vector for each protein was represented by concatenating numerical vectors from GCGR, NSI, PwAAC and Dipeptide. In addition, SVM was selected as the classifier. Comparison with other existing models using the jackknife test on CL317 and ZW225 were shown in Table 3 and Table 4, respectively.
Table 3

Comparison of prediction performance for CL317 in the jackknife test.

PredictorMCCAcc
CyMeNuEnMiSe
[15]0.800.770.730.900.740.680.827
[23]0.870.900.860.950.860.800.909
[24]0.840.850.840.910.770.800.88
[25]0.890.880.870.950.880.780.911
[6]0.9460.9090.8850.9570.8820.7060.912
This paper0.8960.9130.9290.8920.8530.9050.921
Table 4

Comparison of prediction performance for ZW225 in the jackknife test.

PredictorMCCAcc
MeCyNuMi
[3]0.9330.900.6340.600.831
[15]0.910.9290.7320.680.858
[24]0.9210.8710.7320.640.84
[6]0.910.8710.7560.720.849
This paper0.9090.8920.8670.7780.889
As can be seen, our integration model achieved the highest overall accuracies, that is, 0.921 and 0.889 on CL317 and ZW225, respectively. There are two indications: (1) our proposed protein sequence feature representations including GCGR and NSI both contain some valuable information such as concentrated local information, which were not covered by previous features; (2) The integration of multiple informative features may improve prediction performance. In addition, our model achieved the highest MCCs for most of subcellular location classes on CL317 except for Cy and Me. Moreover, the MCCs and the Accs of the class Nu and Mi for our model are much better than other existing methods on ZW225. Finally, we searched authoritative journals and publications for further validation of the predicted subcellular location of some proteins, and found that some of them have already been validated by experiments. For example, we predicted that the protein YHR196W belongs to nucleolar, which have been reported by more than 20 publications such as Eswara et al. [26] and Polymenis et al. [27]. We also predicted Sec17p to be localized in cytoplasm and the endoplasmic reticulum, consistent with Aouida et al. [28]. Thus, our model is effective in screening out potential protein subcellular locations for further experimental validation. To summarize: first, the new simple unitary distance-based method is comparable to many methods in prediction accuracy; second, the proposed new perspectives (GCGR and NSI) truly contain some valuable information from protein primary sequence, and can be served as a complement to the existing feature representations; third, the multi-feature based model can improve the prediction accuracy notably, thus can be used to help biologist determine protein subcellular location. However, we are fully aware that there are several limitations in this study. First of all, we only used the average of the x- and y-axis of the points in the GCGR plot, which may retrieve only partial information of the plot. An immediate option is to try other statistics of the GCGR plot such as median and percentiles. Second, the biological interpretation under the effectiveness of the features is not fully clear. Third, the current version of the software is not very user-friendly. In the future, we will devote to offer an online web service such that more biologists can use the software. We will also try to use some parallel algorithms for dealing with large scale eukaryote species including human data.

3. Materials and Methods

3.1. Datasets

In the paper, two yeast datasets CL317 and ZW225 are used for comparing different predicting models. The CL317 dataset was collected by Chen and Li [18]. The original 846 proteins explicitly annotated to one subcellular were derived from SWISSPROT (version 49.0) by European Bioinformatics Institute, Hinxton Cambridge, United Kingdom (www.ebi.ac.uk/ swissprot) [25]. Since short sequences are more like to be homologous and it is also difficult to extract enough information from them, we removed the proteins with less than 80 residues similar to Chen and Li [18]. The remaining dataset contains 317 apoptosis proteins belonging to six subcellular locations including cytoplasmic (Cy), membrane (Me), nuclear (Nu), endoplasmic reticulum (En), mitochondrial (Mi) and secreted (Se) with 112, 55, 52, 47, 34 and 17 proteins, respectively. ZW225 was curated by Zhang and Wang [24], including 225 proteins in four subcellular locations with 89 membrane proteins, 70 cytoplasmic proteins, 41 nuclear proteins and 25 mitochondrial proteins. The proteins were extracted from SWISSPROT (version 50.3) by European Bioinformatics Institute, Hinxton Cambridge, United Kingdom using the same rules as CL317.

3.2. Generalized Chaos Game Representation (GCGR) of Protein Primary Sequences

The chaos game representation (CGR) was initially introduced to visualize DNA sequences [29] and later for protein sequences as well [30]. Here, we further developed a generalized chaos game representation (GCGR) to represent a protein sequence by a 2-dimensional numerical feature vector describing the frequency of 20 amino acids and their neighbor information in the sequence. The construction of GCGR consists of three steps:

3.2.1. Step 1: Convert a Protein Sequence into a Sequence on an Alphabet of Size 6

We converted the 20 amino acids into six groups (Table 5). Specifically, Proline (P), Glycine (G) and Cysteine (C) formed three separate groups because of their unique backbone properties. The remaining 17 amino acids were classified into the other three groups according to their hydropathy scale including strongly hydrophilic (denoted by H), strongly hydrophobic (L), and weakly hydrophilic or weakly hydrophobic (S) [31]. As a result, each primary protein sequence could be uniquely represented by a string on the alphabet {H, L, S, P, G, C}. For example, the protein sequence “YAMQESHFTCI” can be represented by “SLLHHSHLSCL” according to Table 5.
Table 5

The six classes of the 20 amino acids.

ClassificationAbbreviationAmino Acids
Strongly hydrophilic or polarHH, R, D, E, N, Q, K
Strongly hydrophobicLL, I, V, A, M, F
Weakly hydrophilic or weakly hydrophobic (ambiguous)SS, T, Y, W
ProlinePP
GlycineGG
CysteineCC

3.2.2. Step 2: Construct the GCCR Plot

Firstly, we drew a regular hexagon, in which each vertex is associated with a distinct label of H, L, S, P, G and C, and each edge is of unit length. Then, for each encoded primary sequence in the first step, we plotted its letters sequentially as vertices inside the hexagon as follows: the first vertex, corresponding to the first letter of the primary sequence, was placed in the center of the hexagon; and the i-th vertex, corresponding to the i-th letter, was placed in the middle of the first (i-1)-th vertices and the vertex representing the i-th letter in the hexagon. After that, a plot named the GCGR of the primary sequence was drawn. As examples, we plotted in Figure 2 the GCGRs for six representative proteins with each belonging to a different subcellular location. From the six GCGR figures, we can directly retrieve some valuable information: for proteins in the Cy and Nu classes, the plotted points are close to vertices H and L; the protein in the Me class are uniformly distributed around all the vertices except for C; proteins in the Nu and En classes have fewer points around vertices G, C and P, G, respectively; proteins in the last two classes are almost uniformly distributed. In a word, the proteins in different subcellular locations distributed differently in the GCGR plots.
Figure 2

The GCGRs of primary sequence for proteins from six subcellular locations GCGR: Generalized Chaos Game Representation.

3.2.3. Step 3: Convert Each Protein Sequence into a 2-D Vector according to its GCGR Plot

As can be seen from Figure 2, each letter in the protein sequence corresponds to a (x, y)-coordinate in the GCGR plot. We then modelled the GCGR plot as a combination of two series: one is composed of the x-coordinates and the other is composed of the y-coordinates, which were named x-series and y-series, respectively. As can be seen from Figure 3 and Figure 4, there are many useful observations: (1) The average values of the x-series and y-series for proteins in the En-class, denoted as and respectively, tend to be greater than those for proteins in the other classes; (2) Proteins in the first class Cy also have a large , but do not have a large ; (3) Proteins in the last two classes Mi and Se have moderate and , respectively.
Figure 3

Six time series that represent the first three GCGRs in Figure 2. Each panel in Figure 2 gives rise to two time series.

Figure 4

Six time series that represent the last three GCGRs in Figure 2. Each panel in Figure 2 gives rise to two time series.

However, unlike proteins in the Se class, those in the Mi class have a greater than its ; (4) of proteins in the second class Me is the smallest among all classes. In summary, and are two effective numerical features to identify subcellular location of proteins. It is not surprising since and contain not only the information about amino acids frequencies, but also their order in a protein sequence. For a better view, we also drew in Figure 5 the boxplots of and for proteins in each of the six classes in CL317.
Figure 5

The boxplots for the and of all the proteins in dataset CL317 grouped into the six subcellular locations.

Theoretically, a class with narrow variation scope and less outliers can be discriminated more robustly. As can be seen, the proteins in the Nu class have substantially narrower variation with ranges approximately from 1.22 to 1.26 and ranges approximately from 1.23 to 1.28. For En, though is widely distributed, is more centralized, which can be used to differentiate this class. Similarly for Se, though is widely distributed, is more centralized. Finally, it is of note that all six classes has different medians and relatively differentiable variation scopes for both and . Therefore, by combining and the , it is possible to predict the localization of most proteins.

3.3. Novel Statistics and Information Theory (NSI) of Protein Primary Sequences

In order to acquire more position information of the primary sequence, we presented a novel statistics and information theory based method to extract features from the protein primary sequence. Different from the previous section, we just classified the 20 amino acids into three groups in view of their hydropathy profiles [21]: (1) internal group, in which the residues tend to appear in the inner side of the protein spatial structure, (2) external group, in which the residues tend to occur at the surface, and (3) ambivalent group, in which the residues do not have fixed common positions. Then, a protein sequence can be transformed into a 3-letter string according to the following rule: where represents the th letter in the protein primary sequence , and presents the encoded letter for . For example, given a protein sequence , its encoded sequence is . After that, we calculated the position features of the encoded sequence to represent its local information. Specifically, let be the position sequence of a given amino acid k. We calculated the intervals between two consequent positions in , which formed a new numerical distance sequence denoted by . Obviously, contains the positional and distribution information of the given amino acid k in the primary sequences. For instance, for the encoded amino acid sequence, the position sequences for the reduced amino acids F, D, and S are: , , . Then the symbolic sequences are cyclic and their numerical sequences are: , , . The numerical sequence provides a new profile to characterize correlation residues of the given sequence. In fact, the interval distance between two occurrences of k can be denoted by a random variable x. We calculated the probability of the Matthew’s correlation coefficient (MCC) of the variable x and obtain its distribution function. Based on the probability theory, we can further calculate the mean value and the variance by: Then, we defined the positional information as follows: where is a pivotal statistic for comparing the degree of variation from one data to another. Finally, appropriately characterize the positional information of three encoding letters and thus form a novel feature vector of a protein primary sequence.

3.4. Unitary Distance

In this article, rather than the serial combination method which combines different feature vectors into a super-vector, the parallel combination method combines two feature vectors by a complex vector [32], which was defined by where i is an imaginary unit. Here, we defined the parallel combined feature space on as . Thus C is an m-dimensional complex vector space, where . The inner product of two vectors in the complex space is given by , where , and H is the denotation of conjugate transpose. The complex vector space defined by the above inner product is usually called unitary space. The norm in unitary space is given by: where . Then, the unitary distance between two complex vectors and is calculated by:

3.5. Performance Assessment

For evaluating the effectiveness of two proposed features GCGR and NSI, we first introduced an easy model for fast predicting protein subcellular location, which is described as follows: for a given protein, its numeric features from GCGR and NSI were firstly extracted. Then, these two features are concatenated in parallel and then classified by a classifier free prediction model. As well-acknowledged, all the prediction models in the real number space could be extended to the complex number space by using different similarity measures. For example, the Euclidian distance is a commonly-used similarity metric in the real space, while the unitary distance is often used in the complex space, which was adopted in this paper. Note that the dimensionalities of u and v in the complex space must be equal, pad the lower-dimensional one with until its dimensionality is equal to the higher-dimensional one before vectors combination. In order to measure the predictive capability of the algorithm, we adopted the following commonly used measures: where True Positive (TP) represents the number of true positives in its subcellular location, True Negative (TN) represents the number of true negatives in its subcellular location, False Positive (FP) denotes the number of false positives and False Negative (FN) denotes the number of false negatives in its subcellular location. N is the total number of the protein sequences. Sensitivity means the rate of correct prediction. Specificity means the reliability level for predictive model. The Matthew’s correlation coefficient (MCC) shows the comprehensive performance of the prediction algorithm. In this paper, all the experiments were completed by MatLab and a Library for Support Vector Machines (LIBSVM). In the experiments, we chose the jackknife test for validating the performance of each model. The jackknife test is done by dropping in turn each sample from the data set as the test sample and fitting the model for the remaining set of observations as the training samples. The predicting accuracy can be obtained by the right classified samples divided by the total number of samples. It worth’s noticing that though there is no parameter in our feature construction process, there are two model parameters for LIBSVM, namely c and g. We adopted a grid search to find the best c and g in the jackknife test. Specifically, c and g both varied from 2−5 to 25 with a multiple 2, and the best c is 8 and g is 0.0625 for the dataset CL317, while the numbers are 16 and 0.125 respectively for the dataset ZW225.

4. Conclusions

In this study, we first proposed a novel and quick method for predicting yeast subcellular locations based on generalized chaos game representation of the protein primary sequence and the statistics and information theory to uncover the residues distribution among the sequence. Implementation on two benchmark yeast datasets suggests that this model achieves comparable classification performance as those of machine learning-based classifiers. In addition, a fusion model incorporating GCGR and NSI with some known features including PwAAC and Dipeptide were presented, which gains the highest overall accuracy and MCC on the two benchmark datasets. The results also indicate that the new features extracted contain some useful information, which is not mined in previous methods.
  26 in total

1.  Prediction of protein subcellular locations by incorporating quasi-sequence-order effect.

Authors:  K C Chou
Journal:  Biochem Biophys Res Commun       Date:  2000-11-19       Impact factor: 3.575

2.  Using functional domain composition and support vector machines for prediction of protein subcellular location.

Authors:  Kuo-Chen Chou; Yu-Dong Cai
Journal:  J Biol Chem       Date:  2002-08-16       Impact factor: 5.157

3.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003.

Authors:  Brigitte Boeckmann; Amos Bairoch; Rolf Apweiler; Marie-Claude Blatter; Anne Estreicher; Elisabeth Gasteiger; Maria J Martin; Karine Michoud; Claire O'Donovan; Isabelle Phan; Sandrine Pilbout; Michel Schneider
Journal:  Nucleic Acids Res       Date:  2003-01-01       Impact factor: 16.971

4.  Application of pseudo amino acid composition for predicting protein subcellular location: stochastic signal processing approach.

Authors:  Yu-Xi Pan; Zhi-Zhou Zhang; Zong-Ming Guo; Guo-Yin Feng; Zhen-De Huang; Lin He
Journal:  J Protein Chem       Date:  2003-05

5.  A new method for identification of protein (sub)families in a set of proteins based on hydropathy distribution in proteins.

Authors:  Josef Pánek; Ingvar Eidhammer; Rein Aasland
Journal:  Proteins       Date:  2005-03-01

6.  Prediction of the subcellular location of apoptosis proteins.

Authors:  Ying-Li Chen; Qian-Zhong Li
Journal:  J Theor Biol       Date:  2006-11-21       Impact factor: 2.691

7.  A novel method for apoptosis protein subcellular localization prediction combining encoding based on grouped weight and support vector machine.

Authors:  Zhen-Hui Zhang; Zheng-Hua Wang; Zhen-Rong Zhang; Yong-Xian Wang
Journal:  FEBS Lett       Date:  2006-10-17       Impact factor: 4.124

8.  Prediction of subcellular protein localization based on functional domain composition.

Authors:  Peilin Jia; Ziliang Qian; Zhenbin Zeng; Yudong Cai; Yixue Li
Journal:  Biochem Biophys Res Commun       Date:  2007-04-02       Impact factor: 3.575

9.  Fast Fourier transform-based support vector machine for subcellular localization prediction using different substitution models.

Authors:  Zhimeng Wang; Lin Jiang; Menglong Li; Lina Sun; Rongying Lin
Journal:  Acta Biochim Biophys Sin (Shanghai)       Date:  2007-09       Impact factor: 3.848

10.  STEM: a tool for the analysis of short time series gene expression data.

Authors:  Jason Ernst; Ziv Bar-Joseph
Journal:  BMC Bioinformatics       Date:  2006-04-05       Impact factor: 3.169

View more
  4 in total

1.  Subcellular location prediction of apoptosis proteins using two novel feature extraction methods based on evolutionary information and LDA.

Authors:  Lei Du; Qingfang Meng; Yuehui Chen; Peng Wu
Journal:  BMC Bioinformatics       Date:  2020-05-24       Impact factor: 3.169

2.  Gm-PLoc: A Subcellular Localization Model of Multi-Label Protein Based on GAN and DeepFM.

Authors:  Liwen Wu; Song Gao; Shaowen Yao; Feng Wu; Jie Li; Yunyun Dong; Yunqi Zhang
Journal:  Front Genet       Date:  2022-06-15       Impact factor: 4.772

3.  HMMPred: Accurate Prediction of DNA-Binding Proteins Based on HMM Profiles and XGBoost Feature Selection.

Authors:  Xiuzhi Sang; Wanyue Xiao; Huiwen Zheng; Yang Yang; Taigang Liu
Journal:  Comput Math Methods Med       Date:  2020-03-28       Impact factor: 2.238

4.  Protein sequence information extraction and subcellular localization prediction with gapped k-Mer method.

Authors:  Yu-Hua Yao; Ya-Ping Lv; Ling Li; Hui-Min Xu; Bin-Bin Ji; Jing Chen; Chun Li; Bo Liao; Xu-Ying Nan
Journal:  BMC Bioinformatics       Date:  2019-12-30       Impact factor: 3.169

  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.