Literature DB >> 27610389

QuaBingo: A Prediction System for Protein Quaternary Structure Attributes Using Block Composition.

Chi-Hua Tung1, Chi-Wei Chen2, Ren-Chao Guo2, Hui-Fuang Ng3, Yen-Wei Chu4.   

Abstract

Background. Quaternary structures of proteins are closely relevant to gene regulation, signal transduction, and many other biological functions of proteins. In the current study, a new method based on protein-conserved motif composition in block format for feature extraction is proposed, which is termed block composition. Results. The protein quaternary assembly states prediction system which combines blocks with functional domain composition, called QuaBingo, is constructed by three layers of classifiers that can categorize quaternary structural attributes of monomer, homooligomer, and heterooligomer. The building of the first layer classifier uses support vector machines (SVM) based on blocks and functional domains of proteins, and the second layer SVM was utilized to process the outputs of the first layer. Finally, the result is determined by the Random Forest of the third layer. We compared the effectiveness of the combination of block composition, functional domain composition, and pseudoamino acid composition of the model. In the 11 kinds of functional protein families, QuaBingo is 23% of Matthews Correlation Coefficient (MCC) higher than the existing prediction system. The results also revealed the biological characterization of the top five block compositions. Conclusions. QuaBingo provides better predictive ability for predicting the quaternary structural attributes of proteins.

Entities:  

Mesh:

Substances:

Year:  2016        PMID: 27610389      PMCID: PMC5005774          DOI: 10.1155/2016/9480276

Source DB:  PubMed          Journal:  Biomed Res Int            Impact factor:   3.411


1. Background

Proteins are responsible for a vast amount of biological synthesis, enzyme catalysis, transport of molecules, and functions in cells. In addition, their specific functions are closely associated with molecular structure. Protein structure can be divided into four levels, that is, from primary to quaternary structure. Many important biological functions must be achieved by polymerization of protein monomers to form oligomeric proteins or higher order multimeric proteins. The concept of protein quaternary structure was first presented by Bernal in 1958 [1, 2], in which he found that some protein compositions and structures were more complicated than others. These proteins were shown to be composed of several protein subunits to form biological macromolecules. The quaternary structures of protein subunits fold together by noncovalent bonds, and thus the structure classification can be delineated according to the type of subunit. If the protein complex consists of identical subunits, it is called a homooligomer; otherwise it is referred to as a heterooligomer. Classification based on the number of subunits can be divided into dimers, trimers, tetramers, and so forth [3]. Examples include (1) insulin, having the activity to form a homodimer; (2) tumor necrosis factor-α (tumor necrosis factor-alpha), to form a tight trimer; and (3) human hemoglobin protein is a heterotetramer, with two identical α subunits and two identical β subunits. An excellent review summarized what is known about the biological functions of nonhomologous homodimer and heterodimeric complexes [4]. For example, thymidylate synthase, a homodimeric protein, is highly conserved among distant species. The tertiary complex of thymidylate synthase has been revealed about the asymmetrical conformation of two homodimers (PDB ID: 4EB4). The closed and open forms of a molecule of the complex dimer may affect the ligand binding strength [5]. In addition, HIV-1 reverse transcriptase is a well-known drug target for treating HIV infections (PDB ID: 3HVT) [6]. Heterodimerization of HIV-1 reverse transcriptase contains subunit P66 and P51 is required for DNA polymerase activity. Although there has been significant progress in the analysis of protein structure with various experimental approaches, experimentation performed to determine protein structure is typically expensive and time-consuming. Consequently, it is necessary to develop a protein quaternary assembly states prediction system that will enable the analysis of protein structure and function using the current and rapidly increasing amount of sequence data. In previous studies, Garian predicted homodimers and nonhomodimers using a decision-tree and amino acid composition method involving the integration of AAindex. Zhang utilized support vector machines (SVM) and a weighted autocorrelation function in an attempt to identify the key features from the amino acid composition. These studies demonstrated that the primary structure indeed possessed needed information about quaternary structure formation [7, 8]. However, the general feature encoding method of amino acid composition will lose much important protein sequence information, such as physical and chemical properties of amino acids. Therefore, pseudoamino acid composition (PseAAC) was used to predict quaternary structure. This feature not only incorporates the sequence order effect but also reflects hydrophobic and hydrophilic properties [9]. Zhang et al. used PseAAC to develop sequence-segmented PseAAC and combined segments of the protein sequence and domain relationships in an effort to improve prediction results [10]. In recent years, functional domain composition was presented from an evolutionary and functional perspective, because proteins that share similar domain structures often have similar functions [11-13]. This method is suitable for applications in multiple categories of quaternary structural classification problems and can greatly improve prediction performance. However, a disadvantage is that some proteins may not contain any other known functional domains. In fact, the corresponding known functional domains are too few to represent proteins, which result in a classifier being unable to learn effectively. These problems are due to the current database still being incomplete. The objective of this study is to construct an accurate prediction system for protein quaternary structure attributes. In addition to the previous studies, which have been shown to achieve high prediction accuracy of functional domain composition, the method of functional domains possesses problems that need to be overcome. Accordingly, we attempt to improve this feature extraction method based on a protein sequence homology region concept, that is, block composition, which was proposed to present the protein characteristics. Since the protein interaction binding sites usually have more surface area and a high exposure of hydrophobic solvent accessibility, we will combine amino acid solvent accessibility information and pseudoamino acid composition to calculate the sequence order effect. This system is a three-layer prediction classifier framework. The first layer classifier identifies the structure type of the unknown protein sequence which is, respectively, monomer, dimer, trimer, tetramer, pentamer, hexamer, octamer, decamer, and dodecamer. Then, the result of the first layer of each class serves as input for the second layer classifier, which is used to integrate different features, considering different protein features in the predictive ability of the corresponding advantages and disadvantages to enhance the accuracy of prediction. Finally, the third layer classifier determines the structure type of the query protein. Cross-validation results show that the predicted results using block composition obtain the best results. Specifically, the overall average prediction accuracy rate is more than 90% in the 60% sequence similarity of each class. Functional domain composition and PseASA are lower by about 10% and 20%, respectively. The results prove that block composition is able to effectively identify quaternary structure assembly states. In addition, performance analysis of different types of function proteins revealed that QuaBingo exhibits superior predictive ability for enzymes, gene regulation, signal transduction, molecular binding, and other important proteins. An online web server is freely available at http://predictor.nchu.edu.tw/QuaBingo/.

2. Methods

2.1. Compilation of Datasets

The protein oligomer sequences used in this study come from the 3D Complex [14] protein quaternary structure classification database. This database provides protein structures, structure type, symmetrical patterns, and other pieces of relevant information. We searched homo- and heterooligomers of each class from the 3D Complex, and information regarding the corrected number of subunits was utilized to construct the database. The following steps were performed for processing: (1) remove oligomer sequences with lengths of less than 30 amino acids; (2) remove those sequences containing greater than or equal to three unknown amino acid; and (3) use the CD-HIT [15, 16] to remove redundant sequences in the database, that is, the sequence identity with 60%, for avoiding prediction bias. However, the classes of pentamer, octamer, decamer, and dodecamer used CD-HIT 90% for processing to avoid losing sufficient statistical significance. Finally, the database had 8,444 sequences, named Oli8444. This database was employed as the training dataset of the first and the second layer classifiers. Specifically, there were 3,273 monomers, 3,658 homooligomers, and 1,513 heterooligomers. In addition to monomers, the homo- and heterooligomers have eight individual subcategories, that is, dimer, trimer, tetramer, pentamer, hexamer, octamer, decamer, and dodecamer. Heptamer and undecamer sequences are not used due to little available information. The serial numbers of each category are listed in Supplementary Table S1 in Supplementary Material available online at http://dx.doi.org/10.1155/2016/9480276. In order to obtain more representative types of sequences, the training data in third layer classifier are processed by CD-HIT 40% from the Oli8444 training set to further remove sequences containing one or more types of oligomer and named Oli6926. However, the sequence has too few categories, such as pentamer, octamer, decamer, and dodecamer, and is no longer subject to CD-HIT 40% treatment. The independent test is collected from nonlearning test sequences of Oli6926.

2.2. Block Composition (Block)

A motif is a small and highly conserved sequence in the secondary structure, which is usually associated with protein function; there are multiple motifs in proteins. The Blocks database [17-19] is a protein motif database which is based on SWISS-PROT and Prosite to calculate ungapped multiple alignment of protein sequences present in short segments of high sequence similarity blocks. Because this feature extraction method is based on searching the sequence of the Blocks database, the method is termed block composition. The Blocks database currently contains 29,068 protein blocks. P Block can be defined as 29,068 dimensional space vectors by (1). If the protein P can be compared to the corresponding block i in the Blocks database, B is 1; otherwise it is 0. The rule is defined by the following equation (2). One has

2.3. Functional Domain Composition (FunD)

Proteins usually consist of one or more functional domains. When the same functional domains are discovered in different proteins, this indicates that they may have the same evolutionary origin and function. Version v3.10 of CDD [20] contains 44,354 protein domains and families and includes several external source databases (Pfam [21], SMART [22], KOG [23], COG [23], PRK [24], and TIGR [25]). We use a conservative threshold with E-value <0.01 in order to identify what kinds of functional domains are found for query protein P. 44,354 proteins can be expressed as a feature vector P FunD dimensional space by (3). If D is 1, this means that the ith domain in CDD is found for P, otherwise it is 0. The rule is defined by (3). One has

2.4. Pseudoamino Acid Composition Based on Solvent Accessibility of Amino Acid (PseASA)

Protein quaternary structure formed by interactions between two or more polypeptide chains and the interaction depend on surfaces of amino acids in contact with each other. Recent studies of protein hotspots suggest that solvent accessibility constitutes an important feature of protein interactions [26, 27]. Protein binding sites usually have a more exposed hydrophobic area and higher solvent accessibility. Therefore, we will apply this feature in encoding pseudoamino acid composition [9], named PseASA, to investigate the effect of the relationship between protein interactions and structure on the prediction system. First, the information regarding amino acid solvent accessibility is derived from NetSurfP version 1.1 [28] prediction data and divided into “exposed” or “buried” states. The discontinuous exposure and buried amino acid are linked into exposed protein sequence P    E 1 E 2 E 3 E 4 E 5 ⋯ E and buried protein sequence P    B 1 B 2 B 3 B 4 B 5 ⋯ B (m and n are the sequence lengths and may change with prediction data of different proteins). PseAAC-Builder [29] was used for feature encoding of pseudoamino acids. However, because of the consideration about overall accuracy of using protein features on Oli8444 dataset, QuaBingo did not use the PseASA feature (see Section 3; Tables 1 and 2).
Table 1

Performance of using different features with SVM in 10-fold cross-validation for monomer classification on the Oli8444 dataset.

Monomeric protein
Sn (%)Sp (%)ACC (%)MCC
Block
 Monomer79.0778.7578.910.579

FunD
 Monomer93.6857.8075.750.552

PseASA
 Monomer70.1556.3363.240.268
Table 2

Performance of using different features with SVM in 10-fold cross-validation for homo- and heterooligomer classification on the Oli8444 dataset.

HomooligomerHeterooligomer
Sn (%)Sp (%)ACC (%)MCCSn (%)Sp (%)ACC (%)MCC
Block
 Dimer83.1882.8383.000.66066.3097.0086.73 0.697
 Trimer89.2599.7695.480.90983.6597.5091.93 0.835
 Tetramer75.3297.5390.120.77685.0398.0392.82 0.853
 Pentamer100.0096.6798.570.97383.3395.0089.17 0.813
 Hexamer89.0398.5294.270.88782.5097.0090.42 0.812
 Octamer95.7172.6284.170.70990.0085.3387.67 0.782
 Decamer91.67100.0095.830.928100.00100.00100.00 1.000
 Dodecamer95.5098.0096.750.94186.0094.6790.33 0.823

 Overall92.270.84891.13 0.827

FunD
 Dimer52.4990.2671.370.46274.7490.1685.01 0.659
 Trimer93.7387.8390.220.80669.9488.2580.87 0.600
 Tetramer60.6496.6184.620.64771.9695.0885.79 0.706
 Pentamer75.0086.6780.480.64953.33100.0076.67 0.572
 Hexamer64.94100.0084.270.71285.7583.8984.86 0.698
 Octamer48.81100.0074.400.56644.67100.0072.33 0.517
 Decamer63.33100.0081.670.69145.00100.0072.50 0.473
 Dodecamer63.00100.0081.500.68069.00100.0084.50 0.733

 Overall81.070.65280.32 0.620

PseASA
 Dimer66.9546.7456.850.14012.6293.0766.18 0.094
 Trimer39.1685.9366.930.28836.0882.7563.98 0.218
 Tetramer30.1191.5271.050.28033.3783.4963.34 0.194
 Pentamer64.1770.0067.620.34386.6766.6776.67 0.564
 Hexamer65.7660.3762.780.26261.6380.3771.85 0.431
 Octamer66.9060.4863.690.28586.0071.5078.75 0.604
 Decamer81.6793.3387.500.78590.0085.0087.50 0.773
 Dodecamer73.5068.0070.750.42966.8385.5076.17 0.556

 Overall68.400.35273.05 0.429

2.5. The Three-Layer Architecture of Classifiers

SVM is generally used as a binary classifier that was initially applied to pattern recognition and other fields [30]. In the past, SVM has been successfully applied in various fields of classification problems, and the predictions of quaternary structure have also been found to achieve good results [8, 10, 31]. LibSVM is utilized in this study, and was developed by Chang and Lin [32]. The construction of the prediction system in the current study employs a three-tier architecture, the first layer of which uses SVM to create different characteristic rules of binary classification prediction model. Feature selection using python syntax written LibSVM package fselect.py [33] gives F-score based on the importance of each feature and then sorts the trained model by F-score. In order to avoid poor recognition and enormous computational time, the trained models are divided into four equal parts according to the F-score from high to low and remove 25% or less or more than 75% of the models. Finally, the construction of first layer classification model is completed by choosing better sensitivity, specificity, and Matthews Correlation Coefficient (MCC) based on 10-fold cross-validation accuracy of measurement. Due to 10-fold cross-validation results of first layer classification model, the predictive power of three kinds of characteristic rules for different classes of oligomers was known. The second layer is the first layer using SVM optimization model predictions, the purpose of which is combining the individual features of each oligomer model outputs into one. Training the second layer integrated model approach is using 10-fold cross-validation test predictions of first layer as input and considering the strengths and weaknesses of the characteristics of different proteins in order to improve accuracy of prediction. By comparing the data analysis ability of different machine learning algorithms, we finally selected Random Forest to construct the third layer classifier for the integration of these recognition results and determine the quaternary structure type of protein oligomer. Figure 1 is a flowchart of the predicting system.
Figure 1

Flowchart of the three-layer architecture of classifiers.

2.6. Evaluation Measures

To assess the predictive performance of the classifier, we use the following formula. TP, FP, FN, and TN are true positives, false positives, false negatives, and true negatives, respectively. Sensitivity (Sn) on behalf of this type of protein oligomer reflects the percentage of correct predictions for that category. Specificity (Sp) on behalf of nonprotein oligomers of this type indicates the percentage of correct predictions of nonclass. Accuracy (ACC) is used to assess the overall predictive power of the prediction accuracy. Matthews Correlation Coefficient (MCC) values range from −1 to 1, in which the value of 1 represents a completely correct prediction, the value of 0 represents random prediction, and the value of −1 represents exactly the opposite prediction: For the third layer classifier evaluation criteria for the classification results, we used Kappa statistics and F-measure for viewing. Kappa statistics [34] are used to judge the classifier results, consistent with the random assortment. Its value is in the range of −1 to 1. When K = 1, it represents that the predicting results are different with random classifier prediction; K = 0 means predicting results are the same as random prediction; K = −1 represents that there is no effect and classification credibility. Here, we also use F-measure as the evaluation results of the standard classification. F-measure is a combination of precision and recall, with values from 0 to 1.

3. Results

3.1. Performance of Using Different Protein Features in the First Layer

In order to understand the different types of feature codes for the accuracy of the prediction structure, we trained the SVM classification model with 10-fold cross-validation evaluation model validity. Tables 1 and 2 show the 10-fold cross-validation prediction sensitivity, specificity, accuracy, and MCC on the monomer, homooligomer, and heterooligomer. As can be seen from the results of the cross-validation, block composition in the monomer, homooligomer, and heterooligomer achieved an overall accuracy of 78.91%, 92.27%, and 91.13%, respectively. MCC was 0.579, 0.848, and 0.827, respectively. Since most of sensitivity performance has more than 80%, it indicates that a block composition method is indeed suitable for exhibition of protein characteristics and effectiveness of structure type classification. In the verification results of Functional domains (FunD) feature, the overall accuracy of monomer, homooligomer, and heterooligomer was 75.75%, 80.26%, and 79.93%, respectively. The results of FunD in homooligomer and heterooligomer were lower than the ones of block composition about 10%, while the sensitivity of homooctamer, heterooctamer, and heterodecamer are less than 50%. These results represent that FunD cannot be rendered for associated characteristics. The overall PseASA prediction accuracy is relatively low, that is, respectively, 68.40% and 73.05%. However, compared with the FunD, using PseASA method to predict heterooligomer, pentamer, octamer, and decamer is better at 86.67%, 86%, and 90% of sensitivity, respectively. In addition, the MCC of PseASA for prediction is generally lower, showing that the homology between the whole sequences is not high or that the same category of the sequence number and complexity increases, which makes it difficult to obtain correct predictions. Even if it does not contain pentamer, octamer, decamer, and dodecamer which have a high sequence homology, the overall accuracy of homo- and heterooligomers still reached 90.72% and 90.48%, respectively. To further enhance prediction accuracy, we used the second layer SVM to integrate the various features of the model output.

3.2. Performance of Model Combination to Enhance Oligomer Type Prediction Accuracy

The purpose of establishing the second layer is to integrate different predicted results of characteristic model in each category. We unitized different combinations of characteristic models, in which the model is constructed by three features referred to as B (Block), F (FunD), and P (PseASA). Table 3 displays that performance comparison of model combination in 10-fold cross-validation for oligomer classification in the second layer.
Table 3

Performance comparison of model combination with SVM in 10-fold cross-validation for oligomer classification in the second layer.

F + P B + P B + F B + F + P
ACC (%)MCCACC (%)MCCACC (%)MCCACC (%)MCC
Monomer75.750.55278.910.57982.640.66382.640.663

Homooligomer
 Dimer71.010.45683.000.66083.000.66083.000.660
 Trimer90.220.80695.480.90995.480.90995.470.907
 Tetramer84.210.63890.120.77693.410.85493.410.854
 Pentamer80.480.64998.570.97398.570.97398.570.973
 Hexamer84.270.71294.270.88796.940.93996.940.939
 Octamer74.400.56684.170.70985.600.74384.170.709
 Decamer85.830.75995.830.92898.330.97198.330.971
 Dodecamer81.500.68096.750.94199.000.98299.000.982

Overall 81.49 0.658 92.27 0.848 93.79 0.879 93.61 0.874

Heterooligomer
 Dimer85.010.65986.730.69788.890.76788.890.767
 Trimer80.870.60091.930.83591.930.83592.070.836
 Tetramer85.790.70692.820.85394.480.88994.480.889
 Pentamer77.500.57789.170.79993.330.88693.330.886
 Hexamer84.860.69890.420.81290.420.81293.720.875
 Octamer72.670.48487.670.78287.670.78287.670.782
 Decamer92.500.87397.500.958100.001.00097.500.958
 Dodecamer84.500.73390.330.82395.330.91695.330.916

Overall 82.96 0.666 90.82 0.820 92.76 0.861 92.88 0.863
In the result of the monomer combination B + P with an accuracy of 78.91%, a difference of the combination of F + P is about 3%. Using a combination of B + F enhanced accuracy, improving from 78.91% to 82.64%. However, B + F and B + F + P combination exhibited less accuracy. The same situation also appears in the feature models combination for homo- and heterooligomers. Overall, B + F model combinations can have better performance than using the single Block model. Most of the categories were improved from 1 to 6%. Therefore, this study will feature B + F combination to construct the first layer and the second layer of the classification model.

3.3. Performance Comparison of Classification Algorithms in the Third Layer

In order to obtain unique results to determine an unknown protein quaternary structure type, we use a layer of the classifier to process the output of the second layer. By comparing different types of algorithms on power of data analysis and problem solving ability, we selected the better algorithm for constructing the third layer classifier. Studies using six types of typical algorithms are tested, that is, Bayes, Functions, Lazy, Rules, Trees, and Meta. The Oli6926 dataset is used in this training. We also used the two authentication methods, 10-fold cross-validation and self-consistency, to assess the learning effectiveness of the classifier. In the results of 10-fold cross-validation, Correctly Classified Instances (CCI) of LibSVM and Logistic were 67.40% and 67.28%, respectively (Table 4). Kappa statistics was 0.5288 and 0.5285, respectively. And the F-measure was 0.616 and 0.615, respectively. These two algorithms have best predicted results. However, we found that the predictive accuracy and statistical value of LibSVM and Logistic are higher because most correct predictions which occurred in the large quaternary categories and in minor categories predictions, like pentamer, hexamer, and octamer, are completely ignored. Other algorithms, such as decision table and Bagging, also have a similar situation. Conversely, the accuracy of Random Forest, Random Tree, and IBk was 58.91%, 54.65%, and 58.45%, respectively. Kappa was 0.4306, 0.3817, and 0.4126, respectively. F-measure was 0.566, 0.537, and 0.551, respectively. Although the results of these three algorithms are not perfect, they are not susceptible to imbalance of data numbers.
Table 4

Performance comparison of classification algorithms in 10-fold cross-validation and self-consistency test.

AlgorithmsTest method
Cross-validationSelf-consistency
CCI (%)Kappa F-measureCCI (%)Kappa F-measure
Bayes
 Bayes net64.800.50170.60865.020.50530.611
 Naïve Bayes39.9100.22839.9100.228

Functions
 LibSVM67.400.52880.61668.600.54640.632
 Logistic67.280.52850.61567.570.53260.619
 Multilayer perceptron64.010.48930.59869.970.56940.657

Lazy
 IB151.910.35130.51587.180.82180.869
 IBk58.450.41260.55190.380.86820.902
 KStar62.430.4630.58188.100.83620.877

Meta
 AdaBoostM159.800.39090.49359.800.39090.493
 Bagging66.270.51470.60569.880.56710.65

Rules
 Conjunctive rule59.800.39090.49359.800.39090.493
 Decision table66.990.51890.60167.220.52180.601
 DTNB67.120.52250.60667.380.52680.61

Tree
 J4866.450.51610.60769.810.56390.646
 Random forest58.910.43060.56690.020.86510.899
 Random tree54.650.38170.53790.380.86820.902

CCI is correctly classified instances.

The results of 10-fold cross-validation of LibSVM and Logistic in the self-consistency test were not significantly increased. Relative under the self-consistency verification, Random Forest, Random Tree, and IBk correctly predicted ratio reached about 90%, since they have good recognition capability for the known information. The prediction performance of Random Forest and IBk was similar in self-consistency which could achieve the highest value of 0.856 MCC. Since the cross-validation and prediction results of Random Forest algorithms for minor categories were good, we finally chose the Random Forest classification algorithm as the third layer classifier in QuaBingo.

3.4. Performance Analysis

In order to understand the prediction capabilities of QuaBingo for different functional protein structures in the cell, we compared it with a known quaternary structure prediction tool QuatIdent [12] using an independent test. As shown in Table 5, the predicted result of the average sensitivity of QuaBingo was 51.95%. For the protein categories in the enzyme, gene regulation, membrane protein, single transduction, and molecular binding, there was better prediction of ACC from 77% to 80%. In the QuatIdent, the average sensitivity was 20.74%. These results illustrated the predicting method which is composed of functional domain and PsePSSM cannot obtain a correct identification result for most quaternary protein structures.
Table 5

Comparison of results of different functional categories of proteins on QuaBingo and QuatIdent.

Protein categoriesQuaBingoQuatIdent
Sn (%)Sp (%)ACC (%)MCCSn (%)Sp (%)ACC (%)MCC
Immunity system40.4696.2868.370.36720.6196.6658.640.199
Enzyme57.2197.33 77.27 0.54538.1897.9868.080.426
Cell cycle44.4496.5370.490.41014.8297.8056.310.176
Chaperone45.9596.6271.280.42620.2798.9959.630.313
Gene regulation58.3697.40 77.88 0.55821.7598.1959.970.276
Transport proteins57.8097.3677.580.55221.6797.8659.770.258
Single transduction59.1697.45 78.30 0.56611.9798.4255.190.167
Viral protein42.7396.4269.570.39110.0098.7554.380.156
Membrane protein57.8197.36 77.59 0.55216.4198.4957.450.229
Molecular binding63.3797.71 80.54 0.61127.1198.4762.790.351
Hormone36.0896.0166.040.32128.8797.2963.080.305
Others60.0397.5078.770.57517.1898.6157.890.247

Overall51.9597.0074.470.49020.7498.1359.430.259

3.5. The Top Five Features of Block Composition of Oligomer on Oli8444

The feature extraction method of block composition is simple, which implies that a lot of useful information can be gained to help discover mechanisms of protein aggregation and serial modes. We will optimize block composition by feature selection, according to the degree of importance of each characteristic value, giving an F-score numerical score. The top five features are shown in Table 6. For example, the IPB006052A of block composition in the top five features is TNF (Tumor Necrosis Factor) family of conserved sequence, which is found in trimeric CD40 ligand (PDB ID: 1ALY) in the training data and also found in the human Collagen X sequences (PDB ID: 1GR3). Human Collagen X needs to rely on the C1q domain to form a stable homotrimer. In existing data annotation, C1q and TNF-like domains overlap, and there are a number of important positions on the sequence of amino acids with high conservation and similar topology [35]. Much literature has confirmed that these amino acids play an important role in the formation of a hydrophobic core stability trimeric structure and formation of biologically active protein complexes [27, 35, 36]. In addition, many other features of block composition are associated with a particular function of protein. Thus, feature selection not only reduces the number of features in block composition but also can effectively identify characteristic patterns obviously related to the protein molecule aggregation phenomenon and hence distinguish quaternary structure among different oligomers.
Table 6

Top five features of block composition of oligomers.

Oligomer typeTop 5 features
12345
MonomerIPB002225AIPB002347AIPB000817AIPB002347DIPB013549A

Homooligomer
 DimerIPB000817AIPB004045IPB013572BIPB001647IPB003449A
 TrimerIPB007691DIPB006052AIPB006056AIPB006175AIPB006175B
 TetramerIPB002347DIPB003560DIPB002198BIPB002347BIPB002347E
 PentamerIPB007334AIPB001931AIPB013124EIPB008681AIPB012599D
 HexamerIPB001564CIPB001753CIPB001980AIPB001564AIPB001564B
 OctamerIPB001354CIPB013341BIPB002682IPB001354AIPB001354B
 DecamerIPB000866AIPB000866BIPB013740IPB003394AIPB002587G
 DodecamerIPB002177AIPB002177BIPB008331BIPB014035BIPB007664A

Heterooligomer
 DimerIPB003026BIPB008386BIPB000315AIPB000219AIPB012565
 TrimerIPB002353BIPB012565IPB003990AIPB001003BIPB003026B
 TetramerIPB003026BIPB012565IPB010004AIPB001664DIPB002398F
 PentamerIPB001280EIPB003484DIPB012420IPB004333CIPB006711D
 HexamerIPB002919AIPB003038IPB008019AIPB001591AIPB001762
 OctamerIPB007659AIPB004977BIPB006574BIPB002971GIPB003539A
 DecamerIPB013124EIPB002662BIPB003417AIPB000732AIPB000817A
 DodecamerIPB002682IPB000353BIPB001003BIPB003597BIPB006217A

3.6. Case Study

Thymidylate synthase (TS; EC 2.1.1.45) is an enzyme that can converts deoxyuridine monophosphate into deoxythymidine monophosphate and has an important position for necessary cell function about DNA replication and damage repair. The inhibition of TS is a way of cancer treatment that involves using inhibitors to interfere with DNA biosynthesis and create a disturbance in growth of cancer. TS is known that conserved protein from E. coli to human. Here, QuaBingo provides the testing results for several TS homologs, including 2KCE (E. coli), 4IQQ (C. elegans), 2TSR (rat), 4EB4 (mouse), 1HVY (human), and 1I00 (human). The testing results show that QuaBingo can correctly predict the quaternary structure, as homodimer, with TS phylogenetic distant homologs, and the sensitivity performance was 100%. This demonstrates that the QuaBingo may work within the example of phylogenetic homologous proteins.

4. Conclusions

In this study, we propose a feature extraction method based on a block of conserved protein sequence for the classification of protein quaternary structure. This method can overcome the problems of feature extraction encountered by functional domain composition: (1) some proteins may not contain any other known functional domains; and (2) corresponding known functional domains are too few to represent proteins. It is worth noting that the first problem has not yet been encountered in our proposed method, and the second problem was comprehensively solved using QuaBingo. The 10-fold cross-validation results showed that the overall accuracy of block composition of homo- and heterooligomers is 92.27% and 91.13%, respectively. Moreover, they are all 10% higher than the functional domain composition. These results demonstrate that the block composition can extract important and biologically meaningful features and thus enhance the prediction of protein quaternary structure. Although many proteins exist as monomers, they may interact with another protein to form polymers or may further assemble to become a biologically relevant tetramer or octamer. Currently, most of these problems have not been solved through scientific research or verified by adequate information. In the future, as more and more data are added to pertinent databases, an accurate prediction system could be established that would greatly assist relevant research development. An online web server is freely available at http://predictor.nchu.edu.tw/QuaBingo/. Table S1. The amount of each protein quaternary structure attribute in different datasets. Oli8444.zip. Dataset Oli8444.
  29 in total

1.  Increased coverage of protein families with the blocks database servers.

Authors:  J G Henikoff; E A Greene; S Pietrokovski; S Henikoff
Journal:  Nucleic Acids Res       Date:  2000-01-01       Impact factor: 16.971

2.  Blocks+: a non-redundant database of protein alignment blocks derived from multiple compilations.

Authors:  S Henikoff; J G Henikoff; S Pietrokovski
Journal:  Bioinformatics       Date:  1999-06       Impact factor: 6.937

3.  Classification of protein quaternary structure with support vector machine.

Authors:  Shao-Wu Zhang; Quan Pan; Hong-Cai Zhang; Yun-Long Zhang; Hai-Yu Wang
Journal:  Bioinformatics       Date:  2003-12-12       Impact factor: 6.937

4.  Automated assembly of protein blocks for database searching.

Authors:  S Henikoff; J G Henikoff
Journal:  Nucleic Acids Res       Date:  1991-12-11       Impact factor: 16.971

5.  The quaternary structure of proteins.

Authors:  H Sund; K Weber
Journal:  Angew Chem Int Ed Engl       Date:  1966-02       Impact factor: 15.336

Review 6.  Principles of protein-protein interactions.

Authors:  S Jones; J M Thornton
Journal:  Proc Natl Acad Sci U S A       Date:  1996-01-09       Impact factor: 11.205

7.  Insights into the innate immunity of the Mediterranean mussel Mytilus galloprovincialis.

Authors:  Paola Venier; Laura Varotto; Umberto Rosani; Caterina Millino; Barbara Celegato; Filippo Bernante; Gerolamo Lanfranchi; Beatriz Novoa; Philippe Roch; Antonio Figueras; Alberto Pallavicini
Journal:  BMC Genomics       Date:  2011-01-26       Impact factor: 3.969

8.  The Pfam protein families database.

Authors:  Marco Punta; Penny C Coggill; Ruth Y Eberhardt; Jaina Mistry; John Tate; Chris Boursnell; Ningze Pang; Kristoffer Forslund; Goran Ceric; Jody Clements; Andreas Heger; Liisa Holm; Erik L L Sonnhammer; Sean R Eddy; Alex Bateman; Robert D Finn
Journal:  Nucleic Acids Res       Date:  2011-11-29       Impact factor: 16.971

9.  Classification of protein quaternary structure by functional domain composition.

Authors:  Xiaojing Yu; Chuan Wang; Yixue Li
Journal:  BMC Bioinformatics       Date:  2006-04-04       Impact factor: 3.169

10.  Crystal structure of mouse thymidylate synthase in tertiary complex with dUMP and raltitrexed reveals N-terminus architecture and two different active site conformations.

Authors:  Anna Dowierciał; Piotr Wilk; Wojciech Rypniewski; Wojciech Rode; Adam Jarmuła
Journal:  Biomed Res Int       Date:  2014-06-03       Impact factor: 3.411

View more
  3 in total

1.  Computational Methods for the Elucidation of Protein Structure and Interactions.

Authors:  Nicholas S Edmunds; Liam J McGuffin
Journal:  Methods Mol Biol       Date:  2021

2.  QUATgo: Protein quaternary structural attributes predicted by two-stage machine learning approaches with heterogeneous feature encoding.

Authors:  Chi-Hua Tung; Ching-Hsuan Chien; Chi-Wei Chen; Lan-Ying Huang; Yu-Nan Liu; Yen-Wei Chu
Journal:  PLoS One       Date:  2020-04-29       Impact factor: 3.240

3.  Self-assembled peptide and protein nanostructures for anti-cancer therapy: Targeted delivery, stimuli-responsive devices and immunotherapy.

Authors:  Masoud Delfi; Rossella Sartorius; Milad Ashrafizadeh; Esmaeel Sharifi; Yapei Zhang; Piergiuseppe De Berardinis; Ali Zarrabi; Rajender S Varma; Franklin R Tay; Bryan Ronain Smith; Pooyan Makvandi
Journal:  Nano Today       Date:  2021-03-11       Impact factor: 18.962

  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.