Literature DB >> 21526102

Are there any differences between features of proteins expressed in malignant and benign breast cancers?

Mansour Ebrahimi1, Esmaeil Ebrahimie, Narges Shamabadi, Mahdi Ebrahimi.   

Abstract

BACKGROUND: The most common cancer among women is breast cancer and it has been blamed as the second leading cause of cancer death in women; so far many approaches have been used to analyze and detect benign and malignant forms of cancer and understanding the features involved in proteins expressed by various types of breast cancers is crucial.
METHODS: Herein features of proteins expressed in malignant, benign and both cancers were compared using different screening techniques, clustering methods, decision tree models and generalized rule induction (GRI) algorithms to look for patterns of similarity in two benign and malignant breast cancer groups.
RESULTS: The findings showed that the N-terminal amino acid was Met and 57 out of 838 proteins' features ranked as important (p > 0.05). The depth of the trees induced by tree induction models varied from 5 (in the Quest model) to 2 (in the C5.0 model) branches. The best performance evaluation found when C&RT model applied and the worst evaluation found when CHAID model applied. No significant difference in the percentage of correctness, performance evaluation, and mean correctness in tree induction algorithms was found when feature selection applied on datasets, but the number of peer groups reduced significantly (p < 0.05) when feature selection model applied.
CONCLUSIONS: The frequency of Ile-Ile was the most important protein attributes in all tree and rule induction models. The importance of sequence-based classification and the frequency of Ile-Ile in prediction of malignant and benign breast cancer have been discussed here.

Entities:  

Keywords:  Breast Neoplasms; Computational Biology; Decision Support Techniques

Year:  2010        PMID: 21526102      PMCID: PMC3082830     

Source DB:  PubMed          Journal:  J Res Med Sci        ISSN: 1735-1995            Impact factor:   1.852


The most common cancer among women is breast cancer, excluding non-melanoma skin cancers and it is the second leading cause of cancer death in women (exceeded only by lung cancer); although recent studies confirmed death rates from breast cancer declined significantly during last decade.1 These declines may be due to earlier detection or better treatment. If the breast cancer is diagnosed early enough, the cure rate is very high and more than 97% of women can survive at least for 5 years.2 Generally the cancer has been categorized as non-invasive or benign, where the cancer cells are confined to the origin place, do not threaten life and do not spread outside of the breast; and invasive or malignant, where the cancer cells have broken through the duct into the surrounding fatty and connective tissues; this type may lead to death if not detected and cured.3 Although various techniques have been used to distinguish between benign and malignant breast cancers in recent years, use of computer based technologies such as bioinformatics models have attracted huge attentions.4–7 The Support Vector Machine (SVM) classification algorithm shown to be a useful tool to diagnose breast cancer.89 Bioinformatics tools such as feature selection with extensive RNA-pathway analysis on mass spectrometric of metabolites used to identify the important features related to breast cancer pathogenesis.8 In another attempt to introduce a predictive system for non-invasive breast cancer, a combination of another bioinformatics tools (SVM-based feature selection) and mass spectrometric analysis was employed.10 Neural network propagation algorithm with SVMs and other baseline methods were used to identify several markers with clinical or biological relevance with the breast cancers.11 Prediction tasks are attempts to accurately forecast the outcome of a specific situation by using input data obtained from a concrete set of variables that potentially describe the situation.1213 Nowadays, neural networks, as artificial intelligence, have found application in a wide range of problems14 and in many cases resulted as superior to standard statistical models.15 The predictive reliability of an artificial neural networks model in medical diagnosis has been confirmed so far.16 Modeling systems have been used for better prediction of breast and lung carcinoma post-surgery survival using neural networks as suitable tool.1718 When data analysis involve hundreds, or even thousands of variables, data mining tools are being used as one of the most probable candidates.19 It is anticipated that applying a neural network or a decision tree to a set of variables of this quantity may require more time than practice.20 There are many attributes determine the different characteristics of a protein molecule. As a result, the majority of time and effort of artificial modeling algorithms spent in the model-building process involves determining which variables should be included in the model. Attribute weighting or feature selection helps the model to reduce the size of variable set, extracting a more manageable set of attributes for rule or tree induction or getting out meaningful models.21 The value of a discrete dependent variable with a finite set from the values of a set of independent variables is predicted by induction tree algorithms.22 The tree is constructed by looking for regularities in data, determining the features to add at the next level of the tree using an entropy calculation, and then choosing the feature that minimizes the entropy impurity.23 There are many well-known decision tree algorithms available. To better understand the features that contribute to the type of proteins expressed in breast cancer (benign or malignant) and to find a suitable tool to classify the types of cancer according to proteins’ attributes, various clustering, screening, and decision tree models were employed in this study.

Methods

From the UniProt Knowledgebase (Swiss-Prot and TrEMBL) database, sequences from 15 proteins expressed during two distinctive forms of breast cancers (10 benign and 5 malignant) and one common group (with 6 proteins in both benign and malignant groups) were retrieved. The proteins were categorized into B (benign), M (malignant) and C (control) groups. Eight hundred and seventy nine protein attributes or features such as length, weight, isoelectric point, aliphatic index, the count and the frequency of each amino acid and the count and the frequency of dipeptides from all of those proteins were calculated. All attributes were classified as continuous variables, except for the N-terminal amino acid, which was classified as categorical. A dataset of these protein features was imported into Clementine software (Clementine_NLV-11.1.0.95; Integral Solution, Ltd.), and type of cancer variable (B, M and C) was set as the output variable and the other variables were set as input variables. Different tree induction algorithms were applied to the datasets to find the most important attributes and trace the most probable patterns expressed during two forms of cancers. These algorithms allowed the development of classification systems that automatically included in their rules only the attributes that really matter in making a decision. Attributes that did not contribute to the accuracy of the tree were ignored. This process yielded very useful information about the data and could be used to reduce the data to relevant fields only before training another learning technique, such as a neural network. Various algorithms are available for performing classification and segmentation analysis, and herein different decision tree and cluster analysis models were used. To investigate the effects of the attribute weighting algorithm on other models behavior, all models were run both with and without feature selection criteria. Two screening models were used:

a) Anomaly Detection Model:

By examining large numbers of attributes, this model was used to identify outliers or unusual cases in the data.

b) Attribute Weighting Algorithm:

This model identifies the features that have a strong correlation with the type of cancers and labels the attributes as important, marginal, and unimportant, with values more than 0.95, between 0.95 and 0.90, and less than 0.90, respectively. Two clustering models applied:

a) K-Means:

This model clusters data into distinct groups when clustering groups are unknown. Records are grouped so that those within a group or cluster tend to be similar to each other, whereas records in different groups are dissimilar.

b) Two-Step Cluster:

In two-step cluster, the first step scans the data and compresses them into a manageable set of subclusters and in the second step a hierarchical clustering method applies to merge subclusters into larger clusters. Five different tree induction models applied:

a) Classification and Regression Tree (C&RT):

This algorithm uses recursive partitioning to split the training records into segments by minimizing the impurity at each step.

b) CHAID:

Decision trees generated by using chi-square statistics to identify optimal splits.

c) Exhaustive CHAID:

A modification of CHAID with examining all possible splits.

d) QUEST:

A binary classification method generates and reduces the processing time.

e) C5.0:

A tree or a rule set induces by splitting the sample based on the field that provides the maximum information gain at each level. Generalized rule induction (GRI) model or association model discovers association rules in the data by extracting a set of rules from the data using an index that takes both the generality (support) and accuracy (confidence) of rules into account.

Results

The average length, weight, isoelectric point, and aliphatic indices of proteins studied here were 794.524 ± 749.528, 108.873 ± 112.782, 6.625 ± 1.238, and 84.914 ± 10.832 (mean ± SD), respectively. The average counts of sulfur, carbon, nitrogen, oxygen, and hydrogen were 33.810 ± 25.390, 4022.333 ± 3826.498, 1105.952 ± 1076.559, 1207.905 ± 1177.811, and 6373.667 ± 6128.062, respectively, and the average counts of hydrophobic, hydrophilic, and other residues were 359.857 ± 313.313, 201.952 ± 208.667, and 1160.000 ± 232.714, respectively. The frequencies of hydrogen, carbon, oxygen, nitrogen, and sulfur in all enzymes were 0.490 ± 0.009, 0.314 ± 0.054, 0.099 ± 0.006, 0.086 ± 0.004 and 0.002 ± 0.001, respectively, and the frequencies of hydrophobic, hydrophilic, other (amphoteric) residues, and negatively, positively, and other charged residues were 0.454 ± 0.059, 0.243 ± 0.037, 0.302 ± 0.081, 112.429 ± 116.015, 101.095 ± 99.609 and 581.000 ± 538.844, respectively. The mean count of amino acids ranged from a minimum of 11.095 ± 15.251 for Try to a maximum of 85.095 ± 92.529 for Leu and the same order found for amino acids frequencies (from 0.012 ± 0.007 for Try to 0.103 ± 0.023 for Leu). The N-terminal amino acid was Met among all proteins studied in this paper. The average non-reduced Cys extinction coefficient at 280 nm was 92462.381 ± 104156.36, non-reduced Cys absorption was 3380.877 ± 15488.905, the reduced Cys extinction coefficient was 88248.143 ± 105517.497, and the reduced Cys absorption was 0.915 ± 0.385.

Screening Models

Two peer groups with an anomaly index cutoff of 1.352 were generated. No anomalous record found in the first peer group of 5 records, while 1 anomaly record found in the second peer group of 16 records. Two peer groups with an anomaly index cutoff of 1.53 and just 1 anomalous record in the second peer group created when feature selection algorithm applied on dataset. Fifty seven out of 838 attributes had p value higher than 0.95 in classification of cancer proteins (Table 1), and 84 attributes with weight between 0.90 and 0.95 marked as marginal when feature selection model applied.
Table 1

Results of feature selection on important (and one marginal) features contributing to the two types of breast cancers proteins. Higher values indicate that protein feature is more important.

NoProtein featureValueRankNoProtein featureValueRank
1Count of Leu-Ile0.998Important31Count of Thr-Ile0.972Important
2Count of Met-Ser0.997Important32Freq of Asp-Asp0.972Important
3Freq of Asn-Ile0.996Important33Freq of Asn-Gln0.972Important
4Count of Ile-Ile0.995Important34Count of Thr-Tyr0.972Important
5Count of Ile-Cys0.995Important35Count of Leu-Cys0.971Important
6Freq of Met-Ser0.995Important36Freq of Cys-Tyr0.971Important
7Count of Ser-Phe0.993Important37Count of Asp-Tyr0.971Important
8Count of Phe-Leu0.992Important38Count of Ile0.968Important
9Freq of Gln-Val0.991Important39Count of Phe-Lys0.967Important
10Count of Gly-Phe0.99Important40Count of Cys-Val0.966Important
11Count of Gly-Val0.989Important41Count of Ser-Val0.964Important
12Freq of Gly-Phe0.988Important42Count of Phe-Phe0.963Important
13Count of Ala-Ile0.987Important43Freq of Glu-Trp0.963Important
14Count of Tyr-Cys0.986Important44Count of Phe-Ile0.962Important
15Count of Ile-Pro0.986Important45Freq of Phe-Met0.962Important
16Freq of Asp-Leu0.985Important46Count of Gln-Val0.961Important
17Count of Val-Ala0.985Important47Count of Asn-Arg0.96Important
18Freq of Ile0.984Important48Count of Tyr-Trp0.96Important
19Count of Phe (F)0.981Important49Count of Cys0.959Important
20Count of Tyr-Pro0.981Important50Freq of Ala-Arg0.958Important
21Freq of Asp-Ala (1)0.98Important51Count of Glu-Asn0.956Important
22Count of Ala-Gly0.979Important52Freq of Ala-Ala0.953Important
23Count of Val-Tyr0.979Important53Count of Ile-Asn0.952Important
24Count of Lys-Phe0.978Important54Count of Asp-Arg0.952Important
25Freq of His-Met0.977Important55Freq of Gln-Phe0.951Important
26Count of Ile-Phe0.976Important56Count of Gly-Ile0.951Important
27Freq of Ala-Leu0.974Important57Count of Asp-Lys0.951Important
28Freq of Lys-His0.973Important58Count of Trp-Tyr0.948Important
29Count of Phe-Asp0.973Important59Freq of Asn-Phe0.947Marginal
30Freq of Asp(D)0.973Important60Count of Tyrosine (Y)0.947Marginal
Results of feature selection on important (and one marginal) features contributing to the two types of breast cancers proteins. Higher values indicate that protein feature is more important.

Clustering Models

Six records (more than 28%) put into the first and the fourth clusters and 1, 5, and 3 records were put into the second, third, and fifth clusters, respectively when K-Means algorithm applied on the dataset and five clusters with 8, 1, 8, 3, and 1 records in each cluster, respectively, generated when feature selection filtering applied on dataset. Two clusters with 1 and 20 records in each group, respectively, generated when Two-Step clustering applied on dataset without feature selection and again two clusters (with 5 and 16 records in each cluster) created when feature selection algorithm applied.

Decision Tree Models

A tree with a depth of 2 and cross-validation of 45.0 ± 9.0 induced in C5.0 model and the most important attribute employed to build the tree was the count of Ile-Ile. If the value of this feature was equal to or less than 2, the proteins fell into the malignant (M) category; otherwise they were put into the benign (B) category. In the M subgroup, the frequency of Arg-Cys was used to create the next tree branches, with value equal to 0 as M mode and more than 0 as common (C) mode. When 10-fold cross-validation was applied to the same dataset, again a tree with a depth of 2 and cross-validation of 56.7 ± 11.7 was created. The same protein features and values were used to create tree branches. When the same models were applied to datasets using feature selection filtering, a tree with a depth of 3 and cross-validation of 58.3 ± 10.3 and 58.3 ± 7.1 were generated for C5.0 and C5.0 with 10-fold cross-validation, respectively. Again the count of Ile-Ile (with value of 2) was used to create the first tree branches while count of Ile-Cys was the feature used for second subgroups classification with values equal to or greater than 0 (Figure 1).
Figure 1

A decision tree generated by the CHAID modeling method without feature selection filtering showing protein features used to build the decision tree. M = Malignant cancer; B = Benign; C = Common proteins

A decision tree generated by the CHAID modeling method without feature selection filtering showing protein features used to build the decision tree. M = Malignant cancer; B = Benign; C = Common proteins A tree with a depth of 3 induced when C& RT model applied and the most important attribute to build the tree was the count of Ile-Ile (value ≥ 2.500 for M and > 2.500 for B groups). The frequency Asp-Ser was used to create the second level for M subgroups (with value of 0.004). The same results were obtained when feature selection was used. A tree with a depth of 5 generated when Quest model applied and the frequency of His-Met (with a value of 0) was the most important feature to create the first tree branches. In the M subgroup, count of Cys-Gln (0), the count of Ile-Met (value 0) and the count of Gln-His (value 0.691) were the most important features in creating the subsequent branches of M group decision tree. Nearly the same results were obtained when feature selection filtering was applied. In CHAID model with and without feature selection, a tree with a depth of 3 induced and again the same protein feature (count of Ile-Ile) with the same values as C5.0 model used to create the tree. The same trees with the same features and values were generated when exhaustive CHAID models were applied on data-sets with and without feature selection. The best percentage of correctness, performance evaluation, and mean correctness in the tree induction models belonged to C5.0 model, followed by the CR&T, CHAID, and finally the Quest models (Table 2 and Figure 2).
Table 2

The association rules found in the data by the generalized rule induction (GRI) method, showing 100 most important rules created by GRI algorithm in classifying benign (B), malignant (M) and common (C) proteins expressed in breast cancers.

AntecedentSupport (%)
Count of Ile-Ile > 2.50042.86
Count of Ile > 27.000 and Freq of Ala-Ala < 0.00442.86
Count of Phe-Lys > 0.500 and Count of Ile > 28.50038.1
Count of Ala-Gly > 0.500 and Count of Ile-Cys > 0.50038.1
Count of Ile > 27.000 and Count of Tyr-Pro > 0.50038.1
Freq of Met-Ser < 0.000 and Freq of Gln-Val < 0.002 and Freq of Asp < 0.07823.81
Count of Asp-Arg > 1.50033.33
Count of Asp-Lys > 2.500 and Freq of Ala-Ala < 0.00433.33
Count of Cys-Val > 1.500 and Count of Ile > 30.00033.33
Count of Ala-Gly > 0.500 and Count of Asp-Arg > 1.50033.33
Count of Ala-Gly > 0.500 and Freq of Asp < 0.059 and Freq of Ala-Ala < 0.00433.33
Count of Ala-Gly > 0.500 and Count of Ile > 28.500 and Freq of Ala-Leu < 0.00833.33
Count of Ile > 27.000 and Count of Tyr-Cys < 1.500 and Freq of Ala-Ala < 0.00433.33
Count of Ile > 27.000 and Count of Val-Ala < 7.000 and Freq of Ala-Ala < 0.00433.33
Count of Ile > 27.000 and Count of Ser-Val < 5.500 and Freq of Ala-Ala < 0.00433.33
Count of Ile > 27.000 and Count of Gln-Val < 6.000 and Freq of Ala-Ala < 0.00433.33
Count of Ile > 27.000 and Count of Asn-Arg < 4.500 and Freq of Ala-Ala < 0.00433.33
Count of Ile > 27.000 and Count of Met-Ser < 2.500 and Freq of Ala-Ala < 0.00433.33
Count of Ile > 27.000 and Count of Lys-Phe < 4.500 and Freq of Ala-Ala < 0.00433.33
Count of Ile > 27.000 and Count of Ile-Pro < 2.500 and Freq of Ala-Ala < 0.00433.33
Count of Ile > 27.000 and Count of Ile-Ile < 5.500 and Freq of Ala-Ala < 0.00433.33
Count of Ile > 27.000 and Count of Gly-Ile < 5.500 and Freq of Ala-Ala < 0.00433.33
Count of Ile > 27.000 and Count of Gly-Phe < 3.500 and Freq of Asp-Leu < 0.00633.33
Count of Ile > 27.000 and Count of Phe-Lys < 4.000 and Freq of Ala-Ala < 0.00433.33
Count of Ile > 27.000 and Count of Phe-Phe > 0.500 and Freq of Ala-Leu < 0.00833.33
Count of Ile > 27.000 and Count of Glu-Asn < 3.500 and Freq of Ala-Ala < 0.00433.33
Count of Ile > 27.000 and Count of Asp-Arg < 4.500 and Freq of Ala-Ala < 0.00433.33
Count of Ile > 27.000 and Count of Asp-Lys < 6.500 and Freq of Ala-Ala < 0.00433.33
Count of Ile > 27.000 and Count of Ala-Ile < 7.000 and Freq of Ala-Ala < 0.00433.33
Count of Ile > 27.000 and Count of Ala-Gly < 3.500 and Freq of Ala-Ala < 0.00433.33
Count of Ile > 27.000 and Freq of Asp < 0.056 and Freq of Ala-Ala < 0.00433.33
Count of Ile > 27.000 and Count of Phe < 64.500 and Freq of Ala-Ala < 0.00433.33
Freq of Asn-Ile < 0.000 and Freq of Ala-Ala < 0.008 and Count of Asp-Lys > 1.50014.29
Freq of Asn-Ile < 0.000 and Count of Asp-Lys > 1.500 and Freq of Ala-Ala < 0.00814.29
Freq of Asn-Ile < 0.000 and Freq of Ile > 0.040 and Freq of Ala-Leu < 0.00814.29
Freq of Asn-Ile < 0.000 and Count of Ile > 11.500 and Freq of Ala-Leu < 0.00814.29
Freq of Gly-Phe < 0.000 and Freq of Ala-Ala < 0.008 and Count of Ile > 11.50014.29
Freq of Gly-Phe < 0.000 and Count of Asp-Lys > 1.500 and Freq of Ala-Ala < 0.00814.29
Count of Ala-Ile < 1.500 and Freq of Ala-Ala < 0.008 and Count of Ile > 11.50014.29
Count of Ala-Ile < 1.500 and Count of Ile-Ile > 0.500 and Freq of Ala-Ala < 0.00814.29
Count of Ala-Ile < 1.500 and Count of Asp-Lys > 1.500 and Freq of Ala-Ala < 0.00814.29
Count of Ile < 25.500 and Freq of Ala-Ala < 0.008 and Count of Ile > 11.50014.29
Count of Ile < 25.500 and Count of Ile-Ile > 0.500 and Freq of Ala-Ala < 0.00814.29
Count of Ile < 25.500 and Count of Asp-Lys > 1.500 and Freq of Ala-Ala < 0.00814.29
Count of Asp-Lys > 2.500 and Count of Asp-Tyr > 1.50028.57
Count of Asp-Lys > 2.500 and Freq of Ile < 0.066 and Freq of Ala-Ala < 0.00428.57
Count of Ala-Ile > 2.500 and Freq of Asp-Leu < 0.00628.57
Count of Ala-Gly > 0.500 and Freq of Ile > 0.05928.57
Count of Ala-Gly > 0.500 and Freq of Asp < 0.059 and Freq of Asp > 0.04428.57
Count of Ala-Gly > 0.500 and Count of Cys > 14.000 and Freq of Ala-Leu < 0.00628.57
Freq of Ile > 0.059 and Count of Ala-Gly > 0.50028.57

Count of Ile > 27.000 and Count of Gly-Ile < 5.500 and Count of Tyr-Pro > 0.50028.57

Count of Ile > 27.000 and Count of Gly-Phe < 3.500 and Count of Phe-Lys > 1.50028.57
Count of Ile > 27.000 and Count of Phe-Lys < 4.000 and Count of Cys-Val > 1.50028.57
Count of Ile > 27.000 and Count of Phe-Asp < 4.000 and Freq of Ala-Ala < 0.00428.57
Count of Ile > 27.000 and Count of Glu-Asn < 3.500 and Count of Cys-Val > 1.50028.57
Count of Ile > 27.000 and Count of Asp-Arg < 4.500 and Count of Glu-Asn > 1.50028.57
Count of Ile > 27.000 and Count of Asp-Lys < 6.500 and Count of Cys-Val > 1.50028.57
Count of Ile > 27.000 and Count of Cys-Val < 2.500 and Freq of Ala-Ala < 0.00428.57
Count of Ile > 27.000 and Count of Ala-Ile < 7.000 and Count of Cys-Val > 1.50028.57
Count of Ile > 27.000 and Freq of Asp < 0.056 and Count of Asn-Arg > 1.50028.57
Count of Ile > 27.000 and Count of Phe < 64.500 and Count of Cys-Val > 1.50028.57
Count of Ile > 27.000 and Count of Cys < 28.500 and Freq of Ala-Ala < 0.00428.57
Freq of Met-Ser < 0.000 and Freq of Lys-His < 0.002 and Count of Thr-Ile > 0.50019.05
Freq of Met-Ser < 0.000 and Freq of Phe-Met < 0.002 and Count of Thr-Ile > 0.50019.05
Freq of Met-Ser < 0.000 and Freq of Asp-Leu < 0.013 and Count of Asn-Arg < 0.50019.05
Freq of Met-Ser < 0.000 and Freq of Asp-Ala (1) < 0.014 and Count of Asn-Arg < 0.50019.05
Freq of Met-Ser < 0.000 and Freq of Asp-Asp < 0.004 and Count of Thr-Ile > 0.50019.05
Freq of Met-Ser < 0.000 and Count of Thr-Ile > 0.500 and Count of Asn-Arg < 0.50019.05
Freq of Met-Ser < 0.000 and Count of Asn-Arg < 0.500 and Count of Cys > 1.50019.05
Freq of Met-Ser < 0.000 and Count of Leu-Ile > 0.500 and Count of Asn-Arg < 0.50019.05
Freq of Met-Ser < 0.000 and Freq of Asp < 0.070 and Count of Asn-Arg < 0.50019.05
Freq of Cys-Tyr < 0.000 and Freq of Lys-His < 0.002 and Count of Thr-Ile > 0.50019.05
Freq of Cys-Tyr < 0.000 and Freq of Phe-Met < 0.002 and Count of Cys > 1.50019.05
Freq of Cys-Tyr < 0.000 and Freq of Asp-Leu < 0.013 and Count of Asn-Arg < 0.50019.05
Freq of Cys-Tyr < 0.000 and Freq of Asp-Ala (1) < 0.014 and Count of Asn-Arg < 0.50019.05
Freq of Cys-Tyr < 0.000 and Freq of Asp-Asp < 0.004 and Count of Thr-Ile > 0.50019.05
Freq of Cys-Tyr < 0.000 and Count of Thr-Ile > 0.500 and Count of Asn-Arg < 0.50019.05
Freq of Cys-Tyr < 0.000 and Count of Asn-Arg < 0.500 and Count of Cys > 1.50019.05
Freq of Cys-Tyr < 0.000 and Count of Leu-Ile > 0.500 and Count of Asp-Arg < 0.50019.05
Freq of Cys-Tyr < 0.000 and Freq of Asp < 0.070 and Count of Asn-Arg < 0.50019.05
Count of Met-Ser < 0.500 and Freq of Ala-Leu > 0.00819.05
Count of Met-Ser < 0.500 and Count of Asn-Arg < 0.500 and Freq of Ala-Leu > 0.00819.05
Count of Met-Ser < 0.500 and Freq of Asp < 0.070 and Freq of Ala-Leu > 0.00819.05
Count of Asp-Lys > 2.500 and Count of Asp-Arg < 4.500 and Freq of Ile > 0.04823.81
Count of Asp-Lys > 2.500 and Count of Ala-Ile < 7.000 and Count of Leu-Ile > 3.50023.81
Count of Asp-Lys > 2.500 and Count of Ala-Gly < 3.500 and Freq of Ala-Ala < 0.00423.81
Count of Asp-Lys > 2.500 and Freq of Ile < 0.066 and Count of Asp-Tyr > 1.50023.81
Count of Asp-Lys > 2.500 and Freq of Asp < 0.056 and Freq of Asp > 0.04423.81
Count of Asp-Lys > 2.500 and Count of Ile < 85.500 and Freq of Ile > 0.04823.81
Count of Asp-Lys > 2.500 and Count of Phe < 64.500 and Count of Leu-Ile > 3.50023.81
Count of Asp-Lys > 2.500 and Count of Cys > 14.000 and Freq of Ala-Leu < 0.00623.81
Count of Cys-Val > 1.500 and Count of Cys > 19.00023.81
Count of Ala-Ile > 2.500 and Count of Val-Tyr > 2.00023.81
Count of Ala-Ile > 2.500 and Count of Val-Ala < 9.500 and Freq of Asp-Leu < 0.00623.81
Count of Ala-Ile > 2.500 and Count of Thr-Tyr < 2.500 and Freq of Asp-Leu < 0.00623.81
Count of Ala-Ile > 2.500 and Count of Thr-Ile < 6.500 and Freq of Asp-Leu < 0.00623.81
Count of Ala-Ile > 2.500 and Count of Ser-Val < 8.000 and Freq of Asp-Leu < 0.00623.81
Count of Ala-Ile > 2.500 and Count of Ser-Phe < 5.500 and Freq of Asp-Leu < 0.00623.81
Figure 2

Percentage of correctness and wrongness in various decision tree models, in datasets without feature selection (a) and with feature selection (b); showing C5.0 model had the best performance, followed by CR&T, CHAID, and Quest models

The association rules found in the data by the generalized rule induction (GRI) method, showing 100 most important rules created by GRI algorithm in classifying benign (B), malignant (M) and common (C) proteins expressed in breast cancers. Percentage of correctness and wrongness in various decision tree models, in datasets without feature selection (a) and with feature selection (b); showing C5.0 model had the best performance, followed by CR&T, CHAID, and Quest models

Association Model

One hundred rules with 21 valid transactions and minimum and maximum support of 47.62% and 83.23% and maximum confidence of 100 % generated when GRI model applied. When feature selection was used, minimum support, maximum support, maximum confidence, and minimum confidence changed to 14.29%, 42.86%, 100%, and 87.5%, respectively. When the most accurate model (C&RT) was run on another dataset of 30 proteins from other cancers, the accuracy of the model in predicting the right group was 95.24%, while its wrongness was just 7.76% showing very suitable performance of this model in prediction.

Discussion

Nowadays, incredible amount of data produced each year because cancer research is a worldwide enterprise and the application of computational tools in cancer research has become an important and rapidly developing field. Bioinformatics as an emerging tool has developed primary to address the analysis of huge data generated from genomics and proteomics; however large datasets are also produced in cell biology, physiology, pathology, therapeutics, clinical trials and epidemiology. To utilize and improve the extraction of valuable results generated by researchers on patients’ diagnosis and treatment, collaboration between various sections of sciences such as software engineering, data mining knowledge and clinical studies seems essential.24–26 Early diagnosis of breast cancer is much more significant than any treatment, therefore, more attention should be paid to the early diagnosis of breast cancer.27 Self-examination, clinical examination, physical examination and mammography are main diagnostic tools but these classical methods are useful when tumours are large or palpable and mammography, as an efficient tool, is mainly suitable in western countries. Use of serum markers such as CA15.3, CA27.29 and CEA, without enough sensitivity and specificity has not been accepted in clinical diagnostics; especially in the early stages of breast cancers. Although Food and Drug Administration (FDA) of the United States recommended some markers only for monitoring therapy or recurrence of advanced breast cancer; it is been highly recommended to find new diagnostic tools; and some researchers have proposed proteomics and bioinformatics approaches as emerging tools for breast cancer detection.2628 These tools in conjunction with bioinformatics applications could greatly facilitate the discovery of new and better biomarkers.25 Various modelling tools (Screening Models, Clustering Models, Decision Tree Models and Association Model) applied on more than 800 protein attributes expressed in benign, malignant and both types of breast cancers simultaneously to find different protein features in each class of breast cancers. The screening, clustering, and decision tree models applied on datasets with and without feature selection filtering. Although 85 attributes (with value greater than 0.95) were marked as “important”, more than 95% of them were the frequencies or the counts of dipeptides. The number of peer groups with anomalies did not change when feature selection algorithm were applied, showing the neutral effects of attribute weighting on removing outliers in this case; although in another study we showed feature selection significantly improves the performances of the modelling in classifying mesostable and thermostable proteins.29 In K-Means modelling, the number of clusters did not show any differences when models run on dataset with and without feature selection, although the number of records in the clusters changed. The depth of trees varied from 5 (in the Quest model) to 2 (in the C5.0 with and without 10-fold cross validation models) branches when tree induction models applied. The best performance evaluation belonged to C&RT model and the worst to C5.0 and C5.0 models with 10-fold cross validation. The percentage of correctness, performance evaluation, and mean correctness of tree induction models applied here showed no significant differences (p > 0.95) with and without feature selection filtering on datasets, but when feature selection datasets used the percentage of correctness of CAHID model decreased. In all tree induction models, the count of Ile-Ile chose as the most important attribute and also in all GRI association rules (100 rules) the count of this feature was used as an antecedent to support the rules. A consistent difference exists in the pattern of synonymous codon usage between benign and malignant protein cancers,3031 and there is strong evidence that this difference is the result of selections linked to malignancy coming out from amino acid sequences.3233 In addition, malignant proteins can be distinguished based on the amino acid composition of their proteomes, and several authors have tried to relate these differences to structural differences.34–38 The importance of sequence-based classification in detection of various proteins expressed in breast cancer and the importance of Ile-Ile dipeptide in clustering of proteins, for the first time, reported in this paper. As Ile is a non-polar and hydrophobic amino acid, when it forms a dipeptide bond, it clearly can change the confirmation of proteins so that it has been used as the most important feature in all decision tree models applied in this paper. The performance of different bioinformatics tools (such as screening, clustering, and decision tree algorithms) for discriminating between proteins expressed in malignant and benign types of breast cancer examined here. The results confirmed that amino acid composition can be used to discriminate between proteins groups expressed in two forms of breast cancer. The results also confirmed that most of algorithms employed here can be used to discriminate between proteins expressed in two main forms of breast cancers with an accuracy of 86-100%. No significant difference was found in performance of different models used in this paper. Interestingly, the CHAID and exhaustive CHAID methods showed lower performance in comparison with other decision tree models as we anticipated to be more accurate, because they use the most sophisticated neural network architecture and trim it down to desired level, so the number of hidden layers and the number of neurons in layers 1 and 2 are usually higher than other decision tree models. When feature selection applied no significant differences (p > 0.05) noticed between analyses. The best performance and results were obtained with C&RT algorithms. Thus, it is suggested that this decision tree model can be used as an effective tool to discriminate malignant and benign proteins of breast cancer.

Conclusions

In this study a new approach has been employed for the first time to look at the protein attributes’ variations in malignant and benign breast cancers. The frequency of Ile-Ile was the most important protein attributes in all tree and rule induction models.
  37 in total

Review 1.  Recent progress in biomolecular engineering.

Authors:  D D Ryu; D H Nam
Journal:  Biotechnol Prog       Date:  2000 Jan-Feb

2.  There may never be a final cure for breast cancer.

Authors:  L Barr; D G Evans
Journal:  Eur J Surg Oncol       Date:  2001-06       Impact factor: 4.424

3.  Diagnosing breast cancer based on support vector machines.

Authors:  H X Liu; R S Zhang; F Luan; X J Yao; M C Liu; Z D Hu; B T Fan
Journal:  J Chem Inf Comput Sci       Date:  2003 May-Jun

Review 4.  A guide to template based structure prediction.

Authors:  Xiaotao Qu; Rosemarie Swanson; Ryan Day; Jerry Tsai
Journal:  Curr Protein Pept Sci       Date:  2009-06       Impact factor: 3.272

5.  Proteome changes in ovarian epithelial cells derived from women with BRCA1 mutations and family histories of cancer.

Authors:  Diana M Smith-Beckerman; Kit W Fung; Katherine E Williams; Nelly Auersperg; Andrew K Godwin; Alma L Burlingame
Journal:  Mol Cell Proteomics       Date:  2004-12-09       Impact factor: 5.911

Review 6.  Modelling in surgical oncology--part III: massive data sets and complex systems.

Authors:  D A Rew
Journal:  Eur J Surg Oncol       Date:  2000-12       Impact factor: 4.424

Review 7.  Recent advances in molecular genetics of breast cancer.

Authors:  K Pavelić; K Gall-Troselj
Journal:  J Mol Med (Berl)       Date:  2001-10       Impact factor: 4.599

8.  Proteomics and bioinformatics approaches for identification of serum biomarkers to detect breast cancer.

Authors:  Jinong Li; Zhen Zhang; Jason Rosenzweig; Young Y Wang; Daniel W Chan
Journal:  Clin Chem       Date:  2002-08       Impact factor: 8.327

Review 9.  Breast cancer: beyond the cutting edge.

Authors:  Gerald M Higa
Journal:  Expert Opin Pharmacother       Date:  2009-10       Impact factor: 3.889

10.  Recursive SVM feature selection and sample classification for mass-spectrometry and microarray data.

Authors:  Xuegong Zhang; Xin Lu; Qian Shi; Xiu-Qin Xu; Hon-Chiu E Leung; Lyndsay N Harris; James D Iglehart; Alexander Miron; Jun S Liu; Wing H Wong
Journal:  BMC Bioinformatics       Date:  2006-04-10       Impact factor: 3.169

View more
  10 in total

1.  Computational approaches for classification and prediction of P-type ATPase substrate specificity in Arabidopsis.

Authors:  Zahra Zinati; Abbas Alemzadeh; Amir Hossein KayvanJoo
Journal:  Physiol Mol Biol Plants       Date:  2016-04-07

2.  Classification of lung cancer tumors based on structural and physicochemical properties of proteins by bioinformatics models.

Authors:  Faezeh Hosseinzadeh; Mansour Ebrahimi; Bahram Goliaei; Narges Shamabadi
Journal:  PLoS One       Date:  2012-07-19       Impact factor: 3.240

3.  Prediction of thermostability from amino acid attributes by combination of clustering with attribute weighting: a new vista in engineering enzymes.

Authors:  Mansour Ebrahimi; Amir Lakizadeh; Parisa Agha-Golzadeh; Esmaeil Ebrahimie; Mahdi Ebrahimi
Journal:  PLoS One       Date:  2011-08-10       Impact factor: 3.240

4.  Discovery of EST-SSRs in lung cancer: tagged ESTs with SSRs lead to differential amino acid and protein expression patterns in cancerous tissues.

Authors:  Mohammad Reza Bakhtiarizadeh; Mansour Ebrahimi; Esmaeil Ebrahimie
Journal:  PLoS One       Date:  2011-11-04       Impact factor: 3.240

5.  Prediction of hepatitis C virus interferon/ribavirin therapy outcome based on viral nucleotide attributes using machine learning algorithms.

Authors:  Amir Hossein KayvanJoo; Mansour Ebrahimi; Gholamreza Haqshenas
Journal:  BMC Res Notes       Date:  2014-08-23

6.  Novel approach for identification of influenza virus host range and zoonotic transmissible sequences by determination of host-related associative positions in viral genome segments.

Authors:  Fatemeh Kargarfard; Ashkan Sami; Manijeh Mohammadi-Dehcheshmeh; Esmaeil Ebrahimie
Journal:  BMC Genomics       Date:  2016-11-16       Impact factor: 3.969

7.  Wogonoside inhibits invasion and migration through suppressing TRAF2/4 expression in breast cancer.

Authors:  Yuyuan Yao; Kai Zhao; Zhou Yu; Haochuan Ren; Li Zhao; Zhiyu Li; Qinglong Guo; Na Lu
Journal:  J Exp Clin Cancer Res       Date:  2017-08-03

8.  A new avenue for classification and prediction of olive cultivars using supervised and unsupervised algorithms.

Authors:  Amir H Beiki; Saba Saboor; Mansour Ebrahimi
Journal:  PLoS One       Date:  2012-09-05       Impact factor: 3.240

9.  Improved classification of lung cancer tumors based on structural and physicochemical properties of proteins using data mining models.

Authors:  R Geetha Ramani; Shomona Gracia Jacob
Journal:  PLoS One       Date:  2013-03-07       Impact factor: 3.240

10.  Understanding the undelaying mechanism of HA-subtyping in the level of physic-chemical characteristics of protein.

Authors:  Mansour Ebrahimi; Parisa Aghagolzadeh; Narges Shamabadi; Ahmad Tahmasebi; Mohammed Alsharifi; David L Adelson; Farhid Hemmatzadeh; Esmaeil Ebrahimie
Journal:  PLoS One       Date:  2014-05-08       Impact factor: 3.240

  10 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.