| Literature DB >> 33880074 |
Abstract
Defects are the major problems in the current situation and predicting them is also a difficult task. Researchers and scientists have developed many software defects prediction techniques to overcome this very helpful issue. But to some extend there is a need for an algorithm/method to predict defects with more accuracy, reduce time and space complexities. All the previous research conducted on the data without feature reduction lead to the curse of dimensionality. We brought up a machine learning hybrid approach by combining Principal component Analysis (PCA) and Support vector machines (SVM) to overcome the ongoing problem. We have employed PROMISE (CM1: 344 observations, KC1: 2109 observations) data from the directory of NASA to conduct our research. We split the dataset into training (CM1: 240 observations, KC1: 1476 observations) dataset and testing (CM1: 104 observations, KC1: 633 observations) datasets. Using PCA, we find the principal components for feature optimization which reduce the time complexity. Then, we applied SVM for classification due to very native qualities over traditional and conventional methods. We also employed the GridSearchCV method for hyperparameter tuning. In the proposed hybrid model we have found better accuracy (CM1: 95.2%, KC1: 86.6%) than other methods. The proposed model also presents higher evaluation in the terms of other criteria. As a limitation, the only problem with SVM is there is no probabilistic explanation for classification which may very rigid towards classifications. In the future, some other method may also introduce which can overcome this limitation and keep a soft probabilistic based margin for classification on the optimal hyperplane.Entities:
Keywords: Classification; Feature optimization; PCA; PROMISE dataset; SVM; Software defects detection
Year: 2021 PMID: 33880074 PMCID: PMC8050160 DOI: 10.1007/s10586-021-03282-8
Source DB: PubMed Journal: Cluster Comput ISSN: 1386-7857 Impact factor: 1.809
Comparative analysis PC with other feature optimization approach
| S. No | Algorithm | Drawback |
|---|---|---|
| 1 | Genetic algorithm | Genetic Algorithm has higher-level complexity and it is not worth in our dataset that’s why we have used PC for faster and better execution it possesses built-in feature selection [ |
| Genetic Algorithm requires a large dataset but our dataset is small for which we require a versatile and easy to implement PC technique [ | ||
| 2 | Correlation threshold | While selecting the correlation threshold manually is a very risky task, because it may drop the important features and useful information, for this in our dataset we have used the build-in feature selection technique PC. [ |
| Redundant features may occur but using a PC can eliminate them. [ |
Comparative analysis SVM with other classification approach
| S. No | Algorithm | Drawback |
|---|---|---|
| 1 | Decision tree classification | This classification technique performed poorly on small datasets and in our proposed model we have worked on small datasets on which SVM is better and effective [ |
| It is affected by the overfitting of datasets but SVM is free from the overfitting problem [ | ||
| 2 | Random forest classification | The random Forest Classification algorithm is also affected by the overfitting of datasets and SVM is not sensitive to overfitting [ |
Comparison of algorithms shortcomings with PC-SVM model
| S. No | Algorithms | Shortcoming |
|---|---|---|
| 1 | AdaBoost | This algorithm is affected by outliers and not effective in predicting the errors but SVM is free from outliers and effective in predicting the software defects [ |
| 2 | CART | It has poor modeling with linear data while SVM can work with both linear as well as non-linear datasets [ |
| 3 | KNN | It has a high rate of classification in comparison with SVM [ |
| 4 | Neural Network | It works good with large datasets and taking time in training the datasets while our model has small datasets for that PC-SVM is better [ |
| 5 | Chao Genetic | It has difficulty in providing the optimal solution while PC can provide [ |
| 6 | EM model | It also does not guarantee in providing the optimal solution but our model PC-SVM can give [ |
Attributes description
| Attribute name | Description of attribute |
|---|---|
| LOC | Counts the total number of line in the module |
| Iv(g) | Design complexity analysis (McCabe) |
| Ev(g) | McCabe essential complexity |
| N | Number of operators present in the software module |
| v(g) | Cyclomatic complexity measurement (McCabe) |
| D | Measurement difficulty |
| B | Estimation of effort |
| L | Program length |
| V | Volume |
| I | Intelligence in measurement |
| E | Measurement effort |
| Line of comments in software module | |
| Total number of blank lines in the module | |
| Total number of unique operators | |
| uniq_opnd | Total number of unique operand |
| T | Time estimator |
| Branchcount | Total number of branch in the software module |
| total_op | Total number of operators |
| Total_opnd | Total number of operators |
| Locodeandcomment | Total number of line of code and comments |
| Defects/Problems | Information regarding defect whether the defect is present or not |
Fig. 1Decision boundary
Fig. 2Effects of C parameter on SVM model fitting
Fig. 3Gamma hyperparameter
Fig. 4PC-SVM model representation
Fig. 52D graph hyperplane
Confusion matrix demonstration
| Predicted value | ||
|---|---|---|
| Negative | Positive | |
| Actual value | ||
| Negative | True Positive (TN) | False Negative (FP) |
| Positive | False Positive (FN) | True Negative (TP) |
Fig. 6Confusion matrix for KC1 dataset using a proposed hybrid classifier
Fig. 7a ROC curve analysis, b Precision–recall analysis for KC1 Dataset
Fig. 8Confusion matrix for CM1 dataset using a proposed hybrid classifier
Fig. 9a Precision and recall analysis for MC1 dataset, b ROC curve analysis
Statistical performance analysis for CM1 and KC1 datasets using PCA and SVM
| Dataset | Precision | Recall | F-Measure | Accuracy |
|---|---|---|---|---|
| CM1 | 96.1 | 99.0 | 97.5 | 95.2 |
| KC1 | 86.8 | 99.6 | 92.8 | 86.6 |
Fig. 10a Precision for CM1 & KC1 dataset, b Recall analysis for CM1 & KC1 dataset, c F-M analysis for CM1 & KC1 dataset, d Accuracy analysis for CM1 & KC1 dataset
Performance comparison of the proposed model and previously proposed models
| NASA dataset | Techniques | Precision | Recall | F-M | Accuracy | |
|---|---|---|---|---|---|---|
| KC1 | Naïve Bayes [ | 86.2 | 74.33 | 35.71 | 65.87 | |
| Random Forest [ | – | 75.89 | 37.91 | 67.99 | ||
| C4.5 Miner [ | – | 75.64 | 34.05 | 68.01 | ||
| Immunos[ | – | 72.91 | 36.92 | 63.55 | ||
| ANN-ABC [ | – | 77 | 33 | 69 | ||
| Hybrid self-organizing map [ | – | 80.94 | 35.67 | 78.43 | ||
| SVM [ | 81.2 | 81.27 | 28.96 | 79.24 | ||
| Majority vote [ | – | 85.62 | 30.98 | 79.66 | ||
| AntMiner + [ | – | 84.99 | 26.11 | 80.51 | ||
| ADBBO-RBFNN [ | – | 87.95 | 20.24 | 84.96 | ||
| NN GAPO + B [ | – | – | – | 79.4 | ||
| Decision Tree[ | 83.3 | 94.1 | 87.78 | 86.35 | ||
| KNN[ | 83.9 | 84.7 | 84.3 | – | ||
| Proposed Model (PC-SVM) | 86.8 | 99.6 | 92.8 | 86.6 | ||
| CM1 | Naïve Bayes [ | 86.2 | 78.65 | 34.09 | 64.57 | |
| Random Forest [ | – | 71.29 | 32.17 | 60.98 | ||
| C4.5 Miner [ | – | 74.66 | 27.68 | 66.71 | ||
| Immunos[ | – | 75.02 | 30.99 | 66.03 | ||
| ANN-ABC [ | – | 81 | 33 | 68 | ||
| Hybrid self-organizing map [ | – | 78.96 | 30.65 | 72.37 | ||
| SVM [ | 81.2 | 79.08 | 31.27 | 78.69 | ||
| Majority vote [ | – | 80 | 30.46 | 77.01 | ||
| AntMiner + [ | – | 78.88 | 30.9 | 73.43 | ||
| ADBBO-RBFNN [ | – | 80.96 | 29.71 | 82.57 | ||
| NN GAPO + B [ | – | – | – | 74.4 | ||
| Decision Tree[ | 83.3 | 74.23 | 81.2 | 73.49 | ||
| KNN[ | 83.9 | 84.7 | 84.3 | – | ||
| Proposed Model (PC-SVM) | 96.1 | 99.0 | 97.5 | 95.2 | ||