| Literature DB >> 30886917 |
Aderibigbe Israel Adekitan1, Odunayo Salau2.
Abstract
Research studies on educational data mining are on the increase due to the benefits obtained from the knowledge acquired from machine learning processes which help to improve decision making processes in higher institutions of learning. In this study, predictive analysis was carried out to determine the extent to which the fifth year and final Cumulative Grade Point Average (CGPA) of engineering students in a Nigerian University can be determined using the program of study, the year of entry and the Grade Point Average (GPA) for the first three years of study as inputs into a Konstanz Information Miner (KNIME) based data mining model. Six data mining algorithms were considered, and a maximum accuracy of 89.15% was achieved. The result was verified using both linear and pure quadratic regression models, and R2 values of 0.955 and 0.957 were recorded for both cases. This creates an opportunity for identifying students that may graduate with poor results or may not graduate at all, so that early intervention may be deployed.Entities:
Keywords: Computer science; Education; Information science
Year: 2019 PMID: 30886917 PMCID: PMC6395785 DOI: 10.1016/j.heliyon.2019.e01250
Source DB: PubMed Journal: Heliyon ISSN: 2405-8440
Descriptive statistics of 1841 students' yearly GPA and final CGPA.
| Min | Max | Mean | Std. deviation | Variance | Skewness | Kurtosis | |
|---|---|---|---|---|---|---|---|
| First Year GPA | 1.6000 | 4.9600 | 3.7977 | 0.6591 | 0.4344 | -0.6254 | 0.0265 |
| Second Year GPA | 1.1900 | 4.9600 | 3.3070 | 0.7435 | 0.5528 | -0.0407 | -0.5667 |
| Third Year GPA | 0.9700 | 5.0000 | 3.3935 | 0.8535 | 0.7285 | -0.3226 | -0.6562 |
| Final CGPA | 1.8000 | 4.9300 | 3.5605 | 0.6599 | 0.4355 | -0.2190 | -0.5777 |
Fig. 1First year GPA plots (a) Probability density function (b) Cumulative probability function.
Fig. 2Second year GPA plots (a) Probability density function (b) Cumulative probability function.
Fig. 3Third year GPA plots (a) Probability density function (b) Cumulative probability function.
Fig. 4Final year CGPA plots (a) Probability density function (b) Cumulative probability function.
Confusion matrix for the Probabilistic Neural Network (PNN) Predictor.
| 2|2 | 3rd | 2|1 | 1st | |
|---|---|---|---|---|
| 2|2 | 181 | 7 | 23 | 0 |
| 3rd | 16 | 16 | 0 | 0 |
| 2|1 | 18 | 0 | 242 | 1 |
| 1st | 0 | 0 | 13 | 36 |
Confusion matrix for the Random Forest predictor.
| 2|2 | 3rd | 2|1 | 1st | |
|---|---|---|---|---|
| 2|2 | 179 | 7 | 25 | 0 |
| 3rd | 14 | 18 | 0 | 0 |
| 2|1 | 13 | 0 | 244 | 4 |
| 1st | 0 | 0 | 5 | 44 |
Confusion matrix for the Decision Tree predictor.
| 2|2 | 3rd | 2|1 | 1st | |
|---|---|---|---|---|
| 2|2 | 176 | 9 | 21 | 0 |
| 3rd | 12 | 20 | 0 | 0 |
| 2|1 | 16 | 0 | 237 | 3 |
| 1st | 0 | 0 | 5 | 44 |
Confusion matrix for the Naïve Bayes predictor.
| 2|2 | 3rd | 2|1 | 1st | |
|---|---|---|---|---|
| 2|2 | 177 | 16 | 18 | 0 |
| 3rd | 7 | 25 | 0 | 0 |
| 2|1 | 23 | 0 | 236 | 2 |
| 1st | 0 | 0 | 9 | 40 |
Confusion matrix for the Tree Ensemble predictor.
| 2|2 | 3rd | 2|1 | 1st | |
|---|---|---|---|---|
| 2|2 | 178 | 7 | 26 | 0 |
| 3rd | 12 | 20 | 0 | 0 |
| 2|1 | 13 | 0 | 244 | 4 |
| 1st | 0 | 0 | 5 | 44 |
Confusion matrix for the Logistic Regression predictor.
| 2|2 | 3rd | 2|1 | 1st | |
|---|---|---|---|---|
| 2|2 | 184 | 7 | 20 | 0 |
| 3rd | 11 | 21 | 0 | 0 |
| 2|1 | 13 | 0 | 246 | 2 |
| 1st | 0 | 0 | 7 | 42 |
Model performance comparison.
| PNN | Random Forest | Decision Tree | Naive Bayes | Tree Ensemble | Logistic Regression | |
|---|---|---|---|---|---|---|
| Correct Classified | 475 | 485 | 477 | 478 | 486 | 493 |
| Accuracy | 85.895% | 87.70% | 87.85% | 86.438% | 87.884% | 89.15% |
| Cohen's Kappa (k) | 0.767 | 0.799 | 0.803 | 0.782 | 0.803 | 0.823 |
| Wrong Classified | 78 | 68 | 66 | 75 | 67 | 60 |
| Error | 14.105% | 12.297% | 12.155% | 13.562% | 12.116% | 10.85% |
Prediction confusion of the six data mining predictors.
| PNN | Random Forest | Decision Tree | Naive Bayes | Tree Ensemble | Logistic | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| TP | FP | TP | FP | TP | FP | TP | FP | TP | FP | TP | FP | |
| 2|2 | 181 | 34 | 179 | 27 | 176 | 28 | 177 | 30 | 178 | 25 | 184 | 24 |
| 3rd | 16 | 7 | 18 | 7 | 20 | 9 | 25 | 16 | 20 | 7 | 21 | 7 |
| 2|1 | 242 | 36 | 244 | 30 | 237 | 26 | 236 | 27 | 244 | 31 | 246 | 27 |
| 1st | 36 | 1 | 44 | 4 | 44 | 3 | 40 | 2 | 44 | 4 | 42 | 2 |
| Overall | 475 | 78 | 485 | 68 | 477 | 66 | 478 | 75 | 486 | 67 | 493 | 60 |
TP – True Positive.
FP – False Positive.
Linear regression model results.
| Estimate | Standard Error (SE) | Beta (β) | tStat | pValue | |
|---|---|---|---|---|---|
| (Intercept) | 0.4865 | 0.0197 | - | 24.6470 | 4.18E-116 |
| Program (X1) | -0.0057 | 0.0017 | -0.0164 | -3.2952 | 0.0010 |
| Year of Entry (X2) | -0.0016 | 0.0018 | -0.0053 | -0.8765 | 0.3809 |
| First Year GPA (X3) | 0.1811 | 0.0090 | 0.1809 | 20.1100 | 1.90E-81 |
| Second Year GPA (X4) | 0.2788 | 0.0089 | 0.3141 | 31.3090 | 9.00E-173 |
| Third Year GPA (X5) | 0.4404 | 0.0064 | 0.5696 | 68.8750 | 0 |
Number of observations: 1841, Error degrees of freedom (EDoF): 1835.
Root mean square (RMS) Error: 0.140.
R2: 0.955, Adjusted R2: 0.955.
F-statistic vs. constant model: 7.78e+03, p-value = 0.
F-statistic values for the components, except for the constant term.
| Sum of Square (Sum Sq.) | Degree of Freedom (DF) | Mean Square (Sq.) | F | pValue | |
|---|---|---|---|---|---|
| Program | 0.2135 | 1 | 0.2135 | 10.8580 | 0.0010 |
| Year of Entry | 0.0151 | 1 | 0.0151 | 0.7683 | 0.3809 |
| First Year GPA | 7.9512 | 1 | 7.9512 | 404.41 | 1.90E-81 |
| Second Year GPA | 19.2730 | 1 | 19.273 | 980.24 | 9.00E-173 |
| Third Year GPA | 93.2670 | 1 | 93.267 | 4743.8 | 0 |
| Error | 36.0780 | 1835 | 0.0197 |
ANOVA for the linear regression model.
| Sum Sq. | DF | Mean Sq. | F | pValue | |
|---|---|---|---|---|---|
| Total | 801.38 | 1840 | 0.4355 | ||
| Model | 765.3 | 5 | 153.06 | 7784.9 | 0 |
| Residual | 36.078 | 1835 | 0.0197 |
Fig. 5(a) Added variable plot for the whole model (b) Adjusted response plot using the first year GPA.
Fig. 6(a) Adjusted response plot using the second year GPA (b) Adjusted response plot using the third year GPA.
Quadratic regression model results.
| Estimate | SE | Beta | tStat | pValue | |
|---|---|---|---|---|---|
| (Intercept) | 0.7380 | 0.0788 | - | 9.3629 | 2.20E-20 |
| Program (X1) | -0.0570 | 0.0069 | -0.012941 | -8.209 | 4.16E-16 |
| Year of Entry (X2) | -0.0439 | 0.0079 | 0.012463 | -5.5939 | 2.56E-08 |
| First Year GPA (X3) | 0.1043 | 0.0497 | 0.19571 | 2.0989 | 0.035958 |
| Second Year GPA (X4) | 0.2522 | 0.0432 | 0.32351 | 5.8433 | 6.04E-09 |
| Third Year GPA (X5) | 0.4806 | 0.0319 | 0.55104 | 15.066 | 1.93E-48 |
| Program2 (X1) 2 | 0.0069 | 0.0009 | 0.037883 | 7.7051 | 2.13E-14 |
| Year of Entry2 (X2) 2 | 0.0044 | 0.0008 | 0.031763 | 5.4661 | 5.23E-08 |
| First Year GPA2 (X3)2 | 0.0121 | 0.0070 | 0.0079419 | 1.7162 | 0.0863 |
| Second Year GPA2 (X4)2 | 0.0053 | 0.0067 | 0.0044 | 0.7869 | 0.4315 |
| Third Year GPA2 (X5)2 | -0.0080 | 0.0050 | -0.0088762 | -1.6088 | 0.1078 |
Number of observations: 1841, EDoF: 1830.
RMS Error: 0.137.
R2: 0.957, Adjusted R2: 0.957.
F-statistic vs. constant model: 4.11e+03, p-value = 0.
F-statistic values for the components, except for the constant term.
| Sum Sq. | DF | Mean Sq. | F | pValue | |
|---|---|---|---|---|---|
| Program | 0.1711 | 1 | 0.1711 | 9.1635 | 0.0025 |
| Year of Entry | 0.0267 | 1 | 0.0267 | 1.4279 | 0.2323 |
| First Year GPA | 8.4678 | 1 | 8.4678 | 453.42 | 4.54E-90 |
| Second Year GPA | 18.556 | 1 | 18.556 | 993.61 | 1.42E-174 |
| Third Year GPA | 81.8 | 1 | 81.8 | 4380.1 | 0 |
| Program2 | 1.1087 | 1 | 1.1087 | 59.369 | 2.13E-14 |
| Year of Entry2 | 0.5580 | 1 | 0.5580 | 29.879 | 5.23E-08 |
| First Year GPA2 | 0.0550 | 1 | 0.0550 | 2.9455 | 0.0863 |
| Second Year GPA2 | 0.0116 | 1 | 0.0116 | 0.6192 | 0.4315 |
| Third Year GPA2 | 0.0483 | 1 | 0.0483 | 2.5882 | 0.1078 |
| Error | 34.176 | 1830 | 0.0187 |
ANOVA for the quadratic regression model.
| Sum Sq. | DF | Mean Sq. | F | pValue | |
|---|---|---|---|---|---|
| Total | 801.38 | 1840 | 0.4355 | ||
| Model | 767.2 | 10 | 76.72 | 4108.1 | 0 |
| Linear | 765.3 | 5 | 153.06 | 8195.9 | 0 |
| Nonlinear | 1.9023 | 5 | 0.38046 | 20.372 | 7.76E-20 |
| Residual | 34.176 | 1830 | 0.0187 |
Fig. 7(a) Added variable plot for the whole model (b) Adjusted response plot using the first year GPA.
Fig. 8(a) Adjusted response plot using the second year GPA (b) Adjusted response plot using the third year GPA.