Literature DB >> 30886917

The impact of engineering students' performance in the first three years on their graduation result using educational data mining.

Aderibigbe Israel Adekitan¹, Odunayo Salau².

Abstract

Research studies on educational data mining are on the increase due to the benefits obtained from the knowledge acquired from machine learning processes which help to improve decision making processes in higher institutions of learning. In this study, predictive analysis was carried out to determine the extent to which the fifth year and final Cumulative Grade Point Average (CGPA) of engineering students in a Nigerian University can be determined using the program of study, the year of entry and the Grade Point Average (GPA) for the first three years of study as inputs into a Konstanz Information Miner (KNIME) based data mining model. Six data mining algorithms were considered, and a maximum accuracy of 89.15% was achieved. The result was verified using both linear and pure quadratic regression models, and R2 values of 0.955 and 0.957 were recorded for both cases. This creates an opportunity for identifying students that may graduate with poor results or may not graduate at all, so that early intervention may be deployed.

Entities: Chemical Disease Gene Species

Keywords: Computer science; Education; Information science

Year: 2019 PMID： 30886917 PMCID： PMC6395785 DOI： 10.1016/j.heliyon.2019.e01250

Source DB: PubMed Journal: Heliyon ISSN： 2405-8440

Introduction

Higher institutions are setup to provide quality education capable of transforming the level of awareness, knowledge, and the capacity of the human mind. In Africa, the educational sector has suffered many setbacks due to under development, economic hardship, insufficient budget and corruption. Nigeria is the most populous country in Africa, and the country has the largest university system in the sub-Saharan region (Olaleye and Oyewole, 2016). Engineering education is a driver of global sustainable development and innovation (Agboola and Elinwa, 2013). In Nigeria, engineering education is regulated by the National University Commission (NUC) that acts as a proxy for the government (Mahmud et al., 2012). NUC and the Council for the Regulation of Engineering in Nigeria (COREN) together with relevant stakeholders provide regulatory oversight on the activities and operations of Nigerian universities via relevant policies aimed at ensuring quality in the educational system (Agboola and Elinwa, 2013). NUC provides the regulatory needs for university education in Nigeria and ensures along with other stakeholders (Akerele, 2008), the implementation of all quality assurance policies and periodic accreditation of Nigerian universities (Ajayi and Ekundayo, 2008; Akerele, 2008) toward turning out quality students (Idachaba, 2018; Obadara and Alaka, 2013). A prospective student is required to satisfy the minimum entry requirements as stipulated by NUC to be qualified for admission consideration by any Nigerian University (Adeyemi, 2001; Odukoya et al., 2018; Oladipo et al., 2009). Engineering curriculum in Nigerian universities run for 5 years comprising ten academic semesters. In the second semester of the 4th year of an engineering program, students go on internship termed Student's Industrial Work Experience Scheme (SIWES) for the full semester and return the next session as final year students. The internship program is aimed at bridging the gap between theory and practice, as it exposes students to the realities of the engineering profession. In the first and second year of an engineering program, students are exposed to knowledge on sciences and basic introduction to engineering as a continuum of their secondary school education, and as an introduction to general engineering. In the third year, the curriculum is more focused on the core discipline of each engineering student, that is, electrical engineering, mechanical engineering, civil engineering, and so forth. By the end of the third year, engineering students are already grounded in the basics of their profession. The academic performance of engineering students from their first year to the third year is very vital in terms of acquisition of foundational knowledge, and its impact on their final graduation Cumulative Grade Point Average (CGPA). It is often said that beyond the third year it is very challenging for a student to move from the current class of grade (first class - 1st, second class upper division – 2|1, second class lower division – 2|2, and third class – 3rd) to a higher one due to the nature of academic courses at 400L (fourth year) and 500L (fifth year) which are more robust and touch core foundation of engineering disciplines. The Nigerian educational curriculum has also been criticized as being overloaded and not adequately meeting the societal knowledge needs of the Nigerian learner (Obadara and Alaka, 2013; Omoregie, 2008). The students' grades are classified as follows; first class for a CGPA greater than 4.50, second class upper division for a CGPA between 3.50 to 4.49, second class lower division for the CGPA range of 2.50–3.49, and third class for a CGPA of 1.50–2.49. A student with a CGPA of less than 1.50 will be placed on academic probation in the following semester, and if already in the final semester, it is classified as a fail grade in Covenant university. A student that does not satisfy the requirements for the honours' category but with a CGPA of at least 1.0 may be awarded a pass degree in some Nigerian universities. To graduate with a good class of grade, students must work towards it early right from 100L (first year) because by 300L (third year) the CGPA would have taken form and may be had to improve. This study considers the impact of the Grade Point Average (GPA) of engineering students from 100L to 300L on the final graduation CGPA using Covenant University in Nigeria as a case study. An attempt was made in this study to establish the extent to which the final year CGPA of an engineering student can be predicted using the GPA of the first three years of study. The academic performance dataset of 1841 students that were admitted and graduated within the period of 2002–2014 across seven engineering programs in Covenant University was analysed using a predictive Konstanz Information Miner (KNIME) based data mining model and regression analysis in MATLAB.

Background

The quality of student-teacher interaction, student participation and engagement processes are vital indices that affect student overall performance, failure and dropout rate. For optimal learning and for an educational system to be able to improve on its practices and gaps, it is vital to have a means of collecting performance related feedback data (Daradoumis et al., 2019) and deploying systemic appraisal towards identifying weaknesses in the delivery of education to the student populace. Educational data mining is one of the methods that can be used for pattern recognition and trend identification. Data mining is the extraction of hidden useful information from a dataset through scientific analysis and methods that identify data trends, and hidden patterns within the given dataset, and as such, data mining can be referred to as knowledge discovery (Azevedo, 2018; Hussain et al., 2018). The discovery of hidden information is achieved by running data mining algorithms that combine statistics with computer science to mine valuable information from a seemingly meaningless data jumble. Data mining can be applied in various fields such as engineering (Adekitan et al., 2019; Saini and Aggarwal, 2018), business management (Zuo et al., 2016), marketing and product design (Jin et al., 2019), computer science (Mahendra et al., 2019), education (Ibrahim et al., 2019; Porouhan, 2018), genetics (Noreña et al., 2018), biological studies (Gu et al., 2018), facility maintenance management (Miguel-Cruz et al., 2019), health and drug development studies (Keserci et al., 2017), chemistry and toxicity analysis (Saini and Srivastava, 2019), meteorology (Kovalchuk et al., 2019), transportation safety (Divya et al., 2019) and traffic management (Amiruzzaman, 2019), fraud detection (Vardhani et al., 2019), and so forth. In the educational sector, volumes of data are daily generated from various teaching and learning activities within an institution. Educational data mining is the use of data mining techniques to extract vital information from a dataset generated within the educational context. Such information provides clues on previously unknown trends that relate to student performance (Adekitan and Noma-Osaghae, 2018; Roy and Garg, 2018) and learning behaviours (Kim et al., 2018), the efficiency and quality of teaching, student potential prediction (Yang and Li, 2018), the propensity of students to dropout etc. This provides an opportunity for identifying quality gaps, and for proposing improvement and necessary policies for improving the quality, and the delivery of educational systems (Agarwal et al., 2012; Baepler and Murdoch, 2010; Osmanbegović and Suljic, 2012). Educational data mining is a machine learning process that has been applied for studying and predicting student performance (Asif et al., 2017; Gasevic et al., 2014; Kostopoulos et al., 2018), for evaluating learning technologies integration process (Angeli et al., 2017), and for identifying learning challenges. Relevant educational dataset that has accumulated overtime and depicts the operational process under evaluation must be available, and adequately processed to support a data mining analysis toward achieving a reasonable accuracy (Almarabeh, 2017). Research studies have been carried out to enable the prediction of the final graduation CGPA of students using the lower level performance grades of the students. In the study by Pelumi Oguntunde et al. (2018), the final CGPA of students was predicted using multiple linear regression and correlation to analyse the yearly GPA, and various inferential statistics were developed. The study determined the correlation between the first-year result and the final-year result of the student. With the aid of a regression plot, the students' GPA for the five years of study was fitted using multiple linear regression in order to explain how the GPA for each year contributed to the variations in the final CGPA of the students at graduation. In (Bucos and Drăgulescu, 2018) features such as student attendance, average scores, relevant course data, the level of student participation in class etc. were deployed in a data mining model for predicting the performance of 908 students. A decision tree model was applied by Ahmed and Elaraby (2014) to predict the probability of failure of 1,547 students such that relevant knowledge can be acquired that will enable the management team to be able to deploy adequate and early intervention. In the study, the student grades were classified into five categories, and these are: excellent, very good, good, acceptable and fail. Ten input features that include the student's department, high school grades, level of participation in class, attendance, midterm scores, lab reports, homework grades, seminar score, completion of assignments and the overall grades were applied in the decision tree model developed. Similarly, in Yadav et al. (2012) by using decision tree classifiers, the likelihood of a student to drop out of an institution was predicted through educational data mining. In Tair and El-Halees (2012), association, classification, clustering and outlier detection data mining techniques were applied to analyse 3,314 graduate student performance records over a fifteen-year period. The dataset was analysed using Rule Induction, Naïve Bayesian classifier, K-Means clustering algorithm followed by density-based and distance-based outlier detection methods. 18 attributes of the student dataset were considered, and only 6 attributes: matriculation GPA, gender, specialty of the students, the city of the student, the grade and the type of secondary school attended were selected for the data mining analysis. The remaining 12 attributes were dropped due to their large variances and because some of the attributes are personal information that did not provide useful knowledge. The unsupervised clustering analysis performed, identified four unique clusters in the dataset using k-means algorithm. Data mining method was applied by Al-Radaideh et al. (2006) to evaluate student data towards identifying the key attributes that influence the academic performance of students. This provides an opportunity for improving the quality of higher education. In (Kabakchieva, 2013), data mining technique was applied to analyse student data at a Bulgarian university. The student dataset that was analysed, contained the personal and pre-admission attributes of each student. The Decision Tree Classifiers (J48), k-Nearest Neighbour, Bayesian, Naïve Bayes classifiers, the OneR, and the JRip Rule learners were applied to extract knowledge from the student dataset, and accuracy of 52–67% was achieved. The result showed that the number of courses failed in the first academic year and the admission score of the student are two major features among the very influential features in the classification analysis.

Analysis

The dataset investigated in this study contains the GPA data for the first three academic years and the final CGPA of 1,841 students from 2002 to 2014, across seven engineering departments, and these are: Information and Communication Engineering (ICE), Chemical Engineering (CHE), Computer Engineering (CE), Mechanical Engineering (MECH), Electrical and Electronics Engineering (ELECT), Civil (CVE), and Petroleum Engineering (PE). The student performance dataset of the engineering students of Covenant University in Nigeria was obtained from the study by Popoola et al. (2018). The dataset file in spreadsheet format is named “Supplementary material” and is available from Popoola et al. (2018) via a link in appendix A. The statistical attributes of the dataset were determined and presented in Table 1 as descriptive statistics. The GPA data for the first year, the second year, the third year and the final CGPA are fitted using Gamma Distribution, Normal Distribution, Weibull Distribution and Logistic Distribution respectively. Fig. 1(a) shows the probability density function plot and Fig. 1(b) shows the cumulative probability function plot for the GPA of the engineering students in the first year. Fig. 2(a) shows the probability density function plot and Fig. 2(b) shows the cumulative probability function plot for the GPA of the engineering students in the second year. The probability density function plot and the cumulative probability function plot for the third year GPA is presented in Fig. 3(a) and (b) respectively while the probability density function plot and the cumulative probability function plot for the final year CGPA is presented in Fig. 4(a) and (b) respectively.

Table 1

Descriptive statistics of 1841 students' yearly GPA and final CGPA.

	Min	Max	Mean	Std. deviation	Variance	Skewness	Kurtosis
First Year GPA	1.6000	4.9600	3.7977	0.6591	0.4344	-0.6254	0.0265
Second Year GPA	1.1900	4.9600	3.3070	0.7435	0.5528	-0.0407	-0.5667
Third Year GPA	0.9700	5.0000	3.3935	0.8535	0.7285	-0.3226	-0.6562
Final CGPA	1.8000	4.9300	3.5605	0.6599	0.4355	-0.2190	-0.5777

Fig. 1

First year GPA plots (a) Probability density function (b) Cumulative probability function.

Fig. 2

Second year GPA plots (a) Probability density function (b) Cumulative probability function.

Fig. 3

Third year GPA plots (a) Probability density function (b) Cumulative probability function.

Fig. 4

Final year CGPA plots (a) Probability density function (b) Cumulative probability function.

Descriptive statistics of 1841 students' yearly GPA and final CGPA. First year GPA plots (a) Probability density function (b) Cumulative probability function. Second year GPA plots (a) Probability density function (b) Cumulative probability function. Third year GPA plots (a) Probability density function (b) Cumulative probability function. Final year CGPA plots (a) Probability density function (b) Cumulative probability function. The key features of the dataset comprise the course of study or program, the year of entry, the first-year GPA, the second-year GPA, and the third-year GPA. The class of degree (1st, 2|1, 2|2 and 3rd class) was applied as the target in the KNIME model developed while in the regression models, the class of degree was replaced with the actual CGPA as the model target.

Methodology

KNIME (Konstanz Information Miner) is a well-known, Java based, modular data mining application which facilitates interactive, visual, easy assembling, testing and running of data mining pipelines. The graphical workflow in KNIME is made possible by means of an Eclipse plug-in (Berthold et al., 2009) and the KNIME application is available under Open Source GPLv3 license (KNIME, 2018b). The application has been applied for many data mining analyses and projects (Wahbeh et al., 2011; Yu et al., 2016) by setting up workflows on the KNIME platform. For the classification analysis on the KNIME data mining platform, the following inputs were applied; the engineering departments or program of study, the year of entry, the students' GPA in the first year, the GPA in the second year, and the GPA in the third year. The seven engineering programs are ICE, CHE, CE, MECH, ELECT, CVE and PE while the year or session of admission stretched from the 2002/2003 session to 2009/2010 session. A predictive model was developed on the KNIME analytics platform for identifying the hidden relationship among the features of the dataset which may enable a reasonable prediction of the class of grade of the final year CGPA of the engineering students, using their GPA for the first three academic years out of the total five years of study for engineering programs in Nigeria. In the KNIME workflow, the data was imported into the platform using the excel reader, and the statistical properties of the dataset was extracted with the statistics node. The data was pre-processed by splitting the data samples into two using stratified sampling; 70% for training and 30% for testing. The class of grade is colour coded to enable visual evaluation using scatter plots. The samples were normalized, and principal component analysis was carried out to reduce the dimensionality of the variables. The learners Meta node contains six data mining learner algorithms while the predictors Meta node contains their corresponding predictors. The performance of each algorithm was evaluated using scorers, and the result was exported through the excel writers. The arrows indicate the direction of data flow from one stage to another in the KNIME workflow. The forward feature selection Meta node enabled the classification of the influence of each input variable. Six (6) main data mining algorithms: the Probabilistic Neural Network (PNN) based on the DDA (Dynamic Decay Adjustment), the Random Forest Predictor, the Decision Tree Predictor, the Naïve Bayes Predictor, the Tree Ensemble Predictor, and the Logistic Regression Predictor were applied in the KNIME model for predicting the class of grade of the final CGPA of the students at graduation. The data mining algorithms were applied as pre-configured by the developers with the class of grade selected as the target column, further information on the algorithms and configuration can be sourced from KNIME and NodePit online documentations (KNIME, 2018a; NodePit, 2018). The data was partitioned into two in the ratio 70:30 using stratified sampling, and 70% of the students' performance data samples was applied in training the model while the remaining 30% was deployed for testing the performance of the predictive algorithms. The dataset was normalized, and principal component analysis was carried out to transform the possibly correlated features into uncorrelated principal components in order to improve the predictive accuracy of the model. For a comparative analysis, and to further validate the performance of the data mining model; linear and pure quadratic regression models were also developed to analyse the students' performance dataset.

Results & discussion

This section presents the results of the predictive models using both KNIME and regression-based models.

Results using KNIME-based model

The predictive capability of each of the five input variables was evaluated using the forward feature selection Meta node on the KNIME server, and the result shows that in terms of accuracy, the third year GPA is the most influential variable followed by the second year GPA, the first year GPA, the program and the year of entry is the least influential variable in the classification experiment. The performance of the algorithms in terms of the prediction confusion is presented as confusion matrices in Table 2 for the PNN predictor, in Table 3 for the Random Forest predictor, in Table 4 for the Decision Tree predictor, in Table 5 for the Naïve Bayes predictor, in Table 6 for the Tree Ensemble predictor and in Table 7 for the Logistic Regression predictor. The confusion matrix shows the predictive performance of the algorithms for each class of the classification analysis for identifying the True Positive, False Positive, True Negative and False Negative predictions (Davis and Goadrich, 2006; Parker, 2001; Visa et al., 2011) for the 1st, 2|1, 2|2, and 3rd class grade classifications. A comparative performance analysis for the six algorithms is presented in Table 8. In terms of prediction accuracy, the Logistic Regression predictor had the highest accuracy of 89.15% followed by the Tree Ensemble with an accuracy of 87.884%. The Decision Tree predictor had the third best accuracy of 87.85%, and the Random Forest predictor had the fourth best accuracy with an accuracy of 87.70%. The Naive Bayes predictor had an accuracy of 86.438% while the PNN predictor had the least accuracy of 85.89%. The performance of the algorithms in terms of the number of True Positive and False Positive predictions is presented comparatively in Table 9.

Table 2

Confusion matrix for the Probabilistic Neural Network (PNN) Predictor.

	2\|2	3rd	2\|1	1st
2\|2	181	7	23	0
3rd	16	16	0	0
2\|1	18	0	242	1
1st	0	0	13	36

Table 3

Confusion matrix for the Random Forest predictor.

	2\|2	3rd	2\|1	1st
2\|2	179	7	25	0
3rd	14	18	0	0
2\|1	13	0	244	4
1st	0	0	5	44

Table 4

Confusion matrix for the Decision Tree predictor.

	2\|2	3rd	2\|1	1st
2\|2	176	9	21	0
3rd	12	20	0	0
2\|1	16	0	237	3
1st	0	0	5	44

Table 5

Confusion matrix for the Naïve Bayes predictor.

	2\|2	3rd	2\|1	1st
2\|2	177	16	18	0
3rd	7	25	0	0
2\|1	23	0	236	2
1st	0	0	9	40

Table 6

Confusion matrix for the Tree Ensemble predictor.

	2\|2	3rd	2\|1	1st
2\|2	178	7	26	0
3rd	12	20	0	0
2\|1	13	0	244	4
1st	0	0	5	44

Table 7

Confusion matrix for the Logistic Regression predictor.

	2\|2	3rd	2\|1	1st
2\|2	184	7	20	0
3rd	11	21	0	0
2\|1	13	0	246	2
1st	0	0	7	42

Table 8

Model performance comparison.

	PNN	Random Forest	Decision Tree	Naive Bayes	Tree Ensemble	Logistic Regression
Correct Classified	475	485	477	478	486	493
Accuracy	85.895%	87.70%	87.85%	86.438%	87.884%	89.15%
Cohen's Kappa (k)	0.767	0.799	0.803	0.782	0.803	0.823
Wrong Classified	78	68	66	75	67	60
Error	14.105%	12.297%	12.155%	13.562%	12.116%	10.85%

Table 9

Prediction confusion of the six data mining predictors.

	PNN		Random Forest		Decision Tree		Naive Bayes		Tree Ensemble		LogisticRegression
	TP	FP	TP	FP	TP	FP	TP	FP	TP	FP	TP	FP
2\|2	181	34	179	27	176	28	177	30	178	25	184	24
3rd	16	7	18	7	20	9	25	16	20	7	21	7
2\|1	242	36	244	30	237	26	236	27	244	31	246	27
1st	36	1	44	4	44	3	40	2	44	4	42	2
Overall	475	78	485	68	477	66	478	75	486	67	493	60

TP – True Positive.

FP – False Positive.

Confusion matrix for the Probabilistic Neural Network (PNN) Predictor. Confusion matrix for the Random Forest predictor. Confusion matrix for the Decision Tree predictor. Confusion matrix for the Naïve Bayes predictor. Confusion matrix for the Tree Ensemble predictor. Confusion matrix for the Logistic Regression predictor. Model performance comparison. Prediction confusion of the six data mining predictors. TP – True Positive. FP – False Positive.

Results using regression-based models

For a comparative analysis, and to further validate the performance of the data mining model which had a least accuracy of 85.89%; linear and pure quadratic regression models were developed to analyse the students' performance dataset using Matrix Laboratory (MATLAB). The regression analysis was not performed on KNIME in order to develop a regression model that incorporates all the data samples, rather than splitting the data into two for training and testing as required by KNIME data mining nodes. Also, Analysis of Variance (ANOVA), F statistics, and the response plot of the dependent variable to any of the independent variables can be easily developed on MATLAB. The following independent variables were considered for the regression analysis; the program of study coded numerically from 1 to 7, the year of entry coded numerically from 1 to 8 for the 2002/2003 session up to 2009/2010 academic sessions, the GPA of the first year, second year and third year as independent variables X1, X2, X3, X4 and X5 respectively. The relationship between the independent variables and the Final CGPA as the dependent variable is measured using the coefficient of determination (R2) value of the regression model. Using the standardized coefficient (β) in Table 10, the third year GPA has the highest effect size or influence on the predicted CGPA followed by the second year GPA, the first year GPA, and the program of study while the year of entry has the least influence on the dependent variable.

Table 10

Linear regression model results.

	Estimate	Standard Error (SE)	Beta (β)	tStat	pValue
(Intercept)	0.4865	0.0197	-	24.6470	4.18E-116
Program (X₁)	-0.0057	0.0017	-0.0164	-3.2952	0.0010
Year of Entry (X₂)	-0.0016	0.0018	-0.0053	-0.8765	0.3809
First Year GPA (X₃)	0.1811	0.0090	0.1809	20.1100	1.90E-81
Second Year GPA (X₄)	0.2788	0.0089	0.3141	31.3090	9.00E-173
Third Year GPA (X₅)	0.4404	0.0064	0.5696	68.8750	0

Number of observations: 1841, Error degrees of freedom (EDoF): 1835.

Root mean square (RMS) Error: 0.140.

R2: 0.955, Adjusted R2: 0.955.

F-statistic vs. constant model: 7.78e+03, p-value = 0.

Linear regression model results. Number of observations: 1841, Error degrees of freedom (EDoF): 1835. Root mean square (RMS) Error: 0.140. R2: 0.955, Adjusted R2: 0.955. F-statistic vs. constant model: 7.78e+03, p-value = 0.

Linear regression model

The equation for the linear regression model obtained from the regression analysis is shown in Eq. (1). The summary of the linear regression model estimates is presented in Table 10. The model has an R2 value of 0.955 which indicates that the final CGPA of engineering students in Covenant University can be reasonably determined by their performance (GPA) in the first three years of the five-year study. The F-statistic value for the components is shown in Table 11 while Table 12 displays the ANOVA for the linear regression model. The overall model prediction is presented as an adjusted variable plot in Fig. 5 (a) while Figs. 5 (b), 6 (a) and (b) present the adjusted response of the model to the first year GPA, the second year GPA and the third year GPA respectively.

Table 11

F-statistic values for the components, except for the constant term.

	Sum of Square (Sum Sq.)	Degree of Freedom (DF)	Mean Square (Sq.)	F	pValue
Program	0.2135	1	0.2135	10.8580	0.0010
Year of Entry	0.0151	1	0.0151	0.7683	0.3809
First Year GPA	7.9512	1	7.9512	404.41	1.90E-81
Second Year GPA	19.2730	1	19.273	980.24	9.00E-173
Third Year GPA	93.2670	1	93.267	4743.8	0
Error	36.0780	1835	0.0197

Table 12

ANOVA for the linear regression model.

	Sum Sq.	DF	Mean Sq.	F	pValue
Total	801.38	1840	0.4355
Model	765.3	5	153.06	7784.9	0
Residual	36.078	1835	0.0197

Fig. 5

(a) Added variable plot for the whole model (b) Adjusted response plot using the first year GPA.

Fig. 6

(a) Adjusted response plot using the second year GPA (b) Adjusted response plot using the third year GPA.

F-statistic values for the components, except for the constant term. ANOVA for the linear regression model. (a) Added variable plot for the whole model (b) Adjusted response plot using the first year GPA. (a) Adjusted response plot using the second year GPA (b) Adjusted response plot using the third year GPA.

Pure quadratic regression model

The equation for the pure quadratic regression model obtained from the regression analysis is shown in Eq. (2). The summary of the quadratic regression model estimates is presented in Table 13; the model has an R2 value of 0.957. The F-statistic value for the components is shown in Table 14 while Table 15 displays the ANOVA for the quadratic regression model. The overall model prediction is presented as an adjusted variable plot in Fig. 7 (a) while Figs. 7 (b), 8 (a) and (b) present the adjusted response of the model to the first year GPA, the second year GPA and the third year GPA respectively.

Table 13

Quadratic regression model results.

	Estimate	SE	Beta	tStat	pValue
(Intercept)	0.7380	0.0788	-	9.3629	2.20E-20
Program (X₁)	-0.0570	0.0069	-0.012941	-8.209	4.16E-16
Year of Entry (X₂)	-0.0439	0.0079	0.012463	-5.5939	2.56E-08
First Year GPA (X₃)	0.1043	0.0497	0.19571	2.0989	0.035958
Second Year GPA (X₄)	0.2522	0.0432	0.32351	5.8433	6.04E-09
Third Year GPA (X₅)	0.4806	0.0319	0.55104	15.066	1.93E-48
Program² (X₁) ²	0.0069	0.0009	0.037883	7.7051	2.13E-14
Year of Entry² (X₂) ²	0.0044	0.0008	0.031763	5.4661	5.23E-08
First Year GPA² (X₃)²	0.0121	0.0070	0.0079419	1.7162	0.0863
Second Year GPA² (X₄)²	0.0053	0.0067	0.0044	0.7869	0.4315
Third Year GPA² (X₅)²	-0.0080	0.0050	-0.0088762	-1.6088	0.1078

Number of observations: 1841, EDoF: 1830.

RMS Error: 0.137.

R2: 0.957, Adjusted R2: 0.957.

F-statistic vs. constant model: 4.11e+03, p-value = 0.

Table 14

F-statistic values for the components, except for the constant term.

	Sum Sq.	DF	Mean Sq.	F	pValue
Program	0.1711	1	0.1711	9.1635	0.0025
Year of Entry	0.0267	1	0.0267	1.4279	0.2323
First Year GPA	8.4678	1	8.4678	453.42	4.54E-90
Second Year GPA	18.556	1	18.556	993.61	1.42E-174
Third Year GPA	81.8	1	81.8	4380.1	0
Program2	1.1087	1	1.1087	59.369	2.13E-14
Year of Entry2	0.5580	1	0.5580	29.879	5.23E-08
First Year GPA2	0.0550	1	0.0550	2.9455	0.0863
Second Year GPA2	0.0116	1	0.0116	0.6192	0.4315
Third Year GPA2	0.0483	1	0.0483	2.5882	0.1078
Error	34.176	1830	0.0187

Table 15

ANOVA for the quadratic regression model.

	Sum Sq.	DF	Mean Sq.	F	pValue
Total	801.38	1840	0.4355
Model	767.2	10	76.72	4108.1	0
Linear	765.3	5	153.06	8195.9	0
Nonlinear	1.9023	5	0.38046	20.372	7.76E-20
Residual	34.176	1830	0.0187

Fig. 7

(a) Added variable plot for the whole model (b) Adjusted response plot using the first year GPA.

Fig. 8

(a) Adjusted response plot using the second year GPA (b) Adjusted response plot using the third year GPA.

Quadratic regression model results. Number of observations: 1841, EDoF: 1830. RMS Error: 0.137. R2: 0.957, Adjusted R2: 0.957. F-statistic vs. constant model: 4.11e+03, p-value = 0. F-statistic values for the components, except for the constant term. ANOVA for the quadratic regression model. (a) Added variable plot for the whole model (b) Adjusted response plot using the first year GPA. (a) Adjusted response plot using the second year GPA (b) Adjusted response plot using the third year GPA. The results of the KNIME based predictive model show that the class of grade of a student's final, fifth-year graduation result can be reasonably predicted using the student's GPA for the first three years of study. A maximum accuracy of 89.15% was achieved using the Logistic Regression algorithm while the PNN algorithm had the least accuracy of 85.895%. Likewise, using regression models R2 value of 0.955 was achieved for the linear, and 0.957 for the pure quadratic regression models which indicate that the final CGPA of engineering students in Covenant University can be reasonably determined by their performance (GPA) in the first three years of the five-year study.

Conclusion

The management of higher educational system has transformed over the years from being reactive to being proactive in decision making and system performance analysis. Ensuring adequate quality in the educational system is vital to students' performance, and the overall value of the knowledge being imparted. There are many benefits for detecting student issues and learning difficulties early, because this presents a unique opportunity to address the causal factors on time in order to prevent student failure and drop out tendencies. The performance of engineering student within the first three years of study is often said to be the most important in determining the final CGPA and class of grades of students in Nigeria due to the difficulty of improving the grades significantly at higher levels as the courses become more robust and intensive. In this study, data mining approach was applied to evaluate the validity of this assumption by performing a predictive analysis to determine the final graduation CGPA and the class of grades of students in their final year using their GPA for the first three years of study. The program and the year of entry were applied as predictive inputs into a KNIME workflow using six independent data mining algorithms that were executed separately for a comparative performance analysis of the result of each of the six different algorithms. A maximum accuracy of 89.15% was achieved, and using regression models for performance validation, R2 values of 0.955 and 0.957 were achieved using both linear and pure-quadratic based regression models. This indicates that indeed the graduating results of engineering students in Nigeria, in the fifth and final year of study can be reasonably predicted using their performance in the first three academic sessions. For future studies, as an alternative to running six data mining algorithms separately as implemented in this study, the KNIME workflow could be modified to incorporate the six data mining algorithms together in a model using a voting system such that the benefits of each algorithm can be combined.

Declarations

Author contribution statement

Aderibigbe Israel Adekitan: Conceived and designed the experiments. Performed the experiments; Analyzed and interpreted the data; Contributed reagents, materials, analysis tools or data; Wrote the paper. Odunayo Salau: Performed the experiments; Analyzed and interpreted the data; Contributed reagents, materials, analysis tools or data; Wrote the paper.

Funding statement

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Competing interest statement

The authors declare no conflict of interest.

Additional information

There is no additional information for this paper.

3 in total

Review 1. Educational Anomaly Analytics: Features, Methods, and Challenges.

Authors: Teng Guo; Xiaomei Bai; Xue Tian; Selena Firmin; Feng Xia
Journal: Front Big Data Date: 2022-01-14

2. Predicting students' performance in English and Mathematics using data mining techniques.

Authors: Muhammad Haziq Bin Roslan; Chwen Jen Chen
Journal: Educ Inf Technol (Dordr) Date: 2022-07-29

3. Determinants of retention strategies and sustainable performance of academic staff of government-owned universities in Nigeria.

Authors: Odunayo Salau; Rowland Worlu; Adewale Osibanjo; Anthonia Adeniji; Tolulope Atolagbe; Jumoke Salau
Journal: F1000Res Date: 2020-08-04

3 in total