| Literature DB >> 30532716 |
Abstract
Due to increasing use of technology-enhanced educational assessment, data mining methods have been explored to analyse process data in log files from such assessment. However, most studies were limited to one data mining technique under one specific scenario. The current study demonstrates the usage of four frequently used supervised techniques, including Classification and Regression Trees (CART), gradient boosting, random forest, support vector machine (SVM), and two unsupervised methods, Self-organizing Map (SOM) and k-means, fitted to one assessment data. The USA sample (N = 426) from the 2012 Program for International Student Assessment (PISA) responding to problem-solving items is extracted to demonstrate the methods. After concrete feature generation and feature selection, classifier development procedures are implemented using the illustrated techniques. Results show satisfactory classification accuracy for all the techniques. Suggestions for the selection of classifiers are presented based on the research questions, the interpretability and the simplicity of the classifiers. Interpretations for the results from both supervised and unsupervised learning methods are provided.Entities:
Keywords: data mining; educational assessment; log file; process data; psychometric
Year: 2018 PMID: 30532716 PMCID: PMC6265513 DOI: 10.3389/fpsyg.2018.02231
Source DB: PubMed Journal: Front Psychol ISSN: 1664-1078
Figure 1PISA 2012 problem-solving question TICKETS task2 (CP038Q01) screenshots. (For more clear view, please see http://www.oecd.org/pisa/test-2012/testquestions/question5/).
Figure 2The screenshot of the log file for one student.
15 raw event values and 36 generated features.
| Event_value (15) | Start, End, city_subway, concession, full_fare, daily, Cancel, country_trains, individual, Buy, trip_1, trip_2, trip_3, trip_4, trip_5 |
| Time features (4) | T_time, A_time, S_time, E_time |
| Single actions (12) | All in raw event_values except for Start, End and Buy |
| Two actions coded together (18) | S_city (Start → city_subway) |
| Four actions coded together (2) | city_con_ind_4 (city_subway → concession → individual → trip_4) city_con_daily_cancel (city_subway → concession → daily → Cancel) |
Figure 3Feature importance indicated by tree-based methods.
Figure 4The CART tuning results for cost-complexity parameter (cp).
Figure 5The Gradient Boosting tuning results.
Figure 6The random forest tuning results (peak point corresponds to m = 4).
Average of accuracy measures of the scores.
| CART | rpart | 0.92 | 0.95 | 0.89 | 0.97 | 0.97 | 0.98 | 0.96 | 0.99 | 0.93 | 0.96 | 0.98 |
| Random Forest | randomForest | 0.92 | 0.95 | 0.89 | 0.95 | 10.0 | 0.99 | 0.96 | 0.97 | 0.94 | 0.95 | 0.99 |
| Gradient Boosting | gbm | 0.94 | 0.96 | 0.89 | 0.97 | 10.0 | 0.99 | 0.96 | 0.99 | 0.94 | 0.96 | 0.99 |
| Support Vector Machine | kernlab | 0.92 | 0.95 | 0.94 | 0.93 | 10.0 | 0.98 | 0.98 | 0.97 | 0.96 | 0.96 | 0.99 |
Figure 7The CART classification.
Clustering Algorithms' Fit (DBI) and Agreement (Cohen's Kappa).
| 3 | 1.427 | 1.54 | 0.037 | 1.741 | 1.696 | 0.900 |
| 4 | 1.792 | 1.447 | 0.061 | 1.444 | 1.178 | 0.078 |
| 5 | 1.296 | 1.098 | 1.133 | |||
| 6 | 1.448 | 1.087 | 0.934 | 1.057 | 1.171 | 0.390 |
| 7 | 1.413 | 1.023 | 0.835 | 1.177 | 0.920 | 0.891 |
| 8 | 0.198 | 1.057 | 0.753 | 1.063 | 1.034 | 0.894 |
| 9 | 1.099 | 1.288 | 0.979 | 0.831 | ||
| 10 | 1.442 | 0.251 | 0.884 | 1.288 | 0.816 | 0.627 |
Best fitting solution with the training dataset but lower Kappa value with the test dataset, indicating the disagreement between k-means and SOM.
Final chosen solution. Bold values indicate potential final clustering solution and are discussed in the text.
Figure 8Percentage in each score category in the final SOM clustering solution with 9 clusters from the training dataset.