| Literature DB >> 36210911 |
Abstract
Access to education is the first step to benefiting from it. Although cumulative online learning experience is linked academic learning gains, between-country inequalities mean that large populations are prevented from accumulating such experience. Low-and-middle-income countries are affected by disadvantages in infrastructure such as internet access and uncontextualised learning content, and parents who are less available and less well-resourced than in high-income countries. COVID-19 has exacerbated the global inequalities, with girls affected more than boys in these regions. Therefore, the present research mined online learning data to identify features that are important for access to online learning. Data mining of 54,842,787 initial (random subsample n = 5000) data points from one online learning platform was conducted by partnering theory with data in model development. Following examination of a theory-led machine learning model, a data-led approach was taken to reach a final model. The final model was used to derive Shapley values for feature importance. As expected, country differences, gender, and COVID-19 were important features in access to online learning. The data-led model development resulted in additional insights not examined in the initial, theory-led model: namely, the importance of Math ability, year of birth, session difficulty level, month of birth, and time taken to complete a session.Entities:
Keywords: COVID-19; Country inequalities; Educational access; Machine learning; Online learning
Year: 2022 PMID: 36210911 PMCID: PMC9530424 DOI: 10.1007/s10639-022-11280-5
Source DB: PubMed Journal: Educ Inf Technol (Dordr) ISSN: 1360-2357
Fig. 1Screenshots from Year 6 Maths Games demo. Screenshots progress from left to right, first the top row, then the bottom row. (See https://www.whizz.com/maths-games/year-6-maths-games.)
Feature dictionary, with all features initially included in analysis (i.e., Phase 2 model development)
| Feature name | Feature explanation | Feature engineering process | |
|---|---|---|---|
| 1 | Topic identifier (22 topics in total) | None; from log file | |
| 2 | Academic difficulty of the | None; from log file | |
| 3 | Within each quarter, exercises were sequenced in order of difficulty and ranged from 100 to 1000 (i.e., 100, 200, 300, etc.), incrementing at intervals of 100 within each quarter then resetting at the next quarter | None; from log file | |
| 4 | The feature, stackDepth, related to the lesson’s mode, with the default value being stackDepth = 1 to signify progression; if a learner failed a default, progression lesson, they would regress to a simpler exercise in to a lesson mode with stackDepth = 2; failing that, the learner would be regressed further to even simpler exercise at stackDepth = 3. If the learner passed the stackDepth = 3 exercise and test, they would move back to complete the exercise and test at stackDepth = 2 then, if they pass that test, return to the lesson at stackDepth = 1 | None; from log file | |
| 5 | How long the learner took to progress from beginning of lesson to the end, including the exercises and test | None; from log file | |
| 6 | How long the learner took to complete the exercise questions | None; from log file | |
| 7 | How long the learner took to complete the tutorial as a whole | None; from log file | |
| 8 | The number of questions that the learner attempted in that lesson | None; from log file | |
| 9 | The default progression tutor exercise, regression tutor exercise, replay exercise, tutor test | None; from log file | |
| 10 | The number of times help was sought by the learner | None; from log file | |
| 11 | A summary feature indicating whether the lesson was a standard, progression one, or whether the learner was repeating the lesson for whatever reason | None; from log file | |
| 12 | Computed from log file variable, | ||
| 13 | Computed from log file variable, | ||
| 14 | Computed from log file variable, | ||
| 15 | Computed from log file variable, | ||
| 16 | Dummy variable (or one-hot coding) | Dummy generated from | |
| 17 | Dummy variable (or one-hot coding) | Dummy generated from | |
| 18 | The total number of lessons completed by each learner | Computed from log file variable, | |
| 19 | Year of birth | Computed from log file variable, | |
| 20 | Month of birth | Computed from log file variable, | |
| 21 | The learner’s age in quarters. That is, year + quarter, e.g., 12.25 for 12 years and a quarter; births between January and March were quarter = 0, births between April and June were quarter = 0.25, etc | Computed from log file variable, | |
| 22 | Learner academic age. For example, a learner with pupil_ageQuart = 12.25 years who is attempting a lesson with mathLevel = 9.25 will be showing the mathAbility of + 3 years | Computation: | |
| 23 | Dummy variable (or one-hot coding). 1 = Kenya, 0 = UK or Thailand | Computed using | |
| 24 | Dummy variable (or one-hot coding). 1 = UK, 0 = Kenya or Thailand | Computed using | |
| 25 | Dummy variable (or one-hot coding). 1 = LMIC (Kenya or Thailand), 0 = HIC (UK) | Computed from | |
| 26 | 1 (least deprived) to 3 (most deprived) using country-specific deprivation codes as applied at school level. Missing data were replaced by the sample-level mean (i.e., 2.08) and rounded to the nearest integer (i.e., 2) Additional notes: The UK deprivation status was calculated using the Index of Multiple Deprivation 2019 (IMD2019, Penney, | Computed from log file variable, |
Fig. 2Correlations between potential features and learning outcome (play_count). Transformed data are represented here
Fig. 3Collective force plot showing the overall effect of all features included in the final model, using absolute mean Shapley values. As the graph progresses to the right, effects of the most important features for each individual learner are shown. Features that push the prediction higher (to the right) are shown in red, and those pushing the prediction lower are in blue. The x-axis shows participant number, ordered by similarity for this plot. Panel A gives a snapshot of the features that generally reduce play_count; Panel B shows a snapshot of features that increase play_count. Transformed data are represented here
Fig. 4Decision plot of feature importance for global interpretation, using mean absolute Shapley values. The model output value is the learning outcome (play_count). Features that push the prediction higher (to the right) are shown in red, and those pushing the prediction lower are in blue. The fainter a line, the fewer learners it represents. Transformed data are represented here
Fig. 5Summary plot of feature importance in final model, using mean absolute Shapley values. Features that push the prediction higher (to the right) are shown in red, and those pushing the prediction lower are in blue. Transformed data are represented here
Data-led model development. Coefficients (i.e., weights) that emerged from the regularised regression with the Elastic Net penalty when predicting access to online learning (play_count), in descending order of coefficient size
| Feature | Coefficient | |
|---|---|---|
| 1 | mathAbility | -0.28722 |
| 2 | InCountryDep | -0.14912 |
| 3 | birthMonth | 0.127049 |
| 4 | birthYear | 0.117863 |
| 5 | mathLevel | 0.110936 |
| 6 | Kenya | -0.08101 |
| 7 | exerciseId | 0.066699 |
| 8 | tutorialTime | -0.06467 |
| 9 | total_help | 0.056369 |
| 10 | totalQuestions | 0.054672 |
| 11 | timeTaken | -0.05035 |
| 12 | since_covid | 0.049945 |
| 13 | markedYear | 0.045554 |
| 14 | pupil_ageQuart | -0.04543 |
| 15 | stackDepth | 0.03957 |
| 16 | markedWeek | -0.03924 |
| 17 | topicId | 0.032182 |
| 18 | Male | 0.02713 |
| 19 | lesson_mark | 0.019211 |
| 20 | replay | 0.018695 |
| 21 | questionTime | -0.01032 |
| 22 | UK | -0.00947 |
| 23 | LMIC | 0.008953 |
| 24 | lesson_type | 0.001756 |
| 25 | markedMonth | 0 |
Shapley values for all the features in the final model for predicting play_count
| M | SD | min | max | |
|---|---|---|---|---|
| mathAbility | 0.263367 | 0.219041 | 0.001813 | 1.967418 |
| birthYear | 0.173876 | 0.136386 | 0.000677 | 0.933374 |
| mathLevel | 0.163275 | 0.128059 | 2.05E-05 | 0.709157 |
| birthMonth | 0.132435 | 0.098237 | 0.00017 | 0.93519 |
| timeTaken | 0.132227 | 0.126895 | 0.000171 | 0.913804 |
| markedWeek | 0.097876 | 0.102419 | 0.000113 | 0.752981 |
| questionTime | 0.095164 | 0.081499 | 3.77E-06 | 0.602196 |
| markedYear | 0.094058 | 0.076906 | 3.66E-05 | 0.452359 |
| pupil_ageQuart | 0.092162 | 0.085771 | 0.000123 | 0.543479 |
| lesson_mark | 0.074343 | 0.082897 | 8.61E-05 | 1.109786 |
| topicId | 0.067408 | 0.056775 | 4.13E-05 | 0.461668 |
| totalQuestions | 0.06661 | 0.088144 | 0.000104 | 0.800889 |
| tutorialTime | 0.063569 | 0.077716 | 8.35E-05 | 0.689795 |
| exerciseId | 0.048519 | 0.051968 | 1.13E-05 | 0.511054 |
| total_help | 0.047218 | 0.075228 | 2.12E-05 | 0.85206 |
| UK | 0.045384 | 0.034569 | 3.30E-06 | 0.198564 |
| markedMonth | 0.033839 | 0.032279 | 6.75E-05 | 0.23072 |
| Kenya | 0.032443 | 0.03541 | 0.000127 | 0.607252 |
| lesson_type | 0.030549 | 0.028919 | 2.24E-05 | 0.28909 |
| Male | 0.02577 | 0.028436 | 2.99E-05 | 0.275197 |
| since_covid | 0.02499 | 0.043463 | 6.24E-05 | 0.469803 |
| replay | 0.018683 | 0.046888 | 6.79E-05 | 0.4945 |
| stackDepth | 0.008345 | 0.02338 | 0.000152 | 0.420604 |
| InCountryDep | 0.006133 | 0.024946 | 6.03E-07 | 0.338753 |
| LMIC | 0 | 0 | 0 | 0 |
Fig. 7Line graphs showing how ‘country’ (Kenya, Thailand, and the UK) as well as LMIC status related to ‘access to online learning’ (play_count). Transformed data are represented here
Fig. 6Bar plot of feature importance of features in the final model, using mean absolute Shapley values. Panel A shows the features ordered from the most important to the least, in the final model. Panel B shows the features are generally ordered in the same way, but with clustering where features are related to each other. Transformed data are represented here
Fig. 8The role of gender (Male, dummy variable) in predicting access to online learning (play_count)
Fig. 9Access to online learning (play_count) as the years (markedYear) progress, with the final time point representing the year 2020 (i.e., from the onset of Covid). Transformed data are represented here
Fig. 10Top five most important features in predicting online learning access (play_count). Transformed data are represented here
Fig. 11Scatter plot showing how birthYear was related to online learning access (play_count). Untransformed data are represented here
The top 70 SHAP interaction values
| Feature | Shap interaction value | cum_diff | |
|---|---|---|---|
| 1 | questionTime * timeTaken | 0.12 | NA |
| 2 | birthYear * mathAbility | 0.08 | -0.04 |
| 3 | pupil_ageQuart * mathLevel | 0.08 | -0.01 |
| 4 | mathLevel * mathAbility | 0.07 | 0 |
| 5 | mathLevel * birthMonth | 0.07 | 0 |
| 6 | markedYear * birthYear | 0.07 | 0 |
| 7 | mathLevel * birthYear | 0.06 | -0.01 |
| 8 | birthMonth * mathAbility | 0.06 | 0 |
| 9 | timeTaken * mathLevel | 0.06 | -0.01 |
| 10 | lesson_mark * mathAbility | 0.06 | 0 |
| 11 | timeTaken * mathAbility | 0.05 | 0 |
| 12 | pupil_ageQuart * birthYear | 0.05 | 0 |
| 13 | markedWeek * timeTaken | 0.05 | 0 |
| 14 | questionTime * mathLevel | 0.05 | 0 |
| 15 | questionTime * markedWeek | 0.05 | 0 |
| 16 | markedYear * mathLevel | 0.05 | 0 |
| 17 | pupil_ageQuart * mathAbility | 0.05 | 0 |
| 18 | markedWeek * mathAbility | 0.05 | 0 |
| 19 | questionTime * mathAbility | 0.04 | 0 |
| 20 | timeTaken * birthMonth | 0.04 | 0 |
| 21 | markedYear * mathAbility | 0.04 | 0 |
| 22 | timeTaken * tutorialTime | 0.04 | 0 |
| 23 | lesson_mark * totalQuestions | 0.04 | 0 |
| 24 | pupil_ageQuart * timeTaken | 0.04 | 0 |
| 25 | questionTime * pupil_ageQuart | 0.04 | 0 |
| 26 | lesson_mark * birthYear | 0.04 | 0 |
| 27 | questionTime * birthMonth | 0.04 | 0 |
| 28 | UK * birthYear | 0.04 | 0 |
| 29 | markedWeek * mathLevel | 0.04 | 0 |
| 30 | total_help * tutorialTime | 0.04 | 0 |
| 31 | topicId * mathAbility | 0.04 | 0 |
| 32 | timeTaken * totalQuestions | 0.04 | 0 |
| 33 | topicId * timeTaken | 0.03 | 0 |
| 34 | birthYear * birthMonth | 0.03 | 0 |
| 35 | topicId * birthYear | 0.03 | 0 |
| 36 | timeTaken * birthYear | 0.03 | 0 |
| 37 | markedWeek * birthMonth | 0.03 | 0 |
| 38 | tutorialTime * mathAbility | 0.03 | 0 |
| 39 | lesson_mark * mathLevel | 0.03 | 0 |
| 40 | questionTime * topicId | 0.03 | 0 |
| 41 | pupil_ageQuart * birthMonth | 0.03 | 0 |
| 42 | tutorialTime * birthMonth | 0.03 | 0 |
| 43 | lesson_mark * timeTaken | 0.03 | 0 |
| 44 | totalQuestions * tutorialTime | 0.03 | 0 |
| 45 | markedWeek * pupil_ageQuart | 0.03 | 0 |
| 46 | pupil_ageQuart * markedYear | 0.03 | 0 |
| 47 | markedWeek * birthYear | 0.03 | 0 |
| 48 | Kenya * mathLevel | 0.03 | 0 |
| 49 | totalQuestions * mathAbility | 0.03 | 0 |
| 50 | markedYear * birthMonth | 0.03 | 0 |
| 51 | tutorialTime * mathLevel | 0.03 | 0 |
| 52 | timeTaken * total_help | 0.03 | 0 |
| 53 | lesson_mark * birthMonth | 0.03 | 0 |
| 54 | topicId * mathLevel | 0.03 | 0 |
| 55 | questionTime * tutorialTime | 0.03 | 0 |
| 56 | Male * timeTaken | 0.03 | 0 |
| 57 | questionTime * totalQuestions | 0.03 | 0 |
| 58 | markedWeek * totalQuestions | 0.03 | 0 |
| 59 | questionTime * birthYear | 0.03 | 0 |
| 60 | markedWeek * tutorialTime | 0.03 | 0 |
| 61 | markedWeek * markedYear | 0.03 | 0 |
| 62 | UK * mathLevel | 0.03 | 0 |
| 63 | questionTime * lesson_mark | 0.03 | 0 |
| 64 | lesson_mark * topicId | 0.03 | 0 |
| 65 | timeTaken * exerciseId | 0.03 | 0 |
| 66 | topicId * markedWeek | 0.02 | 0 |
| 67 | lesson_mark * markedWeek | 0.02 | 0 |
| 68 | topicId * Kenya | 0.02 | 0 |
| 69 | topicId * pupil_ageQuart | 0.02 | 0 |
| 70 | markedWeek * exerciseId | 0.02 | 0 |
Fig. 12SHAP interaction values for predicting access to online learning (play_count), from the strongest interaction to the weakest. Only the top 20 interactions are shown here. Transformed data are represented here
Fig. 13Dependence plots for the six strongest interactants to emerge from the final model
Fig. 14Dependence plots of the importance of the features that interact with Country (either UK or Kenya) in predicting learning outcomes (lesson mark), according to mean absolute Shapley values. Transformed data are represented here
Fig. 15The interaction between gender (Male) and timeTaken to complete each lesson. Panel A represents transformed data and relates to feature importance via absolute Shapley values; Panel B represents untransformed data and reflects associations between the variables
Fig. 16Dependence plots of the importance of the features that interact with Covid, as measured by markedYear, in predicting access to online learning (play_count), according to mean absolute Shapley values. Transformed data are represented here
Fig. 17The interaction between learner age (pupil_ageQuart) and Covid (i.e., markedYear; pre-covid = 2015–2019, since covid = 2020 onwards) in predicting access to online learning. Untransformed data are represented here
Fig. 18The interaction between birthMonth and Covid (i.e., markedYear; pre-covid = 2015–2019, since covid = 2020 onwards) in predicting access to online learning (play_count). Untransformed data are represented here
Fig. 19The interaction between markedWeek and Covid (i.e., markedYear; pre-covid = 2015–2019, since covid = 2020 onwards) in predicting access to online learning (play_count). Untransformed data are represented here
Fig. 20Dependence plots for the six strongest interactants to emerge from the final model when predicting access to online learning (play_count)