| Literature DB >> 33281655 |
Liping Yang1, Tao Xin1, Canxi Cao1.
Abstract
How to effectively evaluate students' essays based on a series of relatively objective writing criteria has always been a topic of discussion. With the development of automatic essay scoring, a key question is whether the writing quality can be evaluated systematically based on the scoring rubric. To address this issue, we used an innovative set of graph-based features to predict the quality of Chinese middle school students' essays. These features are divided into four sub-dimensions: basic characteristics, main idea, essay content, and essay development. The results show that graph-based features were significantly better at predicting human essay scores than the baseline features. This indicates that graph-based features can be used to reliably and systematically evaluate the quality of an essay based on the scoring rubric, and it can be used as an alternative tool to replace or supplement human evaluation.Entities:
Keywords: automatic essay scoring; graph-based features; reliability; scoring rubric; writing ability
Year: 2020 PMID: 33281655 PMCID: PMC7689217 DOI: 10.3389/fpsyg.2020.531262
Source DB: PubMed Journal: Front Psychol ISSN: 1664-1078
FIGURE 1AECE architecture.
The characteristics of the datasets from the six essay prompts.
| Prompt | ||||||
| 1 | 2 | 3 | 4 | 5 | 6 | |
| Number of essays | 3,000 | 3,000 | 3,000 | 2,000 | 2,000 | 2,000 |
| Mean number of characters | 433.29 | 426.89 | 414.62 | 158.06 | 128.88 | 154.20 |
| SD of number of characters | 153.67 | 150.12 | 137.56 | 47.24 | 47.94 | 51.36 |
| Range of the rubric | 0–6 | 0–6 | 0–6 | 0–4 | 0–4 | 0–4 |
| Mean of human score | 4.07 | 4.07 | 3.68 | 2.91 | 2.68 | 3.07 |
| SD of human score | 1.14 | 1.22 | 0.86 | 0.51 | 0.72 | 0.62 |
Description of the training and test sets.
| Prompt | |||||||
| 1 | 2 | 3 | 4 | 5 | 6 | ||
| Training set | Number of essays | 2,100 | 2,100 | 2,100 | 1,400 | 1,400 | 1,400 |
| Mean number of characters | 433.60 | 426.20 | 414.10 | 158.47 | 129.00 | 154.36 | |
| SD of number of characters | 153.38 | 150.22 | 139.20 | 47.03 | 47.35 | 51.11 | |
| Range of the rubric | 0–6 | 0–6 | 0–6 | 0–4 | 0–4 | 0–4 | |
| Mean of human score (H1) | 4.06 | 4.08 | 3.67 | 2.91 | 2.69 | 3.07 | |
| SD of human score (H1) | 1.15 | 1.22 | 0.86 | 0.50 | 0.70 | 0.60 | |
| Test set | Number of essays | 900 | 900 | 900 | 600 | 600 | 600 |
| Mean number of characters | 432.58 | 428.50 | 415.84 | 157.11 | 128.61 | 153.84 | |
| SD of number of characters | 154.42 | 149.97 | 133.74 | 47.75 | 49.34 | 51.97 | |
| Mean human score (H1) | 4.08 | 4.05 | 3.70 | 2.90 | 2.65 | 3.08 | |
| SD of human score (H1) | 1.12 | 1.21 | 0.86 | 0.52 | 0.77 | 0.65 | |
QWK and exact agreement of human raters (H1/H2) and H2/AECE holistic score, p values are computed for QWK.
| Comparison with H2 | |||||||
| Prompt 1 | Prompt 2 | Prompt 3 | Prompt 4 | Prompt 5 | Prompt 6 | ||
| H1 | QWK Exact agg. | 0.72 | 0.73 | 0.69 | 0.74 | 0.73 | 0.77 |
| 0.64 | 0.60 | 0.61 | 0.73 | 0.72 | 0.75 | ||
| Baseline | QWK | 0.77 | 0.78 | 0.70 | 0.78 | 0.76 | 0.79 |
| Exact agg. | 0.67 | 0.66 | 0.63 | 0.78 | 0.74 | 0.79 | |
| AECE | QWK | 0.86 | 0.81 | 0.79 | 0.88 | 0.87 | 0.89 |
| Exact agg. | 0.71 | 0.68 | 0.61 | 0.84 | 0.81 | 0.88 | |
| Baseline + AECE | QWK | 0.88 | 0.84 | 0.81 | 0.91 | 0.91 | 0.93 |
| Exact agg. | 0.73 | 0.70 | 0.63 | 0.89 | 0.83 | 0.90 | |
| AECE/baseline | 0.00* | 0.00* | 0.00* | 0.00* | 0.00* | 0.00* | |
QWK of feature sets on trait scoring (main idea, content, and development) on the six data sets.
| Trait | Feature set | Comparison with H2 | |||||
| Prompt 1 | Prompt 2 | Prompt 3 | Prompt 4 | Prompt 5 | Prompt 6 | ||
| Main idea | H1 | 0.62 | 0.59 | 0.57 | 0.65 | 0.64 | 0.69 |
| Baseline | 0.69 | 0.64 | 0.61 | 0.62 | 0.60 | 0.66 | |
| Main idea | 0.78* | 0.72* | 0.70* | 0.81* | 0.79* | 0.83* | |
| Baseline + AECE | 0.83* | 0.74* | 0.73* | 0.87* | 0.82* | 0.87* | |
| Content | H1 | 0.64 | 0.61 | 0.60 | 0.66 | 0.67 | 0.70 |
| Baseline features | 0.60 | 0.55 | 0.61 | 0.69 | 0.66 | 0.72 | |
| Content feature | 0.72* | 0.69* | 0.65* | 0.75* | 0.73* | 0.78* | |
| Baseline + AECE | 0.76* | 0.74* | 0.69* | 0.81* | 0.76* | 0.82* | |
| Development | H1 | 0.55 | 0.52 | 0.50 | 0.58 | 0.53 | 0.61 |
| Baseline | 0.61 | 0.56 | 0.58 | 0.67 | 0.65 | 0.69 | |
| Development | 0.67* | 0.64* | 0.63* | 0.73* | 0.71* | 0.75* | |
| Baseline + AECE | 0.71* | 0.67* | 0.68* | 0.78* | 0.74* | 0.81* | |
Descriptive statistics of the features.
| Prompt | NN | ND | AD | SC | MPR | MI | WDAP | WDNP | SHSE | |
| 1 | Mean | 30.15 | 123.59 | 12.23 | 1.68 | 4.07 | 0.61 | 9.18 | 7.93 | 0.68 |
| STD | 4.73 | 15.54 | 3.17 | 0.05 | 0.19 | 0.10 | 1.02 | 1.38 | 0.07 | |
| Skewness | –0.89 | –0.34 | –0.78 | 0.04 | 0.19 | 0.02 | 0.23 | 0.16 | 0.11 | |
| 2 | Mean | 28.72 | 107.29 | 11.49 | 1.59 | 4.01 | 0.59 | 9.07 | 7.89 | 0.61 |
| STD | 4.65 | 13.64 | 2.83 | 0.06 | 0.13 | 0.07 | 1.05 | 1.24 | 0.08 | |
| Skewness | –0.57 | –0.30 | –0.53 | 0.03 | 0.12 | 0.03 | 0.12 | 0.17 | 0.13 | |
| 3 | Mean | 24.61 | 103.73 | 11.02 | 1.65 | 3.94 | 0.60 | 9.01 | 7.86 | 0.60 |
| STD | 4.31 | 14.21 | 2.91 | 0.03 | 0.21 | 0.08 | 1.04 | 1.21 | 0.05 | |
| Skewness | –0.74 | –0.19 | –0.77 | 0.02 | 0.13 | 0.04 | 0.19 | 0.23 | 0.09 | |
| 4 | Mean | 16.54 | 61.87 | 11.98 | 1.21 | 3.15 | 0.58 | 8.78 | 7.47 | 0.71 |
| STD | 4.64 | 5.43 | 1.08 | 0.07 | 0.11 | 0.09 | 0.96 | 1.22 | 0.04 | |
| Skewness | –0.33 | –0.13 | –0.72 | 0.07 | 0.09 | 0.03 | 0.17 | 0.14 | 0.08 | |
| 5 | Mean | 13.87 | 56.26 | 10.81 | 1.15 | 2.97 | 0.51 | 8.69 | 7.30 | 0.69 |
| STD | 3.19 | 4.03 | 0.07 | 0.03 | 0.08 | 0.05 | 0.68 | 1.11 | 0.03 | |
| Skewness | –0.25 | –0.17 | –0.63 | 0.02 | 0.06 | 0.01 | 0.10 | 0.14 | 0.05 | |
| 6 | Mean | 16.32 | 59.53 | 11.43 | 1.36 | 3.03 | 0.55 | 8.72 | 7.35 | 0.70 |
| STD | 4.07 | 5.36 | 0.09 | 0.08 | 0.10 | 0.06 | 0.92 | 1.19 | 0.06 | |
| Skewness | –0.35 | –0.25 | –0.56 | 0.04 | 0.07 | 0.02 | 0.13 | 0.16 | 0.03 | |
Average correlations (across all prompts) of feature values with H2.
| Feature | Holistic | Main idea | Content | Development |
| NN | 0.49 | 0.40 | 0.43 | 0.38 |
| ND | 0.43 | 0.27 | 0.35 | 0.27 |
| AD | 0.31 | 0.25 | 0.24 | 0.21 |
| SC | 0.23 | 0.14 | 0.20 | 0.18 |
| MPR | 0.68 | 0.78 | 0.52 | 0.42 |
| MI | 0.31 | 0.54 | 0.46 | 0.24 |
| WDAP | 0.39 | 0.23 | 0.25 | 0.63 |
| WDNP | 0.38 | 0.20 | 0.24 | 0.51 |
| SHSE | 0.55 | 0.48 | 0.64 | 0.47 |
FIGURE 2High-level main idea.
FIGURE 3Low-level main idea.
FIGURE 4High-level content.
FIGURE 5Low-level content.
FIGURE 6High-level development.
FIGURE 7Low-level development.
The relative importance of features (expressed as percent of total weights) from regression for the prediction of H1.
| Features | Holistic | Main idea | Content | Development | ||||
| Average | Coef. Var. | Average | Coef. Var. | Average | Coef. Var. | Average | Coef. Var. | |
| NN | 13 | 64 | 6 | 47 | 14 | 54 | 10 | 39 |
| ND | 9 | 39 | 5 | 67 | 8 | 34 | 8 | 54 |
| AD | 8 | 21 | 5 | 44 | 6 | 20 | 9 | 42 |
| SC | 7 | 26 | 5 | 29 | 4 | 32 | 5 | 31 |
| MPR | 26 | 20 | 37 | 23 | 21 | 36 | 13 | 25 |
| MI | 8 | 23 | 15 | 32 | 6 | 61 | 5 | 39 |
| WDAP | 7 | 31 | 8 | 25 | 6 | 25 | 18 | 27 |
| WDNP | 6 | 28 | 7 | 22 | 4 | 47 | 17 | 23 |
| SHSE | 16 | 47 | 12 | 36 | 31 | 24 | 15 | 35 |
SMD of AECE for different genders and economic development areas.
| Prompt | Gender | Holistic | Main idea | Content | Development | ||||
| Baseline | AECE | Baseline | AECE | Baseline | AECE | Baseline | AECE | ||
| 1 | Male | −0.13 | −0.05 | −0.09 | −0.04 | 0.19 | 0.12 | –0.27 | –0.18 |
| Female | −0.11 | −0.04 | −0.14 | −0.09 | 0.24 | 0.11 | –0.19 | –0.07 | |
| 2 | Male | −0.16 | −0.11 | −0.18 | −0.07 | 0.23 | 0.15 | –0.31 | –0.19 |
| Female | −0.15 | −0.08 | −0.14 | −0.10 | 0.19 | 0.13 | –0.25 | –0.09 | |
| 3 | Male | −0.14 | −0.08 | −0.12 | −0.05 | 0.25 | 0.14 | –0.28 | –0.22 |
| Female | −0.12 | −0.07 | −0.15 | −0.08 | 0.29 | 0.13 | –0.20 | –0.20 | |
| 4 | Male | −0.03 | −0.01 | −0.07 | −0.03 | 0.08 | 0.04 | 0.21 | 0.10 |
| Female | −0.02 | −0.01 | −0.05 | −0.01 | 0.09 | 0.02 | 0.13 | 0.09 | |
| 5 | Male | −0.06 | −0.04 | −0.09 | −0.05 | 0.14 | 0.06 | 0.25 | 0.15 |
| Female | −0.05 | −0.03 | −0.07 | −0.04 | 0.13 | 0.05 | 0.19 | 0.14 | |
| 6 | Male | −0.04 | −0.02 | −0.08 | −0.03 | 0.09 | 0.04 | 0.15 | 0.12 |
| Female | −0.02 | −0.02 | −0.07 | −0.03 | 0.07 | 0.04 | 0.14 | 0.11 | |
| 1 | Developed | −0.15 | −0.09 | −0.17 | −0.09 | 0.19 | 0.13 | –0.23 | –0.14 |
| Developing | −0.14 | −0.08 | −0.14 | −0.05 | 0.14 | 0.09 | –0.17 | –0.13 | |
| Underdeveloped | −0.17 | −0.10 | −0.21 | −0.08 | 0.15 | 0.17 | –0.19 | –0.15 | |
| 2 | Developed | −0.17 | −0.12 | −0.19 | −0.12 | 0.23 | 0.14 | –0.25 | –0.17 |
| Developing | −0.16 | −0.09 | −0.16 | −0.10 | 0.19 | 0.10 | –0.20 | –0.15 | |
| Underdeveloped | -0.19 | −0.13 | −0.23 | −0.14 | 0.27 | 0.19 | –0.27 | –0.18 | |
| 3 | Developed | −0.16 | −0.10 | −0.20 | −0.13 | 0.20 | 0.14 | –0.26 | –0.16 |
| Developing | −0.14 | −0.07 | −0.17 | −0.09 | 0.16 | 0.10 | –0.21 | –0.14 | |
| Underdeveloped | −0.18 | −0.12 | −0.21 | −0.16 | 0.18 | 0.13 | –0.28 | –0.19 | |
| 4 | Developed | −0.09 | −0.02 | −.13 | −0.04 | 0.13 | 0.04 | –0.12 | –0.09 |
| Developing | −0.04 | −0.01 | −0.11 | −0.03 | 0.14 | 0.02 | –0.13 | –0.07 | |
| Underdeveloped | −0.11 | −0.04 | −0.16 | −0.09 | 0.17 | 0.06 | –0.21 | –0.13 | |
| 5 | Developed | −0.12 | −0.07 | −0.17 | −0.05 | 0.18 | 0.07 | –0.17 | –0.12 |
| Developing | −0.10 | −0.03 | −0.15 | −0.02 | 0.14 | 0.05 | –0.15 | –0.10 | |
| Underdeveloped | −0.15 | −0.09 | −0.19 | −0.09 | 0.19 | 0.09 | –0.29 | –0.15 | |
| 6 | Developed | −0.11 | −0.03 | −0.14 | −0.05 | 0.15 | 0.05 | –0.16 | –0.10 |
| Developing | −0.08 | −0.02 | −0.10 | −0.02 | 0.12 | 0.02 | –0.14 | –0.08 | |
| Underdeveloped | −0.13 | −0.05 | -0.15 | −0.06 | 0.16 | 0.08 | –0.18 | –0.13 | |