| Literature DB >> 30613260 |
Yuni Susanti1, Takenobu Tokunaga1, Hitoshi Nishikawa1, Hiroyuki Obari2.
Abstract
This paper describes details of the evaluation experiments for questions created by an automatic question generation system. Given a target word and one of its word senses, the system generates a multiple-choice English vocabulary question asking for the closest in meaning to the target word in the reading passage. Two kinds of evaluation were conducted considering two aspects: (1) measuring English learners' proficiency and (2) their similarity to the human-made questions. The first evaluation is based on the responses from English learners obtained through administering the machine-generated and human-made questions to them, and the second is based on the subjective judgement by English teachers. Both evaluations showed that the machine-generated questions were able to achieve a comparable level with the human-made questions in both measuring English proficiency and similarity.Entities:
Keywords: Automatic question generation; English vocabulary question; Evaluation of question items; Language learning; Multiple-choice question; Neural test theory
Year: 2017 PMID: 30613260 PMCID: PMC6302865 DOI: 10.1186/s41039-017-0051-y
Source DB: PubMed Journal: Res Pract Technol Enhanc Learn ISSN: 1793-2068
Fig. 1Four components in a vocabulary question asking for a closest-in-meaning of a word
Fig. 2Architecture of the automatic question generation system
Configuration of evaluation sets (Exp. 1)
| Eval. set | Contents | Test taker | |
|---|---|---|---|
| HQs | MQs | ||
| A1 | TW#01–13 | TW#14–25 | C |
| B1 | TW#14–25 | TW#01–13 | C |
| A2 | TW#26–37 | TW#38–50 | C |
| B2 | TW#38–50 | TW#26–37 | C |
Pearson correlation coefficients between test scores
| Commercial tests | MQs | HQs |
|
|---|---|---|---|
| TOEFL | 0.71 | 0.60 | 21 |
| TOEIC | 0.68 | 0.60 | 21 |
| CASEC (total) | 0.57 | 0.59 | 73 |
| CASEC (vocabulary) | 0.55 | 0.68 | 73 |
All p values are less than 0.05
Fig. 3Difficulty index distribution
Latent rank estimation for MQs and HQs
| Eval. set | No. of students in ranks | |||
|---|---|---|---|---|
| Low | Medium | High | Total | |
| MQs(A) | 12 | 15 | 13 | 40 |
| MQs(B) | 13 | 12 | 14 | 39 |
| HQs(A) | 12 | 12 | 16 | 40 |
| HQs(B) | 12 | 13 | 14 | 39 |
Fig. 4ICRP categories
CD2
and CD1 options. Based on the number of correctly represented probability relations between ranks, we can say that the CU2 and CD2 options are better than the CU1 and CD1 options as a correct answer in measuring test taker proficiency.Distribution of correct answers across ICRP categories
| Eval. set | MI | CU2 | CD2 | CU1 | CD1 | MD | Total |
|---|---|---|---|---|---|---|---|
| MQs(A) | 13 | 2 | 4 | 1 | 2 | 3 | 25 |
| MQs(B) | 17 | 1 | 1 | 2 | 1 | 3 | 25 |
| HQs(A) | 19 | 3 | 1 | 0 | 0 | 2 | 25 |
| HQs(B) | 11 | 6 | 2 | 2 | 2 | 2 | 25 |
Fig. 5Test reference profile
Distribution of distractors across ICRP categories
| Eval. set | MI | CU2 | CD2 | CU1 | CD1 | MD | Total |
|---|---|---|---|---|---|---|---|
| MQs(A) | 9 | 9 | 5 | 12 | 7 | 33 | 75 |
| MQs(B) | 14 | 3 | 2 | 10 | 4 | 40 | 73 |
| HQs(A) | 13 | 4 | 5 | 8 | 6 | 39 | 75 |
| HQs(B) | 15 | 2 | 5 | 6 | 7 | 34 | 69 |
Distribution of “bad” option types
| Multiple correct | Unfamiliar word | Collocationally | More reasonable | Other |
|---|---|---|---|---|
| answers (MCA) | sense (UWS) | odd word (COW) | word (MRW) | |
| 3 | 4 | 2 | 6 | 6 |
Configuration of evaluation sets (Exp. 2)
| Eval. set | Contents | |
|---|---|---|
| HQ | MQ | |
| Set 1 | 4 | 6 |
| Set 2 | 4 | 6 |
| Set 3 | 6 | 4 |
| Set 4 | 7 | 3 |
| Set 5 | 4 | 6 |
Fig. 6A questionnaire for each question item
Fig. 7Distinguishing MQ and HQ
Rationale behind MQ-HQ judgement of MQs
| Component | Human-made | Machine-generated | Total |
|---|---|---|---|
| Reading passage | 82 | 53 | 135 |
| Correct answer | 76 | 39 | 115 |
| Distractor | 44 | 43 | 87 |
Fig. 8Usability in a real test
Distribution of general comments from human expert
| Type | Positive | Negative | Positive+negative | Neutral | Total |
|---|---|---|---|---|---|
| HQs | 27 | 17 | 13 | 18 | 75 |
| MQs | 14 | 45 | 11 | 15 | 85 |
Distribution of the ICRP categories for correct answers in good- and bad-rated items
| Question items | MI | CU2 | CD2 | CU1 | CD1 | MD | Total |
|---|---|---|---|---|---|---|---|
| Good-rated | 13 | 0 | 2 | 0 | 1 | 0 | 16 |
| Bad-rated | 4 | 0 | 1 | 0 | 1 | 3 | 9 |
Distribution of the ICRP categories for distractors in good- and bad-rated items
| Question items | MD | CU1 | CD1 | CU2 | CD2 | MI | Total |
|---|---|---|---|---|---|---|---|
| Good-rated | 28 | 8 | 3 | 1 | 1 | 6 | 47 |
| Bad-rated | 12 | 2 | 1 | 2 | 1 | 8 | 26 |