| Literature DB >> 35230589 |
Filipe Falcão1,2, Patrício Costa3,4, José M Pêgo3,4.
Abstract
BACKGROUND: Current demand for multiple-choice questions (MCQs) in medical assessment is greater than the supply. Consequently, an urgency for new item development methods arises. Automatic Item Generation (AIG) promises to overcome this burden, generating calibrated items based on the work of computer algorithms. Despite the promising scenario, there is still no evidence to encourage a general application of AIG in medical assessment. It is therefore important to evaluate AIG regarding its feasibility, validity and item quality.Entities:
Keywords: Assessment; Automatic item generation; Computer-based testing; Medical Assessment; Multiple-choice questions
Mesh:
Year: 2022 PMID: 35230589 PMCID: PMC8886703 DOI: 10.1007/s10459-022-10092-z
Source DB: PubMed Journal: Adv Health Sci Educ Theory Pract ISSN: 1382-4996 Impact factor: 3.629
Fig. 1AIG three-step process for generating medical MCQs
Fig. 2Flow chart of the included studies
Quality assessment of included studies
| Quality assessment topics. | i) item modelling: definition and related concepts | ii) developing item models | iii) item model taxonomy | iv) using item models to automatically generate items | v) benefits of item modelling | vi) item model bank | vii) estimation of the statistical characteristics of generated items. | viii) description of the three-step process for conducting AIG | ix) assessment of AIG’s capacity to generate new items | x) quality assessment of generated items, cognitive model and/or item model | xi) comparison of AIG with traditional methods of item development | xii) limitations of AIG. | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
| FF | PC | FF | PC | FF | PC | FF | PC | FF | PC | FF | PC | FF | PC | FF | PC | FF | PC | FF | PC | FF | PC | FF | PC |
| (Gierl et al., | 2 | 2 | 2 | 2 | 0 | 0 | 2 | 2 | 2 | 2 | 0 | 1 | 0 | 0 | 2 | 2 | 2 | 2 | 0 | 1 | 1 | 0 | 2 | 2 |
| (Gierl & Lai, | 2 | 2 | 2 | 2 | 0 | 0 | 2 | 2 | 1 | 1 | 2 | 2 | 0 | 0 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 |
| (Gierl & Lai, | 2 | 2 | 2 | 2 | 2 | 1 | 2 | 2 | 2 | 2 | 2 | 2 | 1 | 1 | 2 | 2 | 2 | 2 | 2 | 1 | 0 | 1 | 0 | 0 |
| (Gierl et al., | 2 | 2 | 2 | 2 | 0 | 0 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 1 | 2 | 2 | 2 | 2 | 2 | 2 |
| (Gierl & Lai, | 2 | 2 | 2 | 2 | 0 | 0 | 2 | 2 | 2 | 2 | 2 | 1 | 2 | 1 | 2 | 2 | 1 | 2 | 2 | 2 | 2 | 1 | 2 | 2 |
| (Gierl & Lai, | 2 | 2 | 2 | 2 | 0 | 0 | 2 | 2 | 1 | 1 | 2 | 2 | 0 | 0 | 2 | 2 | 1 | 1 | 2 | 2 | 0 | 0 | 1 | 1 |
| (Lai et al., | 2 | 2 | 2 | 2 | 0 | 0 | 2 | 2 | 2 | 1 | 0 | 0 | 2 | 2 | 2 | 2 | 2 | 1 | 2 | 2 | 0 | 0 | 2 | 1 |
| (Pugh et al., | 2 | 2 | 2 | 2 | 0 | 0 | 2 | 2 | 2 | 2 | 0 | 0 | 0 | 0 | 2 | 2 | 1 | 1 | 2 | 1 | 2 | 0 | 0 | 0 |
| (Pugh et al., | 2 | 2 | 2 | 2 | 0 | 0 | 2 | 2 | 2 | 2 | 0 | 0 | 0 | 0 | 2 | 2 | 1 | 2 | 2 | 2 | 0 | 0 | 2 | 2 |
| (Shappell et al., | 2 | 2 | 2 | 2 | 0 | 0 | 1 | 1 | 2 | 2 | 2 | 2 | 0 | 0 | 2 | 2 | 2 | 2 | 0 | 0 | 0 | 0 | 1 | 1 |
Note-0: ‘topic not covered in the study’; 1: ‘topic coverage was unclear’; 2: ‘topic was covered in the study’
Data synthesis
| Reference | Country | Purpose | Method | Results |
|---|---|---|---|---|
| (Gierl et al., | Canada | Present a methodology to generate MCQs. | AIG was used to generate MCQs. | In 6 h, 1248 items were generated from one item model: Stage 1 (3 h); Stage 2 (2 h), and Stage 3 (1 h). |
| (Gierl & Lai, | Canada | Determine whether AIG generates high-quality items. | Items generated by AIG and items developed using traditional item development methods were blindly rated for quality by experts. Independent-samples Student t-tests were conducted to assess differences between the items in terms of quality. Subsequently, expert classified each item as generated by AIG or as an item developed using traditional methods. | Specialists developed 25 items using traditional item development methods. The same specialists then created 9496 using AIG. A second group of specialists developed 25 items using traditional item development methods. One t-test produced a statistically significant result ( |
| (Gierl & Lai, | Canada | Describe a method for generating test items. | AIG was conducted to generate MCQs using two types of item models: (i)1-layer item model and (ii) n-layer item model. | 256 items were generated with the 1-layer item model; 16,384 items were generated with the n-layer item model. |
| (Gierl et al., | Canada | Assess the psychometric characteristics of items generated by AIG. | Items generated by AIG were distributed within nine tests. Students responded to the items across different forms, Item analysis was conducted using CTT. | 465 items were generated using AIG. For the correct options, items used measured examinees’ performance across a broad range of ability levels and provided strong levels of discrimination. For the incorrect options, items consistently differentiated the low from the high performing examinees. |
| (Gierl & Lai, | Canada | Assess the quality of items generated by AIG. | Authors describe a method to evaluate the quality of items generated by AIG. | If the instructions for item generation in the models are adequate, the generated items will be appropriate for testing. |
| (Gierl & Lai, | Canada | Describe a method for generating items using AIG and rationales required for formative testing. | AIG was used to generate MCQ and the corresponding rationale for each item. | 48 items were generated using the content from the cognitive model. Rationale generation added extra time to the AIG process. Rationales satisfied the required characteristics of feedback. |
| (Lai et al., | Canada | Describe and validate a method of generating distractors using AIG ( | Systematic distractor generation was integrated with AIG’s 3-step process. 13 items were selected for field test. Generated items were distributed across examination forms. 455 medical students responded to the items. Item analysis was conducted following indices from CTT. | Results for the correct option: items measured a wide range of difficulty from the same model and presented consistent levels of discrimination. Results for the incorrect options: generated distractors were effective alternatives as they contained information that consistently appealed to lower performing candidates. |
| (Pugh et al., | Canada | Provide a framework for the development of quality MCQs. | Authors detail a framework for the development of high-quality MCQs using cognitive models. | The approach allowed the efficient generation of MCQs. Authors found that even a group of novices could apply the method to create a complete cognitive model within about 2 h, resulting in 5–10 new items. |
| (Pugh et al., | Canada | Compare the quality of items developed using AIG and the quality of items developed with traditional methods. | Items developed using AIG and traditional methods were blinded reviewed by content experts. A Wilcoxon two-sample test was employed for each quality metric rating scale as well as for the overall cognitive domain judgment scale. | AIG generated between 80–100 items. The entire process required 90–120 min. 51 items created with traditional methods and 51 items generated using AIG were evaluated for quality; AIG items were not perceived as differing from traditionally developed items. |
| (Shappell et al., | USA | Investigates an approach to item generation for mastery learning tests. | 47 residents of an emergency medicine program took a mastery learning test. 20 item models were created and reviewed by educators. Two versions of the test were created. Consistency was evaluated using the test—retest k statistic and decision-consistency classification indices. | 912 MCQ were developed using AIG. Unique iterations per item model ranged from 24 to 128, offering millions of unique 20-question tests. No significant differences in mean learner performance, mean item difficulty and item discriminations across the tests were found. |
AIG validity assessment
| Inferences | (Gierl et al., | (Gierl & Lai, | (Gierl & Lai, | (Gierl et al., | (Gierl & Lai, | (Gierl & Lai, | (Lai et al., | (Pugh et al., | (Pugh et al., | (Shappell et al., | |
|---|---|---|---|---|---|---|---|---|---|---|---|
|
| Generate MCQs for medical licensure testing. | Generate MCQs for medical licensure testing. | Generate MCQs for medical licensure testing. | Generate MCQs for medical licensure testing. | Generate MCQs for medical assessment. | Generate MCQs and rationales for medical formative testing. | Generate MCQs and distractors for medical licensure testing. | Generate MCQs for medical assessment. | Generate MCQs for medical assessment. | Generate MCQs for medical mastery learning assessment. | |
|
|
| Cognitive and item models were developed and reviewed by specialists. | Items were blindly evaluated for quality by a panel of experts. | Cognitive and item models were developed and reviewed by specialists. | Cognitive and item models were developed and reviewed by specialists. | Experts evaluated the content and the logic specified in the cognitive model and in the item model. | Experts blindly reviewed the rationales generated for formative testing. | Cognitive and item models were developed and reviewed by specialists. | Cognitive and item models were developed and reviewed by specialists. | Quality of items generated was evaluated by experts. | Item models were developed and reviewed by specialists. |
|
|
| UN | UN | UN | Item response theory was used, but not reported. CTT was used. Generated items measured a broad range of difficulty levels. | UN | UN | CTT was used. Generated items measured a broad range of difficulty levels; | UN | UN | No significant differences in item difficulty between tests were found. |
|
|
| UN | UN | UN | Consistent levels of item discrimination. | UN | UN | Consistent levels of item discrimination. | UN | UN | No significant differences in mean item discrimination between tests were found. |
|
|
| UN | UN | UN | UN | UN | UN | UN | UN | UN | UN |
*UN - Unclear / Unreported