| Literature DB >> 22135372 |
Minsu Ha1, Ross H Nehm, Mark Urban-Lurain, John E Merrill.
Abstract
Our study explored the prospects and limitations of using machine-learning software to score introductory biology students' written explanations of evolutionary change. We investigated three research questions: 1) Do scoring models built using student responses at one university function effectively at another university? 2) How many human-scored student responses are needed to build scoring models suitable for cross-institutional application? 3) What factors limit computer-scoring efficacy, and how can these factors be mitigated? To answer these questions, two biology experts scored a corpus of 2556 short-answer explanations (from biology majors and nonmajors) at two universities for the presence or absence of five key concepts of evolution. Human- and computer-generated scores were compared using kappa agreement statistics. We found that machine-learning software was capable in most cases of accurately evaluating the degree of scientific sophistication in undergraduate majors' and nonmajors' written explanations of evolutionary change. In cases in which the software did not perform at the benchmark of "near-perfect" agreement (kappa > 0.80), we located the causes of poor performance and identified a series of strategies for their mitigation. Machine-learning software holds promise as an assessment tool for use in undergraduate biology education, but like most assessment tools, it is also characterized by limitations.Entities:
Mesh:
Year: 2011 PMID: 22135372 PMCID: PMC3228656 DOI: 10.1187/cbe.11-08-0081
Source DB: PubMed Journal: CBE Life Sci Educ ISSN: 1931-7913 Impact factor: 3.325
Sample information
| Ethnicity (%) | Gender (%) | |||||||
|---|---|---|---|---|---|---|---|---|
| Institutiona | Major | Participants ( | White | Minority | None mentioned | Male | Female | Age |
| OSU | Nonmajor | 264 | 79.1 | 14.4 | 6.5 | 42 | 58 | 20.1 |
| MSU | Nonmajor | 146 | 66.4 | 13.7 | 19.9 | 40 | 60 | 19.4 |
| MSU | Major | 440 | 79.1 | 11.8 | 9.1 | 42 | 58 | 19.6 |
aOSU = Ohio State University; MSU = Michigan State University.
bNote that n refers to subsampled data sets (see Sample).
Selected examples of students’ written explanations of evolutionary change and corresponding human and computer scores
| Taxon/trait/polarity | Student's explanation of evolutionary change | Human score (number of key concepts) | Computer score (number of key concepts) |
|---|---|---|---|
| Shrew incisors | “Incisors may have developed on shrews due to | 4 | 4 |
| Snail feet | “They would explain that once all the snails had small feet. Then one day there was a | 3 | 3 |
| Fish fins | “There was | 2 | 2 |
| Fly wings | “The evolution of a fly species with a large wing from an ancestral fly with small wings could be through the process of natural selection or from | 1 | 1 |
Figure 1.Magnitudes of agreement among human-scored and computer-scored explanations of evolutionary change from three samples (OSU, MSU nonmajor, and MSU major). For each of the three samples: n = 500 responses. Five key concepts of evolutionary change were examined separately (e.g., variation, heredity). Arrows indicate which sample was used to train the models and which sample was used to test the models. Kappa values compensate for chance agreements, whereas agreement values are raw percentages. (A) OSU sample model training and MSU sample nonmajor model cross-validation; (B) MSU nonmajor sample model training and OSU sample model cross-validation. (C) OSU sample model training and MSU nonmajor model cross-validation. (D) MSU major sample model training and OSU sample model cross-validation. (E) MSU major sample model training and MSU nonmajor model cross-validation. (F) MSU nonmajor model training and MSU major sample cross-validation.
Figure 2.Frequencies (0–100%) of key concepts among samples and between human- and computer-generated scores. Blue bars = human-detected frequencies; red bars = frequencies detected using the MSU major computer-generated scoring model; and green bars = the frequencies detected using the MSU nonmajor computer-generated scoring model. In each of the three samples (OSU nonmajor; MSU major; MSU nonmajor), 500 responses were used. Error bars represent the SEM.
Figure 3.Cross-validation of the impact of training sample size on model performance. Four samples were used in the analysis (OSU nonmajors: n = 500; OSU nonmajors: n = 1056; MSU nonmajors: n = 500; and MSU majors: n = 500). Five key concepts of evolutionary change were examined separately (e.g., variation, heredity). Arrows indicate which sample was used to train the models and which sample was used to test the models. Kappa values compensate for chance agreements, whereas agreement values are raw percentages. (A) OSU sample (n = 500) training and MSU sample nonmajor cross-validation. (B) OSU sample (n = 1056) training and MSU sample nonmajor cross-validation. (C) OSU sample (n = 500) training and MSU major cross-validation. (D) OSU sample (n = 1056) training and MSU sample major cross-validation.
Figure 4.Holistic patterns of human–computer scoring correspondence (each row), taking into account all five key concepts. Circle sizes represent the frequencies of concepts; gray bars indicate the percentages of concept co-occurrence. D = differential survival; V = variation; H = heredity; R = limited resources; C = competition.
Correlation coefficients between human-scored and SIDE-scored student explanations for KCSa
| Training sample | Testing sample | Human vs. SIDE KCD correlation (** |
|---|---|---|
| OSU nonmajor ( | MSU nonmajor ( | 0.79** |
| MSU nonmajor ( | OSU nonmajor ( | 0.80** |
| OSU nonmajor ( | MSU major ( | 0.87** |
| MSU major ( | OSU nonmajor ( | 0.85** |
| MSU major ( | MSU nonmajor ( | 0.82** |
| MSU nonmajor ( | MSU major ( | 0.82** |
aIn all cases, associations were strong and significant (P < 0.001). KCS represents the number of different scientific concepts in a prompt. For details, see Methods and Nehm and Reilly (2007).
Examples of the types of disagreements between human-scored and computer-scored explanationsa
| Scoring pattern | Category | Examples 1 to 5 | Solution |
|---|---|---|---|
| Positive computer score but negative human score | Many key terms used, but important aspects were missing | (1) “The original Shrew, who didn't have incisors, may have not been a | Put a weight on core terms |
| Key terms not adjacent, but scattered throughout the response | (2) “For the fly species with wings to | Human augmentation of SIDE-scoring models | |
| Negative computer score but positive human score | Very uncommonly used expression | (3) “The fish was filling a niche in an area that required a fish with smaller fins. Generations passed and | Increase concept frequencies in training sample; human augmentation of SIDE-scoring models |
| Complex expressions | (4) “Variation of living fish species may leads [sic] to random mutation. It creates new sequences of DNA that will code for new or different protein. | Human augmentation of SIDE-scoring models | |
| Spelling errors and spacing errors | (5) preditor [predator], servive [survive], springoffs [offspring], foodso [food so] | Incorporate a spell-check program during data collection |
aCategories: types of scoring problems; examples: specific student responses; solutions: approaches used to correct the computer–human disagreement.