| Literature DB >> 21633066 |
Scott Freeman1, David Haak, Mary Pat Wenderoth.
Abstract
We tested the hypothesis that highly structured course designs, which implement reading quizzes and/or extensive in-class active-learning activities and weekly practice exams, can lower failure rates in an introductory biology course for majors, compared with low-structure course designs that are based on lecturing and a few high-risk assessments. We controlled for 1) instructor effects by analyzing data from quarters when the same instructor taught the course, 2) exam equivalence with new assessments called the Weighted Bloom's Index and Predicted Exam Score, and 3) student equivalence using a regression-based Predicted Grade. We also tested the hypothesis that points from reading quizzes, clicker questions, and other "practice" assessments in highly structured courses inflate grades and confound comparisons with low-structure course designs. We found no evidence that points from active-learning exercises inflate grades or reduce the impact of exams on final grades. When we controlled for variation in student ability, failure rates were lower in a moderately structured course design and were dramatically lower in a highly structured course design. This result supports the hypothesis that active-learning exercises can make students more skilled learners and help bridge the gap between poorly prepared students and their better-prepared peers.Entities:
Mesh:
Year: 2011 PMID: 21633066 PMCID: PMC3105924 DOI: 10.1187/cbe.10-08-0105
Source DB: PubMed Journal: CBE Life Sci Educ ISSN: 1931-7913 Impact factor: 3.325
Failure rates in some gateway STEM courses
| Field | Course | Failure rate | Failure criterion | Reference |
|---|---|---|---|---|
| Biology | Intro-majors | 56% | Average proportion of Ds and Fs on exams | |
| Intro-majors | >25% | Course outcome: D, F, or drop | ||
| Intro-nonmajors | 27% | Course outcome: D, F, or drop | ||
| Biochemistry | 85% | F on first exam | ||
| Medical Microbiology | 30% | Course outcome: D or F | ||
| Chemistry | Intro-majors | ∼50% | Course outcome: D, F, or drop | |
| Intro-nonmajors | ≥ 30% | Course outcome (“at most institutions”): fail or drop | ||
| Computer science | Intro to programming | 33% | Course outcome (international survey): F or drop | |
| Engineering | Intro to chemical engineering | 32% | Course outcome: D, F, or drop | |
| Mathematics | First-year calculus | 42% | Course outcome (U.S. national average): failure | |
| Physics | Intro-majors | 33% | Course outcome: D, F, or drop |
Variation in course format
| Spring 2002 | Spring 2003 | Spring 2005 | Autumn 2005 | Autumn 2007 | Autumn 2009 | |
|---|---|---|---|---|---|---|
| Class size (10-d class list) | 331 | 345 | Two sections of 173a | Two sections of 173a | 342 | 699 |
| Elements of course design | ||||||
| Socratic lecturing | X | X | X | X | ||
| Ungraded active learning | X | |||||
| Clickers | Xb | Xc | X | X | ||
| Practice exams | Xd | Xe | Xd | Xe | ||
| Reading quizzes | X | X | ||||
| Class notes summaries | X | |||||
| In-class group exercises | X | X | ||||
| Exams | Two 100-point midterms, 200-point comprehensive final | Two 100-point midterms, 200-point comprehensive final | Two 100-point midterms, 200-point comprehensive final | Two 100-point midterms, 200-point comprehensive final | Two 100-point midterms, 200-point comprehensive final | Four 100-point exams |
| Total course points | 550 | 550 | 720b; 620 | 720 | 793 | 741 |
aThe sections were taught back-to-back, with identical lecture notes. They took similar or identical midterms and an identical final exam.
bOne section answered questions with clickers; one section answered identical questions with cards (see Freeman ). Card responses were not graded.
cIn one section, clicker questions were graded for participation only; in one section, identical clicker questions were graded right/wrong (see Freeman ).
dAt random, half the students did practice exams individually; half did the same exercise in a four-person group structured by predicted grade.
eAll students did practice exams individually.
Figure 1.The Weighted Bloom's Index “Scale.” The Weighted Bloom's Index can be interpreted by comparing indices from actual exams to the values shown here, which are expected if all exam questions were at a certain level in Bloom's taxonomy of learning. Levels 1 and 2 in Bloom's taxonomy are considered lower-order cognitive skills; Levels 3–6 are considered higher-order cognitive skills.
Exam equivalence analyses
| Independent ratings | ||||||
|---|---|---|---|---|---|---|
| Discussed- consensus | All three agree | Two of three agree | Sequential ratings | Nonsequential ratings | ||
| a. Percentage agreement among Bloom's taxonomy raters. | ||||||
| Percentage of total ratings | 7.3 | 26.4 | 51.1 | 9.9 | 5.2 | |
| “Discussed-consensus” means that questions were rated independently and then discussed to reach a consensus; “Independent ratings” were not discussed among raters. “Sequential ratings” were questions that received three ratings that differed by one Bloom's level (e.g., a 2, 3, and 4); “Nonsequential ratings” were questions that received three ratings that differed by more than one Bloom's level (e.g., a 2, 3, and 5). | ||||||
| b. Weighted Bloom Indices | ||||||
| Spring 2002 | Spring 2003 | Spring 2005 | Autumn 2005 | Autumn 2007 | Autumn 2009 | |
| Midterm 1 | 50.8 | 48.5 | 45.3 | 58.9 | 54.4 | 53.7 |
| Midterm 2 | 36.1 | 51.6a | 51.6a | 46.8 | 50.8 | 54.7 |
| Midterm 3 | 50.2 | |||||
| Final (or Midterm 4) | 48.1 | 54.3 | 45.4 | 51.6 | 51.6 | 55.3 |
| Course averageb | 45.8 | 52.1 | 46.9 | 52.2 | 52.1 | 53.5 |
| c. PES values (predicted percent correct) | ||||||
| Spring 2002 | Spring 2003 | Spring 2005 | Autumn 2005 | Autumn 2007 | Autumn 2009 | |
| Midterm 1 | 70.8 | 73.0 | 71.9 | 67.8 | 64.9 | 66.0 |
| Midterm 2 | 73.0 | 68.0a | 68.0a | 72.1 | 67.7 | 67.0 |
| Midterm 3 | 68.5 | |||||
| Final (or Midterm 4) | 69.4 | 70.0 | 71.8 | 71.0 | 69.6 | 68.6 |
| Course averageb | 70.6 | 70.2 | 70.9 | 70.5 | 68.0 | 67.5 |
aIdentical exams.
bCourse averages were computed from data on all exam questions from that quarter. (They are not the averages of the indices from each exam.)
Figure 2.Weighted Bloom's Indices and PES values are negatively correlated. The Weighted Bloom's Index summarizes the average Bloom's level per point on an exam; the PES summarizes expert-grader predictions for average points that a class will receive on an exam. Regression statistics are reported in the text.
Regression analyses: Total exam points as a predictor of final course grade. Data from the two sections in Spring 2005 were analyzed separately because the clickers and card sections in that quarter (see Materials and Methods and Table 2) had different total course points and thus a different scale for computing final grade.
| a. Regression statistics | |||||||
|---|---|---|---|---|---|---|---|
| Low structure | Moderate structure | High structure | |||||
| Spring 2002 | Spring 2003 | Spring 2005 (no clickers) | Spring 2005 (clickers) | Autumn 2005 | Autumn 2007 | Autumn 2009 | |
| Adjusted | 0.96 | 0.95 | 0.97 | 0.88 | 0.96 | 0.89 | 0.89 |
| −2.29a | −2.35a | −2.12 | −2.30a | −2.57 | −2.32a | −2.71 | |
| β | 0.0184b | 0.0186b | 0.0172 | 0.0180b | 0.0189b | 0.0187b | 0.0198 |
| 1.5 cutoff predicted by regression | 206.4 | 206.6 | 210.1 | 211.1 | 215.8 | 204.0 | 213.0 |
| 323 | 335 | 174 | 156 | 333 | 336 | 653 | |
| b. ANCOVA fit by GLMs incorporating the effect of exam points, quarter, and the interaction term exam points by quarter; the response variable is actual grade (GLM with binomial error distribution). Analysis of deviance shows that slope and intercept do not significantly vary across quarter. | |||||||
| Model | Residual | Residual Deviance | | Deviance | χ2 | ||
| Exam points only | 2303 | 82.840 | |||||
| Exam points + quarter | 2298 | 79.366 | 5 | 3.473 | 0.6274 | ||
| Exam points × quarter | 2293 | 78.475 | 5 | 0.892 | 0.9708 | ||
aValues have 95% confidence intervals that overlap.
bValues have 95% confidence intervals that overlap.
Robustness of the Predicted Grade model ANCOVA fit by GLMs incorporating the effect of predicted grade, quarter, and the interaction term predicted grade by quarter; the response variable is actual grade (GLM with binomial error distribution). Analysis of deviance shows that slope and intercept do not significantly vary across quarter
| Model | Residual | Residual deviance | Deviance | χ2 | |
|---|---|---|---|---|---|
| Predicted grade only | 2276 | 306.616 | |||
| Predicted grade + quarter | 2271 | 298.432 | 5 | 8.184 | 0.1464 |
| Predicted grade × quarter | 2266 | 297.386 | 5 | 1.047 | 0.9587 |
Average predicted grades across quarters
| Spring 2002 | Spring 2003 | Spring 2005 | Autumn 2005 | Autumn 2007 | Autumn 2009 | |
|---|---|---|---|---|---|---|
| Mean ± SD | 2.46 ± 0.72 | 2.57 ± 0.73 | 2.64 ± 0.70 | 2.67 ± 0.60 | 2.85 ± 0.66 | 2.70 ± 0.61 |
| 327 | 338 | 334 | 328 | 339 | 691 |
Failure rates across quarters
| Low structure | Moderate structure | High structure | ||||
|---|---|---|---|---|---|---|
| Spring 2002 | Spring 2003 | Spring 2005 | Autumn 2005 | Autumn 2007 | Autumn 2009 | |
| Percentage of students < 1.5 | 18.2 | 15.8 | 10.9 | 11.7 | 7.4 | 6.3 |
| 324 | 333 | 330 | 333 | 336 | 653 | |
MMI: Models and comparison criteria. Best-fit models are recognized by 1) the conservative LRT p value of the lowest AIC model or 2) a ΔAIC > 2. Note that the LRTs are hierarchical: The p value reported on each row is from a test comparing the model in that row with the model in the row below it.
| Model | AIC | ΔAIC | ω | logLikelihood | LRT ( | |
|---|---|---|---|---|---|---|
| Structure + predicted | 5 | 1610.27 | — | 0.47 | −800.14 | 0.027 |
| Predicted | 3 | 1613.5 | 3.22 | 0.09 | −803.75 | 0.098 |
| Structure × predicted | 7 | 1613.68 | 3.4 | 0.43 | −799.84 | 2.2e-16 |
| Structure | 4 | 1903.27 | 293 | 0 | −947.64 | 0.0015 |
| Null | 2 | 1912.3 | 302.02 | 0.01 | −954.15 |
Figure 3.Failure rates controlled for Predicted Grade, as a function of course structure. In this study, low-, medium-, and high-structure courses rely primarily on Socratic lecturing, some active learning and formative assessment, and extensive active learning (no lecturing) and formative assessment, respectively. The difference between the proportion of students predicted to fail and the actual proportion failing decreases with increasing structure (GLMM, binomial error n = 2267, *p = 0.06, **p = 0.0004).