Literature DB >> 34283633

Meta-analysis of Gender Performance Gaps in Undergraduate Natural Science Courses.

Sara Odom¹, Halle Boso¹, Scott Bowling¹, Sara Brownell², Sehoya Cotner³, Catherine Creech⁴, Abby Grace Drake⁵, Sarah Eddy⁶, Sheritta Fagbodun⁷, Sadie Hebert³, Avis C James⁸, Jan Just⁹, Justin R St Juliana⁵, Michele Shuster⁸, Seth K Thompson³, Richard Whittington⁷, Bill D Wills¹, Alan E Wilson¹, Kelly R Zamudio⁵, Min Zhong¹, Cissy J Ballen¹.

Abstract

To investigate patterns of gender-based performance gaps, we conducted a meta-analysis of published studies and unpublished data collected across 169 undergraduate biology and chemistry courses. While we did not detect an overall gender gap in performance, heterogeneity analyses suggested further analysis was warranted, so we investigated whether attributes of the learning environment impacted performance disparities on the basis of gender. Several factors moderated performance differences, including class size, assessment type, and pedagogy. Specifically, we found evidence that larger classes, reliance on exams, and undisrupted, traditional lecture were associated with lower grades for women. We discuss our results in the context of natural science courses and conclude by making recommendations for instructional practices and future research to promote gender equity.

Entities: Chemical

Mesh：

Year: 2021 PMID： 34283633 PMCID： PMC8715812 DOI： 10.1187/cbe.20-11-0260

Source DB: PubMed Journal: CBE Life Sci Educ ISSN： 1931-7913 Impact factor: 3.325

INTRODUCTION

Extensive research on the experiences of women in science, technology, engineering, and mathematics (STEM) fields has revealed several common patterns of inequalities that reduce the retention of women in STEM (Eddy and Brownell, 2016). Such systemic challenges include gender stereotypes about STEM careers (DiDonato and Strough, 2013), poor mentorship (Newsome, 2008), unconscious bias against women (Moss-Racusin ), and inadequate institutional support to help balance family demands (Goulden ). Beyond these systemic challenges, institutional and pedagogical choices can also have negative impacts on metrics of performance for women. Examples include large class sizes (Ballen ), biased in-class participation (Aguillon ; Bailey ), and reliance on multiple-choice exams (Stanger-Hall 2012). Due in part to these challenges, women are less likely than men to complete science-related college majors and join the STEM workforce (Chen, 2013). Binary gender performance gaps in science are well documented in a variety of STEM courses (Brooks and Mercincavage, 1991; Grandy, 1994; Tai and Sadler, 2001; Rauschenberger and Sweeder, 2010; Creech and Sweeder, 2012; Sonnert and Fox, 2012; Lauer ; McCullough, 2013; Peters, 2013; Hansen and Birol, 2014; Matz ), including studies that control for measures of incoming student ability (Eddy ; Eddy and Brownell, 2016; Wright ; Salehi ). In some higher education STEM studies that do not control for prior ability, there are cases in which there is no performance gap or one that favors women (Eddy and Brownell, 2016). Controlling for incoming student ability or preparation can account for differences in student performance that arise from factors correlated with demographic characteristics (e.g., gender, race/ethnicity, first-generation status; Salehi ), so that one can compare students with similar student ability or preparation. These controlled differences are interesting to researchers, as they can point to classroom issues that create observed underperformance, defined as “not performing to ability” (Salehi ). The “raw” performance outcomes in STEM course work (not controlling for incoming preparation) can have lasting repercussions on future STEM careers, and this is the analytical approach we used in the current study. For example, Wang found that 12th-grade math scores—on which girls underperformed relative to boys—mediated students’ selection of STEM occupations in their early to mid-30s. In other cases, the impact is immediate. Many undergraduate students start out in introductory STEM courses that serve as required prerequisites for continuing in their majors. If women receive low grades in these introductory STEM courses, then they are less likely than men with similar grades and academic preparation to retake the course, more likely to drop out, and less likely to advance (Rask and Tiefenthaler, 2008; Seymour and Hunter, 2019; Harris ). Thus, research that addresses factors that drive observed performance gaps to minimize these inequities has the potential to enhance the persistence of women in STEM. Understanding factors that lead to inequities requires first investigating ways that instructional practices affect student performance. Previous research has investigated a number of non–mutually exclusive course elements hypothesized to impact gender performance gaps. For example, many introductory courses are taught in large classrooms (Matz ), despite evidence that large courses may negatively affect women’s performance (Ho and Kelman, 2014; Ballen ) and participation (Ballen ; Bailey ). Assessment strategies have also been proposed to have an impact on binary gender gaps. Especially in large introductory courses, student performance is often assessed primarily through the use of timed, multiple-choice exams (Matz ), despite research that shows this approach is not a meaningful measure of critical thinking or learning (Martinez, 1999; Dufresne ; Simkin and Kuechler, 2005) and may specifically disadvantage women (Ballen ). Performance gaps have been shown to be higher on high-stakes exams than they are on other proxies for performance, such as overall grade point average (GPA), or lower-stakes exams (Stanger-Hall, 2012; Kling ). Finally, the instructor’s pedagogical approach in the classroom might impact performance. Substantial evidence now confirms that active learning improves student outcomes in STEM courses (Freeman ), and it may offer disproportional benefits for other groups often underrepresented in STEM, such as underrepresented minority students (Eddy and Hogan, 2014; Ballen ; Casper ; Theobald ) and first-generation students (Eddy and Hogan, 2014). However, when it comes to gender gaps, the effectiveness of active learning has been mixed. While some studies claim reduced gender gaps in active-learning courses (Lorenzo ), other studies have been unable to reproduce the same effect (Pollock ; Madsen ; Ballen ). Yet other studies have noted the potential for active learning to exacerbate inequities that could influence performance for students with anxiety (England ; Cohen ), which would disproportionately affect women (Cooper ; Downing ) and could lead to gender gaps. To test the hypothesis that gender impacts performance in natural science courses and to test the impact of moderators on relative performance outcomes, we conducted a meta-analysis by analyzing data from a wide selection of published and unpublished data (Glass, 1976). Focusing on undergraduate-level biology and chemistry (e.g., general biology, cell biology, biochemistry, general chemistry; see Supplementary Material for more information), we analyzed student scores from a large number of courses and institutions to identify factors that impact gender equity. Specifically, we address the following questions: Is there a performance gap between men and women in undergraduate biology and chemistry courses? What classroom factors (e.g., class size, assessment type, pedagogy) narrow historic gender gaps by promoting women’s performance?

METHODS

Study Identification

We identified studies following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) protocol (Moher ; Supplemental Figure S1). On February 27, 2019, we performed a database search of three online education research–affiliated databases: ERIC, Education Research Complete, and PsychINFO, with search results limited to journal articles, theses, and dissertations. We used the following search terms, limited to subject descriptors: (biology OR STEM OR science OR medical OR chemistry) AND (education OR achievement OR test OR performance OR outcomes OR examinations OR student) AND (university OR college OR higher education OR adulthood) AND (sex OR gender OR female OR gap) NOT foreign countries NOT admission NOT readiness NOT high school NOT career. We used the following inclusion criteria to determine whether the studies identified by this search would be included in the final data set: Data were collected in undergraduate-level courses at colleges and universities in the United States. Data came from a course within the biological and chemical sciences. Data could be aggregated across multiple sections of the same course but could not be combined across different courses. Data included exam scores (average score on one or more exams), course grades (the final grade that students received in a course), or science concept inventory (CI) scores disaggregated by gender. Data from published studies were screened and coded by authors S.O. and H.B. We screened the studies first by reading the abstracts. To increase the number and scope of classroom scores in our analysis, we carried over into full-text screenings both studies that focused on student academic performance and studies that focused on other classroom elements, in case student scores were provided as context for those studies. Studies that were not disqualified based on the abstract were downloaded and the full text screened. Studies were included in our final data set only if we could ensure that all study criteria were met. When studies suggested that data that met our criteria were collected but not included in the publication, we emailed the study’s author(s) to request additional data. Our original search identified 2822 studies. Abstract screening and exclusion of duplicate studies removed 2689 studies, leaving 133 studies for full-text evaluation. Of these, 25 studies could not be accessed, 39 were not conducted in the appropriate setting, and 51 lacked the appropriate data needed for this study (study did not provide grades, scores were not disaggregated by gender, etc.; Supplemental Figure S1). For 22 studies, the text suggested that the authors collected data that fit our criteria, but the data were not included in the published paper. In these cases, we requested data directly from the authors. We were unable to get in contact with the authors of 10 studies. For nine studies, we were able to contact authors, but they were unable to provide us with data because of privacy concerns or because they could no longer access it. Authors of three studies shared data, which we included in the final data set. In total, 18 published studies met all of the required criteria for inclusion (see Supplementary Material for full list of published studies included in the analysis). These studies included 89 different courses. Of these courses, 35 included aggregate data for multiple sections. We note that class size was not calculated based on sample size in these cases; for aggregated data, class size was either missing, or we used average class size. Additionally, we collected course grades and descriptions of 80 individual courses from institutions across the United States in conjunction with the Equity and Diversity in Undergraduate STEM Research Coordination Network (i.e., unpublished data; Thompson ). These data were collected during the course of normal academic classes, with the intention of using them in education research studies that focus on different aspects of equity. Because the instructors who collected these data were involved in this study, we were able to directly follow up on any questions about these data and how they fit this study’s criteria. Data were provided in the form of raw grades, which authors S.O. and C.J.B. used to calculate mean scores and SD for men and women students. We had multiple comparisons from a subset of the n = 169 courses (e.g., both exam score and course grade), and so our data set included n = 246 comparisons and more than 28,000 students.

Data Collection

The research reported was determined to be exempt from Auburn University’s Institutional Review Board (protocol 19-355 EX 1908). From each course, we collected sample size, mean scores, and SD for men and women for whichever of the three specified assessment types (exam scores, final grades, or science CI) were available. If SD and other measures of variance were not included (9.76% of studies), we imputed them based on the average SD of the other scores in each assessment category (Furukawa ). To account for the possibility that these studies had larger SDs than the average, we also ran a sensitivity analysis by using a larger SD (75th percentile). Because this did not change any of the outcomes (see Supplemental Table S4), we present only the results calculated using average SD. For three studies, gender differences in scores were only available in the form of z-scores. Additionally, we collected the following information as it was available (Supplemental Table S1): institution name and/or type (Supplemental Table S2), course title (e.g., Introduction to Biology), broader topic (biology or chemistry), intended student audience (natural science major or nonmajor), number of sections (one or multiple sections), class size, instructor(s) gender, pedagogy (lecture-based or active learning), assessment type (exam score, final course grade, or CI), and course level (introductory/lower division or upper division; Figure 1). We categorized introductory courses as those with a course title that included the terms “introductory” or “principles” or when the description of the course included this information. We categorized upper-level courses as those that had prerequisites or when the study specified that upper-level students typically took them. To include pedagogy in a quantitative model, we categorized descriptions provided by instructors or the literature into either lecture based or active learning (Supplemental Table S3). In “lecture” courses, the majority of course time was dedicated to instruction by the teacher, with few if any alternative activities occurring during a normal class period. “Active learning,” a broad category describing approaches designed to increase student engagement (Freeman ; Driessen ), included courses that incorporated interactive and student-focused activities into the course structure. Using the descriptions provided in each study, two authors (S.O. and C.J.B.) individually categorized the pedagogy of each course, with initial interrater agreement of 83.3%. The primary source of disagreement was in cases in which a course incorporated activities as part of a required laboratory component. We decided to focus only on the lecture component of the course, and following this discussion, we achieved 100% agreement.

FIGURE 1.

Descriptive summary of classes in meta-analysis. (A) Histogram of class sizes; (B) number of comparisons for each assessment type: science CIs, course grade, exam grade; (C) classes by broad subject (biology or chemistry) and level (intro or non-intro); and (D) percentage of pedagogy categories.

Statistical Analyses

We ran all statistical analyses using R v. 3.6.2 (R Core Team, 2019) within R studio v. 1.2.5033 (R Studio Team, 2019). We used the metafor package (Viechtbauer, 2010) for effect size calculations, models, and checking for publication bias; the MuMIn package (Barton, 2020) for model selection by Akaike information criterion (AIC); the multcomp package (Hothorn ) for pairwise comparisons; and the tidyverse package (Wickham ) to streamline coding and create some of the graphs. To account for differences in grade distributions across different courses, we quantified gender gaps by calculating a standardized mean difference for each course in the form of Hedges’s g (Hedges, 1981): For Hedges’s g calculated from z-scores rather than means, we used the following formula: We set up these calculations so that a positive Hedges’s g indicates that women scored higher than men, while a negative Hedges’s g indicates that men scored higher than women. The degree of difference is based on the absolute value of the effect size. While interpretations of effect size impact vary depending on the context of comparisons, within education, Hedges and Hedberg (2007) suggest that Hedges’s g values of 0.2 and greater can indicate differences that should be of interest to policy makers. We used a random effects model, with university and subject as nested random effects (Konstantopoulos, 2011), to calculate the overall effect size based on the Hedges’s g estimates and sampling variances of all of the grade comparisons, using the Hedges’s estimator to account for heterogeneity. Some studies provided both course grade and average exam score. In these cases, we used course grade in calculating the overall effect size (we obtained the same results when exam grades were prioritized; see Supplemental Table S4). We checked for publication bias by generating a funnel plot, running a trim-and-fill analysis, and calculating a fail-safe n using the Rosenberg method (Rosenberg, 2005). Based on initial results, we used a mixed effects model to measure the impact of course factors on gender gaps. We selected models based on AIC (Arnold, 2010; Theobald, 2018), considering the following as potential fixed effects: class size, assessment type (science CIs, exam scores, and course grade), pedagogy category (active or traditional lecture), course level (introductory or upper level), and broad topic (biology or chemistry). University and subject were included as nested random effects (Konstantopoulos, 2011). Because we took this approach, we advise readers to interpret our moderators in an “all-else-equal” context, with the “all” consisting of our other variables. Because assessment type contained three factors, we performed post hoc pairwise comparisons on assessment type using Tukey and Holm adjustments to compare each of the factors against each other.

RESULTS

We did not identify a significant gender gap in performance across all published studies and unpublished data (Hedges’s g = −0.2268, p value = 0.4119; Supplemental Table S4 and Supplemental Figure S2). This model had a high degree of heterogeneity (I2 = 97.00%), suggesting other factors may play a role in explaining variation in the data, which we describe in detail below. We found a negligible impact of publication bias in this data set. While some points fell outside the expected distribution cone in the funnel plot, the distribution of data was relatively symmetrical (Figure 2). Furthermore, a trim-and-fill analysis did not add any additional points, meaning that there were not any identified gaps in the data distribution. The fail-safe n calculation predicted that 7768 “missed” studies would need to exist to invalidate the study’s conclusions. Based on these results, we proceeded with the remaining analyses without any publication bias correction.

FIGURE 2.

Standard error funnel plot addressing publication bias. In a study with minimal publication bias, data should be symmetrically spread, with the majority of data within the indicated cone.

Standard error funnel plot addressing publication bias. In a study with minimal publication bias, data should be symmetrically spread, with the majority of data within the indicated cone. We conducted further analyses using mixed models to identify classroom factors that may explain variation in our data (Figure 3). We used the AIC to identify several models within ΔAIC < 2 (Table 1). We identified three equivalent models with the lowest AIC values and selected the most parsimonious model. This model included class size, assessment type, and pedagogy as fixed effects and university and subject as random effects. The final model excluded other potential variables of interest, such as whether the class was an introductory or upper-level course and the subject (biology or chemistry).

FIGURE 3.

TABLE 1.

Model selection by AIC values, with the model selected for remaining analyses in bold type

Model (random effects = university/subject)	AIC	Δi	w_i
Assessment+pedagogy+class.size	531.1	0.00	0.413
Assessment+pedagogy+intro.or.upper+class.size	532.3	1.18	0.228
Assessment+pedagogy+biol.or.chem+class size	532.7	1.61	0.184

Predicted gender gaps across different class sizes and combinations of pedagogies (active, lecture) and assessment types (course grade, exam scores, and CIs) in units of Hedges’s g. Assuming grades are assigned on a bell curve, the difference of one Hedges’s g is approximately the difference of one letter grade (though interpretations will vary based on the class grade distribution). Model selection by AIC values, with the model selected for remaining analyses in bold type Class size was significantly associated with gender gaps (p value < 0.001) with women’s relative performance dropping as class size increased (Figures 1 and 3). We examined three different assessment types: CIs, exam scores, and course grade (Table 2). Of these assessment types, model-based estimates predicted that, on average, women perform better on course grades than on exams; pairwise comparisons revealed SD between women and men increasing by 0.142 when considering exam scores instead of course grades (Table 3). CI scores were not significantly different from either exam scores or course grades (Table 3 and Figure 3). Finally, we found that, on average, active-learning strategies benefited women’s performance compared with traditional lecture (Table 3).

TABLE 2.

Model estimates, with factors with significant slopes in bold type

Regression coefficient	Estimate ± SE	p value
Intercept	0.273 ± 0.416	0.512
Class size	−0.002 ± 0.000	<0.001
Assessment type (reference level: exams)
Course grade	0.142 ± 0.040	<0.001
CI	−0.661 ±1.631	0.685
Pedagogy (reference level: lecture)
Active	0.262 ± 0.089	0.003

TABLE 3.

Pairwise comparison between multileveled assessment type, with pairs with a significant difference in bold type

Comparison	Estimate	SE	z value	Pr(>\|z\|)
Assessment type
CI–exams	−0.661	1.631	−0.405	1.000
Course–exams	0.142	0.040	3.546	0.001
Course–CI	0.803	1.631	0.492	1.000

Model estimates, with factors with significant slopes in bold type Pairwise comparison between multileveled assessment type, with pairs with a significant difference in bold type

DISCUSSION

Across all classes, we did not detect a statistically significant gender gap within biology and chemistry courses. Due to the high degree of heterogeneity within the data, we explored a number of factors that might be associated with our outcomes. We identified three course elements that predicted gender performance differences—class size, assessment type, and pedagogy. We explored how these factors might impact the historic underrepresentation of women in STEM. Specifically, larger courses and high-stakes exams were associated with underperformance of women relative to men in natural science courses. We also found that, relative to traditional lecture, the incorporation of active-learning strategies was associated with higher performance outcomes among women. Surprisingly, we did not observe differences in gender gaps based on whether the classes in question were biology or chemistry, despite the disciplines’ differences in coverage and culture. We discuss the implications of each impactful factor in the following sections.

Class Size

Our results add to a chorus of studies calling for a decrease in class size to promote student learning and performance. Based on our model, an increase in class size from 50 to 250 students increases gender gaps by ∼0.4 SDs. Prior studies note the association of smaller courses with increased student performance (Achilles, 2012; Ballen ), satisfaction with course experience (Cuseo, 2007), and equitable participation (Ballen ). However, large courses remain common in undergraduate studies, especially for introductory-level courses (Matz ). While institutional demands limit the availability of small classrooms (Saiz, 2014), instructors should be aware of this effect and implement strategies to counter some of the depersonalized, didactic, threat-promoting aspects of the large-lecture environment, such as using group work (Springer ; Chaplin, 2009), learning assistants (Knight ), names (Cooper ), humor (Cooper ), in-class formative assessment techniques (Lowry ; Knight ), and strategic use of role models (Schinske ; Yonas ). Because it is unlikely large classes will become smaller any time soon, future research would profit from an explicit focus on the elements of large classes (other than literal class size) that contribute to gaps in performance. Two examples include research that compares the effectiveness of active-learning strategies between small and large classrooms or tests the impact of two different assessment strategies within large classes. Additionally, descriptive work that isolates certain practices unique to and frequently used in large classes, but not in smaller classes, would build a foundational understanding of factors that may hinder or promote subsets of students.

Assessment Type

We found exams contributed to gender gaps favoring men in introductory science. Based on our model, focusing on course grades, rather than only exam scores, results in a decrease in gender gaps by ∼0.14 SDs. This supports previous research showing that, while exam scores disadvantage women, other assessments in students’ final course grades contribute to more equitable outcomes (Salehi ). While it is common for courses—especially large, introductory courses—to rely heavily on exams to assess students (Koester ), this approach may not always provide an accurate reflection of students’ knowledge or critical-thinking skills (Martinez, 1999; Dufresne ). It is also unlikely that a student’s exam score is a reflection of that student’s ability to conduct tasks proficiently as a disciplinary scientist. Furthermore, previous research in undergraduate science classrooms shows women are disproportionately affected by test anxiety, leading to lower exam scores (Ballen ; Salehi ). Instructors can promote equity by clearly outlining learning objectives and aligning exam and homework questions (Feldman, 2018) and by integrating affirmation exercises before exams (Miyake ; Harris ). Instructors can lower the sense of risk in exams by allowing students to retake exams (Nijenkamp ; Sullivan, 2017), lowering the stakes of exams (Cotner and Ballen, 2017), or avoiding multiple-choice exams altogether (Stanger-Hall, 2012). The majority of course grades in our sample included exam scores in their calculation; however, course grades also typically incorporate other types of assessments, such as participation, homework, quizzes, or in-class assignments. While some of the assessments included in this study may have incorporated one or more of the recommendations listed, their effects are outside of the scope of this analysis. Future research with an explicit focus on the impact of lowering the stakes of exams will clarify effective methods. We found the association of CIs on gender gaps did not differ significantly from other assessment types. CIs are unique, because they probe student understanding of fundamental concepts using systematic classroom assessment techniques (Smith and Tanner, 2010). Because of sample size limitations, we caution readers as they interpret our results, and encourage future work to address the impacts of CIs on performance gaps in more depth.

Pedagogy

Active learning is increasingly implemented in undergraduate classrooms, and for good reason: plenty of research has demonstrated its advantages in regard to improving student grades (Smith ; Freeman ). We show that active-learning practices, as opposed to traditional lecture, increased women’s performance in natural science courses, with our model predicting a decrease in gender gaps by ∼0.26 SDs in active-learning classes compared with traditional lecture. This relationship may hinge on one of the following factors associated with active learning: the development of self-efficacy through scaffolded interactions and consistent, low-stakes assessment (Ballen ); increased sense of belonging through the development of in-group relationships (Eddy ; Eddy and Hogan, 2014); and the use of metacognition to normalize student perceptions of challenges in the course curriculum (Tanner, 2012). However, we encourage readers to interpret these results with caution due to varied implementation of active-learning practices across our categories (see Limitations section).

Limitations

One factor this analysis did not control for was incoming preparation. Due to the format and availability of the data included in the analyses, we focused on raw outcomes, without accounting for any initial differences in performance between men and women when they entered the courses. This is a limitation, because previous work identifies incoming preparation (often in the form of ACT/Scholastic Aptitude Test scores or high school GPA) as a key predictor of a student’s outcome in a course (Lopez ; Rodriguez ). Thus, it is difficult to address the extent of inequality in the classroom without controlling for these differences. We identified published studies primarily through our database search of ERIC, Education Research Complete, and PsychINFO. We chose to focus on these three databases in order to identify a broad range of education papers without pulling a high volume of duplicate studies. We should acknowledge that there are other education databases and search engines that we did not explore that may have yielded additional studies. Furthermore, we did not hand search any journals or “snowball” additional papers from studies. However, we believe that our data set is comprehensive and representative because of the high number of studies yielded by the searches we did perform, as well as the high fail-safe n calculated in our checks for publication bias. Our investigations were limited due to the fundamental nature of meta-analytic methods, which are based entirely on published or previously collected data. The factors we chose to investigate were chosen based on the general availability of adequate descriptions in the educational research studies we included. Often, descriptions of certain course elements were limited to studies specifically investigating that effect, and some factors that we originally wished to investigate had to be abandoned due to limited data. For example, we were interested in how institution type may play a role, but we did not possess comprehensive data across all institution types, such as small liberal arts colleges. Additionally, high-stakes exams may be more common in larger courses, so course size may not be the problem, but rather the reliance on high-stakes exams to assess students. Unfortunately, we did not have access in the study sample to examples of large courses that used other types of assessments. Another limitation was our broad categorization of active classrooms versus traditional lecture classrooms. Active learning is broadly defined in the literature: Freeman solicited responses from 338 biology seminar audience members and defined active learning as that which “engages students in the process of learning through activities and/or discussion in class, as opposed to passively listening to an expert. It emphasizes higher-order thinking and often involves group work” (Freeman , pp. 8413–8414). Based on biology education literature (n = 148 articles) and feedback from biology instructors (n = 105 individuals), Driessen defined active learning as “an interactive and engaging process for students that may be implemented through the employment of strategies that involve metacognition, discussion, group work, formative assessment, practicing core competencies, live-action visuals, conceptual course design, worksheets, and/or games, p. 6.” These definitions make clear that what is encompassed under the term “active learning” is extensive. It is used to describe a wide variety of different instructional practices that are infrequently detailed in scholarly publications (Driessen ). Although some studies have assessed the effect of specific strategies, such as audience response questions (Caldwell, 2007; Smith ; Knight ), group discussions (Miller and Tanner, 2015), case studies (Allen and Tanner, 2005; Miller and Tanner, 2015), and flipped classrooms (Tucker, 2012; van Vliet ; Rahman and Lewis, 2020), among others, our results add urgency to the need to move beyond coarse categorizations of active learning to more fine-grained work, as it clearly matters to marginalized groups (Thompson ). Our research was limited by the publication or instructor descriptions of each course. When descriptions were available, they ranged from highly specific descriptions of the class period to simple designations (i.e., “this was an active-learning course” or “traditional lecture course”). We acknowledge that the categories are not precise and do not fully reflect the range and nuance of what occurs inside each classroom. And while approximately 60% of the courses we included in our analysis were considered active-learning courses, we recognize that, nationally, far fewer classrooms include active learning (Stains ), and it is likely that published studies on active learning may bias toward instructors who are more proficient at active learning. An instructor’s experience with and understanding of how to implement active learning likely impacts its effectiveness (Andrews ), meaning that a strategy that works in some classrooms might not always show the same effects in other classrooms. Finally, we recognize that binary gender is noninclusive language. However, the gender binary has been heavily relied upon in prior studies, and as such, this analysis follows the model laid out in the studies we included, meaning that at this time we cannot address how gender identity outside the gender binary affects student performance in different settings. We also recognize that gender is not the only identity-related factor that affects student performance. Many other elements of identity, such as race/ethnicity (Beichner ; Ballen ), socioeconomic status (Haak ), and LGBTQ+ status (Cooper and Brownell, 2016; Henning ) can effect a student’s experiences in a course, and it is likely that these factors could interact with gender expectations in ways that lead to patterns within certain subgroups that differ from our reports.

Final Remarks

Our results point to multiple ways that instructors and administrators can work to promote equitable outcomes in undergraduate classrooms. Particularly in introductory gateway courses, where students appraise their fit in a field based on performance outcomes relative to their peers, reducing class sizes when possible, decreasing reliance on high-stakes exams, and incorporating active-learning strategies into every lecture are possible avenues to promote equity. By using informed, data-driven solutions, instructors and institutions can create more inclusive classrooms. Click here for additional data file.

51 in total

1. Increased structure and active learning reduce the achievement gap in introductory biology.

Authors: David C Haak; Janneke HilleRisLambers; Emile Pitre; Scott Freeman
Journal: Science Date: 2011-06-03 Impact factor: 47.728

2. Promoting student metacognition.

Authors: Kimberly D Tanner
Journal: CBE Life Sci Educ Date: 2012 Impact factor: 3.325

3. To be funny or not to be funny: Gender differences in student perceptions of instructor humor in college science courses.

Authors: Katelyn M Cooper; Taija Hendrix; Michelle D Stephens; Jacqueline M Cala; Kali Mahrer; Anna Krieg; Ashley C M Agloro; Giovani V Badini; M Elizabeth Barnes; Bradley Eledge; Roxann Jones; Edmond C Lemon; Nicholas C Massimo; Annette Martin; Thomas Ruberto; Kailey Simonson; Emily A Webb; Joseph Weaver; Yi Zheng; Sara E Brownell
Journal: PLoS One Date: 2018-08-15 Impact factor: 3.240

4. True Grit: Passion and persistence make an innovative course design work.

Authors: Anne M Casper; Sarah L Eddy; Scott Freeman
Journal: PLoS Biol Date: 2019-07-18 Impact factor: 8.029

5. In a "Scientist Spotlight" Intervention, Diverse Student Identities Matter.

Authors: Azariah Yonas; Margaret Sleeth; Sehoya Cotner
Journal: J Microbiol Biol Educ Date: 2020-04-10

6. Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement.

Authors: David Moher; Alessandro Liberati; Jennifer Tetzlaff; Douglas G Altman
Journal: PLoS Med Date: 2009-07-21 Impact factor: 11.069

7. Analysis of student performance in large-enrollment life science courses.

Authors: Leah Renée Creech; Ryan D Sweeder
Journal: CBE Life Sci Educ Date: 2012 Impact factor: 3.325

8. Multiple-choice exams: an obstacle for higher-level thinking in introductory science classes.

Authors: Kathrin F Stanger-Hall
Journal: CBE Life Sci Educ Date: 2012 Impact factor: 3.325

9. A Call for Data-Driven Networks to Address Equity in the Context of Undergraduate Biology.

Authors: Seth K Thompson; Sadie Hebert; Sara Berk; Rebecca Brunelli; Catherine Creech; Abby Grace Drake; Sheritta Fagbodun; Marcos E Garcia-Ojeda; Carrie Hall; Jordan Harshman; Todd Lamb; Rachael Robnett; Michèle Shuster; Sehoya Cotner; Cissy J Ballen
Journal: CBE Life Sci Educ Date: 2020-12 Impact factor: 3.325

10. Female In-Class Participation and Performance Increase with More Female Peers and/or a Female Instructor in Life Sciences Courses.

Authors: E G Bailey; R F Greenall; D M Baek; C Morris; N Nelson; T M Quirante; N S Rice; S Rose; K R Williams
Journal: CBE Life Sci Educ Date: 2020-09 Impact factor: 3.325

3 in total

1. The effects of course format, sex, semester, and institution on student performance in an undergraduate animal science course.

Authors: James R Vinyard; Francisco Peñagaricano; Antonio P Faciola
Journal: Transl Anim Sci Date: 2022-01-12

2. Potential for urban agriculture to support accessible and impactful undergraduate biology education.

Authors: Adam D Kay; Eric J Chapman; Jelagat D Cheruiyot; Sue Lowery; Susan R Singer; Gaston Small; Anne M Stone; Ray Warthen; Wendy Westbroek
Journal: Ecol Evol Date: 2022-03-14 Impact factor: 2.912

3. Participation and Performance by Gender in Synchronous Online Lectures: Three Unique Case Studies during Emergency Remote Teaching.

Authors: Sierra C Nichols; Yongyong Y Xia; Mikaylie Parco; Elizabeth G Bailey
Journal: J Microbiol Biol Educ Date: 2022-03-28

3 in total