Literature DB >> 35648777

Can the GRE predict valued outcomes? Dropout and writing skill.

Abstract

Graduate school programs that are considering dropping the GRE as an admissions tool often focus on claims that the test is biased and does not predict valued outcomes. This paper addresses the bias issue and provides evidence related to the prediction of valued outcomes. Two studies are included. The first study used data from chemistry (N = 315) and computer engineering (N = 389) programs from a flagship state university and an Ivy League university to demonstrate the ability of the GRE to predict dropout. Dropout prediction for the chemistry programs was both statistically and practically significant for the GRE quantitative (GRE-Q) scores, but not for the verbal or analytical writing scores. In the computer engineering programs, significant dropout prediction by GRE-Q was evident only for domestic students. In the second study, GRE Analytical Writing scores for 217 students were related to writing produced as part of graduate school coursework and relationships were noted that were both practically and statistically significant.

Entities: Chemical

Mesh：

Year: 2022 PMID： 35648777 PMCID： PMC9159619 DOI： 10.1371/journal.pone.0268738

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.752

Introduction

A number of graduate programs that formerly required all applicants to submit scores on the GRE General Test (GRE) have recently dropped that requirement [1]. Two of the most commonly cited reasons for this change are that the GRE is biased and that it does not predict the outcomes that graduate faculty value most, especially completion of PhD programs [2, 3]. The bias claim is based on an inappropriate equating of bias with the demonstration of group differences. In general, measurement instruments that reveal group differences are not said to be biased. A tape measure that shows that adult men are on average taller than adult women is not said to exhibit gender bias. And a thermometer that shows that people with the flu typically have a higher temperature than healthy people is not said to be biased against people with the flu. But an educational test that shows that students who, on average, had fewer educational opportunities and poorer schools get lower scores than more privileged students is sometimes said to be biased. This definition of bias has serious unintended consequences as it suggests that merely doing away with the test will address the underlying societal problem that the test reveals. But dropping the test will be as effective in more fairly allocating educational resources as destroying thermometers would be in combating the flu. This is not to say that biased test questions cannot exist, but that merely showing a group difference is not evidence of bias in the test. Bias and fairness training for test question writers and statistical tools for uncovering biased questions are still essential in the creation of fair and valid tests [4]. Trained fairness reviewers, including representatives of minority groups, review all GRE questions and statistical differential item functioning (DIF) procedures are used to identify any test questions that are unusually difficult (or easy) for a particular racial/ethnic or gender group. The second reason cited for dropping a GRE requirement is that the scores do not predict valued outcomes. Although there is ample evidence that the scores can predict graduate school readiness as indexed by grades in the first one or two years (e.g., [5, 6]) mere grade prediction is of limited value. Faculty would really like to know which applicants are likely to complete their programs and which applicants demonstrate evidence of research skills (e.g., [2, 3]). A number of studies have explored the extent to which GRE scores can predict research publications or program completion. One study was entitled, “The limitations of the GRE in predicting success in biomedical graduate school” [2]. GRE scores were used as one criterion for admission to this program, and the authors noted that GRE Quantitative (GRE-Q) scores were essentially uncorrelated with first author publication count. Another study was entitled, “Multi-institutional study of GRE scores as predictors of STEM PhD degree completion: GRE gets a low mark” [3]. The abstract of this study noted, “Remarkably, GRE scores were significantly higher for men who left than counterparts who completed STEM PhD degrees.” A study of first author publications conducted in the biomedical program at the University of North Carolina-Chapel Hill [7] noted that GRE-Q scores are uncorrelated with first author publications. Initially, these studies may appear to provide a compelling case against the use of GRE scores in graduate admission. But these studies all share a common problem; they were conducted in highly selective programs in which all admitted and enrolled students have already been screened for the kind of reasoning skills that are assessed by the GRE and undergraduate grades. The only reasonable conclusion from these studies is that among students with strong reasoning skills other factors will determine who graduates or is a highly productive researcher; these studies cannot make inferences concerning the likely success of students with less developed reasoning abilities. The study that concluded that “GRE scores were significantly higher for men who left than counterparts who completed STEM PhD degrees” [3] was conducted in four flagship state universities. The problem with this conclusion is that virtually no one in these selective universities had low GRE-Q scores. The men who left had average scores of 742 and the completers had average scores of 723 (or about the 65th percentile on the old 200–800 GRE scale). So what does this tell us about the likely success of students with low or mediocre GRE scores, or about the potential value of the GRE?—absolutely nothing. The study entitled, “The limitations of the GRE in predicting success in biomedical graduate school” [2] relied on data from the highly selective biomedical graduate program at Vanderbilt University Medical School. While the authors correctly noted the lack of a correlation between GRE scores and publication count, none of the students in their sample had low GRE scores. Indeed, none of the students with three or more publications had GRE-Q scores below 550 and half had scores of at least 700. So, while the conclusion of a near zero correlation is correct, it is also true, and arguably much more relevant, that none of the students with three or more publications had low GRE-Q scores. Similarly, another study of first author publications was conducted in the highly selective biomedical program at the University of North Carolina-Chapel Hill [7]. Once again, it was true that GRE-Q scores are uncorrelated with first author publications. But a closer look reveals that the data can tell a different story. All of the students in this select program had high GRE scores; of the students with 3 or more first author publications, 84% had GRE scores at the 60th percentile or higher, and half scored at the 80th percentile or higher. This leads to the clear conclusion that students with a strong record of first author publications tend to have high GRE-Q scores. Another recent study [8] had the very provocative title, “Typical physics Ph.D. admissions criteria limit access to underrepresented groups but fail to predict doctoral completion,” but the actual text indicated that completion is difficult to predict from either test scores or undergraduate grades. Nevertheless, they concluded that significant associations exist. Using a multivariate logistic model with the 3,692 physics students in their sample, they noted in their abstract, “Significant associations with completion were found for undergraduate GPA in all models and for GRE Quantitative in two of four studies models.” The GRE-Q result is likely a substantial underestimate of the actual predictive power of the GRE because of a number of technical issues in the analysis [9]. Specifically, the Abstract of this critique noted, “The paper makes numerous elementary statistics errors, including introduction of unnecessary collider-like stratification bias, variance inflation by collinearity and range restriction, omission of needed data (some subsequently provided), a peculiar choice of null hypothesis on subgroups, blurring the distinction between failure to reject a null and accepting a null, and an extraordinary procedure for radically inflating confidence intervals in a figure.” It is almost impossible to find studies predicting valued outcomes in programs that have admitted students with low GRE scores and/or other indicators of less developed reasoning skills. Sealy, Saunders, Blume, and Chalkley [10] acknowledged “the typical biases of most GRE investigations of performance where primarily high-achievers on the GRE were admitted” (Abstract). They indicate a further limitation of many of the studies (including their own) with essentially null results relating GRE scores to publications or first author publications. Specifically, they note, “We are well aware that counting papers, either first author or total, has limitations–especially since neither metric captures the quality and/or impact of the publications” (p. 9). Their study followed a small (32 student) cohort of students who were carefully screened for admissions by a number of relevant criteria that did not include GRE scores. Statistical tests of relationships with GRE scores are not very meaningful in such a small sample, but it is worth noting that 28 of the 32 students in the program obtained PhD degrees. With careful selection on multiple criteria and intensive mentoring after enrollment, it is certainly true that students with relatively low GRE scores can succeed. But these data can neither support nor refute the possible relevance of GRE scores as part of a holistic review process. That is, strong GRE scores from a candidate could still boost the admissions chances for a student who was slightly lower on some of the other admissions criteria, such as the quality of the undergraduate institution attended. In addition to program completion, another valued outcome for graduate programs is writing skill. Strong writing skills are required in many graduate courses and in all doctoral programs with a thesis requirement. There is evidence that the GRE Analytical Writing test predicts graduate grades across a number of graduate programs. Indeed, in a comprehensive study using data from over 25,000 students from 10 universities in the Florida state system the GRE Analytical Writing (GRE-AW) test was a significant predictor of the graduate grade point average across a number of different programs [6]. GRE-AW was frequently a better predictor than either the GRE-V or GRE-Q scores, perhaps surprisingly predicting grades in master’s engineering programs and biomedical PhD programs better than predictions from GRE-Q. Because many factors in addition to writing skill are important in determining the overall grade point average, this study could not provide a direct link between GRE-AW and writing demands in graduate courses.

Study 1

Successful completion of a graduate program is highly valued and may be the most important criterion for validation of pre-admission scores [e.g., 2, 3]. But any research using this criterion is problematic because students drop out for many reasons that are unrelated to reasoning skills and could not reasonably be expected to be predicted by test scores or undergraduate grades. A survey of students who left graduate school was conducted by researchers at the National Center for Education Statistics [11]. The survey indicated that the top eight reasons for leaving were: change in family status, conflict with job or military, dissatisfied with program, needed to work, personal problems, other financial reasons, taking time off, and other career interests; note that lack of necessary reasoning skills is not on this list. Nevertheless, if test scores are to be used as part of an admissions decision, it is reasonable to investigate whether there is any relationship of scores to program completion.

Materials and methods

We had requested data from graduate programs representing a variety of selectivity levels but were ultimately successful in obtaining data from only two universities. GRE scores and program completion data were obtained from four highly selective PhD programs at a large flagship state university and at a highly selective Ivy League university. We understood that finding significant relationships to dropout in highly selective programs would likely be challenging, but even in these selective programs there was some variation in GRE scores, albeit near the top of the score scales. We intended to look at large programs in the social sciences and STEM. Specifically, we targeted programs in Chemistry, Electrical and Computer Engineering (ECE), History, and Psychology. The History and Psychology programs were eliminated from the analyses because only a handful of students in these programs left without a degree. We computed the correlation of GRE scores to dropout (0–1), but because the practical significance of correlations is frequently misunderstood [12, 13], we focused on more intuitive quartile comparisons that contrast the percentage of students with bottom quartile GRE scores who dropped out or stayed compared to the percentage of students with top quartile scores who dropped out or stayed. We omit the two middle quartiles to simplify the tables and focus attention on the contrast of high-scoring and low-scoring examinees. Quartiles were defined within programs within universities, so the bottom quartile in the Chemistry programs is not necessarily the same as the bottom quartile in the Electrical and Computer Engineering programs. For the fields in which there were substantial numbers of students who left without their intended degree (Chemistry and ECE), we noted the number of students at different GRE score levels (25th and 75th percentiles) who left before obtaining the degree and computed logistic regressions predicting the 0–1 outcome of dropout or stayed from the three GRE scores (Verbal [GRE-V], GRE-Q, and GRE-AW). We labelled the students who had not dropped out as “stayed,” but note that at the time of the retrospective data collection most of the students who had not dropped out had already attained their PhD degrees, although a few were still enrolled. Students who enrolled in the PhD program but exited with a master’s degree are counted as dropouts from the PhD program. None of the analyses for the GRE-V and GRE-AW scores indicated any significant differences by enrollment status; that is, in these samples GRE-V and GRE-AW were not significant predictors of program completion. Therefore, we focused primarily on the GRE-Q scores. This study was reviewed and approved by the ETS IRB (Committee for the Prior Review of Research; FWA00003247). The data file contained no personally identifiable information so individual consent was not required.

Results for chemistry programs

The Chemistry programs in both universities were highly selective with mean GRE-Q scores of 160 (SD = 5.7) for the 117 students in the flagship state university and 163 (SD = 5.8) for the 198 students in the Ivy League university. The 25th and 75th percentiles were 157 and 164 respectively at one university, and 158 and 167 at the other. In the tables we refer to the 25th percentile scores as (“LoQ”) but recall that although these are relatively low scores in these highly selective universities, they are still well above the average for all examinees who took the GRE. (Among all GRE test takers: Mean = 154, SD = 9.5 [14]). Table 1 presents the Chemistry PhD dropouts by GRE-Q quartile.

Table 1

Chemistry dropouts By GRE-Q quartile.

GRE-Q Quartile	Stayed	Dropped Out	Total	% dropped
HiQ (75^th %ile and above)	68	11	79	14
LoQ (25^th %ile and below)	55	24	79	30

Although the correlation of GRE-Q scores with 0–1 dropout was “only” -0.18, there were twice as many dropouts in the low GRE-Q group compared to the high GRE-Q group. This practically significant difference is also statistically significant. Chi-square, with Yates correction for 2x2 tables, is 5.28, p<.03. Although the quartile comparison is dramatic and easily understood, it does not use all of the data. The maximum likelihood estimates from the logistic regression using all GRE scores are in Table 2. Similar to an ordinary least squares regression, the logistic regression provides an estimate of the importance and statistical significance of each predictor in the equation but is appropriate when the criterion is dichotomous (0–1 dropout or stay).

Table 2

Maximum likelihood estimates for dropout from chemistry programs (N = 315).

Analysis of Maximum Likelihood Estimates
Parameter	DF	Estimate	SE	Wald Chi-Square	Pr > ChiSq
Intercept	1	5.6234	4.7297	1.4136	0.2345
GREVerbal	1	0.0345	0.0267	1.6678	0.1966
GREQuantitative	1	-0.0766	0.0246	9.6965	0.0018
GREWriting	1	0.013	0.2339	0.0031	0.9558

Note.—DF is degrees of freedom; SE is standard error; Pr>ChiSq indicates statistical significance of the Chi Square.

Note.—DF is degrees of freedom; SE is standard error; Pr>ChiSq indicates statistical significance of the Chi Square. The Wald chi-square tests the statistical significance of the three GRE scores in the prediction model. Again, the GRE-Q is shown to be a significant predictor of dropout (p<.002). Note that the negative sign in the table is because dropout was coded as 1 and stay was coded as 0, so a negative sign indicates that students with low GRE-Q scores were more likely to drop out.

Results for Electrical and Computer Engineering (ECE) programs

The ECE programs in both universities were highly selective with mean GRE-Q scores of 166 (SD = 4.0) for the 233 students in the flagship state university and 166 (SD = 4.1) for the 156 students in the Ivy League university. The mean scores at both universities were within 4 points of the maximum (170). The 25th and 75th percentiles were 164 and 170 respectively at both universities. Note that because of ties, the 75th percentile could also be the highest score possible. Table 3 presents the ECE PhD dropouts by GRE-Q quartile.

Table 3

ECE dropouts by GRE-Q quartile.

GRE-Q Quartile	Stayed	Dropped Out	Total	% dropped
HiQ (75^th %ile and above)	74	23	97	24
LoQ (25^th %ile and below)	61	36	97	37

There were more drop outs in the “low” GRE group, but this difference fell short of the conventional standard for statistical significance. (Chi-square with Yates correction = 3.51, p = .06). As indicated in Table 4, including all GRE scores (not just the high and low extremes) in the logistic regressions presents an even weaker case for the utility of GRE scores in predicting drop out.

Table 4

Maximum likelihood estimates for drop out from ECE programs (N = 389).

Analysis of Maximum Likelihood Estimates
Parameter	DF	Estimate	SE	Wald Chi-Square	Pr > ChiSq
Intercept	1	3.4207	4.9918	0.4696	0.4932
GREVerbal	1	0.008	0.0211	0.1438	0.7045
GREQuantitative	1	-0.0315	0.0274	1.3153	0.2514
GREWriting	1	-0.0798	0.1993	0.1604	0.6888

But what this table may actually demonstrate is the folly of trying to make predictions from scores that are clustered at the top of the scale. In one school, the median score was 168 (2 points from the top of the 130–170 scale), and at the other school the median was 166. At both schools, the 75th percentile score was the highest score possible (170). At both schools, the domestic population (U. S. citizens and permanent residents) had somewhat lower GRE-Q scores than the international population, but still had very high scores relative to the 154 average for all test takers. The mean score for domestic students was 164 and the mean for international students was 167. The standard deviation of about 4 in both schools was substantially below the 9.5 standard deviation in the total testing population [14]. As indicated in Table 5, within this slightly lower scoring domestic population a significant relationship of GRE-Q scores to dropout emerged in the logistic regression analyses (p<.03).

Table 5

Maximum likelihood estimates for domestic dropout from ECE programs (N = 110).

Analysis of Maximum Likelihood Estimates
Parameter	DF	Estimate	SE	Wald Chi-Square	Pr > ChiSq
Intercept	1	21.3634	9.6258	4.9257	0.0265
GREVerbal	1	-0.0385	0.0423	0.8317	0.3618
GREQuantitative	1	-0.1112	0.0493	5.0878	0.0241
GREWriting	1	0.5268	0.3525	2.2342	0.135

Study 1 conclusions

Results from the chemistry programs clearly indicate that GRE-Q scores can be effective in identifying students with a higher likelihood to drop out. There were twice as many dropouts in the bottom GRE-Q quartile compared to students in the top quartile. Results for the ECE programs were more ambiguous given the near-ceiling GRE scores of many of these students. Although some correction for the restricted range of the predictor scores is possible in correlational studies (e.g., Kuncel et al. [5]), the correction depends on fitting a regression line based on data that can be problematic when the restriction is as severe as it was in this study. When the slightly lower (but still very high) scores of the domestic students were analyzed separately, the ability of GRE-Q scores to predict dropout was again demonstrated. These tendencies should not be confused with destiny as many students with relatively low scores completed their degrees just as many students with relatively high scores dropped out. The main difficulty with these analyses is that they cannot speak to the fate of students with average or below GRE scores as such students simply do not exist in this dataset.

Study 2

The ability to write logically, clearly, and grammatically is essential in any graduate program and in career success after graduate school. Writing actually done as part of graduate coursework assignments then appears to be an important validity criterion for pre-admissions test scores. An initial attempt to obtain data on actual student writing in graduate courses involved contacting students in 15 universities that represented different levels of selectivity from a broad geographical spectrum. This effort was only minimally successful, so a second strategy entailed direct e-mail contact to students who had taken the GRE. Students from over 100 different universities responded to this effort. Study participants were asked to submit the two most recent examples of writing that they had done in their graduate courses. We asked that the submitted papers be word-processed and approximately ten pages or fewer in length. Participants were permitted to send essays, term papers, book reports, and proposals, for example, but not very brief documents such as poems or other papers that did not contain fairly extended discourse (e.g., papers that consisted primarily of equations). Finally, study participants were encouraged (but not required) to send samples of their writing in which they (a) considered various perspectives and viewpoints or (b) constructed or analyzed arguments. This study was reviewed and approved by the ETS IRB (Committee for the Prior Review of Research: FWA00003247). Written (e-mail) consent was obtained from all participants. Although grades assigned by professors constitute a readily usable criterion for course-related samples, they are, in all likelihood, based on widely different standards for each professor. Therefore, the approach taken followed procedures used in a previous study of student writing in realistic classroom contexts, and the scoring approach also closely matched the procedures in the previous study [15]. Student writing samples were evaluated according to a common set of criteria that could be used across submissions from the diverse set of graduate institutions. The criteria that we applied are those that were developed by four external experts in writing instruction/assessment. These criteria incorporate some of the scoring criteria on the GRE Analytical Writing measure for the “Issue” and “Argument” prompts. The Issue prompts ask the examinee to create an argument while the Argument prompts ask the examinee to critique an argument presented in the prompt. For the current study these holistic guides were expanded in order to reflect a concept of critical thinking that was characterized by one of the experts as being indicative of “scholarly habits of mind.” The writing was scored on a six-point rubric. The description for a “strong” essay (score of 5 [1 point below the top score of 6]) on this rubric is: A 5 paper displays a generally thoughtful, well-developed treatment of the subject/topic and demonstrates strong control of the elements of writing. A typical paper in this category discusses ideas or phenomena in some depth through analysis, synthesis, and/or persuasive reasoning develops and supports main points with logical reasons, examples, and/or details provides a generally well-focused, well-organized presentation, connecting ideas with clear transitions expresses ideas and information clearly, using language and varied sentence structure appropriate for the paper’s context and content demonstrates facility with the conventions (i.e., grammar, usage, and mechanics) of standard written English but may have occasional flaws Twelve college and university faculty members (all teachers and/or experienced evaluators of writing) evaluated all of the writing samples that were submitted. They were trained with a benchmark set of exemplary papers to represent each score level and a rangefinder set that spanned various disciplines within each set. Of the course-related writing samples submitted by study participants, a total of 434 (two from each participant) were deemed to be scorable for the purposes of the study. As a collection, the samples were extremely varied with respect to numerous dimensions, including but not limited to (a) length, (b) content, (c) purpose, and (d) the conditions under which they were written. Submissions included literature reviews, critical analyses, mid-term essays, take-home examinations, critiques/evaluations, biographies, summaries, and so forth. Some appeared to be the result of semester-long efforts, while others seemed to be only one of numerous similar assignments required during an academic period. The wide variety of content found in the submissions is perhaps best illustrated simply by mentioning a few titles: “Abundance and sensitivities of the Eastern mud snail, Ilyansassa obsoleta, throughout the intertidal zone in Charleston, South Carolina” “Public relations: More than just press releases” “Living with severe mental illness: What families and friends must know: Evaluation of a one-day psychoeducation workshop” “The cognitive etiology of body dysphoric disorder (excessive concern over one’s perceived physical flaws)” With regard to length, submissions included: a 225 word application to participate in a summer math/statistics workshop a 2600 word essay on “How Netflix’s business model changed to meet the demands of the consumer” a 5,000 word “numerical investigation” entitled “Optimization of a tandem blade configuration in an axial compressor,” and a 6,000 word “sociological perspective” on the People’s Republic of Korea entitled “Can North Koreans speak?” In order to establish the rater reliability for the course-required writing tasks, 214 of the 434 writing tasks submitted were randomly selected to be read independently by two readers. For each essay, the score from the first rater (randomly selected from the pool of 12 raters) was correlated with the score from the second randomly selected rater. The inter-reader correlation for a single essay was .70. The score used as the criterion was the average of the scores on both writing tasks, which had a task reliability estimate of .54. This estimate is conservative because there is no expectation that the two writing tasks submitted should be strictly parallel.

Results and discussion

Correlations of the scores on the class writing samples with GRE scores are presented in Table 6.

Table 6

Correlation of GRE scores with scores from course-required writing samples.

Scores	Mean	SD	Correlation with Class Writing Samples
GRE-V	155	8	0.37*
GRE-Q	152	8	0.04
GRE-AW	4.2	0.7	0.35*
Course Writing	4	0.9	--

Note. N = 217;

* p<.00001

Note. N = 217; * p<.00001 Because of the diversity in GRE scores of the students submitting writing samples, the standard deviations in the sample were only slightly lower than the standard deviations in the population (9, 9 and 0.9 for V, Q, and AW respectively [14]), so corrections for range restriction were not necessary. Not surprisingly, GRE-Q scores were not correlated with this writing criterion. An alternative way of looking at the same data, as shown in Table 7 (and as we did in Study 1), is to note the percent of students who are low on the predictor and high on the criterion, or vice-versa.

Table 7

Percent of students with relatively high or low scores on course-required writing by GRE-AW score categories.

Course-required Writing	GRE-AW Low (3.5 and below)	GRE-AW High (5.0 and above)
High (5.0 and above)	4 (4%)	14 (18%)
Low (3.5 and below)	29 (29%	4 (5%)
Total n	99	80

Note.—To simplify the table, students scoring between 3.5 and 5.0 were omitted.

Note.—To simplify the table, students scoring between 3.5 and 5.0 were omitted. Among students with low GRE-AW scores, there are seven times as many students with low scores on the criterion compared to students with high scores on the criterion. And among students with high GRE scores there are more than three times as many students with high criterion scores compared to those with low criterion scores. Chi-square, with Yates correction, is 19.2, p<.00001. These analyses demonstrate that GRE-AW scores are both statistically and practically significant indicators of writing skills in actual samples from graduate courses.

Conclusion

The two studies summarized here clearly demonstrate that even in highly selective programs GRE scores can indeed predict meaningful criteria that go beyond graduate GPA. Additional research is needed to better understand the role of GRE scores in less selective programs, and to evaluate possible differential effects in racial/ethnic and gender groups that could not be evaluated in this study because of the limited sample sizes for these groups. More elaborated regression models, or random tree models, that account for additional predictors or covariates such as undergraduate grades or socioeconomic statue should also be considered but note that such variables are often difficult or impossible to interpret in populations with large numbers of international students with undergraduate grades on different scales and with socioeconomic indicators that may have different meanings internationally. Future research should provide more detailed analyses related to the prediction of writing skills. Specifically, the match between the rater’s area of expertise and the assigned writing task should be explored, and with a larger sample analyses within specific program areas should be feasible. But predicting dropout or writing ability in meaningful classroom contexts is only part of the story. A critic of tests such as the GRE could argue that placing too much emphasis on a single predictor could be detrimental to enrolling a diverse class. And we agree. But overreliance on a test score should not be confused with totally ignoring test scores. Without test scores, too much emphasis might be placed on other criteria, especially the undergraduate institution attended. Although accepting students from only top-ranked universities may help with enrolling a qualified class, it would be detrimental to enrolling a diverse class. This is because of a well-known problem with undermatch for many students from underrepresented minority groups. That is, because of financial or family considerations many minority students do not attend selective undergraduate schools for which they are fully qualified, as indexed by SAT scores or high school grades [16]. Institutions that dropped test scores and focused primarily on the undergraduate institution attended would miss these students. For these students, a GRE score could be an important opportunity, and possibly the only opportunity, to convincingly demonstrate their readiness for graduate school. With ample evidence available on the value of enrolling a diverse array of students [17], it would be unfortunate to ignore any measures that could help with this effort. While it is certainly wise to guard against too much reliance on test scores, it would be unwise to ignore scores that are related to meaningful indicators of success in graduate school and that may be the only way for some students to open the door to a graduate education. 23 Nov 2021

PONE-D-21-06540

Does the GRE General Test predict more than just first year graduate GPA? PLOS ONE Dear Dr. Bridgeman, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Although the two reviewers acknowledged the merits of this paper, it still needs improvement in various aspects, such as establishing reliability, strengthening argument and explicating the context of each study. Please submit your revised manuscript by Jan 01 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Mingming Zhou, Ph.D. Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. Please consider changing the title so as to meet our title format requirement (https://journals.plos.org/plosone/s/submission-guidelines). In particular, the title should be "Specific, descriptive, concise, and comprehensible to readers outside the field" and in this case it is not informative and specific about your study's scope, methodology, or its findings. 3. Thank you for including your ethics statement: "Ethics statement for study 1: This study was reviewed and approved by the ETS IRB (FWA00003247). The data file contained no personally identifiable information so individual consent was not required. For study 2:This study was reviewed and approved by the ETS IRB (FWA00003247). Written (e-mail) consent was obtained from all participants.". Please amend your current ethics statement to include the full name of the ethics committee/institutional review board(s) that approved your specific study. Once you have amended this/these statement(s) in the Methods section of the manuscript, please add the same text to the “Ethics Statement” field of the submission form (via “Edit Submission”). For additional information about PLOS ONE ethical requirements for human subjects research, please refer to http://journals.plos.org/plosone/s/submission-guidelines#loc-human-subjects-research. 4. Thank you for stating the following in the Competing Interests section: "I have read the journal's policy and the authors of this manuscript have the following competing interests: Financial support for this research was provided by Educational Testing Service which encourages the independence of researchers; nothing in this report represents an official position of ETS."" Please confirm that this does not alter your adherence to all PLOS ONE policies on sharing data and materials, by including the following statement: "This does not alter our adherence to PLOS ONE policies on sharing data and materials.” (as detailed online in our guide for authors http://journals.plos.org/plosone/s/competing-interests). If there are restrictions on sharing of data and/or materials, please state these. Please note that we cannot proceed with consideration of your article until this information has been declared. Please include your updated Competing Interests statement in your cover letter; we will change the online submission form on your behalf. 5. We note that you have indicated that data from this study are available upon request. PLOS only allows data to be available upon request if there are legal or ethical restrictions on sharing data publicly. For more information on unacceptable data access restrictions, please see http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions. In your revised cover letter, please address the following prompts: a) If there are ethical or legal restrictions on sharing a de-identified data set, please explain them in detail (e.g., data contain potentially sensitive information, data are owned by a third-party organization, etc.) and who has imposed them (e.g., an ethics committee). Please also provide contact information for a data access committee, ethics committee, or other institutional body to which data requests may be sent. b) If there are no restrictions, please upload the minimal anonymized data set necessary to replicate your study findings as either Supporting Information files or to a stable, public repository and provide us with the relevant URLs, DOIs, or accession numbers. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories. We will update your Data Availability statement on your behalf to reflect the information you provide. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Partly ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: Yes ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: Overall, this is a clear, concise, and relevant report presenting two studies. The first explored the utility of the GRE in predicting drop out of students in highly selective graduate programs in chemistry and computer engineering. The second explored the relationship between student GRE scores and writing ability in graduate school. In my opinion, this is a valuable contribution to the literature. I recommend several modest clarifications/additions below. p. 2 – Last Paragraph and the first paragraph of page 3 discuss the difference between bias and demonstration of group differences. This discussion is both valuable and relevant. Of course, this does not establish that bias is not present in the GRE, only that observed group differences are not necessarily owing to bias. This discussion would be strengthened if the authors were able to briefly discuss the way that the development of the GRE minimizes the possibility of bias and/or share the results of studies establishing that group differences in GRE scores are not likely owing to bias. p. 11 -- Study 1 Conclusions discuss the fact that these analyses “cannot speak to the fate of students with average or below GRE scores” because such students don’t exist in the dataset. While you did not use this analytic approach, it may be worth mentioning in the discussion that others such as Donald E. Powers 2004 paper in Journal of Applied Psychology: “Validity of Graduate Record Examinations (GRE) General Test Scores for Admissions to Colleges of Veterinary Medicine” have seen effects of range restriction for GRE scores. p. 15. – I believe that a little more detail would be helpful to describe the process for establishing rater reliability. It appears that 214 of the 434 writing tasks were read by two readers (presumably 2 who were randomly paired from the pool of 12, and, I imagine, a separately randomly-assigned pair in each case, but that needs to be clarified, please. Also, while perhaps it should be, it is not clear to me how the inter-rater correlation was derived. (0.70). Please clarify. Conclusion: The introduction and conclusion of the paper both discuss one of the current primary criticisms of the GRE – that it may show bias in favor of men and certain racial groups. The paper addresses those criticisms by providing validity evidence for the GRE, and arguing that if the GRE is abandoned, schools will be left to rely on fewer admissions criteria (such as selectivity of the undergraduate institution) which are, themselves likely to produce biased admissions results. I personally find this argument compelling. However, the paper does not directly compare/report on potential (or evident) bias in the data it examines. I understand that, if group differences were discovered, it would be difficult or impossible to establish whether they were owing to bias or existing sub-group differences. Nonetheless, I would be interested in the authors discussing why they chose not to compare group differences, given the purpose of the paper. Reviewer #2: --- The manuscript “Does the GRE General Test predict more than just first year graduate GPA” analyzes the extent to which different components of the GRE are predictive of PhD completion (Study 1) and future quality of graduate writing (Study 2). Study 1 utilizes GRE score and completion data from 4 PhD programs (History, Psych, Chemistry, and ECE) at 2 institutions (a large state school and an Ivy league university). However, only the data from Chemistry and ECE are analyzed. The authors first split the GRE scores into quartiles, and removed the middle two in order to create two groups of GRE scorers, “high” and “low.” Subsequent Chi-square tests revealed a statistically significant association between GRE-Q score category (“low” or “high”) and PhD completion for Chemistry majors but not ECE majors. The authors also perform a logistic regression using GRE-V, GRE-Q, and GRE-AW as independent variables and PhD completion as the dependent variable. Again, GRE-Q is found to be statistically significant for Chemistry majors but not ECE majors. However, among US ECE majors only, a significant relationship exists between GRE-Q and completion. In all of Study 1’s analyses, GRE-V and GRE-AW are not significantly associated with completion at the 0.05 level. Study 2 utilizes writing samples from students “from over 100 different universities” to analyze whether the GRE is associated with quality of graduate writing. Four “external experts in writing instruction/assessment” generated a common set of criteria by which to grade the samples. 12 college and university faculty then evaluated the writing samples on a six-point rubric. Correlations between the scored samples and the students’ GRE-V scores, as well as the students’ GRE-AW scores, were statistically significant. From Study 1 and Study 2, the authors conclude that GRE scores can “predict meaningful criteria that go beyond graduate GPA.” --- The issues surrounding the use of GRE scores in graduate admissions processes are both topical and extremely important to the academic community, making this manuscript a valuable contribution to the ongoing discourse in the literature. Indeed, as noted in the manuscript, studies predicting valued outcomes in PhD programs admitted with a broad array of GRE scores are rare. The data presented in Study 2 is especially unique and should provide a tremendous starting point for similar studies in the future. The statistical analyses performed are reasonable for the data collected and are executed well. However, I believe the manuscript would be strengthened by the consideration of several edits/additions by the authors. Notably, the issues outlined in the “Overall Conclusions” section are the primary reason for indicating that the data “Partly” support the conclusions. Literature review - 1) The authors appropriately note that studies of GRE scores often do not address issues related to range restriction. However, they use phrases like “low” and “mediocre” GRE scores to describe the ranges of students “missing” from these datasets, but provide no context for what is actually considered “low.” For instance, noting that the average GRE-Q score of completers in [3] was 723 means little without knowing where that sits in context (a quick search of ETS indicates that’s around a 65th percentile, which some faculty likely would not consider a “high” GRE-Q score). The issue of keeping track of which studies have larger ranges of scores than others is further muddled by the fact that the authors use the new scoring scale in their analysis, but use the old scale in the literature review. How does the data in this manuscript compare to previous studies, which the manuscript describes as saying “absolutely nothing” about the value of the GRE? If anything, a limited range in Study 1 would actually serve to bolster the correlation of GRE-Q and completion that the authors find, but I think that point is lost. Perhaps a table summarizing all of the data, with percentiles and old/new score conversions, would be effective. 2) A reference on range restriction (e.g. https://arxiv.org/abs/1709.02895) might help readers unfamiliar with these issues. 3) Authors should consider citing Kuncel et al. (2001), which actually corrects for range restriction issues and is often cited on the ETS website, but finds no significant correlation between GRE-Q (or -V) and PhD completion in their meta-analysis. How do the authors reconcile this result with their current manuscript? 4) A follow-up to Ref [8] (https://doi.org/10.1103/PhysRevPhysEducRes.17.020115), in general, finds the same basic conclusions as this manuscript (that GRE-Q is correlated with PhD completion), and addresses several concerns laid out in [9]. As such it is likely worth citing. 5) Lastly, the literature review is entirely dedicated to studies most related to Study 1, in particular the existing literature surrounding GRE-Q. Something needs to be said about Study 2 in order to frame it within a larger context. Has there been previous work on standardized writing (on SATs? APs?) and success? Study 1 - 6) The authors must either discuss in more detail the data collected on the History and Psychology programs or entirely remove the comment referring to their elimination from the study. Presumably the History majors had the largest spread of GRE-Q scores (as a non-STEM field), yet they all graduated anyway? Was this just a small N problem or are we seeing essentially no correlation between GRE and completion within history majors (since they’re all graduating independent of score)? This deserves further comment. 7) While it is understandable to want “intuitive” results, some readers will immediately note that by splitting data into quartiles, the authors are “throwing away” some information in the data. Consider including both the correlation and 2x2 tables even if some readers may "misinterpret" correlations. 8) With regard to interpretability, the audience of faculty on admissions committees for whom this paper would most likely be targeted may be unfamiliar with logistic regression using independent variables that have been split into quartiles. A brief comment on what the estimates in Tables 2/4/5 mean in terms of practical score changes and probability of completion would be helpful. Study 2 - 9) As mentioned in comment 5), Study 2 lacks the same context as Study 1. As it stands, essentially the entire discussion in the introduction is dedicated to the research landscape as it pertains to GRE-Q, with little to no mention of GRE-V or writing. Hence Study 2 feels only loosely connected to the rest of the paper 10) In the same spirit as the previous comment, Study 2’s methods and data should be discussed in more detail. For example, - How many students were included? What are the demographics of the students who submitted writing samples? What are their majors? What can the study say about majors not included? - What are the demographics of the faculty who graded the responses? Do these align, and does it matter? - For instance, the rubric appears to be constructed so that faculty from any discipline can grade papers from any other discipline (ie, a psychology faculty member might grade a history paper and evaluate its issue and argument) - is this justified? Can a faculty member determine whether an argument is good when it is out of disciplinary context? Perhaps, but this decision is not discussed. What information was lost by stripping disciplinary context? - One possible avenue to explore this is to examine the scores given by the students’ instructors in comparison to those given by the independent graders. - Have previous studies used a similar methodology to evaluate written work? - What do the distributions of scores look like on the graded papers? What about the original grades? What do these say about the sample of students? - Were there specific requirements for a paper to be “deemed scorable” by the scorers? This list is surely non-exhaustive, and I believe there is much room for the authors to elaborate on how this very interesting data was collected and analyzed. Overall conclusions - This manuscript presents two studies supporting two separate but related claims (as written in the abstract): one is that this manuscript “demonstrate(s) the ability of the GRE to predict dropout” and the other is that this manuscript “shows the relationship of the GRE Analytical Writing scores to writing produced as part of graduate school coursework.” The first claim, that “the GRE” predicts dropout, is only partly supported by the data, and lacks important context. First, as made clear by the manuscript’s focus on the GRE-Q, the authors only find evidence that the quantitative part of the exam correlates with completion - little can be said about the GRE-V. Since students have to take (and pay for) both, the authors should distinguish which parts of the exam they found to be significantly associated with completion, and be consistent about this throughout the manuscript. Amending the abstract to include details of the study (e.g. 4 programs studied at 2 universities, N = students, GRE-Q found significantly associated with completion for one of the programs) avoids any possible obfuscation of the findings. More importantly however, the authors should clarify the implications of their findings. As noted earlier, several other papers have found a correlation between GRE-Q and PhD completion in a STEM discipline. But does finding a correlation between GRE-Q and completion automatically justify its use in admissions? It is in this area where the authors must elaborate. This manuscript notes that the most frequently cited reasons for dropping out of graduate school are known to be completely non-academic: financial issues, family, etc. It also notes that different groups are known to score differently on the GRE exam. Could it be that the GRE is simply picking up on some socioeconomic conditions related to, as the authors say “fewer educational opportunities and poorer schools,” which then make it more difficult to persist in graduate school when tempted with higher paying jobs or confronted with family emergency? Or is the GRE really picking up on reasoning skills that faculty value and help students persist? The authors provide no evidence that this is indeed the case, and actually provide a more compelling argument for the former interpretation. Put another way, why should faculty believe that a students’ ability to do high school math questions make them more able to complete a PhD? Or are other variables confounding this relationship? Does it matter, so long as the test is indicating whether students will drop out? I suspect critics of the GRE would wonder why admissions committees should force students to pay to take a test that largely measures the socioeconomic status of the test takers. Lastly (again related to the fact that Study 2 is largely ignored throughout the introduction), the authors must reconcile statements such as “Although there is ample evidence that the scores can predict graduate school readiness as indexed by grades in the first one or two years (e.g., [5,6]) mere grade prediction is of limited value.” with the fact that Study 2 uses writing samples produced in graduate coursework. Is that not merely a prediction of graduate grades? Does this really link to valued outcomes such as “research ability” and “completion”? The authors tenuously make this link in the first sentence of Study 2, but they must make this link clearer in order for Study 2 to be used in support of the paper’s overall conclusion that the GRE predicts more than grades. ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Jared A. Danielson Reviewer #2: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 25 Jan 2022 Response to reviewers Reviewer 1 p. 2 –…This discussion would be strengthened if the authors were able to briefly discuss the way that the development of the GRE minimizes the possibility of bias and/or share the results of studies establishing that group differences in GRE scores are not likely owing to bias. Added: “Trained fairness reviewers, including representatives of minority groups, review all GRE questions and statistical differential item functioning (DIF) procedures are used to identify any test questions that are unusually difficult (or easy) for a particular racial/ethnic or gender group.” p. 11 -- …While you did not use this analytic approach, it may be worth mentioning in the discussion that others such as Donald E. Powers 2004 paper in Journal of Applied Psychology: “Validity of Graduate Record Examinations (GRE) General Test Scores for Admissions to Colleges of Veterinary Medicine” have seen effects of range restriction for GRE scores. I now mention range restriction in the context of the Kuncel et al. study that was mentioned in the introduction but note that the correction can be problematic when the restriction is severe. Although Powers (2004) is a good study it is not as relevant because the restriction on both predictors and criterion was modest in the sample of veterinary schools. Added: “Although some correction for the restricted range of the predictor scores is possible in correlational studies (e.g., Kuncel et al. [6]), the correction depends on fitting a regression line based on data that can be very sparse when the restriction is as severe as it was in this study.” p. 15. – I believe that a little more detail would be helpful to describe the process for establishing rater reliability. It appears that 214 of the 434 writing tasks were read by two readers (presumably 2 who were randomly paired from the pool of 12, and, I imagine, a separately randomly-assigned pair in each case, but that needs to be clarified, please. Also, while perhaps it should be, it is not clear to me how the inter-rater correlation was derived. (0.70). Please clarify. Added: “For each essay, the score from the first rater (randomly selected from the pool of 12 raters) was correlated with the score from the second randomly selected rater.” Conclusion: … Nonetheless, I would be interested in the authors discussing why they chose not to compare group differences, given the purpose of the paper. Sample sizes were too small to compare group differences. Added in the discussion: “Additional research is needed to better understand the role of GRE scores in less selective programs, and to evaluate possible differential effects in racial/ethnic and gender groups that could not be evaluated in this study because of the limited sample sizes for these groups.” Reviewer 2 1) …For instance, noting that the average GRE-Q score of completers in [3] was 723 means little without knowing where that sits in context (a quick search of ETS indicates that’s around a 65th percentile, which some faculty likely would not consider a “high” GRE-Q score). Added that 723 is about 65th percentile. Also toned down comment on “high” scores on the GRE to say simply “none of the students with three or more publications had low GRE-Q scores.” 2) A reference on range restriction (e.g. https://arxiv.org/abs/1709.02895) might help readers unfamiliar with these issues. Referred reader back to the Kuncel study mentioned in the introduction in the context of range restriction. 3) Authors should consider citing Kuncel et al. (2001), which actually corrects for range restriction issues and is often cited on the ETS website, but finds no significant correlation between GRE-Q (or -V) and PhD completion in their meta-analysis. How do the authors reconcile this result with their current manuscript? Kuncel was cited in the introduction, but it does not deserve too much attention. Many of the studies in the meta-analysis are over 30 years old when the test itself as well as the populations taking it were quite different than they are today. This why the current study was needed. 4) A follow-up to Ref [8] (https://doi.org/10.1103/PhysRevPhysEducRes.17.020115), in general, finds the same basic conclusions as this manuscript (that GRE-Q is correlated with PhD completion), and addresses several concerns laid out in [9]. As such it is likely worth citing. The follow-up states, “Students’ undergraduate GPA (UGPA) and GRE Physics (GRE-P) scores are small but statistically significant predictors of graduate course grades, while GRE quantitative and GRE verbal scores are not,” but corrects only a few of the problems outlined in [9], so it does not seem to be worth citing. 5) Lastly, the literature review is entirely dedicated to studies most related to Study 1, in particular the existing literature surrounding GRE-Q. Something needs to be said about Study 2 in order to frame it within a larger context. Has there been previous work on standardized writing (on SATs? APs?) and success? Added: “In addition to program completion, another valued outcome for graduate programs is writing skill. Strong writing skills are required in many graduate courses and in all doctoral programs with a thesis requirement. There is evidence that the GRE Analytical Writing test predicts graduate grades across a number of graduate programs. Indeed, in a comprehensive study using data from over 25,000 students from 10 universities in the Florida state system the GRE Analytical Writing (GRE-AW) test was a significant predictor of the graduate grade point average across a number of different programs [6]. GRE-AW was frequently a better predictor than either the GRE-V or GRE-Q scores, perhaps surprisingly predicting grades in master’s engineering programs and biomedical PhD programs better than predictions from GRE-Q. Because many factors in addition to writing skill are important in determining the overall grade point average, this study could not provide a direct link between GRE-AW and writing demands in graduate courses.” 6) The authors must either discuss in more detail the data collected on the History and Psychology programs or entirely remove the comment referring to their elimination from the study. Presumably the History majors had the largest spread of GRE-Q scores (as a non-STEM field), yet they all graduated anyway? Was this just a small N problem or are we seeing essentially no correlation between GRE and completion within history majors (since they’re all graduating independent of score)? This deserves further comment. Because our plan was to analyze results in four programs, dropping mention of the history and psychology does not seem to be appropriate. It is not exactly a small N problem but is a problem because of the very small N of students who drop out of these programs. It is not possible to predict dropout when students are not dropping out. 7) While it is understandable to want “intuitive” results, some readers will immediately note that by splitting data into quartiles, the authors are “throwing away” some information in the data. Consider including both the correlation and 2x2 tables even if some readers may "misinterpret" correlations. Added correlation. 8) With regard to interpretability, the audience of faculty on admissions committees for whom this paper would most likely be targeted may be unfamiliar with logistic regression using independent variables that have been split into quartiles. A brief comment on what the estimates in Tables 2/4/5 mean in terms of practical score changes and probability of completion would be helpful. Note that the logistic regression used the original data before it had been split into quartiles. Added: “Similar to an ordinary least squares regression, the logistic regression provides an estimate of the importance and statistical significance of each predictor in the equation but is appropriate when the criterion is dichotomous (0-1 drop out or stay).” 9) As mentioned in comment 5), Study 2 lacks the same context as Study 1. As it stands, essentially the entire discussion in the introduction is dedicated to the research landscape as it pertains to GRE-Q, with little to no mention of GRE-V or writing. Hence Study 2 feels only loosely connected to the rest of the paper. See response to comment 5. 10) In the same spirit as the previous comment, Study 2’s methods and data should be discussed in more detail. For example, - How many students were included? What are the demographics of the students who submitted writing samples? What are their majors? What can the study say about majors not included? - What are the demographics of the faculty who graded the responses? Do these align, and does it matter? - For instance, the rubric appears to be constructed so that faculty from any discipline can grade papers from any other discipline (ie, a psychology faculty member might grade a history paper and evaluate its issue and argument) - is this justified? Can a faculty member determine whether an argument is good when it is out of disciplinary context? Perhaps, but this decision is not discussed. What information was lost by stripping disciplinary context? - One possible avenue to explore this is to examine the scores given by the students’ instructors in comparison to those given by the independent graders. - Have previous studies used a similar methodology to evaluate written work? - What do the distributions of scores look like on the graded papers? What about the original grades? What do these say about the sample of students? - Were there specific requirements for a paper to be “deemed scorable” by the scorers? Some of these questions have been addressed, but data was not available for many of them. The similarity of the approach taken here to previous research [15] was acknowledged. Added: “Therefore, the approach taken followed procedures used in a previous study of student writing in realistic classroom contexts, and the scoring approach also closely matched the procedures in the previous study [15].” The need for additional work in this area was also acknowledged. Added: “Future research should provide more detailed analyses related to the prediction of writing skills. Specifically, the match between the rater’s area of expertise and the assigned writing task should be explored, and with a larger sample analyses within specific program areas should be feasible.” … Amending the abstract to include details of the study (e.g. 4 programs studied at 2 universities, N = students, GRE-Q found significantly associated with completion for one of the programs) avoids any possible obfuscation of the findings. Amended abstract: “…Two studies are included. The first study used data from chemistry (N=320) and computer engineering (N=389) programs from a flagship state university and an Ivy League university to demonstrate the ability of the GRE to predict dropout. Dropout prediction for the chemistry programs was both statistically and practically significant for the GRE quantitative (GRE-Q) scores, but not for the verbal or analytical writing scores. In the computer engineering programs, significant dropout prediction by GRE-Q was evident only for domestic students. The second study showed the relationship of GRE Analytical Writing scores to writing produced as part of graduate school coursework by 217 students. Put another way, why should faculty believe that a students’ ability to do high school math questions make them more able to complete a PhD? Although GRE-Q does not require computational skills beyond high school math, the reasoning required is far beyond “ability to do high school math.” If only high school knowledge were required half of the examinees would not have scores below 154 and GRE-Q could not predict either grades or dropout. Or are other variables confounding this relationship? Does it matter, so long as the test is indicating whether students will drop out? There may be many intervening variables, but this does not alter the conclusion that GRE-Q scores are a useful indicator of students who are more likely to drop out. …the authors must reconcile statements such as “Although there is ample evidence that the scores can predict graduate school readiness as indexed by grades in the first one or two years (e.g., [5,6]) mere grade prediction is of limited value.” with the fact that Study 2 uses writing samples produced in graduate coursework. Is that not merely a prediction of graduate grades? No; writing is just one aspect of graduate grades and this study permits focus on just the writing skill component. Submitted filename: response to PLOS reviewers.docx Click here for additional data file. 31 Mar 2022

PONE-D-21-06540R1

Can the GRE Predict Valued Outcomes? Dropout and Writing Skill PLOS ONE Dear Dr. Bridgeman, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. As you can read from the reviewers' comments, there are still some concerns to be solved. More importantly, the consistency between the response letter and the revised text needs to be ensured. Please submit your revised manuscript by May 08 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Mingming Zhou, Ph.D. Academic Editor PLOS ONE [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: All comments have been addressed Reviewer #3: (No Response) Reviewer #4: (No Response) ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #3: Partly Reviewer #4: Yes ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #3: Yes Reviewer #4: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #3: Yes Reviewer #4: Yes ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #3: Yes Reviewer #4: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: Please note that the changes described in response to the reviewers are acceptable in my perspective. However, the change described in the author response to the reviewers regarding rater reliability does not appear in the new manuscript. I assume this is a simple oversight, and I assume it will be added. Reviewer #3: This paper deals with the important issue of dropout from graduate programs and the ability to predict dropouts and graduates based on GRE scores. It adopts a differentiated approach, based on the different sub-sections of the GRE, which is a welcome contribution. The paper also presents the findings of not one but two studies. The first is a classic prediction one (focusing on GRE-Q) and the second present a novel approach to assess the worth of GRE-AW. With that said, the paper also has some weaknesses. As I enter the review stage during the R&R, I think of the following comments as food for thought – maybe helpful for improving the introduction and conclusion. Reading the paper, there is an unsettling dissonance. The authors criticize former studies for focusing on highly selective institutions and programs (with slight score variation). Yet, their study is based on data from highly selective institutions and openly states that history and psychology were eliminated because of minor score variations. It reads nearly as they are repeating the pitfalls they warn from. Relatedly, can the authors explain what is the source of this variation? One would think that in highly selective institutions, only top-performing students will be admitted. So what is unique in the admission criteria in these programs, that enabled this variation from the first place? One possible answer can be that in some fields there is greater willingness to admit “everyone,” starting with big classrooms and seeing who will pass the first-year exams. If this is the case, there is no point in the pre-admission tests from the first place. Alternatively, “good” and “excellent” students were admitted. In such a case, the fact that the logistic regression includes only the GRE scores with no other background characteristics (undergraduate GPA, socioeconomic status, parental education and so forth) is puzzling. With more parameters within the models, they are very likely to show different results. Yet, we are left in the dark. The analysis itself is ok. It would be good to explain why picking ML over other options for the logistic regression. And as there are a growing number of studies that turn to machine learning models that are ideal for prediction purposes (random forests is but one option), this statistical approach seems valid but outdated. The same applies to the sample size. I am well aware of the difficulty of collecting data on these issues, and the N is acceptable to achieve statistical significance, but still, it looks a bit low for today’s standards. Turning to study 2: Overall I found this part to be very interesting, and dealing with an issue that is less “objective” and harder to measure, which probably explains why there is little literature on the topic. Is there a way to tell the readers at what stage in their programs the responders were when submitting the written examples? 20% in their first year, 50% in the second year, etc. It is likely to assume that over time, grad students will improve their writing skills, and the association with the GRE-AW will weaken. I appreciate the quartile approach. Maybe a more detailed one would be also informative. I understand that you don’t have the data on ethnicity, gender, and so forth, but even just using a more fine-tuned classification will do. A good example is provided in Bowen and Bok book, the shape of the river (appendix table D.3.3). I was surprised to see no reference to Bowen and Bok work. It is also possible to think about the connection between your argument (which is really somewhat of a “don't throw the baby out with the bathwater”) and the ongoing debates about affirmative action. I don’t think that your data can support direct confirmation to either side of the debate, but it would be beneficial to at least acknowledge it. When thinking about it, even the fact that you do not present any other variables in the regression models means something. To conclude – address the dissonance I pointed out above, and explain the limitation and decision to focus solely on GRE in the regression. A bit of a conversation with the literature could help, too. Reviewer #4: Page 1: Please, also include results of the second study in the abstract. For example: “Our analyses demonstrate that GRE-AW scores are both statistically and practically significant indicators of writing skills in actual samples from graduate courses.” Page 5: [… GRE because of a number of technical issues in the analysis [9].] > List the technical issues briefly. Page 8: Please, mention that, now, you also include correlation coefficients. Page 10: The numbers of students included in the study on page 9 (117 from a flagship state university and 198 from an Ivy League university) sum up to 315 and not to 320 as indicated in the caption of table 2. Please check and correct the numbers if necessary. Page 10: There is a space missing in the last paragraph: “were 164 and170”. Page 12: In the result section you claim “Results from the chemistry programs clearly indicate that GRE-Q scores can be effective in identifying students most likely to drop out”. Given that you show that 30% of students with a low GRE-Q score drop out and that 14% with a high GRE-Q score drop out, I would rather suggest to claim that “… in identifying students with a higher likelihood to drop out”. Page 21, reference 5: no space in “meta-analytic”. Page 22, reference 12: “;” not in italic font & wrong dash in page numbers. Page 22, reference 13: “Rubin, D” instead of “Rubin ,D” & “, 74; 1982,” not in italic font ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Jared A. Danielson Reviewer #3: No Reviewer #4: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 4 Apr 2022 Response to reviewers (April 2022) the author response to the reviewers regarding rater reliability does not appear in the new manuscript added: For each essay, the score from the first rater (randomly selected from the pool of 12 raters) was correlated with the score from the second randomly selected rater. The inter-reader correlation for a single essay was .70. Reading the paper, there is an unsettling dissonance. The authors criticize former studies for focusing on highly selective institutions and programs (with slight score variation). ). Yet, their study is based on data from highly selective institutions This was addressed by modifying the methods and materials for Study 1 as follows: We had requested data from graduate programs representing a variety of selectivity levels but were ultimately successful in obtaining data from only two universities. GRE scores and program completion data were obtained from four highly selective PhD programs at a large flagship state university and at a highly selective Ivy League university. We understood that finding significant relationships to dropout in highly selective programs would likely be challenging, but even in these selective programs there was some variation in GRE scores, albeit near the top of the score scales. The analysis itself is ok. It would be good to explain why picking ML over other options for the logistic regression. And as there are a growing number of studies that turn to machine learning models that are ideal for prediction purposes (random forests is but one option), this statistical approach seems valid but outdated…. I understand that you don’t have the data on ethnicity, gender, and so forth, but even just using a more fine-tuned classification will do….explain the limitation and decision to focus solely on GRE in the regression In the conclusion we acknowledged that other analytic approaches were possible, and by implication why we chose not to use some covariates (e.g., undergrad GPA) that might make sense in other contexts but were less relevant to the current context. More elaborated regression models, or random tree models, that account for additional predictors or covariates such as undergraduate grades or socioeconomic statue should also be considered but note that such variables are often difficult or impossible to interpret in populations with large numbers of international students with undergraduate grades on different scales and with socioeconomic indicators that may have different meanings internationally. I was surprised to see no reference to Bowen and Bok work. It is also possible to think about the connection between your argument (which is really somewhat of a “don't throw the baby out with the bathwater”) and the ongoing debates about affirmative action. Bowen & Bok reference added in the Conclusion: With ample evidence available on the value of enrolling a diverse array of students [17], it would be unfortunate to ignore any measures that could help with this effort. Page 1: Please, also include results of the second study in the abstract Added to abstract: . In the second study, GRE Analytical Writing scores for 217 students were related to writing produced as part of graduate school coursework and relationships were noted that were both practically and statistically significant. Page 5: [… GRE because of a number of technical issues in the analysis [9].] > List the technical issues briefly. Added: ]. Specifically, the Abstract of this critique noted, “The paper makes numerous elementary statistics errors, including introduction of unnecessary collider-like stratification bias, variance inflation by collinearity and range restriction, omission of needed data (some subsequently provided), a peculiar choice of null hypothesis on subgroups, blurring the distinction between failure to reject a null and accepting a null, and an extraordinary procedure for radically inflating confidence intervals in a figure.” Page 8: Please, mention that, now, you also include correlation coefficients Added: We computed the correlation of GRE scores to dropout (0-1), Page 10: The numbers of students included in the study on page 9 (117 from a flagship state university and 198 from an Ivy League university) sum up to 315 and not to 320 as indicated in the caption of table 2. Please check and correct the numbers if necessary. Fixed Page 10: There is a space missing in the last paragraph: “were 164 and170”. Fixed Page 12: In the result section you claim “Results from the chemistry programs clearly indicate that GRE-Q scores can be effective in identifying students most likely to drop out”. Given that you show that 30% of students with a low GRE-Q score drop out and that 14% with a high GRE-Q score drop out, I would rather suggest to claim that “… in identifying students with a higher likelihood to drop out”. Fixed: … in identifying students with a higher likelihood to drop out. Page 21, reference 5: no space in “meta-analytic”. ? I did not see any extra space. Spelling exactly as in the original reference. Page 22, reference 12: “;” not in italic font & wrong dash in page numbers. Page 22, reference 13: “Rubin, D” instead of “Rubin ,D” & “, 74; 1982,” not in italic font fixed Submitted filename: response to plosone reviews April.docx Click here for additional data file. 9 May 2022 Can the GRE Predict Valued Outcomes? Dropout and Writing Skill PONE-D-21-06540R2 Dear Dr. Bridgeman, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Mingming Zhou, Ph.D. Section Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: All comments have been addressed Reviewer #3: All comments have been addressed Reviewer #4: All comments have been addressed ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #3: Yes Reviewer #4: Yes ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #3: Yes Reviewer #4: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #3: Yes Reviewer #4: Yes ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #3: Yes Reviewer #4: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: (No Response) Reviewer #3: I thank the authors for the care with which they responded to my comments in previous round of revisions. In my opinion, this article is now ready for publication. Minor clarification: On page 7, “Because many factors in addition to writing skill are important in determining the overall grade point average, this study could not provide a direct link between GRE-AW and writing demands in graduate courses.” This study refers to citations No. 6? Please clarify Reviewer #4: (No Response) ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Jared A. Danielson Reviewer #3: No Reviewer #4: No 12 May 2022 PONE-D-21-06540R2 Can the GRE Predict Valued Outcomes? Dropout and Writing Skill Dear Dr. Bridgeman: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Mingming Zhou Section Editor PLOS ONE

6 in total

1. The Limitations of the GRE in Predicting Success in Biomedical Graduate School.

Authors: Liane Moneta-Koehler; Abigail M Brown; Kimberly A Petrie; Brent J Evans; Roger Chalkley
Journal: PLoS One Date: 2017-01-11 Impact factor: 3.240

2. Predictors of Student Productivity in Biomedical Graduate School Applications.

Authors: Joshua D Hall; Anna B O'Connell; Jeanette G Cook
Journal: PLoS One Date: 2017-01-11 Impact factor: 3.240

3. Typical physics Ph.D. admissions criteria limit access to underrepresented groups but fail to predict doctoral completion.

Authors: Casey W Miller; Benjamin M Zwickl; Julie R Posselt; Rachel T Silvestrini; Theodore Hodapp
Journal: Sci Adv Date: 2019-01-23 Impact factor: 14.136