BACKGROUND: Early identification of students who may have difficulty with the United States Medical Licensing Examination-Step 1 examination is important for medical schools and students. Numerous models that predict Step 1 performance have been identified, but few of these models have been cross-validated. PURPOSE: To cross-validate different prediction models of Step 1 performance. METHODS: The development sample was 686 students from a Midwestern medical school. The cross-validation sample was 147 different students. Logistic regression was used to develop the multiple model and Year 1 grade point average (GPA) and Year 2 Fall GPA were used as the simple models. Receiver Operating Characteristic graphs were used to select optimal cutoffs for each model. Kappa coefficients were used to determine level of agreement, and sensitivity and specificity were used to assess classification accuracy. RESULTS: The Year 1 GPA model had relatively poor agreement with actual Step 1 performance, but the other models evidenced fair agreement. The multiple and Year 1 GPA models demonstrated statistically significant loss of classification accuracy on cross-validation, whereas the Year 2 Fall GPA model did not. CONCLUSIONS: Cross-validation is necessary to determine the generalizability and overall utility of prediction models.
BACKGROUND: Early identification of students who may have difficulty with the United States Medical Licensing Examination-Step 1 examination is important for medical schools and students. Numerous models that predict Step 1 performance have been identified, but few of these models have been cross-validated. PURPOSE: To cross-validate different prediction models of Step 1 performance. METHODS: The development sample was 686 students from a Midwestern medical school. The cross-validation sample was 147 different students. Logistic regression was used to develop the multiple model and Year 1 grade point average (GPA) and Year 2 Fall GPA were used as the simple models. Receiver Operating Characteristic graphs were used to select optimal cutoffs for each model. Kappa coefficients were used to determine level of agreement, and sensitivity and specificity were used to assess classification accuracy. RESULTS: The Year 1 GPA model had relatively poor agreement with actual Step 1 performance, but the other models evidenced fair agreement. The multiple and Year 1 GPA models demonstrated statistically significant loss of classification accuracy on cross-validation, whereas the Year 2 Fall GPA model did not. CONCLUSIONS: Cross-validation is necessary to determine the generalizability and overall utility of prediction models.