Carnegie Mellon University - Department of Statistics
Open-ended (i.e. constructed response) test items have become a stock component of standardized educational assessments. Responses to open-ended items are usually evaluated by human "raters," often with multiple raters judging each response. A pragmatic model of an assessment system using rated responses must be able to accurately capture the additional variability imposed on the system by the raters and address the performance of the raters, both as a group and individually. I will contrast the FACETS model (Linacre, 1989), a mixed-effects multivariate logistic regression model that has been a popular tool for modeling data from rated test items, with a fully hierarchical Bayesian model for rating data (the Hierarchical Rater Model, HRM, of Patz, Junker, Johnson and Mariano, 2002). The HRM makes more realistic assumptions about the dependence between multiple ratings of the same student work, and thus provides a more realistic view of the uncertainty of inferences on parameters and latent variables from rated test items. A rigorous treatment of the approach to dependence and uncertainty in each model will be presented, followed by an exploration the accumulation of information under the HRM, under various scenarios of rater performance (especially poor performance). The HRM will be explored as a flexible tool in the diagnosis of both between and within rater effects by incorporating covariates of rater behavior into the hierarchy of the model. The effect of modality-the design for distributing items among raters-on the severity and consistency of individual raters' performance will be used as an illustration. Both the accurate capture of rater variability and the inclusion/exclusion of rater covariates create model selection problems where parameters easily number in the tens-of-thousands. The adaptation of computational methods for the Bayes factor via a Markov Chain Monte Carlo processes in this high dimensional setting will be discussed.