Efectos del corrector en las evaluaciones educativas de alto impacto

  1. Pamela Woitschach 1
  2. Carlota Díaz-Pérez 2
  3. Daniel Fernández-Argüelles 2
  4. Jaime Fernández-Castañón 2
  5. Alba Fernández-Castillo 2
  6. Lara Fernández-Rodríguez 2
  7. María Cristina González-Canal 2
  8. Iris López-Marqués 2
  9. David Martín-Espinosa 2
  10. Rubén Navarro-Cabrero 2
  11. Lara Osendi-Cadenas 2
  12. Diego Riesgo-Fernández 2
  13. Zara Suárez-García 2
  14. Rubén Fernández-Alonso 2
  1. 1 Universidad Complutense de Madrid
    info

    Universidad Complutense de Madrid

    Madrid, España

    ROR https://ror.org/02p0gd045

  2. 2 Universidad de Oviedo
    info

    Universidad de Oviedo

    Oviedo, España

    ROR https://ror.org/006gksa02

Journal:
REMA

ISSN: 1135-6855

Year of publication: 2018

Volume: 23

Issue: 1

Pages: 12-27

Type: Article

DOI: 10.17811/REMA.23.1.2018.12-27 DIALNET GOOGLE SCHOLAR lock_openDialnet editor

More publications in: REMA

Sustainable development goals

Abstract

Antecedents: the constructed response test items that are qualified by different correctors with rubrics are one of the biggest challenges for Rater effects in high-impact educational assessments and are applied to large sample groups. It is known that rater bias affects the results of the evaluation. In this context, the present study analyzes the raters’ effects of the corrector on written expression. Method: a group of 13 raters rated 375 written productions of 6th-grade students, following an analytical rubric composed of 8 correction criteria. The raters were assigned to 13 groups of correction following a balanced incomplete block design. The first step of the analysis carried out was to confirm the one-dimensional structure of the rubric. The next and final step used different classical methods to study the raters’ effects, the intra-rater consistency and the agreement between judges. Results: differential effects were found among the raters. These differences are important when the raters’ severity is compared. There are also differences in the internal consistency of each judge and in the agreement between correctors. This last effect is especially significant in some raters’ groups. Discussion: differences between raters may have different sources, such as experience and familiarity with the task; the degree of training with the rubrics; the nature of the test; and the design of the rubric used

Bibliographic References

  • Abad, F. J., Olea, J., Ponsoda, V., & García, C. (2011). Medición en ciencias sociales y de la salud. Madrid: Síntesis.
  • Adams, R., & Wu, M. (2010). The analysis of rater effects. Recuperado en diciembre de 2017 de: https://www.acer.org/files/Conquest-Tutorial-3-RaterEffects.pdf.
  • Baird, J. A., Meadows, M., Leckie, G., & Caro, D. (2017). Rater accuracy and training group effects in Expert- and Supervisor-based monitoring systems. Assessment in Education: Principles, Policy and Practice, 24(1), 44-59. doi: 10.1080/0969594X.2015.1108283.
  • Barrett, P. (2001). Conventional interrater reliability: definitions, formulae, and worked examples in SPSS and STATISTICA. Recuperado en diciembre de 2017 de: http://www.pbarrett.net/techpapers/irr_conventional.pdf.
  • Bravo-Arteaga, A. & Fernández del Valle, J. C. (2000). La evaluación convencional frente a los nuevos modelos de evaluación auténtica. Psicothema, 12(S. 2), 95-99.
  • Byrne, B. M. (2001). Structural Equation Modeling with AMOS. Mahwah, NJ: Lawrence Erlbaum Associates.
  • Cochran, W.G., y Cox, G.M. (1974). Diseños experimentales. Mexico: Trillas. (orig. 1957).
  • Cohen, J. (1988). Statistical power analysis for the behavioral sciences. Hilsdale. NJ: Lawrence Earlbaum Associates.
  • Congdon, P. J. & McQueen, J. (2000). The stability of rater severity in Large-Scale Assessment Programs. Journal of Educational Measurement, 37(2), 163-178. doi: 10.1111/j.1745-3984.2000.tb01081.x.
  • Cuxart Jardí, A. (2000). Modelos estadísticos y evaluación: tres estudios en educación. Revista de Educación, 323, 369-394.
  • Eckes, T. (2005). Examining rater effects in TestDaF writing and speaking performance assessments: A multi-faceted Rasch analysis. Language Assessment Quarterly, 2(3), 197-221. doi: 10.1207/s15434311laq0203_2.
  • Eckes, T. (2009). Many-facet Rasch measurement. In S. Takala (Ed.), Reference supplement to the manual for relating language examinations to the Common European Framework of Reference for Languages: Learning, teaching, assessment (Section H). Strasbourg, France: Council of Europe/Language Policy. Recuperado en septiembre de 2017 de https://rm.coe.int/1680667a23#search=eckes.
  • European Commission/EACEA/Eurydice (2016). Structural indicators on achievement in basic skills in Europe – 2016. Eurydice Report. Luxembourg: Publications Office of the European Union. doi:10.2797/092314.
  • European Commission/EACEA/Eurydice (2009). National testing of pupils in Europe: Objectives, organisation and use of results. Luxembourg: Publications Office of the European Union. doi: 10.2797/18294.
  • European Commission/EACEA/Eurydice (2014). Modernisation of higher education in Europe: Access, retention and employability. Luxembourg: Publications Office of the European Union. doi: 10.2797/72146.
  • Engelhard, G. (1992). The measurement of writing ability with a multi-faceted Rasch model. Applied Measurement in Education, 5(3), 171-191. doi: 10.1207/s15324818ame0503_1.
  • Engelhard, G. (1994). Examining rater errors in the assessment of written composition with a multi-faceted Rasch model. Journal of Educational Measurement, 31(2), 93-112. doi: 10.1111/j.1745-3984.1994.tb00436.x.
  • Engelhard, G. (1996). Evaluating rater accuracy in performance assessments. Journal of Educational Measurement, 33(1), 56-70. doi: 10.1111/j.1745-3984.1996.tb00479.x.
  • Feeley, T. H. (2002). Comment on Halo Effects in Rating and Evaluation Research. Human Communication Research, 28: 578-586. doi:10.1111/j.1468-2958.2002.tb00825.x
  • Fernández-Alonso, R. & Muñiz, J. (2011). Diseño de cuadernillos para la evaluación de las competencias básicas. Aula Abierta, 39(2), 3-34.
  • Gyagenda, I., & Engelhard, G. (2009). Using classical and modern measurement theories to explore rater, domain, and gender influences on student writing ability. Journal of Applied Measurement, 10(3), 225-246.
  • Gwet, K. L. (2014). Handbook of inter-rater reliability. The definitive guide to measuring the extent of agreement among raters. Gaithersburg, MD: Advanced Analytics, LLC.
  • Hallgren, K. A. (2012). Computing inter-rater reliability for observational data: An overview and tutorial. Tutorials in Quantitative Methods for Psychology, 8(1), 23-34.
  • Horn, J. L. (1965). A rationale and test for the number of factors in factor analysis. Psychometrika, 30(2), 179-185. https://doi.org/10.1007/BF02289447.
  • Jonsson, A., & Svingby, G. (2007). The use of scoring rubrics: Reliability, validity and educational consequences. Educational Research Review, 2(2), 130-144.
  • Koo. T. K., & Li. M. Y. (2016). A guideline of selecting and reporting intraclass correlation coefficients for reliability research. Journal of Chiropractic Medicine, 15(2). 155-163. http://doi.org/10.1016/j.jcm.2016.02.012.
  • Kuo, S. A. (2007): Which rubric is more suitable for NSS liberal studies? Analytic or holistic? Educational Research Journal, 22(2), 179-199.
  • LaRoche, S., Joncas, M., & Foy, P. (2016). Sample design in TIMSS 2015. In M. O. Martin, I. V. S. Mullis, & M. Hooper (Eds.), Methods and Procedures in TIMSS 2015 (pp. 3.1-3.37). Recuperado en diciembre de 2017 de: Boston College, TIMSS & PIRLS International Study Center website: http://timss.bc.edu/publications/timss/2015-methods/chapter-3.html.
  • Leckie, G., & Baird, J. (2011). Rater effects on essay scoring: A multilevel analysis of severity drift, central tendency, and rater experience. Journal of Educational Measurement, 48(4), 399-418. doi: 10.1111/j.1745-3984.2011.00152.x.
  • Linacre, J. M., Engelhard, G., Tatum, D. S., & Myford, C. M. (1994) Measurement with judges: Many-faceted conjoint measurement. International Journal of Educational Research, 21(6), 569-577. doi: 10.1016/0883-0355(94)90011-6.
  • Lunz, M. E., & Stahl, J. (1990). Judge consistency and severity across grading periods. Evaluation and the Health Professions, 13(4), 425-444. doi: 10.1177/016327879001300405.
  • Lunz, M. E., Wright, B. D., & Linacre, J. M. (1990). Measuring the impact of judge severity on examination scores. Applied Measurement in Education, 3(4), 331-345. doi: 10.1207/s15324818ame0304_3.
  • McGraw. K. O., & Wong. S. P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1(1). 30-46.
  • McNamara, T. F. (1996). Measuring second language performance. London: Longman.
  • Ministerio de Educación, Cultura y Deporte (2016): Pruebas de la evaluación final de Educación Primaria. Curso 2015-2016. Madrid: Instituto de Evaluación.
  • OECD (2014). PISA 2012 Technical Report. Paris: OECD Publishing. Recuperado en Septiembre de 2017 de: https://www.oecd.org/pisa/pisaproducts/PISA-2012-technical-report-final.pdf .
  • Park, T. (2010). An investigation of an ESL placement test of writing using multi-faceted Rasch measurement. Teachers College, Columbia University Working Papers in TESOL and Applied Linguistics, 4(1), 1-19.
  • Pérez-Gil, J. A., Chacón Moscoso, S. y Moreno Rodríguez, R. (2000). Construct Validity: The Use of Factor Analysis. Psicothema, 12(2), 441-446.
  • Prieto, G. (2011). Evaluación de la ejecución mediante el modelo manyfacet Rasch measurement. Psicothema, 23, 233-238.
  • Saal, F. E., Downey, R. G., & Lahey, M. A. (1980). Rating the ratings: Assessing the psychometric quality of rating data. Psychological Bulletin, 88(2), 413-428.
  • Shrout . P. E.. & Fleiss. J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2). 420-428.
  • Stemler, S.E. (2004). A comparison of consensus, consistency, and measurement approaches to estimating interrater reliability. Practical Assessment, Research & Evaluation, 9(4). Recuperado en Septiembre de 2007 de: http://PAREonline.net/getvn.asp?v=9&n=4.
  • Suárez-Álvarez, J., González-Prieto, C., Fernández-Alonso, R., Gil, G., & Muñiz, J. (2014). Psychometric assessment of oral expression in English language in the University Entrance Examination. Revista de Educación, 364, 93-118. doi: 10.4438/1988-592X-RE-2014-364-256.
  • Sudweeks. R. R., Reeve. S., & Bradshaw. W. S. (2005). A comparison of generalizability theory and many-facet Rasch measurement in an analysis of college sophomore writing. Assessing Writing. 9. 239-261.
  • Timmerman, M. E., & Lorenzo-Seva, U. (2011). Dimensionality assessment of ordered polytomous items with parallel analysis. Psychological Methods, 16(2), 209-220. http://dx.doi.org/10.1037/a0023353.
  • Wang, Z., & Yao, L. (2013). The effects of rater severity and rater distribution on examinees’ ability estimation for constructed-response items. ETS Research Report Series, i-22. doi:10.1002/j.2333-8504.2013.tb02330.x.
  • Wolfe, E. W. & McVay, A. (2012). Application of latent trait models to identifying substantively interesting raters. Educational Measurement, 31(3), 31-37.