Humans vs. AI in Grading Students' Texts -- A Pilot Study with 3 Teachers against One AI

Item request has been placed!

Item request cannot be made.

Processing Request

اقرأ على الانترنت اقرأ أكثر حفظ في قائمتي

المؤلفون: Felix Weber; Hendrik Hubbertz
اللغة:
English
المصدر:
International Association for Development of the Information Society. 2025.
الموضوع:
2025
نوع التسجيلة:
Speeches/Meeting Papers
Reports - Research

معلومة اضافية
- Availability:
  International Association for the Development of the Information Society. e-mail: secretariat@iadis.org; Web site: http://www.iadisportal.org
- Peer Reviewed:
  Y
- المصدر:
  7
- Education Level:
  Secondary Education
- الموضوع:
  Grading; Artificial Intelligence; Automation; Student Evaluation; Experienced Teachers; Secondary School Teachers; Technology Uses in Education; Reliability; Feedback (Response)
- نبذة مختصرة :
  Artificial Intelligence (AI) technologies are increasingly being integrated into educational environments, especially in areas such as student feedback and assessment. Among these applications, automated grading tools have garnered both interest and controversy for their potential to streamline evaluation processes while raising questions about accuracy and fairness. In a recent critique, Mühlhoff and Henningsen (2024) raised concerns about the reliability of the Fobizz AI grading tool, highlighting significant inconsistencies in how the AI graded a fixed set of texts. Motivated by their findings, our study sought to replicate and extend this analysis by comparing the AI's grading performance with that of experienced human teachers. We used the same dataset of ten student-written texts originally employed by Mühlhoff and Henningsen, but instead of AI-generated evaluations, we engaged a group of three experienced secondary school teachers to perform the grading. These educators had an average of 20.66 years of teaching experience, providing a robust comparison point for assessing human consistency. To reduce the influence of recognition bias and memory effects, we asked the teachers to grade the texts twice, with a two-month interval between sessions and randomized ordering of the texts each time. The results of our study were striking. Across both rounds of grading, teachers assigned different grades to the same texts in 73% of cases, reflecting a notable degree of inconsistency. In contrast, the Fobizz AI system showed a discrepancy rate of just 30% in the study by Mühlhoff and Henningsen. Furthermore, we observed that the average grade deviation between the two human assessments was 2.1 points on a standard grading scale, while the AI's average deviation was only 0.5 points. These findings suggest that, in this context, the AI grading tool exhibited greater internal consistency than the experienced human teachers. It is important to note, however, that the limited sample size of both texts and participants constrains the generalizability of our conclusions. Nevertheless, the results challenge a common assumption in educational discourse: that experienced human teachers inherently provide more stable and reliable assessments than AI-based systems. Our study invites further research into the comparative reliability of human and machine grading, with implications for the future role of AI in educational assessment practices. [For the complete proceedings, "Proceedings of the International Association for Development of the Information Society (IADIS) International Conference on Cognition and Exploratory Learning in the Digital Age (CELDA) (22nd, Porto, Portugal, November 1-3, 2025)," see ED677812.]
- نبذة مختصرة :
  As Provided
- الموضوع:
  2026
- الرقم المعرف:
  ED677814

تعليقات

No Comments.

Humans vs. AI in Grading Students' Texts -- A Pilot Study with 3 Teachers against One AI

اتصل بنا

اتبع