|
Title:
|
HUMANS VS. AI IN GRADING STUDENTS' TEXTS - A PILOT STUDY WITH 3 TEACHERS AGAINST ONE AI |
|
Author(s):
|
Felix Weber and Hendrik Hubbertz |
|
ISBN:
|
978-989-8704-72-6 |
|
Editors:
|
Demetrios G. Sampson, Dirk Ifenthaler and Pedro Isaías |
|
Year:
|
2025 |
|
Edition:
|
Single |
|
Keywords:
|
Artificial Intelligence, Grading, Feedback, LLMs, Educational Assessment |
|
Type:
|
Full Paper |
|
First Page:
|
128 |
|
Last Page:
|
134 |
|
Language:
|
English |
|
Cover:
|
|
|
Full Contents:
|
if you are a member please login
|
|
Paper Abstract:
|
Artificial Intelligence (AI) technologies are increasingly being integrated into educational environments, especially in areas such as student feedback and assessment. Among these applications, automated grading tools have garnered both interest and controversy for their potential to streamline evaluation processes while raising questions about accuracy and fairness. In a recent critique, Mühlhoff and Henningsen (2024) raised concerns about the reliability of the Fobizz AI grading tool, highlighting significant inconsistencies in how the AI graded a fixed set of texts. Motivated by their findings, our study sought to replicate and extend this analysis by comparing the AI's grading performance with that of experienced human teachers. We used the same dataset of ten student-written texts originally employed by Mühlhoff and Henningsen, but instead of AI-generated evaluations, we engaged a group of three experienced secondary school teachers to perform the grading. These educators had an average of 20.66 years of teaching experience, providing a robust comparison point for assessing human consistency. To reduce the influence of recognition bias and memory effects, we asked the teachers to grade the texts twice, with a two-month interval between sessions and randomized ordering of the texts each time.
The results of our study were striking. Across both rounds of grading, teachers assigned different grades to the same texts in 73% of cases, reflecting a notable degree of inconsistency. In contrast, the Fobizz AI system showed a discrepancy rate of just 30% in the study by Mühlhoff and Henningsen. Furthermore, we observed that the average grade deviation between the two human assessments was 2.1 points on a standard grading scale, while the AI's average deviation was only 0.5 points. These findings suggest that, in this context, the AI grading tool exhibited greater internal consistency than the experienced human teachers.
It is important to note, however, that the limited sample size of both texts and participants constrains the generalizability of our conclusions. Nevertheless, the results challenge a common assumption in educational discourse: that experienced human teachers inherently provide more stable and reliable assessments than AI-based systems. Our study invites further research into the comparative reliability of human and machine grading, with implications for the future role of AI in educational assessment practices. |
|
|
|
|
|
|