Title: Investigating the Quality of AI-Generated Distractors for a Multiple-Choice Vocabulary Test
Author: Wojciech Malec
Institute: Institute of Linguistics, John Paul II Catholic University of Lublin, Al. Racławickie, Lublin, Poland
Keywords: AI-Generated Items, ChatGPT, Vocabulary Assessment, Multiple-Choice Testing, Distractor Analysis
Abstract: This paper evaluates the effectiveness of AI-generated distractors for multiple-choice vocabulary tests. Using OpenAI’s ChatGPT (version 3.5), distractors were created for a test administered to 142 advanced English learners. The study found the test to have relatively low reliability, with some items having ineffective distractors. Qualitative analysis indicated mismatched options, and follow-up queries often failed to correct initial errors. The study concludes that while AI can enhance test practicality, ChatGPT-generated items require human moderation to be operationally viable.
Introduction: Advancements in AI offer significant benefits in educational technologies, including personalized learning, automated essay scoring, and intelligent tutoring systems. OpenAI's ChatGPT, capable of natural language processing, has been used for generating test items. Despite its potential, ensuring the quality and appropriateness of AI-generated items remains a challenge.
Method: Fifteen multiple-choice vocabulary items were created using ChatGPT and administered to 142 advanced English learners. The study used an AI-powered platform, Twee, for generating context sentences and ChatGPT for distractor suggestions. The items were then analyzed for reliability and effectiveness.
Results: The test showed moderate reliability, with some items having distractors that did not perform well. Point-biserial correlation and trace line analysis indicated that many distractors were not functioning as intended. Follow-up attempts to improve distractors using ChatGPT yielded inconsistent results.
Discussion: The study highlighted the need for human oversight in using AI-generated test items. While AI can streamline test development, the current capabilities of ChatGPT are insufficient for producing reliable distractors without human intervention.
Conclusion: AI tools like ChatGPT hold promise for practical test development but require human moderation to ensure item quality. The study suggests that AI-assisted item generation should be viewed as a supplementary tool rather than a standalone solution.
Acknowledgements: Thanks to the teachers and students of XXI Liceum Ogólnokształcące im. św. Stanisława Kostki in Lublin, Poland, for their participation in the study.
References:
- Arslan et al. (2021), Attali & Fraenkel (2000), Attali et al. (2022), Bachman (2004), Bezirhan & von Davier (2023), Bonner et al. (2023), Bruno & Dirkzwager (1995), Circi et al. (2023), Clare & Wilson (2012), Franco & de Francisco Carvalho (2023), Fulcher (2010), Gardner et al. (2021), Gierl et al. (2017), Haladyna (2016), Haladyna & Downing (1989), Holmes et al. (2019), Hoshino (2013), Khademi (2023), Kıyak et al. (2024), Kumar et al. (2023), Ludewig et al. (2023), Malec & Krzemińska-Adamek (2020), March et al. (2021), Nation & Beglar (2007), OpenAI (2024), Papenberg & Musch (2017), Parkes & Zimmaro (2016), Pokrivcakova (2019), Read (2000), Rodriguez (2005), Sayin & Gierl (2024), Segall (2023), Sullivan et al. (2023), Susanti et al. (2018), Twee (2024).