LLMs instead of human judges? A large scale empirical study across 20 NLP evaluation tasks

Bavaresco, A., Bernardi, R., Bertolazzi, L., Elliott, D., Fernández, R., Gatt, A., Ghaleb, E., Giulianelli, M., Hanna, M., Koller, A., Martins, A. F. T., Mondorf, P., Neplenbroek, V., Pezzelle, S., Plank, B., Schlangen, D., Suglia, A., Surikuchi, A. K., Takmaz, E., & Testoni, A. (in press). LLMs instead of human judges? A large scale empirical study across 20 NLP evaluation tasks. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025).

Permanent link to publication record

Publication type

Proceedings paper

LLMs instead of human judges? A large scale empirical study across 20 NLP evaluation tasks

Contact

Follow us

Breadcrumb

LLMs instead of human judges? A large scale empirical study across 20 NLP evaluation tasks

Share this page