DemosQA Dataset for Question Answering in Greek

The recent wave of advancements in Natural Language Processing (NLP) and Deep Learning has led to the development of increasingly powerful Large Language Models (LLMs). However, research has primarily focused on high-resource languages, such as English. Only recently has attention shifted towards multilingual models.

These multilingual models often exhibit a bias in training data towards a limited number of popular languages or rely on transfer learning from high-resource to under-resourced languages. This can lead to a misrepresentation of social, cultural, and historical aspects. To address this challenge, monolingual LLMs have been developed for under-resourced languages, but their effectiveness remains less studied compared to their multilingual counterparts.

A new study focuses on Question Answering (QA) in Greek, contributing with:

  • DemosQA: a novel dataset constructed using social media user questions and community-reviewed answers to better capture the Greek social and cultural zeitgeist.
  • A memory-efficient LLM evaluation framework adaptable to diverse QA datasets and languages.
  • An extensive evaluation of 11 monolingual and multilingual LLMs on 6 human-curated Greek QA datasets using 3 different prompting strategies.

The code and data have been released to facilitate reproducibility of the results.