katerina margatina | publications

* denotes equal contribution

An up-to-date list is available on Google Scholar.

2025

ACL

CONFETTI: Conversational Function-Calling Evaluation Through Turn-Level Interactions

Tamer Alkhouli, Katerina Margatina, James Gung, Raphael Shu, Claudia Zaghi, Monica Sunkara, and Yi Zhang

In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics. 2025

Abstract arXiv PDF Blog

We introduce Conversational Function-Calling Evaluation Through Turn-Level Interactions (CONFETTI), a conversational benchmark1 designed to evaluate the function-calling capabilities and response quality of large language models (LLMs). Current benchmarks lack comprehensive assessment of LLMs in complex conversational scenarios. CONFETTI addresses this gap through 109 human-simulated conversations, comprising 313 user turns and covering 86 APIs. These conversations explicitly target various conversational complexities, such as follow-ups, goal correction and switching, ambiguous and implicit goals. We perform off-policy turn-level evaluation using this benchmark targeting function-calling. Our benchmark also incorporates dialog act annotations to assess agent responses. We evaluate a series of state-of-the-art LLMs and analyze their performance with respect to the number of available APIs, conversation lengths, and chained function calling. Our results reveal that while some models are able to handle long conversations, and leverage more than 20+ APIs successfully, other models struggle with longer context or when increasing the number of APIs. We also report that the performance on chained function-calls is severely limited across the models. Overall, the top performing models on CONFETTI are Nova Pro (40.01%), Claude Sonnet v3.5 (35.46%) and Llama 3.1 405B (33.19%) followed by command-r-plus (31.18%) and Mistral-Large-2407 (30.07%).
REALM

A Study on Leveraging Search and Self-Feedback for Agent Reasoning

Karthikeyan K, Michelle Yuan, Elman Mansimov, Katerina Margatina, Anurag Pratik, Daniele Bonadiman, Monica Sunkara, Yi Zhang, and Yassine Benajiba

In Proceedings of the 1st Workshop for Research on Agent Language Models (REALM 2025). 2025

Abstract arXiv PDF

Recent works have demonstrated that incorporating search during inference can significantly improve reasoning capabilities of language agents. Some approaches may make use of the ground truth or rely on model’s own generated feedback. The search algorithm uses this feedback to then produce values that will update its criterion for exploring and exploiting various reasoning paths. In this study, we investigate how search and model’s self-feedback can be leveraged for reasoning tasks. First, we explore differences in ground-truth feedback and self-feedback during search for math reasoning. Second, we observe limitations in applying search techniques to more complex tasks like tool-calling and design domain-specific approaches to address these gaps. Our experiments reveal challenges related to generalization when solely relying on self-feedback during search. For search to work effectively, either access to the ground-truth is needed or feedback mechanisms need to be carefully designed for the specific task.

2024

Thesis

Exploring Active Learning Algorithms for Data Efficient Language Models

Katerina Margatina

2024

Abstract HTML

Supervised learning is based in the premise that models can effectively solve tasks by learning from numerous examples, mapping inputs to outputs through iterative learning. However, contemporary deep learning models often require vast amounts of labeled data, termed training examples, for optimal performance. Unfortunately, not all training examples contribute equally to the learning process, leading to inefficiencies and resource wastage. Active Learning (AL) has emerged as a powerful paradigm for training language models in a data-efficient manner. By iteratively selecting informative unlabeled data points, which are then annotated by humans to form the training set, AL intelligently guides the training process, optimizing data selection for model improvement over random sampling. This thesis investigates various aspects of active learning algorithms for language mod- els, focusing on model training, data selection, in-context learning and simulation. The thesis is structured along four key publications that tackle these topics respectively. The first publication addresses the effective adaptation of pretrained language models for AL, highlighting the importance of task-specific fine-tuning. The second publication introduces a novel acquisition function, Contrastive Active Learning (CAL), which selects contrastive examples to improve AL performance. The third publication explores active learning principles for in-context learning with large language models, emphasizing the selection of informative demonstrations for few-shot learning. Lastly, the fourth publication critically examines the limitations of simulating AL experiments and pro- poses guidelines for future research. Through these contributions, this thesis aims to advance our understanding of AL algorithms for data-efficient language model training.
NeurIPS
🏆 Best Paper

The PRISM Alignment Project: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models

Hannah Rose Kirk, Alexander Whitefield, Paul Röttger, Andrew Bean, Katerina Margatina, Juan Ciro, Rafael Mosquera, Max Bartolo, Adina Williams, Bertie Vidgen He He, and Scott A. Hale

In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks. 2024

Abstract arXiv HTML Blog TL;DR

Human feedback plays a central role in the alignment of Large Language Models (LLMs). However, open questions remain about the methods (how), domains (where), people (who) and objectives (to what end) of human feedback collection. To navigate these questions, we introduce PRISM, a new dataset which maps the sociodemographics and stated preferences of 1,500 diverse participants from 75 countries, to their contextual preferences and fine-grained feedback in 8,011 live conversations with 21 LLMs. PRISM contributes (i) wide geographic and demographic participation in human feedback data; (ii) two census-representative samples for understanding collective welfare (UK and US); and (iii) individualised feedback where every rating is linked to a detailed participant profile, thus permitting exploration of personalisation and attribution of sample artefacts. We focus on collecting conversations that centre subjective and multicultural perspectives on value-laden and controversial topics, where we expect the most interpersonal and cross-cultural disagreement. We demonstrate the usefulness of PRISM via three case studies of dialogue diversity, preference diversity, and welfare outcomes, showing that it matters which humans set alignment norms. As well as offering a rich community resource, we advocate for broader participation in AI development and a more inclusive approach to technology design.

2023

EMNLP-Findings

Active Learning Principles for In-Context Learning with Large Language Models

Katerina Margatina, Timo Schick, Nikolaos Aletras, and Jane Dwivedi-Yu

In Findings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) 2023 2023

Abstract arXiv TL;DR

The remarkable advancements in large language models (LLMs) have significantly enhanced the performance in few-shot learning settings. By using only a small number of labeled examples, referred to as demonstrations, LLMs can effectively grasp the task at hand through in-context learning. However, the process of selecting appropriate demonstrations has received limited attention in prior work. This paper addresses the issue of identifying the most informative demonstrations for few-shot learning by approaching it as a pool-based Active Learning (AL) problem over a single iteration. Our objective is to investigate how AL algorithms can serve as effective demonstration selection methods for in-context learning. We compare various standard AL algorithms based on uncertainty, diversity, and similarity, and consistently observe that the latter outperforms all other methods, including random sampling. Notably, uncertainty sampling, despite its success in conventional supervised learning scenarios, performs poorly in this context. Our extensive experimentation involving a diverse range of GPT and OPT models across 24 classification and multi-choice tasks, coupled with thorough analysis, unambiguously demonstrates that in-context example selection through AL prioritizes high-quality examples that exhibit low uncertainty and bear similarity to the test examples.
EMNLP

Understanding the Role of Input Token Characters in Language Models: How Does Information Loss Affect Performance?

Ahmed Alajrami, Katerina Margatina, and Nikolaos Aletras

In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) 2023

Abstract arXiv TL;DR

Understanding how and what pre-trained language models (PLMs) learn about language is an open challenge in natural language processing. Previous work has focused on identifying whether they capture semantic and syntactic information, and how the data or the pre-training objective affects their performance. However, to the best of our knowledge, no previous work has specifically examined how information loss in input token characters affects the performance of PLMs. In this study, we address this gap by pre-training language models using small subsets of characters from individual tokens. Surprisingly, we find that pre-training even under extreme settings, i.e. using only one character of each token, the performance retention in standard NLU benchmarks and probing tasks compared to full-token models is high. For instance, a model pre-trained only on single first characters from tokens achieves performance retention of approximately 90% and 77% of the full-token model in SuperGLUE and GLUE tasks, respectively.
ACL-Findings

On the Limitations of Simulating Active Learning

Katerina Margatina, and Nikolaos Aletras

In Findings of the Association for Computational Linguistics (ACL) 2023

Abstract arXiv HTML PDF

Active learning (AL) is a human-and-model-in-the-loop paradigm that iteratively selects informative unlabeled data for human annotation, aiming to improve over random sampling. However, performing AL experiments with human annotations on-the-fly is a laborious and expensive process, thus unrealistic for academic research. An easy fix to this impediment is to simulate AL, by treating an already labeled and publicly available dataset as the pool of unlabeled data. In this position paper, we first survey recent literature and highlight the challenges across all different steps within the AL loop. We further unveil neglected caveats in the experimental setup that can significantly affect the quality of AL research. We continue with an exploration of how the simulation setting can govern empirical findings, arguing that it might be one of the answers behind the ever posed question “why do active learning algorithms sometimes fail to outperform random sampling?”. We argue that evaluating AL algorithms on available labeled datasets might provide a lower bound as to their effectiveness in real data. We believe it is essential to collectively shape the best practices for AL research, particularly as engineering advancements in LLMs push the research focus towards data-driven approaches (e.g., data efficiency, alignment, fairness). In light of this, we have developed guidelines for future work. Our aim is to draw attention to these limitations within the community, in the hope of finding ways to address them.
EACL

Dynamic Benchmarking of Masked Language Models on Temporal Concept Drift with Multiple Views

Katerina Margatina, Shuai Wang, Yogarshi Vyas, Neha Anna John, Yassine Benajiba, and Miguel Ballesteros

In Proceedings of the European Meeting of the Association for Computational Linguistics (EACL) 2023

Abstract arXiv HTML PDF Blog TL;DR Code Poster Slides

Temporal concept drift refers to the problem of data changing over time. In the field of NLP, that would entail that language (e.g. new expressions, meaning shifts) and factual knowledge (e.g. new concepts, updated facts) evolve over time. Focusing on the latter, we benchmark 11 pretrained masked language models (MLMs) on a series of tests designed to evaluate the effect of temporal concept drift, as it is crucial that widely used language models remain up-to-date with the ever-evolving factual updates of the real world. Specifically, we provide a holistic framework that (1) dynamically creates temporal test sets of any time granularity (e.g. month, quarter, year) of factual data from Wikidata, (2) constructs fine-grained splits of tests (e.g. updated, new, unchanged facts) to ensure comprehensive analysis, and (3) evaluates MLMs in three distinct ways (single-token probing, multi-token generation, MLM scoring). In contrast to prior work, our framework aims to unveil how robust an MLM is over time and thus to provide a signal in case it has become outdated, by leveraging multiple views of evaluation.
EACL

Investigating Multi-source Active Learning for Natural Language Inference

Ard Snijders, Douwe Kiela, and Katerina Margatina

In Proceedings of the European Meeting of the Association for Computational Linguistics (EACL) 2023

Abstract arXiv HTML PDF TL;DR Code Poster

In recent years, active learning has been successfully applied to an array of NLP tasks. However, prior work often assumes that training and test data are drawn from the same distribution. This is problematic, as in real-life settings data may stem from several sources of varying relevance and quality. We show that four popular active learning schemes fail to outperform random selection when applied to unlabelled pools comprised of multiple data sources on the task of natural language inference. We reveal that uncertainty-based strategies perform poorly due to the acquisition of collective outliers, i.e., hard-to-learn instances that hamper learning and generalisation. When outliers are removed, strategies are found to recover and outperform random baselines. In further analysis, we find that collective outliers vary in form between sources, and show that hard-to-learn data is not always categorically harmful. Lastly, we leverage dataset cartography to introduce difficulty-stratified testing and find that different strategies are affected differently by example learnability and difficulty.

2022

DADC@NAACL

Proceedings of the First Workshop on Dynamic Adversarial Data Collection

Max Bartolo, Hannah Kirk, Pedro Rodriguez, Katerina Margatina, Tristan Thrush, Robin Jia, Pontus Stenetorp, Adina Williams, and Douwe Kiela

2022

HTML PDF
ACL

On the Importance of Effectively Adapting Pretrained Language Models for Active Learning

Katerina Margatina, Loïc Barrault, and Nikolaos Aletras

In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) 2022

Abstract arXiv HTML PDF TL;DR Code Poster Slides

Recent Active Learning (AL) approaches in Natural Language Processing (NLP) proposed using off-the-shelf pretrained language models (LMs). In this paper, we argue that these LMs are not adapted effectively to the downstream task during AL and we explore ways to address this issue. We suggest to first adapt the pretrained LM to the target task by continuing training with all the available unlabeled data and then use it for AL. We also propose a simple yet effective fine-tuning method to ensure that the adapted LM is properly trained in both low and high resource scenarios during AL. Our experiments demonstrate that our approach provides substantial data efficiency improvements compared to the standard fine-tuning approach, suggesting that a poor training strategy can be catastrophic for AL.
ACL
🌍 Theme 🌎

Challenges and Strategies in Cross-Cultural NLP

Daniel Hershcovich, Stella Frank, Heather Lent, Miryam Lhoneux, Mostafa Abdou, Stephanie Brandl, Emanuele Bugliarello, Laura Cabello Piqueras, Ilias Chalkidis, Ruixiang Cui, Constanza Fierro, Katerina Margatina, Phillip Rust, and Anders Søgaard

In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) 2022

Abstract arXiv HTML PDF TL;DR

Various efforts in the Natural Language Processing (NLP) community have been made to accommodate linguistic diversity and serve speakers of many different languages. However, it is important to acknowledge that speakers and the content they produce and require, vary not just by language, but also by culture. Although language and culture are tightly linked, there are important differences. Analogous to cross-lingual and multilingual NLP, cross-cultural and multicultural NLP considers these differences in order to better serve users of NLP systems. We propose a principled framework to frame these efforts, and survey existing and potential strategies.

2021

EMNLP
✨ Oral ✨

Active Learning by Acquiring Contrastive Examples

Katerina Margatina, Giorgos Vernikos, Loïc Barrault, and Nikolaos Aletras

In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) 2021

Abstract arXiv HTML PDF TL;DR Code Poster Slides

Common acquisition functions for active learning use either uncertainty or diversity sampling, aiming to select difficult and diverse data points from the pool of unlabeled data, respectively. In this work, leveraging the best of both worlds, we propose an acquisition function that opts for selecting \textitcontrastive examples, i.e. data points that are similar in the model feature space and yet the model outputs maximally different predictive likelihoods. We compare our approach, CAL (Contrastive Active Learning), with a diverse set of acquisition functions in four natural language understanding tasks and seven datasets. Our experiments show that CAL performs consistently better or equal than the best performing baseline across all tasks, on both in-domain and out-of-domain data. We also conduct an extensive ablation study of our method and we further analyze all actively acquired datasets showing that CAL achieves a better trade-off between uncertainty and diversity compared to other strategies.
EMNLP

Frustratingly Simple Alternatives to Masked Language Modeling

Atsuki Yamaguchi, George Chrysostomou, Katerina Margatina, and Nikolaos Aletras

In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) 2021

Abstract arXiv HTML PDF TL;DR Code

Masked language modeling (MLM), a self-supervised pretraining objective, is widely used in natural language processing for learning text representations. MLM trains a model to predict a random sample of input tokens that have been replaced by a [MASK] placeholder in a multi-class setting over the entire vocabulary. When pretraining, it is common to use alongside MLM other auxiliary objectives on the token or sequence level to improve downstream performance (e.g. next sentence prediction). However, no previous work so far has attempted in examining whether other simpler linguistically intuitive or not objectives can be used standalone as main pretraining objectives. In this paper, we explore five simple pretraining objectives based on token-level classification tasks as replacements of MLM. Empirical results on GLUE and SQuAD show that our proposed methods achieve comparable or better performance to MLM using a BERT-BASE architecture. We further validate our methods using smaller models, showing that pretraining a model with 41% of the BERT-BASE’s parameters, BERT-MEDIUM results in only a 1% drop in GLUE scores with our best objective.

2020

EMNLP-Findings

Domain Adversarial Fine-Tuning as an Effective Regularizer

Giorgos Vernikos, Katerina Margatina, Alexandra Chronopoulou, and Ion Androutsopoulos

In Findings of the Association for Computational Linguistics (EMNLP) 2020

Abstract HTML PDF TL;DR Code

In Natural Language Processing (NLP), pretrained language models (LMs) that are transferred to downstream tasks have been recently shown to achieve state-of-the-art results. However, standard fine-tuning can degrade the general-domain representations captured during pretraining. To address this issue, we introduce a new regularization technique, AFTER; domain Adversarial Fine-Tuning as an Effective Regularizer. Specifically, we complement the task-specific loss used during fine-tuning with an adversarial objective. This additional loss term is related to an adversarial classifier, that aims to discriminate between in-domain and out-of-domain text representations. Indomain refers to the labeled dataset of the task at hand while out-of-domain refers to unlabeled data from a different domain. Intuitively, the adversarial classifier acts as a regularize which prevents the model from overfitting to the task-specific domain. Empirical results on various natural language understanding tasks show that AFTER leads to improved performance compared to standard fine-tuning.

2019

Thesis

Transfer Learning and Attention-based Conditioning Methods for Natural Language Processing

Katerina Margatina

2019

PDF Slides
ACL

Attention-based Conditioning Methods for External Knowledge Integration

Katerina Margatina, Christos Baziotis, and Alexandros Potamianos

In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) 2019

Abstract HTML PDF Code Poster

In this paper, we present a novel approach for incorporating external knowledge in Recurrent Neural Networks (RNNs). We propose the integration of lexicon features into the self-attention mechanism of RNN-based architectures. This form of conditioning on the attention distribution, enforces the contribution of the most salient words for the task at hand. We introduce three methods, namely attentional concatenation, feature-based gating and affine transformation. Experiments on six benchmark datasets show the effectiveness of our methods. Attentional feature-based gating yields consistent performance improvement across tasks. Our approach is implemented as a simple add-on module for RNN-based models with minimal computational overhead and can be adapted to any deep neural architecture.

2018

WASSA@EMNLP

NTUA-SLP at IEST 2018: Ensemble of Neural Transfer Methods for Implicit Emotion Classification

Alexandra* Chronopoulou, Katerina* Margatina, Christos Baziotis, and Alexandros Potamianos

In Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis (WASSA) 2018

Abstract HTML PDF Code Slides

In this paper we present our approach to tackle the Implicit Emotion Shared Task (IEST) organized as part of WASSA 2018 at EMNLP 2018. Given a tweet, from which a certain word has been removed, we are asked to predict the emotion of the missing word. In this work, we experiment with neural Transfer Learning (TL) methods. Our models are based on LSTM networks, augmented with a self-attention mechanism. We use the weights of various pretrained models, for initializing specific layers of our networks. We leverage a big collection of unlabeled Twitter messages, for pretraining word2vec word embeddings and a set of diverse language models. Moreover, we utilize a sentiment analysis dataset for pretraining a model, which encodes emotion related information. The submitted model consists of an ensemble of the aforementioned TL models. Our team ranked 3rd out of 30 participants, achieving an F1 score of 0.703.