Now in its 25th edition, the Conference and Labs of the Evaluation Forum (CLEF) is a continuation of the very successful series of evaluation campaigns of the Cross Language Evaluation Forum (CLEF) which ran between 2000 and 2009, and established a framework of systematic evaluation of information access systems, primarily through experimentation on shared tasks.

As a leading annual international conference, CLEF uniquely combines evaluation laboratories and workshops with research presentations, panels, posters and demo sessions.

In 2024, CLEF will take place on the 9-12 September at the University of Grenoble Alpes, France.

Accepted Labs

  • BioASQ: Large-scale Biomedical Semantic Indexing and Question Answering
  • CheckThat!: Predicting Check-Worthiness, Subjectivity, Persuasion, Roles and Authorities
  • ELOQUENT: Evaluating Generative Language Models
  • eRisk 2024: Early Risk Prediction on the Internet
  • EXIST: sEXism Identification in Social neTworks
  • iDPP: Intelligent Disease Progression Prediction
  • ImageCLEF: Multimodal Challenge in CLEF
  • JOKER Lab: Automatic Wordplay Analysis
  • LifeCLEF: species identification and prediction
  • LongEval: Longitudinal Evaluation of Model Performance
  • PAN: Lab on Stylometry and Digital Text Forensics
  • qCLEF: Quantum Computing at CLEF
  • SimpleText Lab: Automatic Simplification of Scientific Texts
  • Touché: Argumentation Systems
  • Monster CLEF:CLEF meta-challenge; (almost) all tasks in one!

BioASQ: Large-scale Biomedical Semantic Indexing and Question Answering


The aim of the BioASQ Lab is to push the research frontier towards systems that use the diverse and voluminous information available online to respond directly to the information needs of biomedical scientists.


  • Task 1 - b: Biomedical Semantic Question Answering

    Benchmark datasets of biomedical questions, in English, along with gold standard (reference) answers constructed by a team of biomedical experts. The participants have to respond with relevant articles, and snippets from designated resources, as well as exact and "ideal" answers.

  • Task 2 - Synergy: Question Answering for developing problems

    Biomedical experts pose unanswered questions for developing problems, such as COVID-19, receive the responses provided by the participating systems, and provide feedback, together with updated questions in an iterative procedure that aims to facilitate the incremental understanding of developing problems in biomedicine and public health.

  • Task 3 - MultiCardioNER: Multiple clinical entity detection in multilingual medical content

    A shared task on the automatic detection and normalization of mentions of four clinical entity types, namely diseases, symptoms, procedures and medications, in cardiology clinical case documents in Spanish, English, Italian and Dutch.

  • Task 4 - BioNNE: Nested NER in Russian and English

    A shared task on nested named entity recognition (NER) in PubMed abstracts in Russian and English. The train/dev datasets include annotated mentions of disorders, anatomical structures, chemicals, diagnostic procedures, and biological functions. Participants are encouraged to apply cross-language (Russian to English) and cross-domain techniques.


  • Anastasia Krithara (NCSR "Demokritos", Greece)
  • Anastasios Nentidis (NCSR "Demokritos", Greece)
  • Martin Krallinger (Barcelona Supercomputing Center, Spain)
  • Elena Tutubalina (Kazan Federal University, Russia & Artificial Intelligence Research Institute, Russia)
  • George Paliouras (NCSR "Demokritos", Greece)

Contact • • • • • • •

CheckThat! Lab on Checkworthiness, Subjectivity, Persuasion, Roles, Authorities and Adversarial Robustness

The seventh edition of the CheckThat! lab at CLEF provides a diverse collection of challenges to the research community interested in developing technology to support and understand the journalistic verification process. The tasks go from core verification tasks such as assessing the check-worthiness of a text to understanding the strategies used to influence the audience and identifying the stance of relevant characters on questionable affairs. For the first time, the lab looks at the impact of data purposefully shaped to disguise classifiers for different relevant tasks.


  • Task 1 - Check-worthiness estimation

    Assess whether a statement, sourced from either a tweet or a political debate, warrants fact-checking.

  • Task 2 - Subjectivity

    Given a sentence from a news article, determine whether it is subjective or objective.

  • Task 3 - Persuasion Techniques

    Given a news article and a list of 23 persuasion techniques organized into a 2-tier taxonomy, including logical fallacies and emotional manipulation techniques that might be used to support flawed argumentation, identify the spans of texts in which each technique occurs.

  • Task 4 - Detecting hero, villain, and victim from memes

    Determine the roles of entities within memes, categorizing them as “hero”, “villain”, “victim”, or “other” through a multi-class classification approach that considers the systematic modeling of multimodal semiotic.

  • Task 5 - Authority Evidence for Rumor Verification

    Given a rumor expressed in a tweet and a set of authorities for that rumor, retrieve up to 5 evidence tweets from the authorities' timelines, and determine if the rumor is supported, refuted, or unverifiable according to the evidence.

  • Task 6 - Robustness of Credibility Assessment with Adversarial Examples

    The task is realised in five domains: style-based news bias assessment (HN), propaganda detection (PR), fact checking (FC), rumour detection (RD) and COVID-19 misinformation detection (C19). For each domain, the participants are provided with three victim models, trained for the corresponding binary classification task, as well as a collection of 400 text fragments. Their aim is to prepare adversarial examples, which preserve the meaning of the original examples, but are labelled differently be the classifiers.


  • Alberto Barrón-Cedeño
  • Firoj Alam
  • Tanmoy Chakraborty
  • Tamer Elsayed
  • Preslav Nakov
  • Piotr Przybyla
  • Julia Maria Struß


ELOQUENT shared tasks for evaluation of generative language model quality

ELOQUENT provides a set a of tasks for evaluating the quality of generative language models.


  • 1 - Topical competence

    This task will test and verify a model's understanding of an application domain and specific topic of interest.

  • 2 - Veracity and hallucination

    This task will test how the truthfulness or veracity of automatically generated text can be assessed.

  • 3 - Robustness

    This task will test the capability of a model to handle input variation -- e.g. dialectal, sociolectal, and cross-cultural -- as represented by a set of equivalent but non-identical varieties of input prompts.

  • 4 - Voight Kampff

    This task will explore whether automatically-generated text can be distinguished from human-authored text. This task will be organised in collaboration with the PAN lab at CLEF.


  • Jussi Karlgren, SiloGen
  • Luise Dürlich, RISE ICT
  • Evangelia Gogoulou, RISE ICT
  • Liane Guillou, RISE ICT
  • Joakim Nivre, RISE ICT
  • Magnus Sahlgren, AI Sweden
  • Aarne Talman, SiloGen

Contact • • • • • •

eRisk: Early Risk Detection on the Internet

eRisk explores the evaluation methodology, effectiveness metrics and practical applications (particularly those related to health and safety) of early risk detection on the Internet.


  • Task 1: Search for symptoms of depression

    It consists of ranking sentences from a collection of user writings according to their relevance to a depression symptom. The participants will have to provide rankings for the 21 symptoms of depression from the BDI Questionnaire.

  • Task 2: Early Detection of Signs of Anorexia

    The challenge consists in performing a task on early risk detection of anorexia. The challenge consists of sequentially processing pieces of evidence and detect early traces of anorexia as soon as possible.

  • Task 3: Measuring the severity of the signs of Eating Disorders

    The task consists of estimating the level of features associated with a diagnosis of eating disorders from a thread of user submissions. For each user, the participants will be given a history of postings and the participants will have to fill a standard eating disorder questionnaire.


  • Javier Parapar, Universidade da Coruña
  • Patricia Martín-Rodilla, Universidade da Coruña
  • David E. Losada, Universidade de Santiago de Compostela
  • Fabio Crestani, University of Lugano

Contact • •

EXIST: sEXism Identification in Social neTworks


eRisk explores the evaluation methodology, effectiveness metrics and practical applications (particularly those related to health and safety) of early risk detection on the Internet.


  • Task 1 - Sexism Identification

    The first task is a binary classification. The systems have to decide whether or not a given tweet contains or describes sexist expressions or behaviors (i.e., it is sexist itself, describes a sexist situation or criticizes a sexist behavior).

  • Task 2 - Source Intention

    This task aims to categorize the sexist messages according to the intention of the author in one of the following categories: (i) direct sexist message, (ii) reported sexist message and (iii) judgemental message.

  • Task 3 - Sexism Categorization

    The third task is a multiclass task that aims to categorize the sexist messages according to the type or types of sexism they contain (according to the categorization proposed by experts and that takes into account the different facets of women that are undermined): (i) ideological and inequality, (ii) stereotyping and dominance, (iii) objectification, (iv) sexual violence and (v) misogyny and non-sexual violence.

  • Task 4 - Sexism Identification in Memes

    similar to task 1, task 4 is a binary classification task where participants must determine when a meme contains or describes sexist expressions or behaviors (i.e., it is sexist itself, describes a sexist situation or criticizes a sexist behavior).

  • Task 5 - Sexism Categorization in Memes

    finally, task 5 addresses the problem of categorizing a sexist meme according to the type of sexism that it enclose: (i) ideological and inequality, (ii) stereotyping and dominance, (iii) objectification, (iv) sexual violence and (v) misogyny and non-sexual violence.


  • Laura Plaza
  • Jorge Carrillo-de-Albornoz
  • Julio Gonzalo
  • Enrique Amigó
  • Damiano Spina
  • Paolo Rosso
  • Roser Morante


iDPP@CLEF: Intelligent Disease Progression Prediction at CLEF


Amyotrophic Lateral Sclerosis (ALS) and Multiple Sclerosis (MS) are chronic diseases characterized by progressive or alternate impairment of neurological functions (motor, sensory, visual, cognitive). Patients have to manage alternated periods in hospital with care at home, experiencing a constant uncertainty regarding the timing of the disease acute phases and facing a considerable psychological and economic burden that also involves their caregivers. Clinicians, on the other hand, need tools able to support them in all the phases of the patient treatment, suggest personalized therapeutic decisions, indicate urgently needed interventions. The goal of iDPP@CLEF is to design and develop an evaluation infrastructure for AI algorithms able to: 1) better describe disease mechanisms; 2) stratify patients according to their phenotype assessed all over the disease evolution; 3) predict disease progression in a probabilistic, time dependent fashion.


  • Task 1 - Predicting ALSFRS-R score from sensor data (ALS)

    it will focus on predicting the ALSFRS-R score (ALS Functional Rating Scale - Revised), assigned by medical doctors roughly every three months, from the sensor data collected via the app. The ALSFRS-R score is a somehow “subjective” evaluation performed by a medical doctor and this task will help in answering a currently open question in the research community, i.e. whether it could be derived from objective factors.

  • Task 2 - Predicting patient self-assessment score from sensor (ALS)

    it will focus on predicting the self-assessment score assigned by patients from the sensor data collected via the app. If the self-assessment performed by patients, more frequently than the assessment performed by medical doctors every three months or so, can be reliably predicted by sensor and app data, we can imagine a proactive application which, monitoring the sensor data, alerts the patient if an assessment is needed.

  • Task 3 - Predicting relapses from EDDS sub-scores and environmental data (MS)

    it will focus on predicting a relapse using environmental data and EDSS (Expanded Disability Status Scale) sub-scores. This task will allow us to assess if exposure to different pollutants is a useful variable in predicting a relapse.


  • Helena Aidos, University of Lisbon, Portugal
  • Roberto Bergamaschi, University of Pavia, Italy
  • Paola Cavalla, Città della Salute e della Scienza, Turin, Italy
  • Adriano Chio’, University of Turin, Italy
  • Arianna Dagliati, University of Pavia, Italy
  • Barbara Di Camillo, University of Padua, Italy
  • Mamede Alves de Carvalho, University of Lisbon, Portugal
  • Nicola Ferro, University of Padua, Italy
  • Piero Fariselli, University of Turin, Italy
  • Jose Manuel García Dominguez, Gregorio Marañon Hospital in Madrid, Spain
  • Sara C. Madeira, University of Lisbon, Italy
  • Eleonora Tavazzi, IRCCS Foundation C. Mondino in Pavia, Italy

Contact •

ImageCLEF: Multimedia Retrieval in CLEF


ImageCLEF 2024 is aimed at evaluating the technologies for annotation, indexing, classification and retrieval of multimodal data. Its main objective resides in providing access to large collections of multimodal data for multiple usage scenarios and domains. Considering the experience of the last four successful editions, ImageCLEF 2024 will continue approaching a diversity of applications, namely medical, social media and Internet, and recommending, giving to the participants the opportunity to deal with interdisciplinary approaches and domains.


  • Task 1 - ImageCLEFmedical

    This task is currently at its 20th edition and will continue all the medical sub-tasks in 2023, namely: (i) caption task with medical concept detection and caption prediction, (ii) GAN task on synthetic medical images generated with GANs, (iii) MEDVQA-GI task for medical images generation based on text input, and (iv) Mediqa task with a new use-case on multimodal dermatology response generation.

  • Task 2 - Image Retrieval/Generation for Arguments

    This task is its third edition, having run at the Touché workshop since 2022 and now, for the first time, as a joint task of Touché and ImageCLEF. This year, the task is, given a set of arguments, to return for each argument several images that help to convey the argument's premise, that is, suitable images could depict what is described in the argument.

  • Task 3 - ImageCLEFrecommending

    This is a task which focuses on content-recommendation for cultural heritage content. Despite current advances in content-based recommendation systems, there is limited understanding how well these perform and how relevant they are for the final end-users. This task aims to fill this gap by benchmarking different recommendation systems and methods.

  • Task 4 - ImageCLEFtoPicto

    The objective of ImageCLEFtoPicto is to provide a translation in pictograms from a natural language, either from (i) text or (ii) speech understandable by the users, in this case, people with language impairments as pictogram generation is an emerging and significant domain in natural language processing, with multiple potential applications, enabling communication with individuals who have disabilities, aiding in medical settings for individuals who do not speak the language of a country, and also enhancing user understanding in the service industry.


  • Bogdan Ionescu, National University of Science and Technology Politehnica Bucharest, Romania
  • Henning Müller, University of Applied Sciences Western Switzerland, Sierre, Switzerland
  • Ana-Maria Drăgulinescu, National University of Science and Technology Politehnica Bucharest, Romania

Contact • •

JOKER Lab: Automatic Humour Analysis

The JOKER Lab aims to foster research on automated processing of verbal humour, including tasks such as retrieval, classification, interpretation, generation, and translation.


  • 1 - Humour-aware information retrieval
  • 2 - Fine-grained sentiment analysis of short humorous texts
  • 3 - Humour classification according to genre and technique
  • 4 - Wordplay translation


  • Liana Ermakova, Université de Bretagne Occidentale, HCTI, France
  • Tristan Miller, Department of Computer Science, University of Manitoba, Canada / Austrian Research Institute for Artificial Intelligence (OFAI), Vienna, Austria
  • Anne-Gwenn Bosser, École Nationale d'Ingénieurs de Brest, Lab-STICC CNRS UMR 6285, France
  • Tremaine Thomas, Université de Bretagne Occidentale, HCTI, France
  • Victor Manuel Palma Preciado, Instituto Politécnico Nacional (IPN), Centro de Investigación en Computación (CIC), Mexico City, Mexico / Université de Bretagne Occidentale, HCTI, France
  • Grigori Sidorov, Instituto Politécnico Nacional (IPN), Centro de Investigación en Computación (CIC), Mexico City, Mexico
  • Adam Jatowt, University of Innsbruck, Austria

Contact • • •

LifeCLEF: species identification and prediction

Lifeclef is one of the oldest lab of CLEF focused on multimedia and machine learning for biodiversity monitoring.


  • 1 - Bird species identification in soundscape recordings
  • 2 - Revisiting Fungi Recognition Beyond 1-0 Cost
  • 3 - Species presence prediction based on occurrences data and high-resolution remote sensing images
  • 4 - Multi-species plant identification in vegetation plot images
  • 5 - Revisiting Snake Species Identification in Medically Important Scenarios


  • Alexis Joly (Inria, Pl@ntNet)
  • Lukáš Picek (Inria, University of West Bohemia)
  • Hervé Goëau (CIRAD, Pl@ntNet)
  • Stefan Kahl (Cornell Lab of Ornithology, Chemnitz University of Technology)
  • Christophe Botella (Inria, Pl@ntNet)
  • Henning Müller (HES-SO)

Contact • • • •

LongEval: Longitudinal Evaluation of Model Performance

LongEval CLEF 2024 shared task continues its exploration of the temporal persistence of Information Retrieval (IR) systems and Text Classifiers, by focusing on evaluating system performance degradation over time using evolving data.


  • The aim of the LongEval Task 1 is to support the development of Information Retrieval systems able to face temporal evolution.

  • The primary aim of the LongEval Task 2 is to assess the persistence of stance detection models in the dynamic landscape of social media posts.


  • Rabab Alkhalifa, Queen Mary University of London, UK, Imam Abdulrahman Bin Faisal University, SA
  • Hsuvas Borkakoty, Cardiff University, UK
  • Romain Deveaud, Qwant, France
  • Alaa El-Ebshihy, Research Studios Austria, Data Science Studio, Vienna, AT, TU Wien, Austria
  • Luis Espinosa-Anke, Cardiff University, UK, AMPLYFI, UK
  • Tobias Fink, Research Studios Austria, Data Science Studio, Vienna, AT, TU Wien, Austria
  • David Iommi, Research Studios Austria, Data Science Studio, Vienna, AT
  • Gabriela Gonzalez-Saez, Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG, Grenoble, France
  • Petra Galuscakova, University of Stavanger, Norway
  • Lorraine Goeuriot, Univ. Grenoble Alpes, CNRS, Grenoble INP⋆⋆, LIG, Grenoble, France
  • Elena Kochkina, JPMorgan AI Research
  • Maria Liakata, Queen Mary University of London, UK, Alan Turing Institute, UK, University of Warwick, UK
  • Xiaomo Liu, JPMorgan AI Research
  • Harish Tayyar Madabushi, University of Bath, UK
  • Pablo Medina-Alias, University of Bath, UK
  • Philippe Mulhem, Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG, Grenoble, France
  • Florina Piroi, Research Studios Austria, Data Science Studio, Vienna, AT, TU Wien, Austria
  • Martin Popel, Charles University, Prague, Czech Republic
  • Christophe Servan, Qwant, France, Paris-Saclay University, CNRS, LISN, France
  • Arkaitz Zubiaga, Queen Mary University of London, UK


PAN Lab on Digital Text Forensics and Stylometry


PAN is a series of scientific events and shared tasks on digital text forensics and stylometry whose goal is to advance the state of the art and provide for an objective evaluation on newly developed benchmark datasets in those areas.


  • 1 - Multi-Author Writing Style Analysis

    Given an English document, determine at which paragraphs the author changes. Examples vary in difficulty from easy to hard depending on topical homogeneity of the paragraphs.

  • 2 - Multilingual Text Detoxification

    Given a toxic piece of text, re-write it in a non-toxic way while saving the main content as much as possible. Texts are provided in 7 languages.

  • 3 - Oppositional Thinking Analysis

    Given an English or Spanish online message, determine if it is a conspiracy theory or critical thinking. In former case, find the core elements of the conspiracy narrative.

  • 4 - Generative AI Authorship Verification

    Given a document, determine if the author is a human or a language model. In collaboration with the ELOQUENT lab.


  • Martin Potthast, Leipzig University and ScaDS.AI, Germany
  • Paolo Rosso, Universitat Politècnica de València, Spain
  • Efstathios Stamatatos, University of the Aegean, Greece
  • Benno Stein, Bauhaus-Universität Weimar, Germany


qCLEF: QuantumCLEF

Quantum Computing (QC) is a rapidly growing field, involving an increasing number of researchers and practitioners from different backgrounds to develop new methods that leverage quantum computers to perform faster computations. QuantumCLEF provides an evaluation infrastructure to design and develop QC algorithms and, in particular, for Quantum Annealing (QA) algorithms, for Information Retrieval and Recommender Systems in order to:

  • a) identify new problem formulations to efficiently and effectively employ quantum annealers for their resolution
  • b) compare the performance of QA approaches with respect to their non-quantum counterparts using traditional hardware


  • Task 1 - Feature Selection

    This task focuses on finding the most relevant subset of features to train a Learning-To-Rank model. It relies on the ISTELLA and LETOR4.0 datasets and targets LambdaMART or a content-based Recommender Systems.

  • Task 2 - Clustering

    This task focuses on clustering of embeddings, e.g. BERT embeddings or user-item embeddings.


  • Paolo Cremonesi, Politecnico di Milano, Italy
  • Maurizio Ferrari Da Crema, Politecnico di Milano, Italy
  • Nicola Ferro, University of Padua, Italy
  • Andrea Pasin, University of Padua, Italy

Contact • • •

SimpleText Lab: Improving Access to Scientific Texts for Everyone

SimpleText Lab addresses technical and evaluation challenges associated with making scientific information accessible to a wide audience, students, and experts. We provide appropriate reusable data and benchmarks for scientific text summarization and simplification.


  • 1 - Content Selection

    Retrieving passages to include in a simplified summary

  • 2 - Complexity Spotting

    Identifying and explaining difficult concepts

  • 3 - Text Simplification

    Simplify Scientific Text

  • 4 - SOTA?

    Tracking the State-of-the-Art in Scholarly Publications (Task, Dataset, Metric, Score)


  • Liana Ermakova, HCTI, Université de Bretagne Occidentale (Brest, France)
  • Jaap Kamps, University of Amsterdam (Amsterdam, The Netherlands)
  • Eric SanJuan, Laboratoire Informatique d’Avignon, Avignon Université (Avignon, France)
  • Stéphane Huet, Laboratoire Informatique d’Avignon, Avignon Université (Avignon, France)
  • Hosein Azarbonyad, Elsevier (Amsterdam, The Netherlands)
  • Federica Vezzani, University of Padova (Padova, Italy)
  • Giorgio Maria Di Nunzio, University of Padova (Padova, Italy)
  • Jennifer D’Souza, TIB Leibniz Information Centre for Science and Technology (Hannover, Germany)
  • Salomon Kabongo, L3S Research Center, Leibniz University of Hannover (Hannover, Germany)
  • Hamed Babaei Giglou, TIB Leibniz Information Centre for Science and Technology (Hannover, Germany)
  • Sören Auer, TIB Leibniz Information Centre for Science and Technology (Hannover, Germany)
  • Yue Zhang, Technische Universität Berlin (Germany)

Contact • • • •

Touché: Argumentation Systems

Touché is a series of scientific events and shared tasks to foster the development of technologies that support people in decision-making and opinion-forming and to improve our understanding of these processes.


  • 1 - Human Value Detection

    Given a text, for each sentence, detect which human values the sentence refers to and whether this reference (partially) attains or (partially) constrains the value.

  • 2 - Ideology and Power Identification in Parliamentary Debates

    Given a parliamentary speech in one of several languages, identify the ideology of the speaker's party and identify whether the speaker's party is currently governing or in opposition.

  • 3 - Image Retrieval for Arguments

    Given an argument, retrieve or generate images that help to convey the argument’s premise.


  • Johannes Kiesel, Bauhaus-Universität Weimar
  • Çağrı Çöltekin, University of Tübingen
  • Maximilian Heinrich, Bauhaus-Universität Weimar
  • Martin Potthast, Leipzig University
  • Henning Wachsmuth, Leibniz University Hannover
  • Benno Stein, Bauhaus-Universität Weimar


mCLEF: Monster CLEF

Our goal is to systematically explore how well a given LLM performs across a number of different real-world application challenges with respect to algorithms specifically trained for each task, avoiding contamination issues. The MonsterCLEF lab is organized as a meta-challenge across a selection of tasks chosen from the other labs running in CLEF 2024 and participants are asked to develop a generative AI/LLM-based system that will be run against all the selected tasks with no or minimal adaptation. For each targeted task we rely on the same dataset, experimental setting, and evaluation measures adopted for that specific task. In this way, the LLM-based systems participating in the MonsterCLEF lab are directly comparable with the specialized systems participating in each targeted task.


    MetaTask: targets tasks from the following labs: CheckThat!, ELOQUENT, EXIST, ImageCLEF, LongEval, PAN, and Touché


  • Nicola Ferro, University of Padua, Italy
  • Julio Gonzalo, UNED, Spain
  • Jussi Karlgren, Silo AI, Sweden
  • Henning Müller, HES-SO, Switzerland

Contact • • •