Labs

Now in its 25th edition, the Conference and Labs of the Evaluation Forum (CLEF) is a continuation of the very successful series of evaluation campaigns of the Cross Language Evaluation Forum (CLEF) which ran between 2000 and 2009, and established a framework of systematic evaluation of information access systems, primarily through experimentation on shared tasks.

As a leading annual international conference, CLEF uniquely combines evaluation laboratories and workshops with research presentations, panels, posters and demo sessions.

In 2024, CLEF will take place on the 9-12 September at the University of Grenoble Alpes, France.

Accepted Labs

BioASQ: Large-scale Biomedical Semantic Indexing and Question Answering
CheckThat!: Predicting Check-Worthiness, Subjectivity, Persuasion, Roles and Authorities
ELOQUENT: Evaluating Generative Language Models
eRisk 2024: Early Risk Prediction on the Internet
EXIST: sEXism Identification in Social neTworks
iDPP: Intelligent Disease Progression Prediction
ImageCLEF: Multimodal Challenge in CLEF
JOKER Lab: Automatic Wordplay Analysis
LifeCLEF: species identification and prediction
LongEval: Longitudinal Evaluation of Model Performance
PAN: Lab on Stylometry and Digital Text Forensics
qCLEF: Quantum Computing at CLEF
SimpleText Lab: Automatic Simplification of Scientific Texts
Touché: Argumentation Systems
Monster CLEF:CLEF meta-challenge; (almost) all tasks in one!

BioASQ: Large-scale Biomedical Semantic Indexing and Question Answering

Homepage • @BioASQ

The aim of the BioASQ Lab is to push the research frontier towards systems that use the diverse and voluminous information available online to respond directly to the information needs of biomedical scientists.

Tasks:

Task 1 - b: Biomedical Semantic Question Answering
Benchmark datasets of biomedical questions, in English, along with gold standard (reference) answers constructed by a team of biomedical experts. The participants have to respond with relevant articles, and snippets from designated resources, as well as exact and "ideal" answers.
Task 2 - Synergy: Question Answering for developing problems
Biomedical experts pose unanswered questions for developing problems, such as COVID-19, receive the responses provided by the participating systems, and provide feedback, together with updated questions in an iterative procedure that aims to facilitate the incremental understanding of developing problems in biomedicine and public health.
Task 3 - MultiCardioNER: Multiple clinical entity detection in multilingual medical content
A shared task on the automatic detection and normalization of mentions of four clinical entity types, namely diseases, symptoms, procedures and medications, in cardiology clinical case documents in Spanish, English, Italian and Dutch.
Task 4 - BioNNE: Nested NER in Russian and English
A shared task on nested named entity recognition (NER) in PubMed abstracts in Russian and English. The train/dev datasets include annotated mentions of disorders, anatomical structures, chemicals, diagnostic procedures, and biological functions. Participants are encouraged to apply cross-language (Russian to English) and cross-domain techniques.

Organizers

Anastasia Krithara (NCSR "Demokritos", Greece)
Anastasios Nentidis (NCSR "Demokritos", Greece)
Martin Krallinger (Barcelona Supercomputing Center, Spain)
Elena Tutubalina (Kazan Federal University, Russia & Artificial Intelligence Research Institute, Russia)
George Paliouras (NCSR "Demokritos", Greece)

Contact

tasosnent@iit.demokritos.gr • akrithara@iit.demokritos.gr • salvador.limalopez@bsc.es • eulalia.farre@bsc.es • lgasco@bsc.es • martin.krallinger@bsc.es • tutubalinaev@gmail.com • paliourg@iit.demokritos.gr

CheckThat! Lab on Checkworthiness, Subjectivity, Persuasion, Roles, Authorities and Adversarial Robustness

Homepage

The seventh edition of the CheckThat! lab at CLEF provides a diverse collection of challenges to the research community interested in developing technology to support and understand the journalistic verification process. The tasks go from core verification tasks such as assessing the check-worthiness of a text to understanding the strategies used to influence the audience and identifying the stance of relevant characters on questionable affairs. For the first time, the lab looks at the impact of data purposefully shaped to disguise classifiers for different relevant tasks.

Tasks:

Task 1 - Check-worthiness estimation
Assess whether a statement, sourced from either a tweet or a political debate, warrants fact-checking.
Task 2 - Subjectivity
Given a sentence from a news article, determine whether it is subjective or objective.
Task 3 - Persuasion Techniques
Given a news article and a list of 23 persuasion techniques organized into a 2-tier taxonomy, including logical fallacies and emotional manipulation techniques that might be used to support flawed argumentation, identify the spans of texts in which each technique occurs.
Task 4 - Detecting hero, villain, and victim from memes
Determine the roles of entities within memes, categorizing them as “hero”, “villain”, “victim”, or “other” through a multi-class classification approach that considers the systematic modeling of multimodal semiotic.
Task 5 - Authority Evidence for Rumor Verification
Given a rumor expressed in a tweet and a set of authorities for that rumor, retrieve up to 5 evidence tweets from the authorities' timelines, and determine if the rumor is supported, refuted, or unverifiable according to the evidence.
Task 6 - Robustness of Credibility Assessment with Adversarial Examples
The task is realised in five domains: style-based news bias assessment (HN), propaganda detection (PR), fact checking (FC), rumour detection (RD) and COVID-19 misinformation detection (C19). For each domain, the participants are provided with three victim models, trained for the corresponding binary classification task, as well as a collection of 400 text fragments. Their aim is to prepare adversarial examples, which preserve the meaning of the original examples, but are labelled differently be the classifiers.

Organizers

Alberto Barrón-Cedeño
Firoj Alam
Tanmoy Chakraborty
Tamer Elsayed
Preslav Nakov
Piotr Przybyla
Julia Maria Struß

Contact

a.barron@unibo.it

ELOQUENT shared tasks for evaluation of generative language model quality

Homepage

ELOQUENT provides a set a of tasks for evaluating the quality of generative language models.

Tasks:

1 - Topical competence
This task will test and verify a model's understanding of an application domain and specific topic of interest.
2 - Veracity and hallucination
This task will test how the truthfulness or veracity of automatically generated text can be assessed.
3 - Robustness
This task will test the capability of a model to handle input variation -- e.g. dialectal, sociolectal, and cross-cultural -- as represented by a set of equivalent but non-identical varieties of input prompts.
4 - Voight Kampff
This task will explore whether automatically-generated text can be distinguished from human-authored text. This task will be organised in collaboration with the PAN lab at CLEF.

Organizers

Jussi Karlgren, SiloGen
Luise Dürlich, RISE ICT
Evangelia Gogoulou, RISE ICT
Liane Guillou, RISE ICT
Joakim Nivre, RISE ICT
Magnus Sahlgren, AI Sweden
Aarne Talman, SiloGen

Contact

jussi.karlgren@silo.ai • aarne.talman@silo.ai • liane.guillou@ri.se • joakim.nivre@ri.se • luise.durlich@ri.se • evangelia.gogoulou@ri.se • magnus.sahlgren@ai.se

eRisk: Early Risk Detection on the Internet

Twitter

eRisk explores the evaluation methodology, effectiveness metrics and practical applications (particularly those related to health and safety) of early risk detection on the Internet.

Tasks:

Task 1: Search for symptoms of depression
It consists of ranking sentences from a collection of user writings according to their relevance to a depression symptom. The participants will have to provide rankings for the 21 symptoms of depression from the BDI Questionnaire.
Task 2: Early Detection of Signs of Anorexia
The challenge consists in performing a task on early risk detection of anorexia. The challenge consists of sequentially processing pieces of evidence and detect early traces of anorexia as soon as possible.
Task 3: Measuring the severity of the signs of Eating Disorders
The task consists of estimating the level of features associated with a diagnosis of eating disorders from a thread of user submissions. For each user, the participants will be given a history of postings and the participants will have to fill a standard eating disorder questionnaire.

Organizers

Javier Parapar, Universidade da Coruña
Patricia Martín-Rodilla, Universidade da Coruña
David E. Losada, Universidade de Santiago de Compostela
Fabio Crestani, University of Lugano

Contact

javier.parapar@udc.es • parapar@gmail.com • david.losada@usc.es

EXIST: sEXism Identification in Social neTworks

Homepage • @JCAlbornozC

eRisk explores the evaluation methodology, effectiveness metrics and practical applications (particularly those related to health and safety) of early risk detection on the Internet.

Tasks:

Task 1 - Sexism Identification
The first task is a binary classification. The systems have to decide whether or not a given tweet contains or describes sexist expressions or behaviors (i.e., it is sexist itself, describes a sexist situation or criticizes a sexist behavior).
Task 2 - Source Intention
This task aims to categorize the sexist messages according to the intention of the author in one of the following categories: (i) direct sexist message, (ii) reported sexist message and (iii) judgemental message.
Task 3 - Sexism Categorization
The third task is a multiclass task that aims to categorize the sexist messages according to the type or types of sexism they contain (according to the categorization proposed by experts and that takes into account the different facets of women that are undermined): (i) ideological and inequality, (ii) stereotyping and dominance, (iii) objectification, (iv) sexual violence and (v) misogyny and non-sexual violence.
Task 4 - Sexism Identification in Memes
similar to task 1, task 4 is a binary classification task where participants must determine when a meme contains or describes sexist expressions or behaviors (i.e., it is sexist itself, describes a sexist situation or criticizes a sexist behavior).
Task 5 - Sexism Categorization in Memes
finally, task 5 addresses the problem of categorizing a sexist meme according to the type of sexism that it enclose: (i) ideological and inequality, (ii) stereotyping and dominance, (iii) objectification, (iv) sexual violence and (v) misogyny and non-sexual violence.

Organizers

Laura Plaza
Jorge Carrillo-de-Albornoz
Julio Gonzalo
Enrique Amigó
Damiano Spina
Paolo Rosso
Roser Morante

Contact

jcalbornoz@lsi.uned.es

iDPP@CLEF: Intelligent Disease Progression Prediction at CLEF

Homepage • @brainteaser2020

Amyotrophic Lateral Sclerosis (ALS) and Multiple Sclerosis (MS) are chronic diseases characterized by progressive or alternate impairment of neurological functions (motor, sensory, visual, cognitive). Patients have to manage alternated periods in hospital with care at home, experiencing a constant uncertainty regarding the timing of the disease acute phases and facing a considerable psychological and economic burden that also involves their caregivers. Clinicians, on the other hand, need tools able to support them in all the phases of the patient treatment, suggest personalized therapeutic decisions, indicate urgently needed interventions. The goal of iDPP@CLEF is to design and develop an evaluation infrastructure for AI algorithms able to: 1) better describe disease mechanisms; 2) stratify patients according to their phenotype assessed all over the disease evolution; 3) predict disease progression in a probabilistic, time dependent fashion.

Tasks:

Task 1 - Predicting ALSFRS-R score from sensor data (ALS)
it will focus on predicting the ALSFRS-R score (ALS Functional Rating Scale - Revised), assigned by medical doctors roughly every three months, from the sensor data collected via the app. The ALSFRS-R score is a somehow “subjective” evaluation performed by a medical doctor and this task will help in answering a currently open question in the research community, i.e. whether it could be derived from objective factors.
Task 2 - Predicting patient self-assessment score from sensor (ALS)
it will focus on predicting the self-assessment score assigned by patients from the sensor data collected via the app. If the self-assessment performed by patients, more frequently than the assessment performed by medical doctors every three months or so, can be reliably predicted by sensor and app data, we can imagine a proactive application which, monitoring the sensor data, alerts the patient if an assessment is needed.
Task 3 - Predicting relapses from EDDS sub-scores and environmental data (MS)
it will focus on predicting a relapse using environmental data and EDSS (Expanded Disability Status Scale) sub-scores. This task will allow us to assess if exposure to different pollutants is a useful variable in predicting a relapse.

Organizers

Helena Aidos, University of Lisbon, Portugal
Roberto Bergamaschi, University of Pavia, Italy
Paola Cavalla, Città della Salute e della Scienza, Turin, Italy
Adriano Chio’, University of Turin, Italy
Arianna Dagliati, University of Pavia, Italy
Barbara Di Camillo, University of Padua, Italy
Mamede Alves de Carvalho, University of Lisbon, Portugal
Nicola Ferro, University of Padua, Italy
Piero Fariselli, University of Turin, Italy
Jose Manuel García Dominguez, Gregorio Marañon Hospital in Madrid, Spain
Sara C. Madeira, University of Lisbon, Italy
Eleonora Tavazzi, IRCCS Foundation C. Mondino in Pavia, Italy

Contact

barbara.dicamillo@unipd.it • nicola.ferro@unipd.it

ImageCLEF: Multimedia Retrieval in CLEF

Homepage • @imageclef

ImageCLEF 2024 is aimed at evaluating the technologies for annotation, indexing, classification and retrieval of multimodal data. Its main objective resides in providing access to large collections of multimodal data for multiple usage scenarios and domains. Considering the experience of the last four successful editions, ImageCLEF 2024 will continue approaching a diversity of applications, namely medical, social media and Internet, and recommending, giving to the participants the opportunity to deal with interdisciplinary approaches and domains.

Tasks:

Task 1 - ImageCLEFmedical
This task is currently at its 20th edition and will continue all the medical sub-tasks in 2023, namely: (i) caption task with medical concept detection and caption prediction, (ii) GAN task on synthetic medical images generated with GANs, (iii) MEDVQA-GI task for medical images generation based on text input, and (iv) Mediqa task with a new use-case on multimodal dermatology response generation.
Task 2 - Image Retrieval/Generation for Arguments
This task is its third edition, having run at the Touché workshop since 2022 and now, for the first time, as a joint task of Touché and ImageCLEF. This year, the task is, given a set of arguments, to return for each argument several images that help to convey the argument's premise, that is, suitable images could depict what is described in the argument.
Task 3 - ImageCLEFrecommending
This is a task which focuses on content-recommendation for cultural heritage content. Despite current advances in content-based recommendation systems, there is limited understanding how well these perform and how relevant they are for the final end-users. This task aims to fill this gap by benchmarking different recommendation systems and methods.
Task 4 - ImageCLEFtoPicto
The objective of ImageCLEFtoPicto is to provide a translation in pictograms from a natural language, either from (i) text or (ii) speech understandable by the users, in this case, people with language impairments as pictogram generation is an emerging and significant domain in natural language processing, with multiple potential applications, enabling communication with individuals who have disabilities, aiding in medical settings for individuals who do not speak the language of a country, and also enhancing user understanding in the service industry.

Organizers

Bogdan Ionescu, National University of Science and Technology Politehnica Bucharest, Romania
Henning Müller, University of Applied Sciences Western Switzerland, Sierre, Switzerland
Ana-Maria Drăgulinescu, National University of Science and Technology Politehnica Bucharest, Romania

Contact

bogdanlapi@gmail.com • henning.mueller@hevs.ch • ana.dragulinescu@upb.ro

JOKER Lab: Automatic Humour Analysis

Homepage • Twitter

The JOKER Lab aims to foster research on automated processing of verbal humour, including tasks such as retrieval, classification, interpretation, generation, and translation.

Tasks:

1 - Humour-aware information retrieval
2 - Fine-grained sentiment analysis of short humorous texts
3 - Humour classification according to genre and technique
4 - Wordplay translation

Organizers

Liana Ermakova, Université de Bretagne Occidentale, HCTI, France
Tristan Miller, Department of Computer Science, University of Manitoba, Canada / Austrian Research Institute for Artificial Intelligence (OFAI), Vienna, Austria
Anne-Gwenn Bosser, École Nationale d'Ingénieurs de Brest, Lab-STICC CNRS UMR 6285, France
Tremaine Thomas, Université de Bretagne Occidentale, HCTI, France
Victor Manuel Palma Preciado, Instituto Politécnico Nacional (IPN), Centro de Investigación en Computación (CIC), Mexico City, Mexico / Université de Bretagne Occidentale, HCTI, France
Grigori Sidorov, Instituto Politécnico Nacional (IPN), Centro de Investigación en Computación (CIC), Mexico City, Mexico
Adam Jatowt, University of Innsbruck, Austria

Contact

contact@joker-project.com • joker.humour.project@gmail.com • liana.ermakova.87@gmail.com • liana.ermakova@univ-brest.fr

LifeCLEF: species identification and prediction

Homepage

Lifeclef is one of the oldest lab of CLEF focused on multimedia and machine learning for biodiversity monitoring.

Tasks:

1 - Bird species identification in soundscape recordings
2 - Revisiting Fungi Recognition Beyond 1-0 Cost
3 - Species presence prediction based on occurrences data and high-resolution remote sensing images
4 - Multi-species plant identification in vegetation plot images
5 - Revisiting Snake Species Identification in Medically Important Scenarios

Organizers

Alexis Joly (Inria, Pl@ntNet)
Lukáš Picek (Inria, University of West Bohemia)
Hervé Goëau (CIRAD, Pl@ntNet)
Stefan Kahl (Cornell Lab of Ornithology, Chemnitz University of Technology)
Christophe Botella (Inria, Pl@ntNet)
Henning Müller (HES-SO)

Contact

alexis.joly@inria.fr • herve.goeau@cirad.fr • lukaspicek@gmail.com • stefan.kahl@informatik.tu-chemnitz.de • christophe.botella@gmail.com

LongEval: Longitudinal Evaluation of Model Performance

Homepage

LongEval CLEF 2024 shared task continues its exploration of the temporal persistence of Information Retrieval (IR) systems and Text Classifiers, by focusing on evaluating system performance degradation over time using evolving data.

Tasks:

The aim of the LongEval Task 1 is to support the development of Information Retrieval systems able to face temporal evolution.
The primary aim of the LongEval Task 2 is to assess the persistence of stance detection models in the dynamic landscape of social media posts.

Organizers

Rabab Alkhalifa, Queen Mary University of London, UK, Imam Abdulrahman Bin Faisal University, SA
Hsuvas Borkakoty, Cardiff University, UK
Romain Deveaud, Qwant, France
Alaa El-Ebshihy, Research Studios Austria, Data Science Studio, Vienna, AT, TU Wien, Austria
Luis Espinosa-Anke, Cardiff University, UK, AMPLYFI, UK
Tobias Fink, Research Studios Austria, Data Science Studio, Vienna, AT, TU Wien, Austria
David Iommi, Research Studios Austria, Data Science Studio, Vienna, AT
Gabriela Gonzalez-Saez, Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG, Grenoble, France
Petra Galuscakova, University of Stavanger, Norway
Lorraine Goeuriot, Univ. Grenoble Alpes, CNRS, Grenoble INP⋆⋆, LIG, Grenoble, France
Elena Kochkina, JPMorgan AI Research
Maria Liakata, Queen Mary University of London, UK, Alan Turing Institute, UK, University of Warwick, UK
Xiaomo Liu, JPMorgan AI Research
Harish Tayyar Madabushi, University of Bath, UK
Pablo Medina-Alias, University of Bath, UK
Philippe Mulhem, Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG, Grenoble, France
Florina Piroi, Research Studios Austria, Data Science Studio, Vienna, AT, TU Wien, Austria
Martin Popel, Charles University, Prague, Czech Republic
Christophe Servan, Qwant, France, Paris-Saclay University, CNRS, LISN, France
Arkaitz Zubiaga, Queen Mary University of London, UK

Contact

longeval-organizers@googlegroups.com

PAN Lab on Digital Text Forensics and Stylometry

Homepage • @webis_de

PAN is a series of scientific events and shared tasks on digital text forensics and stylometry whose goal is to advance the state of the art and provide for an objective evaluation on newly developed benchmark datasets in those areas.

Tasks:

1 - Multi-Author Writing Style Analysis
Given an English document, determine at which paragraphs the author changes. Examples vary in difficulty from easy to hard depending on topical homogeneity of the paragraphs.
2 - Multilingual Text Detoxification
Given a toxic piece of text, re-write it in a non-toxic way while saving the main content as much as possible. Texts are provided in 7 languages.
3 - Oppositional Thinking Analysis
Given an English or Spanish online message, determine if it is a conspiracy theory or critical thinking. In former case, find the core elements of the conspiracy narrative.
4 - Generative AI Authorship Verification
Given a document, determine if the author is a human or a language model. In collaboration with the ELOQUENT lab.

Organizers

Martin Potthast, Leipzig University and ScaDS.AI, Germany
Paolo Rosso, Universitat Politècnica de València, Spain
Efstathios Stamatatos, University of the Aegean, Greece
Benno Stein, Bauhaus-Universität Weimar, Germany

Contact

pan@webis.de

qCLEF: QuantumCLEF

Homepage

Quantum Computing (QC) is a rapidly growing field, involving an increasing number of researchers and practitioners from different backgrounds to develop new methods that leverage quantum computers to perform faster computations. QuantumCLEF provides an evaluation infrastructure to design and develop QC algorithms and, in particular, for Quantum Annealing (QA) algorithms, for Information Retrieval and Recommender Systems in order to:

a) identify new problem formulations to efficiently and effectively employ quantum annealers for their resolution
b) compare the performance of QA approaches with respect to their non-quantum counterparts using traditional hardware

Tasks:

Task 1 - Feature Selection
This task focuses on finding the most relevant subset of features to train a Learning-To-Rank model. It relies on the ISTELLA and LETOR4.0 datasets and targets LambdaMART or a content-based Recommender Systems.
Task 2 - Clustering
This task focuses on clustering of embeddings, e.g. BERT embeddings or user-item embeddings.

Organizers

Paolo Cremonesi, Politecnico di Milano, Italy
Maurizio Ferrari Da Crema, Politecnico di Milano, Italy
Nicola Ferro, University of Padua, Italy
Andrea Pasin, University of Padua, Italy

Contact

paolo.cremonesi@polimi.it • maurizio.ferrari@polimi.it • nicola.ferro@unipd.it • andrea.pasin.1@phd.unipd.it

SimpleText Lab: Improving Access to Scientific Texts for Everyone

Homepage • Twitter

SimpleText Lab addresses technical and evaluation challenges associated with making scientific information accessible to a wide audience, students, and experts. We provide appropriate reusable data and benchmarks for scientific text summarization and simplification.

Tasks:

1 - Content Selection
Retrieving passages to include in a simplified summary
2 - Complexity Spotting
Identifying and explaining difficult concepts
3 - Text Simplification
Simplify Scientific Text
4 - SOTA?
Tracking the State-of-the-Art in Scholarly Publications (Task, Dataset, Metric, Score)

Organizers

Liana Ermakova, HCTI, Université de Bretagne Occidentale (Brest, France)
Jaap Kamps, University of Amsterdam (Amsterdam, The Netherlands)
Eric SanJuan, Laboratoire Informatique d’Avignon, Avignon Université (Avignon, France)
Stéphane Huet, Laboratoire Informatique d’Avignon, Avignon Université (Avignon, France)
Hosein Azarbonyad, Elsevier (Amsterdam, The Netherlands)
Federica Vezzani, University of Padova (Padova, Italy)
Giorgio Maria Di Nunzio, University of Padova (Padova, Italy)
Jennifer D’Souza, TIB Leibniz Information Centre for Science and Technology (Hannover, Germany)
Salomon Kabongo, L3S Research Center, Leibniz University of Hannover (Hannover, Germany)
Hamed Babaei Giglou, TIB Leibniz Information Centre for Science and Technology (Hannover, Germany)
Sören Auer, TIB Leibniz Information Centre for Science and Technology (Hannover, Germany)
Yue Zhang, Technische Universität Berlin (Germany)

Contact

contact@simpletext-project.com • simpletextworkshop@gmail.com • liana.ermakova.87@gmail.com • liana.ermakova@univ-brest.fr • jennifer.dsouza@tib.eu

Touché: Argumentation Systems

Homepage

Touché is a series of scientific events and shared tasks to foster the development of technologies that support people in decision-making and opinion-forming and to improve our understanding of these processes.

Tasks:

1 - Human Value Detection
Given a text, for each sentence, detect which human values the sentence refers to and whether this reference (partially) attains or (partially) constrains the value.
2 - Ideology and Power Identification in Parliamentary Debates
Given a parliamentary speech in one of several languages, identify the ideology of the speaker's party and identify whether the speaker's party is currently governing or in opposition.
3 - Image Retrieval for Arguments
Given an argument, retrieve or generate images that help to convey the argument’s premise.

Organizers

Johannes Kiesel, Bauhaus-Universität Weimar
Çağrı Çöltekin, University of Tübingen
Maximilian Heinrich, Bauhaus-Universität Weimar
Martin Potthast, Leipzig University
Henning Wachsmuth, Leibniz University Hannover
Benno Stein, Bauhaus-Universität Weimar

Contact

touche@webis.de

mCLEF: Monster CLEF

Homepage

Our goal is to systematically explore how well a given LLM performs across a number of different real-world application challenges with respect to algorithms specifically trained for each task, avoiding contamination issues. The MonsterCLEF lab is organized as a meta-challenge across a selection of tasks chosen from the other labs running in CLEF 2024 and participants are asked to develop a generative AI/LLM-based system that will be run against all the selected tasks with no or minimal adaptation. For each targeted task we rely on the same dataset, experimental setting, and evaluation measures adopted for that specific task. In this way, the LLM-based systems participating in the MonsterCLEF lab are directly comparable with the specialized systems participating in each targeted task.

Tasks:

MetaTask:

Organizers

Nicola Ferro, University of Padua, Italy
Julio Gonzalo, UNED, Spain
Jussi Karlgren, Silo AI, Sweden
Henning Müller, HES-SO, Switzerland

Contact

nicola.ferro@unipd.it • julio@lsi.uned.es • jussi.karlgren@silo.ai • Henning.Mueller@hevs.ch

Tweets by @clef_initiative