Nikola Ljubešić

I am senior researcher at the Department of Knowledge Technologies at the Jožef Stefan Institute, the Laboratory for Cognitive Modeling at the Faculty for Information and Communication Science, University of Ljubljana, and the Institute for Contemporary History, Ljubljana. I mostly work in the areas of natural language processing, computational linguistics and computational social science.

You can contact me via name (nikola) dot surname (ljubesic) at ijs dot si.

Publications

Google Scholar

Curriculum Vitae

Download my CV

Active projects

LLMs4EU: Large Language Models for the European Union (Digital Europe, 2025-2028)

LLM4DH: Large Language Models for Digital Humanities (ARIS, 2024-2027)

ParlaCAP: Comparing agenda settings across parliaments via the ParlaMint dataset (Horizon Europe, 2024-2026)

EMMA: Embeddings-based techniques for Media Monitoring Applications (ARIS, 2023-2026)

MEZZANINE: Basic research for development of spoken resources and technologies for Slovene (ARIS, 2022-2025)

POVEJMO: Adaptive Natural Language Processing with Large Language Models (ARIS, 2023-2026)

News

(October 2025) ParlaCAP v1 has been published - parliamentary discussions from 28 European parliaments, 8 million speeches, each discussion represented via the Comparative Agendas Project topic (obtained from the ParlaCAP model) and sentiment (obtained from the ParlaSent model).
(September 2025) ParlaSpeech v3 has been released, with four langauges, five annotation layers (three on the spoken modality). The data are also available for search via CLARIN.SI, with a tutorial available.
(May 2025) We are finalising the third iteration of the ParlaSpeech corpus collection that includes annotation of filled pauses and sentiment in all languages and primary stress information in Croatian and Serbian
(September 2024) We have released the multilingual IPTC news media topic classifier, built via our Teacher-Student framework, the model has 60k+ downloads every month
(March 2024) We have released the Mići Princ text+speech dataset in Chakavian dialects and have successfully adapted Whisper-v3-large to the Chakavian dialects, ensuring a ~40% WER reduction and ~60% CER reduction on unseen speakers to vanilla Whisper-v3-large
(February 2024) The ParlaSpeech parliamentary text+speech datasets of Croatian (3k hours), Serbian (1k hours) and Polish (1k hours) are finally out, also available as corpora on concordancers and HuggingFace datasets
(February 2024) We are organizing the DIALECT-COPA shared task in causal commonsense reasoning in dialectal texts! Part of VarDial 2024 @NAACL
(October 2023) The official CLASSLA web corpora for all seven South Slavic languages (Slovenian, Croatian, Bosnian, Montenegrin, Serbian, Macedonian, Bulgarian), almost 11 billion words in size, have been published on the CLARIN.SI concordancer
(September 2023) We have released a XLM-R-large model additionally pre-trained on large collections of parliamentary proceedings, named XLM-R-parla
We have published the pilot versions of the new CLASSLA web corpora for Croatian, Serbian and Slovenian, described in this blog post
I recently took on the role of co-president of the Special interest group in Web as a Corpus of the Association for Computational linguistics, together with Benoît Sagot, and the co-secretaries Veronika Laippala and Pedro Ortiz Suarez; if you are interested in the topic, join our SIG / mailing list
We participated in preparing the ParlaMint 3.0 corpus, adding the proceedings of parliaments of Croatia, Bosnia and Herzegovina, and Serbia
A training dataset for sentence-level sentiment identification in parliamentary proceedings of Croatia, Bosnia and Herzegovina and Serbia published under the name ParlaSent-BCS, with accompanying finetuned BERTić models, the dataset construction process and the modelling are described in the pre-print paper
The first set of free speech2text systems for Croatian (based on the XLS-R and the Slavic models) published, the 1,816 hours of training data are published as the ParlaSpeech-HR dataset
The FRENK hate speech dataset in Croatian, English, Slovenian published and transformer models finetuned on the data (available under the CLASSLA organization’s HuggingFace repo)
I am co-organizing the WNUT MultiLexNorm shared task on lexical normalization, which includes Croatian, Slovenian and Serbian
The BERTić model fine-tuned on the NER task has been added to HuggingFace
The SotA transformer model for Bosnian, Croatian, Montenegrin and Serbian - BERTić - has been released via HuggingFace
I am co-organizing the VarDial2021 evaluation campaign with the task of Social Media Geolocation, which includes geo-locating tweets written in Croatian, Bosnian, Montenegrin or Serbian
The SotA NLP technologies for South Slavic languages are available now as a Python package
I am co-organizing the WMT2020 shared task on similar language translation, including Slovene, Croatian and Serbian for the first time to WMT
I am co-organizing the VarDial2020 evaluation campaign with the task of Social Media Geolocation, which includes geo-locating tweets written in Croatian, Bosnian, Montenegrin or Serbian
I helped setting up the CLASSLA knowledge centre for language technologies for South Slavic languages, part of CLARIN ERIC