I am senior researcher at the Department of Knowledge Technologies at the Jožef Stefan Institute, the Laboratory for Cognitive Modeling at the Faculty for Information and Communication Science, University of Ljubljana, and the Institute for Contemporary History, Ljubljana. I mostly work in the areas of natural language processing, computational linguistics and computational social science.
You can contact me via name dot surname (ljubesic) at ijs dot si.
Publications
Google Scholar
Curriculum Vitae
Download my CV
Active projects
ParlaCAP: Comparing agenda settings across parliaments via the ParlaMint dataset (Horizon Europe, 2024-2026)
MEZZANINE: Basic research for development of spoken resources and technologies for Slovene (ARIS, 2022-2025)
EMMA: Embeddings-based techniques for Media Monitoring Applications (ARIS, 2023-2025)
POVEJMO: Adaptive Natural Language Processing with Large Language Models (ARIS, 2023-2026)
News
- (March 2024) We have released the Mići Princ text+speech dataset in Chakavian dialects and have successfully adapted Whisper-v3-large to the Chakavian dialects, ensuring a ~40% WER reduction and ~60% CER reduction on unseen speakers to vanilla Whisper-v3-large
- (February 2024) The ParlaSpeech parliamentary text+speech datasets of Croatian (3k hours), Serbian (1k hours) and Polish (1k hours) are finally out, also available as corpora on concordancers and HuggingFace datasets
- (February 2024) We are organizing the DIALECT-COPA shared task in causal commonsense reasoning in dialectal texts! Part of VarDial 2024 @NAACL
- (October 2023) The official CLASSLA web corpora for all seven South Slavic languages (Slovenian, Croatian, Bosnian, Montenegrin, Serbian, Macedonian, Bulgarian), almost 11 billion words in size, have been published on the CLARIN.SI concordancer
- (September 2023) We have released a XLM-R-large model additionally pre-trained on large collections of parliamentary proceedings, named XLM-R-parla
- We have published the pilot versions of the new CLASSLA web corpora for Croatian, Serbian and Slovenian, described in this blog post
- I recently took on the role of co-president of the Special interest group in Web as a Corpus of the Association for Computational linguistics, together with Benoît Sagot, and the co-secretaries Veronika Laippala and Pedro Ortiz Suarez; if you are interested in the topic, join our SIG / mailing list
- We participated in preparing the ParlaMint 3.0 corpus, adding the proceedings of parliaments of Croatia, Bosnia and Herzegovina, and Serbia
- A training dataset for sentence-level sentiment identification in parliamentary proceedings of Croatia, Bosnia and Herzegovina and Serbia published under the name ParlaSent-BCS, with accompanying finetuned BERTić models, the dataset construction process and the modelling are described in the pre-print paper
- The first set of free speech2text systems for Croatian (based on the XLS-R and the Slavic models) published, the 1,816 hours of training data are published as the ParlaSpeech-HR dataset
- The FRENK hate speech dataset in Croatian, English, Slovenian published and transformer models finetuned on the data (available under the CLASSLA organization’s HuggingFace repo)
- I am co-organizing the WNUT MultiLexNorm shared task on lexical normalization, which includes Croatian, Slovenian and Serbian
- The BERTić model fine-tuned on the NER task has been added to HuggingFace
- The SotA transformer model for Bosnian, Croatian, Montenegrin and Serbian - BERTić - has been released via HuggingFace
- I am co-organizing the VarDial2021 evaluation campaign with the task of Social Media Geolocation, which includes geo-locating tweets written in Croatian, Bosnian, Montenegrin or Serbian
- The SotA NLP technologies for South Slavic languages are available now as a Python package
- I am co-organizing the WMT2020 shared task on similar language translation, including Slovene, Croatian and Serbian for the first time to WMT
- I am co-organizing the VarDial2020 evaluation campaign with the task of Social Media Geolocation, which includes geo-locating tweets written in Croatian, Bosnian, Montenegrin or Serbian
- I helped setting up the CLASSLA knowledge centre for language technologies for South Slavic languages, part of CLARIN ERIC