Projekt
MEZZANINE EN

The Basic Research for the Development of Spoken Language Resources and Speech Technologies for the Slovenian Language (teMeljnE raZiskave Za rAzvoj govorNih vIrov in tehNologij za slovEnščino, project ID: J7-4642) is a large basic research project financed by the Slovenian Research and Innovation Agency for the period from October 2022 till the end of September 2025. The development of speech resources and technologies requires thoughtful approaches and is based on linguistic and technological expertise. In the MEZZANINE project, we are creating knowledge that will enable this development to be effective, but we are not producing the speech resources and technologies themselves. We focus on the Slovenian language, as solutions developed for foreign languages and languages with extensive speech resources are not always directly transferable or suitable for Slovenian.

The MEZZANINE project focuses on four thematic areas in creating the knowledge needed for the development of speech resources and technologies for Slovenian.

Acquiring recordings of speech

The first thematic area explores which types of speech data, based on existing resources, are most needed and how to collect and transcribe them as efficiently and automatically as possible. Collecting speech data is much more time-consuming and demanding compared to written data. In the MEZZANINE project, collaborators from all nine participating institutions, working in diverse scientific disciplines, are researching: (1) the needs for speech data in various scientific fields (ranging from dialectology, lexicography, and syntax to pragmatics and speech technologies); (2) how to effectively involve citizens in data collection; and (3) which methods to use for speech recognition to enable the most efficient automatic transcription of speech recordings..

Dialect variation

The second thematic area explores the richness of various sounds in Slovenian dialects. Spoken Slovenian in different regions of Slovenia and neighboring areas features many more distinct sounds than those described by the grammar of the standard language. In the MEZZANINE project, Slovenian dialectologists, led by the Fran Ramovš Institute of the Slovenian Language at ZRC SAZU, have joined forces to prepare an overview of the spatial distribution of dialectal sounds and develop an optimal set of sounds suitable for use in automatic speech recognition for Slovenian dialects.

Speech segmentation and annotation

The third thematic area explores annotation schemes and procedures for the automatic annotation of speech. In speech, there are no punctuation marks to indicate sentence boundaries. Words may have different properties compared to their written usage (e.g., the particle te in expressions like kaj te jaz vem), or they may not exist in written language at all (e.g., mhm, betežen). Sentences and statements are often incomplete, speakers self-correct and pause, and utterances are delivered with specific intentions. In the MEZZANINE project, linguists from the Faculty of Electrical Engineering and Computer Science the University of Maribor, together with language technologists from the Jožef Stefan Institute, are developing annotation schemes, annotated speech data, and automatic annotation procedures for: (1) Basic units of speech derived from prosody (intonation, pitch, tempo, volume); (2) Self-corrections and hesitations in speech; (3) Speech-adapted automatic annotation of basic word forms, their morphological properties, and syntactic relationships within a sentence/utterance; (4) The expressed intent of an utterance or dialogue act, such as conveying information, expressing opinions, sharing feelings, preparing the interlocutor or committing oneself to an action, managing interpersonal relationships, or organizing the flow of conversation.

Spoken lexis

The goal of the fourth thematic area is the development of automated procedures for adding information on spoken vocabulary to Slovene language resources, namely the Digital Dictionary Database of Slovene, which serves as the central database for various Slovene dictionaries. Linguists and language technology experts from the Centre for Language Resources and Technologies of the University of Ljubljana are exploring ways to efficiently and automatically extract vocabulary from the reference speech corpus Gos that is absent from written corpora for Slovene or differs in certain linguistic features. Special emphasis is placed on the automatic processing of the phonetic form of the vocabulary, enabling the addition of information about the actual pronunciation of words characteristic of everyday speech to dictionaries.

The results of the project will enable more efficient further development of speech resources and technologies for the Slovenian language while also providing new insights into the characteristics of spoken Slovenian and spoken language in general.