Projekt
MEZZANINE EN
The Basic Research for the Development of Spoken Language Resources and Speech Technologies for the Slovenian Language (teMeljnE raZiskave Za rAzvoj govorNih vIrov in tehNologij za slovEnščino, project ID: J7-4642) is a large basic research project financed by the Slovenian Research and Innovation Agency for the period from October 2022 till the end of September 2025. The development of speech resources and technologies requires thoughtful approaches and is based on linguistic and technological expertise. In the MEZZANINE project, we are creating knowledge that will enable this development to be effective, but we are not producing the speech resources and technologies themselves. We focus on the Slovenian language, as solutions developed for foreign languages and languages with extensive speech resources are not always directly transferable or suitable for Slovenian.
The MEZZANINE project focuses on four thematic areas in creating the knowledge needed for the development of speech resources and technologies for Slovenian.
Acquiring recordings of speech
The first thematic area explores which types of speech data, based on existing resources, are most needed and how to collect and transcribe them as efficiently and automatically as possible. Collecting speech data is much more time-consuming and demanding compared to written data. In the MEZZANINE project, collaborators from all nine participating institutions, working in diverse scientific disciplines, are researching: (1) the needs for speech data in various scientific fields (ranging from dialectology, lexicography, and syntax to pragmatics and speech technologies); (2) how to effectively involve citizens in data collection; and (3) which methods to use for speech recognition to enable the most efficient automatic transcription of speech recordings..
Dialect variation
The second thematic area explores the richness of various sounds in Slovenian dialects. Spoken Slovenian in different regions of Slovenia and neighboring areas features many more distinct sounds than those described by the grammar of the standard language. In the MEZZANINE project, Slovenian dialectologists, led by the Fran Ramovš Institute of the Slovenian Language at ZRC SAZU, have joined forces to prepare an overview of the spatial distribution of dialectal sounds and develop an optimal set of sounds suitable for use in automatic speech recognition for Slovenian dialects.
Speech segmentation and annotation
The third thematic area explores annotation schemes and procedures for the automatic annotation of speech. In speech, there are no punctuation marks to indicate sentence boundaries. Words may have different properties compared to their written usage (e.g., the particle te in expressions like kaj te jaz vem), or they may not exist in written language at all (e.g., mhm, betežen). Sentences and statements are often incomplete, speakers self-correct and pause, and utterances are delivered with specific intentions. In the MEZZANINE project, linguists from the Faculty of Electrical Engineering and Computer Science the University of Maribor, together with language technologists from the Jožef Stefan Institute, are developing annotation schemes, annotated speech data, and automatic annotation procedures for: (1) Basic units of speech derived from prosody (intonation, pitch, tempo, volume); (2) Self-corrections and hesitations in speech; (3) Speech-adapted automatic annotation of basic word forms, their morphological properties, and syntactic relationships within a sentence/utterance; (4) The expressed intent of an utterance or dialogue act, such as conveying information, expressing opinions, sharing feelings, preparing the interlocutor or committing oneself to an action, managing interpersonal relationships, or organizing the flow of conversation.
Spoken lexis
The goal of the fourth thematic area is the development of automated procedures for adding information on spoken vocabulary to Slovene language resources, namely the Digital Dictionary Database of Slovene, which serves as the central database for various Slovene dictionaries. Linguists and language technology experts from the Centre for Language Resources and Technologies of the University of Ljubljana are exploring ways to efficiently and automatically extract vocabulary from the reference speech corpus Gos that is absent from written corpora for Slovene or differs in certain linguistic features. Special emphasis is placed on the automatic processing of the phonetic form of the vocabulary, enabling the addition of information about the actual pronunciation of words characteristic of everyday speech to dictionaries.
The results of the project will enable more efficient further development of speech resources and technologies for the Slovenian language while also providing new insights into the characteristics of spoken Slovenian and spoken language in general.
Speech resources document the spoken use of language. These can include databases with recordings of speech in various situations, ranging from media, the internet, and parliamentary settings to interviews and casual conversations. The speech in the recordings is usually transcribed and often further annotated, for example, with basic forms (lemmas), parts of speech (noun, verb, etc.), morphological features (gender, number, person, etc.), syntactic relationships (predicate, subject, etc.), named entities (proper names), and other linguistic information. Additionally, information about the speaker (gender, age, etc.) and the audio (recording environment, recording quality, equipment used, etc.) is also documented. Such a database can be used to study the characteristics of speech, describe it in dictionaries or grammar, and support the development of speech technologies.
Speech technologies are computer tools related to speech. The most recognizable tool of this kind is a speech recognizer, which transcribes spoken language. Such a service is very useful for dictation, automatic subtitling of video content, automatic searching through audio archives, and similar applications. Another widely used tool is a speech synthesizer, which reads written text aloud. Speech-to-speech translation or the ability to communicate with devices via speech, such as having a conversation with a computer, is also a highly useful service.