Work Packages

The MEZZANINE project is divided into 4 work packages, each containing 2 to 4 activities. Research in each activity follows the defined research questions. Experts from linguistics and technical sciences will cooperate in each work package.

Acquiring recordings of speech

Work Packages 1 (WP1)

Dialect variation

Work Packages 2 (WP2)

Speech segmentation and annotation

Work Packages 3 (WP3)

Spoken
lexis

Work Packages 4 (WP4)

WP1: Acquiring recordings of speech

Activities

Research questions

A1.1-I – Spoken language resources in Linguistics and Technical Sciences

RQ1.1.1
What are the needs of different linguistic disciplines and technical sciences regarding the spoken language resources?

RQ1.1.2
How well are the existing reference speech corpora balanced with regard to the covered spoken genres?

A1.2-I Advantages and disadvantages of different recording techniques

RQ1.2.1
What recording techniques are used to collect speech data, and what are the characteristics of data collected with particular techniques?

RQ1.2.2
What are the potentials of crowdsourcing speech data in small communities, and how can it satisfy the needs of a diverse set of disciplines?

RQ1.2.3
What are the legal considerations when recording speech data or using the existing speech data from different sources and how to address them?

A1.3-T Low-cost limited domain speech data for training a speech recogniser

RQ1.3.1
How should an unsupervised or semi-supervised training of a speech recogniser be constructed, if speech data are only available for a specific domain?

RQ1.3.2
What is the optimal approach for constructing new speech data from the perspective of available low-cost speech data?

A1.4-T The effectiveness of knowledge transfer for different speech/speaker recognition tasks

RV1.4
What are the speech recognition tasks with the lowest possibility of knowledge transfer from high-resourced languages to Slovenian?

WP2: Dialect variation

Activities

Research questions

A2.1-L Geolinguistic analysis of non-standard phonemes

RQ2.1
How reliable is the actual version of Slovenian dialect phonetic transcription?

A2.2-L A spatial model of basic dialect areas of non-standard phonemes

RQ2.2
How to determine a spatial distribution of non-standard phonemes?

A2.3-L Creation of diasystemic contrastive tables

RQ2.3
How to create a spatial model for designing diasystemic contrastive Tables of phonemes (dialect vs. standard)?

A2.4-I Definition of an optimal Slovenian phoneme set for ASR

RQ2.4
How to define an optimal Slovenian phoneme set, which is balanced between the standardised version and dialectic phoneme version?

WP3: Speech segmentation and annotation

Activities

Research questions

A3.1-I – The basic units of speech

RQ3.1.1
How well do manually annotated speech segments (i.e., utterances) in the Slovenian spoken language resources correlate with prosodic units?

RQ3.1.2
How well do manually annotated speech segments in the Slovenian spoken language resources correlate with syntactic units?

A3.2-I Annotating and modelling disfluencies

RQ3.2.1
What is an appropriate scheme for annotation of disfluencies in speech corpora?

RQ3.2.2
What is the optimal approach to automatic disfluency detection in speech corpora?

A3.3-I Morphosyntactic annotation, lemmatisation and dependency parsing

RQ3.3.1
How can disfluency annotations inform and improve linguistic annotation?

RQ3.3.2
How can training data from other domains and modalities be used efficiently for spoken language processing?

RQ3.3.3
What is the impact of linguistic input representation on the results of linguistic annotation?

A3.4-I Dialogue acts` annotation

RQ3.4.1
How unambiguous, adequate and informative is the GORDAN scheme compared to the ISO 24617-2 Standard?

RQ3.4.2
How to expand the ISO 24617-2 tagset in order to achieve better adequacy and informativeness of the tagset?

WP4: Spoken lexis

Activities

Research questions

A4.1-I Canonical forms of (non-standard) spoken lexis

RQ4.1.1
Which types (distinct words in a corpus) interpreting the same or similar phenomena were standardised differently in existing spoken language resources?

RQ4.1.2
What is the appropriate categorisation of the analysed heterogeneously interpreted corpus types, and how are canonical forms classified according to different categories (of types)?

RQ4.1.3
How are canonical forms and types included in the lexicon, or linked with lexicon data?

A4.2-I Lexicographic description of (non-standard) spoken language

RQ4.2.1
What are the characteristics of the spoken lexis, as opposed to written language, and how can these characteristics be analysed automatically (for lexicographic purposes)?

RQ4.2.2
How is semantic description spoken language lexis included in semantic (lexicographic) resources for Slovenian?