Talk: Multivariate temporal response functions (mTRFs): A neurophysiology tool for analyzing neural processing of natural, continuous speech
Speaker: Elena Bolt
Date: 13.12.2024
Abstract: Over the past decade, mTRF models have become a widely used approach for studying how the brain processes natural, continuous speech in both research and clinical contexts. These models sit at the intersection of auditory cognitive neuroscience and computational linguistics, enabling researchers to explore neurophysiological activity across various levels, from fundamental auditory functions to complex linguistic concepts. Drawing from my applied experience, I will introduce mTRF modeling, covering how to get started, available toolboxes, and the range of speech features that can be analyzed. Additionally, I will present applied examples to showcase the versatility of mTRF modeling in both experimental and clinical research on natural speech-based computational neurophysiology.
***
Talk: Echoes of the Self: What Does My Voice Sound Like?
Speaker: Pavo Orepic
Date: 15.11.2024
Abstract:
We are all familiar with the discomfort we feel after hearing our voice in a recording. For many, this discomfort is so intense that they avoid voice and video recordings altogether, missing out on vital aspects of modern communication. With contemporary technology increasingly exposing us to our digital selves, finding ways to ease self-voice discomfort is becoming more important.
Previous attempts to adjust self-voice recordings to sound more "natural" yielded mixed results, largely due to 1) uninformed choices of acoustic transformations applied to self-voice recordings and 2) reliance on subjective measures of “what sounds natural”. The unconventional approach taken here is to address voice acoustics using methods inspired by neuroscience. Specifically, I plan to identify the acoustics behind the natural self-voice by investigating how the brain discriminates our own voice from other voices. In this talk, I will present the vision and goals of this SNSF Spark project.
***
Talk: Experimental Design Clinic
Speaker: Huw Swanborough
Date: 21.06.2024
Abstract:
Data collection is the experimental ‘point-of-no-return’; once data has been collected we can only analyse what is present in the data files and we cannot go back and interpolate missing information that we want for a subsequent analytical model. Biases and confounding effects may be inseparable from the desired observed effects potentially resulting in null results with no way of mitigating them leading to lost time, money, and mental calmness. These pitfalls and obstacles are created during the initial design of the experiment, yet are often only observable during analysis and often lack intuitive cause and effect (e.g. the format in which you save your data may prevent certain post-hoc comparisons being made). During this session we will go over some of these potential pitfalls and discuss ways of avoiding them and mitigating them, with a particular focus on cognitive and psychoacoustic considerations to the experimental design and stimuli presentation that may result in analysis problems down the line. The session will be part presentation, part open-floor clinic for discussion of topics so that we can grapple with the ideas in a more tangible manner. Questions and contributions towards the content of the meeting are warmly welcomed; if you have any concerns, current obstacles, or even examples of past problems caused during the design stage, please submit them to me by Monday and I will do my best to include them for the open-floor discussion. I will be including examples of the times I have painted myself into a corner, so please don’t worry about any attacking critiques of your work as the aim is to be constructive and use shared experience to avoid repeated mistakes as a group.
***
Talk: Large pre-trained self-supervised models for automatic speech processing
Speaker: Srikanth Madikeri Raghunathan
Date: 07.06.2024
Abstract:
In this talk, I present our work on the application of large pre-trained self-supervised models for different speech processing tasks: low-resource automatic speech recognition, spoken language understanding, and language identification. The success of wav2vec 2.0 style self-training paved the way for rapid training of automatic speech recognition systems, which was later extended to other speech processing tasks such as speaker recognition and language recognition. Our work combines the success of hybrid ASR (so called HMM/DNN approaches) with pre-trained audio encoders to leverage the best of both systems: from using Lattice Free-Maximum Mutual Information as cost function for acoustic model fine-tuning to adapters for parameter-efficient training.
With effective ASR training methods, the current focus of research and development on spoken document processing has shifted towards downstream tasks such as intent detection, slot filling, information retrieval and dialog structure discovery. In our work, we compare different approaches to combine multiple hypotheses from ASR, as opposed to only one-best.
***
Talk: Decoding Visual Attention - from 3D Gaze to Social Gaze inference in Everyday Scenes
Invited Speakers: Jean-Marc Odobez & Anshul Gupta (IDIAP)
Date: 26.04.2024
Abstract:
Beyond words, non-verbal behaviors (NVB) are known to play important roles in face-to-face interactions. However, decoding NVB is a challenging problem that involves both extracting subtle physical NVB cues and mapping them to higher-level communication behaviors or social constructs. Gaze, in particular, serves as a fundamental indicator of attention and interest with functions related to communication and social signaling, and plays an important role in many fields, like intuitive human-computer or robot interface design, or for medical diagnosis, like assessing Autism Spectrum Disorders (ASD) in children.
However, estimating the visual attention of others - that is, estimating their gaze (3D line of sight) and Visual Focus of Attention (VFOA) - is a challenging task, even for humans. It often requires not only inferring an accurate 3D gaze direction from the person's face and eyes but also understanding the global context of the scene to decide which object in the field of view is actually looked at. Context can include the person or other person activities that can provide priors about which objects are looked at, or the scene structure to detect obstructions in the line of sight. Hence, two lines of research have been followed recently. The first one focused on improving appearance-based 3D gaze estimation from images and videos, while the second investigated gaze following - the task of estimating the 2D pixel location of where a person looks in an image.
In this presentation, we will discuss different methods that address the two cases mentioned above. We will first focus on several methodological ideas on how to improve 3D gaze estimation, including approaches to build personalized models through few-shot learning and gaze redirection eye synthesis, differential gaze estimation, or taking advantage of priors on social interactions to obtain weak labels for model adaptation. In the second part, we will introduce recent models aiming at estimating gaze targets in the wild, showing how to take advantage of different modalities including estimating the 3D field of view, as well as methods for inferring social labels (eye contact, shared attention).