Audio-visual Speech Representations Learning and Analysis, Jan 2024 - Now
Impact of cross-modal learning on audio-visual representations
- Comparing visual speech representations learned with or without cross-modal learning by probing.
Audio-visual asynchronicity in speech representations
- Compared the temporal dynamics of phonetic information encoding in the audio-visual and audio-only SSL representations by probing.
- Revealed that audio-visual SSL models failed to capture the natural asynchrony between audio and visual modalities.
Visual-only self-supervised learning (SSL) speech model
- Attempted to pre-train a visual-only SSL speech model on visual-only training objectives, i.e. pseudo labels generated by K-means clustering from visual features.
Neurocognitive Disorder (NCD) Detection, 2021-Now
- Built a system obtaining SOTA AD detection accuracies at 93.75% on the ADReSS English dataset using an ensemble of BERT/RoBERTa models, and transferred the methods to Cantonese data
- Adopted prompt-based fine-tuning on the AD classifiers based on transformers, incorporating interpretable features (e.g. disfluency features) to the transformer models.
- Working on transferring the methods accross language (applying the methods on Cantonese NCD detection data) and involving data augmentation.
Alzheimer's Disease (AD) Detection (MPhil thesis), Oct 2021 - Jul 2023
- Built a system obtaining SOTA AD detection accuracies at 93.75% on the ADReSS English dataset using an ensemble of BERT/RoBERTa models, and transferred the methods to Cantonese data.
- Adopted prompt-based fine-tuning on the AD classifiers based on transformers, incorporating interpretable features (e.g. disfluency features) to the transformer models.
Multimodal Emotion Recognitions (Undergrad degree dissertation), Oct 2020 - May 2021
- Used the visual, audio and text information to recognize emotion from speech videos, with Res-TDNN as single modality feature encoders and decision level fusion.
Audio-visual Speech Recognition, Jul - Sep 2018
- Conducted image pre-processing for a disordered speech recognition task with audio-visual features.
- Recognized and extracted mouth region images from videos by OpenCV and dlib
- Built an autoencoder to compress visual data