Connectionists: PhD Position F/M Multimodal automatic detection of stuttering-related disfluencies
Shakeel Ahmad
shakeelzmail608 at gmail.com
Mon Nov 3 04:46:47 EST 2025
*Contract type : * Fixed-term contract
*Level of qualifications required : * Graduate degree or equivalent
*Fonction : * PhD Position
Context
* Introduction*
Stuttering, a fluency disorder affecting millions of individuals, is
characterized by stuttering-like disfluencies (blocks, prolongations,
repetitions) linked to dysfunctions in speech motor control. While its
automatic detection has already been explored using audio-based models,
current systems remain limited by low robustness, difficulty in identifying
certain disfluencies such as silent blocks, and reliance on scarce data.
This PhD project proposes a multimodal approach (audio, video, text) to
enhance the accuracy and robustness of disfluency detection, leveraging an
audiovisual corpus of French-speaking individuals who stutter. The analysis
will rely on modality-specific encoding techniques, followed by a strategic
fusion of their representations for final classification.
*Aims*
The aim of this PhD is to design, develop, and evaluate a multimodal deep
learning approach for the automatic detection of stuttering-like
disfluencies in French, by combining audio, video, and textual modalities.
The work will be based on an annotated audiovisual corpus of
French-speaking people who stutter, with particular focus on disfluencies
that are difficult to detect through audio alone, such as silent blocks,
and on robustness to individual variability.
Assignment
*Missions*
The doctoral candidate’s work will include the following tasks:
- *Audio encoding*: Implement and adapt Stutternet (Sheikh, S. A.,
Sahidullah, M., Hirsch, F., & Ouni, S. – 2021 – *Stutternet: Stuttering
detection using time delay neural network*, in EUSIPCO) to extract
acoustic features relevant to disfluency detection by capturing temporal
dependencies.
- *Video encoding*: Develop and train vision models (e.g., C3D or
Transformers) to analyze video sequences for visual cues of stuttering
(facial tension, blinking, atypical movements). The extraction of facial
landmarks (with OpenFace or MediaPipe) will also be explored as a
complementary or alternative source of features.
- *Text encoding*: Generate automatic transcriptions (via Whisper) and
encode them using pre-trained language models (BERT, RoBERTa) to extract
linguistic context and identify textual patterns characteristic of
disfluencies.
- *Multimodal fusion*: Implement and compare several strategies to fuse
the representations from the three modalities, such as concatenation,
adaptive attention mechanisms, or other approaches leveraging data
complementarity.
- *Classification and evaluation*: Develop a classifier operating on the
fused representation to predict the presence or absence of stuttering
within a given time window. Evaluation will rely on standard metrics
(precision, recall, F1-score, AUC), and results will be compared to expert
manual annotations. Qualitative analyses will also be conducted to
interpret model errors and refine the approach.
Beyond detection, this PhD aims to contribute methodologically to the field
of multimodal fusion applied to pathological speech, with potential impact
in clinical contexts.
Main activities
*Required Skills*
The candidate should hold a Master’s degree in computer science, have
strong skills in machine learning and deep learning, and be proficient in
Python and frameworks such as PyTorch or TensorFlow. An interest in signal
processing (audio/video) and ideally in NLP is expected. Autonomy, rigor,
critical thinking, and analytical abilities are essential, along with
strong communication skills to work in a multidisciplinary environment. An
interest in phonetics, linguistics, and speech disorders—particularly
stuttering—would be a plus.
Skills
*Expected Skills*
The candidate should hold a master’s degree in computer science, with
strong skills in machine learning and deep learning, solid proficiency in
Python and frameworks such as PyTorch or TensorFlow, as well as an interest
in signal processing (audio/video) and, ideally, in NLP. Autonomy, rigor,
critical thinking, and analytical abilities are essential, as well as good
communication skills to thrive in a multidisciplinary environment. An
interest in phonetics, linguistics, and speech disorders—particularly
stuttering—will be a plus. The candidate should also have the ability to
work effectively in a multidisciplinary team.
Benefits package
- Restauration subventionnée
- Transports publics remboursés partiellement
- Congés: 7 semaines de congés annuels + 10 jours de RTT (base temps
plein) + possibilité d'autorisations d'absence exceptionnelle (ex : enfants
malades, déménagement)
- Possibilité de télétravail (après 6 mois d'ancienneté) et aménagement
du temps de travail
- Équipements professionnels à disposition (visioconférence, prêts de
matériels informatiques, etc.)
- Prestations sociales, culturelles et sportives (Association de gestion
des œuvres sociales d'Inria)
- Accès à la formation professionnelle
- Sécurité sociale
Remuneration
€2300 gross/month
Job Details: https://jobs.inria.fr/public/classic/en/offres/2025-09498/topdf
--
Kind Regards,
Dr. Shakeel A. Sheikh,
Prof Slim Ouni
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/connectionists/attachments/20251103/e8127add/attachment.html>
More information about the Connectionists
mailing list