Connectionists: PhD Position F/M Multimodal automatic detection of stuttering-related disfluencies

Shakeel Ahmad shakeelzmail608 at gmail.com
Mon Nov 3 04:46:47 EST 2025


*Contract type : * Fixed-term contract

*Level of qualifications required : * Graduate degree or equivalent

*Fonction : * PhD Position
Context

* Introduction*
Stuttering, a fluency disorder affecting millions of individuals, is
characterized by stuttering-like disfluencies (blocks, prolongations,
repetitions) linked to dysfunctions in speech motor control. While its
automatic detection has already been explored using audio-based models,
current systems remain limited by low robustness, difficulty in identifying
certain disfluencies such as silent blocks, and reliance on scarce data.
This PhD project proposes a multimodal approach (audio, video, text) to
enhance the accuracy and robustness of disfluency detection, leveraging an
audiovisual corpus of French-speaking individuals who stutter. The analysis
will rely on modality-specific encoding techniques, followed by a strategic
fusion of their representations for final classification.

*Aims*

The aim of this PhD is to design, develop, and evaluate a multimodal deep
learning approach for the automatic detection of stuttering-like
disfluencies in French, by combining audio, video, and textual modalities.
The work will be based on an annotated audiovisual corpus of
French-speaking people who stutter, with particular focus on disfluencies
that are difficult to detect through audio alone, such as silent blocks,
and on robustness to individual variability.
Assignment

*Missions*

The doctoral candidate’s work will include the following tasks:

   - *Audio encoding*: Implement and adapt Stutternet (Sheikh, S. A.,
   Sahidullah, M., Hirsch, F., & Ouni, S. – 2021 – *Stutternet: Stuttering
   detection using time delay neural network*, in EUSIPCO) to extract
   acoustic features relevant to disfluency detection by capturing temporal
   dependencies.
   - *Video encoding*: Develop and train vision models (e.g., C3D or
   Transformers) to analyze video sequences for visual cues of stuttering
   (facial tension, blinking, atypical movements). The extraction of facial
   landmarks (with OpenFace or MediaPipe) will also be explored as a
   complementary or alternative source of features.
   - *Text encoding*: Generate automatic transcriptions (via Whisper) and
   encode them using pre-trained language models (BERT, RoBERTa) to extract
   linguistic context and identify textual patterns characteristic of
   disfluencies.
   - *Multimodal fusion*: Implement and compare several strategies to fuse
   the representations from the three modalities, such as concatenation,
   adaptive attention mechanisms, or other approaches leveraging data
   complementarity.
   - *Classification and evaluation*: Develop a classifier operating on the
   fused representation to predict the presence or absence of stuttering
   within a given time window. Evaluation will rely on standard metrics
   (precision, recall, F1-score, AUC), and results will be compared to expert
   manual annotations. Qualitative analyses will also be conducted to
   interpret model errors and refine the approach.

Beyond detection, this PhD aims to contribute methodologically to the field
of multimodal fusion applied to pathological speech, with potential impact
in clinical contexts.
Main activities

*Required Skills*
The candidate should hold a Master’s degree in computer science, have
strong skills in machine learning and deep learning, and be proficient in
Python and frameworks such as PyTorch or TensorFlow. An interest in signal
processing (audio/video) and ideally in NLP is expected. Autonomy, rigor,
critical thinking, and analytical abilities are essential, along with
strong communication skills to work in a multidisciplinary environment. An
interest in phonetics, linguistics, and speech disorders—particularly
stuttering—would be a plus.
Skills



*Expected Skills*

The candidate should hold a master’s degree in computer science, with
strong skills in machine learning and deep learning, solid proficiency in
Python and frameworks such as PyTorch or TensorFlow, as well as an interest
in signal processing (audio/video) and, ideally, in NLP. Autonomy, rigor,
critical thinking, and analytical abilities are essential, as well as good
communication skills to thrive in a multidisciplinary environment. An
interest in phonetics, linguistics, and speech disorders—particularly
stuttering—will be a plus. The candidate should also have the ability to
work effectively in a multidisciplinary team.
Benefits package

   - Restauration subventionnée
   - Transports publics remboursés partiellement
   - Congés: 7 semaines de congés annuels + 10 jours de RTT (base temps
   plein) + possibilité d'autorisations d'absence exceptionnelle (ex : enfants
   malades, déménagement)
   - Possibilité de télétravail (après 6 mois d'ancienneté) et aménagement
   du temps de travail
   - Équipements professionnels à disposition (visioconférence, prêts de
   matériels informatiques, etc.)
   - Prestations sociales, culturelles et sportives (Association de gestion
   des œuvres sociales d'Inria)
   - Accès à la formation professionnelle
   - Sécurité sociale

Remuneration

€2300 gross/month

Job Details: https://jobs.inria.fr/public/classic/en/offres/2025-09498/topdf


-- 
Kind Regards,
Dr. Shakeel A. Sheikh,
Prof Slim Ouni
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/connectionists/attachments/20251103/e8127add/attachment.html>


More information about the Connectionists mailing list