Connectionists: Open PhD Position in Deep Probabilistic Generative Models for Audio-Visual Temporal Data at INRIA Grenoble

Chris Reinke c.reinke85 at gmail.com
Sat Jun 13 08:34:31 EDT 2020


Open PhD Position in Deep Probabilistic Generative Models for Audio-Visual
Temporal Data at INRIA Grenoble

More information and application procedure:
https://jobs.inria.fr/public/classic/en/offres/2020-02718

Starting date: 2020-10-01
Duration of contract: 3 years
Deadline to apply: 2020-07-08


Description:

The overall goal of the proposed PhD topic is to develop deep generative
models for the automatic analysis of audio-visual temporal data. In the
context of human-robot interaction, we want to automatically estimate how
many people participate to a conversation, where they are, what they are
saying, to whom, which gestures they perform, see [1,2]. The developed
algorithms are expected to be implemented in a companion humanoid robot for
Human-robot social interaction. In this PhD work, we will explore the
development of deep probabilistic generative models [3,4].

Learning perception models in multi-person scenarios is challenging because
we need to properly fuse multi-sensory (mainly audio-visual) data,
efficiently solve the combinatorial observation-to-person assignment
problem and account for a time-varying number of people. Therefore, we have
to conceptualize parametric models that are able to faithfully and
efficiently represent a scene, and develop and evaluate the associated
parameter estimation algorithms. Importantly, for the sake of
interpretability, the representation should be structured into a set of
individual cues per person plus a set of collective cues. We will inspire
on state-of-the-art techniques for visual person (body, face) detection and
for person description (appearance, pose, orientation), on the one side,
and on speech processing (speech enhancement, speech and speaker automatic
recognition), on the other side. Part of the informative features will be
extracted using learnable parametric methods, e.g. deep neural networks
(DNNs). We will need to investigate how to fine-tune these architectures to
satisfy the goals of the project, and to adapt to the data distribution of
multi-person conversational scenarios. Once these features are conceived
and learned, we will be able to perform joint inference of individual and
collective cues, and define and address the combinatorial assignment
problem.

The fact that multiple features are extracted and can be assigned to
multiple persons, together with the impact this assignment has on the
temporal dynamics, leads to a combinatorial problem growing exponentially
with time. A clear example of this is found in multi-person tracking [5]
and in sound separation [6] for which we have explored generative/Bayesian
probabilistic models and associated solutions based on variational
inference. In the present PhD work, this kind of models will be combined
with DNNs. The joint training of probabilistic and deep neural models is
difficult and has to be done with care [7]. The PhD student will be
expected to design the deep architectures able to extract cues from raw
data design, to conceive their combination with probabilistic models, and
to develop the optimization frameworks and algorithms able to soundly
optimize for the overall set of parameters.

The PhD work will take place at Inria Grenoble, in Montbonnot-Saint-Martin,
in the Perception Team, headed by Radu Horaud. It will be supervised by
Laurent Girin (Professor Grenoble-INP) & Xavier Alameda-Pineda (Inria
Research Scientist).


Skills:

Research Master's degree, or equivalent, in a discipline connected to
signal and information processing, computer vision and machine learning.
Experience in probabilistic models, specifically variational auto-encoders
is highly welcome. A particular interest/experience in speech/audio
processing, visual recognition, and/or multimodal fusion is a plus. Strong
motivation for the research work. Ability to work both independently and to
collaborate within a small team. Computer skills: MATLAB, Python, Deep
Learning Toolkits (e.g. Keras, Pytorch).


Remuneration:

 - 1st and 2nd year: 1982 euros brut /month
 - 3rd year: 2085 euros brut / month


Benefits package:

 - Subsidized meals
 - Partial reimbursement of public transport costs
 - Leave: 7 weeks of annual leave + 10 extra days off due to RTT (statutory
reduction in working hours) + possibility of exceptional leave (sick
children, moving home, etc.)
 - Possibility of teleworking (after 6 months of employment) and flexible
organization of working hours
 - Professional equipment available (videoconferencing, loan of computer
equipment, etc.)
 - Social, cultural and sports events and activities
 - Access to vocational training
 - Social security coverage


All the best,
Chris Reinke

--
Postdoctoral Researcher
Perception Unit
Inria Grenoble
www.scirei.net



References

[1] X. Alameda-Pineda, Y. Yan, E. Ricci, O. Lanz, and N. Sebe, “Analyzing
Free-standing Conversational Groups: A Multimodal
Approach,” in ACM International Conference on Multimedia, 2015, pp. 4-15.
[2] X. Alameda-Pineda, J. Staiano, R. Subramanian, L. M. Batrinca, E.
Ricci, B. Lepri, O. Lanz, and N. Sebe, “SALSA: A Novel
Dataset for Multimodal Group Behavior Analysis,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 38, iss.
8, 2016.
[3] S. Leglaive, X. Alameda-Pineda, L. Girin, and R. Horaud, “A Recurrent
Variational Autoencoder for Speech Enhancement,”
in IEEE International Conference on Audio, Speech and Signal Processing,
2020.
[4] M. Sadeghi, S. Leglaive, X. Alameda-Pineda, L. Girin, R. Horaud,
“Audio-visual Speech Enhancement Using Conditional
Variational Auto-Encoders,” To appear in IEEE/ACM Transactions on Audio,
Speech and Language Processing, 2020.
[5] Y. Ban, X. Alameda-Pineda, L. Girin, and R. Horaud, “Variational
Bayesian Inference for Audio-Visual Tracking of Multiple
Speakers,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
2020
[6] D. Kounades-Bastian, L. Girin, X. Alameda-Pineda, S. Gannot, and R.
Horaud, “A variational EM algorithm for the
separation of time-varying convolutive audio mixtures,” IEEE/ACM
Transactions on Audio, Speech and Language Processing,
vol. 24, no. 8, pp. 1408–1423, 2016.
[7] S. Lathuilière, P. Mesejo, X. Alameda-Pineda, and R. Horaud, “Deepgum:
Learning deep robust regression with a Gaussian-
uniform mixture model,” in Proceedings of European Conference on Computer
Vision, 2018.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/connectionists/attachments/20200613/ae63e00f/attachment.html>


More information about the Connectionists mailing list