<div dir="ltr">Open PhD Position in Deep Probabilistic Generative Models for Audio-Visual Temporal Data at INRIA Grenoble<br><br>More information and application procedure:<br><a href="https://jobs.inria.fr/public/classic/en/offres/2020-02718">https://jobs.inria.fr/public/classic/en/offres/2020-02718</a><br><br>Starting date: 2020-10-01<br>Duration of contract: 3 years<br>Deadline to apply: 2020-07-08<br><br><br>Description:<br><br>The overall goal of the proposed PhD topic is to develop deep generative models for the automatic analysis of audio-visual temporal data. In the context of human-robot interaction, we want to automatically estimate how many people participate to a conversation, where they are, what they are saying, to whom, which gestures they perform, see [1,2]. The developed algorithms are expected to be implemented in a companion humanoid robot for Human-robot social interaction. In this PhD work, we will explore the development of deep probabilistic generative models [3,4].<br><br>Learning perception models in multi-person scenarios is challenging because we need to properly fuse multi-sensory (mainly audio-visual) data, efficiently solve the combinatorial observation-to-person assignment problem and account for a time-varying number of people. Therefore, we have to conceptualize parametric models that are able to faithfully and efficiently represent a scene, and develop and evaluate the associated parameter estimation algorithms. Importantly, for the sake of interpretability, the representation should be structured into a set of individual cues per person plus a set of collective cues. We will inspire on state-of-the-art techniques for visual person (body, face) detection and for person description (appearance, pose, orientation), on the one side, and on speech processing (speech enhancement, speech and speaker automatic recognition), on the other side. Part of the informative features will be extracted using learnable parametric methods, e.g. deep neural networks (DNNs). We will need to investigate how to fine-tune these architectures to satisfy the goals of the project, and to adapt to the data distribution of multi-person conversational scenarios. Once these features are conceived and learned, we will be able to perform joint inference of individual and collective cues, and define and address the combinatorial assignment problem.<br><br>The fact that multiple features are extracted and can be assigned to multiple persons, together with the impact this assignment has on the temporal dynamics, leads to a combinatorial problem growing exponentially with time. A clear example of this is found in multi-person tracking [5] and in sound separation [6] for which we have explored generative/Bayesian probabilistic models and associated solutions based on variational inference. In the present PhD work, this kind of models will be combined with DNNs. The joint training of probabilistic and deep neural models is difficult and has to be done with care [7]. The PhD student will be expected to design the deep architectures able to extract cues from raw data design, to conceive their combination with probabilistic models, and to develop the optimization frameworks and algorithms able to soundly optimize for the overall set of parameters.<br><br>The PhD work will take place at Inria Grenoble, in Montbonnot-Saint-Martin, in the Perception Team, headed by Radu Horaud. It will be supervised by Laurent Girin (Professor Grenoble-INP) & Xavier Alameda-Pineda (Inria Research Scientist).<br><br><br>Skills:<br><br>Research Master's degree, or equivalent, in a discipline connected to signal and information processing, computer vision and machine learning. Experience in probabilistic models, specifically variational auto-encoders is highly welcome. A particular interest/experience in speech/audio processing, visual recognition, and/or multimodal fusion is a plus. Strong motivation for the research work. Ability to work both independently and to collaborate within a small team. Computer skills: MATLAB, Python, Deep Learning Toolkits (e.g. Keras, Pytorch).<br><br><br>Remuneration:<br><br> - 1st and 2nd year: 1982 euros brut /month<br> - 3rd year: 2085 euros brut / month<br><br> <br>Benefits package:<br><br> - Subsidized meals<br> - Partial reimbursement of public transport costs<br> - Leave: 7 weeks of annual leave + 10 extra days off due to RTT (statutory reduction in working hours) + possibility of exceptional leave (sick children, moving home, etc.)<br> - Possibility of teleworking (after 6 months of employment) and flexible organization of working hours<br> - Professional equipment available (videoconferencing, loan of computer equipment, etc.)<br> - Social, cultural and sports events and activities<br> - Access to vocational training<br> - Social security coverage<br><br> <br>All the best,<br>Chris Reinke<br><br>--<br>Postdoctoral Researcher<br>Perception Unit<br>Inria Grenoble<br><a href="http://www.scirei.net">www.scirei.net</a><br> <br> <br> <br>References<br><br>[1] X. Alameda-Pineda, Y. Yan, E. Ricci, O. Lanz, and N. Sebe, “Analyzing Free-standing Conversational Groups: A Multimodal<br>Approach,” in ACM International Conference on Multimedia, 2015, pp. 4-15.<br>[2] X. Alameda-Pineda, J. Staiano, R. Subramanian, L. M. Batrinca, E. Ricci, B. Lepri, O. Lanz, and N. Sebe, “SALSA: A Novel<br>Dataset for Multimodal Group Behavior Analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, iss.<br>8, 2016.<br>[3] S. Leglaive, X. Alameda-Pineda, L. Girin, and R. Horaud, “A Recurrent Variational Autoencoder for Speech Enhancement,”<br>in IEEE International Conference on Audio, Speech and Signal Processing, 2020.<br>[4] M. Sadeghi, S. Leglaive, X. Alameda-Pineda, L. Girin, R. Horaud, “Audio-visual Speech Enhancement Using Conditional<br>Variational Auto-Encoders,” To appear in IEEE/ACM Transactions on Audio, Speech and Language Processing, 2020.<br>[5] Y. Ban, X. Alameda-Pineda, L. Girin, and R. Horaud, “Variational Bayesian Inference for Audio-Visual Tracking of Multiple<br>Speakers,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020<br>[6] D. Kounades-Bastian, L. Girin, X. Alameda-Pineda, S. Gannot, and R. Horaud, “A variational EM algorithm for the<br>separation of time-varying convolutive audio mixtures,” IEEE/ACM Transactions on Audio, Speech and Language Processing,<br>vol. 24, no. 8, pp. 1408–1423, 2016.<br>[7] S. Lathuilière, P. Mesejo, X. Alameda-Pineda, and R. Horaud, “Deepgum: Learning deep robust regression with a Gaussian-<br>uniform mixture model,” in Proceedings of European Conference on Computer Vision, 2018.</div>