<html><body><div style="font-family: arial, helvetica, sans-serif; font-size: 12pt; color: #000000"><div> <!--StartFragment--><h1 class="post-title entry-title" data-mce-style="box-sizing: border-box; margin: 0px; font-size: 30px; font-family: Lato, sans-serif; font-weight: bold; line-height: normal; color: #af1917; border: 0px none; padding: 0px; font-style: normal; overflow-wrap: break-word; font-variant-ligatures: normal; font-variant-caps: normal; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: #dddddd; text-decoration-style: initial; text-decoration-color: initial;" style="font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial; box-sizing: border-box; margin: 0px; font-size: 30px; font-family: Lato, sans-serif; font-weight: bold; line-height: normal; color: rgb(175, 25, 23); border: 0px none; padding: 0px; overflow-wrap: break-word; background-color: rgb(221, 221, 221);"></h1></div><div data-marker="__SIG_POST__"> <!--StartFragment--><h1 class="post-title entry-title" style="box-sizing: border-box; margin: 0px; font-size: 30px; font-family: Lato, sans-serif; font-weight: bold; line-height: normal; color: #af1917; border: 0px none; padding: 0px; font-style: normal; overflow-wrap: break-word; font-variant-ligatures: normal; font-variant-caps: normal; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: #dddddd; text-decoration-style: initial; text-decoration-color: initial;" data-mce-style="box-sizing: border-box; margin: 0px; font-size: 30px; font-family: Lato, sans-serif; font-weight: bold; line-height: normal; color: #af1917; border: 0px none; padding: 0px; font-style: normal; overflow-wrap: break-word; font-variant-ligatures: normal; font-variant-caps: normal; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: #dddddd; text-decoration-style: initial; text-decoration-color: initial;">Master R2 Internship in Natural Language Processing: weakly supervised learning for hate speech detection</h1><ul class="post-meta" style="box-sizing: border-box; margin: 0px; border: 0px none; padding: 0px; position: relative; font-size: 14px; color: #4a474b; font-family: Lato, sans-serif; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: #dddddd; text-decoration-style: initial; text-decoration-color: initial;" data-mce-style="box-sizing: border-box; margin: 0px; border: 0px none; padding: 0px; position: relative; font-size: 14px; color: #4a474b; font-family: Lato, sans-serif; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: #dddddd; text-decoration-style: initial; text-decoration-color: initial;"><li class="byline" style="box-sizing: border-box; border: none; margin: 5px 0px 0px; padding: 0px 5px 0px 0px; float: left; list-style: none; line-height: normal;" data-mce-style="box-sizing: border-box; border: none; margin: 5px 0px 0px; padding: 0px 5px 0px 0px; float: left; list-style: none; line-height: normal;"><br></li></ul><div class="entry-content clearfix" style="box-sizing: border-box; zoom: 1; clear: both; padding-top: 1.5em; color: #4a474b; font-family: Lato, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: #dddddd; text-decoration-style: initial; text-decoration-color: initial;" data-mce-style="box-sizing: border-box; zoom: 1; clear: both; padding-top: 1.5em; color: #4a474b; font-family: Lato, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: #dddddd; text-decoration-style: initial; text-decoration-color: initial;"><p style="box-sizing: border-box; margin: 0px; border: 0px none; padding: 0px;" data-mce-style="box-sizing: border-box; margin: 0px; border: 0px none; padding: 0px;"><strong style="box-sizing: border-box; font-weight: bold;" data-mce-style="box-sizing: border-box; font-weight: bold;">Supervisors:</strong><span> </span>Irina Illina, MdC, Dominique Fohr, CR CNRS</p><p style="box-sizing: border-box; margin: 0px; border: 0px none; padding: 0px;" data-mce-style="box-sizing: border-box; margin: 0px; border: 0px none; padding: 0px;"><strong style="box-sizing: border-box; font-weight: bold;" data-mce-style="box-sizing: border-box; font-weight: bold;">Team</strong>: Multispeech, LORIA-INRIA</p><p style="box-sizing: border-box; margin: 0px; border: 0px none; padding: 0px;" data-mce-style="box-sizing: border-box; margin: 0px; border: 0px none; padding: 0px;"><strong style="box-sizing: border-box; font-weight: bold;" data-mce-style="box-sizing: border-box; font-weight: bold;">Contact:</strong><span> </span>illina@loria.fr, dominique.fohr@loria.fr</p><p style="box-sizing: border-box; margin: 0px; border: 0px none; padding: 0px;" data-mce-style="box-sizing: border-box; margin: 0px; border: 0px none; padding: 0px;"><strong style="box-sizing: border-box; font-weight: bold;" data-mce-style="box-sizing: border-box; font-weight: bold;">Duration:</strong><span> </span>5-6 months</p><p style="box-sizing: border-box; margin: 0px; border: 0px none; padding: 0px;" data-mce-style="box-sizing: border-box; margin: 0px; border: 0px none; padding: 0px;"><strong style="box-sizing: border-box; font-weight: bold;" data-mce-style="box-sizing: border-box; font-weight: bold;">Deadline to apply :</strong><span> </span>March 1th, 2020</p><p style="box-sizing: border-box; margin: 0px; border: 0px none; padding: 0px;" data-mce-style="box-sizing: border-box; margin: 0px; border: 0px none; padding: 0px;"><strong style="box-sizing: border-box; font-weight: bold;" data-mce-style="box-sizing: border-box; font-weight: bold;">Required skills</strong>: background in statistics, natural language processing and computer program skills (Perl, Python). Candidates should email a detailed CV with diploma</p><p style="box-sizing: border-box; margin: 0px; border: 0px none; padding: 0px;" data-mce-style="box-sizing: border-box; margin: 0px; border: 0px none; padding: 0px;"><strong style="box-sizing: border-box; font-weight: bold;" data-mce-style="box-sizing: border-box; font-weight: bold;">Motivations and context</strong></p><p style="box-sizing: border-box; margin: 0px; border: 0px none; padding: 0px;" data-mce-style="box-sizing: border-box; margin: 0px; border: 0px none; padding: 0px;">Recent years have seen a tremendous development of Internet and social networks. Unfortunately, the dark side of this growth is an increase in hate speech. Only a small percentage of people use the Internet for unhealthy activities such as hate speech. However, the impact of this low percentage of users is extremely damaging.</p><p style="box-sizing: border-box; margin: 0px; border: 0px none; padding: 0px;" data-mce-style="box-sizing: border-box; margin: 0px; border: 0px none; padding: 0px;"><strong style="box-sizing: border-box; font-weight: bold;" data-mce-style="box-sizing: border-box; font-weight: bold;">Hate speech</strong><span> </span>is the subject of different national and international legal frameworks. Manual monitoring and moderating the Internet and the social media content to identify and remove hate speech is extremely expensive. This internship aims at<span> </span><strong style="box-sizing: border-box; font-weight: bold;" data-mce-style="box-sizing: border-box; font-weight: bold;">designing methods for automatic learning of hate speech detection systems</strong><span> </span>on the Internet and social media data. Despite the studies already published on this subject, the results show that the task remains very difficult (Schmidt et al., 2017; Zhang et al., 2018).</p><p style="box-sizing: border-box; margin: 0px; border: 0px none; padding: 0px;" data-mce-style="box-sizing: border-box; margin: 0px; border: 0px none; padding: 0px;">In text classification, text documents are usually represented in some so-called vector space and then assigned to predefined classes through supervised machine learning. Each document is represented as a numerical vector, which is computed from the words of the document. How to numerically represent the terms in an appropriate way is a basic problem in text classification tasks and directly affects the classification accuracy. Developments in Neural Network led to a renewed interest in the field of distributional semantics, more specifically in learning word embeddings (representation of words in a continuous space). Computational efficiency was one big factor which popularized word embeddings. The word embeddings capture syntactic as well as semantic properties of the words (Mikolov et al., 2013). As a result, they outperformed several other word vector representations on different tasks (Baroni et al., 2014).</p><p style="box-sizing: border-box; margin: 0px; border: 0px none; padding: 0px;" data-mce-style="box-sizing: border-box; margin: 0px; border: 0px none; padding: 0px;">Our methodology in the hate speech detection is related on the recent approaches for text classification with Neural Networks and word embeddings. In this context, fully connected feed forward networks, Convolutional Neural Networks (CNN) and also Recurrent/Recursive Neural Networks (RNN)  have been applied. On the one hand, the approaches based on CNN and RNN capture rich compositional information, and have outperformed the state-of-the-art results in text classification; on the other hand they are computationally intensive and require huge corpus of training data.</p><p style="box-sizing: border-box; margin: 0px; border: 0px none; padding: 0px;" data-mce-style="box-sizing: border-box; margin: 0px; border: 0px none; padding: 0px;">To train these DNN hate speech detection systems it is necessary to have a very large corpus of training data. This training data must contains several thousands of social media comments and each comment should be labeled as hate or not hate. It is easy to automatically collect social media and Internet comments. However, it is time consuming and very costly to label huge corpus. Of course, for several hundreds of comments this work can be manually performed by human annotators. But it is not feasible to perform this work for a huge corpus of comments. In this case<span> </span><strong style="box-sizing: border-box; font-weight: bold;" data-mce-style="box-sizing: border-box; font-weight: bold;">weakly supervised learning</strong><span> </span>can be used :<span> </span><strong style="box-sizing: border-box; font-weight: bold;" data-mce-style="box-sizing: border-box; font-weight: bold;">the idea is to train a deep neural network with<span> </span></strong><strong style="box-sizing: border-box; font-weight: bold;" data-mce-style="box-sizing: border-box; font-weight: bold;">a limited amount of labelled data.</strong></p><p style="box-sizing: border-box; margin: 0px; border: 0px none; padding: 0px;" data-mce-style="box-sizing: border-box; margin: 0px; border: 0px none; padding: 0px;"><strong style="box-sizing: border-box; font-weight: bold;" data-mce-style="box-sizing: border-box; font-weight: bold;">The goal of this master internship is to develop a methodology to weakly supervised learning of a hate speech detection system using social network data (Twitter, YouTube, etc.).</strong></p><p style="box-sizing: border-box; margin: 0px; border: 0px none; padding: 0px;" data-mce-style="box-sizing: border-box; margin: 0px; border: 0px none; padding: 0px;"><strong style="box-sizing: border-box; font-weight: bold;" data-mce-style="box-sizing: border-box; font-weight: bold;">Objectives</strong></p><p style="box-sizing: border-box; margin: 0px; border: 0px none; padding: 0px;" data-mce-style="box-sizing: border-box; margin: 0px; border: 0px none; padding: 0px;">In our Multispeech team, we developed a baseline system for automatic hate speech detection. This system is based on fastText and BERT embeddings (Bojanowski  et al., 2017; Devlin et al, 2018) and the methodology of CNN/RNN. During this internship, the master student will work on this system in following directions:</p><ul style="box-sizing: border-box; margin: 0px; border: 0px none; padding: 0.5em 0px 0.5em 1.5em;" data-mce-style="box-sizing: border-box; margin: 0px; border: 0px none; padding: 0.5em 0px 0.5em 1.5em;"><li style="box-sizing: border-box; border: 0px none; margin: 0px 0px 0.5em; padding: 0px;" data-mce-style="box-sizing: border-box; border: 0px none; margin: 0px 0px 0.5em; padding: 0px;">Study of the state-of-the-art approaches in the field of weakly supervised learning;</li><li style="box-sizing: border-box; border: 0px none; margin: 0px 0px 0.5em; padding: 0px;" data-mce-style="box-sizing: border-box; border: 0px none; margin: 0px 0px 0.5em; padding: 0px;">Implementation of a baseline method of weakly supervised learning for our system;</li><li style="box-sizing: border-box; border: 0px none; margin: 0px 0px 0.5em; padding: 0px;" data-mce-style="box-sizing: border-box; border: 0px none; margin: 0px 0px 0.5em; padding: 0px;">Development of a new methodology for weakly supervised learning. Two cases will be studied. In the first case, we train the hate speech detection system using a small labeled corpus. Then, we proceed incrementally. We use this first system to label more data, we retrain the system and use it to label new data, In the second case, we refer to learning with noisy labels (labels that can be not correct or given by several annotators who do not agree).</li></ul><p style="box-sizing: border-box; margin: 0px; border: 0px none; padding: 0px;" data-mce-style="box-sizing: border-box; margin: 0px; border: 0px none; padding: 0px;"><strong style="box-sizing: border-box; font-weight: bold;" data-mce-style="box-sizing: border-box; font-weight: bold;">References</strong></p><p style="box-sizing: border-box; margin: 0px; border: 0px none; padding: 0px;" data-mce-style="box-sizing: border-box; margin: 0px; border: 0px none; padding: 0px;">Baroni, M., Dinu, G., and Kruszewski, G.  “Don’t count, predict! a systematic comparison of context-counting vs. contextpredicting semantic vectors”. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Volume 1, pages 238-247, 2014.</p><p style="box-sizing: border-box; margin: 0px; border: 0px none; padding: 0px;" data-mce-style="box-sizing: border-box; margin: 0px; border: 0px none; padding: 0px;">Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. “Enriching word vectors with subword information”. Transactions of the Association for Computational Linguistics, 5:135–146, 2017.</p><p style="box-sizing: border-box; margin: 0px; border: 0px none; padding: 0px;" data-mce-style="box-sizing: border-box; margin: 0px; border: 0px none; padding: 0px;">Dai, A. M. and Le, Q. V. “Semi-supervised sequence Learning”. In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., and Garnett, R., editors, Advances in Neural Information Processing Systems 28, pages 3061-3069. Curran Associates, Inc, 2015.</p><p style="box-sizing: border-box; margin: 0px; border: 0px none; padding: 0px;" data-mce-style="box-sizing: border-box; margin: 0px; border: 0px none; padding: 0px;">Devlin J.,   Chang M.-W., Lee K., Toutanova K. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, arXiv:1810.04805v1, 2018.</p><p style="box-sizing: border-box; margin: 0px; border: 0px none; padding: 0px;" data-mce-style="box-sizing: border-box; margin: 0px; border: 0px none; padding: 0px;">Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. “Distributed representations of words and phrases and their Compositionality”. In Advances in Neural Information Processing Systems, 26, pages 3111-3119. Curran Associates, Inc, 2013b.</p><p style="box-sizing: border-box; margin: 0px; border: 0px none; padding: 0px;" data-mce-style="box-sizing: border-box; margin: 0px; border: 0px none; padding: 0px;">Schmidt A., Wiegand M. “A Survey on Hate Speech Detection using Natural Language Processing”, Workshop on Natural Language Processing for Social Media, 2017.</p><p style="box-sizing: border-box; margin: 0px; border: 0px none; padding: 0px;" data-mce-style="box-sizing: border-box; margin: 0px; border: 0px none; padding: 0px;">Zhang, Z., Luo, L. “Hate speech detection: a solved problem? The Challenging Case of Long Tail on Twitter”. arxiv.org/pdf/1803.03662, 2018.</p><p style="box-sizing: border-box; margin: 0px; border: 0px none; padding: 0px;" data-mce-style="box-sizing: border-box; margin: 0px; border: 0px none; padding: 0px;"> </p></div><!--EndFragment--><div style="clear: both;" data-mce-style="clear: both;"><br></div></div><div data-marker="__SIG_POST__"><br data-mce-bogus="1"></div><div data-marker="__SIG_POST__">-- <br></div><div>Irina Illina<br><br>Associate Professor <br>Lorraine University<br>LORIA-INRIA<br>Multispeech Team<br>office C147 <br>Building C <br>615 rue du Jardin Botanique<br>54600 Villers-les-Nancy Cedex<br>Tel:+ 33 3 54 95 84 90</div></div></body></html>