<div dir="ltr"><div style="font-size:12.8px">Dear faculty and students,</div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">We look forward to seeing you next Tuesday, October 17, at noon in NSH 3305 for AI Seminar sponsored by Apple. To learn more about the seminar series, please visit the AI Seminar <a href="http://www.cs.cmu.edu/~aiseminar/" target="_blank">webpage</a>.</div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">On Tuesday, <a href="http://www.cs.cmu.edu/~xiaolonw/">Xiaolong Wang</a><span style="font-size:12.8px"> will give the following talk: </span></div><div style="font-size:12.8px"><br></div><div><div><span style="font-size:12.8px">Title:  Learning Visual Representations for Object Detection</span></div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px"><span style="font-size:12.8px">Abstract:</span></div><div style="font-size:12.8px"><span style="font-size:12.8px"><br></span></div><div style="font-size:12.8px"><span style="font-size:12.8px"><div style="font-size:12.8px">Object detection is in the center of applications in computer vision. The current pipeline for training object detectors include ConvNet pre-training and fine-tuning. In this talk, I am going to cover our works on self-supervised/unsupervised ConvNet pre-training as well as optimization strategies on fine-tuning.</div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">For ConvNet pre-training, instead of using millions of labeled images, we explored to learn visual representations using supervisions from the data itself without any human labels, i.e., self-supervised learning. Specifically, we proposed to exploit different self-supervised approaches to learn representations invariant to (i) inter-instance variations (two objects in the same class should have similar features) and (ii) intra-instance variations (viewpoint, pose, deformations, illumination). Instead of combining two approaches with multi-task learning, we organized the data with multiple variations in a graph and applied simple transitive rules to generate pairs of images with richer visual invariance for training. This approach brings the object detection accuracies on MSCOCO dataset less than 1% away from methods using large amount of labeled data (e.g., ImageNet).</div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">For object detection fine-tuning, we proposed to train object detectors invariant to occlusions and deformations. The common solution is to use a data-driven strategy -- collect large-scale datasets which have object instances under different conditions. However, like categories, occlusions and object deformations also follow a long-tail. Some occlusions and deformations are so rare that they hardly happen; yet we want to learn a model invariant to such occurrences. In this talk, we propose to learn an adversarial network that generates examples with occlusions and deformations. The goal of the adversary is to generate examples that are difficult for the object detector to classify. In our framework both the original detector and adversary are learned in a joint manner. We show significant improvements on different datasets (VOC, COCO) with different network architectures (AlexNet, VGG16, ResNet101).</div></span></div><div style="font-size:12.8px"><br></div></div></div>