From donghanw at cs.cmu.edu Fri Jun 1 12:50:50 2012 From: donghanw at cs.cmu.edu (Donghan (Jarod) Wang) Date: Fri, 1 Jun 2012 12:50:50 -0400 Subject: [auton-users] LOU1 restored Message-ID: Hi everyone, LOU1, compute node, has been restored due to memory over-commitment. All jobs were terminated(by SIGTERM). All services are back and running. Please check your jobs. Description ---------------- User programs over committed memory on LOU1, and caused it to stop responding to any commands. Date/Time --------------- Jun 1, 11:40 AM Solution ------------ Following steps were taken to solve the issue: 1. issued OOM(out of memory), but it failed to reclaim memory. 2. sent SIGTERM to all processes except init 3. restored all services Thanks, - Jarod -- Donghan (Jarod) Wang Research Programmer Robotics Institute Carnegie Mellon University 5000 Forbes Avenue Pittsburgh, PA 15213 Email: donghanw at cs.cmu.edu Tel: +1 412 268 1238 -------------- next part -------------- An HTML attachment was scrubbed... URL: