[auton-users] LOU1 restored

Donghan (Jarod) Wang donghanw at cs.cmu.edu
Fri Jun 1 12:50:50 EDT 2012


Hi everyone,

LOU1, compute node, has been restored due to memory over-commitment. All
jobs were terminated(by SIGTERM). All services are back and running. Please
check your jobs.

Description
----------------
User programs over committed memory on LOU1, and caused it to stop
responding to any commands.

Date/Time
---------------
Jun 1, 11:40 AM

Solution
------------
Following steps were taken to solve the issue:
1. issued OOM(out of memory), but it failed to reclaim memory.
2. sent SIGTERM to all processes except init
3. restored all services


Thanks,
- Jarod




-- 
Donghan (Jarod) Wang
Research Programmer
Robotics Institute
Carnegie Mellon University
5000 Forbes Avenue
Pittsburgh, PA 15213
Email: donghanw at cs.cmu.edu
Tel: +1 412 268 1238
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.srv.cs.cmu.edu/mailman/private/autonlab-users/attachments/20120601/e273bd68/attachment.html>


More information about the Autonlab-users mailing list