[auton-users] LOU1 restored
Donghan (Jarod) Wang
donghanw at cs.cmu.edu
Fri Jun 1 12:50:50 EDT 2012
Hi everyone,
LOU1, compute node, has been restored due to memory over-commitment. All
jobs were terminated(by SIGTERM). All services are back and running. Please
check your jobs.
Description
----------------
User programs over committed memory on LOU1, and caused it to stop
responding to any commands.
Date/Time
---------------
Jun 1, 11:40 AM
Solution
------------
Following steps were taken to solve the issue:
1. issued OOM(out of memory), but it failed to reclaim memory.
2. sent SIGTERM to all processes except init
3. restored all services
Thanks,
- Jarod
--
Donghan (Jarod) Wang
Research Programmer
Robotics Institute
Carnegie Mellon University
5000 Forbes Avenue
Pittsburgh, PA 15213
Email: donghanw at cs.cmu.edu
Tel: +1 412 268 1238
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.srv.cs.cmu.edu/mailman/private/autonlab-users/attachments/20120601/e273bd68/attachment.html>
More information about the Autonlab-users
mailing list