[auton-users] LOU1 restored
    Donghan (Jarod) Wang 
    donghanw at cs.cmu.edu
       
    Tue Aug  7 17:08:03 EDT 2012
    
    
  
Hi everyone,
LOU1, compute node, has been rebooted due to a kernel crash. All jobs were
terminated. All services are back and running now. Please check your jobs.
Description
----------------
Multiple jobs exhausted memory on the host while some hitting the limit of
CPU resource. It results in very expensive IO swapping and eventually
kernel panic.
Date/Time
---------------
Kernel crashed on Aug. 7, 3:37PM
Rebooting toke place on Aug. 7 3:50PM
Suggestion
----------------
It's advisable to
   1. check the CPU and memory consumption before running your job(s);
   command *top* will give you a rough idea of the system load
   2. be mindful of the shared environment
   3. reduce resource consumption in the code if possible
   4. request a reservation if necessary(not guaranteed though :)
Please let me know if you have any questions.
Thanks,
- Jarod
-- 
Donghan (Jarod) Wang
Research Programmer
Robotics Institute
Carnegie Mellon University
5000 Forbes Avenue
Pittsburgh, PA 15213
Email: donghanw at cs.cmu.edu
Tel: +1 412 268 1238
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.srv.cs.cmu.edu/mailman/private/autonlab-users/attachments/20120807/a2745daf/attachment.html>
    
    
More information about the Autonlab-users
mailing list