[auton-users] Fwd: LOU1 restored

Donghan (Jarod) Wang donghanw at cs.cmu.edu
Fri Feb 22 09:05:59 EST 2013


Dear Auton users,

LOU1, compute node, has been rebooted due to system overload. All jobs were
terminated gracefully. All services on the node are back and running now.
Please check your jobs.

Date/Time
---------------
Rebooted on Feb. 22 8:56 AM

Description
----------------
There is a bug in a user program. It exhausted both RAM and swap, which
leaded the server stopped responding to the world. This happened around
22:13pm, Feb 21.

Before rebooting, all jobs were terminated gracefully so that they had a
chance to save the data to disks. It's strongly recommended you check your
data to ensure consistency.

I understand some of you are working towards deadlines and want every drop
of the computing power. However, it's import to keep in mind this is shared
environment and manage your jobs in a reasonable resource consumption.
Please don't hesitate to contact me if you have any questions/concerns.

Thanks,
Jarod
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.srv.cs.cmu.edu/mailman/private/autonlab-users/attachments/20130222/194dfa34/attachment.html>


More information about the Autonlab-users mailing list