From donghanw at cs.cmu.edu  Fri Feb  1 14:30:37 2013
From: donghanw at cs.cmu.edu (Donghan (Jarod) Wang)
Date: Fri, 1 Feb 2013 14:30:37 -0500
Subject: [auton-users] Fwd: LOT2 restored
In-Reply-To: <CA+N7t7N0oYCmCmZ7dkMLsNRN7txsbjpkayEJjuJNRJUgQn99BA@mail.gmail.com>
References: <CA+N7t7MgVgz6=mnpwP5xEhUCBjA3GTCxqA-KMj+zHJPnPFn0xQ@mail.gmail.com>
	<CA+N7t7N0oYCmCmZ7dkMLsNRN7txsbjpkayEJjuJNRJUgQn99BA@mail.gmail.com>
Message-ID: <CA+N7t7O32td6aC-OHjy0bPjth1di4kNMEpwFMcoJPjdHrdN2JA@mail.gmail.com>

Hi everyone,

LOT2, compute node, has been rebooted unexpectedly due to a kernel panic.
All jobs were terminated. All services are back and running now. Please
check your jobs.

Description
----------------
A user job exhausted the memory and overloaded the system resulting in
system crash.

Date/Time
---------------
Crashed on Feb. 1 1:15 PM
Rebooted on Feb. 1 2:15PM

It's strongly recommended in the next few hours a user should avoid running
jobs that may overload the system. This is because a faulty disk was
replaced this morning and the system has been syncing the RAID array. Any
system crash will delay the sync process. The recovering is expected to
finish in 12 hours.

Please let me know if you have any questions/concerns.

Thanks,
Jarod

-- 
Donghan (Jarod) Wang
Research Programmer
Robotics Institute
Carnegie Mellon University
5000 Forbes Avenue
Pittsburgh, PA 15213
Email: donghanw at cs.cmu.edu
Tel: +1 412 268 1238
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.srv.cs.cmu.edu/mailman/private/autonlab-users/attachments/20130201/993e9667/attachment.html>

From donghanw at cs.cmu.edu  Tue Feb  5 17:39:30 2013
From: donghanw at cs.cmu.edu (Donghan (Jarod) Wang)
Date: Tue, 5 Feb 2013 17:39:30 -0500
Subject: [auton-users] R 2.15.2
Message-ID: <CA+N7t7MScWNC6kHz29N6S9sGKG1Gc00oUreJrzVFzC5_r7PLNg@mail.gmail.com>

Hello everyone,

The R software has been upgraded to 2.15.2 on all compute nodes. Simply
issue command -- R -- as you would normally do, and you are ready to go.

Old version--R 2.14 will continue to be available for a while. To launch
it, type command-- R_2_14.

Cheers,
Jarod
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.srv.cs.cmu.edu/mailman/private/autonlab-users/attachments/20130205/e0b22624/attachment.html>

From donghanw at cs.cmu.edu  Wed Feb  6 09:04:50 2013
From: donghanw at cs.cmu.edu (Donghan (Jarod) Wang)
Date: Wed, 6 Feb 2013 09:04:50 -0500
Subject: [auton-users] LOU1 restored
Message-ID: <CA+N7t7MiVftFjyVgfO5-duR0G2=2DJ4wsa7pZzoOAXNr_sn0HQ@mail.gmail.com>

Hi everyone,

LOU1, compute node, has been rebooted due to a kernel panic. All jobs were
terminated. All services on the node are back and running now. Please check
your jobs.

Date/Time
---------------
Crashed on Feb. 6 8:15 AM
Rebooted on Feb. 6 18 8:54 AM

Description
----------------
It most likely the system got overload due to a heavy IO user program
combined with compute jobs.

Please let me know if you have any questions/concerns.

Thanks,
Jarod
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.srv.cs.cmu.edu/mailman/private/autonlab-users/attachments/20130206/dfbaa56b/attachment.html>

From donghanw at cs.cmu.edu  Mon Feb 11 10:35:19 2013
From: donghanw at cs.cmu.edu (Donghan (Jarod) Wang)
Date: Mon, 11 Feb 2013 10:35:19 -0500
Subject: [auton-users] Fwd: LOU1 restored
In-Reply-To: <CA+N7t7MiVftFjyVgfO5-duR0G2=2DJ4wsa7pZzoOAXNr_sn0HQ@mail.gmail.com>
References: <CA+N7t7MiVftFjyVgfO5-duR0G2=2DJ4wsa7pZzoOAXNr_sn0HQ@mail.gmail.com>
Message-ID: <CA+N7t7NXVH+DzFPK6Yk-TiHATFj0DxX42Z58KxPeLV3AjwsZ-A@mail.gmail.com>

Hi everyone,

LOU1, compute node, has been rebooted due to out of memory. All jobs were
terminated gracefully. All services on the node are back and running now.
Please check your jobs.

Date/Time
---------------
Rebooted on Feb. 11 10:09 AM

Description
----------------
A user job exhausted both RAM and swap, which leaded the server stopped
responding to the world.
Before rebooting, all jobs were terminated gracefully so that they had a
chance to save the data to disks. It's strongly recommended you check your
jobs to ensure consistency.

Please let me know if you have any questions/concerns.

Thanks,
Jarod
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.srv.cs.cmu.edu/mailman/private/autonlab-users/attachments/20130211/20512a69/attachment.html>

From donghanw at cs.cmu.edu  Wed Feb 13 08:19:20 2013
From: donghanw at cs.cmu.edu (Donghan (Jarod) Wang)
Date: Wed, 13 Feb 2013 08:19:20 -0500
Subject: [auton-users] Rebooting LOP1, LOP2, LOS1, LOU1 at 1pm, Feb. 13
Message-ID: <CA+N7t7NZga5P0g16km4YT+6nQ9N+B+hH74zSOqVYVUSmoUPx=w@mail.gmail.com>

Dear Auton users,

LOP1, LOP2, LOS1, LOU1 will be rebooted at *1:00PM today (Feb. 13) *to
restore the NFS service (/auton). I'll send a notification as soon as the
servers are up and running.

It's important that you save data,  stop running jobs and log out on those
nodes before 1:00 PM.

Please notify me ASAP if you have objections/concerns/questions regarding
the rebooting.


Thanks!

Jarod
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.srv.cs.cmu.edu/mailman/private/autonlab-users/attachments/20130213/a28ec551/attachment.html>

From donghanw at cs.cmu.edu  Wed Feb 13 13:37:18 2013
From: donghanw at cs.cmu.edu (Donghan (Jarod) Wang)
Date: Wed, 13 Feb 2013 13:37:18 -0500
Subject: [auton-users] Rebooting LOP1, LOP2, LOS1, LOU1 at 1pm, Feb. 13
In-Reply-To: <CA+N7t7NZga5P0g16km4YT+6nQ9N+B+hH74zSOqVYVUSmoUPx=w@mail.gmail.com>
References: <CA+N7t7NZga5P0g16km4YT+6nQ9N+B+hH74zSOqVYVUSmoUPx=w@mail.gmail.com>
Message-ID: <CA+N7t7PrF7dkLFedvGRK9mpDtxU+89gn3ZYRJ5vNspgN9NDcEg@mail.gmail.com>

Dear Auton users,

LOP1, LOP2, LOS1, LOU1 are back online. All services on the nodes are up
and running now.

Please let me know if you notice anything odd and/or have
questions/concerns.

Thanks,
Jarod


On Wed, Feb 13, 2013 at 8:19 AM, Donghan (Jarod) Wang
<donghanw at cs.cmu.edu>wrote:

> Dear Auton users,
>
> LOP1, LOP2, LOS1, LOU1 will be rebooted at *1:00PM today (Feb. 13) *to
> restore the NFS service (/auton). I'll send a notification as soon as the
> servers are up and running.
>
> It's important that you save data,  stop running jobs and log out on those
> nodes before 1:00 PM.
>
> Please notify me ASAP if you have objections/concerns/questions regarding
> the rebooting.
>
>
> Thanks!
>
> Jarod
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.srv.cs.cmu.edu/mailman/private/autonlab-users/attachments/20130213/46b53290/attachment.html>

From donghanw at cs.cmu.edu  Thu Feb 14 18:22:48 2013
From: donghanw at cs.cmu.edu (Donghan (Jarod) Wang)
Date: Thu, 14 Feb 2013 18:22:48 -0500
Subject: [auton-users] Neill1 Restored
Message-ID: <CA+N7t7MCMEbvmV0vDif6Lt4J+0WK00tw+V7euSaj5mF8AnO6NQ@mail.gmail.com>

Dear Neill users,

Neill1, compute node, has been rebooted due to system overload. All jobs
were terminated gracefully. All services on the node are back and running
now.

Date/Time
---------------
Rebooted on Feb. 14 6:10 PM

Description
----------------
Over 100 jobs were launched by a user which dramatically exceeded the cpu
capacity (there are 4 cpu cores). The system then stopped responding to the
world.
Before rebooting, all jobs were terminated gracefully so that they had a
chance to save the data to disks. It's strongly recommended you check your
jobs to ensure consistency.

Please let me know if you have any questions/concerns.

Thanks,
Jarod
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.srv.cs.cmu.edu/mailman/private/autonlab-users/attachments/20130214/8f877dbc/attachment.html>

From donghanw at cs.cmu.edu  Fri Feb 22 09:05:59 2013
From: donghanw at cs.cmu.edu (Donghan (Jarod) Wang)
Date: Fri, 22 Feb 2013 09:05:59 -0500
Subject: [auton-users] Fwd: LOU1 restored
In-Reply-To: <CA+N7t7NXVH+DzFPK6Yk-TiHATFj0DxX42Z58KxPeLV3AjwsZ-A@mail.gmail.com>
References: <CA+N7t7MiVftFjyVgfO5-duR0G2=2DJ4wsa7pZzoOAXNr_sn0HQ@mail.gmail.com>
	<CA+N7t7NXVH+DzFPK6Yk-TiHATFj0DxX42Z58KxPeLV3AjwsZ-A@mail.gmail.com>
Message-ID: <CA+N7t7MVtLfG+MM184ZUknEu78_JPDMmXD=BLSq440Aadj6NpQ@mail.gmail.com>

Dear Auton users,

LOU1, compute node, has been rebooted due to system overload. All jobs were
terminated gracefully. All services on the node are back and running now.
Please check your jobs.

Date/Time
---------------
Rebooted on Feb. 22 8:56 AM

Description
----------------
There is a bug in a user program. It exhausted both RAM and swap, which
leaded the server stopped responding to the world. This happened around
22:13pm, Feb 21.

Before rebooting, all jobs were terminated gracefully so that they had a
chance to save the data to disks. It's strongly recommended you check your
data to ensure consistency.

I understand some of you are working towards deadlines and want every drop
of the computing power. However, it's import to keep in mind this is shared
environment and manage your jobs in a reasonable resource consumption.
Please don't hesitate to contact me if you have any questions/concerns.

Thanks,
Jarod
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.srv.cs.cmu.edu/mailman/private/autonlab-users/attachments/20130222/194dfa34/attachment.html>