From donghanw at cs.cmu.edu Fri Feb 1 14:30:37 2013 From: donghanw at cs.cmu.edu (Donghan (Jarod) Wang) Date: Fri, 1 Feb 2013 14:30:37 -0500 Subject: [auton-users] Fwd: LOT2 restored In-Reply-To: References: Message-ID: Hi everyone, LOT2, compute node, has been rebooted unexpectedly due to a kernel panic. All jobs were terminated. All services are back and running now. Please check your jobs. Description ---------------- A user job exhausted the memory and overloaded the system resulting in system crash. Date/Time --------------- Crashed on Feb. 1 1:15 PM Rebooted on Feb. 1 2:15PM It's strongly recommended in the next few hours a user should avoid running jobs that may overload the system. This is because a faulty disk was replaced this morning and the system has been syncing the RAID array. Any system crash will delay the sync process. The recovering is expected to finish in 12 hours. Please let me know if you have any questions/concerns. Thanks, Jarod -- Donghan (Jarod) Wang Research Programmer Robotics Institute Carnegie Mellon University 5000 Forbes Avenue Pittsburgh, PA 15213 Email: donghanw at cs.cmu.edu Tel: +1 412 268 1238 -------------- next part -------------- An HTML attachment was scrubbed... URL: From donghanw at cs.cmu.edu Tue Feb 5 17:39:30 2013 From: donghanw at cs.cmu.edu (Donghan (Jarod) Wang) Date: Tue, 5 Feb 2013 17:39:30 -0500 Subject: [auton-users] R 2.15.2 Message-ID: Hello everyone, The R software has been upgraded to 2.15.2 on all compute nodes. Simply issue command -- R -- as you would normally do, and you are ready to go. Old version--R 2.14 will continue to be available for a while. To launch it, type command-- R_2_14. Cheers, Jarod -------------- next part -------------- An HTML attachment was scrubbed... URL: From donghanw at cs.cmu.edu Wed Feb 6 09:04:50 2013 From: donghanw at cs.cmu.edu (Donghan (Jarod) Wang) Date: Wed, 6 Feb 2013 09:04:50 -0500 Subject: [auton-users] LOU1 restored Message-ID: Hi everyone, LOU1, compute node, has been rebooted due to a kernel panic. All jobs were terminated. All services on the node are back and running now. Please check your jobs. Date/Time --------------- Crashed on Feb. 6 8:15 AM Rebooted on Feb. 6 18 8:54 AM Description ---------------- It most likely the system got overload due to a heavy IO user program combined with compute jobs. Please let me know if you have any questions/concerns. Thanks, Jarod -------------- next part -------------- An HTML attachment was scrubbed... URL: From donghanw at cs.cmu.edu Mon Feb 11 10:35:19 2013 From: donghanw at cs.cmu.edu (Donghan (Jarod) Wang) Date: Mon, 11 Feb 2013 10:35:19 -0500 Subject: [auton-users] Fwd: LOU1 restored In-Reply-To: References: Message-ID: Hi everyone, LOU1, compute node, has been rebooted due to out of memory. All jobs were terminated gracefully. All services on the node are back and running now. Please check your jobs. Date/Time --------------- Rebooted on Feb. 11 10:09 AM Description ---------------- A user job exhausted both RAM and swap, which leaded the server stopped responding to the world. Before rebooting, all jobs were terminated gracefully so that they had a chance to save the data to disks. It's strongly recommended you check your jobs to ensure consistency. Please let me know if you have any questions/concerns. Thanks, Jarod -------------- next part -------------- An HTML attachment was scrubbed... URL: From donghanw at cs.cmu.edu Wed Feb 13 08:19:20 2013 From: donghanw at cs.cmu.edu (Donghan (Jarod) Wang) Date: Wed, 13 Feb 2013 08:19:20 -0500 Subject: [auton-users] Rebooting LOP1, LOP2, LOS1, LOU1 at 1pm, Feb. 13 Message-ID: Dear Auton users, LOP1, LOP2, LOS1, LOU1 will be rebooted at *1:00PM today (Feb. 13) *to restore the NFS service (/auton). I'll send a notification as soon as the servers are up and running. It's important that you save data, stop running jobs and log out on those nodes before 1:00 PM. Please notify me ASAP if you have objections/concerns/questions regarding the rebooting. Thanks! Jarod -------------- next part -------------- An HTML attachment was scrubbed... URL: From donghanw at cs.cmu.edu Wed Feb 13 13:37:18 2013 From: donghanw at cs.cmu.edu (Donghan (Jarod) Wang) Date: Wed, 13 Feb 2013 13:37:18 -0500 Subject: [auton-users] Rebooting LOP1, LOP2, LOS1, LOU1 at 1pm, Feb. 13 In-Reply-To: References: Message-ID: Dear Auton users, LOP1, LOP2, LOS1, LOU1 are back online. All services on the nodes are up and running now. Please let me know if you notice anything odd and/or have questions/concerns. Thanks, Jarod On Wed, Feb 13, 2013 at 8:19 AM, Donghan (Jarod) Wang wrote: > Dear Auton users, > > LOP1, LOP2, LOS1, LOU1 will be rebooted at *1:00PM today (Feb. 13) *to > restore the NFS service (/auton). I'll send a notification as soon as the > servers are up and running. > > It's important that you save data, stop running jobs and log out on those > nodes before 1:00 PM. > > Please notify me ASAP if you have objections/concerns/questions regarding > the rebooting. > > > Thanks! > > Jarod > -------------- next part -------------- An HTML attachment was scrubbed... URL: From donghanw at cs.cmu.edu Thu Feb 14 18:22:48 2013 From: donghanw at cs.cmu.edu (Donghan (Jarod) Wang) Date: Thu, 14 Feb 2013 18:22:48 -0500 Subject: [auton-users] Neill1 Restored Message-ID: Dear Neill users, Neill1, compute node, has been rebooted due to system overload. All jobs were terminated gracefully. All services on the node are back and running now. Date/Time --------------- Rebooted on Feb. 14 6:10 PM Description ---------------- Over 100 jobs were launched by a user which dramatically exceeded the cpu capacity (there are 4 cpu cores). The system then stopped responding to the world. Before rebooting, all jobs were terminated gracefully so that they had a chance to save the data to disks. It's strongly recommended you check your jobs to ensure consistency. Please let me know if you have any questions/concerns. Thanks, Jarod -------------- next part -------------- An HTML attachment was scrubbed... URL: From donghanw at cs.cmu.edu Fri Feb 22 09:05:59 2013 From: donghanw at cs.cmu.edu (Donghan (Jarod) Wang) Date: Fri, 22 Feb 2013 09:05:59 -0500 Subject: [auton-users] Fwd: LOU1 restored In-Reply-To: References: Message-ID: Dear Auton users, LOU1, compute node, has been rebooted due to system overload. All jobs were terminated gracefully. All services on the node are back and running now. Please check your jobs. Date/Time --------------- Rebooted on Feb. 22 8:56 AM Description ---------------- There is a bug in a user program. It exhausted both RAM and swap, which leaded the server stopped responding to the world. This happened around 22:13pm, Feb 21. Before rebooting, all jobs were terminated gracefully so that they had a chance to save the data to disks. It's strongly recommended you check your data to ensure consistency. I understand some of you are working towards deadlines and want every drop of the computing power. However, it's import to keep in mind this is shared environment and manage your jobs in a reasonable resource consumption. Please don't hesitate to contact me if you have any questions/concerns. Thanks, Jarod -------------- next part -------------- An HTML attachment was scrubbed... URL: