From donghanw at cs.cmu.edu Wed Aug 1 08:50:05 2012 From: donghanw at cs.cmu.edu (Donghan (Jarod) Wang) Date: Wed, 1 Aug 2012 08:50:05 -0400 Subject: [auton-users] LOT2 restored In-Reply-To: References: Message-ID: Hi everyone, LOT2, compute node, has been restored due to a kernel crash. All jobs were terminated. All services are back and running now. Please check your jobs. Description ---------------- Based on the log, two possible causes 1. A faulty disk 2. System overload Further analysis and inspection will be conducted to identify the issue. Date/Time --------------- Tue Jul 31 16:30:47 EDT 2012 Please let me know if you have any questions. Thanks, - Jarod -- Donghan (Jarod) Wang Research Programmer Robotics Institute Carnegie Mellon University 5000 Forbes Avenue Pittsburgh, PA 15213 Email: donghanw at cs.cmu.edu Tel: +1 412 268 1238 -------------- next part -------------- An HTML attachment was scrubbed... URL: From donghanw at cs.cmu.edu Mon Aug 6 10:54:12 2012 From: donghanw at cs.cmu.edu (Donghan (Jarod) Wang) Date: Mon, 6 Aug 2012 10:54:12 -0400 Subject: [auton-users] LOS1 restored Message-ID: Hi everyone, LOS1, compute node, has been rebooted due to overloading. All jobs were terminated. All services are back and running now. Please check your jobs. Description ---------------- One or more runaway jobs overloaded the machine and caused it to stop responding to the world. Date/Time --------------- Mal-behavior started on Aug 4th Rebooting toke place on Mon Aug 6 09:22:25 EDT 2012 Please let me know if you have any questions. Thanks, - Jarod -------------- next part -------------- An HTML attachment was scrubbed... URL: From donghanw at cs.cmu.edu Tue Aug 7 17:08:03 2012 From: donghanw at cs.cmu.edu (Donghan (Jarod) Wang) Date: Tue, 7 Aug 2012 17:08:03 -0400 Subject: [auton-users] LOU1 restored Message-ID: Hi everyone, LOU1, compute node, has been rebooted due to a kernel crash. All jobs were terminated. All services are back and running now. Please check your jobs. Description ---------------- Multiple jobs exhausted memory on the host while some hitting the limit of CPU resource. It results in very expensive IO swapping and eventually kernel panic. Date/Time --------------- Kernel crashed on Aug. 7, 3:37PM Rebooting toke place on Aug. 7 3:50PM Suggestion ---------------- It's advisable to 1. check the CPU and memory consumption before running your job(s); command *top* will give you a rough idea of the system load 2. be mindful of the shared environment 3. reduce resource consumption in the code if possible 4. request a reservation if necessary(not guaranteed though :) Please let me know if you have any questions. Thanks, - Jarod -- Donghan (Jarod) Wang Research Programmer Robotics Institute Carnegie Mellon University 5000 Forbes Avenue Pittsburgh, PA 15213 Email: donghanw at cs.cmu.edu Tel: +1 412 268 1238 -------------- next part -------------- An HTML attachment was scrubbed... URL: From donghanw at cs.cmu.edu Sat Aug 18 16:16:53 2012 From: donghanw at cs.cmu.edu (Donghan (Jarod) Wang) Date: Sat, 18 Aug 2012 16:16:53 -0400 Subject: [auton-users] LOU1 restored In-Reply-To: References: Message-ID: Hi everyone, LOU1, compute node, has been rebooted due to a kernel crash. All jobs were terminated. All services are back and running now. Please check your jobs. Description ---------------- System overload. Date/Time --------------- Crashed on Aug. 18 08:36AM Rebooted on Aug. 18 4:00PM Please let me know if you have any questions. Thanks, - Jarod -- Donghan (Jarod) Wang Research Programmer Robotics Institute Carnegie Mellon University 5000 Forbes Avenue Pittsburgh, PA 15213 Email: donghanw at cs.cmu.edu Tel: +1 412 268 1238 -------------- next part -------------- An HTML attachment was scrubbed... URL: From donghanw at cs.cmu.edu Sat Aug 18 16:18:19 2012 From: donghanw at cs.cmu.edu (Donghan (Jarod) Wang) Date: Sat, 18 Aug 2012 16:18:19 -0400 Subject: [auton-users] LOT2 restored Message-ID: Hi everyone, LOT2, compute node, has been rebooted due to a kernel crash. All jobs were terminated. All services are back and running now. Please check your jobs. Description ---------------- System overload. Date/Time --------------- Crashed on Aug. 17 17:51PM Rebooted on Aug. 18 4:00PM Please let me know if you have any questions. Thanks, - Jarod -- Donghan (Jarod) Wang Research Programmer Robotics Institute Carnegie Mellon University 5000 Forbes Avenue Pittsburgh, PA 15213 Email: donghanw at cs.cmu.edu Tel: +1 412 268 1238 -- Donghan (Jarod) Wang Research Programmer Robotics Institute Carnegie Mellon University 5000 Forbes Avenue Pittsburgh, PA 15213 Email: donghanw at cs.cmu.edu Tel: +1 412 268 1238 -------------- next part -------------- An HTML attachment was scrubbed... URL: From donghanw at cs.cmu.edu Tue Aug 21 13:31:29 2012 From: donghanw at cs.cmu.edu (Donghan (Jarod) Wang) Date: Tue, 21 Aug 2012 13:31:29 -0400 Subject: [auton-users] LOW1 restored Message-ID: Hi everyone, LOW1, compute node, has been rebooted due to system overload. All jobs were terminated. All services are back and running now. Please check your jobs. Description ---------------- Memory overcommitment resulted in crashes due to lack of memory. Date/Time --------------- Happened on Aug. 21 12:51AM Rebooted on Aug. 21 1:23PM Please let me know if you have any questions. Thanks, - Jarod -- Donghan (Jarod) Wang Research Programmer Robotics Institute Carnegie Mellon University 5000 Forbes Avenue Pittsburgh, PA 15213 Email: donghanw at cs.cmu.edu Tel: +1 412 268 1238 -------------- next part -------------- An HTML attachment was scrubbed... URL: