From mjbaysek at cs.cmu.edu Tue Oct 4 15:47:38 2011 From: mjbaysek at cs.cmu.edu (Michael J. Baysek) Date: Tue, 04 Oct 2011 15:47:38 -0400 Subject: [auton-users] LOU1 Tweaks Message-ID: <4E8B62DA.5060804@cs.cmu.edu> Hi Lab, I have changed the way the vm subsystem on LOU1 handles memory allocation. I have made these changes to reduce the chance that future compute jobs can take down the server. Specifically, I have disabled memory overcommit completely and I have removed swap space. (Swap is not going to help you if you have already used 128 GB of RAM...) In the future, expect your process to die if it asks for more ram than is available on the system. Also, if multiple users are running concurrently, know that if the machine exceeds the 128 GB mark, somebody's process will die. This may sound harsh, but it is preferable to the entire server going down, possibly for hours until it can be rebooted manually. As always, it is important to be mindful of other users when launching big jobs. If you are about to start a big job while someone is already busy on a node, it is common courtesy to drop that user an email to communicate your intentions and learn the details about their currently running job. Nobody wants to have a job that's been running for a week die because someone launched something that overloaded the node. Please let me know if there are any questions or concerns regarding this change. Best, Mike -- Michael J. Baysek Systems Analyst Carnegie Mellon University / Auton Lab 412-268-8939 - mjbaysek at cs.cmu.edu http://www.autonlab.org From mjbaysek at cs.cmu.edu Tue Oct 4 16:29:14 2011 From: mjbaysek at cs.cmu.edu (Michael J. Baysek) Date: Tue, 04 Oct 2011 16:29:14 -0400 Subject: [auton-users] Auton Lab Server Outage Oct 21-22 In-Reply-To: <000301cc82cc$ef957330$cec05990$@cs.cmu.edu> References: <000301cc82cc$ef957330$cec05990$@cs.cmu.edu> Message-ID: <4E8B6C9A.1050505@cs.cmu.edu> Hi Lab, Please put it on your calendar: Due to the power upgrade of the WeH 3611 machine room, all Auton Lab servers will be down starting at 8PM Friday October 21. You can expect the system to be back up by 3 PM Saturday, barring any complications with the power. Best, Mike -------- Original Message -------- Subject: Power Outage in Wean October 22, 2011 Date: Tue, 4 Oct 2011 15:36:39 -0400 From: SCS Help Desk To: Help Desk There is a scheduled power outage on Saturday October 22, 2011 that will affect Wean Hall corridors 3500, 3600, 3700, 4600 and 4700. The outage is necessary for FMS and an electrical contractor to service switchgear that provides power to those areas. AFFECTED AREAS in SCS: The SCS Data Center in Wean Hall 3611 START DATE: Friday October 21, 2011 9 P.M. EST END DATE: Saturday October 22, 2011 11 A.M. EST SCS COMPUTING SERVICES AFFECTED: All servers located in the SCS data center in Wean Hall will be shut down starting Friday at 9 P.M. This includes: 1) All project servers located in the Wean Hall data center 2) All High Performance Compute Clusters 3) Facility backup servers 4) Poster printing in SCS Operations Servers located in SCS data center in the Gates-Hillman Center, where most of the core servers for SCS reside, WILL NOT BE AFFECTED. There should be no interruption of SCS network connectivity, email service, AFS or printing. Please contact the SCS Help Desk at x8-4231 or send mail to help+ at cs.cmu.edu with any questions or concerns regarding this power outage. Thank you for your attention, SCS Help Desk -------------- next part -------------- An HTML attachment was scrubbed... URL: From mjbaysek at cs.cmu.edu Sun Oct 9 10:25:46 2011 From: mjbaysek at cs.cmu.edu (Michael J. Baysek) Date: Sun, 09 Oct 2011 10:25:46 -0400 Subject: [auton-users] LOP1 down Message-ID: <4E91AEEA.4060008@cs.cmu.edu> Hi Lab, Please use lop2.autonlab.org for the time being to get into the compute system. LOP1 is down at the moment. Additionally, you should be familiar with GNU screen if you are not already. It will help you from having your jobs die if connections are interrupted. Why screen? Using Screen Screen Cheat Sheet - Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From mjbaysek at cs.cmu.edu Mon Oct 10 16:01:09 2011 From: mjbaysek at cs.cmu.edu (Michael J. Baysek) Date: Mon, 10 Oct 2011 16:01:09 -0400 Subject: [auton-users] LOP1 restored Message-ID: <4E934F05.7080201@cs.cmu.edu> Hi Lab, LOP1 is back online. The (very old) disk in it crashed and has been replaced. Let me know if you have any problems with the rebuilt LOP1. Best, Mike -- Michael J. Baysek Systems Analyst Carnegie Mellon University / Auton Lab 412-268-8939 - mjbaysek at cs.cmu.edu http://www.autonlab.org From mjbaysek at cs.cmu.edu Fri Oct 21 13:53:28 2011 From: mjbaysek at cs.cmu.edu (mbaysek@gmail.com) Date: Fri, 21 Oct 2011 13:53:28 -0400 Subject: [auton-users] Auton Computing System Shutdown 8pm Message-ID: <37ee33c5-b9bc-4d6f-9047-0d8b3fc918de@email.android.com> This is just a reminder that tonight at 8 pm, SCS Computing Facilities is forcing a complete shutdown of the Wean Hall machine room where our gear resides. This is the first of two planned outages that aim to add 600 amps of power capacity to the machine room. They anticipate approximately a 12 hour window of downtime. If all goes as planned on their end, I will be energizing our equipment starting around 9 am Saturday. I expect complete system availability by 11 am. During the outage, all Auton Lab equipment, services, and websites will be unavailable, as well as some core SCS services. Since the mailing list server will be unavailable during the outage, I will manually send out a mail detailing any exceptions that may occur with this schedule. -- Sent from my Android phone with K-9 Mail. Please excuse my brevity. -------------- next part -------------- An HTML attachment was scrubbed... URL: From mjbaysek at cs.cmu.edu Sat Oct 22 11:13:34 2011 From: mjbaysek at cs.cmu.edu (Michael J. Baysek) Date: Sat, 22 Oct 2011 11:13:34 -0400 Subject: [auton-users] Auton System Status Message-ID: <4EA2DD9E.3030601@cs.cmu.edu> Hi Lab, The system status has been restored. Have a great weekend. Mike From mjbaysek at cs.cmu.edu Sun Oct 23 11:55:01 2011 From: mjbaysek at cs.cmu.edu (Michael J. Baysek) Date: Sun, 23 Oct 2011 11:55:01 -0400 Subject: [auton-users] Server is Down Message-ID: <4EA438D5.4090706@cs.cmu.edu> Hi Lab, Last night, around 2 am, our file server went down. I believe the problem to be a kernel bug relating to the XFS filesystem we use for our storage. There seems to be a yet unresolved XFS bug in The 3.0.4 kernel. I am looking into this and will let you know once I have the system back up and running, probably under an older kernel. However, I upgraded to kernel 3.0.4 to fix another problem, so I will need a bit of time to find a kernel least likely to present issues. -Mike From mjbaysek at cs.cmu.edu Sun Oct 23 15:43:33 2011 From: mjbaysek at cs.cmu.edu (Michael J. Baysek) Date: Sun, 23 Oct 2011 15:43:33 -0400 Subject: [auton-users] Server Restored Message-ID: <4EA46E65.8020806@cs.cmu.edu> The system is operational once more. -- Michael J. Baysek Systems Analyst Carnegie Mellon University / Auton Lab 412-268-8939 - mjbaysek at cs.cmu.edu http://www.autonlab.org