From mjbaysek at cs.cmu.edu Fri Feb 6 23:44:46 2009 From: mjbaysek at cs.cmu.edu (Michael J. Baysek) Date: Fri, 06 Feb 2009 23:44:46 -0500 Subject: [auton-users] LOP1 In-Reply-To: <200902062310.n16NASvq018997@lmon.autonlab.org> References: <200902062310.n16NASvq018997@lmon.autonlab.org> Message-ID: <498D11BE.7080505@cs.cmu.edu> Hi Everyone. This is my semi-regular reminder to please, not run heavy memory jobs on LOP1. Please use *any* other machine, but not LOP1. If you inadvertently bring LOP1 down, you will bring CVS down for about half of the lab. Thanks, Mike nagios wrote: > ***** Nagios ***** > > Notification Type: PROBLEM > > Service: SWAP > Host: lop1 > Address: 192.168.1.12 > State: CRITICAL > > Date/Time: Fri Feb 6 18:10:28 EST 2009 > > Additional Info: > > (Service Check Timed Out) > > From mjbaysek at cs.cmu.edu Mon Feb 16 16:24:49 2009 From: mjbaysek at cs.cmu.edu (Michael J. Baysek) Date: Mon, 16 Feb 2009 16:24:49 -0500 Subject: [auton-users] LOP1 policy change: Please Read Message-ID: <4999D9A1.4030901@cs.cmu.edu> Hi everyone. This mail is to notify you of a change that has taken effect on LOP1 that may affect how you can run processes there. *SUMMARY* Use of LOP1 is now limited to processes which require less than 2 GB of memory (including overcommit) to run. Any process using more than this will be terminated automatically, and without warning. Be aware that this may cause problems with software packages like Matlab or ASL even under normal use cases. * BACKGROUND* Because LOP1 is used by many as the default server to access CVS, it has a special role in the lab. It needs to be available for that purpose, even when overall lab CPU load is high. When LOP1 is overloaded, particularly in the case of high memory jobs and swap related disk-thrashing, it creates what could be called a denial-of-service for many users of CVS. Until now, it was all to easy to do this, even completely by accident (which is typically how it happens). *WHAT'S CHANGED?* All new logins on LOP1 will be subject to a ulimit for VIRT memory. When a single process asks for more than 2 GB of RAM, the process will now segfault and terminate. All low memory (< 2 GB) jobs will still run as expected. *WHY THE HARD LIMIT?* Imposing the limit helps ensure that accessibility of CVS is not disrupted during heavy load. The only way to do this is to impose these limits. *I STILL NEED TO USE LOP1, WHAT CAN I DO?* It is highly recommended that you DO NOT run applications like Matlab on LOP1, and use another machine instead. Appropriate warnings have been added to the "matlab" and "asl" commands on LOP1 in hopes of preventing problems. Be aware that not all apps can issue this warning. Finally, if you are concerned about how the limits might affect your process, you have two options. 1) Either run on any other machine, or 2) Have your script check shell variable "$?" immediately after the job exits for return code 139, which is the code for segfault. If a particular run segfaults on LOP1, you could simply record your launch arguments to a file for execution on a different machine. -- -- Michael J. Baysek, Systems Analyst Carnegie Mellon University - Auton Lab www.cmu.edu - www.autonlab.org 412-268-8939 From mjbaysek at cs.cmu.edu Wed Feb 25 12:54:34 2009 From: mjbaysek at cs.cmu.edu (Michael J. Baysek) Date: Wed, 25 Feb 2009 12:54:34 -0500 Subject: [auton-users] LOQ1 and LOQ3 Message-ID: <49A585DA.9050802@cs.cmu.edu> LOQ1 and LOQ3 were just rebooted after they became non-responsive. Please check your jobs. -- -- Michael J. Baysek, Systems Analyst Carnegie Mellon University - Auton Lab www.cmu.edu - www.autonlab.org 412-268-8939 From mjbaysek at cs.cmu.edu Wed Feb 25 14:21:04 2009 From: mjbaysek at cs.cmu.edu (Michael J. Baysek) Date: Wed, 25 Feb 2009 14:21:04 -0500 Subject: [auton-users] LOP4 Message-ID: <49A59A20.8080602@cs.cmu.edu> LOP4 was just rebooted after it became non-responsive. Please check your jobs. -- -- Michael J. Baysek, Systems Analyst Carnegie Mellon University - Auton Lab www.cmu.edu - www.autonlab.org 412-268-8939 From mjbaysek at cs.cmu.edu Wed Feb 25 16:08:40 2009 From: mjbaysek at cs.cmu.edu (Michael J. Baysek) Date: Wed, 25 Feb 2009 16:08:40 -0500 Subject: [auton-users] NSH network change tonight Message-ID: <49A5B358.8060607@cs.cmu.edu> Reminder that SCS is changing NSH over to a routed segment tonight. To those of you who use machines in NSH which I maintain, your machines are already changed over to DHCP. This change should not require any new configuration on your PC. Be aware that the switchover may disrupt any connections you leave open as you leave the office tonight. Be sure to save everything you are working on just in case. -- -- Michael J. Baysek, Systems Analyst Carnegie Mellon University - Auton Lab www.cmu.edu - www.autonlab.org 412-268-8939 From mjbaysek at cs.cmu.edu Thu Feb 26 10:08:45 2009 From: mjbaysek at cs.cmu.edu (Michael J. Baysek) Date: Thu, 26 Feb 2009 10:08:45 -0500 Subject: [auton-users] NSH Network Upgrade Aftermath Message-ID: <49A6B07D.2090209@cs.cmu.edu> This message is intended for those of you in NSH. The network change by SCS last night is producing some very slow access times. Help desk admitted receiving calls from floors 1-4 of NSH already, and network team is dispatched. If you are unable to access any vital resources, I can provide any necessary workarounds until the network is back to normal. Until then, I recommend using the wireless network if possible. -- -- Michael J. Baysek, Systems Analyst Carnegie Mellon University - Auton Lab www.cmu.edu - www.autonlab.org 412-268-8939