[auton-users] LOP1 policy change: Please Read

Michael J. Baysek mjbaysek at cs.cmu.edu
Mon Feb 16 16:24:49 EST 2009


Hi everyone.  This mail is to notify you of a change that has taken 
effect on LOP1 that may affect how you can run processes there. 


*SUMMARY*


Use of LOP1 is now limited to processes which require less than 2 GB of 
memory (including overcommit) to run.  Any process using more than this 
will be terminated automatically, and without warning.  Be aware that 
this may cause problems with software packages like Matlab or ASL even 
under normal use cases.
*

BACKGROUND*


Because LOP1 is used by many as the default server to access CVS, it has 
a special role in the lab.  It needs to be available for that purpose, 
even when overall lab CPU load is high.  When LOP1 is overloaded, 
particularly in the case of high memory jobs and swap related 
disk-thrashing, it creates what could be called a denial-of-service for 
many users of CVS.  Until now, it was all to easy to do this, even 
completely by accident (which is typically how it happens).


*WHAT'S CHANGED?*


All new logins on LOP1 will be subject to a ulimit for VIRT memory.  
When a single process asks for more than 2 GB of RAM, the process will 
now segfault and terminate.  All low memory (< 2 GB) jobs will still run 
as expected.


*WHY THE HARD LIMIT?*


Imposing the limit helps ensure that accessibility of CVS is not 
disrupted during heavy load.  The only way to do this is to impose these 
limits. 


*I STILL NEED TO USE LOP1, WHAT CAN I DO?*


It is highly recommended that you DO NOT run applications like Matlab on 
LOP1, and use another machine instead.  Appropriate warnings have been 
added to the "matlab" and "asl" commands on LOP1 in hopes of preventing 
problems.  Be aware that not all apps can issue this warning.


Finally, if you are concerned about how the limits might affect your 
process, you have two options.  1) Either run on any other machine, or 
2) Have your script check shell variable "$?" immediately after the job 
exits for return code 139, which is the code for segfault.  If a 
particular run segfaults on LOP1, you could simply record your launch 
arguments to a file for execution on a different machine. 


-- 
--
Michael J. Baysek, Systems Analyst
Carnegie Mellon University - Auton Lab
www.cmu.edu - www.autonlab.org
412-268-8939




More information about the Autonlab-users mailing list