[auton-users] LOU1 Tweaks

Michael J. Baysek mjbaysek at cs.cmu.edu
Tue Oct 4 15:47:38 EDT 2011


Hi Lab,

I have changed the way the vm subsystem on LOU1 handles memory 
allocation.  I have made these changes to reduce the chance that future 
compute jobs can take down the server.  Specifically, I have disabled 
memory overcommit completely and I have removed swap space.  (Swap is 
not going to help you if you have already used 128 GB of RAM...)

In the future, expect your process to die if it asks for more ram than 
is available on the system.  Also, if multiple users are running 
concurrently, know that if the machine exceeds the 128 GB mark, 
somebody's process will die.

This may sound harsh, but it is preferable to the entire server going 
down, possibly for hours until it can be rebooted manually.

As always, it is important to be mindful of other users when launching 
big jobs.  If you are about to start a big job while someone is already 
busy on a node, it is common courtesy to drop that user an email to 
communicate your intentions and learn the details about their currently 
running job.  Nobody wants to have a job that's been running for a week 
die because someone launched something that overloaded the node.

Please let me know if there are any questions or concerns regarding this 
change.

Best,

Mike

-- 
Michael J. Baysek
Systems Analyst
Carnegie Mellon University / Auton Lab
412-268-8939 - mjbaysek at cs.cmu.edu
http://www.autonlab.org









More information about the Autonlab-users mailing list