[auton-users] LOU1 Tweaks
Michael J. Baysek
mjbaysek at cs.cmu.edu
Tue Oct 4 15:47:38 EDT 2011
Hi Lab,
I have changed the way the vm subsystem on LOU1 handles memory
allocation. I have made these changes to reduce the chance that future
compute jobs can take down the server. Specifically, I have disabled
memory overcommit completely and I have removed swap space. (Swap is
not going to help you if you have already used 128 GB of RAM...)
In the future, expect your process to die if it asks for more ram than
is available on the system. Also, if multiple users are running
concurrently, know that if the machine exceeds the 128 GB mark,
somebody's process will die.
This may sound harsh, but it is preferable to the entire server going
down, possibly for hours until it can be rebooted manually.
As always, it is important to be mindful of other users when launching
big jobs. If you are about to start a big job while someone is already
busy on a node, it is common courtesy to drop that user an email to
communicate your intentions and learn the details about their currently
running job. Nobody wants to have a job that's been running for a week
die because someone launched something that overloaded the node.
Please let me know if there are any questions or concerns regarding this
change.
Best,
Mike
--
Michael J. Baysek
Systems Analyst
Carnegie Mellon University / Auton Lab
412-268-8939 - mjbaysek at cs.cmu.edu
http://www.autonlab.org
More information about the Autonlab-users
mailing list