[auton-users] Auton Lab Compute Resources: Choosing the Right Machine
Michael J. Baysek
mjbaysek at cs.cmu.edu
Thu Feb 28 11:21:03 EST 2008
Hello Lab,
This is general announcement in response to an event last night which
resulted in people's jobs being terminated. This message is intended to
help prevent this from happening in the future, so nobody else loses work.
** I apologize for the length of this message in advance. Nonetheless,
it is important everyone reads it fully. **
BACKGROUND:
Early this morning, numerous processes were launched on machines that
were being shared by two or more users (LOP1, LOP2, and LOQ2) that maxed
out the memory. When this happens, the system tries to preserve itself
by indiscriminately killing processes on the system to free up memory.
If this happens on a machine that only one person is on, the damage is
localized to whatever it is that that user was working on. However, if
this happens on a machine someone is sharing with another user, they
will not only lose their own work, but they will cause someone else to
lose theirs - as was the case last night on LOP1, LOP2, and LOQ2.
PREVENTION:
First, before you run a job on a machine, you should always check the
status page. The status page shows which machines are in use, the
current memory usage on the machine (as top output) and how much memory
they each have total.
The status page is located at: http://www.autonlab.org/status/ . If
you cannot access it, please contact me and I will reset your
credentials. You should be bookmarking this page and checking it every
time you run a job on the lab machines.
I find that it is very easy to sign into a machine, work there for a day
or two before launching a big job there. The machine status may change
in the meantime, so it is good practice to check the machine status
directly before you launch your process to make sure someone else hasn't
already started something.
Second, if all machines are already in use, you should always prefer a
machine with a higher amount of memory. It really is a very bad idea to
share a machine with only 4GB of RAM with another user unless you know
for certain that your memory requirements are negligible. You should
also try and contact me and let me know you are planning on sharing a
machine. I may be able to direct you to a machine that is best suited.
Third, if you have any question that what you are about to do will
overflow the memory on a machine, please make sure nobody else is
running something on the machine before you do so. Anything scripts you
have that fork other processes should be written and tested carefully on
a machine nobody else is running on.
Fourth, a caveat when running jobs on LOP1: LOP1, among other things,
is used by many for access to CVS. If the machine is too heavily taxed,
or the memory is overflowed, it may impact CVS access for everyone.
Only lightweight jobs should be run on LOP1. If your job is
long-running, or more demanding, please use any other machine first.
Finally, if you want to protect your mission critical jobs from being
preempted by another user, contact me to request a reservation for a
machine. If you reserve a machine, it is not possible for another user
to start another job there, and your job will be safe. Reservations are
normally used for project, conference or paper deadlines, timing
experiments, and long running jobs. It is neither necessary nor useful
to request a reservation for casual use of a machine.
I hope that this email has been useful and informative. Again, the
intention is to help prevent future loss of work for everyone. Please
contact me if you have any questions or concerns about the guidelines here.
--
--
Michael J. Baysek, Systems Analyst
Carnegie Mellon University - Auton Lab
www.cmu.edu - www.autonlab.org
412-268-8939
For full contact information, including IM handles, visit
http://www.autonlab.org/auton_intranet/admin.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 3245 bytes
Desc: S/MIME Cryptographic Signature
URL: <https://mailman.srv.cs.cmu.edu/mailman/private/autonlab-users/attachments/20080228/6aecb699/attachment.bin>
More information about the Autonlab-users
mailing list