[auton-users] Auton Lab Compute Resources: Choosing the Right Machine

Michael J. Baysek mjbaysek at cs.cmu.edu
Thu Feb 28 11:21:03 EST 2008


Hello Lab,


This is general announcement in response to an event last night which 
resulted in people's jobs being terminated.  This message is intended to 
help prevent this from happening in the future, so nobody else loses work. 


** I apologize for the length of this message in advance.  Nonetheless, 
it is important everyone reads it fully.   **


BACKGROUND:


Early this morning, numerous processes were launched on machines that 
were being shared by two or more users (LOP1, LOP2, and LOQ2) that maxed 
out the memory.  When this happens, the system tries to preserve itself 
by indiscriminately killing processes on the system to free up memory.  
If this happens on a machine that only one person is on, the damage is 
localized to whatever it is that that user was working on.  However, if 
this happens on a machine someone is sharing with another user, they 
will not only lose their own work, but they will cause someone else to 
lose theirs - as was the case last night on LOP1, LOP2, and LOQ2.


PREVENTION:


First, before you run a job on a machine, you should always check the 
status page.  The status page shows which machines are in use, the 
current memory usage on the machine (as top output) and how much memory 
they each have total.


The status page is located at:  http://www.autonlab.org/status/ .  If 
you cannot access it, please contact me and I will reset your 
credentials.  You should be bookmarking this page and checking it every 
time you run a job on the lab machines. 


I find that it is very easy to sign into a machine, work there for a day 
or two before launching a big job there.  The machine status may change 
in the meantime, so it is good practice to check the machine status 
directly before you launch your process to make sure someone else hasn't 
already started something. 


Second, if all machines are already in use, you should always prefer a 
machine with a higher amount of memory.  It really is a very bad idea to 
share a machine with only 4GB of RAM with another user unless you know 
for certain that your memory requirements are negligible.  You should 
also try and contact me and let me know you are planning on sharing a 
machine.  I may be able to direct you to a machine that is best suited. 


Third, if you have any question that what you are about to do will 
overflow the memory on a machine, please make sure nobody else is 
running something on the machine before you do so.  Anything scripts you 
have that fork other processes should be written and tested carefully on 
a machine nobody else is running on. 


Fourth, a caveat when running jobs on LOP1:  LOP1, among other things, 
is used by many for access to CVS.  If the machine is too heavily taxed, 
or the memory is overflowed, it may impact CVS access for everyone.  
Only lightweight jobs should be run on LOP1.  If your job is 
long-running, or more demanding, please use any other machine first.


Finally, if you want to protect your mission critical jobs from being 
preempted by another user, contact me to request a reservation for a 
machine.  If you reserve a machine, it is not possible for another user 
to start another job there, and your job will be safe.  Reservations are 
normally used for project, conference or paper deadlines, timing 
experiments, and long running jobs.  It is neither necessary nor useful 
to request a reservation for casual use of a machine.


I hope that this email has been useful and informative.  Again, the 
intention is to help prevent future loss of work for everyone.  Please 
contact me if you have any questions or concerns about the guidelines here.




-- 
--
Michael J. Baysek, Systems Analyst
Carnegie Mellon University - Auton Lab
www.cmu.edu - www.autonlab.org
412-268-8939

For full contact information, including IM handles, visit 
http://www.autonlab.org/auton_intranet/admin.html

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 3245 bytes
Desc: S/MIME Cryptographic Signature
URL: <https://mailman.srv.cs.cmu.edu/mailman/private/autonlab-users/attachments/20080228/6aecb699/attachment.bin>


More information about the Autonlab-users mailing list