[auton-users] Auton Lab Compute Resources: Choosing the Right Machine

Michael J. Baysek mjbaysek at cs.cmu.edu
Thu Feb 28 12:44:37 EST 2008


It was pointed out to me that the first sentence of the email can be 
interpreted (incorrectly!) to mean that someone was fired!  That is NOT 
the case!!!! 


By "Job" I, of course meant "cpu process", not employment!


Mike


Michael J. Baysek wrote, On 02/28/2008 11:21 AM:
> Hello Lab,
>
>
> This is general announcement in response to an event last night which 
> resulted in people's jobs being terminated.  This message is intended 
> to help prevent this from happening in the future, so nobody else 
> loses work.
>
> ** I apologize for the length of this message in advance.  
> Nonetheless, it is important everyone reads it fully.   **
>
>
> BACKGROUND:
>
>
> Early this morning, numerous processes were launched on machines that 
> were being shared by two or more users (LOP1, LOP2, and LOQ2) that 
> maxed out the memory.  When this happens, the system tries to preserve 
> itself by indiscriminately killing processes on the system to free up 
> memory.  If this happens on a machine that only one person is on, the 
> damage is localized to whatever it is that that user was working on.  
> However, if this happens on a machine someone is sharing with another 
> user, they will not only lose their own work, but they will cause 
> someone else to lose theirs - as was the case last night on LOP1, 
> LOP2, and LOQ2.
>
>
> PREVENTION:
>
>
> First, before you run a job on a machine, you should always check the 
> status page.  The status page shows which machines are in use, the 
> current memory usage on the machine (as top output) and how much 
> memory they each have total.
>
>
> The status page is located at:  http://www.autonlab.org/status/ .  If 
> you cannot access it, please contact me and I will reset your 
> credentials.  You should be bookmarking this page and checking it 
> every time you run a job on the lab machines.
>
> I find that it is very easy to sign into a machine, work there for a 
> day or two before launching a big job there.  The machine status may 
> change in the meantime, so it is good practice to check the machine 
> status directly before you launch your process to make sure someone 
> else hasn't already started something.
>
> Second, if all machines are already in use, you should always prefer a 
> machine with a higher amount of memory.  It really is a very bad idea 
> to share a machine with only 4GB of RAM with another user unless you 
> know for certain that your memory requirements are negligible.  You 
> should also try and contact me and let me know you are planning on 
> sharing a machine.  I may be able to direct you to a machine that is 
> best suited.
>
> Third, if you have any question that what you are about to do will 
> overflow the memory on a machine, please make sure nobody else is 
> running something on the machine before you do so.  Anything scripts 
> you have that fork other processes should be written and tested 
> carefully on a machine nobody else is running on.
>
> Fourth, a caveat when running jobs on LOP1:  LOP1, among other things, 
> is used by many for access to CVS.  If the machine is too heavily 
> taxed, or the memory is overflowed, it may impact CVS access for 
> everyone.  Only lightweight jobs should be run on LOP1.  If your job 
> is long-running, or more demanding, please use any other machine first.
>
>
> Finally, if you want to protect your mission critical jobs from being 
> preempted by another user, contact me to request a reservation for a 
> machine.  If you reserve a machine, it is not possible for another 
> user to start another job there, and your job will be safe.  
> Reservations are normally used for project, conference or paper 
> deadlines, timing experiments, and long running jobs.  It is neither 
> necessary nor useful to request a reservation for casual use of a 
> machine.
>
>
> I hope that this email has been useful and informative.  Again, the 
> intention is to help prevent future loss of work for everyone.  Please 
> contact me if you have any questions or concerns about the guidelines 
> here.
>
>
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 3245 bytes
Desc: S/MIME Cryptographic Signature
URL: <https://mailman.srv.cs.cmu.edu/mailman/private/autonlab-users/attachments/20080228/ddb95e5e/attachment.bin>


More information about the Autonlab-users mailing list