gpu20 and gpu21 run to the ground

Tue Sep 8 19:47:17 EDT 2020

I own an apology to all of you. Lin from operations went to our
servers and they were unplugged from electricity. That has never happened
ro me in 7 years at the Auton. Lab. There is no point powering this up
tonight. I will drive to CMU tomorrow and move servers to our rack.

Predrag

On Tue, Sep 8, 2020, 6:34 PM Predrag Punosevac <predragp at andrew.cmu.edu>
wrote:

> Dear Autonians,
>
> The gpu20 and gpu21 just crashed. They are not even connected to the IPMI
> so they can't be restarted remotely. That is what I was planning to do on
> Friday. I am trying to talk to people currently in the machine room who can
> restart them for us.
>
> We have to stop this cycle of overloading machines to the point they die
> and become useless to everyone. I hope people who do this realize that they
> just waste their own time as the machines are stateless and work is not
> recoverable. Benedikt Boecking was kind enough and spent some time creating
> this little write up which will be included into our Wiki. In incoming
> weeks I will look into technical solutions which will hopefully
>
> Automatic multi processing issue: some libraries automatically use all
> cores for some of their underlying routines. In particular, this happens
> with python (numpy,scipy,…). Automatic multiprocessing on the servers can
> slow your code down and also impact your colleagues.
> a.  Spawning as many threads as the server has cores can be more expensive
> than just executing the routine on one thread.
> b. Running your process on all cores will impact the ability of your
> colleagues to use the server.
> b. If you already do your own multi-processing, these effects can multiply
> and you might unwittingly try to parallelize across thousands of threads.
>
> Please monitor your resource usage via top/htop and make sure you are not
> flooding the server with too many parallel jobs. To fix the issue you can:
> 1. Set environment variables
> $ export MKL_NUM_THREADS=1
> $ export NUMEXPR_NUM_THREADS=1
> $ export OMP_NUM_THREADS=1
> $ export OPENBLAS_NUM_THREADS =1
> $ export VECLIB_MAXIMUM_THREADS =1
>
> 2. Set these variables in python before importing any other libraries
> import os
> os.environ["OMP_NUM_THREADS"] = "1"
> os.environ["OPENBLAS_NUM_THREADS"] = "1"
> os.environ["MKL_NUM_THREADS"] = "1"
> os.environ["VECLIB_MAXIMUM_THREADS"] = "1"
> os.environ["NUMEXPR_NUM_THREADS"] = "1"
>
>
> Best,
> Predrag
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20200908/a18fe458/attachment-0001.html>