<div dir="auto">I own an apology to all of you. Lin from operations went to our<div dir="auto">servers and they were unplugged from electricity. That has never happened ro me in 7 years at the Auton. Lab. There is no point powering this up tonight. I will drive to CMU tomorrow and move servers to our rack.</div><div dir="auto"><br></div><div dir="auto">Predrag</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Sep 8, 2020, 6:34 PM Predrag Punosevac <<a href="mailto:predragp@andrew.cmu.edu">predragp@andrew.cmu.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div dir="ltr">Dear Autonians,<div><br></div><div>The gpu20 and gpu21 just crashed. They are not even connected to the IPMI so they can't be restarted remotely. That is what I was planning to do on Friday. I am trying to talk to people currently in the machine room who can restart them for us. </div><div><br></div><div>We have to stop this cycle of overloading machines to the point they die and become useless to everyone. I hope people who do this realize that they just waste their own time as the machines are stateless and work is not recoverable. Benedikt Boecking was kind enough and spent some time creating this little write up which will be included into our Wiki. In incoming weeks I will look into technical solutions which will hopefully </div><div><br></div><div><div>Automatic multi processing issue: some libraries automatically use all cores for some of their underlying routines. In particular, this happens with python (numpy,scipy,…). Automatic multiprocessing on the servers can slow your code down and also impact your colleagues. </div><div><span style="white-space:pre-wrap"> </span>a. Spawning as many threads as the server has cores can be more expensive than just executing the routine on one thread. </div><div><span style="white-space:pre-wrap"> </span>b. Running your process on all cores will impact the ability of your colleagues to use the server. </div><div><span style="white-space:pre-wrap"> </span>b. If you already do your own multi-processing, these effects can multiply and you might unwittingly try to parallelize across thousands of threads.</div><div><br></div><div>Please monitor your resource usage via top/htop and make sure you are not flooding the server with too many parallel jobs. To fix the issue you can:</div><div>1. Set environment variables</div><div>$ export MKL_NUM_THREADS=1<br>$ export NUMEXPR_NUM_THREADS=1<br>$ export OMP_NUM_THREADS=1</div><div>$ export OPENBLAS_NUM_THREADS =1</div><div>$ export VECLIB_MAXIMUM_THREADS =1</div><div><br></div><div>2. Set these variables in python before importing any other libraries</div><div>import os<div>os.environ["OMP_NUM_THREADS"] = "1" </div><div>os.environ["OPENBLAS_NUM_THREADS"] = "1" </div><div>os.environ["MKL_NUM_THREADS"] = "1" </div><div>os.environ["VECLIB_MAXIMUM_THREADS"] = "1"</div><div>os.environ["NUMEXPR_NUM_THREADS"] = "1" </div></div></div><div><br></div><div><br></div><div>Best,</div><div>Predrag</div></div></div>
</blockquote></div>