gpu20 and gpu21 run to the ground
Predrag Punosevac
predragp at andrew.cmu.edu
Tue Sep 8 18:34:40 EDT 2020
Dear Autonians,
The gpu20 and gpu21 just crashed. They are not even connected to the IPMI
so they can't be restarted remotely. That is what I was planning to do on
Friday. I am trying to talk to people currently in the machine room who can
restart them for us.
We have to stop this cycle of overloading machines to the point they die
and become useless to everyone. I hope people who do this realize that they
just waste their own time as the machines are stateless and work is not
recoverable. Benedikt Boecking was kind enough and spent some time creating
this little write up which will be included into our Wiki. In incoming
weeks I will look into technical solutions which will hopefully
Automatic multi processing issue: some libraries automatically use all
cores for some of their underlying routines. In particular, this happens
with python (numpy,scipy,…). Automatic multiprocessing on the servers can
slow your code down and also impact your colleagues.
a. Spawning as many threads as the server has cores can be more expensive
than just executing the routine on one thread.
b. Running your process on all cores will impact the ability of your
colleagues to use the server.
b. If you already do your own multi-processing, these effects can multiply
and you might unwittingly try to parallelize across thousands of threads.
Please monitor your resource usage via top/htop and make sure you are not
flooding the server with too many parallel jobs. To fix the issue you can:
1. Set environment variables
$ export MKL_NUM_THREADS=1
$ export NUMEXPR_NUM_THREADS=1
$ export OMP_NUM_THREADS=1
$ export OPENBLAS_NUM_THREADS =1
$ export VECLIB_MAXIMUM_THREADS =1
2. Set these variables in python before importing any other libraries
import os
os.environ["OMP_NUM_THREADS"] = "1"
os.environ["OPENBLAS_NUM_THREADS"] = "1"
os.environ["MKL_NUM_THREADS"] = "1"
os.environ["VECLIB_MAXIMUM_THREADS"] = "1"
os.environ["NUMEXPR_NUM_THREADS"] = "1"
Best,
Predrag
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20200908/a5f92cff/attachment.html>
More information about the Autonlab-users
mailing list