gpu20 and gpu21 run to the ground

Tue Sep 8 18:50:51 EDT 2020

In addition to always using htop or top to monitor if you are running as many parallel jobs as intended, please please please also keep an eye on your memory usage. Just check right after you start your job and a few times in between to make sure everything is as you intended. 

Just fyi, when using htop, you can press u and search for your username. If you press t you get the tree structure of your processes which can be helpful.

> On Sep 8, 2020, at 5:34 PM, Predrag Punosevac <predragp at andrew.cmu.edu> wrote:
> 
> Dear Autonians,
> 
> The gpu20 and gpu21 just crashed. They are not even connected to the IPMI so they can't be restarted remotely. That is what I was planning to do on Friday. I am trying to talk to people currently in the machine room who can restart them for us. 
> 
> We have to stop this cycle of overloading machines to the point they die and become useless to everyone. I hope people who do this realize that they just waste their own time as the machines are stateless and work is not recoverable. Benedikt Boecking was kind enough and spent some time creating this little write up which will be included into our Wiki. In incoming weeks I will look into technical solutions which will hopefully 
> 
> Automatic multi processing issue: some libraries automatically use all cores for some of their underlying routines. In particular, this happens with python (numpy,scipy,…). Automatic multiprocessing on the servers can slow your code down and also impact your colleagues. 
> 	a.  Spawning as many threads as the server has cores can be more expensive than just executing the routine on one thread. 
> 	b. Running your process on all cores will impact the ability of your colleagues to use the server. 
> 	b. If you already do your own multi-processing, these effects can multiply and you might unwittingly try to parallelize across thousands of threads.
> 
> Please monitor your resource usage via top/htop and make sure you are not flooding the server with too many parallel jobs. To fix the issue you can:
> 1. Set environment variables
> $ export MKL_NUM_THREADS=1
> $ export NUMEXPR_NUM_THREADS=1
> $ export OMP_NUM_THREADS=1
> $ export OPENBLAS_NUM_THREADS =1
> $ export VECLIB_MAXIMUM_THREADS =1
> 
> 2. Set these variables in python before importing any other libraries
> import os
> os.environ["OMP_NUM_THREADS"] = "1" 
> os.environ["OPENBLAS_NUM_THREADS"] = "1" 
> os.environ["MKL_NUM_THREADS"] = "1" 
> os.environ["VECLIB_MAXIMUM_THREADS"] = "1"
> os.environ["NUMEXPR_NUM_THREADS"] = "1" 
> 
> 
> Best,
> Predrag