Computing node resources

Benedikt Boecking boecking at andrew.cmu.edu
Wed May 17 13:40:27 EDT 2017


All,

Right now there are over 80 threads running on lov4, a machine that only has 64 cores. The same happened on lov3 earlier today. I know that deadlines are approaching but please try to follow some reasonable person principles. Here is a non-exhaustive list of things you should do before running experiments on our servers:

1. Before starting a new job, check the amount of available memory and how many other jobs are currently running. The easiest way to do this is to use htop. 
2. If a computing node is at its limit, check if any other nodes are underutilized (http://monit.autonlab.org:8080/status/hosts/ <http://monit.autonlab.org:8080/status/hosts/>)
3. “nice" your jobs if they require a lot of resources and will be running for a long time (https://en.wikipedia.org/wiki/Nice_(Unix)) <https://en.wikipedia.org/wiki/Nice_(Unix))>
4. Use a reasonable number of threads and limit excessive memory usage.
5. Close your jupyter notebooks, matlab sessions etc. that you don’t need anymore
6. Move files from the scratch to your home directory on zfsauton if you don’t need them anymore for your current experiments. 
7. If you are using GPUs, use nvidia-smi to check utilization and make sure your code does not automatically allocate all GPUs and all GPU memory to your experiment.

Please respond to this email if you have any additional recommendations for your fellow lab members. 

Best,
Ben



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20170517/b841550c/attachment.html>


More information about the Autonlab-users mailing list