Misuse of resources
Tanmay Agarwal
tanmaya at andrew.cmu.edu
Wed Mar 11 17:15:00 EDT 2020
Hi Autonians,
Adding on to Predrag's long list of misuse, I also request people who are
running tensorflow jobs on GPU to make sure about the following things.
1. Limit the GPU usage to only the cards that are required for the jobs. *BY
DEFAULT*, tensorflow will use up GPU memory from the *ALL* *GPU cards*
available on the machine which may by lying idle most of the time. This can
be done by setting the CUDA_VISIBLE_DEVICES
<https://stackoverflow.com/questions/37893755/tensorflow-set-cuda-visible-devices-within-jupyter>
inside your environment or using the
tf.config.experimental.set_visible_devices
<https://www.tensorflow.org/api_docs/python/tf/config/set_visible_devices>
within Tensorflow.
2. Secondly, you may also want to *SET* the *FLAG*
*TF_FORCE_GPU_ALLOW_GROWTH* or tf.config.experimental.set_memory_growth
<https://www.tensorflow.org/api_docs/python/tf/config/experimental/set_memory_growth>which
allocates and uses memory in a dynamic fashion.
I hope following the above practices will help us utilize our shared
resources more efficiently.
Reference: https://www.tensorflow.org/guide/gpu
Thanking you,
Warm Regards,
Tanmay Agarwal | MSR Graduate Student
Robotics Institute @ CMU
mailto: tanmaya at andrew.cmu.edu
On Wed, Mar 11, 2020 at 1:09 AM Predrag Punosevac <predragp at andrew.cmu.edu>
wrote:
> Dear Autonians,
>
> This is a short list of commonly observed misuses of our lab resources.
>
> 1. Using GPU nodes for CPU jobs. Currently 6 out of 19 GPU nodes are
> currently running CPU jobs while I am typing this email. This has to
> stop immediately!
>
> 2. Using a cache to avoid recomputing data or accessing a slow database
> can provide you with a great performance boost. Do not under any
> circumstance use your home directory for caching. Do not use /tmp
> partition for caching. /tmp is the part of / slice which is limited to
> 50-60GB only and will quickly be filled rendering machine non-usable for
> everyone.
>
> 3. Don't put Jupiter sqlite database on your home directory. It is
> likely going to become incoherent due to NFS properties. Please use
>
> /home/scratch/$username
>
> for sqlite database, cashing, and volatile data in particular.
>
> 4. Please make sure you release GPU cards once you are done running your
> Python scripts. Typically the easiest way for me to deal with those as
> well as zombi processes is reboot. Chances of comp nodes experiencing
> hardware problem grow exponentially with each reboot (dead RAM
> typically). Those are very time consuming to fix.
>
> 5. Please clean your scratch directories regulary. I can't emphasis
> enough how important is this.
>
> 6. If you do have an Auton Lab issued desktop which is VPN connected to
> the computing nodes please don't use shell gateways under any
> circumstances to connect to comp nodes. Your desktops are your private
> shell gateways and they are ssh reachable from anywhere on the world.
>
> 7. Do not store non-essential things in your home directories. An
> example would be putting your conda or R packages. Please put that in
> scratch. Write a small script which can recreate scratch directories for
> you.
>
> 8. Please don't transfer large amounts of data via shell gateways. Log
> into the comp nodes and use outgoing ssh connections to pull the data
> onto the server from outside the lab.
>
> Likely to be continued after a good night sleep...
>
>
> Cheers,
> Predrag
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20200311/e86c41d0/attachment.html>
More information about the Autonlab-users
mailing list