Memory taken

Wed Feb 15 10:28:55 EST 2017

I think the problem is just that tensorflow by default claims all GPU
memory on the machine. Shouldn't be any need to reboot, the memory should
be freed when the process doing it ends / is killed.

I sent an email about this behavior a while ago, which it's maybe worth
re-sending:

Something that's not necessarily obvious to everyone about tensorflow: if
you just run something with tensorflow, it will by default allocate all of
the memory on *all* GPUs on the machine. It's pretty unlikely that whatever
model you're running is going to need all 48 GB in all 4 cards on gpu{2,3}.
:)

To stop this behavior, set the environment variable CUDA_VISIBLE_DEVICES to
only show tensorflow the relevant devices. For example,
"CUDA_VISIBLE_DEVICES=0 python" will then have that tensorflow session use
only gpu0. You can check what devices are free with nvidia-smi. Theano will
pick a single gpu to use by default; to choose a specific one, you can
either use CUDA_VISIBLE_DEVICES in the same way or use
THEANO_FLAGS=device=gpu0.

If you're running small models and want to run more than one on a single
gpu, you can tell tensorflow to avoid allocating all of a GPU's memory with
the methods discussed here
<http://stackoverflow.com/q/34199233/344821>. Setting
per_process_gpu_memory_fraction lets it allocate a certain portion of the
GPU's memory; setting allow_growth=True makes it only claim memory as it
needs it.

Theano's default behavior is similar to allow_growth=True; you can make it
preallocate memory (and often get substantial speedups) with e.g.
THEANO_FLAGS=device=gpu0,lib.cnmem=1. (lib.cnmem=.5 will allocate half the
GPU's memory; lib.cnmem=1024 will allocate 1GB.)

On Wed, Feb 15, 2017 at 5:12 AM Predrag Punosevac <predragp at cs.cmu.edu>
wrote:

> Barnabas Poczos <bapoczos at cs.cmu.edu> wrote:
>
> > Hi All,
> >
> > If there is no other solution, my recommendation is to reboot those
> > GPU nodes which got affected by tensorflow's memory taking "feature".
> >
> > Best,
> > Barnabas
>
> The only problem is that I have to figure out how to find out which
> memory is taken due to the currently running TensorFlow and which one is
> taken due to the buggy software.
>
> Predrag
>
>
> >
> >
> >
> >
> >
> >
> > ======================
> > Barnabas Poczos, PhD
> > Assistant Professor
> > Machine Learning Department
> > Carnegie Mellon University
> >
> >
> > On Tue, Feb 14, 2017 at 10:23 PM, Predrag Punosevac <predragp at cs.cmu.edu>
> wrote:
> > > Kaylan Burleigh <kburleigh at lbl.gov> wrote:
> > >
> > >> Hi Predrag,
> > >>
> > >> Yes, I do know how to use unix. All the machines I'm used to run
> slurm and
> > >> users are not root so the bashrc's are renamed to something else and
> the
> > >> users edit those.
> > >>
> > >
> > > SLURM is queueing system used to manage and control jobs on clusters
> > > including GPU clusters. In Auton Lab at this point we don't operate a
> > > single cluster due to the fact that we typically buy equipment from
> > > smaller general purpose grants. If we score a 300K grant for the
> > > equipment this year we will buy a cluster and we will run SLURM as
> well.
> > > SLURM is used on most CMU clusters.
> > >
> > >
> > >
> > >
> > >> Anyway, gpu1-3 each have a few gpu's that aren't being used but 99%
> of the
> > >> memory is taken. See attached. Can we fix that?
> > >>
> > >
> > > Please see this thread
> > >
> > > https://github.com/tensorflow/tensorflow/issues/1578
> > >
> > > In short it is well known TensorFlow "feature". The only way for me to
> > > "clear" memory is to reboot the node.
> > >
> > >
> > > Best,
> > > Predrag
> > >
> > >
> > >> Thanks,
> > >> Kaylan
> > >>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20170215/0ebd7b62/attachment.html>