<div dir="ltr">I think the problem is just that tensorflow by default claims all GPU memory on the machine. Shouldn't be any need to reboot, the memory should be freed when the process doing it ends / is killed.<div><br></div><div>I sent an email about this behavior a while ago, which it's maybe worth re-sending:<br><div><br><br><br>Something that's not necessarily obvious to everyone about tensorflow: if you just run something with tensorflow, it will by default allocate all of the memory on <b>all</b> GPUs on the machine. It's pretty unlikely that whatever model you're running is going to need all 48 GB in all 4 cards on gpu{2,3}. :)<br><br>To stop this behavior, set the environment variable CUDA_VISIBLE_DEVICES to only show tensorflow the relevant devices. For example,  "CUDA_VISIBLE_DEVICES=0 python" will then have that tensorflow session use only gpu0. You can check what devices are free with nvidia-smi.  Theano will pick a single gpu to use by default; to choose a specific one, you can either use CUDA_VISIBLE_DEVICES in the same way or use THEANO_FLAGS=device=gpu0.<br><br>If you're running small models and want to run more than one on a single gpu, you can tell tensorflow to avoid allocating all of a GPU's memory with the methods discus<span style="color:rgb(33,33,33)">sed</span><span class="inbox-inbox-Apple-converted-space" style="color:rgb(33,33,33)"> </span><a href="http://stackoverflow.com/q/34199233/344821" class="gmail_msg" target="_blank">here</a><span style="color:rgb(33,33,33)">. S</span>etting per_process_gpu_memory_fraction lets it allocate a certain portion of the GPU's memory; setting allow_growth=True makes it only claim memory as it needs it.<br><br>Theano's default behavior is similar to allow_growth=True; you can make it preallocate memory (and often get substantial speedups) with e.g. THEANO_FLAGS=device=gpu0,lib.cnmem=1. (lib.cnmem=.5 will allocate half the GPU's memory; lib.cnmem=1024 will allocate 1GB.)</div></div></div><br><div class="gmail_quote"><div dir="ltr">On Wed, Feb 15, 2017 at 5:12 AM Predrag Punosevac <<a href="mailto:predragp@cs.cmu.edu">predragp@cs.cmu.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Barnabas Poczos <<a href="mailto:bapoczos@cs.cmu.edu" class="gmail_msg" target="_blank">bapoczos@cs.cmu.edu</a>> wrote:<br class="gmail_msg">

<br class="gmail_msg">

> Hi All,<br class="gmail_msg">

><br class="gmail_msg">

> If there is no other solution, my recommendation is to reboot those<br class="gmail_msg">

> GPU nodes which got affected by tensorflow's memory taking "feature".<br class="gmail_msg">

><br class="gmail_msg">

> Best,<br class="gmail_msg">

> Barnabas<br class="gmail_msg">

<br class="gmail_msg">

The only problem is that I have to figure out how to find out which<br class="gmail_msg">

memory is taken due to the currently running TensorFlow and which one is<br class="gmail_msg">

taken due to the buggy software.<br class="gmail_msg">

<br class="gmail_msg">

Predrag<br class="gmail_msg">

<br class="gmail_msg">

<br class="gmail_msg">

><br class="gmail_msg">

><br class="gmail_msg">

><br class="gmail_msg">

><br class="gmail_msg">

><br class="gmail_msg">

><br class="gmail_msg">

> ======================<br class="gmail_msg">

> Barnabas Poczos, PhD<br class="gmail_msg">

> Assistant Professor<br class="gmail_msg">

> Machine Learning Department<br class="gmail_msg">

> Carnegie Mellon University<br class="gmail_msg">

><br class="gmail_msg">

><br class="gmail_msg">

> On Tue, Feb 14, 2017 at 10:23 PM, Predrag Punosevac <<a href="mailto:predragp@cs.cmu.edu" class="gmail_msg" target="_blank">predragp@cs.cmu.edu</a>> wrote:<br class="gmail_msg">

> > Kaylan Burleigh <<a href="mailto:kburleigh@lbl.gov" class="gmail_msg" target="_blank">kburleigh@lbl.gov</a>> wrote:<br class="gmail_msg">

> ><br class="gmail_msg">

> >> Hi Predrag,<br class="gmail_msg">

> >><br class="gmail_msg">

> >> Yes, I do know how to use unix. All the machines I'm used to run slurm and<br class="gmail_msg">

> >> users are not root so the bashrc's are renamed to something else and the<br class="gmail_msg">

> >> users edit those.<br class="gmail_msg">

> >><br class="gmail_msg">

> ><br class="gmail_msg">

> > SLURM is queueing system used to manage and control jobs on clusters<br class="gmail_msg">

> > including GPU clusters. In Auton Lab at this point we don't operate a<br class="gmail_msg">

> > single cluster due to the fact that we typically buy equipment from<br class="gmail_msg">

> > smaller general purpose grants. If we score a 300K grant for the<br class="gmail_msg">

> > equipment this year we will buy a cluster and we will run SLURM as well.<br class="gmail_msg">

> > SLURM is used on most CMU clusters.<br class="gmail_msg">

> ><br class="gmail_msg">

> ><br class="gmail_msg">

> ><br class="gmail_msg">

> ><br class="gmail_msg">

> >> Anyway, gpu1-3 each have a few gpu's that aren't being used but 99% of the<br class="gmail_msg">

> >> memory is taken. See attached. Can we fix that?<br class="gmail_msg">

> >><br class="gmail_msg">

> ><br class="gmail_msg">

> > Please see this thread<br class="gmail_msg">

> ><br class="gmail_msg">

> > <a href="https://github.com/tensorflow/tensorflow/issues/1578" rel="noreferrer" class="gmail_msg" target="_blank">https://github.com/tensorflow/tensorflow/issues/1578</a><br class="gmail_msg">

> ><br class="gmail_msg">

> > In short it is well known TensorFlow "feature". The only way for me to<br class="gmail_msg">

> > "clear" memory is to reboot the node.<br class="gmail_msg">

> ><br class="gmail_msg">

> ><br class="gmail_msg">

> > Best,<br class="gmail_msg">

> > Predrag<br class="gmail_msg">

> ><br class="gmail_msg">

> ><br class="gmail_msg">

> >> Thanks,<br class="gmail_msg">

> >> Kaylan<br class="gmail_msg">

> >><br class="gmail_msg">

</blockquote></div>