Memory taken

Wed Feb 15 00:11:11 EST 2017

Barnabas Poczos <bapoczos at cs.cmu.edu> wrote:

> Hi All,
> 
> If there is no other solution, my recommendation is to reboot those
> GPU nodes which got affected by tensorflow's memory taking "feature".
> 
> Best,
> Barnabas

The only problem is that I have to figure out how to find out which
memory is taken due to the currently running TensorFlow and which one is
taken due to the buggy software.

Predrag

> 
> 
> 
> 
> 
> 
> ======================
> Barnabas Poczos, PhD
> Assistant Professor
> Machine Learning Department
> Carnegie Mellon University
> 
> 
> On Tue, Feb 14, 2017 at 10:23 PM, Predrag Punosevac <predragp at cs.cmu.edu> wrote:
> > Kaylan Burleigh <kburleigh at lbl.gov> wrote:
> >
> >> Hi Predrag,
> >>
> >> Yes, I do know how to use unix. All the machines I'm used to run slurm and
> >> users are not root so the bashrc's are renamed to something else and the
> >> users edit those.
> >>
> >
> > SLURM is queueing system used to manage and control jobs on clusters
> > including GPU clusters. In Auton Lab at this point we don't operate a
> > single cluster due to the fact that we typically buy equipment from
> > smaller general purpose grants. If we score a 300K grant for the
> > equipment this year we will buy a cluster and we will run SLURM as well.
> > SLURM is used on most CMU clusters.
> >
> >
> >
> >
> >> Anyway, gpu1-3 each have a few gpu's that aren't being used but 99% of the
> >> memory is taken. See attached. Can we fix that?
> >>
> >
> > Please see this thread
> >
> > https://github.com/tensorflow/tensorflow/issues/1578
> >
> > In short it is well known TensorFlow "feature". The only way for me to
> > "clear" memory is to reboot the node.
> >
> >
> > Best,
> > Predrag
> >
> >
> >> Thanks,
> >> Kaylan
> >>