Memory taken

Barnabas Poczos bapoczos at cs.cmu.edu
Tue Feb 14 22:41:36 EST 2017


Hi All,

If there is no other solution, my recommendation is to reboot those
GPU nodes which got affected by tensorflow's memory taking "feature".

Best,
Barnabas






======================
Barnabas Poczos, PhD
Assistant Professor
Machine Learning Department
Carnegie Mellon University


On Tue, Feb 14, 2017 at 10:23 PM, Predrag Punosevac <predragp at cs.cmu.edu> wrote:
> Kaylan Burleigh <kburleigh at lbl.gov> wrote:
>
>> Hi Predrag,
>>
>> Yes, I do know how to use unix. All the machines I'm used to run slurm and
>> users are not root so the bashrc's are renamed to something else and the
>> users edit those.
>>
>
> SLURM is queueing system used to manage and control jobs on clusters
> including GPU clusters. In Auton Lab at this point we don't operate a
> single cluster due to the fact that we typically buy equipment from
> smaller general purpose grants. If we score a 300K grant for the
> equipment this year we will buy a cluster and we will run SLURM as well.
> SLURM is used on most CMU clusters.
>
>
>
>
>> Anyway, gpu1-3 each have a few gpu's that aren't being used but 99% of the
>> memory is taken. See attached. Can we fix that?
>>
>
> Please see this thread
>
> https://github.com/tensorflow/tensorflow/issues/1578
>
> In short it is well known TensorFlow "feature". The only way for me to
> "clear" memory is to reboot the node.
>
>
> Best,
> Predrag
>
>
>> Thanks,
>> Kaylan
>>


More information about the Autonlab-users mailing list