Naive Tensorflow/GPU1 question

Fri May 12 13:19:56 EDT 2017

hey Dougal,

I could run python and import tensorflow on GPU1 but the issue is when I
run my command.
Could it be that GPU4 is still using the older version of tensorflow?

I can run my stuff on GPU4 without much of an issue but not on GPU1. Here's
what LD_LIBRARY_PATH gives me oin GPU4 and GPU1

kkandasa at gpu4$ echo $LD_LIBRARY_PATH
/zfsauton/home/kkandasa/torch/install/lib:/zfsauton/home/kkandasa/torch/install/lib:/zfsauton/home/kkandasa/torch/install/lib:

kkandasa at gpu1$ echo $LD_LIBRARY_PATH
/zfsauton/home/kkandasa/torch/install/lib:/zfsauton/home/kkandasa/torch/install/lib:/zfsauton/home/kkandasa/torch/install/lib:/usr/local/cuda/lib64:/usr/lib64/mpich/lib

On Fri, May 12, 2017 at 11:59 AM, Dougal Sutherland <dougal at gmail.com>
wrote:

> It's possible that you followed some instructions I sent a while ago and
> are using your own version of cudnn. Try "echo $LD_LIBRARY_PATH" and make
> sure it only has things in /usr/local, /usr/lib64 (nothing in your own
> directories), and make sure that your python code doesn't change that....
>
> The Anaconda python distribution now distributes cudnn and tensorflow-gpu,
> so you could also install that in your scratch dir to have your own
> install. But they only have tensorflow 1.0 and higher, so your old code
> would require some changes (system install on gpu1 is 0.10, and there were
> breaking changes in both 1.0 and 1.1).
>
> On Fri, May 12, 2017 at 4:55 PM Dougal Sutherland <dougal at gmail.com>
> wrote:
>
>> It works for me too, not in IPython. Try this:
>>
>> CUDA_VISIBLE_DEVICES=5 python -c 'import tensorflow as tf;
>> tf.InteractiveSession()'
>>
>> On Fri, May 12, 2017 at 4:55 PM Kirthevasan Kandasamy <kandasamy at cmu.edu>
>> wrote:
>>
>>> No, I don't use iPython.
>>>
>>> On Fri, May 12, 2017 at 11:22 AM, <chiragn at andrew.cmu.edu> wrote:
>>>
>>>> Have you tried running it from with iPython notebook as an interactive
>>>> session?
>>>>
>>>> I am doing that right now and it works.
>>>>
>>>> Chirag
>>>>
>>>>
>>>> > Kirthevasan Kandasamy <kandasamy at cmu.edu> wrote:
>>>> >
>>>> >> Hi Predrag,
>>>> >>
>>>> >> I am re-running a tensorflow project on GPU1 - I haven't touched it
>>>> in
>>>> >> 4/5
>>>> >> months, and the last time I ran it it worked fine, but when I try
>>>> now I
>>>> >> seem to be getting the following error.
>>>> >>
>>>> >
>>>> > This is the first time I hear about it. I was under impression that
>>>> GPU
>>>> > nodes were usable.  I am redirecting your e-mail to
>>>> users at autonlab.org
>>>> > in the hope that somebody who is using TensorFlow on the regular basis
>>>> > can be of more help.
>>>> >
>>>> > Predrag
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >> Can you please tell me what the issue might be or direct me to
>>>> someone
>>>> >> who
>>>> >> might know?
>>>> >>
>>>> >> This is for the NIPS deadline, so I would appreciate a quick
>>>> response.
>>>> >>
>>>> >> thanks,
>>>>
>>>> >> Samy
>>>> >>
>>>> >>
>>>> >> I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0
>>>> >> with
>>>> >> properties:
>>>> >> name: Tesla K80
>>>> >> major: 3 minor: 7 memoryClockRate (GHz) 0.8235
>>>> >> pciBusID 0000:05:00.0
>>>> >> Total memory: 11.17GiB
>>>> >> Free memory: 11.11GiB
>>>> >> I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0
>>>> >> I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0:   Y
>>>> >> I tensorflow/core/common_runtime/gpu/gpu_device.cc:838] Creating
>>>> >> TensorFlow
>>>> >> device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id:
>>>> >> 0000:05:00.0)
>>>> >> E tensorflow/stream_executor/cuda/cuda_dnn.cc:347] Loaded runtime
>>>> CuDNN
>>>> >> library: 4007 (compatibility version 4000) but source was compiled
>>>> with
>>>> >> 5103 (compatibility version 5100).  If using a binary install,
>>>> upgrade
>>>> >> your
>>>> >> CuDNN library to match.  If building from sources, make sure the
>>>> library
>>>> >> loaded at runtime matches a compatible version specified during
>>>> compile
>>>> >> configuration.
>>>> >> F tensorflow/core/kernels/conv_ops.cc:457] Check failed:
>>>> >> stream->parent()->GetConvolveAlgorithms(&algorithms)
>>>> >> run_resnet.sh: line 49: 22665 Aborted                 (core dumped)
>>>> >> CUDA_VISIBLE_DEVICES=$GPU python ../resnettf/resnet_main.py
>>>> --data_dir
>>>> >> $DATA_DIR --max_batch_iters $NUM_ITERS --report_results_every
>>>> >> $REPORT_RESULTS_EVERY --log_root $LOG_ROOT --dataset $DATASET
>>>> --num_gpus
>>>> >> 1
>>>> >> --save_model_dir $SAVE_MODEL_DIR --save_model_every $SAVE_MODEL_EVERY
>>>> >> --skip_add_method $SKIP_ADD_METHOD --architecture $ARCHITECTURE
>>>> >> --skip_size
>>>> >> $SKIP_SIZE
>>>> >
>>>>
>>>>
>>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20170512/d3a72632/attachment.html>