Naive Tensorflow/GPU1 question

Fri May 12 13:24:14 EDT 2017

The error you showed *should* be triggered by starting a session (not just
by importing tensorflow, but the command I sent earlier does that).

It could be that your torch install in your home directory is messing with
things. Try exporting
LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/lib64/mpich/lib before starting
python.

On Fri, May 12, 2017 at 6:20 PM Kirthevasan Kandasamy <kandasamy at cmu.edu>
wrote:

> hey Dougal,
>
> I could run python and import tensorflow on GPU1 but the issue is when I
> run my command.
> Could it be that GPU4 is still using the older version of tensorflow?
>
> I can run my stuff on GPU4 without much of an issue but not on GPU1.
> Here's what LD_LIBRARY_PATH gives me oin GPU4 and GPU1
>
> kkandasa at gpu4$ echo $LD_LIBRARY_PATH
>
> /zfsauton/home/kkandasa/torch/install/lib:/zfsauton/home/kkandasa/torch/install/lib:/zfsauton/home/kkandasa/torch/install/lib:
>
> kkandasa at gpu1$ echo $LD_LIBRARY_PATH
>
> /zfsauton/home/kkandasa/torch/install/lib:/zfsauton/home/kkandasa/torch/install/lib:/zfsauton/home/kkandasa/torch/install/lib:/usr/local/cuda/lib64:/usr/lib64/mpich/lib
>
>
>
> On Fri, May 12, 2017 at 11:59 AM, Dougal Sutherland <dougal at gmail.com>
> wrote:
>
>> It's possible that you followed some instructions I sent a while ago and
>> are using your own version of cudnn. Try "echo $LD_LIBRARY_PATH" and make
>> sure it only has things in /usr/local, /usr/lib64 (nothing in your own
>> directories), and make sure that your python code doesn't change that....
>>
>> The Anaconda python distribution now distributes cudnn and
>> tensorflow-gpu, so you could also install that in your scratch dir to have
>> your own install. But they only have tensorflow 1.0 and higher, so your old
>> code would require some changes (system install on gpu1 is 0.10, and there
>> were breaking changes in both 1.0 and 1.1).
>>
>> On Fri, May 12, 2017 at 4:55 PM Dougal Sutherland <dougal at gmail.com>
>> wrote:
>>
>>> It works for me too, not in IPython. Try this:
>>>
>>> CUDA_VISIBLE_DEVICES=5 python -c 'import tensorflow as tf;
>>> tf.InteractiveSession()'
>>>
>>> On Fri, May 12, 2017 at 4:55 PM Kirthevasan Kandasamy <kandasamy at cmu.edu>
>>> wrote:
>>>
>>>> No, I don't use iPython.
>>>>
>>>> On Fri, May 12, 2017 at 11:22 AM, <chiragn at andrew.cmu.edu> wrote:
>>>>
>>>>> Have you tried running it from with iPython notebook as an interactive
>>>>> session?
>>>>>
>>>>> I am doing that right now and it works.
>>>>>
>>>>> Chirag
>>>>>
>>>>>
>>>>> > Kirthevasan Kandasamy <kandasamy at cmu.edu> wrote:
>>>>> >
>>>>> >> Hi Predrag,
>>>>> >>
>>>>> >> I am re-running a tensorflow project on GPU1 - I haven't touched it
>>>>> in
>>>>> >> 4/5
>>>>> >> months, and the last time I ran it it worked fine, but when I try
>>>>> now I
>>>>> >> seem to be getting the following error.
>>>>> >>
>>>>> >
>>>>> > This is the first time I hear about it. I was under impression that
>>>>> GPU
>>>>> > nodes were usable.  I am redirecting your e-mail to
>>>>> users at autonlab.org
>>>>> > in the hope that somebody who is using TensorFlow on the regular
>>>>> basis
>>>>> > can be of more help.
>>>>> >
>>>>> > Predrag
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >> Can you please tell me what the issue might be or direct me to
>>>>> someone
>>>>> >> who
>>>>> >> might know?
>>>>> >>
>>>>> >> This is for the NIPS deadline, so I would appreciate a quick
>>>>> response.
>>>>> >>
>>>>> >> thanks,
>>>>>
>>>>> >> Samy
>>>>> >>
>>>>> >>
>>>>> >> I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0
>>>>> >> with
>>>>> >> properties:
>>>>> >> name: Tesla K80
>>>>> >> major: 3 minor: 7 memoryClockRate (GHz) 0.8235
>>>>> >> pciBusID 0000:05:00.0
>>>>> >> Total memory: 11.17GiB
>>>>> >> Free memory: 11.11GiB
>>>>> >> I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0
>>>>> >> I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0:   Y
>>>>> >> I tensorflow/core/common_runtime/gpu/gpu_device.cc:838] Creating
>>>>> >> TensorFlow
>>>>> >> device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id:
>>>>> >> 0000:05:00.0)
>>>>> >> E tensorflow/stream_executor/cuda/cuda_dnn.cc:347] Loaded runtime
>>>>> CuDNN
>>>>> >> library: 4007 (compatibility version 4000) but source was compiled
>>>>> with
>>>>> >> 5103 (compatibility version 5100).  If using a binary install,
>>>>> upgrade
>>>>> >> your
>>>>> >> CuDNN library to match.  If building from sources, make sure the
>>>>> library
>>>>> >> loaded at runtime matches a compatible version specified during
>>>>> compile
>>>>> >> configuration.
>>>>> >> F tensorflow/core/kernels/conv_ops.cc:457] Check failed:
>>>>> >> stream->parent()->GetConvolveAlgorithms(&algorithms)
>>>>> >> run_resnet.sh: line 49: 22665 Aborted                 (core dumped)
>>>>> >> CUDA_VISIBLE_DEVICES=$GPU python ../resnettf/resnet_main.py
>>>>> --data_dir
>>>>> >> $DATA_DIR --max_batch_iters $NUM_ITERS --report_results_every
>>>>> >> $REPORT_RESULTS_EVERY --log_root $LOG_ROOT --dataset $DATASET
>>>>> --num_gpus
>>>>> >> 1
>>>>> >> --save_model_dir $SAVE_MODEL_DIR --save_model_every
>>>>> $SAVE_MODEL_EVERY
>>>>> >> --skip_add_method $SKIP_ADD_METHOD --architecture $ARCHITECTURE
>>>>> >> --skip_size
>>>>> >> $SKIP_SIZE
>>>>> >
>>>>>
>>>>>
>>>>>
>>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20170512/d877ade0/attachment-0001.html>