Naive Tensorflow/GPU1 question

Fri May 12 13:32:39 EDT 2017

actually, I could run your commdn too:

kkandasa at gpu1$ CUDA_VISIBLE_DEVICES=5 python -c 'import tensorflow as tf;
tf.InteractiveSession()'
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA
library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA
library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA
library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA
library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA
library libcurand.so locally
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with
properties:
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:85:00.0
Total memory: 11.17GiB
Free memory: 11.11GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0:   Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:838] Creating TensorFlow
device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:85:00.0)

Ihere is the error, I get.

kkandasa at gpu1$ bash run_resnet2.sh
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA
library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA
library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA
library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA
library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA
library libcurand.so locally
./cifar10py/train/data_batch_3
./cifar10py/train/data_batch_4
./cifar10py/train/train_file
Could not read file train_file.
./cifar10py/train/data_batch_2
./cifar10py/train/data_batch_1
./cifar10py/valid/data_batch_5
./cifar10py/valid/valid_file
Could not read file valid_file.

--- Architecture ---
    initial filters: 64
    residual groups: (11)  [64, 64],  [64, 64],  [128, 128],  [128, 128],
 [128, 128],  [256, 256],  [256, 256],  [256, 256],  [256, 256],  [512,
512],  [512, 512]
    final fc nodes:  1000
    skip add method: linear
    # total model params:  29703429 (29,703,429)
    # trainable model params: 14846530 (14,846,530)

I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with
properties:
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:85:00.0
Total memory: 11.17GiB
Free memory: 11.11GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0:   Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:838] Creating TensorFlow
device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:85:00.0)
E tensorflow/stream_executor/cuda/cuda_dnn.cc:347] Loaded runtime CuDNN
library: 4007 (compatibility version 4000) but source was compiled with
5103 (compatibility version 5100).  If using a binary install, upgrade your
CuDNN library to match.  If building from sources, make sure the library
loaded at runtime matches a compatible version specified during compile
configuration.
F tensorflow/core/kernels/conv_ops.cc:457] Check failed:
stream->parent()->GetConvolveAlgorithms(&algorithms)
run_resnet2.sh: line 33:  9035 Aborted                 (core dumped)
CUDA_VISIBLE_DEVICES=$GPU python ../resnettf/resnet_main.py --data_dir
$DATA_DIR --max_batch_iters $NUM_ITERS --report_results_every
$REPORT_RESULTS_EVERY --log_root $LOG_ROOT --dataset $DATASET --num_gpus 1
--save_model_dir $SAVE_MODEL_DIR --save_model_every $SAVE_MODEL_EVERY
--architecture $ARCHITECTURE --skip_size $SKIP_SIZE
kkandasa at gpu1$ echo $LD_LIBRARY_PATH
/zfsauton/home/kkandasa/torch/install/lib:/zfsauton/home/kkandasa/torch/install/lib:/zfsauton/home/kkandasa/torch/install/lib:/usr/local/cuda/lib64:/usr/lib64/mpich/lib

On Fri, May 12, 2017 at 1:24 PM, Dougal Sutherland <dougal at gmail.com> wrote:

> The error you showed *should* be triggered by starting a session (not
> just by importing tensorflow, but the command I sent earlier does that).
>
> It could be that your torch install in your home directory is messing with
> things. Try exporting LD_LIBRARY_PATH=/usr/local/
> cuda/lib64:/usr/lib64/mpich/lib before starting python.
>
> On Fri, May 12, 2017 at 6:20 PM Kirthevasan Kandasamy <kandasamy at cmu.edu>
> wrote:
>
>> hey Dougal,
>>
>> I could run python and import tensorflow on GPU1 but the issue is when I
>> run my command.
>> Could it be that GPU4 is still using the older version of tensorflow?
>>
>> I can run my stuff on GPU4 without much of an issue but not on GPU1.
>> Here's what LD_LIBRARY_PATH gives me oin GPU4 and GPU1
>>
>> kkandasa at gpu4$ echo $LD_LIBRARY_PATH
>> /zfsauton/home/kkandasa/torch/install/lib:/zfsauton/home/
>> kkandasa/torch/install/lib:/zfsauton/home/kkandasa/torch/install/lib:
>>
>> kkandasa at gpu1$ echo $LD_LIBRARY_PATH
>> /zfsauton/home/kkandasa/torch/install/lib:/zfsauton/home/
>> kkandasa/torch/install/lib:/zfsauton/home/kkandasa/torch/
>> install/lib:/usr/local/cuda/lib64:/usr/lib64/mpich/lib
>>
>>
>>
>> On Fri, May 12, 2017 at 11:59 AM, Dougal Sutherland <dougal at gmail.com>
>> wrote:
>>
>>> It's possible that you followed some instructions I sent a while ago and
>>> are using your own version of cudnn. Try "echo $LD_LIBRARY_PATH" and make
>>> sure it only has things in /usr/local, /usr/lib64 (nothing in your own
>>> directories), and make sure that your python code doesn't change that....
>>>
>>> The Anaconda python distribution now distributes cudnn and
>>> tensorflow-gpu, so you could also install that in your scratch dir to have
>>> your own install. But they only have tensorflow 1.0 and higher, so your old
>>> code would require some changes (system install on gpu1 is 0.10, and there
>>> were breaking changes in both 1.0 and 1.1).
>>>
>>> On Fri, May 12, 2017 at 4:55 PM Dougal Sutherland <dougal at gmail.com>
>>> wrote:
>>>
>>>> It works for me too, not in IPython. Try this:
>>>>
>>>> CUDA_VISIBLE_DEVICES=5 python -c 'import tensorflow as tf;
>>>> tf.InteractiveSession()'
>>>>
>>>> On Fri, May 12, 2017 at 4:55 PM Kirthevasan Kandasamy <
>>>> kandasamy at cmu.edu> wrote:
>>>>
>>>>> No, I don't use iPython.
>>>>>
>>>>> On Fri, May 12, 2017 at 11:22 AM, <chiragn at andrew.cmu.edu> wrote:
>>>>>
>>>>>> Have you tried running it from with iPython notebook as an interactive
>>>>>> session?
>>>>>>
>>>>>> I am doing that right now and it works.
>>>>>>
>>>>>> Chirag
>>>>>>
>>>>>>
>>>>>> > Kirthevasan Kandasamy <kandasamy at cmu.edu> wrote:
>>>>>> >
>>>>>> >> Hi Predrag,
>>>>>> >>
>>>>>> >> I am re-running a tensorflow project on GPU1 - I haven't touched
>>>>>> it in
>>>>>> >> 4/5
>>>>>> >> months, and the last time I ran it it worked fine, but when I try
>>>>>> now I
>>>>>> >> seem to be getting the following error.
>>>>>> >>
>>>>>> >
>>>>>> > This is the first time I hear about it. I was under impression that
>>>>>> GPU
>>>>>> > nodes were usable.  I am redirecting your e-mail to
>>>>>> users at autonlab.org
>>>>>> > in the hope that somebody who is using TensorFlow on the regular
>>>>>> basis
>>>>>> > can be of more help.
>>>>>> >
>>>>>> > Predrag
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >> Can you please tell me what the issue might be or direct me to
>>>>>> someone
>>>>>> >> who
>>>>>> >> might know?
>>>>>> >>
>>>>>> >> This is for the NIPS deadline, so I would appreciate a quick
>>>>>> response.
>>>>>> >>
>>>>>> >> thanks,
>>>>>>
>>>>>> >> Samy
>>>>>> >>
>>>>>> >>
>>>>>> >> I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found
>>>>>> device 0
>>>>>> >> with
>>>>>> >> properties:
>>>>>> >> name: Tesla K80
>>>>>> >> major: 3 minor: 7 memoryClockRate (GHz) 0.8235
>>>>>> >> pciBusID 0000:05:00.0
>>>>>> >> Total memory: 11.17GiB
>>>>>> >> Free memory: 11.11GiB
>>>>>> >> I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0
>>>>>> >> I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0:   Y
>>>>>> >> I tensorflow/core/common_runtime/gpu/gpu_device.cc:838] Creating
>>>>>> >> TensorFlow
>>>>>> >> device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id:
>>>>>> >> 0000:05:00.0)
>>>>>> >> E tensorflow/stream_executor/cuda/cuda_dnn.cc:347] Loaded runtime
>>>>>> CuDNN
>>>>>> >> library: 4007 (compatibility version 4000) but source was compiled
>>>>>> with
>>>>>> >> 5103 (compatibility version 5100).  If using a binary install,
>>>>>> upgrade
>>>>>> >> your
>>>>>> >> CuDNN library to match.  If building from sources, make sure the
>>>>>> library
>>>>>> >> loaded at runtime matches a compatible version specified during
>>>>>> compile
>>>>>> >> configuration.
>>>>>> >> F tensorflow/core/kernels/conv_ops.cc:457] Check failed:
>>>>>> >> stream->parent()->GetConvolveAlgorithms(&algorithms)
>>>>>> >> run_resnet.sh: line 49: 22665 Aborted                 (core dumped)
>>>>>> >> CUDA_VISIBLE_DEVICES=$GPU python ../resnettf/resnet_main.py
>>>>>> --data_dir
>>>>>> >> $DATA_DIR --max_batch_iters $NUM_ITERS --report_results_every
>>>>>> >> $REPORT_RESULTS_EVERY --log_root $LOG_ROOT --dataset $DATASET
>>>>>> --num_gpus
>>>>>> >> 1
>>>>>> >> --save_model_dir $SAVE_MODEL_DIR --save_model_every
>>>>>> $SAVE_MODEL_EVERY
>>>>>> >> --skip_add_method $SKIP_ADD_METHOD --architecture $ARCHITECTURE
>>>>>> >> --skip_size
>>>>>> >> $SKIP_SIZE
>>>>>> >
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20170512/d20213cd/attachment.html>