Possible CUDA problem

Tue Jan 8 18:32:21 EST 2019

Many other GPU servers have no symbolic link between /usr/local/cuda
and actual installed version of CUDA due to multiple CUDA versions
installed. That is the artifact of the fact that we started with CUDA
8.0 and went through bunch of upgrades. You should be able to work
without the symbolic link.

cuDNN is proprietary software (although not Intel library as I said in
an earlier e-mail). I don't have the account to download it

https://developer.nvidia.com/rdp/form/cudnn-download-survey

Please download if you have NVidia dev account and put somewhere where
I can access it. I forgot how it works. We had similar issues with
Intel proprietary compiler and optimization libraries. They can't be
downloaded for free but there are bunch of hops we have to jump
through to get it.

Predrag

On Tue, Jan 8, 2019 at 12:46 PM Yotam Hechtlinger
<yhechtli at andrew.cmu.edu> wrote:
>
> With GPU5 & 6 the problem is that /usr/local is missing a symbolic link.
> It has cuda9.0 and cuda9.1 but not /usr/local/cuda.
>
> Regarding GPU10 - I also think consistency would be useful with the CUDA versions.
> Tensorflow nightly build support CUDA 10 since mid december, see:
> https://github.com/tensorflow/tensorflow/issues/22706
>
> but using it would require switching tensorflow versions between the servers, because the stable version only support CUDA 9.
>
> Regarding cuDNN - not sure I understand, but I can't debug the nightly version on GPU10 until it's installed.
>
> Yotam.
>
>
>
>
>
>
> On Tue, Jan 8, 2019 at 4:42 PM Predrag Punosevac <predragp at andrew.cmu.edu> wrote:
>>
>> It is not. That is a proprietar Intel optimization library. One of you with the active Intel dev account needs to download and put somewhere I can find. Please read documentation to make sure which cuda version is supported. In the past we had to downgrade CUDA to use cuDNN.I would not be surprised that we had to go back to CUDA 9.2 to use it.
>>
>> On Jan 8, 2019 3:49 AM, Yotam Hechtlinger <yhechtli at andrew.cmu.edu> wrote:
>>
>> Hi Predrag,
>>
>> Is cuDNN properly installed?
>> I can't see it inside the /usr/local/cuda.
>>
>> Also import tensorflow provides:
>>
>> ImportError: libcudnn.so.7: cannot open shared object file: No such file or directory
>>
>> Thanks,
>> Yotam.
>>
>> On Mon, Jan 7, 2019 at 10:43 PM Predrag Punosevac <predragp at andrew.cmu.edu> wrote:
>>
>> Ok. I found one problem. CUDA 10 was not properly installed on GPU10
>> due to the dependency problems. I had to disable  rpmfusion repos
>> (both free and non-free) which I considered safe in the past. Now CUDA
>> 10 is installed from NVidia repo and is in /usr/local and
>> /usr/local/cuda is the symbolic link to actual /usr/local/cuda-10.0
>> folder. Please try now.
>>
>> Predrag
>>
>> On Mon, Jan 7, 2019 at 2:28 PM Yotam Hechtlinger
>> <yhechtli at andrew.cmu.edu> wrote:
>> >
>> > Hi Predrag,
>> >
>> > With GPU10 the problem is probably because LD_LIBRARY_PATH directs to  /usr/local/cuda/lib64 but that's not where CUDA is installed (where is it?).
>> >
>> > Yotam.
>> >
>> >
>> > On Mon, Jan 7, 2019 at 9:00 PM Predrag Punosevac <predragp at andrew.cmu.edu> wrote:
>> >>
>> >> Yotam,
>> >>
>> >> Thank you so much for this report! I am CC-ing users at autonlab.org so
>> >> that everyone is on the same page.  Could you please work with me on
>> >> this one? Let's try to fix GPU10 first. GPU10 was recently
>> >> provisioned. It has three (one was DoA) GeForce 1080Ti. I am running
>> >> the latest  NVIDIA-Linux-x86_64-410.78 driver and the latest
>> >> cuda-10.0.130-1. You have two versions of Python. /opt/rh/rh-python36
>> >> will give you the latest 3.6.7. While /opt/miniconda3 will install
>> >> python-3.7.2. Once we fix GPU10 we will move to other machines. Note
>> >> that other machines are still running older version of NVidia driver
>> >> and CUDA-9.2. I have changed nothing on them so whatever is broken it
>> >> is broken upstream (Python,TensorFlow, NVidia, or CUDA).
>> >>
>> >> Please keep CC-ing users to this discussion so that people know what
>> >> is going on.
>> >>
>> >> Predrag
>> >>
>> >>
>> >> On Mon, Jan 7, 2019 at 8:02 AM Yotam Hechtlinger
>> >> <yhechtli at andrew.cmu.edu> wrote:
>> >> >
>> >> > Hi Predrag,
>> >> >
>> >> > There might be some CUDA problem on GPU 5,6 & 10.
>> >> > I get the following message when I try to import tensorflow:
>> >> >
>> >> >
>> >> >
>> >> > >>> import tensorflow
>> >> > Traceback (most recent call last):
>> >> >   File "/zfsauton/home/yhechtli/anaconda3/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py", line 58, in <module>
>> >> >     from tensorflow.python.pywrap_tensorflow_internal import *
>> >> >   File "/zfsauton/home/yhechtli/anaconda3/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in <module>
>> >> >     _pywrap_tensorflow_internal = swig_import_helper()
>> >> >   File "/zfsauton/home/yhechtli/anaconda3/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
>> >> >     _mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
>> >> >   File "/zfsauton/home/yhechtli/anaconda3/lib/python3.6/imp.py", line 243, in load_module
>> >> >     return load_dynamic(name, filename, file)
>> >> >   File "/zfsauton/home/yhechtli/anaconda3/lib/python3.6/imp.py", line 343, in load_dynamic
>> >> >     return _load(spec)
>> >> > ImportError: libcublas.so.9.0: cannot open shared object file: No such file or directory
>> >> >
>> >> > During handling of the above exception, another exception occurred:
>> >> >
>> >> > Traceback (most recent call last):
>> >> >   File "<stdin>", line 1, in <module>
>> >> >   File "/zfsauton/home/yhechtli/anaconda3/lib/python3.6/site-packages/tensorflow/__init__.py", line 24, in <module>
>> >> >     from tensorflow.python import pywrap_tensorflow  # pylint: disable=unused-import
>> >> >   File "/zfsauton/home/yhechtli/anaconda3/lib/python3.6/site-packages/tensorflow/python/__init__.py", line 49, in <module>
>> >> >     from tensorflow.python import pywrap_tensorflow
>> >> >   File "/zfsauton/home/yhechtli/anaconda3/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py", line 74, in <module>
>> >> >     raise ImportError(msg)
>> >> > ImportError: Traceback (most recent call last):
>> >> >   File "/zfsauton/home/yhechtli/anaconda3/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py", line 58, in <module>
>> >> >     from tensorflow.python.pywrap_tensorflow_internal import *
>> >> >   File "/zfsauton/home/yhechtli/anaconda3/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in <module>
>> >> >     _pywrap_tensorflow_internal = swig_import_helper()
>> >> >   File "/zfsauton/home/yhechtli/anaconda3/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
>> >> >     _mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
>> >> >   File "/zfsauton/home/yhechtli/anaconda3/lib/python3.6/imp.py", line 243, in load_module
>> >> >     return load_dynamic(name, filename, file)
>> >> >   File "/zfsauton/home/yhechtli/anaconda3/lib/python3.6/imp.py", line 343, in load_dynamic
>> >> >     return _load(spec)
>> >> > ImportError: libcublas.so.9.0: cannot open shared object file: No such file or directory
>> >> >
>> >> >
>> >> > Failed to load the native TensorFlow runtime.
>> >> >
>> >> > See https://www.tensorflow.org/install/errors
>> >> >
>> >> > for some common reasons and solutions.  Include the entire stack trace
>> >> > above this error message when asking for help.
>> >> >
>>
>>