Possible CUDA problem
Yotam Hechtlinger
yhechtli at andrew.cmu.edu
Tue Jan 8 12:45:43 EST 2019
With GPU5 & 6 the problem is that /usr/local is missing a symbolic link.
It has cuda9.0 and cuda9.1 but not /usr/local/cuda.
Regarding GPU10 - I also think consistency would be useful with the CUDA
versions.
Tensorflow nightly build support CUDA 10 since mid december, see:
https://github.com/tensorflow/tensorflow/issues/22706
but using it would require switching tensorflow versions between the
servers, because the stable version only support CUDA 9.
Regarding cuDNN - not sure I understand, but I can't debug the nightly
version on GPU10 until it's installed.
Yotam.
On Tue, Jan 8, 2019 at 4:42 PM Predrag Punosevac <predragp at andrew.cmu.edu>
wrote:
> It is not. That is a proprietar Intel optimization library. One of you
> with the active Intel dev account needs to download and put somewhere I can
> find. Please read documentation to make sure which cuda version is
> supported. In the past we had to downgrade CUDA to use cuDNN.I would not be
> surprised that we had to go back to CUDA 9.2 to use it.
>
> On Jan 8, 2019 3:49 AM, Yotam Hechtlinger <yhechtli at andrew.cmu.edu> wrote:
>
> Hi Predrag,
>
> Is cuDNN properly installed?
> I can't see it inside the /usr/local/cuda.
>
> Also *import tensorflow* provides:
>
> *ImportError: libcudnn.so.7: cannot open shared object file: No such file
> or directory*
>
> Thanks,
> Yotam.
>
> On Mon, Jan 7, 2019 at 10:43 PM Predrag Punosevac <predragp at andrew.cmu.edu>
> wrote:
>
> Ok. I found one problem. CUDA 10 was not properly installed on GPU10
> due to the dependency problems. I had to disable rpmfusion repos
> (both free and non-free) which I considered safe in the past. Now CUDA
> 10 is installed from NVidia repo and is in /usr/local and
> /usr/local/cuda is the symbolic link to actual /usr/local/cuda-10.0
> folder. Please try now.
>
> Predrag
>
> On Mon, Jan 7, 2019 at 2:28 PM Yotam Hechtlinger
> <yhechtli at andrew.cmu.edu> wrote:
> >
> > Hi Predrag,
> >
> > With GPU10 the problem is probably because LD_LIBRARY_PATH directs to
> /usr/local/cuda/lib64 but that's not where CUDA is installed (where is it?).
> >
> > Yotam.
> >
> >
> > On Mon, Jan 7, 2019 at 9:00 PM Predrag Punosevac <
> predragp at andrew.cmu.edu> wrote:
> >>
> >> Yotam,
> >>
> >> Thank you so much for this report! I am CC-ing users at autonlab.org so
> >> that everyone is on the same page. Could you please work with me on
> >> this one? Let's try to fix GPU10 first. GPU10 was recently
> >> provisioned. It has three (one was DoA) GeForce 1080Ti. I am running
> >> the latest NVIDIA-Linux-x86_64-410.78 driver and the latest
> >> cuda-10.0.130-1. You have two versions of Python. /opt/rh/rh-python36
> >> will give you the latest 3.6.7. While /opt/miniconda3 will install
> >> python-3.7.2. Once we fix GPU10 we will move to other machines. Note
> >> that other machines are still running older version of NVidia driver
> >> and CUDA-9.2. I have changed nothing on them so whatever is broken it
> >> is broken upstream (Python,TensorFlow, NVidia, or CUDA).
> >>
> >> Please keep CC-ing users to this discussion so that people know what
> >> is going on.
> >>
> >> Predrag
> >>
> >>
> >> On Mon, Jan 7, 2019 at 8:02 AM Yotam Hechtlinger
> >> <yhechtli at andrew.cmu.edu> wrote:
> >> >
> >> > Hi Predrag,
> >> >
> >> > There might be some CUDA problem on GPU 5,6 & 10.
> >> > I get the following message when I try to import tensorflow:
> >> >
> >> >
> >> >
> >> > >>> import tensorflow
> >> > Traceback (most recent call last):
> >> > File
> "/zfsauton/home/yhechtli/anaconda3/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py",
> line 58, in <module>
> >> > from tensorflow.python.pywrap_tensorflow_internal import *
> >> > File
> "/zfsauton/home/yhechtli/anaconda3/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py",
> line 28, in <module>
> >> > _pywrap_tensorflow_internal = swig_import_helper()
> >> > File
> "/zfsauton/home/yhechtli/anaconda3/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py",
> line 24, in swig_import_helper
> >> > _mod = imp.load_module('_pywrap_tensorflow_internal', fp,
> pathname, description)
> >> > File "/zfsauton/home/yhechtli/anaconda3/lib/python3.6/imp.py", line
> 243, in load_module
> >> > return load_dynamic(name, filename, file)
> >> > File "/zfsauton/home/yhechtli/anaconda3/lib/python3.6/imp.py", line
> 343, in load_dynamic
> >> > return _load(spec)
> >> > ImportError: libcublas.so.9.0: cannot open shared object file: No
> such file or directory
> >> >
> >> > During handling of the above exception, another exception occurred:
> >> >
> >> > Traceback (most recent call last):
> >> > File "<stdin>", line 1, in <module>
> >> > File
> "/zfsauton/home/yhechtli/anaconda3/lib/python3.6/site-packages/tensorflow/__init__.py",
> line 24, in <module>
> >> > from tensorflow.python import pywrap_tensorflow # pylint:
> disable=unused-import
> >> > File
> "/zfsauton/home/yhechtli/anaconda3/lib/python3.6/site-packages/tensorflow/python/__init__.py",
> line 49, in <module>
> >> > from tensorflow.python import pywrap_tensorflow
> >> > File
> "/zfsauton/home/yhechtli/anaconda3/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py",
> line 74, in <module>
> >> > raise ImportError(msg)
> >> > ImportError: Traceback (most recent call last):
> >> > File
> "/zfsauton/home/yhechtli/anaconda3/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py",
> line 58, in <module>
> >> > from tensorflow.python.pywrap_tensorflow_internal import *
> >> > File
> "/zfsauton/home/yhechtli/anaconda3/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py",
> line 28, in <module>
> >> > _pywrap_tensorflow_internal = swig_import_helper()
> >> > File
> "/zfsauton/home/yhechtli/anaconda3/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py",
> line 24, in swig_import_helper
> >> > _mod = imp.load_module('_pywrap_tensorflow_internal', fp,
> pathname, description)
> >> > File "/zfsauton/home/yhechtli/anaconda3/lib/python3.6/imp.py", line
> 243, in load_module
> >> > return load_dynamic(name, filename, file)
> >> > File "/zfsauton/home/yhechtli/anaconda3/lib/python3.6/imp.py", line
> 343, in load_dynamic
> >> > return _load(spec)
> >> > ImportError: libcublas.so.9.0: cannot open shared object file: No
> such file or directory
> >> >
> >> >
> >> > Failed to load the native TensorFlow runtime.
> >> >
> >> > See https://www.tensorflow.org/install/errors
> >> >
> >> > for some common reasons and solutions. Include the entire stack trace
> >> > above this error message when asking for help.
> >> >
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20190108/5ec7df00/attachment.html>
More information about the Autonlab-users
mailing list