Possible CUDA problem

Predrag Punosevac predragp at andrew.cmu.edu
Mon Jan 7 15:43:19 EST 2019


Ok. I found one problem. CUDA 10 was not properly installed on GPU10
due to the dependency problems. I had to disable  rpmfusion repos
(both free and non-free) which I considered safe in the past. Now CUDA
10 is installed from NVidia repo and is in /usr/local and
/usr/local/cuda is the symbolic link to actual /usr/local/cuda-10.0
folder. Please try now.

Predrag

On Mon, Jan 7, 2019 at 2:28 PM Yotam Hechtlinger
<yhechtli at andrew.cmu.edu> wrote:
>
> Hi Predrag,
>
> With GPU10 the problem is probably because LD_LIBRARY_PATH directs to  /usr/local/cuda/lib64 but that's not where CUDA is installed (where is it?).
>
> Yotam.
>
>
> On Mon, Jan 7, 2019 at 9:00 PM Predrag Punosevac <predragp at andrew.cmu.edu> wrote:
>>
>> Yotam,
>>
>> Thank you so much for this report! I am CC-ing users at autonlab.org so
>> that everyone is on the same page.  Could you please work with me on
>> this one? Let's try to fix GPU10 first. GPU10 was recently
>> provisioned. It has three (one was DoA) GeForce 1080Ti. I am running
>> the latest  NVIDIA-Linux-x86_64-410.78 driver and the latest
>> cuda-10.0.130-1. You have two versions of Python. /opt/rh/rh-python36
>> will give you the latest 3.6.7. While /opt/miniconda3 will install
>> python-3.7.2. Once we fix GPU10 we will move to other machines. Note
>> that other machines are still running older version of NVidia driver
>> and CUDA-9.2. I have changed nothing on them so whatever is broken it
>> is broken upstream (Python,TensorFlow, NVidia, or CUDA).
>>
>> Please keep CC-ing users to this discussion so that people know what
>> is going on.
>>
>> Predrag
>>
>>
>> On Mon, Jan 7, 2019 at 8:02 AM Yotam Hechtlinger
>> <yhechtli at andrew.cmu.edu> wrote:
>> >
>> > Hi Predrag,
>> >
>> > There might be some CUDA problem on GPU 5,6 & 10.
>> > I get the following message when I try to import tensorflow:
>> >
>> >
>> >
>> > >>> import tensorflow
>> > Traceback (most recent call last):
>> >   File "/zfsauton/home/yhechtli/anaconda3/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py", line 58, in <module>
>> >     from tensorflow.python.pywrap_tensorflow_internal import *
>> >   File "/zfsauton/home/yhechtli/anaconda3/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in <module>
>> >     _pywrap_tensorflow_internal = swig_import_helper()
>> >   File "/zfsauton/home/yhechtli/anaconda3/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
>> >     _mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
>> >   File "/zfsauton/home/yhechtli/anaconda3/lib/python3.6/imp.py", line 243, in load_module
>> >     return load_dynamic(name, filename, file)
>> >   File "/zfsauton/home/yhechtli/anaconda3/lib/python3.6/imp.py", line 343, in load_dynamic
>> >     return _load(spec)
>> > ImportError: libcublas.so.9.0: cannot open shared object file: No such file or directory
>> >
>> > During handling of the above exception, another exception occurred:
>> >
>> > Traceback (most recent call last):
>> >   File "<stdin>", line 1, in <module>
>> >   File "/zfsauton/home/yhechtli/anaconda3/lib/python3.6/site-packages/tensorflow/__init__.py", line 24, in <module>
>> >     from tensorflow.python import pywrap_tensorflow  # pylint: disable=unused-import
>> >   File "/zfsauton/home/yhechtli/anaconda3/lib/python3.6/site-packages/tensorflow/python/__init__.py", line 49, in <module>
>> >     from tensorflow.python import pywrap_tensorflow
>> >   File "/zfsauton/home/yhechtli/anaconda3/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py", line 74, in <module>
>> >     raise ImportError(msg)
>> > ImportError: Traceback (most recent call last):
>> >   File "/zfsauton/home/yhechtli/anaconda3/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py", line 58, in <module>
>> >     from tensorflow.python.pywrap_tensorflow_internal import *
>> >   File "/zfsauton/home/yhechtli/anaconda3/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in <module>
>> >     _pywrap_tensorflow_internal = swig_import_helper()
>> >   File "/zfsauton/home/yhechtli/anaconda3/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
>> >     _mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
>> >   File "/zfsauton/home/yhechtli/anaconda3/lib/python3.6/imp.py", line 243, in load_module
>> >     return load_dynamic(name, filename, file)
>> >   File "/zfsauton/home/yhechtli/anaconda3/lib/python3.6/imp.py", line 343, in load_dynamic
>> >     return _load(spec)
>> > ImportError: libcublas.so.9.0: cannot open shared object file: No such file or directory
>> >
>> >
>> > Failed to load the native TensorFlow runtime.
>> >
>> > See https://www.tensorflow.org/install/errors
>> >
>> > for some common reasons and solutions.  Include the entire stack trace
>> > above this error message when asking for help.
>> >


More information about the Autonlab-users mailing list