PyTorch

Wed Mar 28 01:34:56 EDT 2018

Has anyone been able to run either Tensorflow or pytorch on gpu machines 5,
6, 9 ?
Both give CUDA_ERROR_UNKNOWN errors.
I tried setting my LD_LIBRARY_PATH and PATH variables to the cuda-8.0 /
cuda-9.0/ cuda-9.1 (and the LD_LIBRARY_PATH to the corresponding lib64),
reinstalling pytorch for cuda-8.0/ cuda-9.0/ cuda-9.1 using both virtualenv
and the system miniconda, as well as reinstalled tensorflow.
Nothing seems to work unfortunately.
IIRC, these errors first appeared when the systems were rebooted after the
spring break, and have persisted ever since.

Any help in the matter would be appreciated !

On Tue, Mar 27, 2018 at 5:35 PM, Predrag Punosevac <predragp at andrew.cmu.edu>
wrote:

> Matthew Barnes <mbarnes1 at andrew.cmu.edu> wrote:
>
> > I think this is an issue with the CUDA install. I'm unable to run
> > Tensorflow jobs on GPU9 as of last night (have not checked the others,
> but
> > I suspect similar).
>
> Nothing has changed since the last night. The error you are seeing is
> TensorFlow complaning about 390.30 NVidia driver but we upgraded driver
> last week accross all servers and IIRC you were able to use TensorFlow
> on GPU2, GPU3, and GPU4 after the upgrade.
>
> The main problem seems CUDNN library as TensorFlow and PyTorch seems to
> expect older libraries. Look for them in CUDA-9.0 directory.
>
> Predrag
>
> >
> > 2018-03-26 14:54:49.214493: E
> > tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to
> cuInit:
> > CUDA_ERROR_UNKNOWN
> > 2018-03-26 14:54:49.214599: I
> > tensorflow/stream_executor/cuda/cuda_diagnostics.cc:158] retrieving CUDA
> > diagnostic information for host: gpu9.int.autonlab.org
> > 2018-03-26 14:54:49.214617: I
> > tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname:
> > gpu9.int.autonlab.org
> > 2018-03-26 14:54:49.214685: I
> > tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda
> reported
> > version is: 390.30.0
> > 2018-03-26 14:54:49.214747: I
> > tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported
> > version is: 390.30.0
> > 2018-03-26 14:54:49.214762: I
> > tensorflow/stream_executor/cuda/cuda_diagnostics.cc:300] kernel version
> > seems to match DSO: 390.30.0
> >
> >
> > On Tue, Mar 27, 2018 at 1:47 AM Manzil Zaheer <manzil at cmu.edu> wrote:
> >
> > > Hi Predrag,
> > >
> > > Thanks for pointing out the links. From the link you provided, we can
> see
> > > that FB engineers mention that "error 30 is usually unrelated to
> pytorch
> > > issues (or your code change)".
> > >
> > > Thanks,
> > > Manzil
> > > ________________________________________
> > > From: Predrag Punosevac <predragp at andrew.cmu.edu>
> > > Sent: 27 March 2018 01:31
> > > To: Manzil Zaheer
> > > Cc: Barnabas Poczos; users at autonlab.org
> > > Subject: Re: PyTorch
> > >
> > > Manzil Zaheer <manzil at cmu.edu> wrote:
> > >
> > > > Hi Pregrad,
> > > >
> > > > Thanks again for your help. But I still can not get anything running
> on
> > > GPU5,6,7,9. Also notice that GPU1,2,3,4,8 almost all GPUs are full,
> while
> > > no one is using GPU5,6,7,9. This might mean no one else is also able
> to run
> > > anything as well.
> > > >
> > >
> > > 7 if off limit used for the special project. How did you figure out
> that
> > > nobody is using it when
> > > you can't even log there?
> > >
> > > > So I tried many things. Everything installs without issue. But when i
> > > try to run the simple code like:
> > > >
> > >
> > > PyTorch is a research grade software. They have a mailing list. 3 sec
> > > Googling reveals
> > >
> > >
> > > https://github.com/pytorch/pytorch/issues/2527
> > >
> > > also
> > >
> > >
> > > https://stackoverflow.com/questions/45861767/pytorch-
> giving-cuda-runtime-error
> > >
> > > I will look at this more but it would be helpful if you get on PyTorch
> > > mailing list and ask
> > > developers what they think. I see this once every 9 months they are
> > > looking at this bugs every
> > > day.
> > >
> > > Predrag
> > >
> > > > import torch
> > > > x = torch.cuda.FloatTensor(2,3,4)
> > > > print(x)
> > > >
> > > >
> > > > I get the following error:
> > > > THCudaCheck FAIL file=/pytorch/torch/lib/THC/THCGeneral.c line=70
> > > error=30 : unknown error
> > > > Traceback (most recent call last):
> > > >   File "<stdin>", line 1, in <module>
> > > >   File
> > > "/zfsauton/home/manzilz/.local/lib/python3.6/site-
> packages/torch/_utils.py",
> > > line 69, in _cuda
> > > >     return new_type(self.size()).copy_(self, async)
> > > >   File
> > > "/zfsauton/home/manzilz/.local/lib/python3.6/site-
> packages/torch/cuda/__init__.py",
> > > line 384, in _lazy_new
> > > >     _lazy_init()
> > > >   File
> > > "/zfsauton/home/manzilz/.local/lib/python3.6/site-
> packages/torch/cuda/__init__.py",
> > > line 142, in _lazy_init
> > > >     torch._C._cuda_init()
> > > > RuntimeError: cuda runtime error (30) : unknown error at
> > > /pytorch/torch/lib/THC/THCGeneral.c:70
> > > >
> > > > Thanks,
> > > > Manzil
> > > >
> > > > ________________________________________
> > > > From: Predrag Punosevac <predragp at andrew.cmu.edu>
> > > > Sent: 26 March 2018 22:50
> > > > To: Manzil Zaheer
> > > > Cc: Barnabas Poczos; users at autonlab.org
> > > > Subject: Re: PyTorch
> > > >
> > > > Manzil Zaheer <manzil at cmu.edu> wrote:
> > > >
> > > > > Thanks for the detailed analysis. But I am using pytorch. I have
> not
> > > tried Lua torch. Can you please check? Thanks again!
> > > > >
> > > >
> > > > I did. You have Python 3.6.4 in /opt/miniconda3/bin/python3.6
> > > >
> > > > predrag at gpu3$ /opt/miniconda3/bin/python3.6
> > > > Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19)
> > > > [GCC 7.2.0] on linux
> > > > Type "help", "copyright", "credits" or "license" for more
> information.
> > > >
> > > >
> > > > Try reinstalling thing in your scratch directory as
> > > >
> > > > /opt/miniconda3/bin/conda  install pytorch torchvision cuda91 -c
> pytorch
> > > >
> > > > You should see something like
> > > >
> > > > The following packages will be downloaded:
> > > >
> > > >     package                    |            build
> > > >     ---------------------------|-----------------
> > > >     pillow-5.0.0               |   py36h3deb7b8_0         561 KB
> > > >     mkl-2018.0.2               |                1       205.2 MB
> > > >     cuda91-1.0                 |       h4c16780_0           3 KB
> > > > pytorch
> > > >     libpng-1.6.34              |       hb9fc6fc_0         334 KB
> > > >     freetype-2.8               |       hab7d2ae_1         804 KB
> > > >     libgfortran-ng-7.2.0       |       hdf63c60_3         1.2 MB
> > > >     intel-openmp-2018.0.0      |                8         620 KB
> > > >     libtiff-4.0.9              |       h28f6b97_0         586 KB
> > > >     pytorch-0.3.1              |py36_cuda9.1.85_cudnn7.0.5_2
>  475.0
> > > > MB  pytorch
> > > >     torchvision-0.2.0          |   py36h17b6947_1         102 KB
> > > > pytorch
> > > >     jpeg-9b                    |       h024ee3a_2         248 KB
> > > >     numpy-1.14.2               |   py36hdbf6ddf_0         4.0 MB
> > > >     olefile-0.45.1             |           py36_0          47 KB
> > > >     ------------------------------------------------------------
> > > >                                            Total:       688.7 MB
> > > >
> > > >
> > > > Make sure you put your scratch as a path since file server is full. I
> > > > got clean installation but I didn't play further. One thing that
> worries
> > > > me is this line
> > > >
> > > > pytorch-0.3.1              |py36_cuda9.1.85_cudnn7.0.5_2       475.0
> MB
> > > > pytorch
> > > >
> > > > We had problems with cudnn on 9.1 apparently because the upstream was
> > > > assuming 7.0.5 when in reality I have 7.1.1 CUDA 9 or even 7.1.5.
> CUDA
> > > > 9.1
> > > >
> > > > GPU3 has CUDNN library 7.0.5 in cuda-9.0 so try adjusting conda
> command
> > > > accordingly.
> > > >
> > > >
> > > > Best,
> > > > Predrag
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > >
> > > > >
> > > > > Sent from my Samsung Galaxy smartphone.
> > > > >
> > > > >
> > > > > -------- Original message --------
> > > > > From: Predrag Punosevac <predragp at andrew.cmu.edu>
> > > > > Date: 3/26/18 9:00 PM (GMT-05:00)
> > > > > To: Manzil Zaheer <manzil at cmu.edu>
> > > > > Cc: Barnabas Poczos <bapoczos at andrew.cmu.edu>, users at autonlab.org
> > > > > Subject: Re: Lua Torch
> > > > >
> > > > > Manzil Zaheer <manzil at cmu.edu> wrote:
> > > > >
> > > > > > Hi Predrag,
> > > > > >
> > > > > > I am not able to use any GPUSs on gpu5,6,7,9. I tried all 3
> versions
> > > of cuda, but I get the following error:
> > > > > >
> > > > >
> > > > >
> > > > > I was able to build it after adding this
> > > > >
> > > > > export TORCH_NVCC_FLAGS="-D__CUDA_NO_HALF_OPERATORS__"
> > > > >
> > > > > per
> > > > >
> > > > > https://github.com/torch/torch7/issues/1086
> > > > >
> > > > > When I try to run it I get errors that Lua packages are missing
> > > (probably
> > > > > due to my path variables). I have a vague recollection that Simon
> and I
> > > > > halped you once with this thing in the past. IIRC it was very picky
> > > about
> > > > > the version of some Lua package and required their version not the
> one
> > > > > which comes with yum .
> > > > >
> > > > > Anyhow I am forwarding this to users at autonlab in hope somebody is
> > > using
> > > > > it and might be of more help. Please stop by NSH 3119 and let us
> try to
> > > > > debug this.
> > > > >
> > > > > Predrag
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > > THCudaCheck FAIL file=/pytorch/torch/lib/THC/THCGeneral.c
> line=70
> > > error=30 : unknown error
> > > > > > Traceback (most recent call last):
> > > > > >   File "<stdin>", line 1, in <module>
> > > > > >   File
> > > "/zfsauton/home/manzilz/local/lib/python3.6/site-packages/
> torch/cuda/__init__.py",
> > > line 384, in _lazy_new
> > > > > >     _lazy_init()
> > > > > >   File
> > > "/zfsauton/home/manzilz/local/lib/python3.6/site-packages/
> torch/cuda/__init__.py",
> > > line 142, in _lazy_init
> > > > > >     torch._C._cuda_init()
> > > > > > RuntimeError: cuda runtime error (30) : unknown error at
> > > /pytorch/torch/lib/THC/THCGeneral.c:70
> > > > > >
> > > > > > Can you kindly look into it?
> > > > > >
> > > > > > Thanks,
> > > > > > Manzil
> > >
> > >
>

-- 
Barun Patra
Master's Student
Machine Learning Department
Carnegie Mellon University
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20180328/69d7e435/attachment-0001.html>