<div dir="ltr">Has anyone been able to run either Tensorflow or pytorch on gpu machines 5, 6, 9 ?<div>Both give CUDA_ERROR_UNKNOWN errors.</div><div>I tried setting my LD_LIBRARY_PATH and PATH variables to the cuda-8.0 / cuda-9.0/ cuda-9.1 (and the LD_LIBRARY_PATH to the corresponding lib64), reinstalling pytorch for cuda-8.0/ cuda-9.0/ cuda-9.1 using both virtualenv and the system miniconda, as well as reinstalled tensorflow.</div><div>Nothing seems to work unfortunately. </div><div>IIRC, these errors first appeared when the systems were rebooted after the spring break, and have persisted ever since.</div><div><br></div><div>Any help in the matter would be appreciated !</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Mar 27, 2018 at 5:35 PM, Predrag Punosevac <span dir="ltr"><<a href="mailto:predragp@andrew.cmu.edu" target="_blank">predragp@andrew.cmu.edu</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">Matthew Barnes <<a href="mailto:mbarnes1@andrew.cmu.edu">mbarnes1@andrew.cmu.edu</a>> wrote:<br>

<br>

> I think this is an issue with the CUDA install. I'm unable to run<br>

> Tensorflow jobs on GPU9 as of last night (have not checked the others, but<br>

> I suspect similar).<br>

<br>

</span>Nothing has changed since the last night. The error you are seeing is<br>

TensorFlow complaning about 390.30 NVidia driver but we upgraded driver<br>

last week accross all servers and IIRC you were able to use TensorFlow<br>

on GPU2, GPU3, and GPU4 after the upgrade.<br>

<br>

The main problem seems CUDNN library as TensorFlow and PyTorch seems to<br>

expect older libraries. Look for them in CUDA-9.0 directory.<br>

<span class="HOEnZb"><font color="#888888"><br>

Predrag<br>

</font></span><div class="HOEnZb"><div class="h5"><br>

><br>

> 2018-03-26 14:54:49.214493: E<br>

> tensorflow/stream_executor/<wbr>cuda/cuda_driver.cc:406] failed call to cuInit:<br>

> CUDA_ERROR_UNKNOWN<br>

> 2018-03-26 14:54:49.214599: I<br>

> tensorflow/stream_executor/<wbr>cuda/cuda_diagnostics.cc:158] retrieving CUDA<br>

> diagnostic information for host: <a href="http://gpu9.int.autonlab.org" rel="noreferrer" target="_blank">gpu9.int.autonlab.org</a><br>

> 2018-03-26 14:54:49.214617: I<br>

> tensorflow/stream_executor/<wbr>cuda/cuda_diagnostics.cc:165] hostname:<br>

> <a href="http://gpu9.int.autonlab.org" rel="noreferrer" target="_blank">gpu9.int.autonlab.org</a><br>

> 2018-03-26 14:54:49.214685: I<br>

> tensorflow/stream_executor/<wbr>cuda/cuda_diagnostics.cc:189] libcuda reported<br>

> version is: 390.30.0<br>

> 2018-03-26 14:54:49.214747: I<br>

> tensorflow/stream_executor/<wbr>cuda/cuda_diagnostics.cc:193] kernel reported<br>

> version is: 390.30.0<br>

> 2018-03-26 14:54:49.214762: I<br>

> tensorflow/stream_executor/<wbr>cuda/cuda_diagnostics.cc:300] kernel version<br>

> seems to match DSO: 390.30.0<br>

><br>

><br>

> On Tue, Mar 27, 2018 at 1:47 AM Manzil Zaheer <<a href="mailto:manzil@cmu.edu">manzil@cmu.edu</a>> wrote:<br>

><br>

> > Hi Predrag,<br>

> ><br>

> > Thanks for pointing out the links. From the link you provided, we can see<br>

> > that FB engineers mention that "error 30 is usually unrelated to pytorch<br>

> > issues (or your code change)".<br>

> ><br>

> > Thanks,<br>

> > Manzil<br>

> > ______________________________<wbr>__________<br>

> > From: Predrag Punosevac <<a href="mailto:predragp@andrew.cmu.edu">predragp@andrew.cmu.edu</a>><br>

> > Sent: 27 March 2018 01:31<br>

> > To: Manzil Zaheer<br>

> > Cc: Barnabas Poczos; <a href="mailto:users@autonlab.org">users@autonlab.org</a><br>

> > Subject: Re: PyTorch<br>

> ><br>

> > Manzil Zaheer <<a href="mailto:manzil@cmu.edu">manzil@cmu.edu</a>> wrote:<br>

> ><br>

> > > Hi Pregrad,<br>

> > ><br>

> > > Thanks again for your help. But I still can not get anything running on<br>

> > GPU5,6,7,9. Also notice that GPU1,2,3,4,8 almost all GPUs are full, while<br>

> > no one is using GPU5,6,7,9. This might mean no one else is also able to run<br>

> > anything as well.<br>

> > ><br>

> ><br>

> > 7 if off limit used for the special project. How did you figure out that<br>

> > nobody is using it when<br>

> > you can't even log there?<br>

> ><br>

> > > So I tried many things. Everything installs without issue. But when i<br>

> > try to run the simple code like:<br>

> > ><br>

> ><br>

> > PyTorch is a research grade software. They have a mailing list. 3 sec<br>

> > Googling reveals<br>

> ><br>

> ><br>

> > <a href="https://github.com/pytorch/pytorch/issues/2527" rel="noreferrer" target="_blank">https://github.com/pytorch/<wbr>pytorch/issues/2527</a><br>

> ><br>

> > also<br>

> ><br>

> ><br>

> > <a href="https://stackoverflow.com/questions/45861767/pytorch-giving-cuda-runtime-error" rel="noreferrer" target="_blank">https://stackoverflow.com/<wbr>questions/45861767/pytorch-<wbr>giving-cuda-runtime-error</a><br>

> ><br>

> > I will look at this more but it would be helpful if you get on PyTorch<br>

> > mailing list and ask<br>

> > developers what they think. I see this once every 9 months they are<br>

> > looking at this bugs every<br>

> > day.<br>

> ><br>

> > Predrag<br>

> ><br>

> > > import torch<br>

> > > x = torch.cuda.FloatTensor(2,3,4)<br>

> > > print(x)<br>

> > ><br>

> > ><br>

> > > I get the following error:<br>

> > > THCudaCheck FAIL file=/pytorch/torch/lib/THC/<wbr>THCGeneral.c line=70<br>

> > error=30 : unknown error<br>

> > > Traceback (most recent call last):<br>

> > >   File "<stdin>", line 1, in <module><br>

> > >   File<br>

> > "/zfsauton/home/manzilz/.<wbr>local/lib/python3.6/site-<wbr>packages/torch/_utils.py",<br>

> > line 69, in _cuda<br>

> > >     return new_type(self.size()).copy_(<wbr>self, async)<br>

> > >   File<br>

> > "/zfsauton/home/manzilz/.<wbr>local/lib/python3.6/site-<wbr>packages/torch/cuda/__init__.<wbr>py",<br>

> > line 384, in _lazy_new<br>

> > >     _lazy_init()<br>

> > >   File<br>

> > "/zfsauton/home/manzilz/.<wbr>local/lib/python3.6/site-<wbr>packages/torch/cuda/__init__.<wbr>py",<br>

> > line 142, in _lazy_init<br>

> > >     torch._C._cuda_init()<br>

> > > RuntimeError: cuda runtime error (30) : unknown error at<br>

> > /pytorch/torch/lib/THC/<wbr>THCGeneral.c:70<br>

> > ><br>

> > > Thanks,<br>

> > > Manzil<br>

> > ><br>

> > > ______________________________<wbr>__________<br>

> > > From: Predrag Punosevac <<a href="mailto:predragp@andrew.cmu.edu">predragp@andrew.cmu.edu</a>><br>

> > > Sent: 26 March 2018 22:50<br>

> > > To: Manzil Zaheer<br>

> > > Cc: Barnabas Poczos; <a href="mailto:users@autonlab.org">users@autonlab.org</a><br>

> > > Subject: Re: PyTorch<br>

> > ><br>

> > > Manzil Zaheer <<a href="mailto:manzil@cmu.edu">manzil@cmu.edu</a>> wrote:<br>

> > ><br>

> > > > Thanks for the detailed analysis. But I am using pytorch. I have not<br>

> > tried Lua torch. Can you please check? Thanks again!<br>

> > > ><br>

> > ><br>

> > > I did. You have Python 3.6.4 in /opt/miniconda3/bin/python3.6<br>

> > ><br>

> > > predrag@gpu3$ /opt/miniconda3/bin/python3.6<br>

> > > Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19)<br>

> > > [GCC 7.2.0] on linux<br>

> > > Type "help", "copyright", "credits" or "license" for more information.<br>

> > ><br>

> > ><br>

> > > Try reinstalling thing in your scratch directory as<br>

> > ><br>

> > > /opt/miniconda3/bin/conda  install pytorch torchvision cuda91 -c pytorch<br>

> > ><br>

> > > You should see something like<br>

> > ><br>

> > > The following packages will be downloaded:<br>

> > ><br>

> > >     package                    |            build<br>

> > >     ---------------------------|--<wbr>---------------<br>

> > >     pillow-5.0.0               |   py36h3deb7b8_0         561 KB<br>

> > >     mkl-2018.0.2               |                1       205.2 MB<br>

> > >     cuda91-1.0                 |       h4c16780_0           3 KB<br>

> > > pytorch<br>

> > >     libpng-1.6.34              |       hb9fc6fc_0         334 KB<br>

> > >     freetype-2.8               |       hab7d2ae_1         804 KB<br>

> > >     libgfortran-ng-7.2.0       |       hdf63c60_3         1.2 MB<br>

> > >     intel-openmp-2018.0.0      |                8         620 KB<br>

> > >     libtiff-4.0.9              |       h28f6b97_0         586 KB<br>

> > >     pytorch-0.3.1              |py36_cuda9.1.85_cudnn7.0.5_2       475.0<br>

> > > MB  pytorch<br>

> > >     torchvision-0.2.0          |   py36h17b6947_1         102 KB<br>

> > > pytorch<br>

> > >     jpeg-9b                    |       h024ee3a_2         248 KB<br>

> > >     numpy-1.14.2               |   py36hdbf6ddf_0         4.0 MB<br>

> > >     olefile-0.45.1             |           py36_0          47 KB<br>

> > >     ------------------------------<wbr>------------------------------<br>

> > >                                            Total:       688.7 MB<br>

> > ><br>

> > ><br>

> > > Make sure you put your scratch as a path since file server is full. I<br>

> > > got clean installation but I didn't play further. One thing that worries<br>

> > > me is this line<br>

> > ><br>

> > > pytorch-0.3.1              |py36_cuda9.1.85_cudnn7.0.5_2       475.0 MB<br>

> > > pytorch<br>

> > ><br>

> > > We had problems with cudnn on 9.1 apparently because the upstream was<br>

> > > assuming 7.0.5 when in reality I have 7.1.1 CUDA 9 or even 7.1.5.  CUDA<br>

> > > 9.1<br>

> > ><br>

> > > GPU3 has CUDNN library 7.0.5 in cuda-9.0 so try adjusting conda command<br>

> > > accordingly.<br>

> > ><br>

> > ><br>

> > > Best,<br>

> > > Predrag<br>

> > ><br>

> > ><br>

> > ><br>

> > ><br>

> > ><br>

> > ><br>

> > > ><br>

> > > ><br>

> > > > Sent from my Samsung Galaxy smartphone.<br>

> > > ><br>

> > > ><br>

> > > > -------- Original message --------<br>

> > > > From: Predrag Punosevac <<a href="mailto:predragp@andrew.cmu.edu">predragp@andrew.cmu.edu</a>><br>

> > > > Date: 3/26/18 9:00 PM (GMT-05:00)<br>

> > > > To: Manzil Zaheer <<a href="mailto:manzil@cmu.edu">manzil@cmu.edu</a>><br>

> > > > Cc: Barnabas Poczos <<a href="mailto:bapoczos@andrew.cmu.edu">bapoczos@andrew.cmu.edu</a>>, <a href="mailto:users@autonlab.org">users@autonlab.org</a><br>

> > > > Subject: Re: Lua Torch<br>

> > > ><br>

> > > > Manzil Zaheer <<a href="mailto:manzil@cmu.edu">manzil@cmu.edu</a>> wrote:<br>

> > > ><br>

> > > > > Hi Predrag,<br>

> > > > ><br>

> > > > > I am not able to use any GPUSs on gpu5,6,7,9. I tried all 3 versions<br>

> > of cuda, but I get the following error:<br>

> > > > ><br>

> > > ><br>

> > > ><br>

> > > > I was able to build it after adding this<br>

> > > ><br>

> > > > export TORCH_NVCC_FLAGS="-D__CUDA_NO_<wbr>HALF_OPERATORS__"<br>

> > > ><br>

> > > > per<br>

> > > ><br>

> > > > <a href="https://github.com/torch/torch7/issues/1086" rel="noreferrer" target="_blank">https://github.com/torch/<wbr>torch7/issues/1086</a><br>

> > > ><br>

> > > > When I try to run it I get errors that Lua packages are missing<br>

> > (probably<br>

> > > > due to my path variables). I have a vague recollection that Simon and I<br>

> > > > halped you once with this thing in the past. IIRC it was very picky<br>

> > about<br>

> > > > the version of some Lua package and required their version not the one<br>

> > > > which comes with yum .<br>

> > > ><br>

> > > > Anyhow I am forwarding this to users@autonlab in hope somebody is<br>

> > using<br>

> > > > it and might be of more help. Please stop by NSH 3119 and let us try to<br>

> > > > debug this.<br>

> > > ><br>

> > > > Predrag<br>

> > > ><br>

> > > ><br>

> > > ><br>

> > > ><br>

> > > > > THCudaCheck FAIL file=/pytorch/torch/lib/THC/<wbr>THCGeneral.c line=70<br>

> > error=30 : unknown error<br>

> > > > > Traceback (most recent call last):<br>

> > > > >   File "<stdin>", line 1, in <module><br>

> > > > >   File<br>

> > "/zfsauton/home/manzilz/local/<wbr>lib/python3.6/site-packages/<wbr>torch/cuda/__init__.py",<br>

> > line 384, in _lazy_new<br>

> > > > >     _lazy_init()<br>

> > > > >   File<br>

> > "/zfsauton/home/manzilz/local/<wbr>lib/python3.6/site-packages/<wbr>torch/cuda/__init__.py",<br>

> > line 142, in _lazy_init<br>

> > > > >     torch._C._cuda_init()<br>

> > > > > RuntimeError: cuda runtime error (30) : unknown error at<br>

> > /pytorch/torch/lib/THC/<wbr>THCGeneral.c:70<br>

> > > > ><br>

> > > > > Can you kindly look into it?<br>

> > > > ><br>

> > > > > Thanks,<br>

> > > > > Manzil<br>

> ><br>

> ><br>

</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div dir="ltr">Barun Patra <div>Master's Student </div><div>Machine Learning Department</div><div>Carnegie Mellon University</div></div></div></div></div></div>

</div>