<div dir="ltr">Has anyone been able to run either Tensorflow or pytorch on gpu machines 5, 6, 9 ?<div>Both give CUDA_ERROR_UNKNOWN errors.</div><div>I tried setting my LD_LIBRARY_PATH and PATH variables to the cuda-8.0 / cuda-9.0/ cuda-9.1 (and the LD_LIBRARY_PATH to the corresponding lib64), reinstalling pytorch for cuda-8.0/ cuda-9.0/ cuda-9.1 using both virtualenv and the system miniconda, as well as reinstalled tensorflow.</div><div>Nothing seems to work unfortunately. </div><div>IIRC, these errors first appeared when the systems were rebooted after the spring break, and have persisted ever since.</div><div><br></div><div>Any help in the matter would be appreciated !</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Mar 27, 2018 at 5:35 PM, Predrag Punosevac <span dir="ltr"><<a href="mailto:predragp@andrew.cmu.edu" target="_blank">predragp@andrew.cmu.edu</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">Matthew Barnes <<a href="mailto:mbarnes1@andrew.cmu.edu">mbarnes1@andrew.cmu.edu</a>> wrote:<br>
<br>
> I think this is an issue with the CUDA install. I'm unable to run<br>
> Tensorflow jobs on GPU9 as of last night (have not checked the others, but<br>
> I suspect similar).<br>
<br>
</span>Nothing has changed since the last night. The error you are seeing is<br>
TensorFlow complaning about 390.30 NVidia driver but we upgraded driver<br>
last week accross all servers and IIRC you were able to use TensorFlow<br>
on GPU2, GPU3, and GPU4 after the upgrade.<br>
<br>
The main problem seems CUDNN library as TensorFlow and PyTorch seems to<br>
expect older libraries. Look for them in CUDA-9.0 directory.<br>
<span class="HOEnZb"><font color="#888888"><br>
Predrag<br>
</font></span><div class="HOEnZb"><div class="h5"><br>
><br>
> 2018-03-26 14:54:49.214493: E<br>
> tensorflow/stream_executor/<wbr>cuda/cuda_driver.cc:406] failed call to cuInit:<br>
> CUDA_ERROR_UNKNOWN<br>
> 2018-03-26 14:54:49.214599: I<br>
> tensorflow/stream_executor/<wbr>cuda/cuda_diagnostics.cc:158] retrieving CUDA<br>
> diagnostic information for host: <a href="http://gpu9.int.autonlab.org" rel="noreferrer" target="_blank">gpu9.int.autonlab.org</a><br>
> 2018-03-26 14:54:49.214617: I<br>
> tensorflow/stream_executor/<wbr>cuda/cuda_diagnostics.cc:165] hostname:<br>
> <a href="http://gpu9.int.autonlab.org" rel="noreferrer" target="_blank">gpu9.int.autonlab.org</a><br>
> 2018-03-26 14:54:49.214685: I<br>
> tensorflow/stream_executor/<wbr>cuda/cuda_diagnostics.cc:189] libcuda reported<br>
> version is: 390.30.0<br>
> 2018-03-26 14:54:49.214747: I<br>
> tensorflow/stream_executor/<wbr>cuda/cuda_diagnostics.cc:193] kernel reported<br>
> version is: 390.30.0<br>
> 2018-03-26 14:54:49.214762: I<br>
> tensorflow/stream_executor/<wbr>cuda/cuda_diagnostics.cc:300] kernel version<br>
> seems to match DSO: 390.30.0<br>
><br>
><br>
> On Tue, Mar 27, 2018 at 1:47 AM Manzil Zaheer <<a href="mailto:manzil@cmu.edu">manzil@cmu.edu</a>> wrote:<br>
><br>
> > Hi Predrag,<br>
> ><br>
> > Thanks for pointing out the links. From the link you provided, we can see<br>
> > that FB engineers mention that "error 30 is usually unrelated to pytorch<br>
> > issues (or your code change)".<br>
> ><br>
> > Thanks,<br>
> > Manzil<br>
> > ______________________________<wbr>__________<br>
> > From: Predrag Punosevac <<a href="mailto:predragp@andrew.cmu.edu">predragp@andrew.cmu.edu</a>><br>
> > Sent: 27 March 2018 01:31<br>
> > To: Manzil Zaheer<br>
> > Cc: Barnabas Poczos; <a href="mailto:users@autonlab.org">users@autonlab.org</a><br>
> > Subject: Re: PyTorch<br>
> ><br>
> > Manzil Zaheer <<a href="mailto:manzil@cmu.edu">manzil@cmu.edu</a>> wrote:<br>
> ><br>
> > > Hi Pregrad,<br>
> > ><br>
> > > Thanks again for your help. But I still can not get anything running on<br>
> > GPU5,6,7,9. Also notice that GPU1,2,3,4,8 almost all GPUs are full, while<br>
> > no one is using GPU5,6,7,9. This might mean no one else is also able to run<br>
> > anything as well.<br>
> > ><br>
> ><br>
> > 7 if off limit used for the special project. How did you figure out that<br>
> > nobody is using it when<br>
> > you can't even log there?<br>
> ><br>
> > > So I tried many things. Everything installs without issue. But when i<br>
> > try to run the simple code like:<br>
> > ><br>
> ><br>
> > PyTorch is a research grade software. They have a mailing list. 3 sec<br>
> > Googling reveals<br>
> ><br>
> ><br>
> > <a href="https://github.com/pytorch/pytorch/issues/2527" rel="noreferrer" target="_blank">https://github.com/pytorch/<wbr>pytorch/issues/2527</a><br>
> ><br>
> > also<br>
> ><br>
> ><br>
> > <a href="https://stackoverflow.com/questions/45861767/pytorch-giving-cuda-runtime-error" rel="noreferrer" target="_blank">https://stackoverflow.com/<wbr>questions/45861767/pytorch-<wbr>giving-cuda-runtime-error</a><br>
> ><br>
> > I will look at this more but it would be helpful if you get on PyTorch<br>
> > mailing list and ask<br>
> > developers what they think. I see this once every 9 months they are<br>
> > looking at this bugs every<br>
> > day.<br>
> ><br>
> > Predrag<br>
> ><br>
> > > import torch<br>
> > > x = torch.cuda.FloatTensor(2,3,4)<br>
> > > print(x)<br>
> > ><br>
> > ><br>
> > > I get the following error:<br>
> > > THCudaCheck FAIL file=/pytorch/torch/lib/THC/<wbr>THCGeneral.c line=70<br>
> > error=30 : unknown error<br>
> > > Traceback (most recent call last):<br>
> > > File "<stdin>", line 1, in <module><br>
> > > File<br>
> > "/zfsauton/home/manzilz/.<wbr>local/lib/python3.6/site-<wbr>packages/torch/_utils.py",<br>
> > line 69, in _cuda<br>
> > > return new_type(self.size()).copy_(<wbr>self, async)<br>
> > > File<br>
> > "/zfsauton/home/manzilz/.<wbr>local/lib/python3.6/site-<wbr>packages/torch/cuda/__init__.<wbr>py",<br>
> > line 384, in _lazy_new<br>
> > > _lazy_init()<br>
> > > File<br>
> > "/zfsauton/home/manzilz/.<wbr>local/lib/python3.6/site-<wbr>packages/torch/cuda/__init__.<wbr>py",<br>
> > line 142, in _lazy_init<br>
> > > torch._C._cuda_init()<br>
> > > RuntimeError: cuda runtime error (30) : unknown error at<br>
> > /pytorch/torch/lib/THC/<wbr>THCGeneral.c:70<br>
> > ><br>
> > > Thanks,<br>
> > > Manzil<br>
> > ><br>
> > > ______________________________<wbr>__________<br>
> > > From: Predrag Punosevac <<a href="mailto:predragp@andrew.cmu.edu">predragp@andrew.cmu.edu</a>><br>
> > > Sent: 26 March 2018 22:50<br>
> > > To: Manzil Zaheer<br>
> > > Cc: Barnabas Poczos; <a href="mailto:users@autonlab.org">users@autonlab.org</a><br>
> > > Subject: Re: PyTorch<br>
> > ><br>
> > > Manzil Zaheer <<a href="mailto:manzil@cmu.edu">manzil@cmu.edu</a>> wrote:<br>
> > ><br>
> > > > Thanks for the detailed analysis. But I am using pytorch. I have not<br>
> > tried Lua torch. Can you please check? Thanks again!<br>
> > > ><br>
> > ><br>
> > > I did. You have Python 3.6.4 in /opt/miniconda3/bin/python3.6<br>
> > ><br>
> > > predrag@gpu3$ /opt/miniconda3/bin/python3.6<br>
> > > Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19)<br>
> > > [GCC 7.2.0] on linux<br>
> > > Type "help", "copyright", "credits" or "license" for more information.<br>
> > ><br>
> > ><br>
> > > Try reinstalling thing in your scratch directory as<br>
> > ><br>
> > > /opt/miniconda3/bin/conda install pytorch torchvision cuda91 -c pytorch<br>
> > ><br>
> > > You should see something like<br>
> > ><br>
> > > The following packages will be downloaded:<br>
> > ><br>
> > > package | build<br>
> > > ---------------------------|--<wbr>---------------<br>
> > > pillow-5.0.0 | py36h3deb7b8_0 561 KB<br>
> > > mkl-2018.0.2 | 1 205.2 MB<br>
> > > cuda91-1.0 | h4c16780_0 3 KB<br>
> > > pytorch<br>
> > > libpng-1.6.34 | hb9fc6fc_0 334 KB<br>
> > > freetype-2.8 | hab7d2ae_1 804 KB<br>
> > > libgfortran-ng-7.2.0 | hdf63c60_3 1.2 MB<br>
> > > intel-openmp-2018.0.0 | 8 620 KB<br>
> > > libtiff-4.0.9 | h28f6b97_0 586 KB<br>
> > > pytorch-0.3.1 |py36_cuda9.1.85_cudnn7.0.5_2 475.0<br>
> > > MB pytorch<br>
> > > torchvision-0.2.0 | py36h17b6947_1 102 KB<br>
> > > pytorch<br>
> > > jpeg-9b | h024ee3a_2 248 KB<br>
> > > numpy-1.14.2 | py36hdbf6ddf_0 4.0 MB<br>
> > > olefile-0.45.1 | py36_0 47 KB<br>
> > > ------------------------------<wbr>------------------------------<br>
> > > Total: 688.7 MB<br>
> > ><br>
> > ><br>
> > > Make sure you put your scratch as a path since file server is full. I<br>
> > > got clean installation but I didn't play further. One thing that worries<br>
> > > me is this line<br>
> > ><br>
> > > pytorch-0.3.1 |py36_cuda9.1.85_cudnn7.0.5_2 475.0 MB<br>
> > > pytorch<br>
> > ><br>
> > > We had problems with cudnn on 9.1 apparently because the upstream was<br>
> > > assuming 7.0.5 when in reality I have 7.1.1 CUDA 9 or even 7.1.5. CUDA<br>
> > > 9.1<br>
> > ><br>
> > > GPU3 has CUDNN library 7.0.5 in cuda-9.0 so try adjusting conda command<br>
> > > accordingly.<br>
> > ><br>
> > ><br>
> > > Best,<br>
> > > Predrag<br>
> > ><br>
> > ><br>
> > ><br>
> > ><br>
> > ><br>
> > ><br>
> > > ><br>
> > > ><br>
> > > > Sent from my Samsung Galaxy smartphone.<br>
> > > ><br>
> > > ><br>
> > > > -------- Original message --------<br>
> > > > From: Predrag Punosevac <<a href="mailto:predragp@andrew.cmu.edu">predragp@andrew.cmu.edu</a>><br>
> > > > Date: 3/26/18 9:00 PM (GMT-05:00)<br>
> > > > To: Manzil Zaheer <<a href="mailto:manzil@cmu.edu">manzil@cmu.edu</a>><br>
> > > > Cc: Barnabas Poczos <<a href="mailto:bapoczos@andrew.cmu.edu">bapoczos@andrew.cmu.edu</a>>, <a href="mailto:users@autonlab.org">users@autonlab.org</a><br>
> > > > Subject: Re: Lua Torch<br>
> > > ><br>
> > > > Manzil Zaheer <<a href="mailto:manzil@cmu.edu">manzil@cmu.edu</a>> wrote:<br>
> > > ><br>
> > > > > Hi Predrag,<br>
> > > > ><br>
> > > > > I am not able to use any GPUSs on gpu5,6,7,9. I tried all 3 versions<br>
> > of cuda, but I get the following error:<br>
> > > > ><br>
> > > ><br>
> > > ><br>
> > > > I was able to build it after adding this<br>
> > > ><br>
> > > > export TORCH_NVCC_FLAGS="-D__CUDA_NO_<wbr>HALF_OPERATORS__"<br>
> > > ><br>
> > > > per<br>
> > > ><br>
> > > > <a href="https://github.com/torch/torch7/issues/1086" rel="noreferrer" target="_blank">https://github.com/torch/<wbr>torch7/issues/1086</a><br>
> > > ><br>
> > > > When I try to run it I get errors that Lua packages are missing<br>
> > (probably<br>
> > > > due to my path variables). I have a vague recollection that Simon and I<br>
> > > > halped you once with this thing in the past. IIRC it was very picky<br>
> > about<br>
> > > > the version of some Lua package and required their version not the one<br>
> > > > which comes with yum .<br>
> > > ><br>
> > > > Anyhow I am forwarding this to users@autonlab in hope somebody is<br>
> > using<br>
> > > > it and might be of more help. Please stop by NSH 3119 and let us try to<br>
> > > > debug this.<br>
> > > ><br>
> > > > Predrag<br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > > > THCudaCheck FAIL file=/pytorch/torch/lib/THC/<wbr>THCGeneral.c line=70<br>
> > error=30 : unknown error<br>
> > > > > Traceback (most recent call last):<br>
> > > > > File "<stdin>", line 1, in <module><br>
> > > > > File<br>
> > "/zfsauton/home/manzilz/local/<wbr>lib/python3.6/site-packages/<wbr>torch/cuda/__init__.py",<br>
> > line 384, in _lazy_new<br>
> > > > > _lazy_init()<br>
> > > > > File<br>
> > "/zfsauton/home/manzilz/local/<wbr>lib/python3.6/site-packages/<wbr>torch/cuda/__init__.py",<br>
> > line 142, in _lazy_init<br>
> > > > > torch._C._cuda_init()<br>
> > > > > RuntimeError: cuda runtime error (30) : unknown error at<br>
> > /pytorch/torch/lib/THC/<wbr>THCGeneral.c:70<br>
> > > > ><br>
> > > > > Can you kindly look into it?<br>
> > > > ><br>
> > > > > Thanks,<br>
> > > > > Manzil<br>
> ><br>
> ><br>
</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div dir="ltr">Barun Patra <div>Master's Student </div><div>Machine Learning Department</div><div>Carnegie Mellon University</div></div></div></div></div></div>
</div>