PyTorch

Tue Mar 27 08:30:13 EDT 2018

I think this is an issue with the CUDA install. I'm unable to run
Tensorflow jobs on GPU9 as of last night (have not checked the others, but
I suspect similar).

2018-03-26 14:54:49.214493: E
tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to cuInit:
CUDA_ERROR_UNKNOWN
2018-03-26 14:54:49.214599: I
tensorflow/stream_executor/cuda/cuda_diagnostics.cc:158] retrieving CUDA
diagnostic information for host: gpu9.int.autonlab.org
2018-03-26 14:54:49.214617: I
tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname:
gpu9.int.autonlab.org
2018-03-26 14:54:49.214685: I
tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported
version is: 390.30.0
2018-03-26 14:54:49.214747: I
tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported
version is: 390.30.0
2018-03-26 14:54:49.214762: I
tensorflow/stream_executor/cuda/cuda_diagnostics.cc:300] kernel version
seems to match DSO: 390.30.0

On Tue, Mar 27, 2018 at 1:47 AM Manzil Zaheer <manzil at cmu.edu> wrote:

> Hi Predrag,
>
> Thanks for pointing out the links. From the link you provided, we can see
> that FB engineers mention that "error 30 is usually unrelated to pytorch
> issues (or your code change)".
>
> Thanks,
> Manzil
> ________________________________________
> From: Predrag Punosevac <predragp at andrew.cmu.edu>
> Sent: 27 March 2018 01:31
> To: Manzil Zaheer
> Cc: Barnabas Poczos; users at autonlab.org
> Subject: Re: PyTorch
>
> Manzil Zaheer <manzil at cmu.edu> wrote:
>
> > Hi Pregrad,
> >
> > Thanks again for your help. But I still can not get anything running on
> GPU5,6,7,9. Also notice that GPU1,2,3,4,8 almost all GPUs are full, while
> no one is using GPU5,6,7,9. This might mean no one else is also able to run
> anything as well.
> >
>
> 7 if off limit used for the special project. How did you figure out that
> nobody is using it when
> you can't even log there?
>
> > So I tried many things. Everything installs without issue. But when i
> try to run the simple code like:
> >
>
> PyTorch is a research grade software. They have a mailing list. 3 sec
> Googling reveals
>
>
> https://github.com/pytorch/pytorch/issues/2527
>
> also
>
>
> https://stackoverflow.com/questions/45861767/pytorch-giving-cuda-runtime-error
>
> I will look at this more but it would be helpful if you get on PyTorch
> mailing list and ask
> developers what they think. I see this once every 9 months they are
> looking at this bugs every
> day.
>
> Predrag
>
> > import torch
> > x = torch.cuda.FloatTensor(2,3,4)
> > print(x)
> >
> >
> > I get the following error:
> > THCudaCheck FAIL file=/pytorch/torch/lib/THC/THCGeneral.c line=70
> error=30 : unknown error
> > Traceback (most recent call last):
> >   File "<stdin>", line 1, in <module>
> >   File
> "/zfsauton/home/manzilz/.local/lib/python3.6/site-packages/torch/_utils.py",
> line 69, in _cuda
> >     return new_type(self.size()).copy_(self, async)
> >   File
> "/zfsauton/home/manzilz/.local/lib/python3.6/site-packages/torch/cuda/__init__.py",
> line 384, in _lazy_new
> >     _lazy_init()
> >   File
> "/zfsauton/home/manzilz/.local/lib/python3.6/site-packages/torch/cuda/__init__.py",
> line 142, in _lazy_init
> >     torch._C._cuda_init()
> > RuntimeError: cuda runtime error (30) : unknown error at
> /pytorch/torch/lib/THC/THCGeneral.c:70
> >
> > Thanks,
> > Manzil
> >
> > ________________________________________
> > From: Predrag Punosevac <predragp at andrew.cmu.edu>
> > Sent: 26 March 2018 22:50
> > To: Manzil Zaheer
> > Cc: Barnabas Poczos; users at autonlab.org
> > Subject: Re: PyTorch
> >
> > Manzil Zaheer <manzil at cmu.edu> wrote:
> >
> > > Thanks for the detailed analysis. But I am using pytorch. I have not
> tried Lua torch. Can you please check? Thanks again!
> > >
> >
> > I did. You have Python 3.6.4 in /opt/miniconda3/bin/python3.6
> >
> > predrag at gpu3$ /opt/miniconda3/bin/python3.6
> > Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19)
> > [GCC 7.2.0] on linux
> > Type "help", "copyright", "credits" or "license" for more information.
> >
> >
> > Try reinstalling thing in your scratch directory as
> >
> > /opt/miniconda3/bin/conda  install pytorch torchvision cuda91 -c pytorch
> >
> > You should see something like
> >
> > The following packages will be downloaded:
> >
> >     package                    |            build
> >     ---------------------------|-----------------
> >     pillow-5.0.0               |   py36h3deb7b8_0         561 KB
> >     mkl-2018.0.2               |                1       205.2 MB
> >     cuda91-1.0                 |       h4c16780_0           3 KB
> > pytorch
> >     libpng-1.6.34              |       hb9fc6fc_0         334 KB
> >     freetype-2.8               |       hab7d2ae_1         804 KB
> >     libgfortran-ng-7.2.0       |       hdf63c60_3         1.2 MB
> >     intel-openmp-2018.0.0      |                8         620 KB
> >     libtiff-4.0.9              |       h28f6b97_0         586 KB
> >     pytorch-0.3.1              |py36_cuda9.1.85_cudnn7.0.5_2       475.0
> > MB  pytorch
> >     torchvision-0.2.0          |   py36h17b6947_1         102 KB
> > pytorch
> >     jpeg-9b                    |       h024ee3a_2         248 KB
> >     numpy-1.14.2               |   py36hdbf6ddf_0         4.0 MB
> >     olefile-0.45.1             |           py36_0          47 KB
> >     ------------------------------------------------------------
> >                                            Total:       688.7 MB
> >
> >
> > Make sure you put your scratch as a path since file server is full. I
> > got clean installation but I didn't play further. One thing that worries
> > me is this line
> >
> > pytorch-0.3.1              |py36_cuda9.1.85_cudnn7.0.5_2       475.0 MB
> > pytorch
> >
> > We had problems with cudnn on 9.1 apparently because the upstream was
> > assuming 7.0.5 when in reality I have 7.1.1 CUDA 9 or even 7.1.5.  CUDA
> > 9.1
> >
> > GPU3 has CUDNN library 7.0.5 in cuda-9.0 so try adjusting conda command
> > accordingly.
> >
> >
> > Best,
> > Predrag
> >
> >
> >
> >
> >
> >
> > >
> > >
> > > Sent from my Samsung Galaxy smartphone.
> > >
> > >
> > > -------- Original message --------
> > > From: Predrag Punosevac <predragp at andrew.cmu.edu>
> > > Date: 3/26/18 9:00 PM (GMT-05:00)
> > > To: Manzil Zaheer <manzil at cmu.edu>
> > > Cc: Barnabas Poczos <bapoczos at andrew.cmu.edu>, users at autonlab.org
> > > Subject: Re: Lua Torch
> > >
> > > Manzil Zaheer <manzil at cmu.edu> wrote:
> > >
> > > > Hi Predrag,
> > > >
> > > > I am not able to use any GPUSs on gpu5,6,7,9. I tried all 3 versions
> of cuda, but I get the following error:
> > > >
> > >
> > >
> > > I was able to build it after adding this
> > >
> > > export TORCH_NVCC_FLAGS="-D__CUDA_NO_HALF_OPERATORS__"
> > >
> > > per
> > >
> > > https://github.com/torch/torch7/issues/1086
> > >
> > > When I try to run it I get errors that Lua packages are missing
> (probably
> > > due to my path variables). I have a vague recollection that Simon and I
> > > halped you once with this thing in the past. IIRC it was very picky
> about
> > > the version of some Lua package and required their version not the one
> > > which comes with yum .
> > >
> > > Anyhow I am forwarding this to users at autonlab in hope somebody is
> using
> > > it and might be of more help. Please stop by NSH 3119 and let us try to
> > > debug this.
> > >
> > > Predrag
> > >
> > >
> > >
> > >
> > > > THCudaCheck FAIL file=/pytorch/torch/lib/THC/THCGeneral.c line=70
> error=30 : unknown error
> > > > Traceback (most recent call last):
> > > >   File "<stdin>", line 1, in <module>
> > > >   File
> "/zfsauton/home/manzilz/local/lib/python3.6/site-packages/torch/cuda/__init__.py",
> line 384, in _lazy_new
> > > >     _lazy_init()
> > > >   File
> "/zfsauton/home/manzilz/local/lib/python3.6/site-packages/torch/cuda/__init__.py",
> line 142, in _lazy_init
> > > >     torch._C._cuda_init()
> > > > RuntimeError: cuda runtime error (30) : unknown error at
> /pytorch/torch/lib/THC/THCGeneral.c:70
> > > >
> > > > Can you kindly look into it?
> > > >
> > > > Thanks,
> > > > Manzil
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20180327/91f21d86/attachment-0001.html>