PyTorch problem

Thu Sep 6 12:31:51 EDT 2018

Hi,

So I actually have been using the CUDA from Manzil.

While it did help in resolving other issues, recent issue after the file
system shutdown seems to happen with pytorch only. (Not on tensorflow)
Using Manzil's CUDA (old one and again got a copy for another one after the
probldm, just in case) did not resolve the problem.

The problem was only resolved after I went ahead and installed python, pip
locally.

With this experience, I am suspecting that the currently provided conda has
some problem. (Altough the error messages indicate only CUDA errors) Or
maybe it was just a hack around, but this did fix the issue.

Cheers!
Jay-Yoon

On Thu, Sep 6, 2018, 11:29 AM Barnabas Poczos <bapoczos at cs.cmu.edu> wrote:

> Hi All,
>
> I'm somewhat confused:
>
> * Do I understand correctly that Manzil actually is using the CUDA
> libraries installed by himself
> (/zfsauton/home/manzilz/local/cuda-9.0/) and not the system libraries
> (/usr/local/cuda/lib64) ?
> * Since he is using different CUDA libraries is that the reason that
> pytorch is working for him and not for the other users? If so, should
> we double check the system libraries?
> * Do we know anyone who can use pytorch now with the CUDA system
> libraries ? If so, those users please let us know your system env
> variables.
> * As a quick solution, should we ask Manzil to copy his cuda libraries
> to a public place where others could access them?
>
> Best,
> Barnabas
>
> ======================
> Barnabas Poczos, PhD
> Associate Professor
> Co-Director of PhD Program
> Machine Learning Department
> Carnegie Mellon University
> On Wed, Sep 5, 2018 at 5:33 PM Manzil Zaheer <manzil at cmu.edu> wrote:
> >
> > Here is my related env variables:
> >
> >
> >
> > CUDA_HOME=/zfsauton/home/manzilz/local/cuda-9.0/
> >
> >
> LD_LIBRARY_PATH=/zfsauton/home/manzilz/local/lib64:/zfsauton/home/manzilz/local/lib:/zfsauton/home/manzilz/local/cuda-9.0/lib64:/usr/local/cuda/lib64:
> >
> >
> PATH=/zfsauton/home/manzilz/local/bin:/zfsauton/home/manzilz/.local/bin:/zfsauton/home/manzilz/local/cuda-9.0/bin:/usr/local/cuda/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin
> >
> > C_INCLUDE_PATH=/zfsauton/home/manzilz/local/include:
> >
> >
> >
> > From: Biswajit Paria <bparia at cs.cmu.edu>
> > Sent: Wednesday, September 05, 2018 5:29 PM
> > To: Yichong Xu <yichongx at cs.cmu.edu>
> > Cc: Biswajit Paria <bparia at cs.cmu.edu>; eyolcu at cs.cmu.edu; Predrag
> Punosevac <predragp at andrew.cmu.edu>; Manzil Zaheer <manzil at cmu.edu>;
> users at autonlab.org
> > Subject: Re: PyTorch problem
> >
> >
> >
> > If the CUDA examples work for anyone, can they share their PATH and
> LD_LIBRARY_PATH variables?
> >
> >
> >
> > Thanks
> >
> >
> >
> > On Wed, Sep 5, 2018 at 5:27 PM Yichong Xu <yichongx at cs.cmu.edu> wrote:
> >
> > I think with Biswajit’s and my problem with cuda, we should isolate the
> problem with just CUDA (and drivers) instead of wandering around python or
> pytorch.
> >
> > Predrag can you test the CUDA examples? I sort of agree with Manzil that
> this might be a user account problem.
> >
> >
> >
> > Thanks,
> >
> > Yichong
> >
> >
> >
> >
> >
> >
> >
> > On Sep 5, 2018, at 5:14 PM, Biswajit Paria <bparia at cs.cmu.edu> wrote:
> >
> >
> >
> > I just tried Yichong's way of testing cuBLAS, and get the same error as
> earlier:
> >
> >
> >
> > [Matrix Multiply CUBLAS] - Starting...
> >
> > GPU Device 0: "TITAN Xp" with compute capability 6.1
> >
> >
> >
> > MatrixA(640,480), MatrixB(480,320), MatrixC(640,320)
> >
> > CUDA error at matrixMulCUBLAS.cpp:275
> code=1(CUBLAS_STATUS_NOT_INITIALIZED) "cublasCreate(&handle)"
> >
> >
> >
> > So I believe it is not a conda error. I also tried removing .nv, doesn't
> help either. Maybe someone can share the PATH env variable?
> >
> >
> >
> > On Wed, Sep 5, 2018 at 5:08 PM Emre Yolcu <eyolcu at cs.cmu.edu> wrote:
> >
> > Manzil, could you share your `conda env export` (or equivalent) output
> for the environment you use for pytorch? It's still not working for me
> after reboot, maybe I can try replicating your exact setup and try with
> that.
> >
> >
> >
> > Thanks,
> >
> >
> >
> > Emre
> >
> >
> >
> > On Wed, Sep 5, 2018 at 4:56 PM, Predrag Punosevac <
> predragp at andrew.cmu.edu> wrote:
> >
> > Manzil Zaheer <manzil at cmu.edu> wrote:
> >
> > > It was working me before reboot as well. PyTorch does work on all
> > > nodes for me.
> >
> > Aha! Gotcha.
> >
> > >
> > > I am trying to say is that i think it is not issue at system level but
> > > at user account level. I might be wrong though.
> >
> > That was my hunch as well. They were trying to convince me in a 150
> > e-mails chain over the weekend that pytorch was broken when I replaced a
> > failed HDD on the main file server. That didn't make any sense.
> >
> > Could you please share your binaries and setup with outher pytorch
> > users?
> >
> > Cheers,
> > Predrag
> >
> > >
> > >
> > > -------- Original message --------
> > > From: Predrag Punosevac <predragp at andrew.cmu.edu>
> > > Date: 9/5/18 4:44 PM (GMT-05:00)
> > > To: Manzil Zaheer <manzil at cmu.edu>
> > > Cc: Biswajit Paria <bparia at cs.cmu.edu>, Yichong Xu <
> yichongx at cs.cmu.edu>, Emre Yolcu <eyolcu at cs.cmu.edu>, users at autonlab.org
> > > Subject: Re: PyTorch problem
> > >
> > > Should I go ahead and reboot all GPU computing nodes? Can somebody
> else confirm that a reboot fixes the issue?
> > >
> > > Predrag
> > >
> > > On Wed, Sep 5, 2018 at 4:42 PM, Manzil Zaheer <manzil at cmu.edu<mailto:
> manzil at cmu.edu>> wrote:
> > > It does work for me and my friends
> > >
> > >
> > >
> > >
> > > -------- Original message --------
> > > From: Predrag Punosevac <predragp at andrew.cmu.edu<mailto:
> predragp at andrew.cmu.edu>>
> > > Date: 9/5/18 4:40 PM (GMT-05:00)
> > > To: Biswajit Paria <bparia at cs.cmu.edu<mailto:bparia at cs.cmu.edu>>
> > > Cc: Manzil Zaheer <manzil at cmu.edu<mailto:manzil at cmu.edu>>, Yichong Xu
> <yichongx at cs.cmu.edu<mailto:yichongx at cs.cmu.edu>>, Emre Yolcu <
> eyolcu at cs.cmu.edu<mailto:eyolcu at cs.cmu.edu>>, users at autonlab.org<mailto:
> users at autonlab.org>
> > > Subject: Re: PyTorch problem
> > >
> > > I just rebooted GPU8. All packages are up to date. NVidia driver
> appears to be working properly and I can do GPU computations from MATLAB.
> Let's try now to get pytorch working on GPU8.
> > >
> > > Predrag
> > >
> > > On Wed, Sep 5, 2018 at 12:19 AM, Biswajit Paria <bparia at cs.cmu.edu
> <mailto:bparia at cs.cmu.edu>> wrote:
> > > I am facing a similar error on all GPU machines. Did someone find a
> solution yet?
> > >
> > >
> > > 2018-09-05 00:27:41.546064: E
> tensorflow/stream_executor/cuda/cuda_blas.cc:459] failed to create cublas
> handle: CUBLAS_STATUS_NOT_INITIALIZED
> > >
> > > On Tue, Sep 4, 2018 at 10:03 PM Manzil Zaheer <manzil at cmu.edu<mailto:
> manzil at cmu.edu>> wrote:
> > > Hi Yichong
> > >
> > > Yes I am able to run TF and PyTorch on these machines. Recently
> someone else also had similar issue, but it got fixed by reinstalling some
> local packages.
> > >
> > > Thanks,
> > > Manzil
> > >
> > >
> > > -------- Original message --------
> > > From: Yichong Xu <yichongx at cs.cmu.edu<mailto:yichongx at cs.cmu.edu>>
> > > Date: 9/4/18 9:58 PM (GMT-05:00)
> > > To: Emre Yolcu <eyolcu at cs.cmu.edu<mailto:eyolcu at cs.cmu.edu>>, Predrag
> Punosevac <predragp at andrew.cmu.edu<mailto:predragp at andrew.cmu.edu>>
> > > Cc: users at autonlab.org<mailto:users at autonlab.org>
> > > Subject: Re: PyTorch problem
> > >
> > > Just wondering - can Tensorflow run well on these machines? I hope
> someone to confirm about this so that we can isolate the problem.
> > > OK so here?s a further test: I tried running the cuda examples from
> the cuda installation (in /usr/local/cuda/sample), on gpu2 in my scratch
> directory. Simple jobs like deviceQuery succeeds, but simpleCUBLAS failed:
> > > yichongx at gpu2$ cd /home/scratch/yichongx/
> > > yichongx at gpu2$ cd
> > > 0_Simple/        2_Graphics/      4_Finance/       6_Advanced/
> bin/             conda/
> > > 1_Utilities/     3_Imaging/       5_Simulations/   7_CUDALibraries/
> common/          miniconda3/
> > > yichongx at gpu2$ cd 7_CUDALibraries/
> > > yichongx at gpu2$ cd simpleCUBLAS
> > > yichongx at gpu2$ CUDA_VISIBLE_DEVICES=3 ./simpleCUBLAS
> > > GPU Device 0: "TITAN X (Pascal)" with compute capability 6.1
> > >
> > > simpleCUBLAS test running..
> > > !!!! CUBLAS initialization error
> > > yichongx at gpu2$
> > >
> > >
> > > This is also consistent with our previous errors from pytorch, which
> say cublas library not initialized.
> > >
> > > So this means at least there is some problem with CUBLAS on gpu2. This
> post suggests that using sudo can resolve this problem, and this is
> probably because of some permission problems on CUBLAS libraries:
> > >
> https://devtalk.nvidia.com/default/topic/1027602/cuda-setup-and-installation/cublas-libraries-with-incorrect-permissions/
> > > @Predrag: Can you try running the simpleCUBLAS example from the CUDA
> library, with and without root privilege? I think that might be something
> that you are more familiar with. Thank you very much!
> > >
> > >
> > > Thanks,
> > > Yichong
> > >
> >
> > > On Sep 4, 2018, at 3:18 PM, Emre Yolcu <eyolcu at cs.cmu.edu<mailto:
> eyolcu at cs.cmu.edu>> wrote:
> > >
> > > Hi,
> > >
> > > We are trying to troubleshoot the PyTorch issue with Predrag and were
> wondering:
> > >
> > > Is anybody able to run PyTorch GPU models on gpu1-9? If you can, we
> would appreciate if you can respond.
> > >
> > > Also, is it a problem for anyone if gpu8 is rebooted today?
> > >
> > > Thanks,
> > >
> > > Emre
> > >
> > >
> > >
> > > --
> > > Biswajit Paria
> > > PhD in ML @ CMU
> > >
> > >
> >
> >
> >
> >
> >
> >
> > --
> >
> > Biswajit Paria
> >
> > PhD in ML @ CMU
> >
> >
> >
> >
> >
> >
> > --
> >
> > Biswajit Paria
> >
> > PhD in ML @ CMU
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20180906/31b9968a/attachment-0001.html>