PyTorch problem

Emre Yolcu eyolcu at cs.cmu.edu
Wed Sep 5 17:07:56 EDT 2018


Manzil, could you share your `conda env export` (or equivalent) output for
the environment you use for pytorch? It's still not working for me after
reboot, maybe I can try replicating your exact setup and try with that.

Thanks,

Emre

On Wed, Sep 5, 2018 at 4:56 PM, Predrag Punosevac <predragp at andrew.cmu.edu>
wrote:

> Manzil Zaheer <manzil at cmu.edu> wrote:
>
> > It was working me before reboot as well. PyTorch does work on all
> > nodes for me.
>
> Aha! Gotcha.
>
> >
> > I am trying to say is that i think it is not issue at system level but
> > at user account level. I might be wrong though.
>
> That was my hunch as well. They were trying to convince me in a 150
> e-mails chain over the weekend that pytorch was broken when I replaced a
> failed HDD on the main file server. That didn't make any sense.
>
> Could you please share your binaries and setup with outher pytorch
> users?
>
> Cheers,
> Predrag
>
> >
> >
> > -------- Original message --------
> > From: Predrag Punosevac <predragp at andrew.cmu.edu>
> > Date: 9/5/18 4:44 PM (GMT-05:00)
> > To: Manzil Zaheer <manzil at cmu.edu>
> > Cc: Biswajit Paria <bparia at cs.cmu.edu>, Yichong Xu <yichongx at cs.cmu.edu>,
> Emre Yolcu <eyolcu at cs.cmu.edu>, users at autonlab.org
> > Subject: Re: PyTorch problem
> >
> > Should I go ahead and reboot all GPU computing nodes? Can somebody else
> confirm that a reboot fixes the issue?
> >
> > Predrag
> >
> > On Wed, Sep 5, 2018 at 4:42 PM, Manzil Zaheer <manzil at cmu.edu<mailto:
> manzil at cmu.edu>> wrote:
> > It does work for me and my friends
> >
> >
> >
> >
> > -------- Original message --------
> > From: Predrag Punosevac <predragp at andrew.cmu.edu<mailto:
> predragp at andrew.cmu.edu>>
> > Date: 9/5/18 4:40 PM (GMT-05:00)
> > To: Biswajit Paria <bparia at cs.cmu.edu<mailto:bparia at cs.cmu.edu>>
> > Cc: Manzil Zaheer <manzil at cmu.edu<mailto:manzil at cmu.edu>>, Yichong Xu <
> yichongx at cs.cmu.edu<mailto:yichongx at cs.cmu.edu>>, Emre Yolcu <
> eyolcu at cs.cmu.edu<mailto:eyolcu at cs.cmu.edu>>, users at autonlab.org<mailto:
> users at autonlab.org>
> > Subject: Re: PyTorch problem
> >
> > I just rebooted GPU8. All packages are up to date. NVidia driver appears
> to be working properly and I can do GPU computations from MATLAB. Let's try
> now to get pytorch working on GPU8.
> >
> > Predrag
> >
> > On Wed, Sep 5, 2018 at 12:19 AM, Biswajit Paria <bparia at cs.cmu.edu
> <mailto:bparia at cs.cmu.edu>> wrote:
> > I am facing a similar error on all GPU machines. Did someone find a
> solution yet?
> >
> >
> > 2018-09-05 00:27:41.546064: E tensorflow/stream_executor/cuda/cuda_blas.cc:459]
> failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
> >
> > On Tue, Sep 4, 2018 at 10:03 PM Manzil Zaheer <manzil at cmu.edu<mailto:
> manzil at cmu.edu>> wrote:
> > Hi Yichong
> >
> > Yes I am able to run TF and PyTorch on these machines. Recently someone
> else also had similar issue, but it got fixed by reinstalling some local
> packages.
> >
> > Thanks,
> > Manzil
> >
> >
> > -------- Original message --------
> > From: Yichong Xu <yichongx at cs.cmu.edu<mailto:yichongx at cs.cmu.edu>>
> > Date: 9/4/18 9:58 PM (GMT-05:00)
> > To: Emre Yolcu <eyolcu at cs.cmu.edu<mailto:eyolcu at cs.cmu.edu>>, Predrag
> Punosevac <predragp at andrew.cmu.edu<mailto:predragp at andrew.cmu.edu>>
> > Cc: users at autonlab.org<mailto:users at autonlab.org>
> > Subject: Re: PyTorch problem
> >
> > Just wondering - can Tensorflow run well on these machines? I hope
> someone to confirm about this so that we can isolate the problem.
> > OK so here?s a further test: I tried running the cuda examples from the
> cuda installation (in /usr/local/cuda/sample), on gpu2 in my scratch
> directory. Simple jobs like deviceQuery succeeds, but simpleCUBLAS failed:
> > yichongx at gpu2$ cd /home/scratch/yichongx/
> > yichongx at gpu2$ cd
> > 0_Simple/        2_Graphics/      4_Finance/       6_Advanced/
> bin/             conda/
> > 1_Utilities/     3_Imaging/       5_Simulations/   7_CUDALibraries/
> common/          miniconda3/
> > yichongx at gpu2$ cd 7_CUDALibraries/
> > yichongx at gpu2$ cd simpleCUBLAS
> > yichongx at gpu2$ CUDA_VISIBLE_DEVICES=3 ./simpleCUBLAS
> > GPU Device 0: "TITAN X (Pascal)" with compute capability 6.1
> >
> > simpleCUBLAS test running..
> > !!!! CUBLAS initialization error
> > yichongx at gpu2$
> >
> >
> > This is also consistent with our previous errors from pytorch, which say
> cublas library not initialized.
> >
> > So this means at least there is some problem with CUBLAS on gpu2. This
> post suggests that using sudo can resolve this problem, and this is
> probably because of some permission problems on CUBLAS libraries:
> > https://devtalk.nvidia.com/default/topic/1027602/cuda-
> setup-and-installation/cublas-libraries-with-incorrect-permissions/
> > @Predrag: Can you try running the simpleCUBLAS example from the CUDA
> library, with and without root privilege? I think that might be something
> that you are more familiar with. Thank you very much!
> >
> >
> > Thanks,
> > Yichong
> >
> > On Sep 4, 2018, at 3:18 PM, Emre Yolcu <eyolcu at cs.cmu.edu<mailto:eyol
> cu at cs.cmu.edu>> wrote:
> >
> > Hi,
> >
> > We are trying to troubleshoot the PyTorch issue with Predrag and were
> wondering:
> >
> > Is anybody able to run PyTorch GPU models on gpu1-9? If you can, we
> would appreciate if you can respond.
> >
> > Also, is it a problem for anyone if gpu8 is rebooted today?
> >
> > Thanks,
> >
> > Emre
> >
> >
> >
> > --
> > Biswajit Paria
> > PhD in ML @ CMU
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20180905/4eff505a/attachment.html>


More information about the Autonlab-users mailing list