PyTorch problem
Biswajit Paria
bparia at cs.cmu.edu
Wed Sep 5 17:28:56 EDT 2018
If the CUDA examples work for anyone, can they share their PATH and
LD_LIBRARY_PATH variables?
Thanks
On Wed, Sep 5, 2018 at 5:27 PM Yichong Xu <yichongx at cs.cmu.edu> wrote:
> I think with Biswajit’s and my problem with cuda, we should isolate the
> problem with just CUDA (and drivers) instead of wandering around python or
> pytorch.
> Predrag can you test the CUDA examples? I sort of agree with Manzil that
> this might be a user account problem.
>
> *Thanks,*
> *Yichong*
>
>
>
> On Sep 5, 2018, at 5:14 PM, Biswajit Paria <bparia at cs.cmu.edu> wrote:
>
> I just tried Yichong's way of testing cuBLAS, and get the same error as
> earlier:
>
> [Matrix Multiply CUBLAS] - Starting...
> GPU Device 0: "TITAN Xp" with compute capability 6.1
>
> MatrixA(640,480), MatrixB(480,320), MatrixC(640,320)
> CUDA error at matrixMulCUBLAS.cpp:275
> code=1(CUBLAS_STATUS_NOT_INITIALIZED) "cublasCreate(&handle)"
>
>
> So I believe it is not a conda error. I also tried removing .nv, doesn't
> help either. Maybe someone can share the PATH env variable?
>
> On Wed, Sep 5, 2018 at 5:08 PM Emre Yolcu <eyolcu at cs.cmu.edu> wrote:
>
>> Manzil, could you share your `conda env export` (or equivalent) output
>> for the environment you use for pytorch? It's still not working for me
>> after reboot, maybe I can try replicating your exact setup and try with
>> that.
>>
>> Thanks,
>>
>> Emre
>>
>> On Wed, Sep 5, 2018 at 4:56 PM, Predrag Punosevac <
>> predragp at andrew.cmu.edu> wrote:
>>
>>> Manzil Zaheer <manzil at cmu.edu> wrote:
>>>
>>> > It was working me before reboot as well. PyTorch does work on all
>>> > nodes for me.
>>>
>>> Aha! Gotcha.
>>>
>>> >
>>> > I am trying to say is that i think it is not issue at system level but
>>> > at user account level. I might be wrong though.
>>>
>>> That was my hunch as well. They were trying to convince me in a 150
>>> e-mails chain over the weekend that pytorch was broken when I replaced a
>>> failed HDD on the main file server. That didn't make any sense.
>>>
>>> Could you please share your binaries and setup with outher pytorch
>>> users?
>>>
>>> Cheers,
>>> Predrag
>>>
>>> >
>>> >
>>> > -------- Original message --------
>>> > From: Predrag Punosevac <predragp at andrew.cmu.edu>
>>> > Date: 9/5/18 4:44 PM (GMT-05:00)
>>> > To: Manzil Zaheer <manzil at cmu.edu>
>>> > Cc: Biswajit Paria <bparia at cs.cmu.edu>, Yichong Xu <
>>> yichongx at cs.cmu.edu>, Emre Yolcu <eyolcu at cs.cmu.edu>, users at autonlab.org
>>> > Subject: Re: PyTorch problem
>>> >
>>> > Should I go ahead and reboot all GPU computing nodes? Can somebody
>>> else confirm that a reboot fixes the issue?
>>> >
>>> > Predrag
>>> >
>>> > On Wed, Sep 5, 2018 at 4:42 PM, Manzil Zaheer <manzil at cmu.edu<mailto:
>>> manzil at cmu.edu>> wrote:
>>> > It does work for me and my friends
>>> >
>>> >
>>> >
>>> >
>>> > -------- Original message --------
>>> > From: Predrag Punosevac <predragp at andrew.cmu.edu<mailto:
>>> predragp at andrew.cmu.edu>>
>>> > Date: 9/5/18 4:40 PM (GMT-05:00)
>>> > To: Biswajit Paria <bparia at cs.cmu.edu<mailto:bparia at cs.cmu.edu>>
>>> > Cc: Manzil Zaheer <manzil at cmu.edu<mailto:manzil at cmu.edu>>, Yichong Xu
>>> <yichongx at cs.cmu.edu<mailto:yichongx at cs.cmu.edu>>, Emre Yolcu <
>>> eyolcu at cs.cmu.edu<mailto:eyolcu at cs.cmu.edu>>, users at autonlab.org<mailto:
>>> users at autonlab.org>
>>> > Subject: Re: PyTorch problem
>>> >
>>> > I just rebooted GPU8. All packages are up to date. NVidia driver
>>> appears to be working properly and I can do GPU computations from MATLAB.
>>> Let's try now to get pytorch working on GPU8.
>>> >
>>> > Predrag
>>> >
>>> > On Wed, Sep 5, 2018 at 12:19 AM, Biswajit Paria <bparia at cs.cmu.edu
>>> <mailto:bparia at cs.cmu.edu>> wrote:
>>> > I am facing a similar error on all GPU machines. Did someone find a
>>> solution yet?
>>> >
>>> >
>>> > 2018-09-05 00:27:41.546064: E tensorflow/stream_executor/cuda/
>>> cuda_blas.cc:459] failed to create cublas handle:
>>> CUBLAS_STATUS_NOT_INITIALIZED
>>> >
>>> > On Tue, Sep 4, 2018 at 10:03 PM Manzil Zaheer <manzil at cmu.edu<mailto:
>>> manzil at cmu.edu>> wrote:
>>> > Hi Yichong
>>> >
>>> > Yes I am able to run TF and PyTorch on these machines. Recently
>>> someone else also had similar issue, but it got fixed by reinstalling some
>>> local packages.
>>> >
>>> > Thanks,
>>> > Manzil
>>> >
>>> >
>>> > -------- Original message --------
>>> > From: Yichong Xu <yichongx at cs.cmu.edu<mailto:yichongx at cs.cmu.edu>>
>>> > Date: 9/4/18 9:58 PM (GMT-05:00)
>>> > To: Emre Yolcu <eyolcu at cs.cmu.edu<mailto:eyolcu at cs.cmu.edu>>, Predrag
>>> Punosevac <predragp at andrew.cmu.edu<mailto:predragp at andrew.cmu.edu>>
>>> > Cc: users at autonlab.org<mailto:users at autonlab.org>
>>> > Subject: Re: PyTorch problem
>>> >
>>> > Just wondering - can Tensorflow run well on these machines? I hope
>>> someone to confirm about this so that we can isolate the problem.
>>> > OK so here?s a further test: I tried running the cuda examples from
>>> the cuda installation (in /usr/local/cuda/sample), on gpu2 in my scratch
>>> directory. Simple jobs like deviceQuery succeeds, but simpleCUBLAS failed:
>>> > yichongx at gpu2$ cd /home/scratch/yichongx/
>>> > yichongx at gpu2$ cd
>>> > 0_Simple/ 2_Graphics/ 4_Finance/ 6_Advanced/
>>> bin/ conda/
>>> > 1_Utilities/ 3_Imaging/ 5_Simulations/ 7_CUDALibraries/
>>> common/ miniconda3/
>>> > yichongx at gpu2$ cd 7_CUDALibraries/
>>> > yichongx at gpu2$ cd simpleCUBLAS
>>> > yichongx at gpu2$ CUDA_VISIBLE_DEVICES=3 ./simpleCUBLAS
>>> > GPU Device 0: "TITAN X (Pascal)" with compute capability 6.1
>>> >
>>> > simpleCUBLAS test running..
>>> > !!!! CUBLAS initialization error
>>> > yichongx at gpu2$
>>> >
>>> >
>>> > This is also consistent with our previous errors from pytorch, which
>>> say cublas library not initialized.
>>> >
>>> > So this means at least there is some problem with CUBLAS on gpu2. This
>>> post suggests that using sudo can resolve this problem, and this is
>>> probably because of some permission problems on CUBLAS libraries:
>>> >
>>> https://devtalk.nvidia.com/default/topic/1027602/cuda-setup-and-installation/cublas-libraries-with-incorrect-permissions/
>>> > @Predrag: Can you try running the simpleCUBLAS example from the CUDA
>>> library, with and without root privilege? I think that might be something
>>> that you are more familiar with. Thank you very much!
>>> >
>>> >
>>> > Thanks,
>>> > Yichong
>>> >
>>> > On Sep 4, 2018, at 3:18 PM, Emre Yolcu <eyolcu at cs.cmu.edu<mailto:
>>> eyolcu at cs.cmu.edu>> wrote:
>>> >
>>> > Hi,
>>> >
>>> > We are trying to troubleshoot the PyTorch issue with Predrag and were
>>> wondering:
>>> >
>>> > Is anybody able to run PyTorch GPU models on gpu1-9? If you can, we
>>> would appreciate if you can respond.
>>> >
>>> > Also, is it a problem for anyone if gpu8 is rebooted today?
>>> >
>>> > Thanks,
>>> >
>>> > Emre
>>> >
>>> >
>>> >
>>> > --
>>> > Biswajit Paria
>>> > PhD in ML @ CMU
>>> >
>>> >
>>>
>>
>>
>
> --
> Biswajit Paria
> PhD in ML @ CMU
>
>
>
--
Biswajit Paria
PhD in ML @ CMU
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20180905/5bb9b38d/attachment.html>
More information about the Autonlab-users
mailing list