PyTorch problem

Biswajit Paria bparia at cs.cmu.edu
Wed Sep 5 17:14:10 EDT 2018


I just tried Yichong's way of testing cuBLAS, and get the same error as
earlier:

[Matrix Multiply CUBLAS] - Starting...
GPU Device 0: "TITAN Xp" with compute capability 6.1

MatrixA(640,480), MatrixB(480,320), MatrixC(640,320)
CUDA error at matrixMulCUBLAS.cpp:275 code=1(CUBLAS_STATUS_NOT_INITIALIZED)
"cublasCreate(&handle)"


So I believe it is not a conda error. I also tried removing .nv, doesn't
help either. Maybe someone can share the PATH env variable?

On Wed, Sep 5, 2018 at 5:08 PM Emre Yolcu <eyolcu at cs.cmu.edu> wrote:

> Manzil, could you share your `conda env export` (or equivalent) output for
> the environment you use for pytorch? It's still not working for me after
> reboot, maybe I can try replicating your exact setup and try with that.
>
> Thanks,
>
> Emre
>
> On Wed, Sep 5, 2018 at 4:56 PM, Predrag Punosevac <predragp at andrew.cmu.edu
> > wrote:
>
>> Manzil Zaheer <manzil at cmu.edu> wrote:
>>
>> > It was working me before reboot as well. PyTorch does work on all
>> > nodes for me.
>>
>> Aha! Gotcha.
>>
>> >
>> > I am trying to say is that i think it is not issue at system level but
>> > at user account level. I might be wrong though.
>>
>> That was my hunch as well. They were trying to convince me in a 150
>> e-mails chain over the weekend that pytorch was broken when I replaced a
>> failed HDD on the main file server. That didn't make any sense.
>>
>> Could you please share your binaries and setup with outher pytorch
>> users?
>>
>> Cheers,
>> Predrag
>>
>> >
>> >
>> > -------- Original message --------
>> > From: Predrag Punosevac <predragp at andrew.cmu.edu>
>> > Date: 9/5/18 4:44 PM (GMT-05:00)
>> > To: Manzil Zaheer <manzil at cmu.edu>
>> > Cc: Biswajit Paria <bparia at cs.cmu.edu>, Yichong Xu <yichongx at cs.cmu.edu>,
>> Emre Yolcu <eyolcu at cs.cmu.edu>, users at autonlab.org
>> > Subject: Re: PyTorch problem
>> >
>> > Should I go ahead and reboot all GPU computing nodes? Can somebody else
>> confirm that a reboot fixes the issue?
>> >
>> > Predrag
>> >
>> > On Wed, Sep 5, 2018 at 4:42 PM, Manzil Zaheer <manzil at cmu.edu<mailto:
>> manzil at cmu.edu>> wrote:
>> > It does work for me and my friends
>> >
>> >
>> >
>> >
>> > -------- Original message --------
>> > From: Predrag Punosevac <predragp at andrew.cmu.edu<mailto:
>> predragp at andrew.cmu.edu>>
>> > Date: 9/5/18 4:40 PM (GMT-05:00)
>> > To: Biswajit Paria <bparia at cs.cmu.edu<mailto:bparia at cs.cmu.edu>>
>> > Cc: Manzil Zaheer <manzil at cmu.edu<mailto:manzil at cmu.edu>>, Yichong Xu <
>> yichongx at cs.cmu.edu<mailto:yichongx at cs.cmu.edu>>, Emre Yolcu <
>> eyolcu at cs.cmu.edu<mailto:eyolcu at cs.cmu.edu>>, users at autonlab.org<mailto:
>> users at autonlab.org>
>> > Subject: Re: PyTorch problem
>> >
>> > I just rebooted GPU8. All packages are up to date. NVidia driver
>> appears to be working properly and I can do GPU computations from MATLAB.
>> Let's try now to get pytorch working on GPU8.
>> >
>> > Predrag
>> >
>> > On Wed, Sep 5, 2018 at 12:19 AM, Biswajit Paria <bparia at cs.cmu.edu
>> <mailto:bparia at cs.cmu.edu>> wrote:
>> > I am facing a similar error on all GPU machines. Did someone find a
>> solution yet?
>> >
>> >
>> > 2018-09-05 00:27:41.546064: E
>> tensorflow/stream_executor/cuda/cuda_blas.cc:459] failed to create cublas
>> handle: CUBLAS_STATUS_NOT_INITIALIZED
>> >
>> > On Tue, Sep 4, 2018 at 10:03 PM Manzil Zaheer <manzil at cmu.edu<mailto:
>> manzil at cmu.edu>> wrote:
>> > Hi Yichong
>> >
>> > Yes I am able to run TF and PyTorch on these machines. Recently someone
>> else also had similar issue, but it got fixed by reinstalling some local
>> packages.
>> >
>> > Thanks,
>> > Manzil
>> >
>> >
>> > -------- Original message --------
>> > From: Yichong Xu <yichongx at cs.cmu.edu<mailto:yichongx at cs.cmu.edu>>
>> > Date: 9/4/18 9:58 PM (GMT-05:00)
>> > To: Emre Yolcu <eyolcu at cs.cmu.edu<mailto:eyolcu at cs.cmu.edu>>, Predrag
>> Punosevac <predragp at andrew.cmu.edu<mailto:predragp at andrew.cmu.edu>>
>> > Cc: users at autonlab.org<mailto:users at autonlab.org>
>> > Subject: Re: PyTorch problem
>> >
>> > Just wondering - can Tensorflow run well on these machines? I hope
>> someone to confirm about this so that we can isolate the problem.
>> > OK so here?s a further test: I tried running the cuda examples from the
>> cuda installation (in /usr/local/cuda/sample), on gpu2 in my scratch
>> directory. Simple jobs like deviceQuery succeeds, but simpleCUBLAS failed:
>> > yichongx at gpu2$ cd /home/scratch/yichongx/
>> > yichongx at gpu2$ cd
>> > 0_Simple/        2_Graphics/      4_Finance/       6_Advanced/
>> bin/             conda/
>> > 1_Utilities/     3_Imaging/       5_Simulations/   7_CUDALibraries/
>> common/          miniconda3/
>> > yichongx at gpu2$ cd 7_CUDALibraries/
>> > yichongx at gpu2$ cd simpleCUBLAS
>> > yichongx at gpu2$ CUDA_VISIBLE_DEVICES=3 ./simpleCUBLAS
>> > GPU Device 0: "TITAN X (Pascal)" with compute capability 6.1
>> >
>> > simpleCUBLAS test running..
>> > !!!! CUBLAS initialization error
>> > yichongx at gpu2$
>> >
>> >
>> > This is also consistent with our previous errors from pytorch, which
>> say cublas library not initialized.
>> >
>> > So this means at least there is some problem with CUBLAS on gpu2. This
>> post suggests that using sudo can resolve this problem, and this is
>> probably because of some permission problems on CUBLAS libraries:
>> >
>> https://devtalk.nvidia.com/default/topic/1027602/cuda-setup-and-installation/cublas-libraries-with-incorrect-permissions/
>> > @Predrag: Can you try running the simpleCUBLAS example from the CUDA
>> library, with and without root privilege? I think that might be something
>> that you are more familiar with. Thank you very much!
>> >
>> >
>> > Thanks,
>> > Yichong
>> >
>> > On Sep 4, 2018, at 3:18 PM, Emre Yolcu <eyolcu at cs.cmu.edu<mailto:
>> eyolcu at cs.cmu.edu>> wrote:
>> >
>> > Hi,
>> >
>> > We are trying to troubleshoot the PyTorch issue with Predrag and were
>> wondering:
>> >
>> > Is anybody able to run PyTorch GPU models on gpu1-9? If you can, we
>> would appreciate if you can respond.
>> >
>> > Also, is it a problem for anyone if gpu8 is rebooted today?
>> >
>> > Thanks,
>> >
>> > Emre
>> >
>> >
>> >
>> > --
>> > Biswajit Paria
>> > PhD in ML @ CMU
>> >
>> >
>>
>
>

-- 
Biswajit Paria
PhD in ML @ CMU
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20180905/84ca73fd/attachment.html>


More information about the Autonlab-users mailing list