PyTorch problem

Predrag Punosevac predragp at andrew.cmu.edu
Wed Sep 5 16:44:26 EDT 2018


Should I go ahead and reboot all GPU computing nodes? Can somebody else
confirm that a reboot fixes the issue?

Predrag

On Wed, Sep 5, 2018 at 4:42 PM, Manzil Zaheer <manzil at cmu.edu> wrote:

> It does work for me and my friends
>
>
>
>
> -------- Original message --------
> From: Predrag Punosevac <predragp at andrew.cmu.edu>
> Date: 9/5/18 4:40 PM (GMT-05:00)
> To: Biswajit Paria <bparia at cs.cmu.edu>
> Cc: Manzil Zaheer <manzil at cmu.edu>, Yichong Xu <yichongx at cs.cmu.edu>,
> Emre Yolcu <eyolcu at cs.cmu.edu>, users at autonlab.org
> Subject: Re: PyTorch problem
>
> I just rebooted GPU8. All packages are up to date. NVidia driver appears
> to be working properly and I can do GPU computations from MATLAB. Let's try
> now to get pytorch working on GPU8.
>
> Predrag
>
> On Wed, Sep 5, 2018 at 12:19 AM, Biswajit Paria <bparia at cs.cmu.edu> wrote:
>
>> I am facing a similar error on all GPU machines. Did someone find a
>> solution yet?
>>
>> 2018-09-05 00:27:41.546064: E tensorflow/stream_executor/cuda/cuda_blas.cc:459]
>> failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
>>
>> On Tue, Sep 4, 2018 at 10:03 PM Manzil Zaheer <manzil at cmu.edu> wrote:
>>
>>> Hi Yichong
>>>
>>> Yes I am able to run TF and PyTorch on these machines. Recently someone
>>> else also had similar issue, but it got fixed by reinstalling some local
>>> packages.
>>>
>>> Thanks,
>>> Manzil
>>>
>>>
>>> -------- Original message --------
>>> From: Yichong Xu <yichongx at cs.cmu.edu>
>>> Date: 9/4/18 9:58 PM (GMT-05:00)
>>> To: Emre Yolcu <eyolcu at cs.cmu.edu>, Predrag Punosevac <
>>> predragp at andrew.cmu.edu>
>>> Cc: users at autonlab.org
>>> Subject: Re: PyTorch problem
>>>
>>> Just wondering - can Tensorflow run well on these machines? I hope
>>> someone to confirm about this so that we can isolate the problem.
>>> OK so here’s a further test: I tried running the cuda examples from the
>>> cuda installation (in /usr/local/cuda/sample), on gpu2 in my scratch
>>> directory. Simple jobs like deviceQuery succeeds, but simpleCUBLAS failed:
>>> yichongx at gpu2$ cd /home/scratch/yichongx/
>>> yichongx at gpu2$ cd
>>> 0_Simple/        2_Graphics/      4_Finance/       6_Advanced/      bin/
>>>             conda/
>>> 1_Utilities/     3_Imaging/       5_Simulations/   7_CUDALibraries/
>>> common/          miniconda3/
>>> yichongx at gpu2$ cd 7_CUDALibraries/
>>> yichongx at gpu2$ cd simpleCUBLAS
>>> yichongx at gpu2$ CUDA_VISIBLE_DEVICES=3 ./simpleCUBLAS
>>> GPU Device 0: "TITAN X (Pascal)" with compute capability 6.1
>>>
>>> simpleCUBLAS test running..
>>> !!!! CUBLAS initialization error
>>> yichongx at gpu2$
>>>
>>>
>>> This is also consistent with our previous errors from pytorch, which say
>>> cublas library not initialized.
>>>
>>> So this means at least there is some problem with CUBLAS on gpu2. This
>>> post suggests that using sudo can resolve this problem, and this is
>>> probably because of some permission problems on CUBLAS libraries:
>>> https://devtalk.nvidia.com/default/topic/1027602/cuda-setup-
>>> and-installation/cublas-libraries-with-incorrect-permissions/
>>> @Predrag: Can you try running the simpleCUBLAS example from the CUDA
>>> library, with and without root privilege? I think that might be something
>>> that you are more familiar with. Thank you very much!
>>>
>>>
>>> *Thanks,*
>>> *Yichong*
>>>
>>> On Sep 4, 2018, at 3:18 PM, Emre Yolcu <eyolcu at cs.cmu.edu> wrote:
>>>
>>> Hi,
>>>
>>> We are trying to troubleshoot the PyTorch issue with Predrag and were
>>> wondering:
>>>
>>> Is anybody able to run PyTorch GPU models on gpu1-9? If you can, we
>>> would appreciate if you can respond.
>>>
>>> Also, is it a problem for anyone if gpu8 is rebooted today?
>>>
>>> Thanks,
>>>
>>> Emre
>>>
>>>
>>>
>>
>> --
>> Biswajit Paria
>> PhD in ML @ CMU
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20180905/a2dd2827/attachment-0001.html>


More information about the Autonlab-users mailing list