PyTorch problem
Manzil Zaheer
manzil at cmu.edu
Wed Sep 5 16:46:15 EDT 2018
It was working me before reboot as well. PyTorch does work on all nodes for me.
I am trying to say is that i think it is not issue at system level but at user account level. I might be wrong though.
-------- Original message --------
From: Predrag Punosevac <predragp at andrew.cmu.edu>
Date: 9/5/18 4:44 PM (GMT-05:00)
To: Manzil Zaheer <manzil at cmu.edu>
Cc: Biswajit Paria <bparia at cs.cmu.edu>, Yichong Xu <yichongx at cs.cmu.edu>, Emre Yolcu <eyolcu at cs.cmu.edu>, users at autonlab.org
Subject: Re: PyTorch problem
Should I go ahead and reboot all GPU computing nodes? Can somebody else confirm that a reboot fixes the issue?
Predrag
On Wed, Sep 5, 2018 at 4:42 PM, Manzil Zaheer <manzil at cmu.edu<mailto:manzil at cmu.edu>> wrote:
It does work for me and my friends
-------- Original message --------
From: Predrag Punosevac <predragp at andrew.cmu.edu<mailto:predragp at andrew.cmu.edu>>
Date: 9/5/18 4:40 PM (GMT-05:00)
To: Biswajit Paria <bparia at cs.cmu.edu<mailto:bparia at cs.cmu.edu>>
Cc: Manzil Zaheer <manzil at cmu.edu<mailto:manzil at cmu.edu>>, Yichong Xu <yichongx at cs.cmu.edu<mailto:yichongx at cs.cmu.edu>>, Emre Yolcu <eyolcu at cs.cmu.edu<mailto:eyolcu at cs.cmu.edu>>, users at autonlab.org<mailto:users at autonlab.org>
Subject: Re: PyTorch problem
I just rebooted GPU8. All packages are up to date. NVidia driver appears to be working properly and I can do GPU computations from MATLAB. Let's try now to get pytorch working on GPU8.
Predrag
On Wed, Sep 5, 2018 at 12:19 AM, Biswajit Paria <bparia at cs.cmu.edu<mailto:bparia at cs.cmu.edu>> wrote:
I am facing a similar error on all GPU machines. Did someone find a solution yet?
2018-09-05 00:27:41.546064: E tensorflow/stream_executor/cuda/cuda_blas.cc:459] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
On Tue, Sep 4, 2018 at 10:03 PM Manzil Zaheer <manzil at cmu.edu<mailto:manzil at cmu.edu>> wrote:
Hi Yichong
Yes I am able to run TF and PyTorch on these machines. Recently someone else also had similar issue, but it got fixed by reinstalling some local packages.
Thanks,
Manzil
-------- Original message --------
From: Yichong Xu <yichongx at cs.cmu.edu<mailto:yichongx at cs.cmu.edu>>
Date: 9/4/18 9:58 PM (GMT-05:00)
To: Emre Yolcu <eyolcu at cs.cmu.edu<mailto:eyolcu at cs.cmu.edu>>, Predrag Punosevac <predragp at andrew.cmu.edu<mailto:predragp at andrew.cmu.edu>>
Cc: users at autonlab.org<mailto:users at autonlab.org>
Subject: Re: PyTorch problem
Just wondering - can Tensorflow run well on these machines? I hope someone to confirm about this so that we can isolate the problem.
OK so here’s a further test: I tried running the cuda examples from the cuda installation (in /usr/local/cuda/sample), on gpu2 in my scratch directory. Simple jobs like deviceQuery succeeds, but simpleCUBLAS failed:
yichongx at gpu2$ cd /home/scratch/yichongx/
yichongx at gpu2$ cd
0_Simple/ 2_Graphics/ 4_Finance/ 6_Advanced/ bin/ conda/
1_Utilities/ 3_Imaging/ 5_Simulations/ 7_CUDALibraries/ common/ miniconda3/
yichongx at gpu2$ cd 7_CUDALibraries/
yichongx at gpu2$ cd simpleCUBLAS
yichongx at gpu2$ CUDA_VISIBLE_DEVICES=3 ./simpleCUBLAS
GPU Device 0: "TITAN X (Pascal)" with compute capability 6.1
simpleCUBLAS test running..
!!!! CUBLAS initialization error
yichongx at gpu2$
This is also consistent with our previous errors from pytorch, which say cublas library not initialized.
So this means at least there is some problem with CUBLAS on gpu2. This post suggests that using sudo can resolve this problem, and this is probably because of some permission problems on CUBLAS libraries:
https://devtalk.nvidia.com/default/topic/1027602/cuda-setup-and-installation/cublas-libraries-with-incorrect-permissions/
@Predrag: Can you try running the simpleCUBLAS example from the CUDA library, with and without root privilege? I think that might be something that you are more familiar with. Thank you very much!
Thanks,
Yichong
On Sep 4, 2018, at 3:18 PM, Emre Yolcu <eyolcu at cs.cmu.edu<mailto:eyolcu at cs.cmu.edu>> wrote:
Hi,
We are trying to troubleshoot the PyTorch issue with Predrag and were wondering:
Is anybody able to run PyTorch GPU models on gpu1-9? If you can, we would appreciate if you can respond.
Also, is it a problem for anyone if gpu8 is rebooted today?
Thanks,
Emre
--
Biswajit Paria
PhD in ML @ CMU
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20180905/f7f603da/attachment.html>
More information about the Autonlab-users
mailing list