PyTorch problem
Emre Yolcu
eyolcu at cs.cmu.edu
Thu Sep 6 20:46:57 EDT 2018
I think I got it. If I'm not mistaken NFS is the root of all our problems
in this thread. Can anyone having problems try doing the equivalent of
`export CUDA_CACHE_PATH=/home/scratch/eyolcu/computecache` (replacing
eyolcu with your andrew id) and try everything again? This seems to fix it
for me.
On Thu, Sep 6, 2018 at 3:28 PM, Yichong Xu <yichongx at cs.cmu.edu> wrote:
> Hi Predrag,
> I just tested the simpleCUBLAS sample in cuda library. It still does not
> work for me with the same error:
> GPU Device 0: "TITAN Xp" with compute capability 6.1
>
> simpleCUBLAS test running..
> !!!! CUBLAS initialization error
>
>
> I’m not sure where exactly the access problem is, but here is what I get
> from ls -all:
> yichongx at gpu8$ ls -all
> total 2136200
> drwxr-xr-x. 3 sheath sheath 8192 May 31 15:16 .
> drwxr-xr-x. 4 root root 32 Sep 2 2017 ..
> lrwxrwxrwx. 1 root root 18 Mar 13 13:05 libaccinj64.so ->
> libaccinj64.so.9.0
> lrwxrwxrwx. 1 root root 22 Mar 13 13:05 libaccinj64.so.9.0 ->
> libaccinj64.so.9.0.176
> -rwxr-xr-x. 1 root root 6858944 Sep 2 2017 libaccinj64.so.9.0.176
> -rw-r--r--. 1 root root 71952010 Dec 19 2017 libcublas_device.a
> lrwxrwxrwx. 1 root root 16 Mar 13 13:04 libcublas.so ->
> libcublas.so.9.0
> lrwxrwxrwx. 1 root root 20 Mar 13 13:04 libcublas.so.9.0 ->
> libcublas.so.9.0.282
> -rwxr-xr-x. 1 root root 52590576 Dec 19 2017 libcublas.so.9.0.176
> -rwxr-xr-x. 1 root root 55781312 Dec 19 2017 libcublas.so.9.0.282
> -rw-r--r--. 1 root root 62813620 Dec 19 2017 libcublas_static.a
>
>
> *Thanks,*
> *Yichong*
>
> On Sep 6, 2018, at 3:14 PM, Yichong Xu <yichongx at cs.cmu.edu> wrote:
>
> 1. I think yes. Biswajit and I cannot use the system cuda libraries.
> 2. I think yes as well. Predrag said he can run matlab with cuda well
> (probably with root access), so I think there should be some problem with
> the privilege setting of system libraries. We do not have root access on
> our accounts.
> 3. Not yet so far.
> 4. That can be a solution. Maybe we have a public access library as
> Jay-Yoon did and that can work for us.
>
> Also for gpu8 - I just reinstalled pytorch again on scratch of gpu8 and it
> still does not work. I’m making the cuda libraries right now and trying to
> see if it works.
>
> *Thanks,*
> *Yichong*
>
>
>
> On Sep 6, 2018, at 11:20 AM, Barnabas Poczos <bapoczos at cs.cmu.edu> wrote:
>
> Hi All,
>
> I'm somewhat confused:
>
> * Do I understand correctly that Manzil actually is using the CUDA
> libraries installed by himself
> (/zfsauton/home/manzilz/local/cuda-9.0/) and not the system libraries
> (/usr/local/cuda/lib64) ?
> * Since he is using different CUDA libraries is that the reason that
> pytorch is working for him and not for the other users? If so, should
> we double check the system libraries?
> * Do we know anyone who can use pytorch now with the CUDA system
> libraries ? If so, those users please let us know your system env
> variables.
> * As a quick solution, should we ask Manzil to copy his cuda libraries
> to a public place where others could access them?
>
> Best,
> Barnabas
>
> ======================
> Barnabas Poczos, PhD
> Associate Professor
> Co-Director of PhD Program
> Machine Learning Department
> Carnegie Mellon University
> On Wed, Sep 5, 2018 at 5:33 PM Manzil Zaheer <manzil at cmu.edu> wrote:
>
>
> Here is my related env variables:
>
>
>
> CUDA_HOME=/zfsauton/home/manzilz/local/cuda-9.0/
>
> LD_LIBRARY_PATH=/zfsauton/home/manzilz/local/lib64:/
> zfsauton/home/manzilz/local/lib:/zfsauton/home/manzilz/
> local/cuda-9.0/lib64:/usr/local/cuda/lib64:
>
> PATH=/zfsauton/home/manzilz/local/bin:/zfsauton/home/
> manzilz/.local/bin:/zfsauton/home/manzilz/local/cuda-9.0/
> bin:/usr/local/cuda/bin:/usr/lib64/qt-3.3/bin:/usr/local/
> bin:/usr/bin:/usr/local/sbin:/usr/sbin
>
> C_INCLUDE_PATH=/zfsauton/home/manzilz/local/include:
>
>
>
> From: Biswajit Paria <bparia at cs.cmu.edu>
> Sent: Wednesday, September 05, 2018 5:29 PM
> To: Yichong Xu <yichongx at cs.cmu.edu>
> Cc: Biswajit Paria <bparia at cs.cmu.edu>; eyolcu at cs.cmu.edu; Predrag
> Punosevac <predragp at andrew.cmu.edu>; Manzil Zaheer <manzil at cmu.edu>;
> users at autonlab.org
> Subject: Re: PyTorch problem
>
>
>
> If the CUDA examples work for anyone, can they share their PATH and
> LD_LIBRARY_PATH variables?
>
>
>
> Thanks
>
>
>
> On Wed, Sep 5, 2018 at 5:27 PM Yichong Xu <yichongx at cs.cmu.edu> wrote:
>
> I think with Biswajit’s and my problem with cuda, we should isolate the
> problem with just CUDA (and drivers) instead of wandering around python or
> pytorch.
>
> Predrag can you test the CUDA examples? I sort of agree with Manzil that
> this might be a user account problem.
>
>
>
> Thanks,
>
> Yichong
>
>
>
>
>
>
>
> On Sep 5, 2018, at 5:14 PM, Biswajit Paria <bparia at cs.cmu.edu> wrote:
>
>
>
> I just tried Yichong's way of testing cuBLAS, and get the same error as
> earlier:
>
>
>
> [Matrix Multiply CUBLAS] - Starting...
>
> GPU Device 0: "TITAN Xp" with compute capability 6.1
>
>
>
> MatrixA(640,480), MatrixB(480,320), MatrixC(640,320)
>
> CUDA error at matrixMulCUBLAS.cpp:275 code=1(CUBLAS_STATUS_NOT_INITIALIZED)
> "cublasCreate(&handle)"
>
>
>
> So I believe it is not a conda error. I also tried removing .nv, doesn't
> help either. Maybe someone can share the PATH env variable?
>
>
>
> On Wed, Sep 5, 2018 at 5:08 PM Emre Yolcu <eyolcu at cs.cmu.edu> wrote:
>
> Manzil, could you share your `conda env export` (or equivalent) output for
> the environment you use for pytorch? It's still not working for me after
> reboot, maybe I can try replicating your exact setup and try with that.
>
>
>
> Thanks,
>
>
>
> Emre
>
>
>
> On Wed, Sep 5, 2018 at 4:56 PM, Predrag Punosevac <predragp at andrew.cmu.edu>
> wrote:
>
> Manzil Zaheer <manzil at cmu.edu> wrote:
>
> It was working me before reboot as well. PyTorch does work on all
> nodes for me.
>
>
> Aha! Gotcha.
>
>
> I am trying to say is that i think it is not issue at system level but
> at user account level. I might be wrong though.
>
>
> That was my hunch as well. They were trying to convince me in a 150
> e-mails chain over the weekend that pytorch was broken when I replaced a
> failed HDD on the main file server. That didn't make any sense.
>
> Could you please share your binaries and setup with outher pytorch
> users?
>
> Cheers,
> Predrag
>
>
>
> -------- Original message --------
> From: Predrag Punosevac <predragp at andrew.cmu.edu>
> Date: 9/5/18 4:44 PM (GMT-05:00)
> To: Manzil Zaheer <manzil at cmu.edu>
> Cc: Biswajit Paria <bparia at cs.cmu.edu>, Yichong Xu <yichongx at cs.cmu.edu>,
> Emre Yolcu <eyolcu at cs.cmu.edu>, users at autonlab.org
> Subject: Re: PyTorch problem
>
> Should I go ahead and reboot all GPU computing nodes? Can somebody else
> confirm that a reboot fixes the issue?
>
> Predrag
>
> On Wed, Sep 5, 2018 at 4:42 PM, Manzil Zaheer <manzil at cmu.edu<
> mailto:manzil at cmu.edu <manzil at cmu.edu>>> wrote:
> It does work for me and my friends
>
>
>
>
> -------- Original message --------
> From: Predrag Punosevac <predragp at andrew.cmu.edu<mailt
> o:predragp at andrew.cmu.edu <predragp at andrew.cmu.edu>>>
> Date: 9/5/18 4:40 PM (GMT-05:00)
> To: Biswajit Paria <bparia at cs.cmu.edu<mailto:bparia at cs.cmu.edu
> <bparia at cs.cmu.edu>>>
> Cc: Manzil Zaheer <manzil at cmu.edu<mailto:manzil at cmu.edu <manzil at cmu.edu>>>,
> Yichong Xu <yichongx at cs.cmu.edu<mailto:yichongx at cs.cmu.edu
> <yichongx at cs.cmu.edu>>>, Emre Yolcu <eyolcu at cs.cmu.edu<mailto:
> eyolcu at cs.cmu.edu <eyolcu at cs.cmu.edu>>>, users at autonlab.org<mailto:
> users at autonlab.org <users at autonlab.org>>
> Subject: Re: PyTorch problem
>
> I just rebooted GPU8. All packages are up to date. NVidia driver appears
> to be working properly and I can do GPU computations from MATLAB. Let's try
> now to get pytorch working on GPU8.
>
> Predrag
>
> On Wed, Sep 5, 2018 at 12:19 AM, Biswajit Paria <bparia at cs.cmu.edu<mailto:
> bparia at cs.cmu.edu <bparia at cs.cmu.edu>>> wrote:
> I am facing a similar error on all GPU machines. Did someone find a
> solution yet?
>
>
> 2018-09-05 00:27:41.546064: E tensorflow/stream_executor/cuda/
> cuda_blas.cc:459] failed to create cublas handle:
> CUBLAS_STATUS_NOT_INITIALIZED
>
> On Tue, Sep 4, 2018 at 10:03 PM Manzil Zaheer <manzil at cmu.edu<
> mailto:manzil at cmu.edu <manzil at cmu.edu>>> wrote:
> Hi Yichong
>
> Yes I am able to run TF and PyTorch on these machines. Recently someone
> else also had similar issue, but it got fixed by reinstalling some local
> packages.
>
> Thanks,
> Manzil
>
>
> -------- Original message --------
> From: Yichong Xu <yichongx at cs.cmu.edu<mailto:yichongx at cs.cmu.edu
> <yichongx at cs.cmu.edu>>>
> Date: 9/4/18 9:58 PM (GMT-05:00)
> To: Emre Yolcu <eyolcu at cs.cmu.edu<mailto:eyolcu at cs.cmu.edu
> <eyolcu at cs.cmu.edu>>>, Predrag Punosevac <predragp at andrew.cmu.edu<mailt
> o:predragp at andrew.cmu.edu <predragp at andrew.cmu.edu>>>
> Cc: users at autonlab.org<mailto:users at autonlab.org <users at autonlab.org>>
> Subject: Re: PyTorch problem
>
> Just wondering - can Tensorflow run well on these machines? I hope someone
> to confirm about this so that we can isolate the problem.
> OK so here?s a further test: I tried running the cuda examples from the
> cuda installation (in /usr/local/cuda/sample), on gpu2 in my scratch
> directory. Simple jobs like deviceQuery succeeds, but simpleCUBLAS failed:
> yichongx at gpu2$ cd /home/scratch/yichongx/
> yichongx at gpu2$ cd
> 0_Simple/ 2_Graphics/ 4_Finance/ 6_Advanced/ bin/
> conda/
> 1_Utilities/ 3_Imaging/ 5_Simulations/ 7_CUDALibraries/
> common/ miniconda3/
> yichongx at gpu2$ cd 7_CUDALibraries/
> yichongx at gpu2$ cd simpleCUBLAS
> yichongx at gpu2$ CUDA_VISIBLE_DEVICES=3 ./simpleCUBLAS
> GPU Device 0: "TITAN X (Pascal)" with compute capability 6.1
>
> simpleCUBLAS test running..
> !!!! CUBLAS initialization error
> yichongx at gpu2$
>
>
> This is also consistent with our previous errors from pytorch, which say
> cublas library not initialized.
>
> So this means at least there is some problem with CUBLAS on gpu2. This
> post suggests that using sudo can resolve this problem, and this is
> probably because of some permission problems on CUBLAS libraries:
> https://devtalk.nvidia.com/default/topic/1027602/cuda-
> setup-and-installation/cublas-libraries-with-incorrect-permissions/
> @Predrag: Can you try running the simpleCUBLAS example from the CUDA
> library, with and without root privilege? I think that might be something
> that you are more familiar with. Thank you very much!
>
>
> Thanks,
> Yichong
>
>
> On Sep 4, 2018, at 3:18 PM, Emre Yolcu <eyolcu at cs.cmu.edu<mailto:
> eyolcu at cs.cmu.edu <eyolcu at cs.cmu.edu>>> wrote:
>
> Hi,
>
> We are trying to troubleshoot the PyTorch issue with Predrag and were
> wondering:
>
> Is anybody able to run PyTorch GPU models on gpu1-9? If you can, we would
> appreciate if you can respond.
>
> Also, is it a problem for anyone if gpu8 is rebooted today?
>
> Thanks,
>
> Emre
>
>
>
> --
> Biswajit Paria
> PhD in ML @ CMU
>
>
>
>
>
>
>
>
> --
>
> Biswajit Paria
>
> PhD in ML @ CMU
>
>
>
>
>
>
> --
>
> Biswajit Paria
>
> PhD in ML @ CMU
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20180906/07ffb0ad/attachment-0001.html>
More information about the Autonlab-users
mailing list