PyTorch problem

Fri Sep 7 14:19:07 EDT 2018

Thanks Emre! This does resolve it for me.

On Thu, Sep 6, 2018 at 8:47 PM Emre Yolcu <eyolcu at cs.cmu.edu> wrote:

> I think I got it. If I'm not mistaken NFS is the root of all our problems
> in this thread. Can anyone having problems try doing the equivalent of
> `export CUDA_CACHE_PATH=/home/scratch/eyolcu/computecache` (replacing
> eyolcu with your andrew id) and try everything again? This seems to fix it
> for me.
>
> On Thu, Sep 6, 2018 at 3:28 PM, Yichong Xu <yichongx at cs.cmu.edu> wrote:
>
>> Hi Predrag,
>> I just tested the simpleCUBLAS sample in cuda library. It still does not
>> work for me with the same error:
>> GPU Device 0: "TITAN Xp" with compute capability 6.1
>>
>> simpleCUBLAS test running..
>> !!!! CUBLAS initialization error
>>
>>
>> I’m not sure where exactly the access problem is, but here is what I get
>> from ls -all:
>> yichongx at gpu8$ ls -all
>> total 2136200
>> drwxr-xr-x. 3 sheath sheath      8192 May 31 15:16 .
>> drwxr-xr-x. 4 root   root          32 Sep  2  2017 ..
>> lrwxrwxrwx. 1 root   root          18 Mar 13 13:05 libaccinj64.so ->
>> libaccinj64.so.9.0
>> lrwxrwxrwx. 1 root   root          22 Mar 13 13:05 libaccinj64.so.9.0 ->
>> libaccinj64.so.9.0.176
>> -rwxr-xr-x. 1 root   root     6858944 Sep  2  2017 libaccinj64.so.9.0.176
>> -rw-r--r--. 1 root   root    71952010 Dec 19  2017 libcublas_device.a
>> lrwxrwxrwx. 1 root   root          16 Mar 13 13:04 libcublas.so ->
>> libcublas.so.9.0
>> lrwxrwxrwx. 1 root   root          20 Mar 13 13:04 libcublas.so.9.0 ->
>> libcublas.so.9.0.282
>> -rwxr-xr-x. 1 root   root    52590576 Dec 19  2017 libcublas.so.9.0.176
>> -rwxr-xr-x. 1 root   root    55781312 Dec 19  2017 libcublas.so.9.0.282
>> -rw-r--r--. 1 root   root    62813620 Dec 19  2017 libcublas_static.a
>>
>>
>> *Thanks,*
>> *Yichong*
>>
>> On Sep 6, 2018, at 3:14 PM, Yichong Xu <yichongx at cs.cmu.edu> wrote:
>>
>> 1. I think yes. Biswajit and I cannot use the system cuda libraries.
>> 2. I think yes as well. Predrag said he can run matlab with cuda well
>> (probably with root access), so I think there should be some problem with
>> the privilege setting of system libraries. We do not have root access on
>> our accounts.
>> 3. Not yet so far.
>> 4. That can be a solution. Maybe we have a public access library as
>> Jay-Yoon did and that can work for us.
>>
>> Also for gpu8 - I just reinstalled pytorch again on scratch of gpu8 and
>> it still does not work. I’m making the cuda libraries right now and trying
>> to see if it works.
>>
>> *Thanks,*
>> *Yichong*
>>
>>
>>
>> On Sep 6, 2018, at 11:20 AM, Barnabas Poczos <bapoczos at cs.cmu.edu> wrote:
>>
>> Hi All,
>>
>> I'm somewhat confused:
>>
>> * Do I understand correctly that Manzil actually is using the CUDA
>> libraries installed by himself
>> (/zfsauton/home/manzilz/local/cuda-9.0/) and not the system libraries
>> (/usr/local/cuda/lib64) ?
>> * Since he is using different CUDA libraries is that the reason that
>> pytorch is working for him and not for the other users? If so, should
>> we double check the system libraries?
>> * Do we know anyone who can use pytorch now with the CUDA system
>> libraries ? If so, those users please let us know your system env
>> variables.
>> * As a quick solution, should we ask Manzil to copy his cuda libraries
>> to a public place where others could access them?
>>
>> Best,
>> Barnabas
>>
>> ======================
>> Barnabas Poczos, PhD
>> Associate Professor
>> Co-Director of PhD Program
>> Machine Learning Department
>> Carnegie Mellon University
>> On Wed, Sep 5, 2018 at 5:33 PM Manzil Zaheer <manzil at cmu.edu> wrote:
>>
>>
>> Here is my related env variables:
>>
>>
>>
>> CUDA_HOME=/zfsauton/home/manzilz/local/cuda-9.0/
>>
>>
>> LD_LIBRARY_PATH=/zfsauton/home/manzilz/local/lib64:/zfsauton/home/manzilz/local/lib:/zfsauton/home/manzilz/local/cuda-9.0/lib64:/usr/local/cuda/lib64:
>>
>>
>> PATH=/zfsauton/home/manzilz/local/bin:/zfsauton/home/manzilz/.local/bin:/zfsauton/home/manzilz/local/cuda-9.0/bin:/usr/local/cuda/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin
>>
>> C_INCLUDE_PATH=/zfsauton/home/manzilz/local/include:
>>
>>
>>
>> From: Biswajit Paria <bparia at cs.cmu.edu>
>> Sent: Wednesday, September 05, 2018 5:29 PM
>> To: Yichong Xu <yichongx at cs.cmu.edu>
>> Cc: Biswajit Paria <bparia at cs.cmu.edu>; eyolcu at cs.cmu.edu; Predrag
>> Punosevac <predragp at andrew.cmu.edu>; Manzil Zaheer <manzil at cmu.edu>;
>> users at autonlab.org
>> Subject: Re: PyTorch problem
>>
>>
>>
>> If the CUDA examples work for anyone, can they share their PATH and
>> LD_LIBRARY_PATH variables?
>>
>>
>>
>> Thanks
>>
>>
>>
>> On Wed, Sep 5, 2018 at 5:27 PM Yichong Xu <yichongx at cs.cmu.edu> wrote:
>>
>> I think with Biswajit’s and my problem with cuda, we should isolate the
>> problem with just CUDA (and drivers) instead of wandering around python or
>> pytorch.
>>
>> Predrag can you test the CUDA examples? I sort of agree with Manzil that
>> this might be a user account problem.
>>
>>
>>
>> Thanks,
>>
>> Yichong
>>
>>
>>
>>
>>
>>
>>
>> On Sep 5, 2018, at 5:14 PM, Biswajit Paria <bparia at cs.cmu.edu> wrote:
>>
>>
>>
>> I just tried Yichong's way of testing cuBLAS, and get the same error as
>> earlier:
>>
>>
>>
>> [Matrix Multiply CUBLAS] - Starting...
>>
>> GPU Device 0: "TITAN Xp" with compute capability 6.1
>>
>>
>>
>> MatrixA(640,480), MatrixB(480,320), MatrixC(640,320)
>>
>> CUDA error at matrixMulCUBLAS.cpp:275
>> code=1(CUBLAS_STATUS_NOT_INITIALIZED) "cublasCreate(&handle)"
>>
>>
>>
>> So I believe it is not a conda error. I also tried removing .nv, doesn't
>> help either. Maybe someone can share the PATH env variable?
>>
>>
>>
>> On Wed, Sep 5, 2018 at 5:08 PM Emre Yolcu <eyolcu at cs.cmu.edu> wrote:
>>
>> Manzil, could you share your `conda env export` (or equivalent) output
>> for the environment you use for pytorch? It's still not working for me
>> after reboot, maybe I can try replicating your exact setup and try with
>> that.
>>
>>
>>
>> Thanks,
>>
>>
>>
>> Emre
>>
>>
>>
>> On Wed, Sep 5, 2018 at 4:56 PM, Predrag Punosevac <
>> predragp at andrew.cmu.edu> wrote:
>>
>> Manzil Zaheer <manzil at cmu.edu> wrote:
>>
>> It was working me before reboot as well. PyTorch does work on all
>> nodes for me.
>>
>>
>> Aha! Gotcha.
>>
>>
>> I am trying to say is that i think it is not issue at system level but
>> at user account level. I might be wrong though.
>>
>>
>> That was my hunch as well. They were trying to convince me in a 150
>> e-mails chain over the weekend that pytorch was broken when I replaced a
>> failed HDD on the main file server. That didn't make any sense.
>>
>> Could you please share your binaries and setup with outher pytorch
>> users?
>>
>> Cheers,
>> Predrag
>>
>>
>>
>> -------- Original message --------
>> From: Predrag Punosevac <predragp at andrew.cmu.edu>
>> Date: 9/5/18 4:44 PM (GMT-05:00)
>> To: Manzil Zaheer <manzil at cmu.edu>
>> Cc: Biswajit Paria <bparia at cs.cmu.edu>, Yichong Xu <yichongx at cs.cmu.edu>,
>> Emre Yolcu <eyolcu at cs.cmu.edu>, users at autonlab.org
>> Subject: Re: PyTorch problem
>>
>> Should I go ahead and reboot all GPU computing nodes? Can somebody else
>> confirm that a reboot fixes the issue?
>>
>> Predrag
>>
>> On Wed, Sep 5, 2018 at 4:42 PM, Manzil Zaheer <manzil at cmu.edu<
>> mailto:manzil at cmu.edu <manzil at cmu.edu>>> wrote:
>> It does work for me and my friends
>>
>>
>>
>>
>> -------- Original message --------
>> From: Predrag Punosevac <predragp at andrew.cmu.edu<
>> mailto:predragp at andrew.cmu.edu <predragp at andrew.cmu.edu>>>
>> Date: 9/5/18 4:40 PM (GMT-05:00)
>> To: Biswajit Paria <bparia at cs.cmu.edu<mailto:bparia at cs.cmu.edu
>> <bparia at cs.cmu.edu>>>
>> Cc: Manzil Zaheer <manzil at cmu.edu<mailto:manzil at cmu.edu <manzil at cmu.edu>>>,
>> Yichong Xu <yichongx at cs.cmu.edu<mailto:yichongx at cs.cmu.edu
>> <yichongx at cs.cmu.edu>>>, Emre Yolcu <eyolcu at cs.cmu.edu<
>> mailto:eyolcu at cs.cmu.edu <eyolcu at cs.cmu.edu>>>, users at autonlab.org<
>> mailto:users at autonlab.org <users at autonlab.org>>
>> Subject: Re: PyTorch problem
>>
>> I just rebooted GPU8. All packages are up to date. NVidia driver appears
>> to be working properly and I can do GPU computations from MATLAB. Let's try
>> now to get pytorch working on GPU8.
>>
>> Predrag
>>
>> On Wed, Sep 5, 2018 at 12:19 AM, Biswajit Paria <bparia at cs.cmu.edu<
>> mailto:bparia at cs.cmu.edu <bparia at cs.cmu.edu>>> wrote:
>> I am facing a similar error on all GPU machines. Did someone find a
>> solution yet?
>>
>>
>> 2018-09-05 00:27:41.546064: E tensorflow/stream_executor/cuda/
>> cuda_blas.cc:459] failed to create cublas handle:
>> CUBLAS_STATUS_NOT_INITIALIZED
>>
>> On Tue, Sep 4, 2018 at 10:03 PM Manzil Zaheer <manzil at cmu.edu<
>> mailto:manzil at cmu.edu <manzil at cmu.edu>>> wrote:
>> Hi Yichong
>>
>> Yes I am able to run TF and PyTorch on these machines. Recently someone
>> else also had similar issue, but it got fixed by reinstalling some local
>> packages.
>>
>> Thanks,
>> Manzil
>>
>>
>> -------- Original message --------
>> From: Yichong Xu <yichongx at cs.cmu.edu<mailto:yichongx at cs.cmu.edu
>> <yichongx at cs.cmu.edu>>>
>> Date: 9/4/18 9:58 PM (GMT-05:00)
>> To: Emre Yolcu <eyolcu at cs.cmu.edu<mailto:eyolcu at cs.cmu.edu
>> <eyolcu at cs.cmu.edu>>>, Predrag Punosevac <predragp at andrew.cmu.edu<
>> mailto:predragp at andrew.cmu.edu <predragp at andrew.cmu.edu>>>
>> Cc: users at autonlab.org<mailto:users at autonlab.org <users at autonlab.org>>
>> Subject: Re: PyTorch problem
>>
>> Just wondering - can Tensorflow run well on these machines? I hope
>> someone to confirm about this so that we can isolate the problem.
>> OK so here?s a further test: I tried running the cuda examples from the
>> cuda installation (in /usr/local/cuda/sample), on gpu2 in my scratch
>> directory. Simple jobs like deviceQuery succeeds, but simpleCUBLAS failed:
>> yichongx at gpu2$ cd /home/scratch/yichongx/
>> yichongx at gpu2$ cd
>> 0_Simple/        2_Graphics/      4_Finance/       6_Advanced/      bin/
>>             conda/
>> 1_Utilities/     3_Imaging/       5_Simulations/   7_CUDALibraries/
>> common/          miniconda3/
>> yichongx at gpu2$ cd 7_CUDALibraries/
>> yichongx at gpu2$ cd simpleCUBLAS
>> yichongx at gpu2$ CUDA_VISIBLE_DEVICES=3 ./simpleCUBLAS
>> GPU Device 0: "TITAN X (Pascal)" with compute capability 6.1
>>
>> simpleCUBLAS test running..
>> !!!! CUBLAS initialization error
>> yichongx at gpu2$
>>
>>
>> This is also consistent with our previous errors from pytorch, which say
>> cublas library not initialized.
>>
>> So this means at least there is some problem with CUBLAS on gpu2. This
>> post suggests that using sudo can resolve this problem, and this is
>> probably because of some permission problems on CUBLAS libraries:
>>
>> https://devtalk.nvidia.com/default/topic/1027602/cuda-setup-and-installation/cublas-libraries-with-incorrect-permissions/
>> @Predrag: Can you try running the simpleCUBLAS example from the CUDA
>> library, with and without root privilege? I think that might be something
>> that you are more familiar with. Thank you very much!
>>
>>
>> Thanks,
>> Yichong
>>
>>
>> On Sep 4, 2018, at 3:18 PM, Emre Yolcu <eyolcu at cs.cmu.edu<
>> mailto:eyolcu at cs.cmu.edu <eyolcu at cs.cmu.edu>>> wrote:
>>
>> Hi,
>>
>> We are trying to troubleshoot the PyTorch issue with Predrag and were
>> wondering:
>>
>> Is anybody able to run PyTorch GPU models on gpu1-9? If you can, we would
>> appreciate if you can respond.
>>
>> Also, is it a problem for anyone if gpu8 is rebooted today?
>>
>> Thanks,
>>
>> Emre
>>
>>
>>
>> --
>> Biswajit Paria
>> PhD in ML @ CMU
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>>
>> Biswajit Paria
>>
>> PhD in ML @ CMU
>>
>>
>>
>>
>>
>>
>> --
>>
>> Biswajit Paria
>>
>> PhD in ML @ CMU
>>
>>
>>
>>
>

-- 
Biswajit Paria
PhD in ML @ CMU
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20180907/26485399/attachment-0001.html>