<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">
Thank you very much Emre! That’s so clever, and (almost) completely resolves my issue. I only have a small error when using h5py:
<div class="">
<div style="margin: 0px; font-stretch: normal; font-size: 11px; line-height: normal; font-family: Menlo; background-color: rgba(191, 191, 191, 0.862745);" class="">
<span style="font-variant-ligatures: no-common-ligatures" class="">  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper</span></div>
<div style="margin: 0px; font-stretch: normal; font-size: 11px; line-height: normal; font-family: Menlo; background-color: rgba(191, 191, 191, 0.862745);" class="">
<span style="font-variant-ligatures: no-common-ligatures" class="">  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper</span></div>
<div style="margin: 0px; font-stretch: normal; font-size: 11px; line-height: normal; font-family: Menlo; background-color: rgba(191, 191, 191, 0.862745);" class="">
<span style="font-variant-ligatures: no-common-ligatures" class="">  File "h5py/h5f.pyx", line 78, in h5py.h5f.open</span></div>
<div style="margin: 0px; font-stretch: normal; font-size: 11px; line-height: normal; font-family: Menlo; background-color: rgba(191, 191, 191, 0.862745);" class="">
<span style="font-variant-ligatures: no-common-ligatures" class="">OSError: Unable to open file (unable to lock file, errno = 5, error message = 'Input/output error')</span></div>
<br class="">
I resolved the problem by moving the file to scratch space. I think the new disk possibly have some small problems (either performance/permission), and that’s causing the problem.<br class="">
<div><br class="">
</div>
<div><br class="">
</div>
<div>
<div><i class="">Thanks,</i></div>
<div><i class="">Yichong</i></div>
<div class=""><br class="">
</div>
</div>
<div><br class="">
</div>
<div><br class="">
<blockquote type="cite" class="">
<div class="">On Sep 6, 2018, at 8:46 PM, Emre Yolcu <<a href="mailto:eyolcu@cs.cmu.edu" class="">eyolcu@cs.cmu.edu</a>> wrote:</div>
<br class="Apple-interchange-newline">
<div class="">
<div dir="ltr" class="">I think I got it. If I'm not mistaken NFS is the root of all our problems in this thread. Can anyone having problems try doing the equivalent of `export CUDA_CACHE_PATH=/home/scratch/eyolcu/computecache` (replacing eyolcu with your andrew
 id) and try everything again? This seems to fix it for me.<br class="">
</div>
<div class="gmail_extra"><br class="">
<div class="gmail_quote">On Thu, Sep 6, 2018 at 3:28 PM, Yichong Xu <span dir="ltr" class="">
<<a href="mailto:yichongx@cs.cmu.edu" target="_blank" class="">yichongx@cs.cmu.edu</a>></span> wrote:<br class="">
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div style="word-wrap:break-word;line-break:after-white-space" class="">Hi Predrag,
<div class="">I just tested the simpleCUBLAS sample in cuda library. It still does not work for me with the same error:</div>
<div class=""><span class="">
<div style="margin:0px;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo;background-color:rgba(191,191,191,0.862745)" class="">
<span style="font-variant-ligatures:no-common-ligatures" class="">GPU Device 0: "TITAN Xp" with compute capability 6.1</span></div>
<div style="margin:0px;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo;background-color:rgba(191,191,191,0.862745);min-height:13px" class="">
<span style="font-variant-ligatures:no-common-ligatures" class=""></span><br class="">
</div>
</span><span class="">
<div style="margin:0px;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo;background-color:rgba(191,191,191,0.862745)" class="">
<span style="font-variant-ligatures:no-common-ligatures" class="">simpleCUBLAS test running..</span></div>
<div style="margin:0px;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo;background-color:rgba(191,191,191,0.862745)" class="">
<span style="font-variant-ligatures:no-common-ligatures" class="">!!!! CUBLAS initialization error</span></div>
<div class=""><br class="m_-4157814231952792245Apple-interchange-newline">
</div>
<div class=""><br class="">
</div>
</span>
<div class="">I’m not sure where exactly the access problem is, but here is what I get from ls -all:</div>
<div class="">
<div style="margin:0px;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo;background-color:rgba(191,191,191,0.862745)" class="">
<span style="font-variant-ligatures:no-common-ligatures" class="">yichongx@gpu8$ ls -all</span></div>
<div style="margin:0px;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo;background-color:rgba(191,191,191,0.862745)" class="">
<span style="font-variant-ligatures:no-common-ligatures" class="">total 2136200</span></div>
<div style="margin:0px;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo;background-color:rgba(191,191,191,0.862745)" class="">
<span style="font-variant-ligatures:no-common-ligatures" class="">drwxr-xr-x. 3 sheath sheath      8192 May 31 15:16
</span><span style="font-variant-ligatures:no-common-ligatures;color:#005bfb" class="">.</span></div>
<div style="margin:0px;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo;background-color:rgba(191,191,191,0.862745)" class="">
<span style="font-variant-ligatures:no-common-ligatures" class="">drwxr-xr-x. 4 root   root          32 Sep  2  2017
</span><span style="font-variant-ligatures:no-common-ligatures;color:#005bfb" class="">..</span></div>
<div style="margin:0px;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo;background-color:rgba(191,191,191,0.862745)" class="">
<span style="font-variant-ligatures:no-common-ligatures" class="">lrwxrwxrwx. 1 root   root          18 Mar 13 13:05
</span><span style="font-variant-ligatures:no-common-ligatures;color:#44ffff" class="">libaccinj64.so</span><span style="font-variant-ligatures:no-common-ligatures" class=""> ->
</span><span style="font-variant-ligatures:no-common-ligatures;color:#009800" class="">libaccinj64.so.9.0</span></div>
<div style="margin:0px;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo;background-color:rgba(191,191,191,0.862745)" class="">
<span style="font-variant-ligatures:no-common-ligatures" class="">lrwxrwxrwx. 1 root   root          22 Mar 13 13:05
</span><span style="font-variant-ligatures:no-common-ligatures;color:#44ffff" class="">libaccinj64.so.9.0</span><span style="font-variant-ligatures:no-common-ligatures" class=""> ->
</span><span style="font-variant-ligatures:no-common-ligatures;color:#009800" class="">libaccinj64.so.9.0.176</span></div>
<div style="margin:0px;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo;background-color:rgba(191,191,191,0.862745)" class="">
<span style="font-variant-ligatures:no-common-ligatures" class="">-rwxr-xr-x. 1 root   root     6858944 Sep  2  2017
</span><span style="font-variant-ligatures:no-common-ligatures;color:#009800" class="">libaccinj64.so.9.0.176</span></div>
<div style="margin:0px;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo;background-color:rgba(191,191,191,0.862745)" class="">
<span style="font-variant-ligatures:no-common-ligatures" class="">-rw-r--r--. 1 root   root    71952010 Dec 19  2017 libcublas_device.a</span></div>
<div style="margin:0px;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo;background-color:rgba(191,191,191,0.862745)" class="">
<span style="font-variant-ligatures:no-common-ligatures" class="">lrwxrwxrwx. 1 root   root          16 Mar 13 13:04
</span><span style="font-variant-ligatures:no-common-ligatures;color:#44ffff" class="">libcublas.so</span><span style="font-variant-ligatures:no-common-ligatures" class=""> ->
</span><span style="font-variant-ligatures:no-common-ligatures;color:#009800" class="">libcublas.so.9.0</span></div>
<div style="margin:0px;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo;background-color:rgba(191,191,191,0.862745)" class="">
<span style="font-variant-ligatures:no-common-ligatures" class="">lrwxrwxrwx. 1 root   root          20 Mar 13 13:04
</span><span style="font-variant-ligatures:no-common-ligatures;color:#44ffff" class="">libcublas.so.9.0</span><span style="font-variant-ligatures:no-common-ligatures" class=""> ->
</span><span style="font-variant-ligatures:no-common-ligatures;color:#009800" class="">libcublas.so.9.0.282</span></div>
<div style="margin:0px;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo;background-color:rgba(191,191,191,0.862745)" class="">
<span style="font-variant-ligatures:no-common-ligatures" class="">-rwxr-xr-x. 1 root   root    52590576 Dec 19  2017
</span><span style="font-variant-ligatures:no-common-ligatures;color:#009800" class="">libcublas.so.9.0.176</span></div>
<div style="margin:0px;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo;background-color:rgba(191,191,191,0.862745)" class="">
<span style="font-variant-ligatures:no-common-ligatures" class="">-rwxr-xr-x. 1 root   root    55781312 Dec 19  2017
</span><span style="font-variant-ligatures:no-common-ligatures;color:#009800" class="">libcublas.so.9.0.282</span></div>
<div style="margin:0px;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo;background-color:rgba(191,191,191,0.862745)" class="">
<span style="font-variant-ligatures:no-common-ligatures" class="">-rw-r--r--. 1 root   root    62813620 Dec 19  2017 libcublas_static.a</span></div>
</div>
<div class=""><br class="">
</div>
<div class=""><br class="">
<div class="">
<div class=""><i class="">Thanks,</i></div>
<div class=""><i class="">Yichong</i></div>
<div class=""><br class="">
</div>
</div>
<div class="">
<div class="h5">
<blockquote type="cite" class="">
<div class="">On Sep 6, 2018, at 3:14 PM, Yichong Xu <<a href="mailto:yichongx@cs.cmu.edu" target="_blank" class="">yichongx@cs.cmu.edu</a>> wrote:</div>
<br class="m_-4157814231952792245Apple-interchange-newline">
<div class="">
<div style="word-wrap:break-word;line-break:after-white-space" class="">1. I think yes. Biswajit and I cannot use the system cuda libraries.
<div class="">2. I think yes as well. Predrag said he can run matlab with cuda well (probably with root access), so I think there should be some problem with the privilege setting of system libraries. We do not have root access on our accounts.</div>
<div class="">3. Not yet so far.</div>
<div class="">4. That can be a solution. Maybe we have a public access library as Jay-Yoon did and that can work for us.</div>
<div class=""><br class="">
</div>
<div class="">Also for gpu8 - I just reinstalled pytorch again on scratch of gpu8 and it still does not work. I’m making the cuda libraries right now and trying to see if it works.</div>
<div class=""><br class="">
<div class="">
<div style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px" class="">
<i class="">Thanks,</i></div>
<div style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px" class="">
<i class="">Yichong</i></div>
<div style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px" class="">
<br class="">
</div>
<br class="m_-4157814231952792245Apple-interchange-newline">
</div>
<div class=""><br class="">
<blockquote type="cite" class="">
<div class="">On Sep 6, 2018, at 11:20 AM, Barnabas Poczos <<a href="mailto:bapoczos@cs.cmu.edu" target="_blank" class="">bapoczos@cs.cmu.edu</a>> wrote:</div>
<br class="m_-4157814231952792245Apple-interchange-newline">
<div class="">
<div class="">Hi All,<br class="">
<br class="">
I'm somewhat confused:<br class="">
<br class="">
* Do I understand correctly that Manzil actually is using the CUDA<br class="">
libraries installed by himself<br class="">
(/zfsauton/home/manzilz/local/<wbr class="">cuda-9.0/) and not the system libraries<br class="">
(/usr/local/cuda/lib64) ?<br class="">
* Since he is using different CUDA libraries is that the reason that<br class="">
pytorch is working for him and not for the other users? If so, should<br class="">
we double check the system libraries?<br class="">
* Do we know anyone who can use pytorch now with the CUDA system<br class="">
libraries ? If so, those users please let us know your system env<br class="">
variables.<br class="">
* As a quick solution, should we ask Manzil to copy his cuda libraries<br class="">
to a public place where others could access them?<br class="">
<br class="">
Best,<br class="">
Barnabas<br class="">
<br class="">
======================<br class="">
Barnabas Poczos, PhD<br class="">
Associate Professor<br class="">
Co-Director of PhD Program<br class="">
Machine Learning Department<br class="">
Carnegie Mellon University<br class="">
On Wed, Sep 5, 2018 at 5:33 PM Manzil Zaheer <<a href="mailto:manzil@cmu.edu" target="_blank" class="">manzil@cmu.edu</a>> wrote:<br class="">
<blockquote type="cite" class=""><br class="">
Here is my related env variables:<br class="">
<br class="">
<br class="">
<br class="">
CUDA_HOME=/zfsauton/home/<wbr class="">manzilz/local/cuda-9.0/<br class="">
<br class="">
LD_LIBRARY_PATH=/zfsauton/<wbr class="">home/manzilz/local/lib64:/<wbr class="">zfsauton/home/manzilz/local/<wbr class="">lib:/zfsauton/home/manzilz/<wbr class="">local/cuda-9.0/lib64:/usr/<wbr class="">local/cuda/lib64:<br class="">
<br class="">
PATH=/zfsauton/home/manzilz/<wbr class="">local/bin:/zfsauton/home/<wbr class="">manzilz/.local/bin:/zfsauton/<wbr class="">home/manzilz/local/cuda-9.0/<wbr class="">bin:/usr/local/cuda/bin:/usr/<wbr class="">lib64/qt-3.3/bin:/usr/local/<wbr class="">bin:/usr/bin:/usr/local/sbin:/<wbr class="">usr/sbin<br class="">
<br class="">
C_INCLUDE_PATH=/zfsauton/home/<wbr class="">manzilz/local/include:<br class="">
<br class="">
<br class="">
<br class="">
From: Biswajit Paria <<a href="mailto:bparia@cs.cmu.edu" target="_blank" class="">bparia@cs.cmu.edu</a>><br class="">
Sent: Wednesday, September 05, 2018 5:29 PM<br class="">
To: Yichong Xu <<a href="mailto:yichongx@cs.cmu.edu" target="_blank" class="">yichongx@cs.cmu.edu</a>><br class="">
Cc: Biswajit Paria <<a href="mailto:bparia@cs.cmu.edu" target="_blank" class="">bparia@cs.cmu.edu</a>>;
<a href="mailto:eyolcu@cs.cmu.edu" target="_blank" class="">eyolcu@cs.cmu.edu</a>; Predrag Punosevac <<a href="mailto:predragp@andrew.cmu.edu" target="_blank" class="">predragp@andrew.cmu.edu</a>>; Manzil Zaheer <<a href="mailto:manzil@cmu.edu" target="_blank" class="">manzil@cmu.edu</a>>;
<a href="mailto:users@autonlab.org" target="_blank" class="">users@autonlab.org</a><br class="">
Subject: Re: PyTorch problem<br class="">
<br class="">
<br class="">
<br class="">
If the CUDA examples work for anyone, can they share their PATH and LD_LIBRARY_PATH variables?<br class="">
<br class="">
<br class="">
<br class="">
Thanks<br class="">
<br class="">
<br class="">
<br class="">
On Wed, Sep 5, 2018 at 5:27 PM Yichong Xu <<a href="mailto:yichongx@cs.cmu.edu" target="_blank" class="">yichongx@cs.cmu.edu</a>> wrote:<br class="">
<br class="">
I think with Biswajit’s and my problem with cuda, we should isolate the problem with just CUDA (and drivers) instead of wandering around python or pytorch.<br class="">
<br class="">
Predrag can you test the CUDA examples? I sort of agree with Manzil that this might be a user account problem.<br class="">
<br class="">
<br class="">
<br class="">
Thanks,<br class="">
<br class="">
Yichong<br class="">
<br class="">
<br class="">
<br class="">
<br class="">
<br class="">
<br class="">
<br class="">
On Sep 5, 2018, at 5:14 PM, Biswajit Paria <<a href="mailto:bparia@cs.cmu.edu" target="_blank" class="">bparia@cs.cmu.edu</a>> wrote:<br class="">
<br class="">
<br class="">
<br class="">
I just tried Yichong's way of testing cuBLAS, and get the same error as earlier:<br class="">
<br class="">
<br class="">
<br class="">
[Matrix Multiply CUBLAS] - Starting...<br class="">
<br class="">
GPU Device 0: "TITAN Xp" with compute capability 6.1<br class="">
<br class="">
<br class="">
<br class="">
MatrixA(640,480), MatrixB(480,320), MatrixC(640,320)<br class="">
<br class="">
CUDA error at matrixMulCUBLAS.cpp:275 code=1(CUBLAS_STATUS_NOT_<wbr class="">INITIALIZED) "cublasCreate(&handle)"<br class="">
<br class="">
<br class="">
<br class="">
So I believe it is not a conda error. I also tried removing .nv, doesn't help either. Maybe someone can share the PATH env variable?<br class="">
<br class="">
<br class="">
<br class="">
On Wed, Sep 5, 2018 at 5:08 PM Emre Yolcu <<a href="mailto:eyolcu@cs.cmu.edu" target="_blank" class="">eyolcu@cs.cmu.edu</a>> wrote:<br class="">
<br class="">
Manzil, could you share your `conda env export` (or equivalent) output for the environment you use for pytorch? It's still not working for me after reboot, maybe I can try replicating your exact setup and try with that.<br class="">
<br class="">
<br class="">
<br class="">
Thanks,<br class="">
<br class="">
<br class="">
<br class="">
Emre<br class="">
<br class="">
<br class="">
<br class="">
On Wed, Sep 5, 2018 at 4:56 PM, Predrag Punosevac <<a href="mailto:predragp@andrew.cmu.edu" target="_blank" class="">predragp@andrew.cmu.edu</a>> wrote:<br class="">
<br class="">
Manzil Zaheer <<a href="mailto:manzil@cmu.edu" target="_blank" class="">manzil@cmu.edu</a>> wrote:<br class="">
<br class="">
<blockquote type="cite" class="">It was working me before reboot as well. PyTorch does work on all<br class="">
nodes for me.<br class="">
</blockquote>
<br class="">
Aha! Gotcha.<br class="">
<br class="">
<blockquote type="cite" class=""><br class="">
I am trying to say is that i think it is not issue at system level but<br class="">
at user account level. I might be wrong though.<br class="">
</blockquote>
<br class="">
That was my hunch as well. They were trying to convince me in a 150<br class="">
e-mails chain over the weekend that pytorch was broken when I replaced a<br class="">
failed HDD on the main file server. That didn't make any sense.<br class="">
<br class="">
Could you please share your binaries and setup with outher pytorch<br class="">
users?<br class="">
<br class="">
Cheers,<br class="">
Predrag<br class="">
<br class="">
<blockquote type="cite" class=""><br class="">
<br class="">
-------- Original message --------<br class="">
From: Predrag Punosevac <<a href="mailto:predragp@andrew.cmu.edu" target="_blank" class="">predragp@andrew.cmu.edu</a>><br class="">
Date: 9/5/18 4:44 PM (GMT-05:00)<br class="">
To: Manzil Zaheer <<a href="mailto:manzil@cmu.edu" target="_blank" class="">manzil@cmu.edu</a>><br class="">
Cc: Biswajit Paria <<a href="mailto:bparia@cs.cmu.edu" target="_blank" class="">bparia@cs.cmu.edu</a>>, Yichong Xu <<a href="mailto:yichongx@cs.cmu.edu" target="_blank" class="">yichongx@cs.cmu.edu</a>>, Emre Yolcu <<a href="mailto:eyolcu@cs.cmu.edu" target="_blank" class="">eyolcu@cs.cmu.edu</a>>,
<a href="mailto:users@autonlab.org" target="_blank" class="">users@autonlab.org</a><br class="">
Subject: Re: PyTorch problem<br class="">
<br class="">
Should I go ahead and reboot all GPU computing nodes? Can somebody else confirm that a reboot fixes the issue?<br class="">
<br class="">
Predrag<br class="">
<br class="">
On Wed, Sep 5, 2018 at 4:42 PM, Manzil Zaheer <<a href="mailto:manzil@cmu.edu" target="_blank" class="">manzil@cmu.edu</a><<a href="mailto:manzil@cmu.edu" target="_blank" class="">mailto:manzil@<wbr class="">cmu.edu</a>>> wrote:<br class="">
It does work for me and my friends<br class="">
<br class="">
<br class="">
<br class="">
<br class="">
-------- Original message --------<br class="">
From: Predrag Punosevac <<a href="mailto:predragp@andrew.cmu.edu" target="_blank" class="">predragp@andrew.cmu.edu</a><<a href="mailto:predragp@andrew.cmu.edu" target="_blank" class="">mailt<wbr class="">o:predragp@andrew.cmu.edu</a>>><br class="">
Date: 9/5/18 4:40 PM (GMT-05:00)<br class="">
To: Biswajit Paria <<a href="mailto:bparia@cs.cmu.edu" target="_blank" class="">bparia@cs.cmu.edu</a><<a href="mailto:bparia@cs.cmu.edu" target="_blank" class="">mailto:<wbr class="">bparia@cs.cmu.edu</a>>><br class="">
Cc: Manzil Zaheer <<a href="mailto:manzil@cmu.edu" target="_blank" class="">manzil@cmu.edu</a><<a href="mailto:manzil@cmu.edu" target="_blank" class="">mailto:manzil@<wbr class="">cmu.edu</a>>>, Yichong Xu <<a href="mailto:yichongx@cs.cmu.edu" target="_blank" class="">yichongx@cs.cmu.edu</a><<a href="mailto:yichongx@cs.cmu.edu" target="_blank" class="">mailto:<wbr class="">yichongx@cs.cmu.edu</a>>>,
 Emre Yolcu <<a href="mailto:eyolcu@cs.cmu.edu" target="_blank" class="">eyolcu@cs.cmu.edu</a><<a href="mailto:eyolcu@cs.cmu.edu" target="_blank" class="">mailto:<wbr class="">eyolcu@cs.cmu.edu</a>>>,
<a href="mailto:users@autonlab.org" target="_blank" class="">users@autonlab.org</a><<a href="mailto:users@autonlab.org" target="_blank" class="">mailto:<wbr class="">users@autonlab.org</a>><br class="">
Subject: Re: PyTorch problem<br class="">
<br class="">
I just rebooted GPU8. All packages are up to date. NVidia driver appears to be working properly and I can do GPU computations from MATLAB. Let's try now to get pytorch working on GPU8.<br class="">
<br class="">
Predrag<br class="">
<br class="">
On Wed, Sep 5, 2018 at 12:19 AM, Biswajit Paria <<a href="mailto:bparia@cs.cmu.edu" target="_blank" class="">bparia@cs.cmu.edu</a><<a href="mailto:bparia@cs.cmu.edu" target="_blank" class="">mailto:<wbr class="">bparia@cs.cmu.edu</a>>> wrote:<br class="">
I am facing a similar error on all GPU machines. Did someone find a solution yet?<br class="">
<br class="">
<br class="">
2018-09-05 00:27:41.546064: E tensorflow/stream_executor/<wbr class="">cuda/<a href="http://cuda_blas.cc:459/" target="_blank" class="">cuda_blas.cc:459</a>] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED<br class="">
<br class="">
On Tue, Sep 4, 2018 at 10:03 PM Manzil Zaheer <<a href="mailto:manzil@cmu.edu" target="_blank" class="">manzil@cmu.edu</a><<a href="mailto:manzil@cmu.edu" target="_blank" class="">mailto:manzil@<wbr class="">cmu.edu</a>>> wrote:<br class="">
Hi Yichong<br class="">
<br class="">
Yes I am able to run TF and PyTorch on these machines. Recently someone else also had similar issue, but it got fixed by reinstalling some local packages.<br class="">
<br class="">
Thanks,<br class="">
Manzil<br class="">
<br class="">
<br class="">
-------- Original message --------<br class="">
From: Yichong Xu <<a href="mailto:yichongx@cs.cmu.edu" target="_blank" class="">yichongx@cs.cmu.edu</a><<a href="mailto:yichongx@cs.cmu.edu" target="_blank" class="">mailto:<wbr class="">yichongx@cs.cmu.edu</a>>><br class="">
Date: 9/4/18 9:58 PM (GMT-05:00)<br class="">
To: Emre Yolcu <<a href="mailto:eyolcu@cs.cmu.edu" target="_blank" class="">eyolcu@cs.cmu.edu</a><<a href="mailto:eyolcu@cs.cmu.edu" target="_blank" class="">mailto:<wbr class="">eyolcu@cs.cmu.edu</a>>>, Predrag Punosevac <<a href="mailto:predragp@andrew.cmu.edu" target="_blank" class="">predragp@andrew.cmu.edu</a><<a href="mailto:predragp@andrew.cmu.edu" target="_blank" class="">mailt<wbr class="">o:predragp@andrew.cmu.edu</a>>><br class="">
Cc: <a href="mailto:users@autonlab.org" target="_blank" class="">users@autonlab.org</a><<a href="mailto:users@autonlab.org" target="_blank" class="">mailto:<wbr class="">users@autonlab.org</a>><br class="">
Subject: Re: PyTorch problem<br class="">
<br class="">
Just wondering - can Tensorflow run well on these machines? I hope someone to confirm about this so that we can isolate the problem.<br class="">
OK so here?s a further test: I tried running the cuda examples from the cuda installation (in /usr/local/cuda/sample), on gpu2 in my scratch directory. Simple jobs like deviceQuery succeeds, but simpleCUBLAS failed:<br class="">
yichongx@gpu2$ cd /home/scratch/yichongx/<br class="">
yichongx@gpu2$ cd<br class="">
0_Simple/        2_Graphics/      4_Finance/       6_Advanced/      bin/             conda/<br class="">
1_Utilities/     3_Imaging/       5_Simulations/   7_CUDALibraries/ common/          miniconda3/<br class="">
yichongx@gpu2$ cd 7_CUDALibraries/<br class="">
yichongx@gpu2$ cd simpleCUBLAS<br class="">
yichongx@gpu2$ CUDA_VISIBLE_DEVICES=3 ./simpleCUBLAS<br class="">
GPU Device 0: "TITAN X (Pascal)" with compute capability 6.1<br class="">
<br class="">
simpleCUBLAS test running..<br class="">
!!!! CUBLAS initialization error<br class="">
yichongx@gpu2$<br class="">
<br class="">
<br class="">
This is also consistent with our previous errors from pytorch, which say cublas library not initialized.<br class="">
<br class="">
So this means at least there is some problem with CUBLAS on gpu2. This post suggests that using sudo can resolve this problem, and this is probably because of some permission problems on CUBLAS libraries:<br class="">
<a href="https://devtalk.nvidia.com/default/topic/1027602/cuda-setup-and-installation/cublas-libraries-with-incorrect-permissions/" target="_blank" class="">https://devtalk.nvidia.com/<wbr class="">default/topic/1027602/cuda-<wbr class="">setup-and-installation/cublas-<wbr class="">libraries-with-incorrect-<wbr class="">permissions/</a><br class="">
@Predrag: Can you try running the simpleCUBLAS example from the CUDA library, with and without root privilege? I think that might be something that you are more familiar with. Thank you very much!<br class="">
<br class="">
<br class="">
Thanks,<br class="">
Yichong<br class="">
<br class="">
</blockquote>
<br class="">
<blockquote type="cite" class="">On Sep 4, 2018, at 3:18 PM, Emre Yolcu <<a href="mailto:eyolcu@cs.cmu.edu" target="_blank" class="">eyolcu@cs.cmu.edu</a><<a href="mailto:eyolcu@cs.cmu.edu" target="_blank" class="">mailto:<wbr class="">eyolcu@cs.cmu.edu</a>>>
 wrote:<br class="">
<br class="">
Hi,<br class="">
<br class="">
We are trying to troubleshoot the PyTorch issue with Predrag and were wondering:<br class="">
<br class="">
Is anybody able to run PyTorch GPU models on gpu1-9? If you can, we would appreciate if you can respond.<br class="">
<br class="">
Also, is it a problem for anyone if gpu8 is rebooted today?<br class="">
<br class="">
Thanks,<br class="">
<br class="">
Emre<br class="">
<br class="">
<br class="">
<br class="">
--<br class="">
Biswajit Paria<br class="">
PhD in ML @ CMU<br class="">
<br class="">
<br class="">
</blockquote>
<br class="">
<br class="">
<br class="">
<br class="">
<br class="">
<br class="">
--<br class="">
<br class="">
Biswajit Paria<br class="">
<br class="">
PhD in ML @ CMU<br class="">
<br class="">
<br class="">
<br class="">
<br class="">
<br class="">
<br class="">
--<br class="">
<br class="">
Biswajit Paria<br class="">
<br class="">
PhD in ML @ CMU<br class="">
</blockquote>
</div>
</div>
</blockquote>
</div>
<br class="">
</div>
</div>
</div>
</blockquote>
</div>
</div>
</div>
<br class="">
</div>
</div>
</blockquote>
</div>
<br class="">
</div>
</div>
</blockquote>
</div>
<br class="">
</div>
</body>
</html>