PyTorch problem
Predrag Punosevac
predragp at andrew.cmu.edu
Wed Sep 5 17:22:49 EDT 2018
People should use /opt/rh/rh-python36
I did install /opt/miniconda3
but I am not a big fan.
Predrag
On Wed, Sep 5, 2018 at 5:12 PM, Benedikt Boecking <boecking at andrew.cmu.edu>
wrote:
> Not sure this will help, but I (very) recently had issues with software
> installed via conda linking to some of my local python installations.
> Removing and reinstalling the packages did not help. Ultimately, I removed
> all my local installs in ~/.local/lib/python* and installed conda again
> from scratch. It has been working like a charm since then.
>
> Best,
> Ben
>
>
>
> On Sep 5, 2018, at 5:07 PM, Emre Yolcu <eyolcu at cs.cmu.edu> wrote:
>
> Manzil, could you share your `conda env export` (or equivalent) output for
> the environment you use for pytorch? It's still not working for me after
> reboot, maybe I can try replicating your exact setup and try with that.
>
> Thanks,
>
> Emre
>
> On Wed, Sep 5, 2018 at 4:56 PM, Predrag Punosevac <predragp at andrew.cmu.edu
> > wrote:
>
>> Manzil Zaheer <manzil at cmu.edu> wrote:
>>
>> > It was working me before reboot as well. PyTorch does work on all
>> > nodes for me.
>>
>> Aha! Gotcha.
>>
>> >
>> > I am trying to say is that i think it is not issue at system level but
>> > at user account level. I might be wrong though.
>>
>> That was my hunch as well. They were trying to convince me in a 150
>> e-mails chain over the weekend that pytorch was broken when I replaced a
>> failed HDD on the main file server. That didn't make any sense.
>>
>> Could you please share your binaries and setup with outher pytorch
>> users?
>>
>> Cheers,
>> Predrag
>>
>> >
>> >
>> > -------- Original message --------
>> > From: Predrag Punosevac <predragp at andrew.cmu.edu>
>> > Date: 9/5/18 4:44 PM (GMT-05:00)
>> > To: Manzil Zaheer <manzil at cmu.edu>
>> > Cc: Biswajit Paria <bparia at cs.cmu.edu>, Yichong Xu <yichongx at cs.cmu.edu>,
>> Emre Yolcu <eyolcu at cs.cmu.edu>, users at autonlab.org
>> > Subject: Re: PyTorch problem
>> >
>> > Should I go ahead and reboot all GPU computing nodes? Can somebody else
>> confirm that a reboot fixes the issue?
>> >
>> > Predrag
>> >
>> > On Wed, Sep 5, 2018 at 4:42 PM, Manzil Zaheer <manzil at cmu.edu<mailto:
>> manzil at cmu.edu>> wrote:
>> > It does work for me and my friends
>> >
>> >
>> >
>> >
>> > -------- Original message --------
>> > From: Predrag Punosevac <predragp at andrew.cmu.edu<mailto:
>> predragp at andrew.cmu.edu>>
>> > Date: 9/5/18 4:40 PM (GMT-05:00)
>> > To: Biswajit Paria <bparia at cs.cmu.edu<mailto:bparia at cs.cmu.edu>>
>> > Cc: Manzil Zaheer <manzil at cmu.edu<mailto:manzil at cmu.edu>>, Yichong Xu <
>> yichongx at cs.cmu.edu<mailto:yichongx at cs.cmu.edu>>, Emre Yolcu <
>> eyolcu at cs.cmu.edu<mailto:eyolcu at cs.cmu.edu>>, users at autonlab.org<mailto:
>> users at autonlab.org>
>> > Subject: Re: PyTorch problem
>> >
>> > I just rebooted GPU8. All packages are up to date. NVidia driver
>> appears to be working properly and I can do GPU computations from MATLAB.
>> Let's try now to get pytorch working on GPU8.
>> >
>> > Predrag
>> >
>> > On Wed, Sep 5, 2018 at 12:19 AM, Biswajit Paria <bparia at cs.cmu.edu
>> <mailto:bparia at cs.cmu.edu>> wrote:
>> > I am facing a similar error on all GPU machines. Did someone find a
>> solution yet?
>> >
>> >
>> > 2018-09-05 00:27:41.546064: E tensorflow/stream_executor/cuda/
>> cuda_blas.cc:459] failed to create cublas handle:
>> CUBLAS_STATUS_NOT_INITIALIZED
>> >
>> > On Tue, Sep 4, 2018 at 10:03 PM Manzil Zaheer <manzil at cmu.edu<mailto:
>> manzil at cmu.edu>> wrote:
>> > Hi Yichong
>> >
>> > Yes I am able to run TF and PyTorch on these machines. Recently someone
>> else also had similar issue, but it got fixed by reinstalling some local
>> packages.
>> >
>> > Thanks,
>> > Manzil
>> >
>> >
>> > -------- Original message --------
>> > From: Yichong Xu <yichongx at cs.cmu.edu<mailto:yichongx at cs.cmu.edu>>
>> > Date: 9/4/18 9:58 PM (GMT-05:00)
>> > To: Emre Yolcu <eyolcu at cs.cmu.edu<mailto:eyolcu at cs.cmu.edu>>, Predrag
>> Punosevac <predragp at andrew.cmu.edu<mailto:predragp at andrew.cmu.edu>>
>> > Cc: users at autonlab.org<mailto:users at autonlab.org>
>> > Subject: Re: PyTorch problem
>> >
>> > Just wondering - can Tensorflow run well on these machines? I hope
>> someone to confirm about this so that we can isolate the problem.
>> > OK so here?s a further test: I tried running the cuda examples from the
>> cuda installation (in /usr/local/cuda/sample), on gpu2 in my scratch
>> directory. Simple jobs like deviceQuery succeeds, but simpleCUBLAS failed:
>> > yichongx at gpu2$ cd /home/scratch/yichongx/
>> > yichongx at gpu2$ cd
>> > 0_Simple/ 2_Graphics/ 4_Finance/ 6_Advanced/
>> bin/ conda/
>> > 1_Utilities/ 3_Imaging/ 5_Simulations/ 7_CUDALibraries/
>> common/ miniconda3/
>> > yichongx at gpu2$ cd 7_CUDALibraries/
>> > yichongx at gpu2$ cd simpleCUBLAS
>> > yichongx at gpu2$ CUDA_VISIBLE_DEVICES=3 ./simpleCUBLAS
>> > GPU Device 0: "TITAN X (Pascal)" with compute capability 6.1
>> >
>> > simpleCUBLAS test running..
>> > !!!! CUBLAS initialization error
>> > yichongx at gpu2$
>> >
>> >
>> > This is also consistent with our previous errors from pytorch, which
>> say cublas library not initialized.
>> >
>> > So this means at least there is some problem with CUBLAS on gpu2. This
>> post suggests that using sudo can resolve this problem, and this is
>> probably because of some permission problems on CUBLAS libraries:
>> > https://devtalk.nvidia.com/default/topic/1027602/cuda-setup-
>> and-installation/cublas-libraries-with-incorrect-permissions/
>> > @Predrag: Can you try running the simpleCUBLAS example from the CUDA
>> library, with and without root privilege? I think that might be something
>> that you are more familiar with. Thank you very much!
>> >
>> >
>> > Thanks,
>> > Yichong
>> >
>> > On Sep 4, 2018, at 3:18 PM, Emre Yolcu <eyolcu at cs.cmu.edu<mailto:eyol
>> cu at cs.cmu.edu>> wrote:
>> >
>> > Hi,
>> >
>> > We are trying to troubleshoot the PyTorch issue with Predrag and were
>> wondering:
>> >
>> > Is anybody able to run PyTorch GPU models on gpu1-9? If you can, we
>> would appreciate if you can respond.
>> >
>> > Also, is it a problem for anyone if gpu8 is rebooted today?
>> >
>> > Thanks,
>> >
>> > Emre
>> >
>> >
>> >
>> > --
>> > Biswajit Paria
>> > PhD in ML @ CMU
>> >
>> >
>>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20180905/da8ff036/attachment-0001.html>
More information about the Autonlab-users
mailing list