cuda problem

Chufan Gao chufang at andrew.cmu.edu
Wed Aug 19 12:39:04 EDT 2020


Hi All,

I ran into a side issue where tensorflow does indeed detect all of the
gpus, but pytorch now doesn't work.
I did some fiddling, and I figured out that installing pytorch via conda
doesn't link up with cuda correctly, but reinstalling it through pip does.

So if anyone else is having this issue, try reinstalling through pip.

On Tue, Aug 18, 2020 at 10:07 PM Ifigeneia Apostolopoulou <
iapostol at andrew.cmu.edu> wrote:

> At least for me, Friday evening (or beyond) is fine. All these servers are
> currently very underutilized
> (or running very old processes with models probably compiled before the
> issue popped up). I am not sure if this is because
> other people have faced similar problems (with me being the first to
> 'complaint'). In the meantime and for better job scheduling,
> it may be better if  anyone who doesn't encounter a similar problem,
> prefers one of those nodes (gpu2,10,11,12,13,14,21), though.
>
> thanks again and have a good night!
>
>
>
> On Tue, Aug 18, 2020 at 9:44 PM Predrag Punosevac <predragp at andrew.cmu.edu>
> wrote:
>
>> Ifigeneia Apostolopoulou <iapostol at andrew.cmu.edu> wrote:
>>
>> > Predrag, now it works fine. thanks a million! :-D
>> >
>> > gpu2,10,11,12,13,14,21 seem to have a similar issue.
>>
>> I am going to sit on this info at least until Friday evening. You are
>> not supposed to use more than 2-3 nodes at the same time anyway. If
>> those servers work for other people who might not even use TensorFlow I
>> would prefer not to reboot them. It takes about 1.5h to rebuilt each
>> machine. You just listed 7 machines. That is 10.5h of work if everything
>> goes without a hitch.
>>
>> Cheers,
>> Predrag
>>
>> >
>> >
>> >
>> > On Tue, Aug 18, 2020 at 5:23 PM Predrag Punosevac <
>> predragp at andrew.cmu.edu>
>> > wrote:
>> >
>> > > Ifigeneia Apostolopoulou <iapostol at andrew.cmu.edu> wrote:
>> > >
>> > > > yes, but there is still no bin/ptxas in  cuda 10.2. actually
>> there's no
>> > > bin
>> > > > directory. it seems that cuda-10.2 is corrupted?
>> > > >
>> > >
>> > > I took a clue from your message and did the fresh installation of CUDA
>> > > to GPU1 only. I upgraded the kernel and the driver to the latest one
>> > > supporting branch 7.8 of RedHat. The driver works as expected in my
>> > > limited testing. CUDA is upgraded to the newly released 11.0. I really
>> > > hate that NVidia is intensionally breaking previous stable releases as
>> > > soon as the new one is branched out.
>> > >
>> > > Could you please try building Tensor Flow in GPU1 and report the
>> > > progress? We will eventually have to upgrade all GPU nodes to CUDA 11
>> > > even if they are fully working now.
>> > >
>> > > Best,
>> > > Predrag
>> > >
>> > >
>> > >
>> > > > On Tue, Aug 18, 2020 at 11:41 AM Predrag Punosevac <
>> > > predragp at andrew.cmu.edu>
>> > > > wrote:
>> > > >
>> > > > > Because cuda folder is cuda 10.2 folder. Cuda folder is typically
>> just
>> > > a
>> > > > > symbolic link to the curen version of cuda.
>> > > > >
>> > > > > On Tue, Aug 18, 2020, 11:31 AM Kyle Miller <
>> mille856 at andrew.cmu.edu>
>> > > > > wrote:
>> > > > >
>> > > > >> I see. I ran a few find commands on gpu13, I couldn't find a cuda
>> > > folder
>> > > > >> or CUPTI.
>> > > > >>
>> > > > >> On Tue, Aug 18, 2020 at 10:00 AM Ifigeneia Apostolopoulou <
>> > > > >> iapostol at andrew.cmu.edu> wrote:
>> > > > >>
>> > > > >>> Hi Kyle,
>> > > > >>> Thanks a lot for your reply!
>> > > > >>>
>> > > > >>> I also had this issue and I solved it as you did. However, this
>> > > seems to
>> > > > >>> be another issue:
>> > > > >>> I currently can't see CUPTI in usr/local/cuda/extras/CUPTI (or
>> > > anywhere
>> > > > >>> in gpu1 to set it to my path) which causes the issue.
>> > > > >>> I am also attaching the screenshot with the working (gpu3) and
>> > > > >>> not-working (gpu1) case. In gpu1, gpu2, gpu13, it seems that the
>> > > directory
>> > > > >>> cuda (and all its content) has been moved (and I can't find it
>> in
>> > > any other
>> > > > >>> directory).
>> > > > >>>
>> > > > >>>
>> > > > >>>
>> > > > >>>
>> > > > >>>
>> > > > >>> On Tue, Aug 18, 2020 at 9:32 AM Kyle Miller <
>> mille856 at andrew.cmu.edu
>> > > >
>> > > > >>> wrote:
>> > > > >>>
>> > > > >>>> Ifi,
>> > > > >>>>    I recently had difficulty on GPU13, having not used it in a
>> long
>> > > > >>>> while. For me, the issue was that miniconda had moved. I added
>> > > > >>>> /opt/miniconda-py38/bin to my path and rebuilt my environment
>> (not
>> > > sure if
>> > > > >>>> that was necessary). Then it worked.
>> > > > >>>> -Kyle
>> > > > >>>>
>> > > > >>>> On Tue, Aug 18, 2020 at 2:14 AM Predrag Punosevac <
>> > > > >>>> predragp at andrew.cmu.edu> wrote:
>> > > > >>>>
>> > > > >>>>> Ifigeneia Apostolopoulou <iapostol at andrew.cmu.edu> wrote:
>> > > > >>>>>
>> > > > >>>>> > Hi Predrag,
>> > > > >>>>> >
>> > > > >>>>> > I hope that this (weird) summer is going well!
>> > > > >>>>> >
>> > > > >>>>> > I noticed a change in servers gpu1, gpu2, gpu13, gpu14.
>> > > > >>>>> > Specifically, I no longer can find
>> > > > >>>>>
>> > > > >>>>> I have not touch those servers in a very long time. I am
>> CC-ing
>> > > users
>> > > > >>>>> mailing list. My brain is shutting down at this late hour.
>> Maybe
>> > > > >>>>> somebody could be of more help tomorrow morning.
>> > > > >>>>>
>> > > > >>>>> >
>> > > > >>>>> > /usr/local/cuda/extras/CUPTI
>> > > > >>>>> >
>> > > > >>>>>
>> > > > >>>>> I believe you.
>> > > > >>>>>
>> > > > >>>>>
>> > > > >>>>> > which results in the error when I'm building my tensorflow
>> > > models.
>> > > > >>>>> >
>> > > > >>>>> >  Not found: ./bin/ptxas not found. Relying on driver to
>> perform
>> > > ptx
>> > > > >>>>> > compilation. This message will be only logged once.
>> > > > >>>>> >
>> > > > >>>>> > Any ideas, how could I solve this issue? Would it be
>> possible to
>> > > > >>>>> restore
>> > > > >>>>> > the cuda directory?
>> > > > >>>>> >
>> > > > >>>>> > Also, I currently do not have access to gpu21.
>> > > > >>>>>
>> > > > >>>>> It is fixed now. I just restarted sssd daemon. Please don't
>> use
>> > > gpu20
>> > > > >>>>> and gpu21 unless you are training 3D neuronal networks for
>> which
>> > > you
>> > > > >>>>> need lot of GPU memory.
>> > > > >>>>>
>> > > > >>>>> Predrag
>> > > > >>>>>
>> > > > >>>>>
>> > > > >>>>> >
>> > > > >>>>> > Thanks a lot in advance!
>> > > > >>>>>
>> > > > >>>>
>> > >
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20200819/6c3baa18/attachment.html>


More information about the Autonlab-users mailing list