cuda problem
Predrag Punosevac
predragp at andrew.cmu.edu
Tue Aug 18 21:44:29 EDT 2020
Ifigeneia Apostolopoulou <iapostol at andrew.cmu.edu> wrote:
> Predrag, now it works fine. thanks a million! :-D
>
> gpu2,10,11,12,13,14,21 seem to have a similar issue.
I am going to sit on this info at least until Friday evening. You are
not supposed to use more than 2-3 nodes at the same time anyway. If
those servers work for other people who might not even use TensorFlow I
would prefer not to reboot them. It takes about 1.5h to rebuilt each
machine. You just listed 7 machines. That is 10.5h of work if everything
goes without a hitch.
Cheers,
Predrag
>
>
>
> On Tue, Aug 18, 2020 at 5:23 PM Predrag Punosevac <predragp at andrew.cmu.edu>
> wrote:
>
> > Ifigeneia Apostolopoulou <iapostol at andrew.cmu.edu> wrote:
> >
> > > yes, but there is still no bin/ptxas in cuda 10.2. actually there's no
> > bin
> > > directory. it seems that cuda-10.2 is corrupted?
> > >
> >
> > I took a clue from your message and did the fresh installation of CUDA
> > to GPU1 only. I upgraded the kernel and the driver to the latest one
> > supporting branch 7.8 of RedHat. The driver works as expected in my
> > limited testing. CUDA is upgraded to the newly released 11.0. I really
> > hate that NVidia is intensionally breaking previous stable releases as
> > soon as the new one is branched out.
> >
> > Could you please try building Tensor Flow in GPU1 and report the
> > progress? We will eventually have to upgrade all GPU nodes to CUDA 11
> > even if they are fully working now.
> >
> > Best,
> > Predrag
> >
> >
> >
> > > On Tue, Aug 18, 2020 at 11:41 AM Predrag Punosevac <
> > predragp at andrew.cmu.edu>
> > > wrote:
> > >
> > > > Because cuda folder is cuda 10.2 folder. Cuda folder is typically just
> > a
> > > > symbolic link to the curen version of cuda.
> > > >
> > > > On Tue, Aug 18, 2020, 11:31 AM Kyle Miller <mille856 at andrew.cmu.edu>
> > > > wrote:
> > > >
> > > >> I see. I ran a few find commands on gpu13, I couldn't find a cuda
> > folder
> > > >> or CUPTI.
> > > >>
> > > >> On Tue, Aug 18, 2020 at 10:00 AM Ifigeneia Apostolopoulou <
> > > >> iapostol at andrew.cmu.edu> wrote:
> > > >>
> > > >>> Hi Kyle,
> > > >>> Thanks a lot for your reply!
> > > >>>
> > > >>> I also had this issue and I solved it as you did. However, this
> > seems to
> > > >>> be another issue:
> > > >>> I currently can't see CUPTI in usr/local/cuda/extras/CUPTI (or
> > anywhere
> > > >>> in gpu1 to set it to my path) which causes the issue.
> > > >>> I am also attaching the screenshot with the working (gpu3) and
> > > >>> not-working (gpu1) case. In gpu1, gpu2, gpu13, it seems that the
> > directory
> > > >>> cuda (and all its content) has been moved (and I can't find it in
> > any other
> > > >>> directory).
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>> On Tue, Aug 18, 2020 at 9:32 AM Kyle Miller <mille856 at andrew.cmu.edu
> > >
> > > >>> wrote:
> > > >>>
> > > >>>> Ifi,
> > > >>>> I recently had difficulty on GPU13, having not used it in a long
> > > >>>> while. For me, the issue was that miniconda had moved. I added
> > > >>>> /opt/miniconda-py38/bin to my path and rebuilt my environment (not
> > sure if
> > > >>>> that was necessary). Then it worked.
> > > >>>> -Kyle
> > > >>>>
> > > >>>> On Tue, Aug 18, 2020 at 2:14 AM Predrag Punosevac <
> > > >>>> predragp at andrew.cmu.edu> wrote:
> > > >>>>
> > > >>>>> Ifigeneia Apostolopoulou <iapostol at andrew.cmu.edu> wrote:
> > > >>>>>
> > > >>>>> > Hi Predrag,
> > > >>>>> >
> > > >>>>> > I hope that this (weird) summer is going well!
> > > >>>>> >
> > > >>>>> > I noticed a change in servers gpu1, gpu2, gpu13, gpu14.
> > > >>>>> > Specifically, I no longer can find
> > > >>>>>
> > > >>>>> I have not touch those servers in a very long time. I am CC-ing
> > users
> > > >>>>> mailing list. My brain is shutting down at this late hour. Maybe
> > > >>>>> somebody could be of more help tomorrow morning.
> > > >>>>>
> > > >>>>> >
> > > >>>>> > /usr/local/cuda/extras/CUPTI
> > > >>>>> >
> > > >>>>>
> > > >>>>> I believe you.
> > > >>>>>
> > > >>>>>
> > > >>>>> > which results in the error when I'm building my tensorflow
> > models.
> > > >>>>> >
> > > >>>>> > Not found: ./bin/ptxas not found. Relying on driver to perform
> > ptx
> > > >>>>> > compilation. This message will be only logged once.
> > > >>>>> >
> > > >>>>> > Any ideas, how could I solve this issue? Would it be possible to
> > > >>>>> restore
> > > >>>>> > the cuda directory?
> > > >>>>> >
> > > >>>>> > Also, I currently do not have access to gpu21.
> > > >>>>>
> > > >>>>> It is fixed now. I just restarted sssd daemon. Please don't use
> > gpu20
> > > >>>>> and gpu21 unless you are training 3D neuronal networks for which
> > you
> > > >>>>> need lot of GPU memory.
> > > >>>>>
> > > >>>>> Predrag
> > > >>>>>
> > > >>>>>
> > > >>>>> >
> > > >>>>> > Thanks a lot in advance!
> > > >>>>>
> > > >>>>
> >
More information about the Autonlab-users
mailing list