cuda problem

Predrag Punosevac predragp at andrew.cmu.edu
Tue Aug 18 17:23:48 EDT 2020


Ifigeneia Apostolopoulou <iapostol at andrew.cmu.edu> wrote:

> yes, but there is still no bin/ptxas in  cuda 10.2. actually there's no bin
> directory. it seems that cuda-10.2 is corrupted?
> 

I took a clue from your message and did the fresh installation of CUDA
to GPU1 only. I upgraded the kernel and the driver to the latest one
supporting branch 7.8 of RedHat. The driver works as expected in my
limited testing. CUDA is upgraded to the newly released 11.0. I really
hate that NVidia is intensionally breaking previous stable releases as
soon as the new one is branched out. 

Could you please try building Tensor Flow in GPU1 and report the
progress? We will eventually have to upgrade all GPU nodes to CUDA 11
even if they are fully working now. 

Best,
Predrag



> On Tue, Aug 18, 2020 at 11:41 AM Predrag Punosevac <predragp at andrew.cmu.edu>
> wrote:
> 
> > Because cuda folder is cuda 10.2 folder. Cuda folder is typically just a
> > symbolic link to the curen version of cuda.
> >
> > On Tue, Aug 18, 2020, 11:31 AM Kyle Miller <mille856 at andrew.cmu.edu>
> > wrote:
> >
> >> I see. I ran a few find commands on gpu13, I couldn't find a cuda folder
> >> or CUPTI.
> >>
> >> On Tue, Aug 18, 2020 at 10:00 AM Ifigeneia Apostolopoulou <
> >> iapostol at andrew.cmu.edu> wrote:
> >>
> >>> Hi Kyle,
> >>> Thanks a lot for your reply!
> >>>
> >>> I also had this issue and I solved it as you did. However, this seems to
> >>> be another issue:
> >>> I currently can't see CUPTI in usr/local/cuda/extras/CUPTI (or anywhere
> >>> in gpu1 to set it to my path) which causes the issue.
> >>> I am also attaching the screenshot with the working (gpu3) and
> >>> not-working (gpu1) case. In gpu1, gpu2, gpu13, it seems that the directory
> >>> cuda (and all its content) has been moved (and I can't find it in any other
> >>> directory).
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On Tue, Aug 18, 2020 at 9:32 AM Kyle Miller <mille856 at andrew.cmu.edu>
> >>> wrote:
> >>>
> >>>> Ifi,
> >>>>    I recently had difficulty on GPU13, having not used it in a long
> >>>> while. For me, the issue was that miniconda had moved. I added
> >>>> /opt/miniconda-py38/bin to my path and rebuilt my environment (not sure if
> >>>> that was necessary). Then it worked.
> >>>> -Kyle
> >>>>
> >>>> On Tue, Aug 18, 2020 at 2:14 AM Predrag Punosevac <
> >>>> predragp at andrew.cmu.edu> wrote:
> >>>>
> >>>>> Ifigeneia Apostolopoulou <iapostol at andrew.cmu.edu> wrote:
> >>>>>
> >>>>> > Hi Predrag,
> >>>>> >
> >>>>> > I hope that this (weird) summer is going well!
> >>>>> >
> >>>>> > I noticed a change in servers gpu1, gpu2, gpu13, gpu14.
> >>>>> > Specifically, I no longer can find
> >>>>>
> >>>>> I have not touch those servers in a very long time. I am CC-ing users
> >>>>> mailing list. My brain is shutting down at this late hour. Maybe
> >>>>> somebody could be of more help tomorrow morning.
> >>>>>
> >>>>> >
> >>>>> > /usr/local/cuda/extras/CUPTI
> >>>>> >
> >>>>>
> >>>>> I believe you.
> >>>>>
> >>>>>
> >>>>> > which results in the error when I'm building my tensorflow models.
> >>>>> >
> >>>>> >  Not found: ./bin/ptxas not found. Relying on driver to perform ptx
> >>>>> > compilation. This message will be only logged once.
> >>>>> >
> >>>>> > Any ideas, how could I solve this issue? Would it be possible to
> >>>>> restore
> >>>>> > the cuda directory?
> >>>>> >
> >>>>> > Also, I currently do not have access to gpu21.
> >>>>>
> >>>>> It is fixed now. I just restarted sssd daemon. Please don't use gpu20
> >>>>> and gpu21 unless you are training 3D neuronal networks for which you
> >>>>> need lot of GPU memory.
> >>>>>
> >>>>> Predrag
> >>>>>
> >>>>>
> >>>>> >
> >>>>> > Thanks a lot in advance!
> >>>>>
> >>>>


More information about the Autonlab-users mailing list