<div dir="ltr"><div dir="ltr">At least for me, Friday evening (or beyond) is fine. All these servers are currently very underutilized <div>(or running very old processes with models probably compiled before the issue popped up). I am not sure if this is because</div><div>other people have faced similar problems (with me being the first to 'complaint'). In the meantime and for better job scheduling, </div><div>it may be better if anyone who doesn't encounter a similar problem, prefers one of those nodes (<span style="color:rgb(80,0,80)">gpu2,10,11,12,13,14,21</span>), though.</div><div><br></div><div>thanks again and have a good night!</div><div><br><div><br></div></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Aug 18, 2020 at 9:44 PM Predrag Punosevac <<a href="mailto:predragp@andrew.cmu.edu">predragp@andrew.cmu.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex">Ifigeneia Apostolopoulou <<a href="mailto:iapostol@andrew.cmu.edu" target="_blank">iapostol@andrew.cmu.edu</a>> wrote:<br>
<br>
> Predrag, now it works fine. thanks a million! :-D<br>
> <br>
> gpu2,10,11,12,13,14,21 seem to have a similar issue.<br>
<br>
I am going to sit on this info at least until Friday evening. You are<br>
not supposed to use more than 2-3 nodes at the same time anyway. If<br>
those servers work for other people who might not even use TensorFlow I<br>
would prefer not to reboot them. It takes about 1.5h to rebuilt each<br>
machine. You just listed 7 machines. That is 10.5h of work if everything<br>
goes without a hitch.<br>
<br>
Cheers,<br>
Predrag<br>
<br>
> <br>
> <br>
> <br>
> On Tue, Aug 18, 2020 at 5:23 PM Predrag Punosevac <<a href="mailto:predragp@andrew.cmu.edu" target="_blank">predragp@andrew.cmu.edu</a>><br>
> wrote:<br>
> <br>
> > Ifigeneia Apostolopoulou <<a href="mailto:iapostol@andrew.cmu.edu" target="_blank">iapostol@andrew.cmu.edu</a>> wrote:<br>
> ><br>
> > > yes, but there is still no bin/ptxas in cuda 10.2. actually there's no<br>
> > bin<br>
> > > directory. it seems that cuda-10.2 is corrupted?<br>
> > ><br>
> ><br>
> > I took a clue from your message and did the fresh installation of CUDA<br>
> > to GPU1 only. I upgraded the kernel and the driver to the latest one<br>
> > supporting branch 7.8 of RedHat. The driver works as expected in my<br>
> > limited testing. CUDA is upgraded to the newly released 11.0. I really<br>
> > hate that NVidia is intensionally breaking previous stable releases as<br>
> > soon as the new one is branched out.<br>
> ><br>
> > Could you please try building Tensor Flow in GPU1 and report the<br>
> > progress? We will eventually have to upgrade all GPU nodes to CUDA 11<br>
> > even if they are fully working now.<br>
> ><br>
> > Best,<br>
> > Predrag<br>
> ><br>
> ><br>
> ><br>
> > > On Tue, Aug 18, 2020 at 11:41 AM Predrag Punosevac <<br>
> > <a href="mailto:predragp@andrew.cmu.edu" target="_blank">predragp@andrew.cmu.edu</a>><br>
> > > wrote:<br>
> > ><br>
> > > > Because cuda folder is cuda 10.2 folder. Cuda folder is typically just<br>
> > a<br>
> > > > symbolic link to the curen version of cuda.<br>
> > > ><br>
> > > > On Tue, Aug 18, 2020, 11:31 AM Kyle Miller <<a href="mailto:mille856@andrew.cmu.edu" target="_blank">mille856@andrew.cmu.edu</a>><br>
> > > > wrote:<br>
> > > ><br>
> > > >> I see. I ran a few find commands on gpu13, I couldn't find a cuda<br>
> > folder<br>
> > > >> or CUPTI.<br>
> > > >><br>
> > > >> On Tue, Aug 18, 2020 at 10:00 AM Ifigeneia Apostolopoulou <<br>
> > > >> <a href="mailto:iapostol@andrew.cmu.edu" target="_blank">iapostol@andrew.cmu.edu</a>> wrote:<br>
> > > >><br>
> > > >>> Hi Kyle,<br>
> > > >>> Thanks a lot for your reply!<br>
> > > >>><br>
> > > >>> I also had this issue and I solved it as you did. However, this<br>
> > seems to<br>
> > > >>> be another issue:<br>
> > > >>> I currently can't see CUPTI in usr/local/cuda/extras/CUPTI (or<br>
> > anywhere<br>
> > > >>> in gpu1 to set it to my path) which causes the issue.<br>
> > > >>> I am also attaching the screenshot with the working (gpu3) and<br>
> > > >>> not-working (gpu1) case. In gpu1, gpu2, gpu13, it seems that the<br>
> > directory<br>
> > > >>> cuda (and all its content) has been moved (and I can't find it in<br>
> > any other<br>
> > > >>> directory).<br>
> > > >>><br>
> > > >>><br>
> > > >>><br>
> > > >>><br>
> > > >>><br>
> > > >>> On Tue, Aug 18, 2020 at 9:32 AM Kyle Miller <<a href="mailto:mille856@andrew.cmu.edu" target="_blank">mille856@andrew.cmu.edu</a><br>
> > ><br>
> > > >>> wrote:<br>
> > > >>><br>
> > > >>>> Ifi,<br>
> > > >>>> I recently had difficulty on GPU13, having not used it in a long<br>
> > > >>>> while. For me, the issue was that miniconda had moved. I added<br>
> > > >>>> /opt/miniconda-py38/bin to my path and rebuilt my environment (not<br>
> > sure if<br>
> > > >>>> that was necessary). Then it worked.<br>
> > > >>>> -Kyle<br>
> > > >>>><br>
> > > >>>> On Tue, Aug 18, 2020 at 2:14 AM Predrag Punosevac <<br>
> > > >>>> <a href="mailto:predragp@andrew.cmu.edu" target="_blank">predragp@andrew.cmu.edu</a>> wrote:<br>
> > > >>>><br>
> > > >>>>> Ifigeneia Apostolopoulou <<a href="mailto:iapostol@andrew.cmu.edu" target="_blank">iapostol@andrew.cmu.edu</a>> wrote:<br>
> > > >>>>><br>
> > > >>>>> > Hi Predrag,<br>
> > > >>>>> ><br>
> > > >>>>> > I hope that this (weird) summer is going well!<br>
> > > >>>>> ><br>
> > > >>>>> > I noticed a change in servers gpu1, gpu2, gpu13, gpu14.<br>
> > > >>>>> > Specifically, I no longer can find<br>
> > > >>>>><br>
> > > >>>>> I have not touch those servers in a very long time. I am CC-ing<br>
> > users<br>
> > > >>>>> mailing list. My brain is shutting down at this late hour. Maybe<br>
> > > >>>>> somebody could be of more help tomorrow morning.<br>
> > > >>>>><br>
> > > >>>>> ><br>
> > > >>>>> > /usr/local/cuda/extras/CUPTI<br>
> > > >>>>> ><br>
> > > >>>>><br>
> > > >>>>> I believe you.<br>
> > > >>>>><br>
> > > >>>>><br>
> > > >>>>> > which results in the error when I'm building my tensorflow<br>
> > models.<br>
> > > >>>>> ><br>
> > > >>>>> > Not found: ./bin/ptxas not found. Relying on driver to perform<br>
> > ptx<br>
> > > >>>>> > compilation. This message will be only logged once.<br>
> > > >>>>> ><br>
> > > >>>>> > Any ideas, how could I solve this issue? Would it be possible to<br>
> > > >>>>> restore<br>
> > > >>>>> > the cuda directory?<br>
> > > >>>>> ><br>
> > > >>>>> > Also, I currently do not have access to gpu21.<br>
> > > >>>>><br>
> > > >>>>> It is fixed now. I just restarted sssd daemon. Please don't use<br>
> > gpu20<br>
> > > >>>>> and gpu21 unless you are training 3D neuronal networks for which<br>
> > you<br>
> > > >>>>> need lot of GPU memory.<br>
> > > >>>>><br>
> > > >>>>> Predrag<br>
> > > >>>>><br>
> > > >>>>><br>
> > > >>>>> ><br>
> > > >>>>> > Thanks a lot in advance!<br>
> > > >>>>><br>
> > > >>>><br>
> ><br>
</blockquote></div>