No GPU drivers detected on any gpu machine?

Yusha Liu yushal at andrew.cmu.edu
Thu Aug 1 11:26:48 EDT 2019


Hi all,

Could anyone help give me a guide on how to install tensorflow (<2.0 beta)
compatible with CUDA 10.1 on gpus? I haven't succeed on that. Thanks and
sorry for the overhead.

Yours,
Yusha





On Wed, Jul 24, 2019 at 10:16 PM Predrag Punosevac <predragp at andrew.cmu.edu>
wrote:

> Predrag Punosevac <predragp at andrew.cmu.edu> wrote:
>
> I apologize for top posting. Just a quick update. As of 5 minutes ago
> machines gpu[2-10] appear to have no issues. After all the upgrades and
> reboots it appears that we don't have any dead GPU cards on them and
> that drivers and CUDA 10.1 work as expected. I understand that this is a
> little comfort to people who need to regenerate tensorflow, py-torch,
> and all that "deep-learning" stuff but I have no control over the
> upstream decisions.
>
> GPU1 appears to be broken at the moment. Without attaching consol to the
> machine it is difficult for me to asses the complexity of the problem.
>
> One more time sorry for the down time.
>
> Cheers,
> Predrag
>
>
>
>
>
>
> > A quick update on this issue and a resolution. I took a clue from the
> > fact that GPU10 was working as expected and narrowed down the issue to
> > CUDA 9.1 installation.  It appears that upstream has broken CUDA 9.1
> > purposely via dkms utility which is used to recompile kernel modules
> > to fit specific kernel release. They probably want people to move to
> > CUDA 10.1.
> >
> > Long story short. I upgraded NVidia driver and CUDA to 10.1 on GPU2
> > and GPU3 servers. They appear to be working flawlessly on my end as
> > tested with nvidia-smi utility as well as MATLAB. I have recreated
> > GPU3 scratch directory which was 100% used for almost half a year. I
> > have also reinstalled libcudnn library on both machines but I am
> > unable to test it.
> >
> > This is all good but it also means that people will have to regenerate
> > their tools from the scratch to match the kernel, driver, and CUDA
> > versions. If you have things on GPU10 you probably could just migrate
> > them. This is very time consuming but we have no choice.
> >
> > The major bad news is that one of the GPU servers I tried to work on
> > GPU1 (commissioned almost five years ago) didn't survive reboot. It
> > also uses older Tesla K80 cards. I will have to attach the screen and
> > troubleshoot this machine. That will not happen today or for that
> > matter this week.
> >
> > My plan is now to move and fix machines GPU[4-9] which would take the
> > rest of the day.Note that GPU7 is designated for a special project and
> > not generally accessible.
> >
> > Most Kind Regards,
> > Predrag Punosevac
> >
> >
> >
> >
> > On Wed, Jul 24, 2019 at 1:09 PM Predrag Punosevac
> > <predragp at andrew.cmu.edu> wrote:
> > >
> > > Thank you so much for bringing this to my attention. GPU10 is not
> > > broken but sure enough you are right about the other machines. It
> > > appears that one of recent updates have broken the driver. I will
> > > reinstall drivers shortly and reboot the machines. This is also notice
> > > for everyone else that GPU1-9 will have to be rebooted.
> > >
> > > Predrag
> > >
> > > On Wed, Jul 24, 2019 at 10:52 AM Chufan Gao <chufang at andrew.cmu.edu>
> wrote:
> > > >
> > > > Hi Predrag,
> > > >
> > > >
> > > > I discovered today that when I run nvidia-smi, I get this error:
> > > >
> > > >
> > > > NVIDIA-SMI has failed because it couldn't communicate with the
> NVIDIA driver. Make sure that the latest NVIDIA driver is installed and
> running.
> > > >
> > > > The same happens for all of the gpu machines that I tried. I am
> confused - was there an update that broke it?
> > > >
> > > > Sincerely,
> > > > Andy Gao
>


-- 
Yusha Liu, Master's Student
Machine Learning Department
Carnegie Mellon University
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20190801/9310b789/attachment.html>


More information about the Autonlab-users mailing list