No GPU drivers detected on any gpu machine?
Predrag Punosevac
predragp at andrew.cmu.edu
Wed Jul 24 16:15:22 EDT 2019
A quick update on this issue and a resolution. I took a clue from the
fact that GPU10 was working as expected and narrowed down the issue to
CUDA 9.1 installation. It appears that upstream has broken CUDA 9.1
purposely via dkms utility which is used to recompile kernel modules
to fit specific kernel release. They probably want people to move to
CUDA 10.1.
Long story short. I upgraded NVidia driver and CUDA to 10.1 on GPU2
and GPU3 servers. They appear to be working flawlessly on my end as
tested with nvidia-smi utility as well as MATLAB. I have recreated
GPU3 scratch directory which was 100% used for almost half a year. I
have also reinstalled libcudnn library on both machines but I am
unable to test it.
This is all good but it also means that people will have to regenerate
their tools from the scratch to match the kernel, driver, and CUDA
versions. If you have things on GPU10 you probably could just migrate
them. This is very time consuming but we have no choice.
The major bad news is that one of the GPU servers I tried to work on
GPU1 (commissioned almost five years ago) didn't survive reboot. It
also uses older Tesla K80 cards. I will have to attach the screen and
troubleshoot this machine. That will not happen today or for that
matter this week.
My plan is now to move and fix machines GPU[4-9] which would take the
rest of the day.Note that GPU7 is designated for a special project and
not generally accessible.
Most Kind Regards,
Predrag Punosevac
On Wed, Jul 24, 2019 at 1:09 PM Predrag Punosevac
<predragp at andrew.cmu.edu> wrote:
>
> Thank you so much for bringing this to my attention. GPU10 is not
> broken but sure enough you are right about the other machines. It
> appears that one of recent updates have broken the driver. I will
> reinstall drivers shortly and reboot the machines. This is also notice
> for everyone else that GPU1-9 will have to be rebooted.
>
> Predrag
>
> On Wed, Jul 24, 2019 at 10:52 AM Chufan Gao <chufang at andrew.cmu.edu> wrote:
> >
> > Hi Predrag,
> >
> >
> > I discovered today that when I run nvidia-smi, I get this error:
> >
> >
> > NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
> >
> > The same happens for all of the gpu machines that I tried. I am confused - was there an update that broke it?
> >
> > Sincerely,
> > Andy Gao
More information about the Autonlab-users
mailing list