NVidia driver broke GPUs
Predrag Punosevac
predragp at andrew.cmu.edu
Wed Mar 28 18:58:49 EDT 2018
Barnabas Poczos <bapoczos at cs.cmu.edu> wrote:
> If this can't be fixed quickly, then would it be possible to do a roll
> back on these GPU machines (5,6,9) to the latest state when they
> worked fine?
> (If I know correctly, they are down since March 23.)
>
> Sorry for bugging you with this, I just want to find a quick solution
> to make these 12 GPU cards usable again with pytorch and tensorflow
> because several deadlines are coming.
>
> Many thanks! ... and sorry for annoying you with this!
>
Ok Yotam and I spent last 3-4h debugging this. It is not PyTorch nor
TensorFlow issue. It is not even CUDA issue. NVidia driver itself is
broken. I have no idea how it happened on some machines and didn't
happen on others (all GPU machines with the exception of GPU-7 run the
same latest Red Hat 3.10.0-693.21.1.el7.x86 kernel). The clue should
have being the fact that MATLAB also got broken on some machines.
My hunch is that NVidia driver gets recompiled during the kernel update
and apparently that is not as robust as it should be.
The plan of the action is that I will try to remove everything NVidia
related from GPU9 machine try to reinstall driver, CUDA from the
scratch. Hopefully GPU9 will become functional just like GPU8. Once it
works for GPU9 I can go and fix other machines. If that doesn't work I
will reinstall GPU9 from the scratch.
Long story short somebody at NVidia did a shady job with QA and we
became victims. Oh just for the record we don't use ZFS on Linux. If I
was running root of the ZFS pool as I am doing on the file server I
could just do beadm select the previous working system and go back. I am
not aware that Linux can do something like that but that is what I do on
FreeBSD and that what Solaris does.
Best,
Predrag
> Cheers,
> Barnabas
> ======================
> Barnabas Poczos, PhD
> Assistant Professor
> Machine Learning Department
> Carnegie Mellon University
>
More information about the Autonlab-users
mailing list