NVidia driver broke GPUs

Barnabas Poczos bapoczos at cs.cmu.edu
Wed Mar 28 19:16:21 EDT 2018


Thanks Predrag and Yotam for your help working on this!

Best,
Barnabas
======================
Barnabas Poczos, PhD
Assistant Professor
Machine Learning Department
Carnegie Mellon University


On Wed, Mar 28, 2018 at 6:58 PM, Predrag Punosevac
<predragp at andrew.cmu.edu> wrote:
> Barnabas Poczos <bapoczos at cs.cmu.edu> wrote:
>
>> If this can't be fixed quickly, then would it be possible to do a roll
>> back on these GPU machines (5,6,9)  to the latest state when they
>> worked fine?
>> (If I know correctly, they are down since March 23.)
>>
>> Sorry for bugging you with this, I just want to find a quick solution
>> to make these 12 GPU cards usable again with pytorch and tensorflow
>> because several deadlines are coming.
>>
>> Many thanks! ... and sorry for annoying you with this!
>>
>
> Ok Yotam and I spent last 3-4h debugging this. It is not PyTorch nor
> TensorFlow issue. It is not even CUDA issue. NVidia driver itself is
> broken. I have no idea how it happened on some machines and didn't
> happen on others (all GPU machines with the exception of GPU-7 run the
> same latest Red Hat  3.10.0-693.21.1.el7.x86 kernel). The clue should
> have being the fact that MATLAB also got broken on some machines.
> My hunch is that NVidia driver gets recompiled during the kernel update
> and apparently that is not as robust as it should be.
>
> The plan of the action is that I will try to remove everything NVidia
> related from GPU9 machine try to reinstall driver, CUDA from the
> scratch. Hopefully GPU9 will become functional just like GPU8. Once it
> works for GPU9 I can go and fix other machines. If that doesn't work I
> will reinstall GPU9 from the scratch.
>
> Long story short somebody at NVidia did a shady job with QA and we
> became victims. Oh just for the record we don't use ZFS on Linux. If I
> was running root of the ZFS pool as I am doing on the file server I
> could just do beadm select the previous working system and go back. I am
> not aware that Linux can do something like that but that is what I do on
> FreeBSD and that what Solaris does.
>
>
> Best,
> Predrag
>
>
>
>
>> Cheers,
>> Barnabas
>> ======================
>> Barnabas Poczos, PhD
>> Assistant Professor
>> Machine Learning Department
>> Carnegie Mellon University
>>


More information about the Autonlab-users mailing list