NVidia driver broke GPUs

Wed Mar 28 23:15:06 EDT 2018

Dear Autonians,

I have another update on NVidia driver issue. I have actually reinstall
the driver and CUDA-9.0 on GPU9 but the issue is still here. Please see
below detailed report. 

I have seeing few people reporting even this stupidity with NVidia
hardware. Their solution is cold reboot. I have rebooted this machine
multiple times but every time remotely with the reboot command. That is
so called soft reboot where the power actaully never gets completely cut
off. 

Tomorrow I will go to machine room turn off the machine for 10 minutes
and bring it back on line.

We will see if that helps. 

Predrag

root at gpu9$  cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  390.30  Wed Jan 31
22:08:49 PST 2018
GCC version:  gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC) 

root at gpu9$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176

root at gpu9$ nvidia-smi
Wed Mar 28 23:13:13 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.30                 Driver Version: 390.30
    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile
Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util
Compute M. |
|===============================+======================+======================|
|   0  TITAN Xp            Off  | 00000000:02:00.0 Off |
N/A |
| 23%   40C    P0    61W / 250W |      0MiB / 12196MiB |      0%
Default |
+-------------------------------+----------------------+----------------------+
|   1  TITAN Xp            Off  | 00000000:03:00.0 Off |
N/A |
| 24%   43C    P0    61W / 250W |      0MiB / 12196MiB |      0%
Default |
+-------------------------------+----------------------+----------------------+
|   2  TITAN Xp            Off  | 00000000:82:00.0 Off |
N/A |
| 23%   40C    P0    62W / 250W |      0MiB / 12196MiB |      0%
Default |
+-------------------------------+----------------------+----------------------+
|   3  TITAN Xp            Off  | 00000000:83:00.0 Off |
N/A |
| 23%   42C    P0    62W / 250W |      0MiB / 12196MiB |      0%
Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU
Memory |
|  GPU       PID   Type   Process name                             Usage
    |
|=============================================================================|
|  No running processes found
    |
+-----------------------------------------------------------------------------+

root at gpu9$ ls
deviceQuery  deviceQuery.cpp  deviceQuery.o  Makefile  NsightEclipse.xml
readme.txt
root at gpu9$ ./deviceQuery 
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

 cudaGetDeviceCount returned 30
 -> unknown error
 Result = FAIL

> On Wed, Mar 28, 2018 at 6:58 PM, Predrag Punosevac
> <predragp at andrew.cmu.edu> wrote:
> > Barnabas Poczos <bapoczos at cs.cmu.edu> wrote:
> >
> >> If this can't be fixed quickly, then would it be possible to do a roll
> >> back on these GPU machines (5,6,9)  to the latest state when they
> >> worked fine?
> >> (If I know correctly, they are down since March 23.)
> >>
> >> Sorry for bugging you with this, I just want to find a quick solution
> >> to make these 12 GPU cards usable again with pytorch and tensorflow
> >> because several deadlines are coming.
> >>
> >> Many thanks! ... and sorry for annoying you with this!
> >>
> >
> > Ok Yotam and I spent last 3-4h debugging this. It is not PyTorch nor
> > TensorFlow issue. It is not even CUDA issue. NVidia driver itself is
> > broken. I have no idea how it happened on some machines and didn't
> > happen on others (all GPU machines with the exception of GPU-7 run the
> > same latest Red Hat  3.10.0-693.21.1.el7.x86 kernel). The clue should
> > have being the fact that MATLAB also got broken on some machines.
> > My hunch is that NVidia driver gets recompiled during the kernel update
> > and apparently that is not as robust as it should be.
> >
> > The plan of the action is that I will try to remove everything NVidia
> > related from GPU9 machine try to reinstall driver, CUDA from the
> > scratch. Hopefully GPU9 will become functional just like GPU8. Once it
> > works for GPU9 I can go and fix other machines. If that doesn't work I
> > will reinstall GPU9 from the scratch.
> >
> > Long story short somebody at NVidia did a shady job with QA and we
> > became victims. Oh just for the record we don't use ZFS on Linux. If I
> > was running root of the ZFS pool as I am doing on the file server I
> > could just do beadm select the previous working system and go back. I am
> > not aware that Linux can do something like that but that is what I do on
> > FreeBSD and that what Solaris does.
> >
> >
> > Best,
> > Predrag
> >
> >
> >
> >
> >> Cheers,
> >> Barnabas
> >> ======================
> >> Barnabas Poczos, PhD
> >> Assistant Professor
> >> Machine Learning Department
> >> Carnegie Mellon University
> >>