NVidia driver broke GPUs
Predrag Punosevac
predragp at andrew.cmu.edu
Wed Mar 28 23:15:06 EDT 2018
Dear Autonians,
I have another update on NVidia driver issue. I have actually reinstall
the driver and CUDA-9.0 on GPU9 but the issue is still here. Please see
below detailed report.
I have seeing few people reporting even this stupidity with NVidia
hardware. Their solution is cold reboot. I have rebooted this machine
multiple times but every time remotely with the reboot command. That is
so called soft reboot where the power actaully never gets completely cut
off.
Tomorrow I will go to machine room turn off the machine for 10 minutes
and bring it back on line.
We will see if that helps.
Predrag
root at gpu9$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 390.30 Wed Jan 31
22:08:49 PST 2018
GCC version: gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC)
root at gpu9$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176
root at gpu9$ nvidia-smi
Wed Mar 28 23:13:13 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.30 Driver Version: 390.30
|
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile
Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util
Compute M. |
|===============================+======================+======================|
| 0 TITAN Xp Off | 00000000:02:00.0 Off |
N/A |
| 23% 40C P0 61W / 250W | 0MiB / 12196MiB | 0%
Default |
+-------------------------------+----------------------+----------------------+
| 1 TITAN Xp Off | 00000000:03:00.0 Off |
N/A |
| 24% 43C P0 61W / 250W | 0MiB / 12196MiB | 0%
Default |
+-------------------------------+----------------------+----------------------+
| 2 TITAN Xp Off | 00000000:82:00.0 Off |
N/A |
| 23% 40C P0 62W / 250W | 0MiB / 12196MiB | 0%
Default |
+-------------------------------+----------------------+----------------------+
| 3 TITAN Xp Off | 00000000:83:00.0 Off |
N/A |
| 23% 42C P0 62W / 250W | 0MiB / 12196MiB | 0%
Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU
Memory |
| GPU PID Type Process name Usage
|
|=============================================================================|
| No running processes found
|
+-----------------------------------------------------------------------------+
root at gpu9$ ls
deviceQuery deviceQuery.cpp deviceQuery.o Makefile NsightEclipse.xml
readme.txt
root at gpu9$ ./deviceQuery
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
cudaGetDeviceCount returned 30
-> unknown error
Result = FAIL
> On Wed, Mar 28, 2018 at 6:58 PM, Predrag Punosevac
> <predragp at andrew.cmu.edu> wrote:
> > Barnabas Poczos <bapoczos at cs.cmu.edu> wrote:
> >
> >> If this can't be fixed quickly, then would it be possible to do a roll
> >> back on these GPU machines (5,6,9) to the latest state when they
> >> worked fine?
> >> (If I know correctly, they are down since March 23.)
> >>
> >> Sorry for bugging you with this, I just want to find a quick solution
> >> to make these 12 GPU cards usable again with pytorch and tensorflow
> >> because several deadlines are coming.
> >>
> >> Many thanks! ... and sorry for annoying you with this!
> >>
> >
> > Ok Yotam and I spent last 3-4h debugging this. It is not PyTorch nor
> > TensorFlow issue. It is not even CUDA issue. NVidia driver itself is
> > broken. I have no idea how it happened on some machines and didn't
> > happen on others (all GPU machines with the exception of GPU-7 run the
> > same latest Red Hat 3.10.0-693.21.1.el7.x86 kernel). The clue should
> > have being the fact that MATLAB also got broken on some machines.
> > My hunch is that NVidia driver gets recompiled during the kernel update
> > and apparently that is not as robust as it should be.
> >
> > The plan of the action is that I will try to remove everything NVidia
> > related from GPU9 machine try to reinstall driver, CUDA from the
> > scratch. Hopefully GPU9 will become functional just like GPU8. Once it
> > works for GPU9 I can go and fix other machines. If that doesn't work I
> > will reinstall GPU9 from the scratch.
> >
> > Long story short somebody at NVidia did a shady job with QA and we
> > became victims. Oh just for the record we don't use ZFS on Linux. If I
> > was running root of the ZFS pool as I am doing on the file server I
> > could just do beadm select the previous working system and go back. I am
> > not aware that Linux can do something like that but that is what I do on
> > FreeBSD and that what Solaris does.
> >
> >
> > Best,
> > Predrag
> >
> >
> >
> >
> >> Cheers,
> >> Barnabas
> >> ======================
> >> Barnabas Poczos, PhD
> >> Assistant Professor
> >> Machine Learning Department
> >> Carnegie Mellon University
> >>
More information about the Autonlab-users
mailing list