GPU problem fixed!
Predrag Punosevac
predragp at andrew.cmu.edu
Thu Mar 29 15:25:40 EDT 2018
Dear Autonians,
This is now fixed! Apparently we hit a serious driver bug with 930.30.
Please try now to compile TensorFlow and PyTorch on GPU9
Predrag Punoseva ccess from TITAN Xp (GPU0) -> TITAN Xp (GPU1) : Yes
> Peer access from TITAN Xp (GPU0) -> TITAN Xp (GPU2) : No
> Peer access from TITAN Xp (GPU0) -> TITAN Xp (GPU3) : No
> Peer access from TITAN Xp (GPU1) -> TITAN Xp (GPU0) : Yes
> Peer access from TITAN Xp (GPU1) -> TITAN Xp (GPU2) : No
> Peer access from TITAN Xp (GPU1) -> TITAN Xp (GPU3) : No
> Peer access from TITAN Xp (GPU2) -> TITAN Xp (GPU0) : No
> Peer access from TITAN Xp (GPU2) -> TITAN Xp (GPU1) : No
> Peer access from TITAN Xp (GPU2) -> TITAN Xp (GPU3) : Yes
> Peer access from TITAN Xp (GPU3) -> TITAN Xp (GPU0) : No
> Peer access from TITAN Xp (GPU3) -> TITAN Xp (GPU1) : No
> Peer access from TITAN Xp (GPU3) -> TITAN Xp (GPU2) : Yes
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.1, CUDA
Runtime Version = 9.1, NumDevs = 4
Result = PASS <predragp at andrew.cmu.edu> wrote:
I will ping you with a plan of action as soon as Kyle and I stop dancing
We are kind in celebratory mood right now. We will have to fix first
servers with higher number which have Titan Xp cards and newer
motherboards before moving to lower number servers with older GPU cards.
Predrag
> Yotam Hechtlinger <yhechtli at andrew.cmu.edu> wrote:
>
> > Hello Predrag,
> >
> > There might be a bug with GPU8 also.
> > I didn't have time to test it yet, but python crashes when trying to call
> > keras.
>
> I did cold reboot. It didn't help. I think what we see is the bug with
> the driver 390.30. The bug could be Titan Xp specific that is why we see
> older machines working.Nvidia has a websites where one can download the
> scripts which one can use to recompile the latest driver. I think the
> latest driver is 390.48. which is quite a few versions ahead of 390.30.
> I am installing it right now on GPU9. If that doesn't work I will try
> downgrading kernel which assumption that it is a kernel bug. The
> following kernels are available
>
> kernel.x86_64 3.10.0-693.5.2.el7 @updates
> kernel.x86_64 3.10.0-693.11.6.el7 @updates
> kernel.x86_64 3.10.0-693.21.1.el7
>
> Right now I am running 3.10.0-693.21.1 but we can try to go one or even
> two kernels back.
>
> If all that fails I still have few magic tricks in my hat but they are
> related to motherboard firmware. GPU8 and GPU9 have the same
> motherboards but not other servers.
>
> Best,
> Predrag
>
>
>
> > Unlike GPU 5,6 & 9, you can actually get the GPU working, but when I run a
> > keras prediction functions it crashed and says:
> >
> > Loaded runtime CuDNN library: 7101 (compatibility version 7100) but source
> > was compiled with 7004 (compatibility version 7000). If using a binary
> > install, upgrade your CuDNN library to match. If building from sources,
> > make sure the library loaded at runtime matches a compatible version
> > specified during compile configuration.
> > 2018-03-29 09:57:49.807855: F tensorflow/core/kernels/conv_ops.cc:717]
> > Check failed: stream->parent()->GetConvolveAlgorithms(
> > conv_parameters.ShouldIncludeWinogradNonfusedAlgo<T>(), &algorithms)
> >
> > Same code works on GPU4.
> > I know this is not informative, I'll look into it later, just wanted to
> > give you a heads up.
> > I think this might be why there aren't any users on GPU8 but there are on
> > GPU4.
> >
> > Thanks,
> > Yotam.
More information about the Autonlab-users
mailing list