gpu10: pytorch and cuda
Yichong Xu
yichongx at cs.cmu.edu
Sun Mar 10 22:13:20 EDT 2019
Thanks for the suggestion Predrag! However it seems like I cannot even run the cuda10.1 examples, as I mentioned previously:
yichongx at gpu10$ pwd
/home/scratch/yichongx/NVIDIA_CUDA-10.1_Samples/0_Simple/simplePrintf
yichongx at gpu10$ ls
Makefile readme.txt simplePrintf simplePrintf.cu simplePrintf.o
yichongx at gpu10$ ./simplePrintf
CUDA error at ../../common/inc/helper_cuda.h:744 code=999(cudaErrorUnknown) "cudaGetDeviceCount(&device_count)”
This seems like a problem of CUDA its own. I downloaded the cuda10.1 examples from here:
https://docs.nvidia.com/cuda/cuda-samples/index.html
Thanks,
Yichong
On Mar 10, 2019, at 10:07 PM, Predrag Punosevac <predragp at andrew.cmu.edu<mailto:predragp at andrew.cmu.edu>> wrote:
Yichong Xu <yichongx at cs.cmu.edu<mailto:yichongx at cs.cmu.edu>> wrote:
I tried installing the nightly version and the same error appears. I
guess it is a recent problem - a few weeks ago I can also run pytorch
but now it breaks (at that time there were only 3 gpus available on
gpu10).
This is likely due to the CUDA upgrade. NVidia is aggressively pushing
CUDA 10 branch which we already used on this server. Both pytorch and
tensor flow were working fine up until I added another GPU card week ago
and upgraded the kernel and CUDA to 10.1. I would suggest that we do a
bit of debugging in unison with upstream. In my experience upstream has
probably not caught yet with latest changes and this is what we see.
Instead of me guessing somebody needs to communicate with pytorch and
tensor flow developers (via mailing lists).
Cheers,
Predrag
Thanks,
Yichong
On Mar 10, 2019, at 3:05 PM, Yotam Hechtlinger <yhechtli at andrew.cmu.edu<mailto:yhechtli at andrew.cmu.edu><mailto:yhechtli at andrew.cmu.edu>> wrote:
Regarding tensorflow you don't need to compile from source.
pip install tf-nightly-gpu
Should get it done. I think that's what I've done, but it's been few weeks ago, so try it out and if it doesn't work I'll try to debug it.
Notice that you'll have to uninstall it and install the regular version when you switch back to the other GPUs.
Not sure regarding pytorch, I haven't tried to install it yet.
Yotam.
On Sun, Mar 10, 2019 at 2:24 PM Yichong Xu <yichongx at cs.cmu.edu<mailto:yichongx at cs.cmu.edu><mailto:yichongx at cs.cmu.edu>> wrote:
It seems like tensorflow does not support cuda10 right now - it has to be installed from source.
But I???m mainly using pytorch though and the version with cuda10 does not run either.
Plus, I tried the original cuda example and it cannot find the gpu either:
(base) yichongx at gpu10$ ls
Makefile readme.txt simplePrintf simplePrintf.cu simplePrintf.o
(base) yichongx at gpu10$ ./simplePrintf
CUDA error at ../../common/inc/helper_cuda.h:744 code=999(cudaErrorUnknown) "cudaGetDeviceCount(&device_count)"
(base) yichongx at gpu10$
Thanks,
Yichong
On Mar 10, 2019, at 9:52 AM, Yotam Hechtlinger <yhechtli at andrew.cmu.edu<mailto:yhechtli at andrew.cmu.edu><mailto:yhechtli at andrew.cmu.edu>> wrote:
It's not the same cuda version on GPU 10 and the rest, I think different version of tensorflow has to be installed.
Check your tensorflow version and if it supports the cuda version on gpu10.
On Saturday, March 9, 2019, Predrag Punosevac <predragp at andrew.cmu.edu<mailto:predragp at andrew.cmu.edu><mailto:predragp at andrew.cmu.edu>> wrote:
Try CUDA 10.0 instead of 10.1
On Mar 9, 2019 5:28 PM, Yichong Xu <yichongx at cs.cmu.edu<mailto:yichongx at cs.cmu.edu><mailto:yichongx at cs.cmu.edu>> wrote:
Same issue here.
From my iPhone
On Mar 9, 2019, at 4:01 PM, Emre Yolcu <eyolcu at andrew.cmu.edu<mailto:eyolcu at andrew.cmu.edu><mailto:eyolcu at andrew.cmu.edu>> wrote:
Hi,
Right now on gpu10 `nvcc --version` and `nvidia-smi` seem to work, but `python -c ???import torch; print(torch.cuda.is_available())???` prints False. Is anybody running into the same issue?
Emre
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20190311/1003e9eb/attachment.html>
More information about the Autonlab-users
mailing list