gpu10: pytorch and cuda

Sun Mar 10 22:19:03 EDT 2019

Yichong Xu <yichongx at cs.cmu.edu> wrote:

> Thanks for the suggestion Predrag! However it seems like I cannot even run the cuda10.1 examples, as I mentioned previously:
> yichongx at gpu10$ pwd
> /home/scratch/yichongx/NVIDIA_CUDA-10.1_Samples/0_Simple/simplePrintf
> yichongx at gpu10$ ls
> Makefile  readme.txt  simplePrintf  simplePrintf.cu  simplePrintf.o
> yichongx at gpu10$ ./simplePrintf
> CUDA error at ../../common/inc/helper_cuda.h:744 code=999(cudaErrorUnknown) "cudaGetDeviceCount(&device_count)???
> 
> This seems like a problem of CUDA its own. I downloaded the cuda10.1 examples from here:
> https://docs.nvidia.com/cuda/cuda-samples/index.html

I can't do anything tonight. Later this week (perhaps Tuesday) I will
try to reinstall everything.

> 
> 
> 
> 
> Thanks,
> Yichong
> 
> 
> 
> On Mar 10, 2019, at 10:07 PM, Predrag Punosevac <predragp at andrew.cmu.edu<mailto:predragp at andrew.cmu.edu>> wrote:
> 
> Yichong Xu <yichongx at cs.cmu.edu<mailto:yichongx at cs.cmu.edu>> wrote:
> 
> I tried installing the nightly version and the same error appears. I
> guess it is a recent problem - a few weeks ago I can also run pytorch
> but now it breaks (at that time there were only 3 gpus available on
> gpu10).
> 
> 
> This is likely due to the CUDA upgrade. NVidia is aggressively pushing
> CUDA 10 branch which we already used on this server. Both pytorch and
> tensor flow were working fine up until I added another GPU card week ago
> and upgraded the kernel and CUDA to 10.1. I would suggest that we do a
> bit of debugging in unison with upstream. In my experience upstream has
> probably not caught yet with latest changes and this is what we see.
> Instead of me guessing somebody needs to communicate with pytorch and
> tensor flow developers (via mailing lists).
> 
> Cheers,
> Predrag
> 
> 
> 
> 
> Thanks,
> Yichong
> 
> 
> 
> On Mar 10, 2019, at 3:05 PM, Yotam Hechtlinger <yhechtli at andrew.cmu.edu<mailto:yhechtli at andrew.cmu.edu><mailto:yhechtli at andrew.cmu.edu>> wrote:
> 
> Regarding tensorflow you don't need to compile from source.
> 
> pip install tf-nightly-gpu
> 
> Should get it done. I think that's what I've done, but it's been few weeks ago, so try it out and if it doesn't work I'll try to debug it.
> Notice that you'll have to uninstall it and install the regular version when you switch back to the other GPUs.
> 
> Not sure regarding pytorch, I haven't tried to install it yet.
> 
> Yotam.
> 
> 
> On Sun, Mar 10, 2019 at 2:24 PM Yichong Xu <yichongx at cs.cmu.edu<mailto:yichongx at cs.cmu.edu><mailto:yichongx at cs.cmu.edu>> wrote:
> It seems like tensorflow does not support cuda10 right now - it has to be installed from source.
> But I???m mainly using pytorch though and the version with cuda10 does not run either.
> Plus, I tried the original cuda example and it cannot find the gpu either:
> (base) yichongx at gpu10$ ls
> Makefile  readme.txt  simplePrintf  simplePrintf.cu  simplePrintf.o
> (base) yichongx at gpu10$ ./simplePrintf
> CUDA error at ../../common/inc/helper_cuda.h:744 code=999(cudaErrorUnknown) "cudaGetDeviceCount(&device_count)"
> (base) yichongx at gpu10$
> 
> 
> 
> Thanks,
> Yichong
> 
> 
> 
> On Mar 10, 2019, at 9:52 AM, Yotam Hechtlinger <yhechtli at andrew.cmu.edu<mailto:yhechtli at andrew.cmu.edu><mailto:yhechtli at andrew.cmu.edu>> wrote:
> 
> It's not the same cuda version on GPU 10 and the rest, I think different version of tensorflow has to be installed.
> 
> Check your tensorflow version and if it supports the cuda version on gpu10.
> 
> 
> 
> On Saturday, March 9, 2019, Predrag Punosevac <predragp at andrew.cmu.edu<mailto:predragp at andrew.cmu.edu><mailto:predragp at andrew.cmu.edu>> wrote:
> Try CUDA 10.0 instead of 10.1
> 
> On Mar 9, 2019 5:28 PM, Yichong Xu <yichongx at cs.cmu.edu<mailto:yichongx at cs.cmu.edu><mailto:yichongx at cs.cmu.edu>> wrote:
> Same issue here.
> 
> From my iPhone
> 
> On Mar 9, 2019, at 4:01 PM, Emre Yolcu <eyolcu at andrew.cmu.edu<mailto:eyolcu at andrew.cmu.edu><mailto:eyolcu at andrew.cmu.edu>> wrote:
> 
> 
> Hi,
> 
> 
> 
> Right now on gpu10 `nvcc --version` and `nvidia-smi` seem to work, but `python -c ???import torch; print(torch.cuda.is_available())???` prints False. Is anybody running into the same issue?
> 
> 
> 
> Emre
> 
> 
> 
>