CUDA hangs

Tue Nov 6 21:43:29 EST 2018

Previously we have encountered this issue: Basically somehow you cannot put your cuda cache on nfs server now. Doing this will resolve the problem (works for me):
export CUDA_CACHE_PATH=/home/scratch/[your_id]/[some_folder]

Thanks,
Yichong

On Nov 6, 2018, at 7:41 PM, Emre Yolcu <eyolcu at cs.cmu.edu<mailto:eyolcu at cs.cmu.edu>> wrote:

Could you try setting up everything in the scratch directory and test that way (if that's not what you're already doing)? The last time we had a CUDA problem I moved everything from /zfsauton/home to /home/scratch directories and I cannot reproduce the error on gpu{6,8,9}.

On Tue, Nov 6, 2018 at 6:41 PM, <qiong.zhang at stat.ubc.ca<mailto:qiong.zhang at stat.ubc.ca>> wrote:

I have a similar issue. When I submit the job, it says Runtime error: CUDA error: unknown error. I tried the simple commands that you provided, doesn't work as well.

Qiong

November 6, 2018 3:02 PM, "Matthew Barnes" <mbarnes1 at andrew.cmu.edu<mailto:%22Matthew%20Barnes%22%20%3Cmbarnes1 at andrew.cmu.edu%3E>> wrote:
Is anyone else having issues with CUDA since this week? Even simple pytorch commands hang:
(torch) bash-4.2$ python
Python 2.7.5 (default, Jul 3 2018, 19:30:05)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
x>>> x = torch.zeros(4)
>>> x.cuda()
nvidia-smi works, and torch.cuda.is_available() returns True.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20181107/76eece04/attachment.html>