From eyolcu at cs.cmu.edu Sat Sep 1 10:30:41 2018 From: eyolcu at cs.cmu.edu (Emre Yolcu) Date: Sat, 1 Sep 2018 10:30:41 -0400 Subject: Disk I/O error Message-ID: Hi, Since yesterday I've been getting the error below (on CPU, GPU nodes and lake) when I start ipython. Has anybody run into the same thing, or do you have ideas how it can be fixed? I did try deleting the file. [TerminalIPythonApp] ERROR | Failed to open SQLite history /zfsauton/home/eyolcu/.ipython/profile_default/history.sqlite (disk I/O error). [TerminalIPythonApp] ERROR | History file was moved to /zfsauton/home/eyolcu/.ipython/profile_default/history-corrupt.sqlite and a new file created. Emre -------------- next part -------------- An HTML attachment was scrubbed... URL: From jkoushik at andrew.cmu.edu Sat Sep 1 11:48:29 2018 From: jkoushik at andrew.cmu.edu (Jayanth Koushik) Date: Sat, 1 Sep 2018 11:48:29 -0400 Subject: ImageNet Data Message-ID: Hi all, Is the ImageNet dataset available on any of the nodes? I?d like to avoid re-downloading if possible. Thanks! ~Jayanth From yichongx at cs.cmu.edu Sat Sep 1 12:58:13 2018 From: yichongx at cs.cmu.edu (Yichong Xu) Date: Sat, 1 Sep 2018 16:58:13 +0000 Subject: CUDA Error In-Reply-To: References: <26992d64-ea80-c5fb-1fff-7319b674f4ee@andrew.cmu.edu> <20180831152342.ElF7KE0a6%predragp@andrew.cmu.edu> Message-ID: Hi, I?m having the same problem here - @ Vincent have you figured out how to fix this? >>> import torch >>> a=torch.zeros(4,4) >>> a.cuda() THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1524590031827/work/aten/src/THC/THCTensorRandom.cu line=25 error=30 : unknown error Traceback (most recent call last): File "", line 1, in RuntimeError: cuda runtime error (30) : unknown error at /opt/conda/conda-bld/pytorch_1524590031827/work/aten/src/THC/THCTensorRandom.cu:25 Previously I can use pytorch without error. Thanks, Yichong From: Autonlab-users On Behalf Of Jayanth Koushik Sent: 2018?8?31? 11:34 To: Predrag Punosevac Cc: users at autonlab.org Subject: Re: CUDA Error The last line of the error refers to a different conda. Can you make sure all paths are correct? ~Jayanth On Aug 31, 2018, at 11:23 AM, Predrag Punosevac > wrote: Vincent Jeanselme > wrote: Good Morning, Lets try users at autonlab.org Predrag Since the change of the hard drive, I have the following error when I run it on the GPUs (I have reinstalled pytorch but does not solve my problem). I think that the problem comes from the Cuda library. THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1524577177097/work/aten/src/THC/THCTensorRandom.cu line=25 error=30 : unknown error Traceback (most recent call last): ?? File "./train.py", line 519, in ?????? main(args) ?? File "./train.py", line 61, in main ?????? model = nn.DataParallel(model).cuda() ?? File "/zfsauton/home/vjeanselme/anaconda3/envs/lstmpy27/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 102, in __init__ ?????? _check_balance(self.device_ids) ?? File "/zfsauton/home/vjeanselme/anaconda3/envs/lstmpy27/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 17, in _check_balance ?????? dev_props = [torch.cuda.get_device_properties(i) for i in device_ids] ?? File "/zfsauton/home/vjeanselme/anaconda3/envs/lstmpy27/lib/python2.7/site-packages/torch/cuda/__init__.py", line 290, in get_device_properties ?????? init()?? # will define _get_device_properties and _CudaDeviceProperties ?? File "/zfsauton/home/vjeanselme/anaconda3/envs/lstmpy27/lib/python2.7/site-packages/torch/cuda/__init__.py", line 143, in init ?????? _lazy_init() ?? File "/zfsauton/home/vjeanselme/anaconda3/envs/lstmpy27/lib/python2.7/site-packages/torch/cuda/__init__.py", line 161, in _lazy_init ?????? torch._C._cuda_init() RuntimeError: cuda runtime error (30) : unknown error at /opt/conda/conda-bld/pytorch_1524577177097/work/aten/src/THC/THCTensorRandom.cu:25 I don't know how to fix it, would you have any suggestions ? Thank you, -- Vincent Jeanselme ----------------- Analyst Researcher Auton Lab - Robotics Institute Carnegie Mellon University -------------- next part -------------- An HTML attachment was scrubbed... URL: From eyolcu at cs.cmu.edu Mon Sep 3 09:43:32 2018 From: eyolcu at cs.cmu.edu (Emre Yolcu) Date: Mon, 3 Sep 2018 09:43:32 -0400 Subject: CUDA Error In-Reply-To: References: <26992d64-ea80-c5fb-1fff-7319b674f4ee@andrew.cmu.edu> <20180831152342.ElF7KE0a6%predragp@andrew.cmu.edu> Message-ID: I'm getting the same error. On Sat, Sep 1, 2018 at 12:58 PM, Yichong Xu wrote: > Hi, > > I?m having the same problem here - @ Vincent have you figured out how to > fix this? > > >>> import torch > > >>> a=torch.zeros(4,4) > > >>> a.cuda() > > THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1524590031827/work/ > aten/src/THC/THCTensorRandom.cu line=25 error=30 : unknown error > > Traceback (most recent call last): > > File "", line 1, in > > RuntimeError: cuda runtime error (30) : unknown error at > /opt/conda/conda-bld/pytorch_1524590031827/work/aten/src/ > THC/THCTensorRandom.cu:25 > > > > Previously I can use pytorch without error. > > > > *Thanks,* > > *Yichong* > > > > > > > > *From:* Autonlab-users *On Behalf > Of *Jayanth Koushik > *Sent:* 2018?8?31? 11:34 > *To:* Predrag Punosevac > *Cc:* users at autonlab.org > *Subject:* Re: CUDA Error > > > > The last line of the error refers to a different conda. Can you make sure > all paths are correct? > > ~Jayanth > > > On Aug 31, 2018, at 11:23 AM, Predrag Punosevac > wrote: > > Vincent Jeanselme wrote: > > > Good Morning, > > > Lets try users at autonlab.org > > > Predrag > > > > > Since the change of the hard drive, I have the following error when I > > run it on the GPUs (I have reinstalled pytorch but does not solve my > > problem). I think that the problem comes from the Cuda library. > > > > THCudaCheck FAIL > > file=/opt/conda/conda-bld/pytorch_1524577177097/work/ > aten/src/THC/THCTensorRandom.cu > > line=25 error=30 : unknown error > > Traceback (most recent call last): > > ?? File "./train.py", line 519, in > > ?????? main(args) > > ?? File "./train.py", line 61, in main > > ?????? model = nn.DataParallel(model).cuda() > > ?? File > > "/zfsauton/home/vjeanselme/anaconda3/envs/lstmpy27/lib/ > python2.7/site-packages/torch/nn/parallel/data_parallel.py", > > line 102, in __init__ > > ?????? _check_balance(self.device_ids) > > ?? File > > "/zfsauton/home/vjeanselme/anaconda3/envs/lstmpy27/lib/ > python2.7/site-packages/torch/nn/parallel/data_parallel.py", > > line 17, in _check_balance > > ?????? dev_props = [torch.cuda.get_device_properties(i) for i in > > device_ids] > > ?? File > > "/zfsauton/home/vjeanselme/anaconda3/envs/lstmpy27/lib/ > python2.7/site-packages/torch/cuda/__init__.py", > > line 290, in get_device_properties > > ?????? init()?? # will define _get_device_properties and > > _CudaDeviceProperties > > ?? File > > "/zfsauton/home/vjeanselme/anaconda3/envs/lstmpy27/lib/ > python2.7/site-packages/torch/cuda/__init__.py", > > line 143, in init > > ?????? _lazy_init() > > ?? File > > "/zfsauton/home/vjeanselme/anaconda3/envs/lstmpy27/lib/ > python2.7/site-packages/torch/cuda/__init__.py", > > line 161, in _lazy_init > > ?????? torch._C._cuda_init() > > RuntimeError: cuda runtime error (30) : unknown error at > > /opt/conda/conda-bld/pytorch_1524577177097/work/ > aten/src/THC/THCTensorRandom.cu:25 > > > > I don't know how to fix it, would you have any suggestions ? > > > > Thank you, > > > > -- > > Vincent Jeanselme > > ----------------- > > Analyst Researcher > > Auton Lab - Robotics Institute > > Carnegie Mellon University > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eyolcu at cs.cmu.edu Tue Sep 4 15:18:40 2018 From: eyolcu at cs.cmu.edu (Emre Yolcu) Date: Tue, 4 Sep 2018 15:18:40 -0400 Subject: PyTorch problem Message-ID: Hi, We are trying to troubleshoot the PyTorch issue with Predrag and were wondering: Is anybody able to run PyTorch GPU models on gpu1-9? If you can, we would appreciate if you can respond. Also, is it a problem for anyone if gpu8 is rebooted today? Thanks, Emre -------------- next part -------------- An HTML attachment was scrubbed... URL: From jaylee at cs.cmu.edu Tue Sep 4 15:40:34 2018 From: jaylee at cs.cmu.edu (Jay Yoon Lee) Date: Tue, 4 Sep 2018 15:40:34 -0400 Subject: PyTorch problem In-Reply-To: References: Message-ID: Hi Emre, For gpu8, I think my job will finish by tomorrow and it has been running for day and a half, would you be able to wait ? And may I ask the reason you are trying to reboot ? Thanks, Jay-Yoon On Tue, Sep 4, 2018 at 3:19 PM Emre Yolcu wrote: > Hi, > > We are trying to troubleshoot the PyTorch issue with Predrag and were > wondering: > > Is anybody able to run PyTorch GPU models on gpu1-9? If you can, we would > appreciate if you can respond. > > Also, is it a problem for anyone if gpu8 is rebooted today? > > Thanks, > > Emre > -------------- next part -------------- An HTML attachment was scrubbed... URL: From elenagiusarma at gmail.com Tue Sep 4 15:39:25 2018 From: elenagiusarma at gmail.com (Elena Giusarma) Date: Tue, 4 Sep 2018 15:39:25 -0400 Subject: CUDA Error In-Reply-To: References: <26992d64-ea80-c5fb-1fff-7319b674f4ee@andrew.cmu.edu> <20180831152342.ElF7KE0a6%predragp@andrew.cmu.edu> Message-ID: Hi, I am having this error, net.cuda(3) File "/zfsauton/home/egiusarm/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 258, in cuda return self._apply(lambda t: t.cuda(device)) File "/zfsauton/home/egiusarm/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 185, in _apply module._apply(fn) File "/zfsauton/home/egiusarm/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 191, in _apply param.data = fn(param.data) File "/zfsauton/home/egiusarm/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 258, in return self._apply(lambda t: t.cuda(device)) RuntimeError: CUDA error: unknown error I never had that error before. I always used pytorch without problems. thanks, Elena Il giorno lun 3 set 2018 alle ore 09:43 Emre Yolcu ha scritto: > I'm getting the same error. > > On Sat, Sep 1, 2018 at 12:58 PM, Yichong Xu wrote: > >> Hi, >> >> I?m having the same problem here - @ Vincent have you figured out how to >> fix this? >> >> >>> import torch >> >> >>> a=torch.zeros(4,4) >> >> >>> a.cuda() >> >> THCudaCheck FAIL >> file=/opt/conda/conda-bld/pytorch_1524590031827/work/aten/src/THC/THCTensorRandom.cu >> line=25 error=30 : unknown error >> >> Traceback (most recent call last): >> >> File "", line 1, in >> >> RuntimeError: cuda runtime error (30) : unknown error at >> /opt/conda/conda-bld/pytorch_1524590031827/work/aten/src/THC/THCTensorRandom.cu:25 >> >> >> >> Previously I can use pytorch without error. >> >> >> >> *Thanks,* >> >> *Yichong* >> >> >> >> >> >> >> >> *From:* Autonlab-users *On Behalf >> Of *Jayanth Koushik >> *Sent:* 2018?8?31? 11:34 >> *To:* Predrag Punosevac >> *Cc:* users at autonlab.org >> *Subject:* Re: CUDA Error >> >> >> >> The last line of the error refers to a different conda. Can you make sure >> all paths are correct? >> >> ~Jayanth >> >> >> On Aug 31, 2018, at 11:23 AM, Predrag Punosevac >> wrote: >> >> Vincent Jeanselme wrote: >> >> >> Good Morning, >> >> >> Lets try users at autonlab.org >> >> >> Predrag >> >> >> >> >> Since the change of the hard drive, I have the following error when I >> >> run it on the GPUs (I have reinstalled pytorch but does not solve my >> >> problem). I think that the problem comes from the Cuda library. >> >> >> >> THCudaCheck FAIL >> >> >> file=/opt/conda/conda-bld/pytorch_1524577177097/work/aten/src/THC/THCTensorRandom.cu >> >> line=25 error=30 : unknown error >> >> Traceback (most recent call last): >> >> ?? File "./train.py", line 519, in >> >> ?????? main(args) >> >> ?? File "./train.py", line 61, in main >> >> ?????? model = nn.DataParallel(model).cuda() >> >> ?? File >> >> >> "/zfsauton/home/vjeanselme/anaconda3/envs/lstmpy27/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", >> >> line 102, in __init__ >> >> ?????? _check_balance(self.device_ids) >> >> ?? File >> >> >> "/zfsauton/home/vjeanselme/anaconda3/envs/lstmpy27/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", >> >> line 17, in _check_balance >> >> ?????? dev_props = [torch.cuda.get_device_properties(i) for i in >> >> device_ids] >> >> ?? File >> >> >> "/zfsauton/home/vjeanselme/anaconda3/envs/lstmpy27/lib/python2.7/site-packages/torch/cuda/__init__.py", >> >> line 290, in get_device_properties >> >> ?????? init()?? # will define _get_device_properties and >> >> _CudaDeviceProperties >> >> ?? File >> >> >> "/zfsauton/home/vjeanselme/anaconda3/envs/lstmpy27/lib/python2.7/site-packages/torch/cuda/__init__.py", >> >> line 143, in init >> >> ?????? _lazy_init() >> >> ?? File >> >> >> "/zfsauton/home/vjeanselme/anaconda3/envs/lstmpy27/lib/python2.7/site-packages/torch/cuda/__init__.py", >> >> line 161, in _lazy_init >> >> ?????? torch._C._cuda_init() >> >> RuntimeError: cuda runtime error (30) : unknown error at >> >> >> /opt/conda/conda-bld/pytorch_1524577177097/work/aten/src/THC/THCTensorRandom.cu:25 >> >> >> >> I don't know how to fix it, would you have any suggestions ? >> >> >> >> Thank you, >> >> >> >> -- >> >> Vincent Jeanselme >> >> ----------------- >> >> Analyst Researcher >> >> Auton Lab - Robotics Institute >> >> Carnegie Mellon University >> >> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From yichongx at cs.cmu.edu Tue Sep 4 21:57:16 2018 From: yichongx at cs.cmu.edu (Yichong Xu) Date: Wed, 5 Sep 2018 01:57:16 +0000 Subject: PyTorch problem In-Reply-To: References: Message-ID: <442569FC-5DE7-4F67-BDCA-7A7A1902EAF0@andrew.cmu.edu> Just wondering - can Tensorflow run well on these machines? I hope someone to confirm about this so that we can isolate the problem. OK so here?s a further test: I tried running the cuda examples from the cuda installation (in /usr/local/cuda/sample), on gpu2 in my scratch directory. Simple jobs like deviceQuery succeeds, but simpleCUBLAS failed: yichongx at gpu2$ cd /home/scratch/yichongx/ yichongx at gpu2$ cd 0_Simple/ 2_Graphics/ 4_Finance/ 6_Advanced/ bin/ conda/ 1_Utilities/ 3_Imaging/ 5_Simulations/ 7_CUDALibraries/ common/ miniconda3/ yichongx at gpu2$ cd 7_CUDALibraries/ yichongx at gpu2$ cd simpleCUBLAS yichongx at gpu2$ CUDA_VISIBLE_DEVICES=3 ./simpleCUBLAS GPU Device 0: "TITAN X (Pascal)" with compute capability 6.1 simpleCUBLAS test running.. !!!! CUBLAS initialization error yichongx at gpu2$ This is also consistent with our previous errors from pytorch, which say cublas library not initialized. So this means at least there is some problem with CUBLAS on gpu2. This post suggests that using sudo can resolve this problem, and this is probably because of some permission problems on CUBLAS libraries: https://devtalk.nvidia.com/default/topic/1027602/cuda-setup-and-installation/cublas-libraries-with-incorrect-permissions/ @Predrag: Can you try running the simpleCUBLAS example from the CUDA library, with and without root privilege? I think that might be something that you are more familiar with. Thank you very much! Thanks, Yichong On Sep 4, 2018, at 3:18 PM, Emre Yolcu > wrote: Hi, We are trying to troubleshoot the PyTorch issue with Predrag and were wondering: Is anybody able to run PyTorch GPU models on gpu1-9? If you can, we would appreciate if you can respond. Also, is it a problem for anyone if gpu8 is rebooted today? Thanks, Emre -------------- next part -------------- An HTML attachment was scrubbed... URL: From manzil at cmu.edu Tue Sep 4 22:01:50 2018 From: manzil at cmu.edu (Manzil Zaheer) Date: Wed, 5 Sep 2018 02:01:50 +0000 Subject: PyTorch problem In-Reply-To: <442569FC-5DE7-4F67-BDCA-7A7A1902EAF0@andrew.cmu.edu> References: , <442569FC-5DE7-4F67-BDCA-7A7A1902EAF0@andrew.cmu.edu> Message-ID: Hi Yichong Yes I am able to run TF and PyTorch on these machines. Recently someone else also had similar issue, but it got fixed by reinstalling some local packages. Thanks, Manzil -------- Original message -------- From: Yichong Xu Date: 9/4/18 9:58 PM (GMT-05:00) To: Emre Yolcu , Predrag Punosevac Cc: users at autonlab.org Subject: Re: PyTorch problem Just wondering - can Tensorflow run well on these machines? I hope someone to confirm about this so that we can isolate the problem. OK so here?s a further test: I tried running the cuda examples from the cuda installation (in /usr/local/cuda/sample), on gpu2 in my scratch directory. Simple jobs like deviceQuery succeeds, but simpleCUBLAS failed: yichongx at gpu2$ cd /home/scratch/yichongx/ yichongx at gpu2$ cd 0_Simple/ 2_Graphics/ 4_Finance/ 6_Advanced/ bin/ conda/ 1_Utilities/ 3_Imaging/ 5_Simulations/ 7_CUDALibraries/ common/ miniconda3/ yichongx at gpu2$ cd 7_CUDALibraries/ yichongx at gpu2$ cd simpleCUBLAS yichongx at gpu2$ CUDA_VISIBLE_DEVICES=3 ./simpleCUBLAS GPU Device 0: "TITAN X (Pascal)" with compute capability 6.1 simpleCUBLAS test running.. !!!! CUBLAS initialization error yichongx at gpu2$ This is also consistent with our previous errors from pytorch, which say cublas library not initialized. So this means at least there is some problem with CUBLAS on gpu2. This post suggests that using sudo can resolve this problem, and this is probably because of some permission problems on CUBLAS libraries: https://devtalk.nvidia.com/default/topic/1027602/cuda-setup-and-installation/cublas-libraries-with-incorrect-permissions/ @Predrag: Can you try running the simpleCUBLAS example from the CUDA library, with and without root privilege? I think that might be something that you are more familiar with. Thank you very much! Thanks, Yichong On Sep 4, 2018, at 3:18 PM, Emre Yolcu > wrote: Hi, We are trying to troubleshoot the PyTorch issue with Predrag and were wondering: Is anybody able to run PyTorch GPU models on gpu1-9? If you can, we would appreciate if you can respond. Also, is it a problem for anyone if gpu8 is rebooted today? Thanks, Emre -------------- next part -------------- An HTML attachment was scrubbed... URL: From bparia at cs.cmu.edu Wed Sep 5 00:19:50 2018 From: bparia at cs.cmu.edu (Biswajit Paria) Date: Wed, 5 Sep 2018 00:19:50 -0400 Subject: PyTorch problem In-Reply-To: References: <442569FC-5DE7-4F67-BDCA-7A7A1902EAF0@andrew.cmu.edu> Message-ID: I am facing a similar error on all GPU machines. Did someone find a solution yet? 2018-09-05 00:27:41.546064: E tensorflow/stream_executor/cuda/cuda_blas.cc:459] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED On Tue, Sep 4, 2018 at 10:03 PM Manzil Zaheer wrote: > Hi Yichong > > Yes I am able to run TF and PyTorch on these machines. Recently someone > else also had similar issue, but it got fixed by reinstalling some local > packages. > > Thanks, > Manzil > > > -------- Original message -------- > From: Yichong Xu > Date: 9/4/18 9:58 PM (GMT-05:00) > To: Emre Yolcu , Predrag Punosevac < > predragp at andrew.cmu.edu> > Cc: users at autonlab.org > Subject: Re: PyTorch problem > > Just wondering - can Tensorflow run well on these machines? I hope someone > to confirm about this so that we can isolate the problem. > OK so here?s a further test: I tried running the cuda examples from the > cuda installation (in /usr/local/cuda/sample), on gpu2 in my scratch > directory. Simple jobs like deviceQuery succeeds, but simpleCUBLAS failed: > yichongx at gpu2$ cd /home/scratch/yichongx/ > yichongx at gpu2$ cd > 0_Simple/ 2_Graphics/ 4_Finance/ 6_Advanced/ bin/ > conda/ > 1_Utilities/ 3_Imaging/ 5_Simulations/ 7_CUDALibraries/ > common/ miniconda3/ > yichongx at gpu2$ cd 7_CUDALibraries/ > yichongx at gpu2$ cd simpleCUBLAS > yichongx at gpu2$ CUDA_VISIBLE_DEVICES=3 ./simpleCUBLAS > GPU Device 0: "TITAN X (Pascal)" with compute capability 6.1 > > simpleCUBLAS test running.. > !!!! CUBLAS initialization error > yichongx at gpu2$ > > > This is also consistent with our previous errors from pytorch, which say > cublas library not initialized. > > So this means at least there is some problem with CUBLAS on gpu2. This > post suggests that using sudo can resolve this problem, and this is > probably because of some permission problems on CUBLAS libraries: > > https://devtalk.nvidia.com/default/topic/1027602/cuda-setup-and-installation/cublas-libraries-with-incorrect-permissions/ > @Predrag: Can you try running the simpleCUBLAS example from the CUDA > library, with and without root privilege? I think that might be something > that you are more familiar with. Thank you very much! > > > *Thanks,* > *Yichong* > > On Sep 4, 2018, at 3:18 PM, Emre Yolcu wrote: > > Hi, > > We are trying to troubleshoot the PyTorch issue with Predrag and were > wondering: > > Is anybody able to run PyTorch GPU models on gpu1-9? If you can, we would > appreciate if you can respond. > > Also, is it a problem for anyone if gpu8 is rebooted today? > > Thanks, > > Emre > > > -- Biswajit Paria PhD in ML @ CMU -------------- next part -------------- An HTML attachment was scrubbed... URL: From mauorama at gmail.com Wed Sep 5 15:12:25 2018 From: mauorama at gmail.com (Mauricio) Date: Wed, 5 Sep 2018 15:12:25 -0400 Subject: CUDA error: unknown error Message-ID: Hi, I am having this problem with pytorch... any solution? import torch a = torch.rand(5, 3) device = torch.device('cuda') a.to(device) Traceback (most recent call last): File "", line 1, in RuntimeError: CUDA error: unknown error Thank you.. -------------- next part -------------- An HTML attachment was scrubbed... URL: From vjeansel at andrew.cmu.edu Wed Sep 5 15:55:14 2018 From: vjeansel at andrew.cmu.edu (Vincent Jeanselme) Date: Wed, 5 Sep 2018 15:55:14 -0400 Subject: iPython Error Message-ID: Hello all, If you have the following error when you use ipython on the server (or if your jupyter notebooks are much slower than before): [TerminalIPythonApp] ERROR | Failed to open SQLite history /home/scratch/$USER/.ipython/ipython_hist.sqlite (unable to open database file). You need first to create a ipython config file : ipython profile create And then to add in the created file (usually /zfsauton/home/$USER/.ipython/profile_default/ipython_kernel_config.py) the following line: c.HistoryManager.hist_file="/home/scratch/USER/.ipython_hist.sqlite" This way ipython will write its history on the local disk, Vincent -- Vincent Jeanselme ----------------- Analyst Researcher Auton Lab - Robotics Institute Carnegie Mellon University -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Wed Sep 5 16:40:37 2018 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Wed, 5 Sep 2018 16:40:37 -0400 Subject: PyTorch problem In-Reply-To: References: <442569FC-5DE7-4F67-BDCA-7A7A1902EAF0@andrew.cmu.edu>

Message-ID: I just rebooted GPU8. All packages are up to date. NVidia driver appears to be working properly and I can do GPU computations from MATLAB. Let's try now to get pytorch working on GPU8. Predrag On Wed, Sep 5, 2018 at 12:19 AM, Biswajit Paria wrote: > I am facing a similar error on all GPU machines. Did someone find a > solution yet? > > 2018-09-05 00:27:41.546064: E tensorflow/stream_executor/cuda/cuda_blas.cc:459] > failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED > > On Tue, Sep 4, 2018 at 10:03 PM Manzil Zaheer wrote: > >> Hi Yichong >> >> Yes I am able to run TF and PyTorch on these machines. Recently someone >> else also had similar issue, but it got fixed by reinstalling some local >> packages. >> >> Thanks, >> Manzil >> >> >> -------- Original message -------- >> From: Yichong Xu >> Date: 9/4/18 9:58 PM (GMT-05:00) >> To: Emre Yolcu , Predrag Punosevac < >> predragp at andrew.cmu.edu> >> Cc: users at autonlab.org >> Subject: Re: PyTorch problem >> >> Just wondering - can Tensorflow run well on these machines? I hope >> someone to confirm about this so that we can isolate the problem. >> OK so here?s a further test: I tried running the cuda examples from the >> cuda installation (in /usr/local/cuda/sample), on gpu2 in my scratch >> directory. Simple jobs like deviceQuery succeeds, but simpleCUBLAS failed: >> yichongx at gpu2$ cd /home/scratch/yichongx/ >> yichongx at gpu2$ cd >> 0_Simple/ 2_Graphics/ 4_Finance/ 6_Advanced/ bin/ >> conda/ >> 1_Utilities/ 3_Imaging/ 5_Simulations/ 7_CUDALibraries/ >> common/ miniconda3/ >> yichongx at gpu2$ cd 7_CUDALibraries/ >> yichongx at gpu2$ cd simpleCUBLAS >> yichongx at gpu2$ CUDA_VISIBLE_DEVICES=3 ./simpleCUBLAS >> GPU Device 0: "TITAN X (Pascal)" with compute capability 6.1 >> >> simpleCUBLAS test running.. >> !!!! CUBLAS initialization error >> yichongx at gpu2$ >> >> >> This is also consistent with our previous errors from pytorch, which say >> cublas library not initialized. >> >> So this means at least there is some problem with CUBLAS on gpu2. This >> post suggests that using sudo can resolve this problem, and this is >> probably because of some permission problems on CUBLAS libraries: >> https://devtalk.nvidia.com/default/topic/1027602/cuda- >> setup-and-installation/cublas-libraries-with-incorrect-permissions/ >> @Predrag: Can you try running the simpleCUBLAS example from the CUDA >> library, with and without root privilege? I think that might be something >> that you are more familiar with. Thank you very much! >> >> >> *Thanks,* >> *Yichong* >> >> On Sep 4, 2018, at 3:18 PM, Emre Yolcu wrote: >> >> Hi, >> >> We are trying to troubleshoot the PyTorch issue with Predrag and were >> wondering: >> >> Is anybody able to run PyTorch GPU models on gpu1-9? If you can, we would >> appreciate if you can respond. >> >> Also, is it a problem for anyone if gpu8 is rebooted today? >> >> Thanks, >> >> Emre >> >> >> > > -- > Biswajit Paria > PhD in ML @ CMU > -------------- next part -------------- An HTML attachment was scrubbed... URL: From manzil at cmu.edu Wed Sep 5 16:42:36 2018 From: manzil at cmu.edu (Manzil Zaheer) Date: Wed, 5 Sep 2018 20:42:36 +0000 Subject: PyTorch problem In-Reply-To: References: <442569FC-5DE7-4F67-BDCA-7A7A1902EAF0@andrew.cmu.edu>

, Message-ID: It does work for me and my friends -------- Original message -------- From: Predrag Punosevac Date: 9/5/18 4:40 PM (GMT-05:00) To: Biswajit Paria Cc: Manzil Zaheer , Yichong Xu , Emre Yolcu , users at autonlab.org Subject: Re: PyTorch problem I just rebooted GPU8. All packages are up to date. NVidia driver appears to be working properly and I can do GPU computations from MATLAB. Let's try now to get pytorch working on GPU8. Predrag On Wed, Sep 5, 2018 at 12:19 AM, Biswajit Paria > wrote: I am facing a similar error on all GPU machines. Did someone find a solution yet? 2018-09-05 00:27:41.546064: E tensorflow/stream_executor/cuda/cuda_blas.cc:459] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED On Tue, Sep 4, 2018 at 10:03 PM Manzil Zaheer > wrote: Hi Yichong Yes I am able to run TF and PyTorch on these machines. Recently someone else also had similar issue, but it got fixed by reinstalling some local packages. Thanks, Manzil -------- Original message -------- From: Yichong Xu > Date: 9/4/18 9:58 PM (GMT-05:00) To: Emre Yolcu >, Predrag Punosevac > Cc: users at autonlab.org Subject: Re: PyTorch problem Just wondering - can Tensorflow run well on these machines? I hope someone to confirm about this so that we can isolate the problem. OK so here?s a further test: I tried running the cuda examples from the cuda installation (in /usr/local/cuda/sample), on gpu2 in my scratch directory. Simple jobs like deviceQuery succeeds, but simpleCUBLAS failed: yichongx at gpu2$ cd /home/scratch/yichongx/ yichongx at gpu2$ cd 0_Simple/ 2_Graphics/ 4_Finance/ 6_Advanced/ bin/ conda/ 1_Utilities/ 3_Imaging/ 5_Simulations/ 7_CUDALibraries/ common/ miniconda3/ yichongx at gpu2$ cd 7_CUDALibraries/ yichongx at gpu2$ cd simpleCUBLAS yichongx at gpu2$ CUDA_VISIBLE_DEVICES=3 ./simpleCUBLAS GPU Device 0: "TITAN X (Pascal)" with compute capability 6.1 simpleCUBLAS test running.. !!!! CUBLAS initialization error yichongx at gpu2$ This is also consistent with our previous errors from pytorch, which say cublas library not initialized. So this means at least there is some problem with CUBLAS on gpu2. This post suggests that using sudo can resolve this problem, and this is probably because of some permission problems on CUBLAS libraries: https://devtalk.nvidia.com/default/topic/1027602/cuda-setup-and-installation/cublas-libraries-with-incorrect-permissions/ @Predrag: Can you try running the simpleCUBLAS example from the CUDA library, with and without root privilege? I think that might be something that you are more familiar with. Thank you very much! Thanks, Yichong On Sep 4, 2018, at 3:18 PM, Emre Yolcu > wrote: Hi, We are trying to troubleshoot the PyTorch issue with Predrag and were wondering: Is anybody able to run PyTorch GPU models on gpu1-9? If you can, we would appreciate if you can respond. Also, is it a problem for anyone if gpu8 is rebooted today? Thanks, Emre -- Biswajit Paria PhD in ML @ CMU -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Wed Sep 5 16:44:26 2018 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Wed, 5 Sep 2018 16:44:26 -0400 Subject: PyTorch problem In-Reply-To: References: <442569FC-5DE7-4F67-BDCA-7A7A1902EAF0@andrew.cmu.edu>

Message-ID: Should I go ahead and reboot all GPU computing nodes? Can somebody else confirm that a reboot fixes the issue? Predrag On Wed, Sep 5, 2018 at 4:42 PM, Manzil Zaheer wrote: > It does work for me and my friends > > > > > -------- Original message -------- > From: Predrag Punosevac > Date: 9/5/18 4:40 PM (GMT-05:00) > To: Biswajit Paria > Cc: Manzil Zaheer , Yichong Xu , > Emre Yolcu , users at autonlab.org > Subject: Re: PyTorch problem > > I just rebooted GPU8. All packages are up to date. NVidia driver appears > to be working properly and I can do GPU computations from MATLAB. Let's try > now to get pytorch working on GPU8. > > Predrag > > On Wed, Sep 5, 2018 at 12:19 AM, Biswajit Paria wrote: > >> I am facing a similar error on all GPU machines. Did someone find a >> solution yet? >> >> 2018-09-05 00:27:41.546064: E tensorflow/stream_executor/cuda/cuda_blas.cc:459] >> failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED >> >> On Tue, Sep 4, 2018 at 10:03 PM Manzil Zaheer wrote: >> >>> Hi Yichong >>> >>> Yes I am able to run TF and PyTorch on these machines. Recently someone >>> else also had similar issue, but it got fixed by reinstalling some local >>> packages. >>> >>> Thanks, >>> Manzil >>> >>> >>> -------- Original message -------- >>> From: Yichong Xu >>> Date: 9/4/18 9:58 PM (GMT-05:00) >>> To: Emre Yolcu , Predrag Punosevac < >>> predragp at andrew.cmu.edu> >>> Cc: users at autonlab.org >>> Subject: Re: PyTorch problem >>> >>> Just wondering - can Tensorflow run well on these machines? I hope >>> someone to confirm about this so that we can isolate the problem. >>> OK so here?s a further test: I tried running the cuda examples from the >>> cuda installation (in /usr/local/cuda/sample), on gpu2 in my scratch >>> directory. Simple jobs like deviceQuery succeeds, but simpleCUBLAS failed: >>> yichongx at gpu2$ cd /home/scratch/yichongx/ >>> yichongx at gpu2$ cd >>> 0_Simple/ 2_Graphics/ 4_Finance/ 6_Advanced/ bin/ >>> conda/ >>> 1_Utilities/ 3_Imaging/ 5_Simulations/ 7_CUDALibraries/ >>> common/ miniconda3/ >>> yichongx at gpu2$ cd 7_CUDALibraries/ >>> yichongx at gpu2$ cd simpleCUBLAS >>> yichongx at gpu2$ CUDA_VISIBLE_DEVICES=3 ./simpleCUBLAS >>> GPU Device 0: "TITAN X (Pascal)" with compute capability 6.1 >>> >>> simpleCUBLAS test running.. >>> !!!! CUBLAS initialization error >>> yichongx at gpu2$ >>> >>> >>> This is also consistent with our previous errors from pytorch, which say >>> cublas library not initialized. >>> >>> So this means at least there is some problem with CUBLAS on gpu2. This >>> post suggests that using sudo can resolve this problem, and this is >>> probably because of some permission problems on CUBLAS libraries: >>> https://devtalk.nvidia.com/default/topic/1027602/cuda-setup- >>> and-installation/cublas-libraries-with-incorrect-permissions/ >>> @Predrag: Can you try running the simpleCUBLAS example from the CUDA >>> library, with and without root privilege? I think that might be something >>> that you are more familiar with. Thank you very much! >>> >>> >>> *Thanks,* >>> *Yichong* >>> >>> On Sep 4, 2018, at 3:18 PM, Emre Yolcu wrote: >>> >>> Hi, >>> >>> We are trying to troubleshoot the PyTorch issue with Predrag and were >>> wondering: >>> >>> Is anybody able to run PyTorch GPU models on gpu1-9? If you can, we >>> would appreciate if you can respond. >>> >>> Also, is it a problem for anyone if gpu8 is rebooted today? >>> >>> Thanks, >>> >>> Emre >>> >>> >>> >> >> -- >> Biswajit Paria >> PhD in ML @ CMU >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From manzil at cmu.edu Wed Sep 5 16:46:15 2018 From: manzil at cmu.edu (Manzil Zaheer) Date: Wed, 5 Sep 2018 20:46:15 +0000 Subject: PyTorch problem In-Reply-To: References: <442569FC-5DE7-4F67-BDCA-7A7A1902EAF0@andrew.cmu.edu>

, Message-ID: <7f7c51795a7f4a2398934b79bf5de592@cmu.edu> It was working me before reboot as well. PyTorch does work on all nodes for me. I am trying to say is that i think it is not issue at system level but at user account level. I might be wrong though. -------- Original message -------- From: Predrag Punosevac Date: 9/5/18 4:44 PM (GMT-05:00) To: Manzil Zaheer Cc: Biswajit Paria , Yichong Xu , Emre Yolcu , users at autonlab.org Subject: Re: PyTorch problem Should I go ahead and reboot all GPU computing nodes? Can somebody else confirm that a reboot fixes the issue? Predrag On Wed, Sep 5, 2018 at 4:42 PM, Manzil Zaheer > wrote: It does work for me and my friends -------- Original message -------- From: Predrag Punosevac > Date: 9/5/18 4:40 PM (GMT-05:00) To: Biswajit Paria > Cc: Manzil Zaheer >, Yichong Xu >, Emre Yolcu >, users at autonlab.org Subject: Re: PyTorch problem I just rebooted GPU8. All packages are up to date. NVidia driver appears to be working properly and I can do GPU computations from MATLAB. Let's try now to get pytorch working on GPU8. Predrag On Wed, Sep 5, 2018 at 12:19 AM, Biswajit Paria > wrote: I am facing a similar error on all GPU machines. Did someone find a solution yet? 2018-09-05 00:27:41.546064: E tensorflow/stream_executor/cuda/cuda_blas.cc:459] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED On Tue, Sep 4, 2018 at 10:03 PM Manzil Zaheer > wrote: Hi Yichong Yes I am able to run TF and PyTorch on these machines. Recently someone else also had similar issue, but it got fixed by reinstalling some local packages. Thanks, Manzil -------- Original message -------- From: Yichong Xu > Date: 9/4/18 9:58 PM (GMT-05:00) To: Emre Yolcu >, Predrag Punosevac > Cc: users at autonlab.org Subject: Re: PyTorch problem Just wondering - can Tensorflow run well on these machines? I hope someone to confirm about this so that we can isolate the problem. OK so here?s a further test: I tried running the cuda examples from the cuda installation (in /usr/local/cuda/sample), on gpu2 in my scratch directory. Simple jobs like deviceQuery succeeds, but simpleCUBLAS failed: yichongx at gpu2$ cd /home/scratch/yichongx/ yichongx at gpu2$ cd 0_Simple/ 2_Graphics/ 4_Finance/ 6_Advanced/ bin/ conda/ 1_Utilities/ 3_Imaging/ 5_Simulations/ 7_CUDALibraries/ common/ miniconda3/ yichongx at gpu2$ cd 7_CUDALibraries/ yichongx at gpu2$ cd simpleCUBLAS yichongx at gpu2$ CUDA_VISIBLE_DEVICES=3 ./simpleCUBLAS GPU Device 0: "TITAN X (Pascal)" with compute capability 6.1 simpleCUBLAS test running.. !!!! CUBLAS initialization error yichongx at gpu2$ This is also consistent with our previous errors from pytorch, which say cublas library not initialized. So this means at least there is some problem with CUBLAS on gpu2. This post suggests that using sudo can resolve this problem, and this is probably because of some permission problems on CUBLAS libraries: https://devtalk.nvidia.com/default/topic/1027602/cuda-setup-and-installation/cublas-libraries-with-incorrect-permissions/ @Predrag: Can you try running the simpleCUBLAS example from the CUDA library, with and without root privilege? I think that might be something that you are more familiar with. Thank you very much! Thanks, Yichong On Sep 4, 2018, at 3:18 PM, Emre Yolcu > wrote: Hi, We are trying to troubleshoot the PyTorch issue with Predrag and were wondering: Is anybody able to run PyTorch GPU models on gpu1-9? If you can, we would appreciate if you can respond. Also, is it a problem for anyone if gpu8 is rebooted today? Thanks, Emre -- Biswajit Paria PhD in ML @ CMU -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Wed Sep 5 16:56:14 2018 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Wed, 05 Sep 2018 16:56:14 -0400 Subject: PyTorch problem In-Reply-To: <7f7c51795a7f4a2398934b79bf5de592@cmu.edu> References: <442569FC-5DE7-4F67-BDCA-7A7A1902EAF0@andrew.cmu.edu>

<7f7c51795a7f4a2398934b79bf5de592@cmu.edu> Message-ID: <20180905205614.IMQuJDzT_%predragp@andrew.cmu.edu> Manzil Zaheer wrote: > It was working me before reboot as well. PyTorch does work on all > nodes for me. Aha! Gotcha. > > I am trying to say is that i think it is not issue at system level but > at user account level. I might be wrong though. That was my hunch as well. They were trying to convince me in a 150 e-mails chain over the weekend that pytorch was broken when I replaced a failed HDD on the main file server. That didn't make any sense. Could you please share your binaries and setup with outher pytorch users? Cheers, Predrag > > > -------- Original message -------- > From: Predrag Punosevac > Date: 9/5/18 4:44 PM (GMT-05:00) > To: Manzil Zaheer > Cc: Biswajit Paria , Yichong Xu , Emre Yolcu , users at autonlab.org > Subject: Re: PyTorch problem > > Should I go ahead and reboot all GPU computing nodes? Can somebody else confirm that a reboot fixes the issue? > > Predrag > > On Wed, Sep 5, 2018 at 4:42 PM, Manzil Zaheer > wrote: > It does work for me and my friends > > > > > -------- Original message -------- > From: Predrag Punosevac > > Date: 9/5/18 4:40 PM (GMT-05:00) > To: Biswajit Paria > > Cc: Manzil Zaheer >, Yichong Xu >, Emre Yolcu >, users at autonlab.org > Subject: Re: PyTorch problem > > I just rebooted GPU8. All packages are up to date. NVidia driver appears to be working properly and I can do GPU computations from MATLAB. Let's try now to get pytorch working on GPU8. > > Predrag > > On Wed, Sep 5, 2018 at 12:19 AM, Biswajit Paria > wrote: > I am facing a similar error on all GPU machines. Did someone find a solution yet? > > > 2018-09-05 00:27:41.546064: E tensorflow/stream_executor/cuda/cuda_blas.cc:459] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED > > On Tue, Sep 4, 2018 at 10:03 PM Manzil Zaheer > wrote: > Hi Yichong > > Yes I am able to run TF and PyTorch on these machines. Recently someone else also had similar issue, but it got fixed by reinstalling some local packages. > > Thanks, > Manzil > > > -------- Original message -------- > From: Yichong Xu > > Date: 9/4/18 9:58 PM (GMT-05:00) > To: Emre Yolcu >, Predrag Punosevac > > Cc: users at autonlab.org > Subject: Re: PyTorch problem > > Just wondering - can Tensorflow run well on these machines? I hope someone to confirm about this so that we can isolate the problem. > OK so here?s a further test: I tried running the cuda examples from the cuda installation (in /usr/local/cuda/sample), on gpu2 in my scratch directory. Simple jobs like deviceQuery succeeds, but simpleCUBLAS failed: > yichongx at gpu2$ cd /home/scratch/yichongx/ > yichongx at gpu2$ cd > 0_Simple/ 2_Graphics/ 4_Finance/ 6_Advanced/ bin/ conda/ > 1_Utilities/ 3_Imaging/ 5_Simulations/ 7_CUDALibraries/ common/ miniconda3/ > yichongx at gpu2$ cd 7_CUDALibraries/ > yichongx at gpu2$ cd simpleCUBLAS > yichongx at gpu2$ CUDA_VISIBLE_DEVICES=3 ./simpleCUBLAS > GPU Device 0: "TITAN X (Pascal)" with compute capability 6.1 > > simpleCUBLAS test running.. > !!!! CUBLAS initialization error > yichongx at gpu2$ > > > This is also consistent with our previous errors from pytorch, which say cublas library not initialized. > > So this means at least there is some problem with CUBLAS on gpu2. This post suggests that using sudo can resolve this problem, and this is probably because of some permission problems on CUBLAS libraries: > https://devtalk.nvidia.com/default/topic/1027602/cuda-setup-and-installation/cublas-libraries-with-incorrect-permissions/ > @Predrag: Can you try running the simpleCUBLAS example from the CUDA library, with and without root privilege? I think that might be something that you are more familiar with. Thank you very much! > > > Thanks, > Yichong > > On Sep 4, 2018, at 3:18 PM, Emre Yolcu > wrote: > > Hi, > > We are trying to troubleshoot the PyTorch issue with Predrag and were wondering: > > Is anybody able to run PyTorch GPU models on gpu1-9? If you can, we would appreciate if you can respond. > > Also, is it a problem for anyone if gpu8 is rebooted today? > > Thanks, > > Emre > > > > -- > Biswajit Paria > PhD in ML @ CMU > > From eyolcu at cs.cmu.edu Wed Sep 5 17:07:56 2018 From: eyolcu at cs.cmu.edu (Emre Yolcu) Date: Wed, 5 Sep 2018 17:07:56 -0400 Subject: PyTorch problem In-Reply-To: <20180905205614.IMQuJDzT_%predragp@andrew.cmu.edu> References: <442569FC-5DE7-4F67-BDCA-7A7A1902EAF0@andrew.cmu.edu>

<7f7c51795a7f4a2398934b79bf5de592@cmu.edu> <20180905205614.IMQuJDzT_%predragp@andrew.cmu.edu> Message-ID: Manzil, could you share your `conda env export` (or equivalent) output for the environment you use for pytorch? It's still not working for me after reboot, maybe I can try replicating your exact setup and try with that. Thanks, Emre On Wed, Sep 5, 2018 at 4:56 PM, Predrag Punosevac wrote: > Manzil Zaheer wrote: > > > It was working me before reboot as well. PyTorch does work on all > > nodes for me. > > Aha! Gotcha. > > > > > I am trying to say is that i think it is not issue at system level but > > at user account level. I might be wrong though. > > That was my hunch as well. They were trying to convince me in a 150 > e-mails chain over the weekend that pytorch was broken when I replaced a > failed HDD on the main file server. That didn't make any sense. > > Could you please share your binaries and setup with outher pytorch > users? > > Cheers, > Predrag > > > > > > > -------- Original message -------- > > From: Predrag Punosevac > > Date: 9/5/18 4:44 PM (GMT-05:00) > > To: Manzil Zaheer > > Cc: Biswajit Paria , Yichong Xu , > Emre Yolcu , users at autonlab.org > > Subject: Re: PyTorch problem > > > > Should I go ahead and reboot all GPU computing nodes? Can somebody else > confirm that a reboot fixes the issue? > > > > Predrag > > > > On Wed, Sep 5, 2018 at 4:42 PM, Manzil Zaheer manzil at cmu.edu>> wrote: > > It does work for me and my friends > > > > > > > > > > -------- Original message -------- > > From: Predrag Punosevac predragp at andrew.cmu.edu>> > > Date: 9/5/18 4:40 PM (GMT-05:00) > > To: Biswajit Paria > > > Cc: Manzil Zaheer >, Yichong Xu < > yichongx at cs.cmu.edu>, Emre Yolcu < > eyolcu at cs.cmu.edu>, users at autonlab.org users at autonlab.org> > > Subject: Re: PyTorch problem > > > > I just rebooted GPU8. All packages are up to date. NVidia driver appears > to be working properly and I can do GPU computations from MATLAB. Let's try > now to get pytorch working on GPU8. > > > > Predrag > > > > On Wed, Sep 5, 2018 at 12:19 AM, Biswajit Paria > wrote: > > I am facing a similar error on all GPU machines. Did someone find a > solution yet? > > > > > > 2018-09-05 00:27:41.546064: E tensorflow/stream_executor/cuda/cuda_blas.cc:459] > failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED > > > > On Tue, Sep 4, 2018 at 10:03 PM Manzil Zaheer manzil at cmu.edu>> wrote: > > Hi Yichong > > > > Yes I am able to run TF and PyTorch on these machines. Recently someone > else also had similar issue, but it got fixed by reinstalling some local > packages. > > > > Thanks, > > Manzil > > > > > > -------- Original message -------- > > From: Yichong Xu > > > Date: 9/4/18 9:58 PM (GMT-05:00) > > To: Emre Yolcu >, Predrag > Punosevac > > > Cc: users at autonlab.org > > Subject: Re: PyTorch problem > > > > Just wondering - can Tensorflow run well on these machines? I hope > someone to confirm about this so that we can isolate the problem. > > OK so here?s a further test: I tried running the cuda examples from the > cuda installation (in /usr/local/cuda/sample), on gpu2 in my scratch > directory. Simple jobs like deviceQuery succeeds, but simpleCUBLAS failed: > > yichongx at gpu2$ cd /home/scratch/yichongx/ > > yichongx at gpu2$ cd > > 0_Simple/ 2_Graphics/ 4_Finance/ 6_Advanced/ > bin/ conda/ > > 1_Utilities/ 3_Imaging/ 5_Simulations/ 7_CUDALibraries/ > common/ miniconda3/ > > yichongx at gpu2$ cd 7_CUDALibraries/ > > yichongx at gpu2$ cd simpleCUBLAS > > yichongx at gpu2$ CUDA_VISIBLE_DEVICES=3 ./simpleCUBLAS > > GPU Device 0: "TITAN X (Pascal)" with compute capability 6.1 > > > > simpleCUBLAS test running.. > > !!!! CUBLAS initialization error > > yichongx at gpu2$ > > > > > > This is also consistent with our previous errors from pytorch, which say > cublas library not initialized. > > > > So this means at least there is some problem with CUBLAS on gpu2. This > post suggests that using sudo can resolve this problem, and this is > probably because of some permission problems on CUBLAS libraries: > > https://devtalk.nvidia.com/default/topic/1027602/cuda- > setup-and-installation/cublas-libraries-with-incorrect-permissions/ > > @Predrag: Can you try running the simpleCUBLAS example from the CUDA > library, with and without root privilege? I think that might be something > that you are more familiar with. Thank you very much! > > > > > > Thanks, > > Yichong > > > > On Sep 4, 2018, at 3:18 PM, Emre Yolcu cu at cs.cmu.edu>> wrote: > > > > Hi, > > > > We are trying to troubleshoot the PyTorch issue with Predrag and were > wondering: > > > > Is anybody able to run PyTorch GPU models on gpu1-9? If you can, we > would appreciate if you can respond. > > > > Also, is it a problem for anyone if gpu8 is rebooted today? > > > > Thanks, > > > > Emre > > > > > > > > -- > > Biswajit Paria > > PhD in ML @ CMU > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From boecking at andrew.cmu.edu Wed Sep 5 17:12:49 2018 From: boecking at andrew.cmu.edu (Benedikt Boecking) Date: Wed, 5 Sep 2018 17:12:49 -0400 Subject: PyTorch problem In-Reply-To: References: <442569FC-5DE7-4F67-BDCA-7A7A1902EAF0@andrew.cmu.edu>

<7f7c51795a7f4a2398934b79bf5de592@cmu.edu> <20180905205614.IMQuJDzT_%predragp@andrew.cmu.edu> Message-ID: Not sure this will help, but I (very) recently had issues with software installed via conda linking to some of my local python installations. Removing and reinstalling the packages did not help. Ultimately, I removed all my local installs in ~/.local/lib/python* and installed conda again from scratch. It has been working like a charm since then. Best, Ben > On Sep 5, 2018, at 5:07 PM, Emre Yolcu wrote: > > Manzil, could you share your `conda env export` (or equivalent) output for the environment you use for pytorch? It's still not working for me after reboot, maybe I can try replicating your exact setup and try with that. > > Thanks, > > Emre > > On Wed, Sep 5, 2018 at 4:56 PM, Predrag Punosevac > wrote: > Manzil Zaheer > wrote: > > > It was working me before reboot as well. PyTorch does work on all > > nodes for me. > > Aha! Gotcha. > > > > > I am trying to say is that i think it is not issue at system level but > > at user account level. I might be wrong though. > > That was my hunch as well. They were trying to convince me in a 150 > e-mails chain over the weekend that pytorch was broken when I replaced a > failed HDD on the main file server. That didn't make any sense. > > Could you please share your binaries and setup with outher pytorch > users? > > Cheers, > Predrag > > > > > > > -------- Original message -------- > > From: Predrag Punosevac > > > Date: 9/5/18 4:44 PM (GMT-05:00) > > To: Manzil Zaheer > > > Cc: Biswajit Paria >, Yichong Xu >, Emre Yolcu >, users at autonlab.org > > Subject: Re: PyTorch problem > > > > Should I go ahead and reboot all GPU computing nodes? Can somebody else confirm that a reboot fixes the issue? > > > > Predrag > > > > On Wed, Sep 5, 2018 at 4:42 PM, Manzil Zaheer >> wrote: > > It does work for me and my friends > > > > > > > > > > -------- Original message -------- > > From: Predrag Punosevac >> > > Date: 9/5/18 4:40 PM (GMT-05:00) > > To: Biswajit Paria >> > > Cc: Manzil Zaheer >>, Yichong Xu >>, Emre Yolcu >>, users at autonlab.org > > > Subject: Re: PyTorch problem > > > > I just rebooted GPU8. All packages are up to date. NVidia driver appears to be working properly and I can do GPU computations from MATLAB. Let's try now to get pytorch working on GPU8. > > > > Predrag > > > > On Wed, Sep 5, 2018 at 12:19 AM, Biswajit Paria >> wrote: > > I am facing a similar error on all GPU machines. Did someone find a solution yet? > > > > > > 2018-09-05 00:27:41.546064: E tensorflow/stream_executor/cuda/cuda_blas.cc:459] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED > > > > On Tue, Sep 4, 2018 at 10:03 PM Manzil Zaheer >> wrote: > > Hi Yichong > > > > Yes I am able to run TF and PyTorch on these machines. Recently someone else also had similar issue, but it got fixed by reinstalling some local packages. > > > > Thanks, > > Manzil > > > > > > -------- Original message -------- > > From: Yichong Xu >> > > Date: 9/4/18 9:58 PM (GMT-05:00) > > To: Emre Yolcu >>, Predrag Punosevac >> > > Cc: users at autonlab.org > > > Subject: Re: PyTorch problem > > > > Just wondering - can Tensorflow run well on these machines? I hope someone to confirm about this so that we can isolate the problem. > > OK so here?s a further test: I tried running the cuda examples from the cuda installation (in /usr/local/cuda/sample), on gpu2 in my scratch directory. Simple jobs like deviceQuery succeeds, but simpleCUBLAS failed: > > yichongx at gpu2$ cd /home/scratch/yichongx/ > > yichongx at gpu2$ cd > > 0_Simple/ 2_Graphics/ 4_Finance/ 6_Advanced/ bin/ conda/ > > 1_Utilities/ 3_Imaging/ 5_Simulations/ 7_CUDALibraries/ common/ miniconda3/ > > yichongx at gpu2$ cd 7_CUDALibraries/ > > yichongx at gpu2$ cd simpleCUBLAS > > yichongx at gpu2$ CUDA_VISIBLE_DEVICES=3 ./simpleCUBLAS > > GPU Device 0: "TITAN X (Pascal)" with compute capability 6.1 > > > > simpleCUBLAS test running.. > > !!!! CUBLAS initialization error > > yichongx at gpu2$ > > > > > > This is also consistent with our previous errors from pytorch, which say cublas library not initialized. > > > > So this means at least there is some problem with CUBLAS on gpu2. This post suggests that using sudo can resolve this problem, and this is probably because of some permission problems on CUBLAS libraries: > > https://devtalk.nvidia.com/default/topic/1027602/cuda-setup-and-installation/cublas-libraries-with-incorrect-permissions/ > > @Predrag: Can you try running the simpleCUBLAS example from the CUDA library, with and without root privilege? I think that might be something that you are more familiar with. Thank you very much! > > > > > > Thanks, > > Yichong > > > > On Sep 4, 2018, at 3:18 PM, Emre Yolcu >> wrote: > > > > Hi, > > > > We are trying to troubleshoot the PyTorch issue with Predrag and were wondering: > > > > Is anybody able to run PyTorch GPU models on gpu1-9? If you can, we would appreciate if you can respond. > > > > Also, is it a problem for anyone if gpu8 is rebooted today? > > > > Thanks, > > > > Emre > > > > > > > > -- > > Biswajit Paria > > PhD in ML @ CMU > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bparia at cs.cmu.edu Wed Sep 5 17:14:10 2018 From: bparia at cs.cmu.edu (Biswajit Paria) Date: Wed, 5 Sep 2018 17:14:10 -0400 Subject: PyTorch problem In-Reply-To: References: <442569FC-5DE7-4F67-BDCA-7A7A1902EAF0@andrew.cmu.edu>

<7f7c51795a7f4a2398934b79bf5de592@cmu.edu> <20180905205614.IMQuJDzT_%predragp@andrew.cmu.edu> Message-ID: I just tried Yichong's way of testing cuBLAS, and get the same error as earlier: [Matrix Multiply CUBLAS] - Starting... GPU Device 0: "TITAN Xp" with compute capability 6.1 MatrixA(640,480), MatrixB(480,320), MatrixC(640,320) CUDA error at matrixMulCUBLAS.cpp:275 code=1(CUBLAS_STATUS_NOT_INITIALIZED) "cublasCreate(&handle)" So I believe it is not a conda error. I also tried removing .nv, doesn't help either. Maybe someone can share the PATH env variable? On Wed, Sep 5, 2018 at 5:08 PM Emre Yolcu wrote: > Manzil, could you share your `conda env export` (or equivalent) output for > the environment you use for pytorch? It's still not working for me after > reboot, maybe I can try replicating your exact setup and try with that. > > Thanks, > > Emre > > On Wed, Sep 5, 2018 at 4:56 PM, Predrag Punosevac > wrote: > >> Manzil Zaheer wrote: >> >> > It was working me before reboot as well. PyTorch does work on all >> > nodes for me. >> >> Aha! Gotcha. >> >> > >> > I am trying to say is that i think it is not issue at system level but >> > at user account level. I might be wrong though. >> >> That was my hunch as well. They were trying to convince me in a 150 >> e-mails chain over the weekend that pytorch was broken when I replaced a >> failed HDD on the main file server. That didn't make any sense. >> >> Could you please share your binaries and setup with outher pytorch >> users? >> >> Cheers, >> Predrag >> >> > >> > >> > -------- Original message -------- >> > From: Predrag Punosevac >> > Date: 9/5/18 4:44 PM (GMT-05:00) >> > To: Manzil Zaheer >> > Cc: Biswajit Paria , Yichong Xu , >> Emre Yolcu , users at autonlab.org >> > Subject: Re: PyTorch problem >> > >> > Should I go ahead and reboot all GPU computing nodes? Can somebody else >> confirm that a reboot fixes the issue? >> > >> > Predrag >> > >> > On Wed, Sep 5, 2018 at 4:42 PM, Manzil Zaheer > manzil at cmu.edu>> wrote: >> > It does work for me and my friends >> > >> > >> > >> > >> > -------- Original message -------- >> > From: Predrag Punosevac > predragp at andrew.cmu.edu>> >> > Date: 9/5/18 4:40 PM (GMT-05:00) >> > To: Biswajit Paria > >> > Cc: Manzil Zaheer >, Yichong Xu < >> yichongx at cs.cmu.edu>, Emre Yolcu < >> eyolcu at cs.cmu.edu>, users at autonlab.org> users at autonlab.org> >> > Subject: Re: PyTorch problem >> > >> > I just rebooted GPU8. All packages are up to date. NVidia driver >> appears to be working properly and I can do GPU computations from MATLAB. >> Let's try now to get pytorch working on GPU8. >> > >> > Predrag >> > >> > On Wed, Sep 5, 2018 at 12:19 AM, Biswajit Paria > > wrote: >> > I am facing a similar error on all GPU machines. Did someone find a >> solution yet? >> > >> > >> > 2018-09-05 00:27:41.546064: E >> tensorflow/stream_executor/cuda/cuda_blas.cc:459] failed to create cublas >> handle: CUBLAS_STATUS_NOT_INITIALIZED >> > >> > On Tue, Sep 4, 2018 at 10:03 PM Manzil Zaheer > manzil at cmu.edu>> wrote: >> > Hi Yichong >> > >> > Yes I am able to run TF and PyTorch on these machines. Recently someone >> else also had similar issue, but it got fixed by reinstalling some local >> packages. >> > >> > Thanks, >> > Manzil >> > >> > >> > -------- Original message -------- >> > From: Yichong Xu > >> > Date: 9/4/18 9:58 PM (GMT-05:00) >> > To: Emre Yolcu >, Predrag >> Punosevac > >> > Cc: users at autonlab.org >> > Subject: Re: PyTorch problem >> > >> > Just wondering - can Tensorflow run well on these machines? I hope >> someone to confirm about this so that we can isolate the problem. >> > OK so here?s a further test: I tried running the cuda examples from the >> cuda installation (in /usr/local/cuda/sample), on gpu2 in my scratch >> directory. Simple jobs like deviceQuery succeeds, but simpleCUBLAS failed: >> > yichongx at gpu2$ cd /home/scratch/yichongx/ >> > yichongx at gpu2$ cd >> > 0_Simple/ 2_Graphics/ 4_Finance/ 6_Advanced/ >> bin/ conda/ >> > 1_Utilities/ 3_Imaging/ 5_Simulations/ 7_CUDALibraries/ >> common/ miniconda3/ >> > yichongx at gpu2$ cd 7_CUDALibraries/ >> > yichongx at gpu2$ cd simpleCUBLAS >> > yichongx at gpu2$ CUDA_VISIBLE_DEVICES=3 ./simpleCUBLAS >> > GPU Device 0: "TITAN X (Pascal)" with compute capability 6.1 >> > >> > simpleCUBLAS test running.. >> > !!!! CUBLAS initialization error >> > yichongx at gpu2$ >> > >> > >> > This is also consistent with our previous errors from pytorch, which >> say cublas library not initialized. >> > >> > So this means at least there is some problem with CUBLAS on gpu2. This >> post suggests that using sudo can resolve this problem, and this is >> probably because of some permission problems on CUBLAS libraries: >> > >> https://devtalk.nvidia.com/default/topic/1027602/cuda-setup-and-installation/cublas-libraries-with-incorrect-permissions/ >> > @Predrag: Can you try running the simpleCUBLAS example from the CUDA >> library, with and without root privilege? I think that might be something >> that you are more familiar with. Thank you very much! >> > >> > >> > Thanks, >> > Yichong >> > >> > On Sep 4, 2018, at 3:18 PM, Emre Yolcu > eyolcu at cs.cmu.edu>> wrote: >> > >> > Hi, >> > >> > We are trying to troubleshoot the PyTorch issue with Predrag and were >> wondering: >> > >> > Is anybody able to run PyTorch GPU models on gpu1-9? If you can, we >> would appreciate if you can respond. >> > >> > Also, is it a problem for anyone if gpu8 is rebooted today? >> > >> > Thanks, >> > >> > Emre >> > >> > >> > >> > -- >> > Biswajit Paria >> > PhD in ML @ CMU >> > >> > >> > > -- Biswajit Paria PhD in ML @ CMU -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Wed Sep 5 17:22:49 2018 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Wed, 5 Sep 2018 17:22:49 -0400 Subject: PyTorch problem In-Reply-To: References: <442569FC-5DE7-4F67-BDCA-7A7A1902EAF0@andrew.cmu.edu>

<7f7c51795a7f4a2398934b79bf5de592@cmu.edu> <20180905205614.IMQuJDzT_%predragp@andrew.cmu.edu>

Message-ID: People should use /opt/rh/rh-python36 I did install /opt/miniconda3 but I am not a big fan. Predrag On Wed, Sep 5, 2018 at 5:12 PM, Benedikt Boecking wrote: > Not sure this will help, but I (very) recently had issues with software > installed via conda linking to some of my local python installations. > Removing and reinstalling the packages did not help. Ultimately, I removed > all my local installs in ~/.local/lib/python* and installed conda again > from scratch. It has been working like a charm since then. > > Best, > Ben > > > > On Sep 5, 2018, at 5:07 PM, Emre Yolcu wrote: > > Manzil, could you share your `conda env export` (or equivalent) output for > the environment you use for pytorch? It's still not working for me after > reboot, maybe I can try replicating your exact setup and try with that. > > Thanks, > > Emre > > On Wed, Sep 5, 2018 at 4:56 PM, Predrag Punosevac > wrote: > >> Manzil Zaheer wrote: >> >> > It was working me before reboot as well. PyTorch does work on all >> > nodes for me. >> >> Aha! Gotcha. >> >> > >> > I am trying to say is that i think it is not issue at system level but >> > at user account level. I might be wrong though. >> >> That was my hunch as well. They were trying to convince me in a 150 >> e-mails chain over the weekend that pytorch was broken when I replaced a >> failed HDD on the main file server. That didn't make any sense. >> >> Could you please share your binaries and setup with outher pytorch >> users? >> >> Cheers, >> Predrag >> >> > >> > >> > -------- Original message -------- >> > From: Predrag Punosevac >> > Date: 9/5/18 4:44 PM (GMT-05:00) >> > To: Manzil Zaheer >> > Cc: Biswajit Paria , Yichong Xu , >> Emre Yolcu , users at autonlab.org >> > Subject: Re: PyTorch problem >> > >> > Should I go ahead and reboot all GPU computing nodes? Can somebody else >> confirm that a reboot fixes the issue? >> > >> > Predrag >> > >> > On Wed, Sep 5, 2018 at 4:42 PM, Manzil Zaheer > manzil at cmu.edu>> wrote: >> > It does work for me and my friends >> > >> > >> > >> > >> > -------- Original message -------- >> > From: Predrag Punosevac > predragp at andrew.cmu.edu>> >> > Date: 9/5/18 4:40 PM (GMT-05:00) >> > To: Biswajit Paria > >> > Cc: Manzil Zaheer >, Yichong Xu < >> yichongx at cs.cmu.edu>, Emre Yolcu < >> eyolcu at cs.cmu.edu>, users at autonlab.org> users at autonlab.org> >> > Subject: Re: PyTorch problem >> > >> > I just rebooted GPU8. All packages are up to date. NVidia driver >> appears to be working properly and I can do GPU computations from MATLAB. >> Let's try now to get pytorch working on GPU8. >> > >> > Predrag >> > >> > On Wed, Sep 5, 2018 at 12:19 AM, Biswajit Paria > > wrote: >> > I am facing a similar error on all GPU machines. Did someone find a >> solution yet? >> > >> > >> > 2018-09-05 00:27:41.546064: E tensorflow/stream_executor/cuda/ >> cuda_blas.cc:459] failed to create cublas handle: >> CUBLAS_STATUS_NOT_INITIALIZED >> > >> > On Tue, Sep 4, 2018 at 10:03 PM Manzil Zaheer > manzil at cmu.edu>> wrote: >> > Hi Yichong >> > >> > Yes I am able to run TF and PyTorch on these machines. Recently someone >> else also had similar issue, but it got fixed by reinstalling some local >> packages. >> > >> > Thanks, >> > Manzil >> > >> > >> > -------- Original message -------- >> > From: Yichong Xu > >> > Date: 9/4/18 9:58 PM (GMT-05:00) >> > To: Emre Yolcu >, Predrag >> Punosevac > >> > Cc: users at autonlab.org >> > Subject: Re: PyTorch problem >> > >> > Just wondering - can Tensorflow run well on these machines? I hope >> someone to confirm about this so that we can isolate the problem. >> > OK so here?s a further test: I tried running the cuda examples from the >> cuda installation (in /usr/local/cuda/sample), on gpu2 in my scratch >> directory. Simple jobs like deviceQuery succeeds, but simpleCUBLAS failed: >> > yichongx at gpu2$ cd /home/scratch/yichongx/ >> > yichongx at gpu2$ cd >> > 0_Simple/ 2_Graphics/ 4_Finance/ 6_Advanced/ >> bin/ conda/ >> > 1_Utilities/ 3_Imaging/ 5_Simulations/ 7_CUDALibraries/ >> common/ miniconda3/ >> > yichongx at gpu2$ cd 7_CUDALibraries/ >> > yichongx at gpu2$ cd simpleCUBLAS >> > yichongx at gpu2$ CUDA_VISIBLE_DEVICES=3 ./simpleCUBLAS >> > GPU Device 0: "TITAN X (Pascal)" with compute capability 6.1 >> > >> > simpleCUBLAS test running.. >> > !!!! CUBLAS initialization error >> > yichongx at gpu2$ >> > >> > >> > This is also consistent with our previous errors from pytorch, which >> say cublas library not initialized. >> > >> > So this means at least there is some problem with CUBLAS on gpu2. This >> post suggests that using sudo can resolve this problem, and this is >> probably because of some permission problems on CUBLAS libraries: >> > https://devtalk.nvidia.com/default/topic/1027602/cuda-setup- >> and-installation/cublas-libraries-with-incorrect-permissions/ >> > @Predrag: Can you try running the simpleCUBLAS example from the CUDA >> library, with and without root privilege? I think that might be something >> that you are more familiar with. Thank you very much! >> > >> > >> > Thanks, >> > Yichong >> > >> > On Sep 4, 2018, at 3:18 PM, Emre Yolcu > cu at cs.cmu.edu>> wrote: >> > >> > Hi, >> > >> > We are trying to troubleshoot the PyTorch issue with Predrag and were >> wondering: >> > >> > Is anybody able to run PyTorch GPU models on gpu1-9? If you can, we >> would appreciate if you can respond. >> > >> > Also, is it a problem for anyone if gpu8 is rebooted today? >> > >> > Thanks, >> > >> > Emre >> > >> > >> > >> > -- >> > Biswajit Paria >> > PhD in ML @ CMU >> > >> > >> > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From yichongx at cs.cmu.edu Wed Sep 5 17:27:27 2018 From: yichongx at cs.cmu.edu (Yichong Xu) Date: Wed, 5 Sep 2018 21:27:27 +0000 Subject: PyTorch problem In-Reply-To: References: <442569FC-5DE7-4F67-BDCA-7A7A1902EAF0@andrew.cmu.edu>

<7f7c51795a7f4a2398934b79bf5de592@cmu.edu> <20180905205614.IMQuJDzT_%predragp@andrew.cmu.edu> Message-ID: <750EDAE5-46EE-43A4-ADED-2EF54345F12F@cs.cmu.edu> I think with Biswajit?s and my problem with cuda, we should isolate the problem with just CUDA (and drivers) instead of wandering around python or pytorch. Predrag can you test the CUDA examples? I sort of agree with Manzil that this might be a user account problem. Thanks, Yichong On Sep 5, 2018, at 5:14 PM, Biswajit Paria > wrote: I just tried Yichong's way of testing cuBLAS, and get the same error as earlier: [Matrix Multiply CUBLAS] - Starting... GPU Device 0: "TITAN Xp" with compute capability 6.1 MatrixA(640,480), MatrixB(480,320), MatrixC(640,320) CUDA error at matrixMulCUBLAS.cpp:275 code=1(CUBLAS_STATUS_NOT_INITIALIZED) "cublasCreate(&handle)" So I believe it is not a conda error. I also tried removing .nv, doesn't help either. Maybe someone can share the PATH env variable? On Wed, Sep 5, 2018 at 5:08 PM Emre Yolcu > wrote: Manzil, could you share your `conda env export` (or equivalent) output for the environment you use for pytorch? It's still not working for me after reboot, maybe I can try replicating your exact setup and try with that. Thanks, Emre On Wed, Sep 5, 2018 at 4:56 PM, Predrag Punosevac > wrote: Manzil Zaheer > wrote: > It was working me before reboot as well. PyTorch does work on all > nodes for me. Aha! Gotcha. > > I am trying to say is that i think it is not issue at system level but > at user account level. I might be wrong though. That was my hunch as well. They were trying to convince me in a 150 e-mails chain over the weekend that pytorch was broken when I replaced a failed HDD on the main file server. That didn't make any sense. Could you please share your binaries and setup with outher pytorch users? Cheers, Predrag > > > -------- Original message -------- > From: Predrag Punosevac > > Date: 9/5/18 4:44 PM (GMT-05:00) > To: Manzil Zaheer > > Cc: Biswajit Paria >, Yichong Xu >, Emre Yolcu >, users at autonlab.org > Subject: Re: PyTorch problem > > Should I go ahead and reboot all GPU computing nodes? Can somebody else confirm that a reboot fixes the issue? > > Predrag > > On Wed, Sep 5, 2018 at 4:42 PM, Manzil Zaheer >> wrote: > It does work for me and my friends > > > > > -------- Original message -------- > From: Predrag Punosevac >> > Date: 9/5/18 4:40 PM (GMT-05:00) > To: Biswajit Paria >> > Cc: Manzil Zaheer >>, Yichong Xu >>, Emre Yolcu >>, users at autonlab.org> > Subject: Re: PyTorch problem > > I just rebooted GPU8. All packages are up to date. NVidia driver appears to be working properly and I can do GPU computations from MATLAB. Let's try now to get pytorch working on GPU8. > > Predrag > > On Wed, Sep 5, 2018 at 12:19 AM, Biswajit Paria >> wrote: > I am facing a similar error on all GPU machines. Did someone find a solution yet? > > > 2018-09-05 00:27:41.546064: E tensorflow/stream_executor/cuda/cuda_blas.cc:459] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED > > On Tue, Sep 4, 2018 at 10:03 PM Manzil Zaheer >> wrote: > Hi Yichong > > Yes I am able to run TF and PyTorch on these machines. Recently someone else also had similar issue, but it got fixed by reinstalling some local packages. > > Thanks, > Manzil > > > -------- Original message -------- > From: Yichong Xu >> > Date: 9/4/18 9:58 PM (GMT-05:00) > To: Emre Yolcu >>, Predrag Punosevac >> > Cc: users at autonlab.org> > Subject: Re: PyTorch problem > > Just wondering - can Tensorflow run well on these machines? I hope someone to confirm about this so that we can isolate the problem. > OK so here?s a further test: I tried running the cuda examples from the cuda installation (in /usr/local/cuda/sample), on gpu2 in my scratch directory. Simple jobs like deviceQuery succeeds, but simpleCUBLAS failed: > yichongx at gpu2$ cd /home/scratch/yichongx/ > yichongx at gpu2$ cd > 0_Simple/ 2_Graphics/ 4_Finance/ 6_Advanced/ bin/ conda/ > 1_Utilities/ 3_Imaging/ 5_Simulations/ 7_CUDALibraries/ common/ miniconda3/ > yichongx at gpu2$ cd 7_CUDALibraries/ > yichongx at gpu2$ cd simpleCUBLAS > yichongx at gpu2$ CUDA_VISIBLE_DEVICES=3 ./simpleCUBLAS > GPU Device 0: "TITAN X (Pascal)" with compute capability 6.1 > > simpleCUBLAS test running.. > !!!! CUBLAS initialization error > yichongx at gpu2$ > > > This is also consistent with our previous errors from pytorch, which say cublas library not initialized. > > So this means at least there is some problem with CUBLAS on gpu2. This post suggests that using sudo can resolve this problem, and this is probably because of some permission problems on CUBLAS libraries: > https://devtalk.nvidia.com/default/topic/1027602/cuda-setup-and-installation/cublas-libraries-with-incorrect-permissions/ > @Predrag: Can you try running the simpleCUBLAS example from the CUDA library, with and without root privilege? I think that might be something that you are more familiar with. Thank you very much! > > > Thanks, > Yichong > > On Sep 4, 2018, at 3:18 PM, Emre Yolcu >> wrote: > > Hi, > > We are trying to troubleshoot the PyTorch issue with Predrag and were wondering: > > Is anybody able to run PyTorch GPU models on gpu1-9? If you can, we would appreciate if you can respond. > > Also, is it a problem for anyone if gpu8 is rebooted today? > > Thanks, > > Emre > > > > -- > Biswajit Paria > PhD in ML @ CMU > > -- Biswajit Paria PhD in ML @ CMU -------------- next part -------------- An HTML attachment was scrubbed... URL: From bparia at cs.cmu.edu Wed Sep 5 17:28:56 2018 From: bparia at cs.cmu.edu (Biswajit Paria) Date: Wed, 5 Sep 2018 17:28:56 -0400 Subject: PyTorch problem In-Reply-To: <750EDAE5-46EE-43A4-ADED-2EF54345F12F@cs.cmu.edu> References: <442569FC-5DE7-4F67-BDCA-7A7A1902EAF0@andrew.cmu.edu>

<7f7c51795a7f4a2398934b79bf5de592@cmu.edu> <20180905205614.IMQuJDzT_%predragp@andrew.cmu.edu> <750EDAE5-46EE-43A4-ADED-2EF54345F12F@cs.cmu.edu> Message-ID: If the CUDA examples work for anyone, can they share their PATH and LD_LIBRARY_PATH variables? Thanks On Wed, Sep 5, 2018 at 5:27 PM Yichong Xu wrote: > I think with Biswajit?s and my problem with cuda, we should isolate the > problem with just CUDA (and drivers) instead of wandering around python or > pytorch. > Predrag can you test the CUDA examples? I sort of agree with Manzil that > this might be a user account problem. > > *Thanks,* > *Yichong* > > > > On Sep 5, 2018, at 5:14 PM, Biswajit Paria wrote: > > I just tried Yichong's way of testing cuBLAS, and get the same error as > earlier: > > [Matrix Multiply CUBLAS] - Starting... > GPU Device 0: "TITAN Xp" with compute capability 6.1 > > MatrixA(640,480), MatrixB(480,320), MatrixC(640,320) > CUDA error at matrixMulCUBLAS.cpp:275 > code=1(CUBLAS_STATUS_NOT_INITIALIZED) "cublasCreate(&handle)" > > > So I believe it is not a conda error. I also tried removing .nv, doesn't > help either. Maybe someone can share the PATH env variable? > > On Wed, Sep 5, 2018 at 5:08 PM Emre Yolcu wrote: > >> Manzil, could you share your `conda env export` (or equivalent) output >> for the environment you use for pytorch? It's still not working for me >> after reboot, maybe I can try replicating your exact setup and try with >> that. >> >> Thanks, >> >> Emre >> >> On Wed, Sep 5, 2018 at 4:56 PM, Predrag Punosevac < >> predragp at andrew.cmu.edu> wrote: >> >>> Manzil Zaheer wrote: >>> >>> > It was working me before reboot as well. PyTorch does work on all >>> > nodes for me. >>> >>> Aha! Gotcha. >>> >>> > >>> > I am trying to say is that i think it is not issue at system level but >>> > at user account level. I might be wrong though. >>> >>> That was my hunch as well. They were trying to convince me in a 150 >>> e-mails chain over the weekend that pytorch was broken when I replaced a >>> failed HDD on the main file server. That didn't make any sense. >>> >>> Could you please share your binaries and setup with outher pytorch >>> users? >>> >>> Cheers, >>> Predrag >>> >>> > >>> > >>> > -------- Original message -------- >>> > From: Predrag Punosevac >>> > Date: 9/5/18 4:44 PM (GMT-05:00) >>> > To: Manzil Zaheer >>> > Cc: Biswajit Paria , Yichong Xu < >>> yichongx at cs.cmu.edu>, Emre Yolcu , users at autonlab.org >>> > Subject: Re: PyTorch problem >>> > >>> > Should I go ahead and reboot all GPU computing nodes? Can somebody >>> else confirm that a reboot fixes the issue? >>> > >>> > Predrag >>> > >>> > On Wed, Sep 5, 2018 at 4:42 PM, Manzil Zaheer >> manzil at cmu.edu>> wrote: >>> > It does work for me and my friends >>> > >>> > >>> > >>> > >>> > -------- Original message -------- >>> > From: Predrag Punosevac >> predragp at andrew.cmu.edu>> >>> > Date: 9/5/18 4:40 PM (GMT-05:00) >>> > To: Biswajit Paria > >>> > Cc: Manzil Zaheer >, Yichong Xu >>> >, Emre Yolcu < >>> eyolcu at cs.cmu.edu>, users at autonlab.org>> users at autonlab.org> >>> > Subject: Re: PyTorch problem >>> > >>> > I just rebooted GPU8. All packages are up to date. NVidia driver >>> appears to be working properly and I can do GPU computations from MATLAB. >>> Let's try now to get pytorch working on GPU8. >>> > >>> > Predrag >>> > >>> > On Wed, Sep 5, 2018 at 12:19 AM, Biswajit Paria >> > wrote: >>> > I am facing a similar error on all GPU machines. Did someone find a >>> solution yet? >>> > >>> > >>> > 2018-09-05 00:27:41.546064: E tensorflow/stream_executor/cuda/ >>> cuda_blas.cc:459] failed to create cublas handle: >>> CUBLAS_STATUS_NOT_INITIALIZED >>> > >>> > On Tue, Sep 4, 2018 at 10:03 PM Manzil Zaheer