From eyolcu at cs.cmu.edu Sat Sep 1 10:30:41 2018 From: eyolcu at cs.cmu.edu (Emre Yolcu) Date: Sat, 1 Sep 2018 10:30:41 -0400 Subject: Disk I/O error Message-ID: Hi, Since yesterday I've been getting the error below (on CPU, GPU nodes and lake) when I start ipython. Has anybody run into the same thing, or do you have ideas how it can be fixed? I did try deleting the file. [TerminalIPythonApp] ERROR | Failed to open SQLite history /zfsauton/home/eyolcu/.ipython/profile_default/history.sqlite (disk I/O error). [TerminalIPythonApp] ERROR | History file was moved to /zfsauton/home/eyolcu/.ipython/profile_default/history-corrupt.sqlite and a new file created. Emre -------------- next part -------------- An HTML attachment was scrubbed... URL: From jkoushik at andrew.cmu.edu Sat Sep 1 11:48:29 2018 From: jkoushik at andrew.cmu.edu (Jayanth Koushik) Date: Sat, 1 Sep 2018 11:48:29 -0400 Subject: ImageNet Data Message-ID: Hi all, Is the ImageNet dataset available on any of the nodes? I?d like to avoid re-downloading if possible. Thanks! ~Jayanth From yichongx at cs.cmu.edu Sat Sep 1 12:58:13 2018 From: yichongx at cs.cmu.edu (Yichong Xu) Date: Sat, 1 Sep 2018 16:58:13 +0000 Subject: CUDA Error In-Reply-To: References: <26992d64-ea80-c5fb-1fff-7319b674f4ee@andrew.cmu.edu> <20180831152342.ElF7KE0a6%predragp@andrew.cmu.edu> Message-ID: Hi, I?m having the same problem here - @ Vincent have you figured out how to fix this? >>> import torch >>> a=torch.zeros(4,4) >>> a.cuda() THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1524590031827/work/aten/src/THC/THCTensorRandom.cu line=25 error=30 : unknown error Traceback (most recent call last): File "", line 1, in RuntimeError: cuda runtime error (30) : unknown error at /opt/conda/conda-bld/pytorch_1524590031827/work/aten/src/THC/THCTensorRandom.cu:25 Previously I can use pytorch without error. Thanks, Yichong From: Autonlab-users On Behalf Of Jayanth Koushik Sent: 2018?8?31? 11:34 To: Predrag Punosevac Cc: users at autonlab.org Subject: Re: CUDA Error The last line of the error refers to a different conda. Can you make sure all paths are correct? ~Jayanth On Aug 31, 2018, at 11:23 AM, Predrag Punosevac > wrote: Vincent Jeanselme > wrote: Good Morning, Lets try users at autonlab.org Predrag Since the change of the hard drive, I have the following error when I run it on the GPUs (I have reinstalled pytorch but does not solve my problem). I think that the problem comes from the Cuda library. THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1524577177097/work/aten/src/THC/THCTensorRandom.cu line=25 error=30 : unknown error Traceback (most recent call last): ?? File "./train.py", line 519, in ?????? main(args) ?? File "./train.py", line 61, in main ?????? model = nn.DataParallel(model).cuda() ?? File "/zfsauton/home/vjeanselme/anaconda3/envs/lstmpy27/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 102, in __init__ ?????? _check_balance(self.device_ids) ?? File "/zfsauton/home/vjeanselme/anaconda3/envs/lstmpy27/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 17, in _check_balance ?????? dev_props = [torch.cuda.get_device_properties(i) for i in device_ids] ?? File "/zfsauton/home/vjeanselme/anaconda3/envs/lstmpy27/lib/python2.7/site-packages/torch/cuda/__init__.py", line 290, in get_device_properties ?????? init()?? # will define _get_device_properties and _CudaDeviceProperties ?? File "/zfsauton/home/vjeanselme/anaconda3/envs/lstmpy27/lib/python2.7/site-packages/torch/cuda/__init__.py", line 143, in init ?????? _lazy_init() ?? File "/zfsauton/home/vjeanselme/anaconda3/envs/lstmpy27/lib/python2.7/site-packages/torch/cuda/__init__.py", line 161, in _lazy_init ?????? torch._C._cuda_init() RuntimeError: cuda runtime error (30) : unknown error at /opt/conda/conda-bld/pytorch_1524577177097/work/aten/src/THC/THCTensorRandom.cu:25 I don't know how to fix it, would you have any suggestions ? Thank you, -- Vincent Jeanselme ----------------- Analyst Researcher Auton Lab - Robotics Institute Carnegie Mellon University -------------- next part -------------- An HTML attachment was scrubbed... URL: From eyolcu at cs.cmu.edu Mon Sep 3 09:43:32 2018 From: eyolcu at cs.cmu.edu (Emre Yolcu) Date: Mon, 3 Sep 2018 09:43:32 -0400 Subject: CUDA Error In-Reply-To: References: <26992d64-ea80-c5fb-1fff-7319b674f4ee@andrew.cmu.edu> <20180831152342.ElF7KE0a6%predragp@andrew.cmu.edu> Message-ID: I'm getting the same error. On Sat, Sep 1, 2018 at 12:58 PM, Yichong Xu wrote: > Hi, > > I?m having the same problem here - @ Vincent have you figured out how to > fix this? > > >>> import torch > > >>> a=torch.zeros(4,4) > > >>> a.cuda() > > THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1524590031827/work/ > aten/src/THC/THCTensorRandom.cu line=25 error=30 : unknown error > > Traceback (most recent call last): > > File "", line 1, in > > RuntimeError: cuda runtime error (30) : unknown error at > /opt/conda/conda-bld/pytorch_1524590031827/work/aten/src/ > THC/THCTensorRandom.cu:25 > > > > Previously I can use pytorch without error. > > > > *Thanks,* > > *Yichong* > > > > > > > > *From:* Autonlab-users *On Behalf > Of *Jayanth Koushik > *Sent:* 2018?8?31? 11:34 > *To:* Predrag Punosevac > *Cc:* users at autonlab.org > *Subject:* Re: CUDA Error > > > > The last line of the error refers to a different conda. Can you make sure > all paths are correct? > > ~Jayanth > > > On Aug 31, 2018, at 11:23 AM, Predrag Punosevac > wrote: > > Vincent Jeanselme wrote: > > > Good Morning, > > > Lets try users at autonlab.org > > > Predrag > > > > > Since the change of the hard drive, I have the following error when I > > run it on the GPUs (I have reinstalled pytorch but does not solve my > > problem). I think that the problem comes from the Cuda library. > > > > THCudaCheck FAIL > > file=/opt/conda/conda-bld/pytorch_1524577177097/work/ > aten/src/THC/THCTensorRandom.cu > > line=25 error=30 : unknown error > > Traceback (most recent call last): > > ?? File "./train.py", line 519, in > > ?????? main(args) > > ?? File "./train.py", line 61, in main > > ?????? model = nn.DataParallel(model).cuda() > > ?? File > > "/zfsauton/home/vjeanselme/anaconda3/envs/lstmpy27/lib/ > python2.7/site-packages/torch/nn/parallel/data_parallel.py", > > line 102, in __init__ > > ?????? _check_balance(self.device_ids) > > ?? File > > "/zfsauton/home/vjeanselme/anaconda3/envs/lstmpy27/lib/ > python2.7/site-packages/torch/nn/parallel/data_parallel.py", > > line 17, in _check_balance > > ?????? dev_props = [torch.cuda.get_device_properties(i) for i in > > device_ids] > > ?? File > > "/zfsauton/home/vjeanselme/anaconda3/envs/lstmpy27/lib/ > python2.7/site-packages/torch/cuda/__init__.py", > > line 290, in get_device_properties > > ?????? init()?? # will define _get_device_properties and > > _CudaDeviceProperties > > ?? File > > "/zfsauton/home/vjeanselme/anaconda3/envs/lstmpy27/lib/ > python2.7/site-packages/torch/cuda/__init__.py", > > line 143, in init > > ?????? _lazy_init() > > ?? File > > "/zfsauton/home/vjeanselme/anaconda3/envs/lstmpy27/lib/ > python2.7/site-packages/torch/cuda/__init__.py", > > line 161, in _lazy_init > > ?????? torch._C._cuda_init() > > RuntimeError: cuda runtime error (30) : unknown error at > > /opt/conda/conda-bld/pytorch_1524577177097/work/ > aten/src/THC/THCTensorRandom.cu:25 > > > > I don't know how to fix it, would you have any suggestions ? > > > > Thank you, > > > > -- > > Vincent Jeanselme > > ----------------- > > Analyst Researcher > > Auton Lab - Robotics Institute > > Carnegie Mellon University > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eyolcu at cs.cmu.edu Tue Sep 4 15:18:40 2018 From: eyolcu at cs.cmu.edu (Emre Yolcu) Date: Tue, 4 Sep 2018 15:18:40 -0400 Subject: PyTorch problem Message-ID: Hi, We are trying to troubleshoot the PyTorch issue with Predrag and were wondering: Is anybody able to run PyTorch GPU models on gpu1-9? If you can, we would appreciate if you can respond. Also, is it a problem for anyone if gpu8 is rebooted today? Thanks, Emre -------------- next part -------------- An HTML attachment was scrubbed... URL: From jaylee at cs.cmu.edu Tue Sep 4 15:40:34 2018 From: jaylee at cs.cmu.edu (Jay Yoon Lee) Date: Tue, 4 Sep 2018 15:40:34 -0400 Subject: PyTorch problem In-Reply-To: References: Message-ID: Hi Emre, For gpu8, I think my job will finish by tomorrow and it has been running for day and a half, would you be able to wait ? And may I ask the reason you are trying to reboot ? Thanks, Jay-Yoon On Tue, Sep 4, 2018 at 3:19 PM Emre Yolcu wrote: > Hi, > > We are trying to troubleshoot the PyTorch issue with Predrag and were > wondering: > > Is anybody able to run PyTorch GPU models on gpu1-9? If you can, we would > appreciate if you can respond. > > Also, is it a problem for anyone if gpu8 is rebooted today? > > Thanks, > > Emre > -------------- next part -------------- An HTML attachment was scrubbed... URL: From elenagiusarma at gmail.com Tue Sep 4 15:39:25 2018 From: elenagiusarma at gmail.com (Elena Giusarma) Date: Tue, 4 Sep 2018 15:39:25 -0400 Subject: CUDA Error In-Reply-To: References: <26992d64-ea80-c5fb-1fff-7319b674f4ee@andrew.cmu.edu> <20180831152342.ElF7KE0a6%predragp@andrew.cmu.edu> Message-ID: Hi, I am having this error, net.cuda(3) File "/zfsauton/home/egiusarm/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 258, in cuda return self._apply(lambda t: t.cuda(device)) File "/zfsauton/home/egiusarm/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 185, in _apply module._apply(fn) File "/zfsauton/home/egiusarm/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 191, in _apply param.data = fn(param.data) File "/zfsauton/home/egiusarm/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 258, in return self._apply(lambda t: t.cuda(device)) RuntimeError: CUDA error: unknown error I never had that error before. I always used pytorch without problems. thanks, Elena Il giorno lun 3 set 2018 alle ore 09:43 Emre Yolcu ha scritto: > I'm getting the same error. > > On Sat, Sep 1, 2018 at 12:58 PM, Yichong Xu wrote: > >> Hi, >> >> I?m having the same problem here - @ Vincent have you figured out how to >> fix this? >> >> >>> import torch >> >> >>> a=torch.zeros(4,4) >> >> >>> a.cuda() >> >> THCudaCheck FAIL >> file=/opt/conda/conda-bld/pytorch_1524590031827/work/aten/src/THC/THCTensorRandom.cu >> line=25 error=30 : unknown error >> >> Traceback (most recent call last): >> >> File "", line 1, in >> >> RuntimeError: cuda runtime error (30) : unknown error at >> /opt/conda/conda-bld/pytorch_1524590031827/work/aten/src/THC/THCTensorRandom.cu:25 >> >> >> >> Previously I can use pytorch without error. >> >> >> >> *Thanks,* >> >> *Yichong* >> >> >> >> >> >> >> >> *From:* Autonlab-users *On Behalf >> Of *Jayanth Koushik >> *Sent:* 2018?8?31? 11:34 >> *To:* Predrag Punosevac >> *Cc:* users at autonlab.org >> *Subject:* Re: CUDA Error >> >> >> >> The last line of the error refers to a different conda. Can you make sure >> all paths are correct? >> >> ~Jayanth >> >> >> On Aug 31, 2018, at 11:23 AM, Predrag Punosevac >> wrote: >> >> Vincent Jeanselme wrote: >> >> >> Good Morning, >> >> >> Lets try users at autonlab.org >> >> >> Predrag >> >> >> >> >> Since the change of the hard drive, I have the following error when I >> >> run it on the GPUs (I have reinstalled pytorch but does not solve my >> >> problem). I think that the problem comes from the Cuda library. >> >> >> >> THCudaCheck FAIL >> >> >> file=/opt/conda/conda-bld/pytorch_1524577177097/work/aten/src/THC/THCTensorRandom.cu >> >> line=25 error=30 : unknown error >> >> Traceback (most recent call last): >> >> ?? File "./train.py", line 519, in >> >> ?????? main(args) >> >> ?? File "./train.py", line 61, in main >> >> ?????? model = nn.DataParallel(model).cuda() >> >> ?? File >> >> >> "/zfsauton/home/vjeanselme/anaconda3/envs/lstmpy27/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", >> >> line 102, in __init__ >> >> ?????? _check_balance(self.device_ids) >> >> ?? File >> >> >> "/zfsauton/home/vjeanselme/anaconda3/envs/lstmpy27/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", >> >> line 17, in _check_balance >> >> ?????? dev_props = [torch.cuda.get_device_properties(i) for i in >> >> device_ids] >> >> ?? File >> >> >> "/zfsauton/home/vjeanselme/anaconda3/envs/lstmpy27/lib/python2.7/site-packages/torch/cuda/__init__.py", >> >> line 290, in get_device_properties >> >> ?????? init()?? # will define _get_device_properties and >> >> _CudaDeviceProperties >> >> ?? File >> >> >> "/zfsauton/home/vjeanselme/anaconda3/envs/lstmpy27/lib/python2.7/site-packages/torch/cuda/__init__.py", >> >> line 143, in init >> >> ?????? _lazy_init() >> >> ?? File >> >> >> "/zfsauton/home/vjeanselme/anaconda3/envs/lstmpy27/lib/python2.7/site-packages/torch/cuda/__init__.py", >> >> line 161, in _lazy_init >> >> ?????? torch._C._cuda_init() >> >> RuntimeError: cuda runtime error (30) : unknown error at >> >> >> /opt/conda/conda-bld/pytorch_1524577177097/work/aten/src/THC/THCTensorRandom.cu:25 >> >> >> >> I don't know how to fix it, would you have any suggestions ? >> >> >> >> Thank you, >> >> >> >> -- >> >> Vincent Jeanselme >> >> ----------------- >> >> Analyst Researcher >> >> Auton Lab - Robotics Institute >> >> Carnegie Mellon University >> >> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From yichongx at cs.cmu.edu Tue Sep 4 21:57:16 2018 From: yichongx at cs.cmu.edu (Yichong Xu) Date: Wed, 5 Sep 2018 01:57:16 +0000 Subject: PyTorch problem In-Reply-To: References: Message-ID: <442569FC-5DE7-4F67-BDCA-7A7A1902EAF0@andrew.cmu.edu> Just wondering - can Tensorflow run well on these machines? I hope someone to confirm about this so that we can isolate the problem. OK so here?s a further test: I tried running the cuda examples from the cuda installation (in /usr/local/cuda/sample), on gpu2 in my scratch directory. Simple jobs like deviceQuery succeeds, but simpleCUBLAS failed: yichongx at gpu2$ cd /home/scratch/yichongx/ yichongx at gpu2$ cd 0_Simple/ 2_Graphics/ 4_Finance/ 6_Advanced/ bin/ conda/ 1_Utilities/ 3_Imaging/ 5_Simulations/ 7_CUDALibraries/ common/ miniconda3/ yichongx at gpu2$ cd 7_CUDALibraries/ yichongx at gpu2$ cd simpleCUBLAS yichongx at gpu2$ CUDA_VISIBLE_DEVICES=3 ./simpleCUBLAS GPU Device 0: "TITAN X (Pascal)" with compute capability 6.1 simpleCUBLAS test running.. !!!! CUBLAS initialization error yichongx at gpu2$ This is also consistent with our previous errors from pytorch, which say cublas library not initialized. So this means at least there is some problem with CUBLAS on gpu2. This post suggests that using sudo can resolve this problem, and this is probably because of some permission problems on CUBLAS libraries: https://devtalk.nvidia.com/default/topic/1027602/cuda-setup-and-installation/cublas-libraries-with-incorrect-permissions/ @Predrag: Can you try running the simpleCUBLAS example from the CUDA library, with and without root privilege? I think that might be something that you are more familiar with. Thank you very much! Thanks, Yichong On Sep 4, 2018, at 3:18 PM, Emre Yolcu > wrote: Hi, We are trying to troubleshoot the PyTorch issue with Predrag and were wondering: Is anybody able to run PyTorch GPU models on gpu1-9? If you can, we would appreciate if you can respond. Also, is it a problem for anyone if gpu8 is rebooted today? Thanks, Emre -------------- next part -------------- An HTML attachment was scrubbed... URL: From manzil at cmu.edu Tue Sep 4 22:01:50 2018 From: manzil at cmu.edu (Manzil Zaheer) Date: Wed, 5 Sep 2018 02:01:50 +0000 Subject: PyTorch problem In-Reply-To: <442569FC-5DE7-4F67-BDCA-7A7A1902EAF0@andrew.cmu.edu> References: , <442569FC-5DE7-4F67-BDCA-7A7A1902EAF0@andrew.cmu.edu> Message-ID: Hi Yichong Yes I am able to run TF and PyTorch on these machines. Recently someone else also had similar issue, but it got fixed by reinstalling some local packages. Thanks, Manzil -------- Original message -------- From: Yichong Xu Date: 9/4/18 9:58 PM (GMT-05:00) To: Emre Yolcu , Predrag Punosevac Cc: users at autonlab.org Subject: Re: PyTorch problem Just wondering - can Tensorflow run well on these machines? I hope someone to confirm about this so that we can isolate the problem. OK so here?s a further test: I tried running the cuda examples from the cuda installation (in /usr/local/cuda/sample), on gpu2 in my scratch directory. Simple jobs like deviceQuery succeeds, but simpleCUBLAS failed: yichongx at gpu2$ cd /home/scratch/yichongx/ yichongx at gpu2$ cd 0_Simple/ 2_Graphics/ 4_Finance/ 6_Advanced/ bin/ conda/ 1_Utilities/ 3_Imaging/ 5_Simulations/ 7_CUDALibraries/ common/ miniconda3/ yichongx at gpu2$ cd 7_CUDALibraries/ yichongx at gpu2$ cd simpleCUBLAS yichongx at gpu2$ CUDA_VISIBLE_DEVICES=3 ./simpleCUBLAS GPU Device 0: "TITAN X (Pascal)" with compute capability 6.1 simpleCUBLAS test running.. !!!! CUBLAS initialization error yichongx at gpu2$ This is also consistent with our previous errors from pytorch, which say cublas library not initialized. So this means at least there is some problem with CUBLAS on gpu2. This post suggests that using sudo can resolve this problem, and this is probably because of some permission problems on CUBLAS libraries: https://devtalk.nvidia.com/default/topic/1027602/cuda-setup-and-installation/cublas-libraries-with-incorrect-permissions/ @Predrag: Can you try running the simpleCUBLAS example from the CUDA library, with and without root privilege? I think that might be something that you are more familiar with. Thank you very much! Thanks, Yichong On Sep 4, 2018, at 3:18 PM, Emre Yolcu > wrote: Hi, We are trying to troubleshoot the PyTorch issue with Predrag and were wondering: Is anybody able to run PyTorch GPU models on gpu1-9? If you can, we would appreciate if you can respond. Also, is it a problem for anyone if gpu8 is rebooted today? Thanks, Emre -------------- next part -------------- An HTML attachment was scrubbed... URL: From bparia at cs.cmu.edu Wed Sep 5 00:19:50 2018 From: bparia at cs.cmu.edu (Biswajit Paria) Date: Wed, 5 Sep 2018 00:19:50 -0400 Subject: PyTorch problem In-Reply-To: References: <442569FC-5DE7-4F67-BDCA-7A7A1902EAF0@andrew.cmu.edu> Message-ID: I am facing a similar error on all GPU machines. Did someone find a solution yet? 2018-09-05 00:27:41.546064: E tensorflow/stream_executor/cuda/cuda_blas.cc:459] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED On Tue, Sep 4, 2018 at 10:03 PM Manzil Zaheer wrote: > Hi Yichong > > Yes I am able to run TF and PyTorch on these machines. Recently someone > else also had similar issue, but it got fixed by reinstalling some local > packages. > > Thanks, > Manzil > > > -------- Original message -------- > From: Yichong Xu > Date: 9/4/18 9:58 PM (GMT-05:00) > To: Emre Yolcu , Predrag Punosevac < > predragp at andrew.cmu.edu> > Cc: users at autonlab.org > Subject: Re: PyTorch problem > > Just wondering - can Tensorflow run well on these machines? I hope someone > to confirm about this so that we can isolate the problem. > OK so here?s a further test: I tried running the cuda examples from the > cuda installation (in /usr/local/cuda/sample), on gpu2 in my scratch > directory. Simple jobs like deviceQuery succeeds, but simpleCUBLAS failed: > yichongx at gpu2$ cd /home/scratch/yichongx/ > yichongx at gpu2$ cd > 0_Simple/ 2_Graphics/ 4_Finance/ 6_Advanced/ bin/ > conda/ > 1_Utilities/ 3_Imaging/ 5_Simulations/ 7_CUDALibraries/ > common/ miniconda3/ > yichongx at gpu2$ cd 7_CUDALibraries/ > yichongx at gpu2$ cd simpleCUBLAS > yichongx at gpu2$ CUDA_VISIBLE_DEVICES=3 ./simpleCUBLAS > GPU Device 0: "TITAN X (Pascal)" with compute capability 6.1 > > simpleCUBLAS test running.. > !!!! CUBLAS initialization error > yichongx at gpu2$ > > > This is also consistent with our previous errors from pytorch, which say > cublas library not initialized. > > So this means at least there is some problem with CUBLAS on gpu2. This > post suggests that using sudo can resolve this problem, and this is > probably because of some permission problems on CUBLAS libraries: > > https://devtalk.nvidia.com/default/topic/1027602/cuda-setup-and-installation/cublas-libraries-with-incorrect-permissions/ > @Predrag: Can you try running the simpleCUBLAS example from the CUDA > library, with and without root privilege? I think that might be something > that you are more familiar with. Thank you very much! > > > *Thanks,* > *Yichong* > > On Sep 4, 2018, at 3:18 PM, Emre Yolcu wrote: > > Hi, > > We are trying to troubleshoot the PyTorch issue with Predrag and were > wondering: > > Is anybody able to run PyTorch GPU models on gpu1-9? If you can, we would > appreciate if you can respond. > > Also, is it a problem for anyone if gpu8 is rebooted today? > > Thanks, > > Emre > > > -- Biswajit Paria PhD in ML @ CMU -------------- next part -------------- An HTML attachment was scrubbed... URL: From mauorama at gmail.com Wed Sep 5 15:12:25 2018 From: mauorama at gmail.com (Mauricio) Date: Wed, 5 Sep 2018 15:12:25 -0400 Subject: CUDA error: unknown error Message-ID: Hi, I am having this problem with pytorch... any solution? import torch a = torch.rand(5, 3) device = torch.device('cuda') a.to(device) Traceback (most recent call last): File "", line 1, in RuntimeError: CUDA error: unknown error Thank you.. -------------- next part -------------- An HTML attachment was scrubbed... URL: From vjeansel at andrew.cmu.edu Wed Sep 5 15:55:14 2018 From: vjeansel at andrew.cmu.edu (Vincent Jeanselme) Date: Wed, 5 Sep 2018 15:55:14 -0400 Subject: iPython Error Message-ID: Hello all, If you have the following error when you use ipython on the server (or if your jupyter notebooks are much slower than before): [TerminalIPythonApp] ERROR | Failed to open SQLite history /home/scratch/$USER/.ipython/ipython_hist.sqlite (unable to open database file). You need first to create a ipython config file : ipython profile create And then to add in the created file (usually /zfsauton/home/$USER/.ipython/profile_default/ipython_kernel_config.py) the following line: c.HistoryManager.hist_file="/home/scratch/USER/.ipython_hist.sqlite" This way ipython will write its history on the local disk, Vincent -- Vincent Jeanselme ----------------- Analyst Researcher Auton Lab - Robotics Institute Carnegie Mellon University -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Wed Sep 5 16:40:37 2018 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Wed, 5 Sep 2018 16:40:37 -0400 Subject: PyTorch problem In-Reply-To: References: <442569FC-5DE7-4F67-BDCA-7A7A1902EAF0@andrew.cmu.edu> Message-ID: I just rebooted GPU8. All packages are up to date. NVidia driver appears to be working properly and I can do GPU computations from MATLAB. Let's try now to get pytorch working on GPU8. Predrag On Wed, Sep 5, 2018 at 12:19 AM, Biswajit Paria wrote: > I am facing a similar error on all GPU machines. Did someone find a > solution yet? > > 2018-09-05 00:27:41.546064: E tensorflow/stream_executor/cuda/cuda_blas.cc:459] > failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED > > On Tue, Sep 4, 2018 at 10:03 PM Manzil Zaheer wrote: > >> Hi Yichong >> >> Yes I am able to run TF and PyTorch on these machines. Recently someone >> else also had similar issue, but it got fixed by reinstalling some local >> packages. >> >> Thanks, >> Manzil >> >> >> -------- Original message -------- >> From: Yichong Xu >> Date: 9/4/18 9:58 PM (GMT-05:00) >> To: Emre Yolcu , Predrag Punosevac < >> predragp at andrew.cmu.edu> >> Cc: users at autonlab.org >> Subject: Re: PyTorch problem >> >> Just wondering - can Tensorflow run well on these machines? I hope >> someone to confirm about this so that we can isolate the problem. >> OK so here?s a further test: I tried running the cuda examples from the >> cuda installation (in /usr/local/cuda/sample), on gpu2 in my scratch >> directory. Simple jobs like deviceQuery succeeds, but simpleCUBLAS failed: >> yichongx at gpu2$ cd /home/scratch/yichongx/ >> yichongx at gpu2$ cd >> 0_Simple/ 2_Graphics/ 4_Finance/ 6_Advanced/ bin/ >> conda/ >> 1_Utilities/ 3_Imaging/ 5_Simulations/ 7_CUDALibraries/ >> common/ miniconda3/ >> yichongx at gpu2$ cd 7_CUDALibraries/ >> yichongx at gpu2$ cd simpleCUBLAS >> yichongx at gpu2$ CUDA_VISIBLE_DEVICES=3 ./simpleCUBLAS >> GPU Device 0: "TITAN X (Pascal)" with compute capability 6.1 >> >> simpleCUBLAS test running.. >> !!!! CUBLAS initialization error >> yichongx at gpu2$ >> >> >> This is also consistent with our previous errors from pytorch, which say >> cublas library not initialized. >> >> So this means at least there is some problem with CUBLAS on gpu2. This >> post suggests that using sudo can resolve this problem, and this is >> probably because of some permission problems on CUBLAS libraries: >> https://devtalk.nvidia.com/default/topic/1027602/cuda- >> setup-and-installation/cublas-libraries-with-incorrect-permissions/ >> @Predrag: Can you try running the simpleCUBLAS example from the CUDA >> library, with and without root privilege? I think that might be something >> that you are more familiar with. Thank you very much! >> >> >> *Thanks,* >> *Yichong* >> >> On Sep 4, 2018, at 3:18 PM, Emre Yolcu wrote: >> >> Hi, >> >> We are trying to troubleshoot the PyTorch issue with Predrag and were >> wondering: >> >> Is anybody able to run PyTorch GPU models on gpu1-9? If you can, we would >> appreciate if you can respond. >> >> Also, is it a problem for anyone if gpu8 is rebooted today? >> >> Thanks, >> >> Emre >> >> >> > > -- > Biswajit Paria > PhD in ML @ CMU > -------------- next part -------------- An HTML attachment was scrubbed... URL: From manzil at cmu.edu Wed Sep 5 16:42:36 2018 From: manzil at cmu.edu (Manzil Zaheer) Date: Wed, 5 Sep 2018 20:42:36 +0000 Subject: PyTorch problem In-Reply-To: References: <442569FC-5DE7-4F67-BDCA-7A7A1902EAF0@andrew.cmu.edu> , Message-ID: It does work for me and my friends -------- Original message -------- From: Predrag Punosevac Date: 9/5/18 4:40 PM (GMT-05:00) To: Biswajit Paria Cc: Manzil Zaheer , Yichong Xu , Emre Yolcu , users at autonlab.org Subject: Re: PyTorch problem I just rebooted GPU8. All packages are up to date. NVidia driver appears to be working properly and I can do GPU computations from MATLAB. Let's try now to get pytorch working on GPU8. Predrag On Wed, Sep 5, 2018 at 12:19 AM, Biswajit Paria > wrote: I am facing a similar error on all GPU machines. Did someone find a solution yet? 2018-09-05 00:27:41.546064: E tensorflow/stream_executor/cuda/cuda_blas.cc:459] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED On Tue, Sep 4, 2018 at 10:03 PM Manzil Zaheer > wrote: Hi Yichong Yes I am able to run TF and PyTorch on these machines. Recently someone else also had similar issue, but it got fixed by reinstalling some local packages. Thanks, Manzil -------- Original message -------- From: Yichong Xu > Date: 9/4/18 9:58 PM (GMT-05:00) To: Emre Yolcu >, Predrag Punosevac > Cc: users at autonlab.org Subject: Re: PyTorch problem Just wondering - can Tensorflow run well on these machines? I hope someone to confirm about this so that we can isolate the problem. OK so here?s a further test: I tried running the cuda examples from the cuda installation (in /usr/local/cuda/sample), on gpu2 in my scratch directory. Simple jobs like deviceQuery succeeds, but simpleCUBLAS failed: yichongx at gpu2$ cd /home/scratch/yichongx/ yichongx at gpu2$ cd 0_Simple/ 2_Graphics/ 4_Finance/ 6_Advanced/ bin/ conda/ 1_Utilities/ 3_Imaging/ 5_Simulations/ 7_CUDALibraries/ common/ miniconda3/ yichongx at gpu2$ cd 7_CUDALibraries/ yichongx at gpu2$ cd simpleCUBLAS yichongx at gpu2$ CUDA_VISIBLE_DEVICES=3 ./simpleCUBLAS GPU Device 0: "TITAN X (Pascal)" with compute capability 6.1 simpleCUBLAS test running.. !!!! CUBLAS initialization error yichongx at gpu2$ This is also consistent with our previous errors from pytorch, which say cublas library not initialized. So this means at least there is some problem with CUBLAS on gpu2. This post suggests that using sudo can resolve this problem, and this is probably because of some permission problems on CUBLAS libraries: https://devtalk.nvidia.com/default/topic/1027602/cuda-setup-and-installation/cublas-libraries-with-incorrect-permissions/ @Predrag: Can you try running the simpleCUBLAS example from the CUDA library, with and without root privilege? I think that might be something that you are more familiar with. Thank you very much! Thanks, Yichong On Sep 4, 2018, at 3:18 PM, Emre Yolcu > wrote: Hi, We are trying to troubleshoot the PyTorch issue with Predrag and were wondering: Is anybody able to run PyTorch GPU models on gpu1-9? If you can, we would appreciate if you can respond. Also, is it a problem for anyone if gpu8 is rebooted today? Thanks, Emre -- Biswajit Paria PhD in ML @ CMU -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Wed Sep 5 16:44:26 2018 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Wed, 5 Sep 2018 16:44:26 -0400 Subject: PyTorch problem In-Reply-To: References: <442569FC-5DE7-4F67-BDCA-7A7A1902EAF0@andrew.cmu.edu> Message-ID: Should I go ahead and reboot all GPU computing nodes? Can somebody else confirm that a reboot fixes the issue? Predrag On Wed, Sep 5, 2018 at 4:42 PM, Manzil Zaheer wrote: > It does work for me and my friends > > > > > -------- Original message -------- > From: Predrag Punosevac > Date: 9/5/18 4:40 PM (GMT-05:00) > To: Biswajit Paria > Cc: Manzil Zaheer , Yichong Xu , > Emre Yolcu , users at autonlab.org > Subject: Re: PyTorch problem > > I just rebooted GPU8. All packages are up to date. NVidia driver appears > to be working properly and I can do GPU computations from MATLAB. Let's try > now to get pytorch working on GPU8. > > Predrag > > On Wed, Sep 5, 2018 at 12:19 AM, Biswajit Paria wrote: > >> I am facing a similar error on all GPU machines. Did someone find a >> solution yet? >> >> 2018-09-05 00:27:41.546064: E tensorflow/stream_executor/cuda/cuda_blas.cc:459] >> failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED >> >> On Tue, Sep 4, 2018 at 10:03 PM Manzil Zaheer wrote: >> >>> Hi Yichong >>> >>> Yes I am able to run TF and PyTorch on these machines. Recently someone >>> else also had similar issue, but it got fixed by reinstalling some local >>> packages. >>> >>> Thanks, >>> Manzil >>> >>> >>> -------- Original message -------- >>> From: Yichong Xu >>> Date: 9/4/18 9:58 PM (GMT-05:00) >>> To: Emre Yolcu , Predrag Punosevac < >>> predragp at andrew.cmu.edu> >>> Cc: users at autonlab.org >>> Subject: Re: PyTorch problem >>> >>> Just wondering - can Tensorflow run well on these machines? I hope >>> someone to confirm about this so that we can isolate the problem. >>> OK so here?s a further test: I tried running the cuda examples from the >>> cuda installation (in /usr/local/cuda/sample), on gpu2 in my scratch >>> directory. Simple jobs like deviceQuery succeeds, but simpleCUBLAS failed: >>> yichongx at gpu2$ cd /home/scratch/yichongx/ >>> yichongx at gpu2$ cd >>> 0_Simple/ 2_Graphics/ 4_Finance/ 6_Advanced/ bin/ >>> conda/ >>> 1_Utilities/ 3_Imaging/ 5_Simulations/ 7_CUDALibraries/ >>> common/ miniconda3/ >>> yichongx at gpu2$ cd 7_CUDALibraries/ >>> yichongx at gpu2$ cd simpleCUBLAS >>> yichongx at gpu2$ CUDA_VISIBLE_DEVICES=3 ./simpleCUBLAS >>> GPU Device 0: "TITAN X (Pascal)" with compute capability 6.1 >>> >>> simpleCUBLAS test running.. >>> !!!! CUBLAS initialization error >>> yichongx at gpu2$ >>> >>> >>> This is also consistent with our previous errors from pytorch, which say >>> cublas library not initialized. >>> >>> So this means at least there is some problem with CUBLAS on gpu2. This >>> post suggests that using sudo can resolve this problem, and this is >>> probably because of some permission problems on CUBLAS libraries: >>> https://devtalk.nvidia.com/default/topic/1027602/cuda-setup- >>> and-installation/cublas-libraries-with-incorrect-permissions/ >>> @Predrag: Can you try running the simpleCUBLAS example from the CUDA >>> library, with and without root privilege? I think that might be something >>> that you are more familiar with. Thank you very much! >>> >>> >>> *Thanks,* >>> *Yichong* >>> >>> On Sep 4, 2018, at 3:18 PM, Emre Yolcu wrote: >>> >>> Hi, >>> >>> We are trying to troubleshoot the PyTorch issue with Predrag and were >>> wondering: >>> >>> Is anybody able to run PyTorch GPU models on gpu1-9? If you can, we >>> would appreciate if you can respond. >>> >>> Also, is it a problem for anyone if gpu8 is rebooted today? >>> >>> Thanks, >>> >>> Emre >>> >>> >>> >> >> -- >> Biswajit Paria >> PhD in ML @ CMU >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From manzil at cmu.edu Wed Sep 5 16:46:15 2018 From: manzil at cmu.edu (Manzil Zaheer) Date: Wed, 5 Sep 2018 20:46:15 +0000 Subject: PyTorch problem In-Reply-To: References: <442569FC-5DE7-4F67-BDCA-7A7A1902EAF0@andrew.cmu.edu> , Message-ID: <7f7c51795a7f4a2398934b79bf5de592@cmu.edu> It was working me before reboot as well. PyTorch does work on all nodes for me. I am trying to say is that i think it is not issue at system level but at user account level. I might be wrong though. -------- Original message -------- From: Predrag Punosevac Date: 9/5/18 4:44 PM (GMT-05:00) To: Manzil Zaheer Cc: Biswajit Paria , Yichong Xu , Emre Yolcu , users at autonlab.org Subject: Re: PyTorch problem Should I go ahead and reboot all GPU computing nodes? Can somebody else confirm that a reboot fixes the issue? Predrag On Wed, Sep 5, 2018 at 4:42 PM, Manzil Zaheer > wrote: It does work for me and my friends -------- Original message -------- From: Predrag Punosevac > Date: 9/5/18 4:40 PM (GMT-05:00) To: Biswajit Paria > Cc: Manzil Zaheer >, Yichong Xu >, Emre Yolcu >, users at autonlab.org Subject: Re: PyTorch problem I just rebooted GPU8. All packages are up to date. NVidia driver appears to be working properly and I can do GPU computations from MATLAB. Let's try now to get pytorch working on GPU8. Predrag On Wed, Sep 5, 2018 at 12:19 AM, Biswajit Paria > wrote: I am facing a similar error on all GPU machines. Did someone find a solution yet? 2018-09-05 00:27:41.546064: E tensorflow/stream_executor/cuda/cuda_blas.cc:459] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED On Tue, Sep 4, 2018 at 10:03 PM Manzil Zaheer > wrote: Hi Yichong Yes I am able to run TF and PyTorch on these machines. Recently someone else also had similar issue, but it got fixed by reinstalling some local packages. Thanks, Manzil -------- Original message -------- From: Yichong Xu > Date: 9/4/18 9:58 PM (GMT-05:00) To: Emre Yolcu >, Predrag Punosevac > Cc: users at autonlab.org Subject: Re: PyTorch problem Just wondering - can Tensorflow run well on these machines? I hope someone to confirm about this so that we can isolate the problem. OK so here?s a further test: I tried running the cuda examples from the cuda installation (in /usr/local/cuda/sample), on gpu2 in my scratch directory. Simple jobs like deviceQuery succeeds, but simpleCUBLAS failed: yichongx at gpu2$ cd /home/scratch/yichongx/ yichongx at gpu2$ cd 0_Simple/ 2_Graphics/ 4_Finance/ 6_Advanced/ bin/ conda/ 1_Utilities/ 3_Imaging/ 5_Simulations/ 7_CUDALibraries/ common/ miniconda3/ yichongx at gpu2$ cd 7_CUDALibraries/ yichongx at gpu2$ cd simpleCUBLAS yichongx at gpu2$ CUDA_VISIBLE_DEVICES=3 ./simpleCUBLAS GPU Device 0: "TITAN X (Pascal)" with compute capability 6.1 simpleCUBLAS test running.. !!!! CUBLAS initialization error yichongx at gpu2$ This is also consistent with our previous errors from pytorch, which say cublas library not initialized. So this means at least there is some problem with CUBLAS on gpu2. This post suggests that using sudo can resolve this problem, and this is probably because of some permission problems on CUBLAS libraries: https://devtalk.nvidia.com/default/topic/1027602/cuda-setup-and-installation/cublas-libraries-with-incorrect-permissions/ @Predrag: Can you try running the simpleCUBLAS example from the CUDA library, with and without root privilege? I think that might be something that you are more familiar with. Thank you very much! Thanks, Yichong On Sep 4, 2018, at 3:18 PM, Emre Yolcu > wrote: Hi, We are trying to troubleshoot the PyTorch issue with Predrag and were wondering: Is anybody able to run PyTorch GPU models on gpu1-9? If you can, we would appreciate if you can respond. Also, is it a problem for anyone if gpu8 is rebooted today? Thanks, Emre -- Biswajit Paria PhD in ML @ CMU -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Wed Sep 5 16:56:14 2018 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Wed, 05 Sep 2018 16:56:14 -0400 Subject: PyTorch problem In-Reply-To: <7f7c51795a7f4a2398934b79bf5de592@cmu.edu> References: <442569FC-5DE7-4F67-BDCA-7A7A1902EAF0@andrew.cmu.edu> <7f7c51795a7f4a2398934b79bf5de592@cmu.edu> Message-ID: <20180905205614.IMQuJDzT_%predragp@andrew.cmu.edu> Manzil Zaheer wrote: > It was working me before reboot as well. PyTorch does work on all > nodes for me. Aha! Gotcha. > > I am trying to say is that i think it is not issue at system level but > at user account level. I might be wrong though. That was my hunch as well. They were trying to convince me in a 150 e-mails chain over the weekend that pytorch was broken when I replaced a failed HDD on the main file server. That didn't make any sense. Could you please share your binaries and setup with outher pytorch users? Cheers, Predrag > > > -------- Original message -------- > From: Predrag Punosevac > Date: 9/5/18 4:44 PM (GMT-05:00) > To: Manzil Zaheer > Cc: Biswajit Paria , Yichong Xu , Emre Yolcu , users at autonlab.org > Subject: Re: PyTorch problem > > Should I go ahead and reboot all GPU computing nodes? Can somebody else confirm that a reboot fixes the issue? > > Predrag > > On Wed, Sep 5, 2018 at 4:42 PM, Manzil Zaheer > wrote: > It does work for me and my friends > > > > > -------- Original message -------- > From: Predrag Punosevac > > Date: 9/5/18 4:40 PM (GMT-05:00) > To: Biswajit Paria > > Cc: Manzil Zaheer >, Yichong Xu >, Emre Yolcu >, users at autonlab.org > Subject: Re: PyTorch problem > > I just rebooted GPU8. All packages are up to date. NVidia driver appears to be working properly and I can do GPU computations from MATLAB. Let's try now to get pytorch working on GPU8. > > Predrag > > On Wed, Sep 5, 2018 at 12:19 AM, Biswajit Paria > wrote: > I am facing a similar error on all GPU machines. Did someone find a solution yet? > > > 2018-09-05 00:27:41.546064: E tensorflow/stream_executor/cuda/cuda_blas.cc:459] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED > > On Tue, Sep 4, 2018 at 10:03 PM Manzil Zaheer > wrote: > Hi Yichong > > Yes I am able to run TF and PyTorch on these machines. Recently someone else also had similar issue, but it got fixed by reinstalling some local packages. > > Thanks, > Manzil > > > -------- Original message -------- > From: Yichong Xu > > Date: 9/4/18 9:58 PM (GMT-05:00) > To: Emre Yolcu >, Predrag Punosevac > > Cc: users at autonlab.org > Subject: Re: PyTorch problem > > Just wondering - can Tensorflow run well on these machines? I hope someone to confirm about this so that we can isolate the problem. > OK so here?s a further test: I tried running the cuda examples from the cuda installation (in /usr/local/cuda/sample), on gpu2 in my scratch directory. Simple jobs like deviceQuery succeeds, but simpleCUBLAS failed: > yichongx at gpu2$ cd /home/scratch/yichongx/ > yichongx at gpu2$ cd > 0_Simple/ 2_Graphics/ 4_Finance/ 6_Advanced/ bin/ conda/ > 1_Utilities/ 3_Imaging/ 5_Simulations/ 7_CUDALibraries/ common/ miniconda3/ > yichongx at gpu2$ cd 7_CUDALibraries/ > yichongx at gpu2$ cd simpleCUBLAS > yichongx at gpu2$ CUDA_VISIBLE_DEVICES=3 ./simpleCUBLAS > GPU Device 0: "TITAN X (Pascal)" with compute capability 6.1 > > simpleCUBLAS test running.. > !!!! CUBLAS initialization error > yichongx at gpu2$ > > > This is also consistent with our previous errors from pytorch, which say cublas library not initialized. > > So this means at least there is some problem with CUBLAS on gpu2. This post suggests that using sudo can resolve this problem, and this is probably because of some permission problems on CUBLAS libraries: > https://devtalk.nvidia.com/default/topic/1027602/cuda-setup-and-installation/cublas-libraries-with-incorrect-permissions/ > @Predrag: Can you try running the simpleCUBLAS example from the CUDA library, with and without root privilege? I think that might be something that you are more familiar with. Thank you very much! > > > Thanks, > Yichong > > On Sep 4, 2018, at 3:18 PM, Emre Yolcu > wrote: > > Hi, > > We are trying to troubleshoot the PyTorch issue with Predrag and were wondering: > > Is anybody able to run PyTorch GPU models on gpu1-9? If you can, we would appreciate if you can respond. > > Also, is it a problem for anyone if gpu8 is rebooted today? > > Thanks, > > Emre > > > > -- > Biswajit Paria > PhD in ML @ CMU > > From eyolcu at cs.cmu.edu Wed Sep 5 17:07:56 2018 From: eyolcu at cs.cmu.edu (Emre Yolcu) Date: Wed, 5 Sep 2018 17:07:56 -0400 Subject: PyTorch problem In-Reply-To: <20180905205614.IMQuJDzT_%predragp@andrew.cmu.edu> References: <442569FC-5DE7-4F67-BDCA-7A7A1902EAF0@andrew.cmu.edu> <7f7c51795a7f4a2398934b79bf5de592@cmu.edu> <20180905205614.IMQuJDzT_%predragp@andrew.cmu.edu> Message-ID: Manzil, could you share your `conda env export` (or equivalent) output for the environment you use for pytorch? It's still not working for me after reboot, maybe I can try replicating your exact setup and try with that. Thanks, Emre On Wed, Sep 5, 2018 at 4:56 PM, Predrag Punosevac wrote: > Manzil Zaheer wrote: > > > It was working me before reboot as well. PyTorch does work on all > > nodes for me. > > Aha! Gotcha. > > > > > I am trying to say is that i think it is not issue at system level but > > at user account level. I might be wrong though. > > That was my hunch as well. They were trying to convince me in a 150 > e-mails chain over the weekend that pytorch was broken when I replaced a > failed HDD on the main file server. That didn't make any sense. > > Could you please share your binaries and setup with outher pytorch > users? > > Cheers, > Predrag > > > > > > > -------- Original message -------- > > From: Predrag Punosevac > > Date: 9/5/18 4:44 PM (GMT-05:00) > > To: Manzil Zaheer > > Cc: Biswajit Paria , Yichong Xu , > Emre Yolcu , users at autonlab.org > > Subject: Re: PyTorch problem > > > > Should I go ahead and reboot all GPU computing nodes? Can somebody else > confirm that a reboot fixes the issue? > > > > Predrag > > > > On Wed, Sep 5, 2018 at 4:42 PM, Manzil Zaheer manzil at cmu.edu>> wrote: > > It does work for me and my friends > > > > > > > > > > -------- Original message -------- > > From: Predrag Punosevac predragp at andrew.cmu.edu>> > > Date: 9/5/18 4:40 PM (GMT-05:00) > > To: Biswajit Paria > > > Cc: Manzil Zaheer >, Yichong Xu < > yichongx at cs.cmu.edu>, Emre Yolcu < > eyolcu at cs.cmu.edu>, users at autonlab.org users at autonlab.org> > > Subject: Re: PyTorch problem > > > > I just rebooted GPU8. All packages are up to date. NVidia driver appears > to be working properly and I can do GPU computations from MATLAB. Let's try > now to get pytorch working on GPU8. > > > > Predrag > > > > On Wed, Sep 5, 2018 at 12:19 AM, Biswajit Paria > wrote: > > I am facing a similar error on all GPU machines. Did someone find a > solution yet? > > > > > > 2018-09-05 00:27:41.546064: E tensorflow/stream_executor/cuda/cuda_blas.cc:459] > failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED > > > > On Tue, Sep 4, 2018 at 10:03 PM Manzil Zaheer manzil at cmu.edu>> wrote: > > Hi Yichong > > > > Yes I am able to run TF and PyTorch on these machines. Recently someone > else also had similar issue, but it got fixed by reinstalling some local > packages. > > > > Thanks, > > Manzil > > > > > > -------- Original message -------- > > From: Yichong Xu > > > Date: 9/4/18 9:58 PM (GMT-05:00) > > To: Emre Yolcu >, Predrag > Punosevac > > > Cc: users at autonlab.org > > Subject: Re: PyTorch problem > > > > Just wondering - can Tensorflow run well on these machines? I hope > someone to confirm about this so that we can isolate the problem. > > OK so here?s a further test: I tried running the cuda examples from the > cuda installation (in /usr/local/cuda/sample), on gpu2 in my scratch > directory. Simple jobs like deviceQuery succeeds, but simpleCUBLAS failed: > > yichongx at gpu2$ cd /home/scratch/yichongx/ > > yichongx at gpu2$ cd > > 0_Simple/ 2_Graphics/ 4_Finance/ 6_Advanced/ > bin/ conda/ > > 1_Utilities/ 3_Imaging/ 5_Simulations/ 7_CUDALibraries/ > common/ miniconda3/ > > yichongx at gpu2$ cd 7_CUDALibraries/ > > yichongx at gpu2$ cd simpleCUBLAS > > yichongx at gpu2$ CUDA_VISIBLE_DEVICES=3 ./simpleCUBLAS > > GPU Device 0: "TITAN X (Pascal)" with compute capability 6.1 > > > > simpleCUBLAS test running.. > > !!!! CUBLAS initialization error > > yichongx at gpu2$ > > > > > > This is also consistent with our previous errors from pytorch, which say > cublas library not initialized. > > > > So this means at least there is some problem with CUBLAS on gpu2. This > post suggests that using sudo can resolve this problem, and this is > probably because of some permission problems on CUBLAS libraries: > > https://devtalk.nvidia.com/default/topic/1027602/cuda- > setup-and-installation/cublas-libraries-with-incorrect-permissions/ > > @Predrag: Can you try running the simpleCUBLAS example from the CUDA > library, with and without root privilege? I think that might be something > that you are more familiar with. Thank you very much! > > > > > > Thanks, > > Yichong > > > > On Sep 4, 2018, at 3:18 PM, Emre Yolcu cu at cs.cmu.edu>> wrote: > > > > Hi, > > > > We are trying to troubleshoot the PyTorch issue with Predrag and were > wondering: > > > > Is anybody able to run PyTorch GPU models on gpu1-9? If you can, we > would appreciate if you can respond. > > > > Also, is it a problem for anyone if gpu8 is rebooted today? > > > > Thanks, > > > > Emre > > > > > > > > -- > > Biswajit Paria > > PhD in ML @ CMU > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From boecking at andrew.cmu.edu Wed Sep 5 17:12:49 2018 From: boecking at andrew.cmu.edu (Benedikt Boecking) Date: Wed, 5 Sep 2018 17:12:49 -0400 Subject: PyTorch problem In-Reply-To: References: <442569FC-5DE7-4F67-BDCA-7A7A1902EAF0@andrew.cmu.edu> <7f7c51795a7f4a2398934b79bf5de592@cmu.edu> <20180905205614.IMQuJDzT_%predragp@andrew.cmu.edu> Message-ID: Not sure this will help, but I (very) recently had issues with software installed via conda linking to some of my local python installations. Removing and reinstalling the packages did not help. Ultimately, I removed all my local installs in ~/.local/lib/python* and installed conda again from scratch. It has been working like a charm since then. Best, Ben > On Sep 5, 2018, at 5:07 PM, Emre Yolcu wrote: > > Manzil, could you share your `conda env export` (or equivalent) output for the environment you use for pytorch? It's still not working for me after reboot, maybe I can try replicating your exact setup and try with that. > > Thanks, > > Emre > > On Wed, Sep 5, 2018 at 4:56 PM, Predrag Punosevac > wrote: > Manzil Zaheer > wrote: > > > It was working me before reboot as well. PyTorch does work on all > > nodes for me. > > Aha! Gotcha. > > > > > I am trying to say is that i think it is not issue at system level but > > at user account level. I might be wrong though. > > That was my hunch as well. They were trying to convince me in a 150 > e-mails chain over the weekend that pytorch was broken when I replaced a > failed HDD on the main file server. That didn't make any sense. > > Could you please share your binaries and setup with outher pytorch > users? > > Cheers, > Predrag > > > > > > > -------- Original message -------- > > From: Predrag Punosevac > > > Date: 9/5/18 4:44 PM (GMT-05:00) > > To: Manzil Zaheer > > > Cc: Biswajit Paria >, Yichong Xu >, Emre Yolcu >, users at autonlab.org > > Subject: Re: PyTorch problem > > > > Should I go ahead and reboot all GPU computing nodes? Can somebody else confirm that a reboot fixes the issue? > > > > Predrag > > > > On Wed, Sep 5, 2018 at 4:42 PM, Manzil Zaheer >> wrote: > > It does work for me and my friends > > > > > > > > > > -------- Original message -------- > > From: Predrag Punosevac >> > > Date: 9/5/18 4:40 PM (GMT-05:00) > > To: Biswajit Paria >> > > Cc: Manzil Zaheer >>, Yichong Xu >>, Emre Yolcu >>, users at autonlab.org > > > Subject: Re: PyTorch problem > > > > I just rebooted GPU8. All packages are up to date. NVidia driver appears to be working properly and I can do GPU computations from MATLAB. Let's try now to get pytorch working on GPU8. > > > > Predrag > > > > On Wed, Sep 5, 2018 at 12:19 AM, Biswajit Paria >> wrote: > > I am facing a similar error on all GPU machines. Did someone find a solution yet? > > > > > > 2018-09-05 00:27:41.546064: E tensorflow/stream_executor/cuda/cuda_blas.cc:459] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED > > > > On Tue, Sep 4, 2018 at 10:03 PM Manzil Zaheer >> wrote: > > Hi Yichong > > > > Yes I am able to run TF and PyTorch on these machines. Recently someone else also had similar issue, but it got fixed by reinstalling some local packages. > > > > Thanks, > > Manzil > > > > > > -------- Original message -------- > > From: Yichong Xu >> > > Date: 9/4/18 9:58 PM (GMT-05:00) > > To: Emre Yolcu >>, Predrag Punosevac >> > > Cc: users at autonlab.org > > > Subject: Re: PyTorch problem > > > > Just wondering - can Tensorflow run well on these machines? I hope someone to confirm about this so that we can isolate the problem. > > OK so here?s a further test: I tried running the cuda examples from the cuda installation (in /usr/local/cuda/sample), on gpu2 in my scratch directory. Simple jobs like deviceQuery succeeds, but simpleCUBLAS failed: > > yichongx at gpu2$ cd /home/scratch/yichongx/ > > yichongx at gpu2$ cd > > 0_Simple/ 2_Graphics/ 4_Finance/ 6_Advanced/ bin/ conda/ > > 1_Utilities/ 3_Imaging/ 5_Simulations/ 7_CUDALibraries/ common/ miniconda3/ > > yichongx at gpu2$ cd 7_CUDALibraries/ > > yichongx at gpu2$ cd simpleCUBLAS > > yichongx at gpu2$ CUDA_VISIBLE_DEVICES=3 ./simpleCUBLAS > > GPU Device 0: "TITAN X (Pascal)" with compute capability 6.1 > > > > simpleCUBLAS test running.. > > !!!! CUBLAS initialization error > > yichongx at gpu2$ > > > > > > This is also consistent with our previous errors from pytorch, which say cublas library not initialized. > > > > So this means at least there is some problem with CUBLAS on gpu2. This post suggests that using sudo can resolve this problem, and this is probably because of some permission problems on CUBLAS libraries: > > https://devtalk.nvidia.com/default/topic/1027602/cuda-setup-and-installation/cublas-libraries-with-incorrect-permissions/ > > @Predrag: Can you try running the simpleCUBLAS example from the CUDA library, with and without root privilege? I think that might be something that you are more familiar with. Thank you very much! > > > > > > Thanks, > > Yichong > > > > On Sep 4, 2018, at 3:18 PM, Emre Yolcu >> wrote: > > > > Hi, > > > > We are trying to troubleshoot the PyTorch issue with Predrag and were wondering: > > > > Is anybody able to run PyTorch GPU models on gpu1-9? If you can, we would appreciate if you can respond. > > > > Also, is it a problem for anyone if gpu8 is rebooted today? > > > > Thanks, > > > > Emre > > > > > > > > -- > > Biswajit Paria > > PhD in ML @ CMU > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bparia at cs.cmu.edu Wed Sep 5 17:14:10 2018 From: bparia at cs.cmu.edu (Biswajit Paria) Date: Wed, 5 Sep 2018 17:14:10 -0400 Subject: PyTorch problem In-Reply-To: References: <442569FC-5DE7-4F67-BDCA-7A7A1902EAF0@andrew.cmu.edu> <7f7c51795a7f4a2398934b79bf5de592@cmu.edu> <20180905205614.IMQuJDzT_%predragp@andrew.cmu.edu> Message-ID: I just tried Yichong's way of testing cuBLAS, and get the same error as earlier: [Matrix Multiply CUBLAS] - Starting... GPU Device 0: "TITAN Xp" with compute capability 6.1 MatrixA(640,480), MatrixB(480,320), MatrixC(640,320) CUDA error at matrixMulCUBLAS.cpp:275 code=1(CUBLAS_STATUS_NOT_INITIALIZED) "cublasCreate(&handle)" So I believe it is not a conda error. I also tried removing .nv, doesn't help either. Maybe someone can share the PATH env variable? On Wed, Sep 5, 2018 at 5:08 PM Emre Yolcu wrote: > Manzil, could you share your `conda env export` (or equivalent) output for > the environment you use for pytorch? It's still not working for me after > reboot, maybe I can try replicating your exact setup and try with that. > > Thanks, > > Emre > > On Wed, Sep 5, 2018 at 4:56 PM, Predrag Punosevac > wrote: > >> Manzil Zaheer wrote: >> >> > It was working me before reboot as well. PyTorch does work on all >> > nodes for me. >> >> Aha! Gotcha. >> >> > >> > I am trying to say is that i think it is not issue at system level but >> > at user account level. I might be wrong though. >> >> That was my hunch as well. They were trying to convince me in a 150 >> e-mails chain over the weekend that pytorch was broken when I replaced a >> failed HDD on the main file server. That didn't make any sense. >> >> Could you please share your binaries and setup with outher pytorch >> users? >> >> Cheers, >> Predrag >> >> > >> > >> > -------- Original message -------- >> > From: Predrag Punosevac >> > Date: 9/5/18 4:44 PM (GMT-05:00) >> > To: Manzil Zaheer >> > Cc: Biswajit Paria , Yichong Xu , >> Emre Yolcu , users at autonlab.org >> > Subject: Re: PyTorch problem >> > >> > Should I go ahead and reboot all GPU computing nodes? Can somebody else >> confirm that a reboot fixes the issue? >> > >> > Predrag >> > >> > On Wed, Sep 5, 2018 at 4:42 PM, Manzil Zaheer > manzil at cmu.edu>> wrote: >> > It does work for me and my friends >> > >> > >> > >> > >> > -------- Original message -------- >> > From: Predrag Punosevac > predragp at andrew.cmu.edu>> >> > Date: 9/5/18 4:40 PM (GMT-05:00) >> > To: Biswajit Paria > >> > Cc: Manzil Zaheer >, Yichong Xu < >> yichongx at cs.cmu.edu>, Emre Yolcu < >> eyolcu at cs.cmu.edu>, users at autonlab.org> users at autonlab.org> >> > Subject: Re: PyTorch problem >> > >> > I just rebooted GPU8. All packages are up to date. NVidia driver >> appears to be working properly and I can do GPU computations from MATLAB. >> Let's try now to get pytorch working on GPU8. >> > >> > Predrag >> > >> > On Wed, Sep 5, 2018 at 12:19 AM, Biswajit Paria > > wrote: >> > I am facing a similar error on all GPU machines. Did someone find a >> solution yet? >> > >> > >> > 2018-09-05 00:27:41.546064: E >> tensorflow/stream_executor/cuda/cuda_blas.cc:459] failed to create cublas >> handle: CUBLAS_STATUS_NOT_INITIALIZED >> > >> > On Tue, Sep 4, 2018 at 10:03 PM Manzil Zaheer > manzil at cmu.edu>> wrote: >> > Hi Yichong >> > >> > Yes I am able to run TF and PyTorch on these machines. Recently someone >> else also had similar issue, but it got fixed by reinstalling some local >> packages. >> > >> > Thanks, >> > Manzil >> > >> > >> > -------- Original message -------- >> > From: Yichong Xu > >> > Date: 9/4/18 9:58 PM (GMT-05:00) >> > To: Emre Yolcu >, Predrag >> Punosevac > >> > Cc: users at autonlab.org >> > Subject: Re: PyTorch problem >> > >> > Just wondering - can Tensorflow run well on these machines? I hope >> someone to confirm about this so that we can isolate the problem. >> > OK so here?s a further test: I tried running the cuda examples from the >> cuda installation (in /usr/local/cuda/sample), on gpu2 in my scratch >> directory. Simple jobs like deviceQuery succeeds, but simpleCUBLAS failed: >> > yichongx at gpu2$ cd /home/scratch/yichongx/ >> > yichongx at gpu2$ cd >> > 0_Simple/ 2_Graphics/ 4_Finance/ 6_Advanced/ >> bin/ conda/ >> > 1_Utilities/ 3_Imaging/ 5_Simulations/ 7_CUDALibraries/ >> common/ miniconda3/ >> > yichongx at gpu2$ cd 7_CUDALibraries/ >> > yichongx at gpu2$ cd simpleCUBLAS >> > yichongx at gpu2$ CUDA_VISIBLE_DEVICES=3 ./simpleCUBLAS >> > GPU Device 0: "TITAN X (Pascal)" with compute capability 6.1 >> > >> > simpleCUBLAS test running.. >> > !!!! CUBLAS initialization error >> > yichongx at gpu2$ >> > >> > >> > This is also consistent with our previous errors from pytorch, which >> say cublas library not initialized. >> > >> > So this means at least there is some problem with CUBLAS on gpu2. This >> post suggests that using sudo can resolve this problem, and this is >> probably because of some permission problems on CUBLAS libraries: >> > >> https://devtalk.nvidia.com/default/topic/1027602/cuda-setup-and-installation/cublas-libraries-with-incorrect-permissions/ >> > @Predrag: Can you try running the simpleCUBLAS example from the CUDA >> library, with and without root privilege? I think that might be something >> that you are more familiar with. Thank you very much! >> > >> > >> > Thanks, >> > Yichong >> > >> > On Sep 4, 2018, at 3:18 PM, Emre Yolcu > eyolcu at cs.cmu.edu>> wrote: >> > >> > Hi, >> > >> > We are trying to troubleshoot the PyTorch issue with Predrag and were >> wondering: >> > >> > Is anybody able to run PyTorch GPU models on gpu1-9? If you can, we >> would appreciate if you can respond. >> > >> > Also, is it a problem for anyone if gpu8 is rebooted today? >> > >> > Thanks, >> > >> > Emre >> > >> > >> > >> > -- >> > Biswajit Paria >> > PhD in ML @ CMU >> > >> > >> > > -- Biswajit Paria PhD in ML @ CMU -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Wed Sep 5 17:22:49 2018 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Wed, 5 Sep 2018 17:22:49 -0400 Subject: PyTorch problem In-Reply-To: References: <442569FC-5DE7-4F67-BDCA-7A7A1902EAF0@andrew.cmu.edu> <7f7c51795a7f4a2398934b79bf5de592@cmu.edu> <20180905205614.IMQuJDzT_%predragp@andrew.cmu.edu> Message-ID: People should use /opt/rh/rh-python36 I did install /opt/miniconda3 but I am not a big fan. Predrag On Wed, Sep 5, 2018 at 5:12 PM, Benedikt Boecking wrote: > Not sure this will help, but I (very) recently had issues with software > installed via conda linking to some of my local python installations. > Removing and reinstalling the packages did not help. Ultimately, I removed > all my local installs in ~/.local/lib/python* and installed conda again > from scratch. It has been working like a charm since then. > > Best, > Ben > > > > On Sep 5, 2018, at 5:07 PM, Emre Yolcu wrote: > > Manzil, could you share your `conda env export` (or equivalent) output for > the environment you use for pytorch? It's still not working for me after > reboot, maybe I can try replicating your exact setup and try with that. > > Thanks, > > Emre > > On Wed, Sep 5, 2018 at 4:56 PM, Predrag Punosevac > wrote: > >> Manzil Zaheer wrote: >> >> > It was working me before reboot as well. PyTorch does work on all >> > nodes for me. >> >> Aha! Gotcha. >> >> > >> > I am trying to say is that i think it is not issue at system level but >> > at user account level. I might be wrong though. >> >> That was my hunch as well. They were trying to convince me in a 150 >> e-mails chain over the weekend that pytorch was broken when I replaced a >> failed HDD on the main file server. That didn't make any sense. >> >> Could you please share your binaries and setup with outher pytorch >> users? >> >> Cheers, >> Predrag >> >> > >> > >> > -------- Original message -------- >> > From: Predrag Punosevac >> > Date: 9/5/18 4:44 PM (GMT-05:00) >> > To: Manzil Zaheer >> > Cc: Biswajit Paria , Yichong Xu , >> Emre Yolcu , users at autonlab.org >> > Subject: Re: PyTorch problem >> > >> > Should I go ahead and reboot all GPU computing nodes? Can somebody else >> confirm that a reboot fixes the issue? >> > >> > Predrag >> > >> > On Wed, Sep 5, 2018 at 4:42 PM, Manzil Zaheer > manzil at cmu.edu>> wrote: >> > It does work for me and my friends >> > >> > >> > >> > >> > -------- Original message -------- >> > From: Predrag Punosevac > predragp at andrew.cmu.edu>> >> > Date: 9/5/18 4:40 PM (GMT-05:00) >> > To: Biswajit Paria > >> > Cc: Manzil Zaheer >, Yichong Xu < >> yichongx at cs.cmu.edu>, Emre Yolcu < >> eyolcu at cs.cmu.edu>, users at autonlab.org> users at autonlab.org> >> > Subject: Re: PyTorch problem >> > >> > I just rebooted GPU8. All packages are up to date. NVidia driver >> appears to be working properly and I can do GPU computations from MATLAB. >> Let's try now to get pytorch working on GPU8. >> > >> > Predrag >> > >> > On Wed, Sep 5, 2018 at 12:19 AM, Biswajit Paria > > wrote: >> > I am facing a similar error on all GPU machines. Did someone find a >> solution yet? >> > >> > >> > 2018-09-05 00:27:41.546064: E tensorflow/stream_executor/cuda/ >> cuda_blas.cc:459] failed to create cublas handle: >> CUBLAS_STATUS_NOT_INITIALIZED >> > >> > On Tue, Sep 4, 2018 at 10:03 PM Manzil Zaheer > manzil at cmu.edu>> wrote: >> > Hi Yichong >> > >> > Yes I am able to run TF and PyTorch on these machines. Recently someone >> else also had similar issue, but it got fixed by reinstalling some local >> packages. >> > >> > Thanks, >> > Manzil >> > >> > >> > -------- Original message -------- >> > From: Yichong Xu > >> > Date: 9/4/18 9:58 PM (GMT-05:00) >> > To: Emre Yolcu >, Predrag >> Punosevac > >> > Cc: users at autonlab.org >> > Subject: Re: PyTorch problem >> > >> > Just wondering - can Tensorflow run well on these machines? I hope >> someone to confirm about this so that we can isolate the problem. >> > OK so here?s a further test: I tried running the cuda examples from the >> cuda installation (in /usr/local/cuda/sample), on gpu2 in my scratch >> directory. Simple jobs like deviceQuery succeeds, but simpleCUBLAS failed: >> > yichongx at gpu2$ cd /home/scratch/yichongx/ >> > yichongx at gpu2$ cd >> > 0_Simple/ 2_Graphics/ 4_Finance/ 6_Advanced/ >> bin/ conda/ >> > 1_Utilities/ 3_Imaging/ 5_Simulations/ 7_CUDALibraries/ >> common/ miniconda3/ >> > yichongx at gpu2$ cd 7_CUDALibraries/ >> > yichongx at gpu2$ cd simpleCUBLAS >> > yichongx at gpu2$ CUDA_VISIBLE_DEVICES=3 ./simpleCUBLAS >> > GPU Device 0: "TITAN X (Pascal)" with compute capability 6.1 >> > >> > simpleCUBLAS test running.. >> > !!!! CUBLAS initialization error >> > yichongx at gpu2$ >> > >> > >> > This is also consistent with our previous errors from pytorch, which >> say cublas library not initialized. >> > >> > So this means at least there is some problem with CUBLAS on gpu2. This >> post suggests that using sudo can resolve this problem, and this is >> probably because of some permission problems on CUBLAS libraries: >> > https://devtalk.nvidia.com/default/topic/1027602/cuda-setup- >> and-installation/cublas-libraries-with-incorrect-permissions/ >> > @Predrag: Can you try running the simpleCUBLAS example from the CUDA >> library, with and without root privilege? I think that might be something >> that you are more familiar with. Thank you very much! >> > >> > >> > Thanks, >> > Yichong >> > >> > On Sep 4, 2018, at 3:18 PM, Emre Yolcu > cu at cs.cmu.edu>> wrote: >> > >> > Hi, >> > >> > We are trying to troubleshoot the PyTorch issue with Predrag and were >> wondering: >> > >> > Is anybody able to run PyTorch GPU models on gpu1-9? If you can, we >> would appreciate if you can respond. >> > >> > Also, is it a problem for anyone if gpu8 is rebooted today? >> > >> > Thanks, >> > >> > Emre >> > >> > >> > >> > -- >> > Biswajit Paria >> > PhD in ML @ CMU >> > >> > >> > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From yichongx at cs.cmu.edu Wed Sep 5 17:27:27 2018 From: yichongx at cs.cmu.edu (Yichong Xu) Date: Wed, 5 Sep 2018 21:27:27 +0000 Subject: PyTorch problem In-Reply-To: References: <442569FC-5DE7-4F67-BDCA-7A7A1902EAF0@andrew.cmu.edu> <7f7c51795a7f4a2398934b79bf5de592@cmu.edu> <20180905205614.IMQuJDzT_%predragp@andrew.cmu.edu> Message-ID: <750EDAE5-46EE-43A4-ADED-2EF54345F12F@cs.cmu.edu> I think with Biswajit?s and my problem with cuda, we should isolate the problem with just CUDA (and drivers) instead of wandering around python or pytorch. Predrag can you test the CUDA examples? I sort of agree with Manzil that this might be a user account problem. Thanks, Yichong On Sep 5, 2018, at 5:14 PM, Biswajit Paria > wrote: I just tried Yichong's way of testing cuBLAS, and get the same error as earlier: [Matrix Multiply CUBLAS] - Starting... GPU Device 0: "TITAN Xp" with compute capability 6.1 MatrixA(640,480), MatrixB(480,320), MatrixC(640,320) CUDA error at matrixMulCUBLAS.cpp:275 code=1(CUBLAS_STATUS_NOT_INITIALIZED) "cublasCreate(&handle)" So I believe it is not a conda error. I also tried removing .nv, doesn't help either. Maybe someone can share the PATH env variable? On Wed, Sep 5, 2018 at 5:08 PM Emre Yolcu > wrote: Manzil, could you share your `conda env export` (or equivalent) output for the environment you use for pytorch? It's still not working for me after reboot, maybe I can try replicating your exact setup and try with that. Thanks, Emre On Wed, Sep 5, 2018 at 4:56 PM, Predrag Punosevac > wrote: Manzil Zaheer > wrote: > It was working me before reboot as well. PyTorch does work on all > nodes for me. Aha! Gotcha. > > I am trying to say is that i think it is not issue at system level but > at user account level. I might be wrong though. That was my hunch as well. They were trying to convince me in a 150 e-mails chain over the weekend that pytorch was broken when I replaced a failed HDD on the main file server. That didn't make any sense. Could you please share your binaries and setup with outher pytorch users? Cheers, Predrag > > > -------- Original message -------- > From: Predrag Punosevac > > Date: 9/5/18 4:44 PM (GMT-05:00) > To: Manzil Zaheer > > Cc: Biswajit Paria >, Yichong Xu >, Emre Yolcu >, users at autonlab.org > Subject: Re: PyTorch problem > > Should I go ahead and reboot all GPU computing nodes? Can somebody else confirm that a reboot fixes the issue? > > Predrag > > On Wed, Sep 5, 2018 at 4:42 PM, Manzil Zaheer >> wrote: > It does work for me and my friends > > > > > -------- Original message -------- > From: Predrag Punosevac >> > Date: 9/5/18 4:40 PM (GMT-05:00) > To: Biswajit Paria >> > Cc: Manzil Zaheer >>, Yichong Xu >>, Emre Yolcu >>, users at autonlab.org> > Subject: Re: PyTorch problem > > I just rebooted GPU8. All packages are up to date. NVidia driver appears to be working properly and I can do GPU computations from MATLAB. Let's try now to get pytorch working on GPU8. > > Predrag > > On Wed, Sep 5, 2018 at 12:19 AM, Biswajit Paria >> wrote: > I am facing a similar error on all GPU machines. Did someone find a solution yet? > > > 2018-09-05 00:27:41.546064: E tensorflow/stream_executor/cuda/cuda_blas.cc:459] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED > > On Tue, Sep 4, 2018 at 10:03 PM Manzil Zaheer >> wrote: > Hi Yichong > > Yes I am able to run TF and PyTorch on these machines. Recently someone else also had similar issue, but it got fixed by reinstalling some local packages. > > Thanks, > Manzil > > > -------- Original message -------- > From: Yichong Xu >> > Date: 9/4/18 9:58 PM (GMT-05:00) > To: Emre Yolcu >>, Predrag Punosevac >> > Cc: users at autonlab.org> > Subject: Re: PyTorch problem > > Just wondering - can Tensorflow run well on these machines? I hope someone to confirm about this so that we can isolate the problem. > OK so here?s a further test: I tried running the cuda examples from the cuda installation (in /usr/local/cuda/sample), on gpu2 in my scratch directory. Simple jobs like deviceQuery succeeds, but simpleCUBLAS failed: > yichongx at gpu2$ cd /home/scratch/yichongx/ > yichongx at gpu2$ cd > 0_Simple/ 2_Graphics/ 4_Finance/ 6_Advanced/ bin/ conda/ > 1_Utilities/ 3_Imaging/ 5_Simulations/ 7_CUDALibraries/ common/ miniconda3/ > yichongx at gpu2$ cd 7_CUDALibraries/ > yichongx at gpu2$ cd simpleCUBLAS > yichongx at gpu2$ CUDA_VISIBLE_DEVICES=3 ./simpleCUBLAS > GPU Device 0: "TITAN X (Pascal)" with compute capability 6.1 > > simpleCUBLAS test running.. > !!!! CUBLAS initialization error > yichongx at gpu2$ > > > This is also consistent with our previous errors from pytorch, which say cublas library not initialized. > > So this means at least there is some problem with CUBLAS on gpu2. This post suggests that using sudo can resolve this problem, and this is probably because of some permission problems on CUBLAS libraries: > https://devtalk.nvidia.com/default/topic/1027602/cuda-setup-and-installation/cublas-libraries-with-incorrect-permissions/ > @Predrag: Can you try running the simpleCUBLAS example from the CUDA library, with and without root privilege? I think that might be something that you are more familiar with. Thank you very much! > > > Thanks, > Yichong > > On Sep 4, 2018, at 3:18 PM, Emre Yolcu >> wrote: > > Hi, > > We are trying to troubleshoot the PyTorch issue with Predrag and were wondering: > > Is anybody able to run PyTorch GPU models on gpu1-9? If you can, we would appreciate if you can respond. > > Also, is it a problem for anyone if gpu8 is rebooted today? > > Thanks, > > Emre > > > > -- > Biswajit Paria > PhD in ML @ CMU > > -- Biswajit Paria PhD in ML @ CMU -------------- next part -------------- An HTML attachment was scrubbed... URL: From bparia at cs.cmu.edu Wed Sep 5 17:28:56 2018 From: bparia at cs.cmu.edu (Biswajit Paria) Date: Wed, 5 Sep 2018 17:28:56 -0400 Subject: PyTorch problem In-Reply-To: <750EDAE5-46EE-43A4-ADED-2EF54345F12F@cs.cmu.edu> References: <442569FC-5DE7-4F67-BDCA-7A7A1902EAF0@andrew.cmu.edu> <7f7c51795a7f4a2398934b79bf5de592@cmu.edu> <20180905205614.IMQuJDzT_%predragp@andrew.cmu.edu> <750EDAE5-46EE-43A4-ADED-2EF54345F12F@cs.cmu.edu> Message-ID: If the CUDA examples work for anyone, can they share their PATH and LD_LIBRARY_PATH variables? Thanks On Wed, Sep 5, 2018 at 5:27 PM Yichong Xu wrote: > I think with Biswajit?s and my problem with cuda, we should isolate the > problem with just CUDA (and drivers) instead of wandering around python or > pytorch. > Predrag can you test the CUDA examples? I sort of agree with Manzil that > this might be a user account problem. > > *Thanks,* > *Yichong* > > > > On Sep 5, 2018, at 5:14 PM, Biswajit Paria wrote: > > I just tried Yichong's way of testing cuBLAS, and get the same error as > earlier: > > [Matrix Multiply CUBLAS] - Starting... > GPU Device 0: "TITAN Xp" with compute capability 6.1 > > MatrixA(640,480), MatrixB(480,320), MatrixC(640,320) > CUDA error at matrixMulCUBLAS.cpp:275 > code=1(CUBLAS_STATUS_NOT_INITIALIZED) "cublasCreate(&handle)" > > > So I believe it is not a conda error. I also tried removing .nv, doesn't > help either. Maybe someone can share the PATH env variable? > > On Wed, Sep 5, 2018 at 5:08 PM Emre Yolcu wrote: > >> Manzil, could you share your `conda env export` (or equivalent) output >> for the environment you use for pytorch? It's still not working for me >> after reboot, maybe I can try replicating your exact setup and try with >> that. >> >> Thanks, >> >> Emre >> >> On Wed, Sep 5, 2018 at 4:56 PM, Predrag Punosevac < >> predragp at andrew.cmu.edu> wrote: >> >>> Manzil Zaheer wrote: >>> >>> > It was working me before reboot as well. PyTorch does work on all >>> > nodes for me. >>> >>> Aha! Gotcha. >>> >>> > >>> > I am trying to say is that i think it is not issue at system level but >>> > at user account level. I might be wrong though. >>> >>> That was my hunch as well. They were trying to convince me in a 150 >>> e-mails chain over the weekend that pytorch was broken when I replaced a >>> failed HDD on the main file server. That didn't make any sense. >>> >>> Could you please share your binaries and setup with outher pytorch >>> users? >>> >>> Cheers, >>> Predrag >>> >>> > >>> > >>> > -------- Original message -------- >>> > From: Predrag Punosevac >>> > Date: 9/5/18 4:44 PM (GMT-05:00) >>> > To: Manzil Zaheer >>> > Cc: Biswajit Paria , Yichong Xu < >>> yichongx at cs.cmu.edu>, Emre Yolcu , users at autonlab.org >>> > Subject: Re: PyTorch problem >>> > >>> > Should I go ahead and reboot all GPU computing nodes? Can somebody >>> else confirm that a reboot fixes the issue? >>> > >>> > Predrag >>> > >>> > On Wed, Sep 5, 2018 at 4:42 PM, Manzil Zaheer >> manzil at cmu.edu>> wrote: >>> > It does work for me and my friends >>> > >>> > >>> > >>> > >>> > -------- Original message -------- >>> > From: Predrag Punosevac >> predragp at andrew.cmu.edu>> >>> > Date: 9/5/18 4:40 PM (GMT-05:00) >>> > To: Biswajit Paria > >>> > Cc: Manzil Zaheer >, Yichong Xu >>> >, Emre Yolcu < >>> eyolcu at cs.cmu.edu>, users at autonlab.org>> users at autonlab.org> >>> > Subject: Re: PyTorch problem >>> > >>> > I just rebooted GPU8. All packages are up to date. NVidia driver >>> appears to be working properly and I can do GPU computations from MATLAB. >>> Let's try now to get pytorch working on GPU8. >>> > >>> > Predrag >>> > >>> > On Wed, Sep 5, 2018 at 12:19 AM, Biswajit Paria >> > wrote: >>> > I am facing a similar error on all GPU machines. Did someone find a >>> solution yet? >>> > >>> > >>> > 2018-09-05 00:27:41.546064: E tensorflow/stream_executor/cuda/ >>> cuda_blas.cc:459] failed to create cublas handle: >>> CUBLAS_STATUS_NOT_INITIALIZED >>> > >>> > On Tue, Sep 4, 2018 at 10:03 PM Manzil Zaheer >> manzil at cmu.edu>> wrote: >>> > Hi Yichong >>> > >>> > Yes I am able to run TF and PyTorch on these machines. Recently >>> someone else also had similar issue, but it got fixed by reinstalling some >>> local packages. >>> > >>> > Thanks, >>> > Manzil >>> > >>> > >>> > -------- Original message -------- >>> > From: Yichong Xu > >>> > Date: 9/4/18 9:58 PM (GMT-05:00) >>> > To: Emre Yolcu >, Predrag >>> Punosevac > >>> > Cc: users at autonlab.org >>> > Subject: Re: PyTorch problem >>> > >>> > Just wondering - can Tensorflow run well on these machines? I hope >>> someone to confirm about this so that we can isolate the problem. >>> > OK so here?s a further test: I tried running the cuda examples from >>> the cuda installation (in /usr/local/cuda/sample), on gpu2 in my scratch >>> directory. Simple jobs like deviceQuery succeeds, but simpleCUBLAS failed: >>> > yichongx at gpu2$ cd /home/scratch/yichongx/ >>> > yichongx at gpu2$ cd >>> > 0_Simple/ 2_Graphics/ 4_Finance/ 6_Advanced/ >>> bin/ conda/ >>> > 1_Utilities/ 3_Imaging/ 5_Simulations/ 7_CUDALibraries/ >>> common/ miniconda3/ >>> > yichongx at gpu2$ cd 7_CUDALibraries/ >>> > yichongx at gpu2$ cd simpleCUBLAS >>> > yichongx at gpu2$ CUDA_VISIBLE_DEVICES=3 ./simpleCUBLAS >>> > GPU Device 0: "TITAN X (Pascal)" with compute capability 6.1 >>> > >>> > simpleCUBLAS test running.. >>> > !!!! CUBLAS initialization error >>> > yichongx at gpu2$ >>> > >>> > >>> > This is also consistent with our previous errors from pytorch, which >>> say cublas library not initialized. >>> > >>> > So this means at least there is some problem with CUBLAS on gpu2. This >>> post suggests that using sudo can resolve this problem, and this is >>> probably because of some permission problems on CUBLAS libraries: >>> > >>> https://devtalk.nvidia.com/default/topic/1027602/cuda-setup-and-installation/cublas-libraries-with-incorrect-permissions/ >>> > @Predrag: Can you try running the simpleCUBLAS example from the CUDA >>> library, with and without root privilege? I think that might be something >>> that you are more familiar with. Thank you very much! >>> > >>> > >>> > Thanks, >>> > Yichong >>> > >>> > On Sep 4, 2018, at 3:18 PM, Emre Yolcu >> eyolcu at cs.cmu.edu>> wrote: >>> > >>> > Hi, >>> > >>> > We are trying to troubleshoot the PyTorch issue with Predrag and were >>> wondering: >>> > >>> > Is anybody able to run PyTorch GPU models on gpu1-9? If you can, we >>> would appreciate if you can respond. >>> > >>> > Also, is it a problem for anyone if gpu8 is rebooted today? >>> > >>> > Thanks, >>> > >>> > Emre >>> > >>> > >>> > >>> > -- >>> > Biswajit Paria >>> > PhD in ML @ CMU >>> > >>> > >>> >> >> > > -- > Biswajit Paria > PhD in ML @ CMU > > > -- Biswajit Paria PhD in ML @ CMU -------------- next part -------------- An HTML attachment was scrubbed... URL: From manzil at cmu.edu Wed Sep 5 17:32:03 2018 From: manzil at cmu.edu (Manzil Zaheer) Date: Wed, 5 Sep 2018 21:32:03 +0000 Subject: PyTorch problem In-Reply-To: References: <442569FC-5DE7-4F67-BDCA-7A7A1902EAF0@andrew.cmu.edu> <7f7c51795a7f4a2398934b79bf5de592@cmu.edu> <20180905205614.IMQuJDzT_%predragp@andrew.cmu.edu> <750EDAE5-46EE-43A4-ADED-2EF54345F12F@cs.cmu.edu> Message-ID: <5168285b40e7421d9f489e39fd834fab@cmu.edu> Here is my related env variables: CUDA_HOME=/zfsauton/home/manzilz/local/cuda-9.0/ LD_LIBRARY_PATH=/zfsauton/home/manzilz/local/lib64:/zfsauton/home/manzilz/local/lib:/zfsauton/home/manzilz/local/cuda-9.0/lib64:/usr/local/cuda/lib64: PATH=/zfsauton/home/manzilz/local/bin:/zfsauton/home/manzilz/.local/bin:/zfsauton/home/manzilz/local/cuda-9.0/bin:/usr/local/cuda/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin C_INCLUDE_PATH=/zfsauton/home/manzilz/local/include: From: Biswajit Paria Sent: Wednesday, September 05, 2018 5:29 PM To: Yichong Xu Cc: Biswajit Paria ; eyolcu at cs.cmu.edu; Predrag Punosevac ; Manzil Zaheer ; users at autonlab.org Subject: Re: PyTorch problem If the CUDA examples work for anyone, can they share their PATH and LD_LIBRARY_PATH variables? Thanks On Wed, Sep 5, 2018 at 5:27 PM Yichong Xu > wrote: I think with Biswajit?s and my problem with cuda, we should isolate the problem with just CUDA (and drivers) instead of wandering around python or pytorch. Predrag can you test the CUDA examples? I sort of agree with Manzil that this might be a user account problem. Thanks, Yichong On Sep 5, 2018, at 5:14 PM, Biswajit Paria > wrote: I just tried Yichong's way of testing cuBLAS, and get the same error as earlier: [Matrix Multiply CUBLAS] - Starting... GPU Device 0: "TITAN Xp" with compute capability 6.1 MatrixA(640,480), MatrixB(480,320), MatrixC(640,320) CUDA error at matrixMulCUBLAS.cpp:275 code=1(CUBLAS_STATUS_NOT_INITIALIZED) "cublasCreate(&handle)" So I believe it is not a conda error. I also tried removing .nv, doesn't help either. Maybe someone can share the PATH env variable? On Wed, Sep 5, 2018 at 5:08 PM Emre Yolcu > wrote: Manzil, could you share your `conda env export` (or equivalent) output for the environment you use for pytorch? It's still not working for me after reboot, maybe I can try replicating your exact setup and try with that. Thanks, Emre On Wed, Sep 5, 2018 at 4:56 PM, Predrag Punosevac > wrote: Manzil Zaheer > wrote: > It was working me before reboot as well. PyTorch does work on all > nodes for me. Aha! Gotcha. > > I am trying to say is that i think it is not issue at system level but > at user account level. I might be wrong though. That was my hunch as well. They were trying to convince me in a 150 e-mails chain over the weekend that pytorch was broken when I replaced a failed HDD on the main file server. That didn't make any sense. Could you please share your binaries and setup with outher pytorch users? Cheers, Predrag > > > -------- Original message -------- > From: Predrag Punosevac > > Date: 9/5/18 4:44 PM (GMT-05:00) > To: Manzil Zaheer > > Cc: Biswajit Paria >, Yichong Xu >, Emre Yolcu >, users at autonlab.org > Subject: Re: PyTorch problem > > Should I go ahead and reboot all GPU computing nodes? Can somebody else confirm that a reboot fixes the issue? > > Predrag > > On Wed, Sep 5, 2018 at 4:42 PM, Manzil Zaheer >> wrote: > It does work for me and my friends > > > > > -------- Original message -------- > From: Predrag Punosevac >> > Date: 9/5/18 4:40 PM (GMT-05:00) > To: Biswajit Paria >> > Cc: Manzil Zaheer >>, Yichong Xu >>, Emre Yolcu >>, users at autonlab.org> > Subject: Re: PyTorch problem > > I just rebooted GPU8. All packages are up to date. NVidia driver appears to be working properly and I can do GPU computations from MATLAB. Let's try now to get pytorch working on GPU8. > > Predrag > > On Wed, Sep 5, 2018 at 12:19 AM, Biswajit Paria >> wrote: > I am facing a similar error on all GPU machines. Did someone find a solution yet? > > > 2018-09-05 00:27:41.546064: E tensorflow/stream_executor/cuda/cuda_blas.cc:459] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED > > On Tue, Sep 4, 2018 at 10:03 PM Manzil Zaheer >> wrote: > Hi Yichong > > Yes I am able to run TF and PyTorch on these machines. Recently someone else also had similar issue, but it got fixed by reinstalling some local packages. > > Thanks, > Manzil > > > -------- Original message -------- > From: Yichong Xu >> > Date: 9/4/18 9:58 PM (GMT-05:00) > To: Emre Yolcu >>, Predrag Punosevac >> > Cc: users at autonlab.org> > Subject: Re: PyTorch problem > > Just wondering - can Tensorflow run well on these machines? I hope someone to confirm about this so that we can isolate the problem. > OK so here?s a further test: I tried running the cuda examples from the cuda installation (in /usr/local/cuda/sample), on gpu2 in my scratch directory. Simple jobs like deviceQuery succeeds, but simpleCUBLAS failed: > yichongx at gpu2$ cd /home/scratch/yichongx/ > yichongx at gpu2$ cd > 0_Simple/ 2_Graphics/ 4_Finance/ 6_Advanced/ bin/ conda/ > 1_Utilities/ 3_Imaging/ 5_Simulations/ 7_CUDALibraries/ common/ miniconda3/ > yichongx at gpu2$ cd 7_CUDALibraries/ > yichongx at gpu2$ cd simpleCUBLAS > yichongx at gpu2$ CUDA_VISIBLE_DEVICES=3 ./simpleCUBLAS > GPU Device 0: "TITAN X (Pascal)" with compute capability 6.1 > > simpleCUBLAS test running.. > !!!! CUBLAS initialization error > yichongx at gpu2$ > > > This is also consistent with our previous errors from pytorch, which say cublas library not initialized. > > So this means at least there is some problem with CUBLAS on gpu2. This post suggests that using sudo can resolve this problem, and this is probably because of some permission problems on CUBLAS libraries: > https://devtalk.nvidia.com/default/topic/1027602/cuda-setup-and-installation/cublas-libraries-with-incorrect-permissions/ > @Predrag: Can you try running the simpleCUBLAS example from the CUDA library, with and without root privilege? I think that might be something that you are more familiar with. Thank you very much! > > > Thanks, > Yichong > > On Sep 4, 2018, at 3:18 PM, Emre Yolcu >> wrote: > > Hi, > > We are trying to troubleshoot the PyTorch issue with Predrag and were wondering: > > Is anybody able to run PyTorch GPU models on gpu1-9? If you can, we would appreciate if you can respond. > > Also, is it a problem for anyone if gpu8 is rebooted today? > > Thanks, > > Emre > > > > -- > Biswajit Paria > PhD in ML @ CMU > > -- Biswajit Paria PhD in ML @ CMU -- Biswajit Paria PhD in ML @ CMU -------------- next part -------------- An HTML attachment was scrubbed... URL: From bapoczos at cs.cmu.edu Thu Sep 6 11:20:10 2018 From: bapoczos at cs.cmu.edu (Barnabas Poczos) Date: Thu, 6 Sep 2018 11:20:10 -0400 Subject: PyTorch problem In-Reply-To: <5168285b40e7421d9f489e39fd834fab@cmu.edu> References: <442569FC-5DE7-4F67-BDCA-7A7A1902EAF0@andrew.cmu.edu> <7f7c51795a7f4a2398934b79bf5de592@cmu.edu> <20180905205614.IMQuJDzT_%predragp@andrew.cmu.edu> <750EDAE5-46EE-43A4-ADED-2EF54345F12F@cs.cmu.edu> <5168285b40e7421d9f489e39fd834fab@cmu.edu> Message-ID: Hi All, I'm somewhat confused: * Do I understand correctly that Manzil actually is using the CUDA libraries installed by himself (/zfsauton/home/manzilz/local/cuda-9.0/) and not the system libraries (/usr/local/cuda/lib64) ? * Since he is using different CUDA libraries is that the reason that pytorch is working for him and not for the other users? If so, should we double check the system libraries? * Do we know anyone who can use pytorch now with the CUDA system libraries ? If so, those users please let us know your system env variables. * As a quick solution, should we ask Manzil to copy his cuda libraries to a public place where others could access them? Best, Barnabas ====================== Barnabas Poczos, PhD Associate Professor Co-Director of PhD Program Machine Learning Department Carnegie Mellon University On Wed, Sep 5, 2018 at 5:33 PM Manzil Zaheer wrote: > > Here is my related env variables: > > > > CUDA_HOME=/zfsauton/home/manzilz/local/cuda-9.0/ > > LD_LIBRARY_PATH=/zfsauton/home/manzilz/local/lib64:/zfsauton/home/manzilz/local/lib:/zfsauton/home/manzilz/local/cuda-9.0/lib64:/usr/local/cuda/lib64: > > PATH=/zfsauton/home/manzilz/local/bin:/zfsauton/home/manzilz/.local/bin:/zfsauton/home/manzilz/local/cuda-9.0/bin:/usr/local/cuda/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin > > C_INCLUDE_PATH=/zfsauton/home/manzilz/local/include: > > > > From: Biswajit Paria > Sent: Wednesday, September 05, 2018 5:29 PM > To: Yichong Xu > Cc: Biswajit Paria ; eyolcu at cs.cmu.edu; Predrag Punosevac ; Manzil Zaheer ; users at autonlab.org > Subject: Re: PyTorch problem > > > > If the CUDA examples work for anyone, can they share their PATH and LD_LIBRARY_PATH variables? > > > > Thanks > > > > On Wed, Sep 5, 2018 at 5:27 PM Yichong Xu wrote: > > I think with Biswajit?s and my problem with cuda, we should isolate the problem with just CUDA (and drivers) instead of wandering around python or pytorch. > > Predrag can you test the CUDA examples? I sort of agree with Manzil that this might be a user account problem. > > > > Thanks, > > Yichong > > > > > > > > On Sep 5, 2018, at 5:14 PM, Biswajit Paria wrote: > > > > I just tried Yichong's way of testing cuBLAS, and get the same error as earlier: > > > > [Matrix Multiply CUBLAS] - Starting... > > GPU Device 0: "TITAN Xp" with compute capability 6.1 > > > > MatrixA(640,480), MatrixB(480,320), MatrixC(640,320) > > CUDA error at matrixMulCUBLAS.cpp:275 code=1(CUBLAS_STATUS_NOT_INITIALIZED) "cublasCreate(&handle)" > > > > So I believe it is not a conda error. I also tried removing .nv, doesn't help either. Maybe someone can share the PATH env variable? > > > > On Wed, Sep 5, 2018 at 5:08 PM Emre Yolcu wrote: > > Manzil, could you share your `conda env export` (or equivalent) output for the environment you use for pytorch? It's still not working for me after reboot, maybe I can try replicating your exact setup and try with that. > > > > Thanks, > > > > Emre > > > > On Wed, Sep 5, 2018 at 4:56 PM, Predrag Punosevac wrote: > > Manzil Zaheer wrote: > > > It was working me before reboot as well. PyTorch does work on all > > nodes for me. > > Aha! Gotcha. > > > > > I am trying to say is that i think it is not issue at system level but > > at user account level. I might be wrong though. > > That was my hunch as well. They were trying to convince me in a 150 > e-mails chain over the weekend that pytorch was broken when I replaced a > failed HDD on the main file server. That didn't make any sense. > > Could you please share your binaries and setup with outher pytorch > users? > > Cheers, > Predrag > > > > > > > -------- Original message -------- > > From: Predrag Punosevac > > Date: 9/5/18 4:44 PM (GMT-05:00) > > To: Manzil Zaheer > > Cc: Biswajit Paria , Yichong Xu , Emre Yolcu , users at autonlab.org > > Subject: Re: PyTorch problem > > > > Should I go ahead and reboot all GPU computing nodes? Can somebody else confirm that a reboot fixes the issue? > > > > Predrag > > > > On Wed, Sep 5, 2018 at 4:42 PM, Manzil Zaheer > wrote: > > It does work for me and my friends > > > > > > > > > > -------- Original message -------- > > From: Predrag Punosevac > > > Date: 9/5/18 4:40 PM (GMT-05:00) > > To: Biswajit Paria > > > Cc: Manzil Zaheer >, Yichong Xu >, Emre Yolcu >, users at autonlab.org > > Subject: Re: PyTorch problem > > > > I just rebooted GPU8. All packages are up to date. NVidia driver appears to be working properly and I can do GPU computations from MATLAB. Let's try now to get pytorch working on GPU8. > > > > Predrag > > > > On Wed, Sep 5, 2018 at 12:19 AM, Biswajit Paria > wrote: > > I am facing a similar error on all GPU machines. Did someone find a solution yet? > > > > > > 2018-09-05 00:27:41.546064: E tensorflow/stream_executor/cuda/cuda_blas.cc:459] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED > > > > On Tue, Sep 4, 2018 at 10:03 PM Manzil Zaheer > wrote: > > Hi Yichong > > > > Yes I am able to run TF and PyTorch on these machines. Recently someone else also had similar issue, but it got fixed by reinstalling some local packages. > > > > Thanks, > > Manzil > > > > > > -------- Original message -------- > > From: Yichong Xu > > > Date: 9/4/18 9:58 PM (GMT-05:00) > > To: Emre Yolcu >, Predrag Punosevac > > > Cc: users at autonlab.org > > Subject: Re: PyTorch problem > > > > Just wondering - can Tensorflow run well on these machines? I hope someone to confirm about this so that we can isolate the problem. > > OK so here?s a further test: I tried running the cuda examples from the cuda installation (in /usr/local/cuda/sample), on gpu2 in my scratch directory. Simple jobs like deviceQuery succeeds, but simpleCUBLAS failed: > > yichongx at gpu2$ cd /home/scratch/yichongx/ > > yichongx at gpu2$ cd > > 0_Simple/ 2_Graphics/ 4_Finance/ 6_Advanced/ bin/ conda/ > > 1_Utilities/ 3_Imaging/ 5_Simulations/ 7_CUDALibraries/ common/ miniconda3/ > > yichongx at gpu2$ cd 7_CUDALibraries/ > > yichongx at gpu2$ cd simpleCUBLAS > > yichongx at gpu2$ CUDA_VISIBLE_DEVICES=3 ./simpleCUBLAS > > GPU Device 0: "TITAN X (Pascal)" with compute capability 6.1 > > > > simpleCUBLAS test running.. > > !!!! CUBLAS initialization error > > yichongx at gpu2$ > > > > > > This is also consistent with our previous errors from pytorch, which say cublas library not initialized. > > > > So this means at least there is some problem with CUBLAS on gpu2. This post suggests that using sudo can resolve this problem, and this is probably because of some permission problems on CUBLAS libraries: > > https://devtalk.nvidia.com/default/topic/1027602/cuda-setup-and-installation/cublas-libraries-with-incorrect-permissions/ > > @Predrag: Can you try running the simpleCUBLAS example from the CUDA library, with and without root privilege? I think that might be something that you are more familiar with. Thank you very much! > > > > > > Thanks, > > Yichong > > > > > On Sep 4, 2018, at 3:18 PM, Emre Yolcu > wrote: > > > > Hi, > > > > We are trying to troubleshoot the PyTorch issue with Predrag and were wondering: > > > > Is anybody able to run PyTorch GPU models on gpu1-9? If you can, we would appreciate if you can respond. > > > > Also, is it a problem for anyone if gpu8 is rebooted today? > > > > Thanks, > > > > Emre > > > > > > > > -- > > Biswajit Paria > > PhD in ML @ CMU > > > > > > > > > > > -- > > Biswajit Paria > > PhD in ML @ CMU > > > > > > > -- > > Biswajit Paria > > PhD in ML @ CMU From jaylee at cs.cmu.edu Thu Sep 6 12:31:51 2018 From: jaylee at cs.cmu.edu (Jay Yoon Lee) Date: Thu, 6 Sep 2018 12:31:51 -0400 Subject: PyTorch problem In-Reply-To: References: <442569FC-5DE7-4F67-BDCA-7A7A1902EAF0@andrew.cmu.edu> <7f7c51795a7f4a2398934b79bf5de592@cmu.edu> <20180905205614.IMQuJDzT_%predragp@andrew.cmu.edu> <750EDAE5-46EE-43A4-ADED-2EF54345F12F@cs.cmu.edu> <5168285b40e7421d9f489e39fd834fab@cmu.edu> Message-ID: Hi, So I actually have been using the CUDA from Manzil. While it did help in resolving other issues, recent issue after the file system shutdown seems to happen with pytorch only. (Not on tensorflow) Using Manzil's CUDA (old one and again got a copy for another one after the probldm, just in case) did not resolve the problem. The problem was only resolved after I went ahead and installed python, pip locally. With this experience, I am suspecting that the currently provided conda has some problem. (Altough the error messages indicate only CUDA errors) Or maybe it was just a hack around, but this did fix the issue. Cheers! Jay-Yoon On Thu, Sep 6, 2018, 11:29 AM Barnabas Poczos wrote: > Hi All, > > I'm somewhat confused: > > * Do I understand correctly that Manzil actually is using the CUDA > libraries installed by himself > (/zfsauton/home/manzilz/local/cuda-9.0/) and not the system libraries > (/usr/local/cuda/lib64) ? > * Since he is using different CUDA libraries is that the reason that > pytorch is working for him and not for the other users? If so, should > we double check the system libraries? > * Do we know anyone who can use pytorch now with the CUDA system > libraries ? If so, those users please let us know your system env > variables. > * As a quick solution, should we ask Manzil to copy his cuda libraries > to a public place where others could access them? > > Best, > Barnabas > > ====================== > Barnabas Poczos, PhD > Associate Professor > Co-Director of PhD Program > Machine Learning Department > Carnegie Mellon University > On Wed, Sep 5, 2018 at 5:33 PM Manzil Zaheer wrote: > > > > Here is my related env variables: > > > > > > > > CUDA_HOME=/zfsauton/home/manzilz/local/cuda-9.0/ > > > > > LD_LIBRARY_PATH=/zfsauton/home/manzilz/local/lib64:/zfsauton/home/manzilz/local/lib:/zfsauton/home/manzilz/local/cuda-9.0/lib64:/usr/local/cuda/lib64: > > > > > PATH=/zfsauton/home/manzilz/local/bin:/zfsauton/home/manzilz/.local/bin:/zfsauton/home/manzilz/local/cuda-9.0/bin:/usr/local/cuda/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin > > > > C_INCLUDE_PATH=/zfsauton/home/manzilz/local/include: > > > > > > > > From: Biswajit Paria > > Sent: Wednesday, September 05, 2018 5:29 PM > > To: Yichong Xu > > Cc: Biswajit Paria ; eyolcu at cs.cmu.edu; Predrag > Punosevac ; Manzil Zaheer ; > users at autonlab.org > > Subject: Re: PyTorch problem > > > > > > > > If the CUDA examples work for anyone, can they share their PATH and > LD_LIBRARY_PATH variables? > > > > > > > > Thanks > > > > > > > > On Wed, Sep 5, 2018 at 5:27 PM Yichong Xu wrote: > > > > I think with Biswajit?s and my problem with cuda, we should isolate the > problem with just CUDA (and drivers) instead of wandering around python or > pytorch. > > > > Predrag can you test the CUDA examples? I sort of agree with Manzil that > this might be a user account problem. > > > > > > > > Thanks, > > > > Yichong > > > > > > > > > > > > > > > > On Sep 5, 2018, at 5:14 PM, Biswajit Paria wrote: > > > > > > > > I just tried Yichong's way of testing cuBLAS, and get the same error as > earlier: > > > > > > > > [Matrix Multiply CUBLAS] - Starting... > > > > GPU Device 0: "TITAN Xp" with compute capability 6.1 > > > > > > > > MatrixA(640,480), MatrixB(480,320), MatrixC(640,320) > > > > CUDA error at matrixMulCUBLAS.cpp:275 > code=1(CUBLAS_STATUS_NOT_INITIALIZED) "cublasCreate(&handle)" > > > > > > > > So I believe it is not a conda error. I also tried removing .nv, doesn't > help either. Maybe someone can share the PATH env variable? > > > > > > > > On Wed, Sep 5, 2018 at 5:08 PM Emre Yolcu wrote: > > > > Manzil, could you share your `conda env export` (or equivalent) output > for the environment you use for pytorch? It's still not working for me > after reboot, maybe I can try replicating your exact setup and try with > that. > > > > > > > > Thanks, > > > > > > > > Emre > > > > > > > > On Wed, Sep 5, 2018 at 4:56 PM, Predrag Punosevac < > predragp at andrew.cmu.edu> wrote: > > > > Manzil Zaheer wrote: > > > > > It was working me before reboot as well. PyTorch does work on all > > > nodes for me. > > > > Aha! Gotcha. > > > > > > > > I am trying to say is that i think it is not issue at system level but > > > at user account level. I might be wrong though. > > > > That was my hunch as well. They were trying to convince me in a 150 > > e-mails chain over the weekend that pytorch was broken when I replaced a > > failed HDD on the main file server. That didn't make any sense. > > > > Could you please share your binaries and setup with outher pytorch > > users? > > > > Cheers, > > Predrag > > > > > > > > > > > -------- Original message -------- > > > From: Predrag Punosevac > > > Date: 9/5/18 4:44 PM (GMT-05:00) > > > To: Manzil Zaheer > > > Cc: Biswajit Paria , Yichong Xu < > yichongx at cs.cmu.edu>, Emre Yolcu , users at autonlab.org > > > Subject: Re: PyTorch problem > > > > > > Should I go ahead and reboot all GPU computing nodes? Can somebody > else confirm that a reboot fixes the issue? > > > > > > Predrag > > > > > > On Wed, Sep 5, 2018 at 4:42 PM, Manzil Zaheer manzil at cmu.edu>> wrote: > > > It does work for me and my friends > > > > > > > > > > > > > > > -------- Original message -------- > > > From: Predrag Punosevac predragp at andrew.cmu.edu>> > > > Date: 9/5/18 4:40 PM (GMT-05:00) > > > To: Biswajit Paria > > > > Cc: Manzil Zaheer >, Yichong Xu > >, Emre Yolcu < > eyolcu at cs.cmu.edu>, users at autonlab.org users at autonlab.org> > > > Subject: Re: PyTorch problem > > > > > > I just rebooted GPU8. All packages are up to date. NVidia driver > appears to be working properly and I can do GPU computations from MATLAB. > Let's try now to get pytorch working on GPU8. > > > > > > Predrag > > > > > > On Wed, Sep 5, 2018 at 12:19 AM, Biswajit Paria > wrote: > > > I am facing a similar error on all GPU machines. Did someone find a > solution yet? > > > > > > > > > 2018-09-05 00:27:41.546064: E > tensorflow/stream_executor/cuda/cuda_blas.cc:459] failed to create cublas > handle: CUBLAS_STATUS_NOT_INITIALIZED > > > > > > On Tue, Sep 4, 2018 at 10:03 PM Manzil Zaheer manzil at cmu.edu>> wrote: > > > Hi Yichong > > > > > > Yes I am able to run TF and PyTorch on these machines. Recently > someone else also had similar issue, but it got fixed by reinstalling some > local packages. > > > > > > Thanks, > > > Manzil > > > > > > > > > -------- Original message -------- > > > From: Yichong Xu > > > > Date: 9/4/18 9:58 PM (GMT-05:00) > > > To: Emre Yolcu >, Predrag > Punosevac > > > > Cc: users at autonlab.org > > > Subject: Re: PyTorch problem > > > > > > Just wondering - can Tensorflow run well on these machines? I hope > someone to confirm about this so that we can isolate the problem. > > > OK so here?s a further test: I tried running the cuda examples from > the cuda installation (in /usr/local/cuda/sample), on gpu2 in my scratch > directory. Simple jobs like deviceQuery succeeds, but simpleCUBLAS failed: > > > yichongx at gpu2$ cd /home/scratch/yichongx/ > > > yichongx at gpu2$ cd > > > 0_Simple/ 2_Graphics/ 4_Finance/ 6_Advanced/ > bin/ conda/ > > > 1_Utilities/ 3_Imaging/ 5_Simulations/ 7_CUDALibraries/ > common/ miniconda3/ > > > yichongx at gpu2$ cd 7_CUDALibraries/ > > > yichongx at gpu2$ cd simpleCUBLAS > > > yichongx at gpu2$ CUDA_VISIBLE_DEVICES=3 ./simpleCUBLAS > > > GPU Device 0: "TITAN X (Pascal)" with compute capability 6.1 > > > > > > simpleCUBLAS test running.. > > > !!!! CUBLAS initialization error > > > yichongx at gpu2$ > > > > > > > > > This is also consistent with our previous errors from pytorch, which > say cublas library not initialized. > > > > > > So this means at least there is some problem with CUBLAS on gpu2. This > post suggests that using sudo can resolve this problem, and this is > probably because of some permission problems on CUBLAS libraries: > > > > https://devtalk.nvidia.com/default/topic/1027602/cuda-setup-and-installation/cublas-libraries-with-incorrect-permissions/ > > > @Predrag: Can you try running the simpleCUBLAS example from the CUDA > library, with and without root privilege? I think that might be something > that you are more familiar with. Thank you very much! > > > > > > > > > Thanks, > > > Yichong > > > > > > > > On Sep 4, 2018, at 3:18 PM, Emre Yolcu eyolcu at cs.cmu.edu>> wrote: > > > > > > Hi, > > > > > > We are trying to troubleshoot the PyTorch issue with Predrag and were > wondering: > > > > > > Is anybody able to run PyTorch GPU models on gpu1-9? If you can, we > would appreciate if you can respond. > > > > > > Also, is it a problem for anyone if gpu8 is rebooted today? > > > > > > Thanks, > > > > > > Emre > > > > > > > > > > > > -- > > > Biswajit Paria > > > PhD in ML @ CMU > > > > > > > > > > > > > > > > > > > > -- > > > > Biswajit Paria > > > > PhD in ML @ CMU > > > > > > > > > > > > > > -- > > > > Biswajit Paria > > > > PhD in ML @ CMU > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From yichongx at cs.cmu.edu Thu Sep 6 15:14:12 2018 From: yichongx at cs.cmu.edu (Yichong Xu) Date: Thu, 6 Sep 2018 19:14:12 +0000 Subject: PyTorch problem In-Reply-To: References: <442569FC-5DE7-4F67-BDCA-7A7A1902EAF0@andrew.cmu.edu> <7f7c51795a7f4a2398934b79bf5de592@cmu.edu> <20180905205614.IMQuJDzT_%predragp@andrew.cmu.edu> <750EDAE5-46EE-43A4-ADED-2EF54345F12F@cs.cmu.edu> <5168285b40e7421d9f489e39fd834fab@cmu.edu> Message-ID: 1. I think yes. Biswajit and I cannot use the system cuda libraries. 2. I think yes as well. Predrag said he can run matlab with cuda well (probably with root access), so I think there should be some problem with the privilege setting of system libraries. We do not have root access on our accounts. 3. Not yet so far. 4. That can be a solution. Maybe we have a public access library as Jay-Yoon did and that can work for us. Also for gpu8 - I just reinstalled pytorch again on scratch of gpu8 and it still does not work. I?m making the cuda libraries right now and trying to see if it works. Thanks, Yichong On Sep 6, 2018, at 11:20 AM, Barnabas Poczos > wrote: Hi All, I'm somewhat confused: * Do I understand correctly that Manzil actually is using the CUDA libraries installed by himself (/zfsauton/home/manzilz/local/cuda-9.0/) and not the system libraries (/usr/local/cuda/lib64) ? * Since he is using different CUDA libraries is that the reason that pytorch is working for him and not for the other users? If so, should we double check the system libraries? * Do we know anyone who can use pytorch now with the CUDA system libraries ? If so, those users please let us know your system env variables. * As a quick solution, should we ask Manzil to copy his cuda libraries to a public place where others could access them? Best, Barnabas ====================== Barnabas Poczos, PhD Associate Professor Co-Director of PhD Program Machine Learning Department Carnegie Mellon University On Wed, Sep 5, 2018 at 5:33 PM Manzil Zaheer > wrote: Here is my related env variables: CUDA_HOME=/zfsauton/home/manzilz/local/cuda-9.0/ LD_LIBRARY_PATH=/zfsauton/home/manzilz/local/lib64:/zfsauton/home/manzilz/local/lib:/zfsauton/home/manzilz/local/cuda-9.0/lib64:/usr/local/cuda/lib64: PATH=/zfsauton/home/manzilz/local/bin:/zfsauton/home/manzilz/.local/bin:/zfsauton/home/manzilz/local/cuda-9.0/bin:/usr/local/cuda/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin C_INCLUDE_PATH=/zfsauton/home/manzilz/local/include: From: Biswajit Paria > Sent: Wednesday, September 05, 2018 5:29 PM To: Yichong Xu > Cc: Biswajit Paria >; eyolcu at cs.cmu.edu; Predrag Punosevac >; Manzil Zaheer >; users at autonlab.org Subject: Re: PyTorch problem If the CUDA examples work for anyone, can they share their PATH and LD_LIBRARY_PATH variables? Thanks On Wed, Sep 5, 2018 at 5:27 PM Yichong Xu > wrote: I think with Biswajit?s and my problem with cuda, we should isolate the problem with just CUDA (and drivers) instead of wandering around python or pytorch. Predrag can you test the CUDA examples? I sort of agree with Manzil that this might be a user account problem. Thanks, Yichong On Sep 5, 2018, at 5:14 PM, Biswajit Paria > wrote: I just tried Yichong's way of testing cuBLAS, and get the same error as earlier: [Matrix Multiply CUBLAS] - Starting... GPU Device 0: "TITAN Xp" with compute capability 6.1 MatrixA(640,480), MatrixB(480,320), MatrixC(640,320) CUDA error at matrixMulCUBLAS.cpp:275 code=1(CUBLAS_STATUS_NOT_INITIALIZED) "cublasCreate(&handle)" So I believe it is not a conda error. I also tried removing .nv, doesn't help either. Maybe someone can share the PATH env variable? On Wed, Sep 5, 2018 at 5:08 PM Emre Yolcu > wrote: Manzil, could you share your `conda env export` (or equivalent) output for the environment you use for pytorch? It's still not working for me after reboot, maybe I can try replicating your exact setup and try with that. Thanks, Emre On Wed, Sep 5, 2018 at 4:56 PM, Predrag Punosevac > wrote: Manzil Zaheer > wrote: It was working me before reboot as well. PyTorch does work on all nodes for me. Aha! Gotcha. I am trying to say is that i think it is not issue at system level but at user account level. I might be wrong though. That was my hunch as well. They were trying to convince me in a 150 e-mails chain over the weekend that pytorch was broken when I replaced a failed HDD on the main file server. That didn't make any sense. Could you please share your binaries and setup with outher pytorch users? Cheers, Predrag -------- Original message -------- From: Predrag Punosevac > Date: 9/5/18 4:44 PM (GMT-05:00) To: Manzil Zaheer > Cc: Biswajit Paria >, Yichong Xu >, Emre Yolcu >, users at autonlab.org Subject: Re: PyTorch problem Should I go ahead and reboot all GPU computing nodes? Can somebody else confirm that a reboot fixes the issue? Predrag On Wed, Sep 5, 2018 at 4:42 PM, Manzil Zaheer > wrote: It does work for me and my friends -------- Original message -------- From: Predrag Punosevac > Date: 9/5/18 4:40 PM (GMT-05:00) To: Biswajit Paria > Cc: Manzil Zaheer >, Yichong Xu >, Emre Yolcu >, users at autonlab.org Subject: Re: PyTorch problem I just rebooted GPU8. All packages are up to date. NVidia driver appears to be working properly and I can do GPU computations from MATLAB. Let's try now to get pytorch working on GPU8. Predrag On Wed, Sep 5, 2018 at 12:19 AM, Biswajit Paria > wrote: I am facing a similar error on all GPU machines. Did someone find a solution yet? 2018-09-05 00:27:41.546064: E tensorflow/stream_executor/cuda/cuda_blas.cc:459] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED On Tue, Sep 4, 2018 at 10:03 PM Manzil Zaheer > wrote: Hi Yichong Yes I am able to run TF and PyTorch on these machines. Recently someone else also had similar issue, but it got fixed by reinstalling some local packages. Thanks, Manzil -------- Original message -------- From: Yichong Xu > Date: 9/4/18 9:58 PM (GMT-05:00) To: Emre Yolcu >, Predrag Punosevac > Cc: users at autonlab.org Subject: Re: PyTorch problem Just wondering - can Tensorflow run well on these machines? I hope someone to confirm about this so that we can isolate the problem. OK so here?s a further test: I tried running the cuda examples from the cuda installation (in /usr/local/cuda/sample), on gpu2 in my scratch directory. Simple jobs like deviceQuery succeeds, but simpleCUBLAS failed: yichongx at gpu2$ cd /home/scratch/yichongx/ yichongx at gpu2$ cd 0_Simple/ 2_Graphics/ 4_Finance/ 6_Advanced/ bin/ conda/ 1_Utilities/ 3_Imaging/ 5_Simulations/ 7_CUDALibraries/ common/ miniconda3/ yichongx at gpu2$ cd 7_CUDALibraries/ yichongx at gpu2$ cd simpleCUBLAS yichongx at gpu2$ CUDA_VISIBLE_DEVICES=3 ./simpleCUBLAS GPU Device 0: "TITAN X (Pascal)" with compute capability 6.1 simpleCUBLAS test running.. !!!! CUBLAS initialization error yichongx at gpu2$ This is also consistent with our previous errors from pytorch, which say cublas library not initialized. So this means at least there is some problem with CUBLAS on gpu2. This post suggests that using sudo can resolve this problem, and this is probably because of some permission problems on CUBLAS libraries: https://devtalk.nvidia.com/default/topic/1027602/cuda-setup-and-installation/cublas-libraries-with-incorrect-permissions/ @Predrag: Can you try running the simpleCUBLAS example from the CUDA library, with and without root privilege? I think that might be something that you are more familiar with. Thank you very much! Thanks, Yichong On Sep 4, 2018, at 3:18 PM, Emre Yolcu > wrote: Hi, We are trying to troubleshoot the PyTorch issue with Predrag and were wondering: Is anybody able to run PyTorch GPU models on gpu1-9? If you can, we would appreciate if you can respond. Also, is it a problem for anyone if gpu8 is rebooted today? Thanks, Emre -- Biswajit Paria PhD in ML @ CMU -- Biswajit Paria PhD in ML @ CMU -- Biswajit Paria PhD in ML @ CMU -------------- next part -------------- An HTML attachment was scrubbed... URL: From eyolcu at cs.cmu.edu Thu Sep 6 20:46:57 2018 From: eyolcu at cs.cmu.edu (Emre Yolcu) Date: Thu, 6 Sep 2018 20:46:57 -0400 Subject: PyTorch problem In-Reply-To: References: <442569FC-5DE7-4F67-BDCA-7A7A1902EAF0@andrew.cmu.edu> <7f7c51795a7f4a2398934b79bf5de592@cmu.edu> <20180905205614.IMQuJDzT_%predragp@andrew.cmu.edu> <750EDAE5-46EE-43A4-ADED-2EF54345F12F@cs.cmu.edu> <5168285b40e7421d9f489e39fd834fab@cmu.edu> Message-ID: I think I got it. If I'm not mistaken NFS is the root of all our problems in this thread. Can anyone having problems try doing the equivalent of `export CUDA_CACHE_PATH=/home/scratch/eyolcu/computecache` (replacing eyolcu with your andrew id) and try everything again? This seems to fix it for me. On Thu, Sep 6, 2018 at 3:28 PM, Yichong Xu wrote: > Hi Predrag, > I just tested the simpleCUBLAS sample in cuda library. It still does not > work for me with the same error: > GPU Device 0: "TITAN Xp" with compute capability 6.1 > > simpleCUBLAS test running.. > !!!! CUBLAS initialization error > > > I?m not sure where exactly the access problem is, but here is what I get > from ls -all: > yichongx at gpu8$ ls -all > total 2136200 > drwxr-xr-x. 3 sheath sheath 8192 May 31 15:16 . > drwxr-xr-x. 4 root root 32 Sep 2 2017 .. > lrwxrwxrwx. 1 root root 18 Mar 13 13:05 libaccinj64.so -> > libaccinj64.so.9.0 > lrwxrwxrwx. 1 root root 22 Mar 13 13:05 libaccinj64.so.9.0 -> > libaccinj64.so.9.0.176 > -rwxr-xr-x. 1 root root 6858944 Sep 2 2017 libaccinj64.so.9.0.176 > -rw-r--r--. 1 root root 71952010 Dec 19 2017 libcublas_device.a > lrwxrwxrwx. 1 root root 16 Mar 13 13:04 libcublas.so -> > libcublas.so.9.0 > lrwxrwxrwx. 1 root root 20 Mar 13 13:04 libcublas.so.9.0 -> > libcublas.so.9.0.282 > -rwxr-xr-x. 1 root root 52590576 Dec 19 2017 libcublas.so.9.0.176 > -rwxr-xr-x. 1 root root 55781312 Dec 19 2017 libcublas.so.9.0.282 > -rw-r--r--. 1 root root 62813620 Dec 19 2017 libcublas_static.a > > > *Thanks,* > *Yichong* > > On Sep 6, 2018, at 3:14 PM, Yichong Xu wrote: > > 1. I think yes. Biswajit and I cannot use the system cuda libraries. > 2. I think yes as well. Predrag said he can run matlab with cuda well > (probably with root access), so I think there should be some problem with > the privilege setting of system libraries. We do not have root access on > our accounts. > 3. Not yet so far. > 4. That can be a solution. Maybe we have a public access library as > Jay-Yoon did and that can work for us. > > Also for gpu8 - I just reinstalled pytorch again on scratch of gpu8 and it > still does not work. I?m making the cuda libraries right now and trying to > see if it works. > > *Thanks,* > *Yichong* > > > > On Sep 6, 2018, at 11:20 AM, Barnabas Poczos wrote: > > Hi All, > > I'm somewhat confused: > > * Do I understand correctly that Manzil actually is using the CUDA > libraries installed by himself > (/zfsauton/home/manzilz/local/cuda-9.0/) and not the system libraries > (/usr/local/cuda/lib64) ? > * Since he is using different CUDA libraries is that the reason that > pytorch is working for him and not for the other users? If so, should > we double check the system libraries? > * Do we know anyone who can use pytorch now with the CUDA system > libraries ? If so, those users please let us know your system env > variables. > * As a quick solution, should we ask Manzil to copy his cuda libraries > to a public place where others could access them? > > Best, > Barnabas > > ====================== > Barnabas Poczos, PhD > Associate Professor > Co-Director of PhD Program > Machine Learning Department > Carnegie Mellon University > On Wed, Sep 5, 2018 at 5:33 PM Manzil Zaheer wrote: > > > Here is my related env variables: > > > > CUDA_HOME=/zfsauton/home/manzilz/local/cuda-9.0/ > > LD_LIBRARY_PATH=/zfsauton/home/manzilz/local/lib64:/ > zfsauton/home/manzilz/local/lib:/zfsauton/home/manzilz/ > local/cuda-9.0/lib64:/usr/local/cuda/lib64: > > PATH=/zfsauton/home/manzilz/local/bin:/zfsauton/home/ > manzilz/.local/bin:/zfsauton/home/manzilz/local/cuda-9.0/ > bin:/usr/local/cuda/bin:/usr/lib64/qt-3.3/bin:/usr/local/ > bin:/usr/bin:/usr/local/sbin:/usr/sbin > > C_INCLUDE_PATH=/zfsauton/home/manzilz/local/include: > > > > From: Biswajit Paria > Sent: Wednesday, September 05, 2018 5:29 PM > To: Yichong Xu > Cc: Biswajit Paria ; eyolcu at cs.cmu.edu; Predrag > Punosevac ; Manzil Zaheer ; > users at autonlab.org > Subject: Re: PyTorch problem > > > > If the CUDA examples work for anyone, can they share their PATH and > LD_LIBRARY_PATH variables? > > > > Thanks > > > > On Wed, Sep 5, 2018 at 5:27 PM Yichong Xu wrote: > > I think with Biswajit?s and my problem with cuda, we should isolate the > problem with just CUDA (and drivers) instead of wandering around python or > pytorch. > > Predrag can you test the CUDA examples? I sort of agree with Manzil that > this might be a user account problem. > > > > Thanks, > > Yichong > > > > > > > > On Sep 5, 2018, at 5:14 PM, Biswajit Paria wrote: > > > > I just tried Yichong's way of testing cuBLAS, and get the same error as > earlier: > > > > [Matrix Multiply CUBLAS] - Starting... > > GPU Device 0: "TITAN Xp" with compute capability 6.1 > > > > MatrixA(640,480), MatrixB(480,320), MatrixC(640,320) > > CUDA error at matrixMulCUBLAS.cpp:275 code=1(CUBLAS_STATUS_NOT_INITIALIZED) > "cublasCreate(&handle)" > > > > So I believe it is not a conda error. I also tried removing .nv, doesn't > help either. Maybe someone can share the PATH env variable? > > > > On Wed, Sep 5, 2018 at 5:08 PM Emre Yolcu wrote: > > Manzil, could you share your `conda env export` (or equivalent) output for > the environment you use for pytorch? It's still not working for me after > reboot, maybe I can try replicating your exact setup and try with that. > > > > Thanks, > > > > Emre > > > > On Wed, Sep 5, 2018 at 4:56 PM, Predrag Punosevac > wrote: > > Manzil Zaheer wrote: > > It was working me before reboot as well. PyTorch does work on all > nodes for me. > > > Aha! Gotcha. > > > I am trying to say is that i think it is not issue at system level but > at user account level. I might be wrong though. > > > That was my hunch as well. They were trying to convince me in a 150 > e-mails chain over the weekend that pytorch was broken when I replaced a > failed HDD on the main file server. That didn't make any sense. > > Could you please share your binaries and setup with outher pytorch > users? > > Cheers, > Predrag > > > > -------- Original message -------- > From: Predrag Punosevac > Date: 9/5/18 4:44 PM (GMT-05:00) > To: Manzil Zaheer > Cc: Biswajit Paria , Yichong Xu , > Emre Yolcu , users at autonlab.org > Subject: Re: PyTorch problem > > Should I go ahead and reboot all GPU computing nodes? Can somebody else > confirm that a reboot fixes the issue? > > Predrag > > On Wed, Sep 5, 2018 at 4:42 PM, Manzil Zaheer mailto:manzil at cmu.edu >> wrote: > It does work for me and my friends > > > > > -------- Original message -------- > From: Predrag Punosevac o:predragp at andrew.cmu.edu >> > Date: 9/5/18 4:40 PM (GMT-05:00) > To: Biswajit Paria >> > Cc: Manzil Zaheer >>, > Yichong Xu >>, Emre Yolcu eyolcu at cs.cmu.edu >>, users at autonlab.org users at autonlab.org > > Subject: Re: PyTorch problem > > I just rebooted GPU8. All packages are up to date. NVidia driver appears > to be working properly and I can do GPU computations from MATLAB. Let's try > now to get pytorch working on GPU8. > > Predrag > > On Wed, Sep 5, 2018 at 12:19 AM, Biswajit Paria bparia at cs.cmu.edu >> wrote: > I am facing a similar error on all GPU machines. Did someone find a > solution yet? > > > 2018-09-05 00:27:41.546064: E tensorflow/stream_executor/cuda/ > cuda_blas.cc:459] failed to create cublas handle: > CUBLAS_STATUS_NOT_INITIALIZED > > On Tue, Sep 4, 2018 at 10:03 PM Manzil Zaheer mailto:manzil at cmu.edu >> wrote: > Hi Yichong > > Yes I am able to run TF and PyTorch on these machines. Recently someone > else also had similar issue, but it got fixed by reinstalling some local > packages. > > Thanks, > Manzil > > > -------- Original message -------- > From: Yichong Xu >> > Date: 9/4/18 9:58 PM (GMT-05:00) > To: Emre Yolcu >>, Predrag Punosevac o:predragp at andrew.cmu.edu >> > Cc: users at autonlab.org> > Subject: Re: PyTorch problem > > Just wondering - can Tensorflow run well on these machines? I hope someone > to confirm about this so that we can isolate the problem. > OK so here?s a further test: I tried running the cuda examples from the > cuda installation (in /usr/local/cuda/sample), on gpu2 in my scratch > directory. Simple jobs like deviceQuery succeeds, but simpleCUBLAS failed: > yichongx at gpu2$ cd /home/scratch/yichongx/ > yichongx at gpu2$ cd > 0_Simple/ 2_Graphics/ 4_Finance/ 6_Advanced/ bin/ > conda/ > 1_Utilities/ 3_Imaging/ 5_Simulations/ 7_CUDALibraries/ > common/ miniconda3/ > yichongx at gpu2$ cd 7_CUDALibraries/ > yichongx at gpu2$ cd simpleCUBLAS > yichongx at gpu2$ CUDA_VISIBLE_DEVICES=3 ./simpleCUBLAS > GPU Device 0: "TITAN X (Pascal)" with compute capability 6.1 > > simpleCUBLAS test running.. > !!!! CUBLAS initialization error > yichongx at gpu2$ > > > This is also consistent with our previous errors from pytorch, which say > cublas library not initialized. > > So this means at least there is some problem with CUBLAS on gpu2. This > post suggests that using sudo can resolve this problem, and this is > probably because of some permission problems on CUBLAS libraries: > https://devtalk.nvidia.com/default/topic/1027602/cuda- > setup-and-installation/cublas-libraries-with-incorrect-permissions/ > @Predrag: Can you try running the simpleCUBLAS example from the CUDA > library, with and without root privilege? I think that might be something > that you are more familiar with. Thank you very much! > > > Thanks, > Yichong > > > On Sep 4, 2018, at 3:18 PM, Emre Yolcu eyolcu at cs.cmu.edu >> wrote: > > Hi, > > We are trying to troubleshoot the PyTorch issue with Predrag and were > wondering: > > Is anybody able to run PyTorch GPU models on gpu1-9? If you can, we would > appreciate if you can respond. > > Also, is it a problem for anyone if gpu8 is rebooted today? > > Thanks, > > Emre > > > > -- > Biswajit Paria > PhD in ML @ CMU > > > > > > > > > -- > > Biswajit Paria > > PhD in ML @ CMU > > > > > > > -- > > Biswajit Paria > > PhD in ML @ CMU > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Fri Sep 7 13:52:42 2018 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Fri, 7 Sep 2018 13:52:42 -0400 Subject: bash.autonlab.org Message-ID: Dear Autonians, bash appears to be down. It is not I just needed to use that machine to access our NREC infrastructure quickly. When VPN tunnel is created between bash and NREC ssh login is not longer possible. Please give me about 30 minutes. Predrag -------------- next part -------------- An HTML attachment was scrubbed... URL: From bparia at cs.cmu.edu Fri Sep 7 14:19:07 2018 From: bparia at cs.cmu.edu (Biswajit Paria) Date: Fri, 7 Sep 2018 14:19:07 -0400 Subject: PyTorch problem In-Reply-To: References: <442569FC-5DE7-4F67-BDCA-7A7A1902EAF0@andrew.cmu.edu> <7f7c51795a7f4a2398934b79bf5de592@cmu.edu> <20180905205614.IMQuJDzT_%predragp@andrew.cmu.edu> <750EDAE5-46EE-43A4-ADED-2EF54345F12F@cs.cmu.edu> <5168285b40e7421d9f489e39fd834fab@cmu.edu> Message-ID: Thanks Emre! This does resolve it for me. On Thu, Sep 6, 2018 at 8:47 PM Emre Yolcu wrote: > I think I got it. If I'm not mistaken NFS is the root of all our problems > in this thread. Can anyone having problems try doing the equivalent of > `export CUDA_CACHE_PATH=/home/scratch/eyolcu/computecache` (replacing > eyolcu with your andrew id) and try everything again? This seems to fix it > for me. > > On Thu, Sep 6, 2018 at 3:28 PM, Yichong Xu wrote: > >> Hi Predrag, >> I just tested the simpleCUBLAS sample in cuda library. It still does not >> work for me with the same error: >> GPU Device 0: "TITAN Xp" with compute capability 6.1 >> >> simpleCUBLAS test running.. >> !!!! CUBLAS initialization error >> >> >> I?m not sure where exactly the access problem is, but here is what I get >> from ls -all: >> yichongx at gpu8$ ls -all >> total 2136200 >> drwxr-xr-x. 3 sheath sheath 8192 May 31 15:16 . >> drwxr-xr-x. 4 root root 32 Sep 2 2017 .. >> lrwxrwxrwx. 1 root root 18 Mar 13 13:05 libaccinj64.so -> >> libaccinj64.so.9.0 >> lrwxrwxrwx. 1 root root 22 Mar 13 13:05 libaccinj64.so.9.0 -> >> libaccinj64.so.9.0.176 >> -rwxr-xr-x. 1 root root 6858944 Sep 2 2017 libaccinj64.so.9.0.176 >> -rw-r--r--. 1 root root 71952010 Dec 19 2017 libcublas_device.a >> lrwxrwxrwx. 1 root root 16 Mar 13 13:04 libcublas.so -> >> libcublas.so.9.0 >> lrwxrwxrwx. 1 root root 20 Mar 13 13:04 libcublas.so.9.0 -> >> libcublas.so.9.0.282 >> -rwxr-xr-x. 1 root root 52590576 Dec 19 2017 libcublas.so.9.0.176 >> -rwxr-xr-x. 1 root root 55781312 Dec 19 2017 libcublas.so.9.0.282 >> -rw-r--r--. 1 root root 62813620 Dec 19 2017 libcublas_static.a >> >> >> *Thanks,* >> *Yichong* >> >> On Sep 6, 2018, at 3:14 PM, Yichong Xu wrote: >> >> 1. I think yes. Biswajit and I cannot use the system cuda libraries. >> 2. I think yes as well. Predrag said he can run matlab with cuda well >> (probably with root access), so I think there should be some problem with >> the privilege setting of system libraries. We do not have root access on >> our accounts. >> 3. Not yet so far. >> 4. That can be a solution. Maybe we have a public access library as >> Jay-Yoon did and that can work for us. >> >> Also for gpu8 - I just reinstalled pytorch again on scratch of gpu8 and >> it still does not work. I?m making the cuda libraries right now and trying >> to see if it works. >> >> *Thanks,* >> *Yichong* >> >> >> >> On Sep 6, 2018, at 11:20 AM, Barnabas Poczos wrote: >> >> Hi All, >> >> I'm somewhat confused: >> >> * Do I understand correctly that Manzil actually is using the CUDA >> libraries installed by himself >> (/zfsauton/home/manzilz/local/cuda-9.0/) and not the system libraries >> (/usr/local/cuda/lib64) ? >> * Since he is using different CUDA libraries is that the reason that >> pytorch is working for him and not for the other users? If so, should >> we double check the system libraries? >> * Do we know anyone who can use pytorch now with the CUDA system >> libraries ? If so, those users please let us know your system env >> variables. >> * As a quick solution, should we ask Manzil to copy his cuda libraries >> to a public place where others could access them? >> >> Best, >> Barnabas >> >> ====================== >> Barnabas Poczos, PhD >> Associate Professor >> Co-Director of PhD Program >> Machine Learning Department >> Carnegie Mellon University >> On Wed, Sep 5, 2018 at 5:33 PM Manzil Zaheer wrote: >> >> >> Here is my related env variables: >> >> >> >> CUDA_HOME=/zfsauton/home/manzilz/local/cuda-9.0/ >> >> >> LD_LIBRARY_PATH=/zfsauton/home/manzilz/local/lib64:/zfsauton/home/manzilz/local/lib:/zfsauton/home/manzilz/local/cuda-9.0/lib64:/usr/local/cuda/lib64: >> >> >> PATH=/zfsauton/home/manzilz/local/bin:/zfsauton/home/manzilz/.local/bin:/zfsauton/home/manzilz/local/cuda-9.0/bin:/usr/local/cuda/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin >> >> C_INCLUDE_PATH=/zfsauton/home/manzilz/local/include: >> >> >> >> From: Biswajit Paria >> Sent: Wednesday, September 05, 2018 5:29 PM >> To: Yichong Xu >> Cc: Biswajit Paria ; eyolcu at cs.cmu.edu; Predrag >> Punosevac ; Manzil Zaheer ; >> users at autonlab.org >> Subject: Re: PyTorch problem >> >> >> >> If the CUDA examples work for anyone, can they share their PATH and >> LD_LIBRARY_PATH variables? >> >> >> >> Thanks >> >> >> >> On Wed, Sep 5, 2018 at 5:27 PM Yichong Xu wrote: >> >> I think with Biswajit?s and my problem with cuda, we should isolate the >> problem with just CUDA (and drivers) instead of wandering around python or >> pytorch. >> >> Predrag can you test the CUDA examples? I sort of agree with Manzil that >> this might be a user account problem. >> >> >> >> Thanks, >> >> Yichong >> >> >> >> >> >> >> >> On Sep 5, 2018, at 5:14 PM, Biswajit Paria wrote: >> >> >> >> I just tried Yichong's way of testing cuBLAS, and get the same error as >> earlier: >> >> >> >> [Matrix Multiply CUBLAS] - Starting... >> >> GPU Device 0: "TITAN Xp" with compute capability 6.1 >> >> >> >> MatrixA(640,480), MatrixB(480,320), MatrixC(640,320) >> >> CUDA error at matrixMulCUBLAS.cpp:275 >> code=1(CUBLAS_STATUS_NOT_INITIALIZED) "cublasCreate(&handle)" >> >> >> >> So I believe it is not a conda error. I also tried removing .nv, doesn't >> help either. Maybe someone can share the PATH env variable? >> >> >> >> On Wed, Sep 5, 2018 at 5:08 PM Emre Yolcu wrote: >> >> Manzil, could you share your `conda env export` (or equivalent) output >> for the environment you use for pytorch? It's still not working for me >> after reboot, maybe I can try replicating your exact setup and try with >> that. >> >> >> >> Thanks, >> >> >> >> Emre >> >> >> >> On Wed, Sep 5, 2018 at 4:56 PM, Predrag Punosevac < >> predragp at andrew.cmu.edu> wrote: >> >> Manzil Zaheer wrote: >> >> It was working me before reboot as well. PyTorch does work on all >> nodes for me. >> >> >> Aha! Gotcha. >> >> >> I am trying to say is that i think it is not issue at system level but >> at user account level. I might be wrong though. >> >> >> That was my hunch as well. They were trying to convince me in a 150 >> e-mails chain over the weekend that pytorch was broken when I replaced a >> failed HDD on the main file server. That didn't make any sense. >> >> Could you please share your binaries and setup with outher pytorch >> users? >> >> Cheers, >> Predrag >> >> >> >> -------- Original message -------- >> From: Predrag Punosevac >> Date: 9/5/18 4:44 PM (GMT-05:00) >> To: Manzil Zaheer >> Cc: Biswajit Paria , Yichong Xu , >> Emre Yolcu , users at autonlab.org >> Subject: Re: PyTorch problem >> >> Should I go ahead and reboot all GPU computing nodes? Can somebody else >> confirm that a reboot fixes the issue? >> >> Predrag >> >> On Wed, Sep 5, 2018 at 4:42 PM, Manzil Zaheer > mailto:manzil at cmu.edu >> wrote: >> It does work for me and my friends >> >> >> >> >> -------- Original message -------- >> From: Predrag Punosevac > mailto:predragp at andrew.cmu.edu >> >> Date: 9/5/18 4:40 PM (GMT-05:00) >> To: Biswajit Paria > >> >> Cc: Manzil Zaheer >>, >> Yichong Xu > >>, Emre Yolcu > mailto:eyolcu at cs.cmu.edu >>, users at autonlab.org< >> mailto:users at autonlab.org > >> Subject: Re: PyTorch problem >> >> I just rebooted GPU8. All packages are up to date. NVidia driver appears >> to be working properly and I can do GPU computations from MATLAB. Let's try >> now to get pytorch working on GPU8. >> >> Predrag >> >> On Wed, Sep 5, 2018 at 12:19 AM, Biswajit Paria > mailto:bparia at cs.cmu.edu >> wrote: >> I am facing a similar error on all GPU machines. Did someone find a >> solution yet? >> >> >> 2018-09-05 00:27:41.546064: E tensorflow/stream_executor/cuda/ >> cuda_blas.cc:459] failed to create cublas handle: >> CUBLAS_STATUS_NOT_INITIALIZED >> >> On Tue, Sep 4, 2018 at 10:03 PM Manzil Zaheer > mailto:manzil at cmu.edu >> wrote: >> Hi Yichong >> >> Yes I am able to run TF and PyTorch on these machines. Recently someone >> else also had similar issue, but it got fixed by reinstalling some local >> packages. >> >> Thanks, >> Manzil >> >> >> -------- Original message -------- >> From: Yichong Xu > >> >> Date: 9/4/18 9:58 PM (GMT-05:00) >> To: Emre Yolcu > >>, Predrag Punosevac > mailto:predragp at andrew.cmu.edu >> >> Cc: users at autonlab.org> >> Subject: Re: PyTorch problem >> >> Just wondering - can Tensorflow run well on these machines? I hope >> someone to confirm about this so that we can isolate the problem. >> OK so here?s a further test: I tried running the cuda examples from the >> cuda installation (in /usr/local/cuda/sample), on gpu2 in my scratch >> directory. Simple jobs like deviceQuery succeeds, but simpleCUBLAS failed: >> yichongx at gpu2$ cd /home/scratch/yichongx/ >> yichongx at gpu2$ cd >> 0_Simple/ 2_Graphics/ 4_Finance/ 6_Advanced/ bin/ >> conda/ >> 1_Utilities/ 3_Imaging/ 5_Simulations/ 7_CUDALibraries/ >> common/ miniconda3/ >> yichongx at gpu2$ cd 7_CUDALibraries/ >> yichongx at gpu2$ cd simpleCUBLAS >> yichongx at gpu2$ CUDA_VISIBLE_DEVICES=3 ./simpleCUBLAS >> GPU Device 0: "TITAN X (Pascal)" with compute capability 6.1 >> >> simpleCUBLAS test running.. >> !!!! CUBLAS initialization error >> yichongx at gpu2$ >> >> >> This is also consistent with our previous errors from pytorch, which say >> cublas library not initialized. >> >> So this means at least there is some problem with CUBLAS on gpu2. This >> post suggests that using sudo can resolve this problem, and this is >> probably because of some permission problems on CUBLAS libraries: >> >> https://devtalk.nvidia.com/default/topic/1027602/cuda-setup-and-installation/cublas-libraries-with-incorrect-permissions/ >> @Predrag: Can you try running the simpleCUBLAS example from the CUDA >> library, with and without root privilege? I think that might be something >> that you are more familiar with. Thank you very much! >> >> >> Thanks, >> Yichong >> >> >> On Sep 4, 2018, at 3:18 PM, Emre Yolcu > mailto:eyolcu at cs.cmu.edu >> wrote: >> >> Hi, >> >> We are trying to troubleshoot the PyTorch issue with Predrag and were >> wondering: >> >> Is anybody able to run PyTorch GPU models on gpu1-9? If you can, we would >> appreciate if you can respond. >> >> Also, is it a problem for anyone if gpu8 is rebooted today? >> >> Thanks, >> >> Emre >> >> >> >> -- >> Biswajit Paria >> PhD in ML @ CMU >> >> >> >> >> >> >> >> >> -- >> >> Biswajit Paria >> >> PhD in ML @ CMU >> >> >> >> >> >> >> -- >> >> Biswajit Paria >> >> PhD in ML @ CMU >> >> >> >> > -- Biswajit Paria PhD in ML @ CMU -------------- next part -------------- An HTML attachment was scrubbed... URL: From awd at cs.cmu.edu Fri Sep 7 16:53:28 2018 From: awd at cs.cmu.edu (Artur Dubrawski) Date: Fri, 7 Sep 2018 16:53:28 -0400 Subject: Fwd: [AI Seminar] AI Seminar sponsored by Apple -- Matt Barnes -- Sep. 11th In-Reply-To: References: Message-ID: This is really worth attending especially if you have missed Matt's thesis proposal. Cheers Artur ---------- Forwarded message --------- From: Han Zhao Date: Fri, Sep 7, 2018 at 3:04 PM Subject: [AI Seminar] AI Seminar sponsored by Apple -- Matt Barnes -- Sep. 11th To: Dear faculty and students: We look forward to seeing you next Tuesday, Sep. 11th, at noon in *GHC 6115* for AI Seminar sponsored by Apple. To learn more about the seminar series, please visit the website . On Tuesday, Matt Barnes will give the following talk: Title: Learning with Clusters: A cardinal machine learning sin and how to correct for it Abstract: As machine learning systems become increasingly complex, clustering has evolved from an exploratory data analysis tool into an integrated component of computer vision, robotics, medical and census data pipelines. Currently, as with many machine learning systems, the output of the clustering algorithm is taken as ground truth at the next pipeline step. We show this false assumption causes subtle and dangerous behavior for even the simplest systems -- sometimes biasing results by upwards of 25%. We provide the first empirical and theoretical study of this phenomenon which we term dependency leakage. Further, we introduce fixes in the form of estimators and methods to both quantify and correct for clustering errors' impacts on downstream learners. Our work is agnostic to the downstream learners, and requires few assumptions on the clustering algorithm. Empirical results demonstrate our approach improves these machine learning systems compared to naive approaches, which do not account for clustering errors. This talk is based on the following two papers: http://auai.org/uai2017/proceedings/papers/87.pdf https://arxiv.org/abs/1807.06713 -- *Han ZhaoMachine Learning Department* *School of Computer ScienceCarnegie Mellon UniversityMobile: +1-* *412-652-4404* -------------- next part -------------- An HTML attachment was scrubbed... URL: From yichongx at cs.cmu.edu Fri Sep 7 17:12:48 2018 From: yichongx at cs.cmu.edu (Yichong Xu) Date: Fri, 7 Sep 2018 21:12:48 +0000 Subject: PyTorch problem In-Reply-To: References: <442569FC-5DE7-4F67-BDCA-7A7A1902EAF0@andrew.cmu.edu> <7f7c51795a7f4a2398934b79bf5de592@cmu.edu> <20180905205614.IMQuJDzT_%predragp@andrew.cmu.edu> <750EDAE5-46EE-43A4-ADED-2EF54345F12F@cs.cmu.edu> <5168285b40e7421d9f489e39fd834fab@cmu.edu> Message-ID: <5058F3D1-1A48-43EE-8A25-01876549427D@cs.cmu.edu> Thank you very much Emre! That?s so clever, and (almost) completely resolves my issue. I only have a small error when using h5py: File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper File "h5py/h5f.pyx", line 78, in h5py.h5f.open OSError: Unable to open file (unable to lock file, errno = 5, error message = 'Input/output error') I resolved the problem by moving the file to scratch space. I think the new disk possibly have some small problems (either performance/permission), and that?s causing the problem. Thanks, Yichong On Sep 6, 2018, at 8:46 PM, Emre Yolcu > wrote: I think I got it. If I'm not mistaken NFS is the root of all our problems in this thread. Can anyone having problems try doing the equivalent of `export CUDA_CACHE_PATH=/home/scratch/eyolcu/computecache` (replacing eyolcu with your andrew id) and try everything again? This seems to fix it for me. On Thu, Sep 6, 2018 at 3:28 PM, Yichong Xu > wrote: Hi Predrag, I just tested the simpleCUBLAS sample in cuda library. It still does not work for me with the same error: GPU Device 0: "TITAN Xp" with compute capability 6.1 simpleCUBLAS test running.. !!!! CUBLAS initialization error I?m not sure where exactly the access problem is, but here is what I get from ls -all: yichongx at gpu8$ ls -all total 2136200 drwxr-xr-x. 3 sheath sheath 8192 May 31 15:16 . drwxr-xr-x. 4 root root 32 Sep 2 2017 .. lrwxrwxrwx. 1 root root 18 Mar 13 13:05 libaccinj64.so -> libaccinj64.so.9.0 lrwxrwxrwx. 1 root root 22 Mar 13 13:05 libaccinj64.so.9.0 -> libaccinj64.so.9.0.176 -rwxr-xr-x. 1 root root 6858944 Sep 2 2017 libaccinj64.so.9.0.176 -rw-r--r--. 1 root root 71952010 Dec 19 2017 libcublas_device.a lrwxrwxrwx. 1 root root 16 Mar 13 13:04 libcublas.so -> libcublas.so.9.0 lrwxrwxrwx. 1 root root 20 Mar 13 13:04 libcublas.so.9.0 -> libcublas.so.9.0.282 -rwxr-xr-x. 1 root root 52590576 Dec 19 2017 libcublas.so.9.0.176 -rwxr-xr-x. 1 root root 55781312 Dec 19 2017 libcublas.so.9.0.282 -rw-r--r--. 1 root root 62813620 Dec 19 2017 libcublas_static.a Thanks, Yichong On Sep 6, 2018, at 3:14 PM, Yichong Xu > wrote: 1. I think yes. Biswajit and I cannot use the system cuda libraries. 2. I think yes as well. Predrag said he can run matlab with cuda well (probably with root access), so I think there should be some problem with the privilege setting of system libraries. We do not have root access on our accounts. 3. Not yet so far. 4. That can be a solution. Maybe we have a public access library as Jay-Yoon did and that can work for us. Also for gpu8 - I just reinstalled pytorch again on scratch of gpu8 and it still does not work. I?m making the cuda libraries right now and trying to see if it works. Thanks, Yichong On Sep 6, 2018, at 11:20 AM, Barnabas Poczos > wrote: Hi All, I'm somewhat confused: * Do I understand correctly that Manzil actually is using the CUDA libraries installed by himself (/zfsauton/home/manzilz/local/cuda-9.0/) and not the system libraries (/usr/local/cuda/lib64) ? * Since he is using different CUDA libraries is that the reason that pytorch is working for him and not for the other users? If so, should we double check the system libraries? * Do we know anyone who can use pytorch now with the CUDA system libraries ? If so, those users please let us know your system env variables. * As a quick solution, should we ask Manzil to copy his cuda libraries to a public place where others could access them? Best, Barnabas ====================== Barnabas Poczos, PhD Associate Professor Co-Director of PhD Program Machine Learning Department Carnegie Mellon University On Wed, Sep 5, 2018 at 5:33 PM Manzil Zaheer > wrote: Here is my related env variables: CUDA_HOME=/zfsauton/home/manzilz/local/cuda-9.0/ LD_LIBRARY_PATH=/zfsauton/home/manzilz/local/lib64:/zfsauton/home/manzilz/local/lib:/zfsauton/home/manzilz/local/cuda-9.0/lib64:/usr/local/cuda/lib64: PATH=/zfsauton/home/manzilz/local/bin:/zfsauton/home/manzilz/.local/bin:/zfsauton/home/manzilz/local/cuda-9.0/bin:/usr/local/cuda/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin C_INCLUDE_PATH=/zfsauton/home/manzilz/local/include: From: Biswajit Paria > Sent: Wednesday, September 05, 2018 5:29 PM To: Yichong Xu > Cc: Biswajit Paria >; eyolcu at cs.cmu.edu; Predrag Punosevac >; Manzil Zaheer >; users at autonlab.org Subject: Re: PyTorch problem If the CUDA examples work for anyone, can they share their PATH and LD_LIBRARY_PATH variables? Thanks On Wed, Sep 5, 2018 at 5:27 PM Yichong Xu > wrote: I think with Biswajit?s and my problem with cuda, we should isolate the problem with just CUDA (and drivers) instead of wandering around python or pytorch. Predrag can you test the CUDA examples? I sort of agree with Manzil that this might be a user account problem. Thanks, Yichong On Sep 5, 2018, at 5:14 PM, Biswajit Paria > wrote: I just tried Yichong's way of testing cuBLAS, and get the same error as earlier: [Matrix Multiply CUBLAS] - Starting... GPU Device 0: "TITAN Xp" with compute capability 6.1 MatrixA(640,480), MatrixB(480,320), MatrixC(640,320) CUDA error at matrixMulCUBLAS.cpp:275 code=1(CUBLAS_STATUS_NOT_INITIALIZED) "cublasCreate(&handle)" So I believe it is not a conda error. I also tried removing .nv, doesn't help either. Maybe someone can share the PATH env variable? On Wed, Sep 5, 2018 at 5:08 PM Emre Yolcu > wrote: Manzil, could you share your `conda env export` (or equivalent) output for the environment you use for pytorch? It's still not working for me after reboot, maybe I can try replicating your exact setup and try with that. Thanks, Emre On Wed, Sep 5, 2018 at 4:56 PM, Predrag Punosevac > wrote: Manzil Zaheer > wrote: It was working me before reboot as well. PyTorch does work on all nodes for me. Aha! Gotcha. I am trying to say is that i think it is not issue at system level but at user account level. I might be wrong though. That was my hunch as well. They were trying to convince me in a 150 e-mails chain over the weekend that pytorch was broken when I replaced a failed HDD on the main file server. That didn't make any sense. Could you please share your binaries and setup with outher pytorch users? Cheers, Predrag -------- Original message -------- From: Predrag Punosevac > Date: 9/5/18 4:44 PM (GMT-05:00) To: Manzil Zaheer > Cc: Biswajit Paria >, Yichong Xu >, Emre Yolcu >, users at autonlab.org Subject: Re: PyTorch problem Should I go ahead and reboot all GPU computing nodes? Can somebody else confirm that a reboot fixes the issue? Predrag On Wed, Sep 5, 2018 at 4:42 PM, Manzil Zaheer > wrote: It does work for me and my friends -------- Original message -------- From: Predrag Punosevac > Date: 9/5/18 4:40 PM (GMT-05:00) To: Biswajit Paria > Cc: Manzil Zaheer >, Yichong Xu >, Emre Yolcu >, users at autonlab.org Subject: Re: PyTorch problem I just rebooted GPU8. All packages are up to date. NVidia driver appears to be working properly and I can do GPU computations from MATLAB. Let's try now to get pytorch working on GPU8. Predrag On Wed, Sep 5, 2018 at 12:19 AM, Biswajit Paria > wrote: I am facing a similar error on all GPU machines. Did someone find a solution yet? 2018-09-05 00:27:41.546064: E tensorflow/stream_executor/cuda/cuda_blas.cc:459] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED On Tue, Sep 4, 2018 at 10:03 PM Manzil Zaheer > wrote: Hi Yichong Yes I am able to run TF and PyTorch on these machines. Recently someone else also had similar issue, but it got fixed by reinstalling some local packages. Thanks, Manzil -------- Original message -------- From: Yichong Xu > Date: 9/4/18 9:58 PM (GMT-05:00) To: Emre Yolcu >, Predrag Punosevac > Cc: users at autonlab.org Subject: Re: PyTorch problem Just wondering - can Tensorflow run well on these machines? I hope someone to confirm about this so that we can isolate the problem. OK so here?s a further test: I tried running the cuda examples from the cuda installation (in /usr/local/cuda/sample), on gpu2 in my scratch directory. Simple jobs like deviceQuery succeeds, but simpleCUBLAS failed: yichongx at gpu2$ cd /home/scratch/yichongx/ yichongx at gpu2$ cd 0_Simple/ 2_Graphics/ 4_Finance/ 6_Advanced/ bin/ conda/ 1_Utilities/ 3_Imaging/ 5_Simulations/ 7_CUDALibraries/ common/ miniconda3/ yichongx at gpu2$ cd 7_CUDALibraries/ yichongx at gpu2$ cd simpleCUBLAS yichongx at gpu2$ CUDA_VISIBLE_DEVICES=3 ./simpleCUBLAS GPU Device 0: "TITAN X (Pascal)" with compute capability 6.1 simpleCUBLAS test running.. !!!! CUBLAS initialization error yichongx at gpu2$ This is also consistent with our previous errors from pytorch, which say cublas library not initialized. So this means at least there is some problem with CUBLAS on gpu2. This post suggests that using sudo can resolve this problem, and this is probably because of some permission problems on CUBLAS libraries: https://devtalk.nvidia.com/default/topic/1027602/cuda-setup-and-installation/cublas-libraries-with-incorrect-permissions/ @Predrag: Can you try running the simpleCUBLAS example from the CUDA library, with and without root privilege? I think that might be something that you are more familiar with. Thank you very much! Thanks, Yichong On Sep 4, 2018, at 3:18 PM, Emre Yolcu > wrote: Hi, We are trying to troubleshoot the PyTorch issue with Predrag and were wondering: Is anybody able to run PyTorch GPU models on gpu1-9? If you can, we would appreciate if you can respond. Also, is it a problem for anyone if gpu8 is rebooted today? Thanks, Emre -- Biswajit Paria PhD in ML @ CMU -- Biswajit Paria PhD in ML @ CMU -- Biswajit Paria PhD in ML @ CMU -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Thu Sep 13 17:51:01 2018 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Thu, 13 Sep 2018 17:51:01 -0400 Subject: git down In-Reply-To: References: Message-ID: <20180913215101.dvIt7vkKv%predragp@andrew.cmu.edu> Dan Howarth wrote: > Hello, > > I'm having the same issue as I did before when trying to login to git.int > ... it says wrong username / password > > -Dan Works as expected. I just checked. Are you sure that you were using LDAP option not the local option? I did notice that ldapd was dying after the last round of security patches https://www.openbsd.org/errata63.html I do restart ldapd in via cron every 15 minutes just to make sure the thing is running. You must have had a real bad timing. Cheers, Predrag P.S. Please use your CMU for work related communication. From predragp at andrew.cmu.edu Tue Sep 18 13:56:35 2018 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Tue, 18 Sep 2018 13:56:35 -0400 Subject: HDD replaced Message-ID: <20180918175635.WnuUi_U99%predragp@andrew.cmu.edu> Dear Autonians, You probably noticed that the your home directories were unavailable for a short period of time. I had to replace a dead HDD. Please see below. root at uranus:~ # smartctl -l selftest /dev/da18 smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-RELEASE-p3 amd64] (local build) Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed: electrical failure 10% 28199 3403696 # 2 Short offline Completed: electrical failure 10% 28175 3403696 # 3 Short offline Completed: electrical failure 10% 28151 3403696 # 4 Extended offline Completed: electrical failure 90% 28128 3403696 # 5 Short offline Completed: electrical failure 10% 28127 3403696 # 6 Short offline Completed: electrical failure 10% 28103 3403696 # 7 Short offline Completed: electrical failure 30% 28079 3403696 # 8 Short offline Completed: electrical failure 10% 28056 3403696 # 9 Short offline Completed: electrical failure 30% 28031 3403696 #10 Short offline Completed: electrical failure 10% 28008 3403696 #11 Short offline Completed: electrical failure 30% 27983 3403696 #12 Extended offline Completed: electrical failure 90% 27961 3403696 #13 Short offline Completed: electrical failure 10% 27960 3403696 #14 Short offline Completed: electrical failure 30% 27935 3403696 #15 Short offline Completed: electrical failure 10% 27911 3403696 #16 Short offline Completed: electrical failure 10% 27888 3403696 #17 Short offline Completed without error 00% 27863 - #18 Short offline Completed without error 00% 27839 - #19 Short offline Completed without error 00% 27815 - #20 Extended offline Completed: read failure 40% 27796 - #21 Short offline Completed without error 00% 27791 - Cheers, Predrag From predragp at andrew.cmu.edu Wed Sep 19 11:07:08 2018 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Wed, 19 Sep 2018 11:07:08 -0400 Subject: Fwd: Download MATLAB & Simulink Release 2018b Message-ID: <20180919150708.7JHzqb859%predragp@andrew.cmu.edu> Dear Autonians, It is that time of the year when the MATLAB has to be upgraded due to the licensing issues. I will try to do it quietly over next week or so. If you notice that your MATLAB installation sudently stopped working with a nasty message about expired license the reason is 2018b release. Best, Predrag P.S. The priority in upgrading will be given to servers of course and than to big MATLAB users like Robert, Dan, Kyle ... -------- Original Message -------- From: "MathWorks" To: predragp at andrew.cmu.edu Date: 19 Sep 2018 01:15:46 -0400 Subject: Download MATLAB & Simulink Release 2018b To view this email as a web page, click here . "MathWorks" Dear Predrag Punosevac, You???re invited to download MathWorks Release 2018b. "R2018b" Download now R2018b delivers new features in MATLAB and Simulink, two new products, and updates to all other products. Highlights include: Deep Learning Edit networks using Deep Learning designer app, visualize using network analyzer, automated video labeling, export models to ONNX, and deploy to NVIDIA, Intel, and ARM processors. "Highlights" Smart Editing in Simulink Create new block ports with a click and edit block parameters on the icon. 5G Toolbox A new product for simulating, analyzing, and testing the physical layer of 5G communications systems. Sensor Fusion and Tracking Toolbox A new product for designing and simulating multisensor tracking and navigation systems. Download now . As part of your MathWorks Software Maintenance Service subscription, you are eligible for new product features and offerings delivered through general releases twice per year. Download your products now from the License Center . For more information about R2018b, see the release highlights and videos and release notes . Sincerely, MathWorks Customer Service Team mathworks.com/contact ?? 2018 The MathWorks, Inc. MATLAB and Simulink are registered trademarks of The MathWorks, Inc. See a list of additional trademarks . Other product or brand names may be trademarks or registered trademarks of their respective holders. You are subscribed as predragp at andrew.cmu.edu The MathWorks, Inc. - 3 Apple Hill Drive, Natick, MA 01760 - 508-647-7000 From jieshic at andrew.cmu.edu Thu Sep 20 10:32:37 2018 From: jieshic at andrew.cmu.edu (Chen Jieshi) Date: Thu, 20 Sep 2018 10:32:37 -0400 Subject: *** Auton Lab's 25th Annual Picnic: Sunday, October 7th at Schenley Park *** Message-ID: <70D7E5A2-54D9-4607-8C1D-EE868038CDDD@andrew.cmu.edu> Dear Autonians, We would like to invite you and your family to celebrate the 25th birthday of the Auton Lab. Pls save the date for our annual lab picnic at Vietnam Veterans Pavilion in Schenley Park on Sunday, October 7th. Pls RSVP through the web form below so that we could plan resources properly. https://goo.gl/forms/HysbH5sndcs4bZjr1 Looking forward to seeing all of you! Best, Jessie Jieshi (Jessie) Chen Senior Research Analyst Auton Lab, Robotics Institute Carnegie Mellon University Newell-Simon Hall, Room 3123 5000 Forbes Ave, Pittsburgh, PA 15213 -------------- next part -------------- An HTML attachment was scrubbed... URL: From kirthevasankandasamy at gmail.com Mon Sep 24 23:04:42 2018 From: kirthevasankandasamy at gmail.com (Kirthevasan Kandasamy) Date: Mon, 24 Sep 2018 23:04:42 -0400 Subject: Fwd: Thesis Defense - Oct. 3, 2018 - Kirthevasan Kandasamy - Tuning Hyper-parameters without Grad-students: Scaling up Bandit Optimisation In-Reply-To: References: Message-ID: Hi everyone, I am defending next Wednesday. You are welcome to drop by. Thanks! Samy ---------- Forwarded message --------- From: Diane Stidle Date: Mon, Sep 24, 2018 at 4:19 PM Subject: Thesis Defense - Oct. 3, 2018 - Kirthevasan Kandasamy - Tuning Hyper-parameters without Grad-students: Scaling up Bandit Optimisation To: ml-seminar at cs.cmu.edu , zoubin at eng.cam.ac.uk < zoubin at eng.cam.ac.uk>, *Thesis Defense* Date: October 3, 2018 Time: 12:30pm (EDT) Place: 8102 GHC PhD Candidate: Kirthevasan Kandasamy *Title: **Tuning Hyper-parameters without Grad-students: Scaling up Bandit Optimisation* Abstract: This thesis explores scalable methods for adaptive decision making under uncertainty, where the goal of an agent is to design an experiment, observe the outcome, and plan subsequent experiments to achieve a desired goal. Typically, each experiment incurs a large computational or economic cost, and we need to keep the number of experiments to a minimum. Many of such problems fall under the bandit framework, where each experiment evaluates a noisy function and the goal is to find the optimum of this function. A common use case for the bandit framework, pervasive in many industrial and scientific applications, is hyper-parameter tuning, where we need to find the optimal configuration of a black-box system by tuning the several knobs which affect the performance of the system. Some applications include statistical model selection, materials design, optimal policy selection in robotics, and maximum likelihood inference in simulation based scientific models. More generally, bandits are but one class of problems studied under the umbrella of adaptive decision-making under uncertainty. Problems such as active learning and design of experiments are other examples of adaptive decision-making, but unlike bandits, progress towards a desired goal is not made known to the agent via a reward signal. With increasingly expensive function evaluations and demands to optimise over complex input spaces, bandit optimisation tasks face new challenges today. At the same time, there are new opportunities that have not been exploited previously. We study the following questions in this thesis to enable the application of bandits and more broadly adaptive decision-making to modern applications. - Conventional bandit methods work reliably in low dimensional settings, but scale poorly with input dimensionality. Scaling such methods to high dimensional inputs requires addressing several computational and statistical challenges. - In many applications, an expensive function can be cheaply approximated. We study techniques that can use information from these cheap lower fidelity approximations to speed up the overall optimisation process. - Conventional bandit methods are inherently sequential. We study parallelisation techniques so as to deploy several function evaluations at the same time. - Typical methods assume that a design can be characterised by a Euclidean vector. We study bandit methods on graph-structured spaces. As a specific application, we study neural architecture search, which optimises for the structure of the neural network by viewing as a directed graph with node labels and node weights. - Many methods for adaptive decision-making are not competitive with human experts. Incorporating domain knowledge and human intuition about specific problems may significantly improve practical performance. We first study the above topics in the bandit framework and then study how they can be extended to broader decision-making problems. We develop methods with theoretical guarantees which simultaneously enjoy good empirical performance. As part of this thesis, we also develop an open source platform for scalable and robust bandit optimisation. Thesis Committee: Barnab?s P?czos(Co-chair) Jeff Schneider (Co-chair) Aarti Singh Zoubin Ghahramani (University of Cambridge) *Link to draft document: *http://www.cs.cmu.edu/~kkandasa/docs/thesis.pdf -- Diane Stidle Graduate Programs Manager Machine Learning Department Carnegie Mellon Universitystidle at cmu.edu 412-268-1299 -------------- next part -------------- An HTML attachment was scrubbed... URL: From mbarnes1 at andrew.cmu.edu Tue Sep 25 17:09:53 2018 From: mbarnes1 at andrew.cmu.edu (Matthew Barnes) Date: Tue, 25 Sep 2018 17:09:53 -0400 Subject: MATLAB starts huge number of processes Message-ID: Starting a simple instance of matlab on the servers creates about 50 processes. The old tricks (OMP_NUM_THREADS=1,MKL_NUM_THREADS=1, NUMEXPR_NUM_THREADS=1) don't seem to work. Anyone else have this issue and found a solution? Thanks! Matt -------------- next part -------------- An HTML attachment was scrubbed... URL: From boecking at andrew.cmu.edu Tue Sep 25 17:19:55 2018 From: boecking at andrew.cmu.edu (Benedikt Boecking) Date: Tue, 25 Sep 2018 17:19:55 -0400 Subject: MATLAB starts huge number of processes In-Reply-To: References: Message-ID: <055F7404-79C2-413E-957B-E2D29FCCF037@andrew.cmu.edu> I don?t think Matlab uses Open MP or Intel MKL. You can call maxNumCompThreads(N) in Matlab and should be able to set the number of threads like that. Or start Matlab with the -singleCompThread option. Let us know if that changes the behavior. Maybe there are some global settings for maxNumCompThreads that Predrag can change. > On Sep 25, 2018, at 5:09 PM, Matthew Barnes wrote: > > Starting a simple instance of matlab on the servers creates about 50 processes. The old tricks (OMP_NUM_THREADS=1,MKL_NUM_THREADS=1,NUMEXPR_NUM_THREADS=1) don't seem to work. > > Anyone else have this issue and found a solution? > > Thanks! > Matt -------------- next part -------------- An HTML attachment was scrubbed... URL: From awd at cs.cmu.edu Wed Sep 26 10:21:57 2018 From: awd at cs.cmu.edu (Artur Dubrawski) Date: Wed, 26 Sep 2018 10:21:57 -0400 Subject: Fwd: *** Auton Lab's 25th Annual Picnic: Sunday, October 7th at Schenley Park *** In-Reply-To: <70D7E5A2-54D9-4607-8C1D-EE868038CDDD@andrew.cmu.edu> References: <70D7E5A2-54D9-4607-8C1D-EE868038CDDD@andrew.cmu.edu> Message-ID: A reminder to please rsvp if you have not done so yet. See all you at the picnic! Artur ---------- Forwarded message --------- From: Chen Jieshi Date: Thu, Sep 20, 2018 at 10:33 AM Subject: *** Auton Lab's 25th Annual Picnic: Sunday, October 7th at Schenley Park *** To: Dear Autonians, We would like to invite you and your family to celebrate the *25th* birthday of the Auton Lab. Pls save the date for our annual lab picnic at *Vietnam Veterans Pavilion in Schenley Park* on *Sunday, October 7th*. Pls RSVP through the web form below so that we could plan resources properly. https://goo.gl/forms/HysbH5sndcs4bZjr1 Looking forward to seeing all of you! Best, Jessie Jieshi (Jessie) Chen Senior Research Analyst Auton Lab, Robotics Institute Carnegie Mellon University Newell-Simon Hall, Room 3123 5000 Forbes Ave, Pittsburgh, PA 15213 -------------- next part -------------- An HTML attachment was scrubbed... URL: From bparia at cs.cmu.edu Fri Sep 28 18:21:10 2018 From: bparia at cs.cmu.edu (Biswajit Paria) Date: Fri, 28 Sep 2018 18:21:10 -0400 Subject: Horovod/openmp error on GPU nodes Message-ID: Hi, I am facing the following error when trying to run horovod (which uses openmp) with tensorflow on the gpu nodes. What is interesting is that the error is not permanent. My code runs fine for sometime, and then the errors start appearing, after which I have to shift to a new GPU node. I suspect this is again related to the NFS and permissions like the previous GPU issue. Please let me know if you have a solution to this. Thanks. -------------------------------------------------------------------------------------------------------------- /zfsauton/home/bparia/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`. from ._conv import register_converters as _register_converters [gpu6.int.autonlab.org:17401] PMIX ERROR: OUT-OF-RESOURCE in file gds_dstore.c at line 1178 [gpu6.int.autonlab.org:17401] PMIX ERROR: OUT-OF-RESOURCE in file gds_dstore.c at line 1313 [gpu6.int.autonlab.org:17401] PMIX ERROR: OUT-OF-RESOURCE in file gds_dstore.c at line 2331 [gpu6.int.autonlab.org:17401] PMIX ERROR: OUT-OF-RESOURCE in file gds_dstore.c at line 3148 [gpu6.int.autonlab.org:17401] PMIX ERROR: OUT-OF-RESOURCE in file gds_dstore.c at line 3180 [gpu6.int.autonlab.org:17401] PMIX ERROR: OUT-OF-RESOURCE in file server/pmix_server.c at line 2151 [gpu6.int.autonlab.org:17406] PMIX ERROR: OUT-OF-RESOURCE in file client/pmix_client.c at line 228 [gpu6.int.autonlab.org:17406] OPAL ERROR: Error in file pmix2x_client.c at line 109 *** An error occurred in MPI_Init_thread *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) [gpu6.int.autonlab.org:17406] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed! ------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. ------------------------------------------------------- -------------------------------------------------------------------------- mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[30728,1],0] Exit code: 1 ---------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------------------- Biswajit Paria PhD student MLD CMU -------------- next part -------------- An HTML attachment was scrubbed... URL: