<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=Windows-1252">
<meta content="text/html; charset=utf-8">
</head>
<body class="" style="word-wrap:break-word; line-break:after-white-space">
Hi Yichong 
<div><br>
</div>
<div>Yes I am able to run TF and PyTorch on these machines. Recently someone else also had similar issue, but it got fixed by reinstalling some local packages.</div>
<div><br>
</div>
<div>Thanks,</div>
<div>Manzil </div>
<div><br>
</div>
<div><br>
</div>
<div>-------- Original message --------</div>
<div>From: Yichong Xu <yichongx@cs.cmu.edu> </div>
<div>Date: 9/4/18 9:58 PM (GMT-05:00) </div>
<div>To: Emre Yolcu <eyolcu@cs.cmu.edu>, Predrag Punosevac <predragp@andrew.cmu.edu>
</div>
<div>Cc: users@autonlab.org </div>
<div>Subject: Re: PyTorch problem </div>
<div><br>
</div>
<div>
<div class="">Just wondering - can Tensorflow run well on these machines? I hope someone to confirm about this so that we can isolate the problem.</div>
OK so here’s a further test: I tried running the cuda examples from the cuda installation (in /usr/local/cuda/sample), on gpu2 in my scratch directory. Simple jobs like deviceQuery succeeds, but simpleCUBLAS failed:
<div class="">
<div class="" style="margin:0px; font-size:11px; line-height:normal; font-family:Menlo">
<span class="" style="">yichongx@gpu2$ cd /home/scratch/yichongx/</span></div>
<div class="" style="margin:0px; font-size:11px; line-height:normal; font-family:Menlo">
<span class="" style="">yichongx@gpu2$ cd </span></div>
<div class="" style="margin:0px; font-size:11px; line-height:normal; font-family:Menlo">
<span class="" style="">0_Simple/        2_Graphics/      4_Finance/       6_Advanced/      bin/             conda/           </span></div>
<div class="" style="margin:0px; font-size:11px; line-height:normal; font-family:Menlo">
<span class="" style="">1_Utilities/     3_Imaging/       5_Simulations/   7_CUDALibraries/ common/          miniconda3/      </span></div>
<div class="" style="margin:0px; font-size:11px; line-height:normal; font-family:Menlo">
<span class="" style="">yichongx@gpu2$ cd 7_CUDALibraries/</span></div>
<div class="" style="margin:0px; font-size:11px; line-height:normal; font-family:Menlo">
<span class="" style="">yichongx@gpu2$ cd simpleCUBLAS</span></div>
<div class="" style="margin:0px; font-size:11px; line-height:normal; font-family:Menlo">
<span class="" style="">yichongx@gpu2$ CUDA_VISIBLE_DEVICES=3 ./simpleCUBLAS</span></div>
<div class="" style="margin:0px; font-size:11px; line-height:normal; font-family:Menlo">
<span class="" style="">GPU Device 0: "TITAN X (Pascal)" with compute capability 6.1</span></div>
<div class="" style="margin:0px; font-size:11px; line-height:normal; font-family:Menlo; min-height:13px">
<span class="" style=""></span><br class="">
</div>
<div class="" style="margin:0px; font-size:11px; line-height:normal; font-family:Menlo">
<span class="" style="">simpleCUBLAS test running..</span></div>
<div class="" style="margin:0px; font-size:11px; line-height:normal; font-family:Menlo">
<span class="" style="">!!!! CUBLAS initialization error</span></div>
<div class="" style="margin:0px; font-size:11px; line-height:normal; font-family:Menlo">
<span class="" style="">yichongx@gpu2$ </span></div>
<div class="">
<div class=""><br class="">
<div class="" style="color:rgb(0,0,0); font-family:Helvetica; font-size:12px; font-style:normal; font-weight:normal; letter-spacing:normal; orphans:auto; text-align:start; text-indent:0px; text-transform:none; white-space:normal; widows:auto; word-spacing:0px">
<br class="">
</div>
</div>
<div>This is also consistent with our previous errors from pytorch, which say cublas library not initialized.</div>
<div><br class="">
</div>
<div>So this means at least there is some problem with CUBLAS on gpu2. This post suggests that using sudo can resolve this problem, and this is probably because of some permission problems on CUBLAS libraries:</div>
<div><a href="https://devtalk.nvidia.com/default/topic/1027602/cuda-setup-and-installation/cublas-libraries-with-incorrect-permissions/" class="">https://devtalk.nvidia.com/default/topic/1027602/cuda-setup-and-installation/cublas-libraries-with-incorrect-permissions/</a></div>
<div>@Predrag: Can you try running the simpleCUBLAS example from the CUDA library, with and without root privilege? I think that might be something that you are more familiar with. Thank you very much!</div>
<div><br class="">
</div>
<div><br class="">
</div>
<div>
<div><i class="">Thanks,</i></div>
<div><i class="">Yichong</i></div>
<div><i class=""><br class="">
</i></div>
<blockquote type="cite" class="">
<div class="">On Sep 4, 2018, at 3:18 PM, Emre Yolcu <<a href="mailto:eyolcu@cs.cmu.edu" class="">eyolcu@cs.cmu.edu</a>> wrote:</div>
<br class="Apple-interchange-newline">
<div class="">
<div dir="ltr" class="">
<div class="">Hi,</div>
<div class=""><br class="">
</div>
<div class="">We are trying to troubleshoot the PyTorch issue with Predrag and were wondering:</div>
<div class=""><br class="">
</div>
<div class="">Is anybody able to run PyTorch GPU models on gpu1-9? If you can, we would appreciate if you can respond.</div>
<div class=""><br class="">
</div>
<div class="">Also, is it a problem for anyone if gpu8 is rebooted today?</div>
<div class=""><br class="">
</div>
<div class="">Thanks,</div>
<div class=""><br class="">
</div>
<div class="">Emre<br class="">
</div>
</div>
</div>
</blockquote>
</div>
<br class="">
</div>
</div>
</div>
</body>
</html>