From chiragn at cs.cmu.edu Sun Mar 4 23:27:45 2018 From: chiragn at cs.cmu.edu (Chirag Nagpal) Date: Sun, 4 Mar 2018 23:27:45 -0500 Subject: numpy configuration Message-ID: Hi all! I need help with configuring numpy with OpenBLAS. specifically I'm using numpy on lov5, but matrix operations seem to use just one core. explicitly linking numpy to openblas should alleviate this. Need pointers on where openblas is located in the OS, and the correct way to link it. Thanks Chirag -- *Chirag Nagpal* Graduate Student, Language Technologies Institute School of Computer Science Carnegie Mellon University cs.cmu.edu/~chiragn -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Tue Mar 6 17:12:18 2018 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Tue, 06 Mar 2018 17:12:18 -0500 Subject: GPU2 hard-rebooted due to ... Message-ID: <20180306221218.iPoMnuEmR%predragp@andrew.cmu.edu> Dear Autonians, Somebody run GPU2 into the ground. Memory including swap was 100% loaded. I had to hard reboot the machine. I am fixing it right now. Should not take more than 30 minutes. Please don't start anything until machine is fully ready. Predrag From ngisolfi at cmu.edu Thu Mar 8 09:37:52 2018 From: ngisolfi at cmu.edu (Nick Gisolfi) Date: Thu, 8 Mar 2018 09:37:52 -0500 Subject: [hackAuton] weekend scheduling and shirt sizes Message-ID: Hi Everyone, We will need all hands on deck to help participants on the weekend of the hackAuton, April 6-8. Please plan on participating. I have two links I need everyone to fill out... Link 1 (volunteer sign ups and choose your event t-shirt size): https://hackauton.com/volunteer Link 2 (specify times you can be available...select as many as possible): https://doodle.com/poll/nkgerss2dsc9bxmn I will create a formal schedule once everyone signs up and I get a better picture of the number of participants we will have. We have 30 participants registered at the moment! - Nick -------------- next part -------------- An HTML attachment was scrubbed... URL: From awd at cs.cmu.edu Sat Mar 10 13:48:36 2018 From: awd at cs.cmu.edu (Artur Dubrawski) Date: Sat, 10 Mar 2018 13:48:36 -0500 Subject: as we've just celebrated the International Women's Day - check this out :) Message-ID: <066a486a-2495-b8ec-e79d-0d125ccd9ce6@cs.cmu.edu> https://www.girlboss.com/girlboss/2018/3/7/female-tech-founders?lipi=urn%3Ali%3Apage%3Ad_flagship3_feed%3B3cCJCEwsTPmW3EE4pD5gRA%3D%3D Happy (belated) Women's Day to all our cherished Women Autonians! Artur From bpatra at andrew.cmu.edu Wed Mar 14 15:58:17 2018 From: bpatra at andrew.cmu.edu (Barun Patra) Date: Wed, 14 Mar 2018 15:58:17 -0400 Subject: Failed to initialize NVML: Driver/library version mismatch Message-ID: Hi, Are any of you facing the same issue ? Failed to initialize NVML: Driver/library version mismatch The last time this issue occurred, I think rebooting fixed the issue. Thanks for the help! -------------- next part -------------- An HTML attachment was scrubbed... URL: From sheath at andrew.cmu.edu Thu Mar 15 13:47:35 2018 From: sheath at andrew.cmu.edu (Simon Heath) Date: Thu, 15 Mar 2018 13:47:35 -0400 Subject: Failed to initialize NVML: Driver/library version mismatch In-Reply-To: References: Message-ID: I'll happily help you with this but I need more information. Which GPU node are you on? What are you trying to do? When did this problem start? Is there an easy way I can reproduce it for troubleshooting? Thanks, Simon On Wed, Mar 14, 2018 at 3:58 PM, Barun Patra wrote: > Hi, > Are any of you facing the same issue ? > Failed to initialize NVML: Driver/library version mismatch > > The last time this issue occurred, I think rebooting fixed the issue. > > Thanks for the help! > -- Simon Heath, Research Programmer and Analyst Robotics Institute - Auton Lab Carnegie Mellon University sheath at andrew.cmu.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: From ngisolfi at cs.cmu.edu Mon Mar 19 12:05:03 2018 From: ngisolfi at cs.cmu.edu (Nick Gisolfi) Date: Mon, 19 Mar 2018 12:05:03 -0400 Subject: [hackAuton] Please sign up for PSC account Message-ID: Hi Everyone, The Pittsburgh Supercomputing Center is supplying the computational power for our hackAuton! Please create an account (http://portal.xsede.org) to help test the environment and reply to me (not entire list) if you do decide to make an account so I can add you to the proper allocation (takes about 24-48 hours). We have access to both CPU and GPU nodes. Our job now is to make sure that common software libraries are installed and operating smoothly before the hackAuton. PSC doesn't have a lot of collaborations with AI folks, so we may need to ask them to add a few modules to their servers. Try running a few small experiments and please help identify what needs to be added/fixed. Thanks! - Nick -------------- next part -------------- An HTML attachment was scrubbed... URL: From ngisolfi at cs.cmu.edu Mon Mar 19 12:10:46 2018 From: ngisolfi at cs.cmu.edu (Nick Gisolfi) Date: Mon, 19 Mar 2018 12:10:46 -0400 Subject: [hackAuton] Need more volunteers April 7&8 Message-ID: Hi Everyone, https://doodle.com/poll/nkgerss2dsc9bxmn We need a few more volunteers for the weekend of the hackAuton. Right now we have 11 people signed up (thank you!!) but we do not have enough volunteers for Saturday and Sunday. Right now there are 46 registered participants. I estimate this number to grow significantly (close to 100), once email announcements go out today. We want to have Autonians present in full force at the event to show the depth of our lab. Please sign up to help out if you have not already. Thank you! - Nick -------------- next part -------------- An HTML attachment was scrubbed... URL: From awd at cs.cmu.edu Tue Mar 20 09:50:02 2018 From: awd at cs.cmu.edu (Artur Dubrawski) Date: Tue, 20 Mar 2018 09:50:02 -0400 Subject: Auton Lab postdoc candidate job talk: Wednesday March 28, 11am, NSH 4119 Message-ID: Team, We will have a skype presentation on Wednesday next week given by Bo Wu of Chinese Academy of Science, who is seeking a post-doctoral position with the Auton Lab. Please see below the title/abstract and bio of the speaker and please join us to see the talk! Cheers Artur --- Title: Temporal Learning and Prediction Abstract: While time-aware scenarios are ubiquitous, Temporal Learning and Prediction motivated by a wide range of applications depending on the dynamic platforms or systems (e.g. diffusion in marketing, pricing in ads etc). In domains as diverse as consumption, finance, entertainment and transportation, we observe a fundamental shift away from discrete, infrequent data to nearly continuous monitoring and recording. Therefore, temporal modeling dynamic signals, behaviors or information is a novel and prevalent topic in research area. Meanwhile, as an important platform for users to share and spread information at anytime, ?social media? offers an good opportunity to study temporal social signals, such as post popularity, user interests over time etc. We treated future popularity prediction as our research problem, and our research work tends to investigate the temporal learning and prediction techniques for sequential or time-series data. Different with previous prediction algorithms, our work study multiple temporal-view prediction problems for social media popularity, which contains dynamic factorization prediction, specific time prediction and time-series prediction. From the inner to sequential and from implicit to explicit, these prediction approaches progressively the influence previous user sharing behaviors to future popularity. Moreover, we explorer temporal learning and prediction have effective effects, which evaluated by the experiments of social media popularity prediction on large dataset. And we are also tring to applied proposed temporal modeling approaches into other problems. Short Bio: Bo Wu received Ph.D. degree from Chinese Academy of Sciences (Institute of Computing Technology), Beijing, China. His current research interests are temporal machine learning, deep learning, computer vision and social multimedia. He has over 2-years research experience in Microsoft Research Asia, and one year research experience in Academia Sinica. He has authored several papers published at top conferences and journals (ACM MM, AAAI, IJCAI, TKDE etc.), and also invited as reviewers or TPC member of IEEE TKDE, IEEE TMM, ACM Multimedia, SIGIR and ICIP etc. He is co-organizer of ACM Multimedia Challenge 2017. He has received several awards, including Turing 50th Student Scholarship, Innovation Research Award, Ph.D. Student Research Award, Top 1% in Global Recommendation Challenge etc. From ngisolfi at cs.cmu.edu Thu Mar 22 13:13:37 2018 From: ngisolfi at cs.cmu.edu (Nick Gisolfi) Date: Thu, 22 Mar 2018 13:13:37 -0400 Subject: [hackAuton] Cookies in NSH 3111 Message-ID: Hi All, There are chocolate chip cookies in NSH 3111! - Nick -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Thu Mar 22 13:51:58 2018 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Thu, 22 Mar 2018 13:51:58 -0400 Subject: Driver/library version mismatch on gpu nodes In-Reply-To: References: Message-ID: <20180322175158.IaQkucHzu%predragp@andrew.cmu.edu> Michael Andrews wrote: > Hi Predrag, > > There seems to be a driver/library mismatch on some of the gpu nodes (e.g. > gpu3, gpu4): > > $ nvidia-smi > Failed to initialize NVML: Driver/library version mismatch > Unfortunatelly the machines will have to be rebooted to clear that. I will do it today at 5:00 PM. Predrag > Could you have a look when you get a chance? > > Thanks, > Michael From mbandrews at cmu.edu Thu Mar 22 15:47:08 2018 From: mbandrews at cmu.edu (Michael Andrews) Date: Thu, 22 Mar 2018 15:47:08 -0400 Subject: Driver/library version mismatch on gpu nodes In-Reply-To: <20180322175158.IaQkucHzu%predragp@andrew.cmu.edu> References: <20180322175158.IaQkucHzu%predragp@andrew.cmu.edu> Message-ID: Thanks! Michael On Thu, Mar 22, 2018 at 1:51 PM, Predrag Punosevac wrote: > Michael Andrews wrote: > > > Hi Predrag, > > > > There seems to be a driver/library mismatch on some of the gpu nodes > (e.g. > > gpu3, gpu4): > > > > $ nvidia-smi > > Failed to initialize NVML: Driver/library version mismatch > > > > Unfortunatelly the machines will have to be rebooted to clear that. I > will do it today at 5:00 PM. > > Predrag > > > Could you have a look when you get a chance? > > > > Thanks, > > Michael > -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Thu Mar 22 19:22:21 2018 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Thu, 22 Mar 2018 19:22:21 -0400 Subject: cudnn 7.0 In-Reply-To: References: Message-ID: <20180322232221.icsLz4INc%predragp@andrew.cmu.edu> Matt Barnes wrote: > Reminder to change the symlink on the GPU's > > /usr/local/cuda-9.0/lib64/libcudnn.so.7 > > to point to libcudnn.so.7.0.5 Thank you for the remainder. I will look into this when I get back home tonight. Predrag From predragp at andrew.cmu.edu Thu Mar 22 22:13:34 2018 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Thu, 22 Mar 2018 22:13:34 -0400 Subject: cudnn 7.0 In-Reply-To: References: Message-ID: <20180323021334.31W2vmrKf%predragp@andrew.cmu.edu> Matt Barnes wrote: > Reminder to change the symlink on the GPU's > > /usr/local/cuda-9.0/lib64/libcudnn.so.7 > > to point to libcudnn.so.7.0.5 Simon, Can you explain to me what is happening here? Why do these symbolic links have to be set manually to the older version of cudnn? Predrag root at gpu1$ cd /usr/local/cuda-9.0/lib64 root at gpu1$ ls -l libcudnn.so.7 lrwxrwxrwx. 1 root root 17 Mar 22 18:06 libcudnn.so.7 -> libcudnn.so.7.1.1 root at gpu2$ ls -l libcudnn.so.7 lrwxrwxrwx. 1 sheath sheath 17 Nov 16 23:41 libcudnn.so.7 -> libcudnn.so.7.0.5 root at gpu3$ ls -l libcudnn.so.7 lrwxrwxrwx 1 sheath sheath 17 Nov 16 23:41 libcudnn.so.7 -> libcudnn.so.7.0.5 root at gpu4$ ls -l libcudnn.so.7 lrwxrwxrwx. 1 sheath sheath 17 Nov 16 23:41 libcudnn.so.7 -> libcudnn.so.7.0.5 root at gpu5$ ls -l libcudnn.so.7 lrwxrwxrwx. 1 root root 17 Mar 22 18:01 libcudnn.so.7 -> libcudnn.so.7.1.1 root at gpu6$ ls -l libcudnn.so.7 lrwxrwxrwx. 1 root root 17 Mar 22 18:00 libcudnn.so.7 -> libcudnn.so.7.1.1 root at gpu8$ ls -l libcudnn.so.7 lrwxrwxrwx. 1 root root 17 Mar 22 17:57 libcudnn.so.7 -> libcudnn.so.7.1.1 root at gpu9$ ls -l libcudnn.so.7 lrwxrwxrwx. 1 root root 17 Mar 22 17:53 libcudnn.so.7 -> libcudnn.so.7.1.1 From mbandrews at cmu.edu Fri Mar 23 08:54:23 2018 From: mbandrews at cmu.edu (Michael Andrews) Date: Fri, 23 Mar 2018 08:54:23 -0400 Subject: Fwd: Driver/library version mismatch on gpu nodes In-Reply-To: <20180322232130.YbbVZl8sw%predragp@andrew.cmu.edu> References: <20180322232130.YbbVZl8sw%predragp@andrew.cmu.edu> Message-ID: Forwarding for Predrag: ---------- Forwarded message ---------- From: Predrag Punosevac Date: Thu, Mar 22, 2018 at 7:21 PM Subject: Re: Driver/library version mismatch on gpu nodes To: mbandrews at cmu.edu Dear Autonians, This turned to be little bit bigger job than originally anticipated. This is the summary of what has being done: gpu[1-9] with the exception of GPU7 which is used to serve a client have being upgraded to the latest 3.10.0-693.21.1.el7 including all packages nvidia-smi works as expected on all of those machines now devtool-4 tools were replaced with devtool-6 which means that on all those machines you have gcc 6 miniconda3 (python 3.6.3.) in /opt/miniconda3 All these machines now have /opt/rh-git29 MATLAB has to be removed from GPU1 and GPU2 due to the space issue. I will be reinstalling it tomorrow on the different location. As a bonus you will get MATLAB R2018a which I am not planning to install on other servers unless requested (I will be waiting for R2018b). For some reason MATLAB no longer works with GPUs on servers GPU5 and GPU9. Those two servers will have the same treatment as GPU1 and GPU2 and I will install the latest version of MATLAB in order to fix the problem (IIRC GPU8 and GPU9 are still only available to designated users). Finally Tensorflow is possibly broken due to a symlink. Please see my next e-mail. I plan to fix this later tonight. Predrag -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Fri Mar 23 09:43:27 2018 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Fri, 23 Mar 2018 09:43:27 -0400 Subject: Fwd: Driver/library version mismatch on gpu nodes In-Reply-To: References: <20180322232130.YbbVZl8sw%predragp@andrew.cmu.edu> Message-ID: <20180323134327.OQFs8bwM3%predragp@andrew.cmu.edu> Michael Andrews wrote: > Forwarding for Predrag: Why did you forward me my own message? Predrag > > ---------- Forwarded message ---------- > From: Predrag Punosevac > Date: Thu, Mar 22, 2018 at 7:21 PM > Subject: Re: Driver/library version mismatch on gpu nodes > To: mbandrews at cmu.edu > > Dear Autonians, > > This turned to be little bit bigger job than originally anticipated. > This is the summary of what has being done: > > gpu[1-9] with the exception of GPU7 which is used to serve a client have > being upgraded to the latest 3.10.0-693.21.1.el7 including all packages > > nvidia-smi works as expected on all of those machines now > > devtool-4 tools were replaced with devtool-6 which means that on all > those machines you have gcc 6 > > miniconda3 (python 3.6.3.) in /opt/miniconda3 > > All these machines now have /opt/rh-git29 > > > MATLAB has to be removed from GPU1 and GPU2 due to the space issue. I > will be reinstalling it tomorrow on the different location. As a bonus > you will get MATLAB R2018a which I am not planning to install on other > servers unless requested (I will be waiting for R2018b). > > For some reason MATLAB no longer works with GPUs on servers GPU5 and > GPU9. Those two servers will have the same treatment as GPU1 and GPU2 > and I will install the latest version of MATLAB in order to fix the > problem (IIRC GPU8 and GPU9 are still only available to designated > users). > > Finally Tensorflow is possibly broken due to a symlink. Please see my > next e-mail. I plan to fix this later tonight. > > Predrag From mbarnes1 at andrew.cmu.edu Fri Mar 23 10:38:54 2018 From: mbarnes1 at andrew.cmu.edu (Matthew Barnes) Date: Fri, 23 Mar 2018 14:38:54 +0000 Subject: cudnn 7.0 In-Reply-To: <20180323021334.31W2vmrKf%predragp@andrew.cmu.edu> References: <20180323021334.31W2vmrKf%predragp@andrew.cmu.edu> Message-ID: I've installed my own versions of CUDA and cuDNN. So things are working for me, but this is still going to be an issue for everyone else in the lab. On Thu, Mar 22, 2018 at 10:13 PM Predrag Punosevac wrote: > Matt Barnes wrote: > > > Reminder to change the symlink on the GPU's > > > > /usr/local/cuda-9.0/lib64/libcudnn.so.7 > > > > to point to libcudnn.so.7.0.5 > > Simon, > > Can you explain to me what is happening here? Why do these symbolic > links have to be set manually to the older version of cudnn? > > Predrag > > > root at gpu1$ cd /usr/local/cuda-9.0/lib64 > root at gpu1$ ls -l libcudnn.so.7 > lrwxrwxrwx. 1 root root 17 Mar 22 18:06 libcudnn.so.7 -> > libcudnn.so.7.1.1 > > root at gpu2$ ls -l libcudnn.so.7 > lrwxrwxrwx. 1 sheath sheath 17 Nov 16 23:41 libcudnn.so.7 -> > libcudnn.so.7.0.5 > > root at gpu3$ ls -l libcudnn.so.7 > lrwxrwxrwx 1 sheath sheath 17 Nov 16 23:41 libcudnn.so.7 -> > libcudnn.so.7.0.5 > > root at gpu4$ ls -l libcudnn.so.7 > lrwxrwxrwx. 1 sheath sheath 17 Nov 16 23:41 libcudnn.so.7 -> > libcudnn.so.7.0.5 > > root at gpu5$ ls -l libcudnn.so.7 > lrwxrwxrwx. 1 root root 17 Mar 22 18:01 libcudnn.so.7 -> > libcudnn.so.7.1.1 > > root at gpu6$ ls -l libcudnn.so.7 > lrwxrwxrwx. 1 root root 17 Mar 22 18:00 libcudnn.so.7 -> > libcudnn.so.7.1.1 > > root at gpu8$ ls -l libcudnn.so.7 > lrwxrwxrwx. 1 root root 17 Mar 22 17:57 libcudnn.so.7 -> > libcudnn.so.7.1.1 > > root at gpu9$ ls -l libcudnn.so.7 > lrwxrwxrwx. 1 root root 17 Mar 22 17:53 libcudnn.so.7 -> > libcudnn.so.7.1.1 > -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Fri Mar 23 14:50:58 2018 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Fri, 23 Mar 2018 14:50:58 -0400 Subject: Driver/library version mismatch on gpu nodes In-Reply-To: References: <20180322175158.IaQkucHzu%predragp@andrew.cmu.edu> <20180322191116.gT7vCEaxh%predragp@andrew.cmu.edu> Message-ID: <20180323185058.D3-snyt7d%predragp@andrew.cmu.edu> Jay Yoon Lee wrote: > Hi Predrag, > > I am not sure if it's just me or everybody else. > After the reboot, GPU 1, 4, 8 is working for me, > but GPU 2, 3, 5, 6, 9 is not working for me. > > GPU 2, 3, 5, 6, 9 are complaining --> failed to connect to server > Failed to initialize NVML: Driver/library version mismatch > > Is there anything I need to on my end ? > (nvidia-smi does not work and I don't think I can do anything on my end.) It works for me. I just logged into all GPU machines with the exception of GPU7 and nvidia-smi gave the correct report. I did test things yesterday but I didn't want to replay to your e-mail until I checked things one more time. It must be someting about your enviromental variables. Also bear in mind that there are three different versions of CUDA on most of these GPUs. root at gpu8$ ls -1|grep cuda cuda cuda-8.0 cuda-9.0 cuda-9.1 Predrag > > Thanks, > Jay-Yoon > > > > On Thu, Mar 22, 2018 at 3:11 PM, Predrag Punosevac > wrote: > > > Jay Yoon Lee wrote: > > > > > Hi Predrag, > > > > > > Thanks for the email & I upvote for rebooting gpu3 &4. > > > > > > As far as I know, before it was just gpu2 having problem and now we have > > > gpu3, 4 having the same symptoms. > > > > > > But, one question: I don't think gpu2 got fixed even after rebooting. > > > Or is it just me? --> Do I have to reconfigure something? > > > > GPU2 has a problem with the full file system. I will move MATLAB to > > different location and resolve that. OK. GPU2 will be also down at 5 PM > > for about an hour. > > > > Predrag > > > > > > > > I am asking this question to see, > > > whether I have to do something once gpu3 & 4 are rebooted > > > since gpu2 reboot didn't seem to work for me. > > > > > > Thanks! > > > Jay-Yoon > > > > > > On Thu, Mar 22, 2018 at 1:51 PM, Predrag Punosevac < > > predragp at andrew.cmu.edu> > > > wrote: > > > > > > > Michael Andrews wrote: > > > > > > > > > Hi Predrag, > > > > > > > > > > There seems to be a driver/library mismatch on some of the gpu nodes > > > > (e.g. > > > > > gpu3, gpu4): > > > > > > > > > > $ nvidia-smi > > > > > Failed to initialize NVML: Driver/library version mismatch > > > > > > > > > > > > > Unfortunatelly the machines will have to be rebooted to clear that. I > > > > will do it today at 5:00 PM. > > > > > > > > Predrag > > > > > > > > > Could you have a look when you get a chance? > > > > > > > > > > Thanks, > > > > > Michael > > > > > > From predragp at andrew.cmu.edu Fri Mar 23 14:54:16 2018 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Fri, 23 Mar 2018 14:54:16 -0400 Subject: GPU8 and GPU9 Message-ID: <20180323185416.YK3cK4Z31%predragp@andrew.cmu.edu> GPU 8 and GPU 9 are no longer off the limit and anybody can use them. The only GPU server out of GPU[1-9] which remains reserved for the specific project is GPU-7. Predrag From predragp at andrew.cmu.edu Fri Mar 23 15:04:55 2018 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Fri, 23 Mar 2018 15:04:55 -0400 Subject: Tensorflow Message-ID: <20180323190455.PwnlJzgeb%predragp@andrew.cmu.edu> Tensorflow is tested and works well for multiple people on GPU2, GPU3, and GPU4. On other servers you have to make sure for now that you are using /usr/local/cuda-9.0/lib64/libcudnn.so.7.0.5 This is due to the fact that Tensorflow is broken up stream with libcudnn.so.7.1.5 I am thinking how best to play around this problem. All servers have three versions of cuda but the default is the newest root at gpu1$ ls -l |grep cuda lrwxrwxrwx. 1 root root 8 Mar 22 19:10 cuda -> cuda-9.1 drwxr-xr-x. 14 root root 4096 Apr 19 2017 cuda-8.0 drwxr-xr-x. 15 root root 4096 Nov 30 16:10 cuda-9.0 drwxr-xr-x. 15 root root 4096 Mar 22 18:59 cuda-9.1 Predrag From predragp at andrew.cmu.edu Fri Mar 23 18:26:50 2018 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Fri, 23 Mar 2018 18:26:50 -0400 Subject: Gogs/Git now uses LDAP for authentication Message-ID: <20180323222650.I9CIJcHy-%predragp@andrew.cmu.edu> Dear Autonians, Our Gogs/Git repository is now using LDAP database for authorization and authentication. All existing local accounts have being mapped into the LDAP accounts. No local account will be created again for any reason! If your Gogs password was the same as your LDAP password no action will be needed on your part. If you are using your e-mail address to log into the Gogs interface you will have to use the same e-mail address I have in LDAP. Uploaded ssh keys are not affected (tested). If your Gogs password was different from your LDAP password you will have to use LDAP password to log into Gogs interface. There was one user in the Gogs whose username didn't match LDAP username (Samy). His local account has being deleted as he had no repos. Next time he tries to log into Gogs interface he will just use his LDAP credentials. Unfortunately he will have to upload his ssh-key again. There are four remaining local accounts three of which have Gogs/Git admin privileges: awertz, sheath, and predrag. Mr. Jenkins https://jenkins.io/ also have a local account. Best, Predrag P.S. My understanding (please Anthony correct me) is that all CVS repositories have being migrated to Git. Subversion code has not being migrated yet. CVS and Subversion remain available for historical reasons but nobody has write access. The easiest way to get on Anthony's, Simon's and my bad side is to try to check in data or binary files into Gogs/Git. If you don't know much version control please stop by NSH 3119 for short orientation. From predragp at andrew.cmu.edu Fri Mar 23 23:08:06 2018 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Fri, 23 Mar 2018 23:08:06 -0400 Subject: Main File server full Message-ID: <20180324030806.XCenX7ise%predragp@andrew.cmu.edu> Dear Autonians, Our main file server is full and I no longer can take the snapshots of your home directories. This doesn't affect members of Neill group who have their own file server. ot at gaia] /var/log# head -10 messages Mar 23 13:00:00 gaia newsyslog[71545]: logfile turned over due to size>100K Mar 23 13:00:08 gaia autosnap.py: [tools.autosnap:58] Popen()ing: /sbin/zfs snapshot -r -o freenas:state=NEW zfsauton/home at auto-20180323.1300-2w Mar 23 13:00:09 gaia autosnap.py: [tools.autosnap:243] Failed to create snapshot 'zfsauton/home at auto-20180323.1300-2w': cannot create snapshot 'zfsauton/home at auto-20180323.1300-2w': out of space no snapshots were created The HDDs were purchased in November but hesitated to take the plunge until I was 100% about the new design. Unfortunately this can't wait any longer. The plan of the action is as follows. 1. Over the weekend I will verify quality of data and project pseudo file-systems (datasets in ZFS lingo) replications. 2. Assuming that those replications are OK I will make them alive on Monday. Some down time is unavoidable. I hope to keep it within 2h. 3. Once we are happy with live copies of project and data pseudo-file systems those will be destroyed on the main file server in order to make additional space for the snapshots of your home folders. 4. That might buy us time (at least a 1-2 weeks). 5. Once home directories are properly replicated on the backup server they will be made alive. 6. Main file will be rebuild with new HDDs. Old HDDs will not be erased and we will still be able to put that ZFS pool back on line if the things are not working the way we want. 7. In new setup each home directory will be separate data set and for the first time we will have 300GB quota. You will also be able to access your own snapshots without asking me for the files. Best, Predrag From predragp at andrew.cmu.edu Mon Mar 26 21:00:21 2018 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Mon, 26 Mar 2018 21:00:21 -0400 Subject: Lua Torch In-Reply-To: <295ba51bedc34351966390695e2ad973@PGH-MSGMLT-02.andrew.ad.cmu.edu> References: <295ba51bedc34351966390695e2ad973@PGH-MSGMLT-02.andrew.ad.cmu.edu> Message-ID: <20180327010021.WzisjuLUh%predragp@andrew.cmu.edu> Manzil Zaheer wrote: > Hi Predrag, > > I am not able to use any GPUSs on gpu5,6,7,9. I tried all 3 versions of cuda, but I get the following error: > I was able to build it after adding this export TORCH_NVCC_FLAGS="-D__CUDA_NO_HALF_OPERATORS__" per https://github.com/torch/torch7/issues/1086 When I try to run it I get errors that Lua packages are missing (probably due to my path variables). I have a vague recollection that Simon and I halped you once with this thing in the past. IIRC it was very picky about the version of some Lua package and required their version not the one which comes with yum . Anyhow I am forwarding this to users at autonlab in hope somebody is using it and might be of more help. Please stop by NSH 3119 and let us try to debug this. Predrag > THCudaCheck FAIL file=/pytorch/torch/lib/THC/THCGeneral.c line=70 error=30 : unknown error > Traceback (most recent call last): > File "", line 1, in > File "/zfsauton/home/manzilz/local/lib/python3.6/site-packages/torch/cuda/__init__.py", line 384, in _lazy_new > _lazy_init() > File "/zfsauton/home/manzilz/local/lib/python3.6/site-packages/torch/cuda/__init__.py", line 142, in _lazy_init > torch._C._cuda_init() > RuntimeError: cuda runtime error (30) : unknown error at /pytorch/torch/lib/THC/THCGeneral.c:70 > > Can you kindly look into it? > > Thanks, > Manzil From manzil at cmu.edu Mon Mar 26 21:02:18 2018 From: manzil at cmu.edu (Manzil Zaheer) Date: Tue, 27 Mar 2018 01:02:18 +0000 Subject: Lua Torch In-Reply-To: <20180327010021.WzisjuLUh%predragp@andrew.cmu.edu> References: <295ba51bedc34351966390695e2ad973@PGH-MSGMLT-02.andrew.ad.cmu.edu>, <20180327010021.WzisjuLUh%predragp@andrew.cmu.edu> Message-ID: <9f3af81667764f5b9cdb9d10dd156914@PGH-MSGMLT-02.andrew.ad.cmu.edu> Thanks for the detailed analysis. But I am using pytorch. I have not tried Lua torch. Can you please check? Thanks again! Sent from my Samsung Galaxy smartphone. -------- Original message -------- From: Predrag Punosevac Date: 3/26/18 9:00 PM (GMT-05:00) To: Manzil Zaheer Cc: Barnabas Poczos , users at autonlab.org Subject: Re: Lua Torch Manzil Zaheer wrote: > Hi Predrag, > > I am not able to use any GPUSs on gpu5,6,7,9. I tried all 3 versions of cuda, but I get the following error: > I was able to build it after adding this export TORCH_NVCC_FLAGS="-D__CUDA_NO_HALF_OPERATORS__" per https://github.com/torch/torch7/issues/1086 When I try to run it I get errors that Lua packages are missing (probably due to my path variables). I have a vague recollection that Simon and I halped you once with this thing in the past. IIRC it was very picky about the version of some Lua package and required their version not the one which comes with yum . Anyhow I am forwarding this to users at autonlab in hope somebody is using it and might be of more help. Please stop by NSH 3119 and let us try to debug this. Predrag > THCudaCheck FAIL file=/pytorch/torch/lib/THC/THCGeneral.c line=70 error=30 : unknown error > Traceback (most recent call last): > File "", line 1, in > File "/zfsauton/home/manzilz/local/lib/python3.6/site-packages/torch/cuda/__init__.py", line 384, in _lazy_new > _lazy_init() > File "/zfsauton/home/manzilz/local/lib/python3.6/site-packages/torch/cuda/__init__.py", line 142, in _lazy_init > torch._C._cuda_init() > RuntimeError: cuda runtime error (30) : unknown error at /pytorch/torch/lib/THC/THCGeneral.c:70 > > Can you kindly look into it? > > Thanks, > Manzil -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Mon Mar 26 22:50:12 2018 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Mon, 26 Mar 2018 22:50:12 -0400 Subject: PyTorch In-Reply-To: References: <295ba51bedc34351966390695e2ad973@PGH-MSGMLT-02.andrew.ad.cmu.edu> <20180327010021.WzisjuLUh%predragp@andrew.cmu.edu> Message-ID: <20180327025012.PucNB2br-%predragp@andrew.cmu.edu> Manzil Zaheer wrote: > Thanks for the detailed analysis. But I am using pytorch. I have not tried Lua torch. Can you please check? Thanks again! > I did. You have Python 3.6.4 in /opt/miniconda3/bin/python3.6 predrag at gpu3$ /opt/miniconda3/bin/python3.6 Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19) [GCC 7.2.0] on linux Type "help", "copyright", "credits" or "license" for more information. Try reinstalling thing in your scratch directory as /opt/miniconda3/bin/conda install pytorch torchvision cuda91 -c pytorch You should see something like The following packages will be downloaded: package | build ---------------------------|----------------- pillow-5.0.0 | py36h3deb7b8_0 561 KB mkl-2018.0.2 | 1 205.2 MB cuda91-1.0 | h4c16780_0 3 KB pytorch libpng-1.6.34 | hb9fc6fc_0 334 KB freetype-2.8 | hab7d2ae_1 804 KB libgfortran-ng-7.2.0 | hdf63c60_3 1.2 MB intel-openmp-2018.0.0 | 8 620 KB libtiff-4.0.9 | h28f6b97_0 586 KB pytorch-0.3.1 |py36_cuda9.1.85_cudnn7.0.5_2 475.0 MB pytorch torchvision-0.2.0 | py36h17b6947_1 102 KB pytorch jpeg-9b | h024ee3a_2 248 KB numpy-1.14.2 | py36hdbf6ddf_0 4.0 MB olefile-0.45.1 | py36_0 47 KB ------------------------------------------------------------ Total: 688.7 MB Make sure you put your scratch as a path since file server is full. I got clean installation but I didn't play further. One thing that worries me is this line pytorch-0.3.1 |py36_cuda9.1.85_cudnn7.0.5_2 475.0 MB pytorch We had problems with cudnn on 9.1 apparently because the upstream was assuming 7.0.5 when in reality I have 7.1.1 CUDA 9 or even 7.1.5. CUDA 9.1 GPU3 has CUDNN library 7.0.5 in cuda-9.0 so try adjusting conda command accordingly. Best, Predrag > > > Sent from my Samsung Galaxy smartphone. > > > -------- Original message -------- > From: Predrag Punosevac > Date: 3/26/18 9:00 PM (GMT-05:00) > To: Manzil Zaheer > Cc: Barnabas Poczos , users at autonlab.org > Subject: Re: Lua Torch > > Manzil Zaheer wrote: > > > Hi Predrag, > > > > I am not able to use any GPUSs on gpu5,6,7,9. I tried all 3 versions of cuda, but I get the following error: > > > > > I was able to build it after adding this > > export TORCH_NVCC_FLAGS="-D__CUDA_NO_HALF_OPERATORS__" > > per > > https://github.com/torch/torch7/issues/1086 > > When I try to run it I get errors that Lua packages are missing (probably > due to my path variables). I have a vague recollection that Simon and I > halped you once with this thing in the past. IIRC it was very picky about > the version of some Lua package and required their version not the one > which comes with yum . > > Anyhow I am forwarding this to users at autonlab in hope somebody is using > it and might be of more help. Please stop by NSH 3119 and let us try to > debug this. > > Predrag > > > > > > THCudaCheck FAIL file=/pytorch/torch/lib/THC/THCGeneral.c line=70 error=30 : unknown error > > Traceback (most recent call last): > > File "", line 1, in > > File "/zfsauton/home/manzilz/local/lib/python3.6/site-packages/torch/cuda/__init__.py", line 384, in _lazy_new > > _lazy_init() > > File "/zfsauton/home/manzilz/local/lib/python3.6/site-packages/torch/cuda/__init__.py", line 142, in _lazy_init > > torch._C._cuda_init() > > RuntimeError: cuda runtime error (30) : unknown error at /pytorch/torch/lib/THC/THCGeneral.c:70 > > > > Can you kindly look into it? > > > > Thanks, > > Manzil From manzil at cmu.edu Tue Mar 27 01:00:39 2018 From: manzil at cmu.edu (Manzil Zaheer) Date: Tue, 27 Mar 2018 05:00:39 +0000 Subject: PyTorch In-Reply-To: <20180327025012.PucNB2br-%predragp@andrew.cmu.edu> References: <295ba51bedc34351966390695e2ad973@PGH-MSGMLT-02.andrew.ad.cmu.edu> <20180327010021.WzisjuLUh%predragp@andrew.cmu.edu> , <20180327025012.PucNB2br-%predragp@andrew.cmu.edu> Message-ID: <64e210675cba4c68a73e803cfcaca728@PGH-MSGMLT-03.andrew.ad.cmu.edu> Hi Pregrad, Thanks again for your help. But I still can not get anything running on GPU5,6,7,9. Also notice that GPU1,2,3,4,8 almost all GPUs are full, while no one is using GPU5,6,7,9. This might mean no one else is also able to run anything as well. So I tried many things. Everything installs without issue. But when i try to run the simple code like: import torch x = torch.cuda.FloatTensor(2,3,4) print(x) I get the following error: THCudaCheck FAIL file=/pytorch/torch/lib/THC/THCGeneral.c line=70 error=30 : unknown error Traceback (most recent call last): File "", line 1, in File "/zfsauton/home/manzilz/.local/lib/python3.6/site-packages/torch/_utils.py", line 69, in _cuda return new_type(self.size()).copy_(self, async) File "/zfsauton/home/manzilz/.local/lib/python3.6/site-packages/torch/cuda/__init__.py", line 384, in _lazy_new _lazy_init() File "/zfsauton/home/manzilz/.local/lib/python3.6/site-packages/torch/cuda/__init__.py", line 142, in _lazy_init torch._C._cuda_init() RuntimeError: cuda runtime error (30) : unknown error at /pytorch/torch/lib/THC/THCGeneral.c:70 Thanks, Manzil ________________________________________ From: Predrag Punosevac Sent: 26 March 2018 22:50 To: Manzil Zaheer Cc: Barnabas Poczos; users at autonlab.org Subject: Re: PyTorch Manzil Zaheer wrote: > Thanks for the detailed analysis. But I am using pytorch. I have not tried Lua torch. Can you please check? Thanks again! > I did. You have Python 3.6.4 in /opt/miniconda3/bin/python3.6 predrag at gpu3$ /opt/miniconda3/bin/python3.6 Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19) [GCC 7.2.0] on linux Type "help", "copyright", "credits" or "license" for more information. Try reinstalling thing in your scratch directory as /opt/miniconda3/bin/conda install pytorch torchvision cuda91 -c pytorch You should see something like The following packages will be downloaded: package | build ---------------------------|----------------- pillow-5.0.0 | py36h3deb7b8_0 561 KB mkl-2018.0.2 | 1 205.2 MB cuda91-1.0 | h4c16780_0 3 KB pytorch libpng-1.6.34 | hb9fc6fc_0 334 KB freetype-2.8 | hab7d2ae_1 804 KB libgfortran-ng-7.2.0 | hdf63c60_3 1.2 MB intel-openmp-2018.0.0 | 8 620 KB libtiff-4.0.9 | h28f6b97_0 586 KB pytorch-0.3.1 |py36_cuda9.1.85_cudnn7.0.5_2 475.0 MB pytorch torchvision-0.2.0 | py36h17b6947_1 102 KB pytorch jpeg-9b | h024ee3a_2 248 KB numpy-1.14.2 | py36hdbf6ddf_0 4.0 MB olefile-0.45.1 | py36_0 47 KB ------------------------------------------------------------ Total: 688.7 MB Make sure you put your scratch as a path since file server is full. I got clean installation but I didn't play further. One thing that worries me is this line pytorch-0.3.1 |py36_cuda9.1.85_cudnn7.0.5_2 475.0 MB pytorch We had problems with cudnn on 9.1 apparently because the upstream was assuming 7.0.5 when in reality I have 7.1.1 CUDA 9 or even 7.1.5. CUDA 9.1 GPU3 has CUDNN library 7.0.5 in cuda-9.0 so try adjusting conda command accordingly. Best, Predrag > > > Sent from my Samsung Galaxy smartphone. > > > -------- Original message -------- > From: Predrag Punosevac > Date: 3/26/18 9:00 PM (GMT-05:00) > To: Manzil Zaheer > Cc: Barnabas Poczos , users at autonlab.org > Subject: Re: Lua Torch > > Manzil Zaheer wrote: > > > Hi Predrag, > > > > I am not able to use any GPUSs on gpu5,6,7,9. I tried all 3 versions of cuda, but I get the following error: > > > > > I was able to build it after adding this > > export TORCH_NVCC_FLAGS="-D__CUDA_NO_HALF_OPERATORS__" > > per > > https://github.com/torch/torch7/issues/1086 > > When I try to run it I get errors that Lua packages are missing (probably > due to my path variables). I have a vague recollection that Simon and I > halped you once with this thing in the past. IIRC it was very picky about > the version of some Lua package and required their version not the one > which comes with yum . > > Anyhow I am forwarding this to users at autonlab in hope somebody is using > it and might be of more help. Please stop by NSH 3119 and let us try to > debug this. > > Predrag > > > > > > THCudaCheck FAIL file=/pytorch/torch/lib/THC/THCGeneral.c line=70 error=30 : unknown error > > Traceback (most recent call last): > > File "", line 1, in > > File "/zfsauton/home/manzilz/local/lib/python3.6/site-packages/torch/cuda/__init__.py", line 384, in _lazy_new > > _lazy_init() > > File "/zfsauton/home/manzilz/local/lib/python3.6/site-packages/torch/cuda/__init__.py", line 142, in _lazy_init > > torch._C._cuda_init() > > RuntimeError: cuda runtime error (30) : unknown error at /pytorch/torch/lib/THC/THCGeneral.c:70 > > > > Can you kindly look into it? > > > > Thanks, > > Manzil From predragp at andrew.cmu.edu Tue Mar 27 01:31:40 2018 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Tue, 27 Mar 2018 01:31:40 -0400 Subject: PyTorch In-Reply-To: <1522126842790.39313@cmu.edu> References: <295ba51bedc34351966390695e2ad973@PGH-MSGMLT-02.andrew.ad.cmu.edu> <20180327010021.WzisjuLUh%predragp@andrew.cmu.edu> <20180327025012.PucNB2br-%predragp@andrew.cmu.edu> <1522126842790.39313@cmu.edu> Message-ID: <20180327053140.xM3NWbFsK%predragp@andrew.cmu.edu> Manzil Zaheer wrote: > Hi Pregrad, > > Thanks again for your help. But I still can not get anything running on GPU5,6,7,9. Also notice that GPU1,2,3,4,8 almost all GPUs are full, while no one is using GPU5,6,7,9. This might mean no one else is also able to run anything as well. > 7 if off limit used for the special project. How did you figure out that nobody is using it when you can't even log there? > So I tried many things. Everything installs without issue. But when i try to run the simple code like: > PyTorch is a research grade software. They have a mailing list. 3 sec Googling reveals https://github.com/pytorch/pytorch/issues/2527 also https://stackoverflow.com/questions/45861767/pytorch-giving-cuda-runtime-error I will look at this more but it would be helpful if you get on PyTorch mailing list and ask developers what they think. I see this once every 9 months they are looking at this bugs every day. Predrag > import torch > x = torch.cuda.FloatTensor(2,3,4) > print(x) > > > I get the following error: > THCudaCheck FAIL file=/pytorch/torch/lib/THC/THCGeneral.c line=70 error=30 : unknown error > Traceback (most recent call last): > File "", line 1, in > File "/zfsauton/home/manzilz/.local/lib/python3.6/site-packages/torch/_utils.py", line 69, in _cuda > return new_type(self.size()).copy_(self, async) > File "/zfsauton/home/manzilz/.local/lib/python3.6/site-packages/torch/cuda/__init__.py", line 384, in _lazy_new > _lazy_init() > File "/zfsauton/home/manzilz/.local/lib/python3.6/site-packages/torch/cuda/__init__.py", line 142, in _lazy_init > torch._C._cuda_init() > RuntimeError: cuda runtime error (30) : unknown error at /pytorch/torch/lib/THC/THCGeneral.c:70 > > Thanks, > Manzil > > ________________________________________ > From: Predrag Punosevac > Sent: 26 March 2018 22:50 > To: Manzil Zaheer > Cc: Barnabas Poczos; users at autonlab.org > Subject: Re: PyTorch > > Manzil Zaheer wrote: > > > Thanks for the detailed analysis. But I am using pytorch. I have not tried Lua torch. Can you please check? Thanks again! > > > > I did. You have Python 3.6.4 in /opt/miniconda3/bin/python3.6 > > predrag at gpu3$ /opt/miniconda3/bin/python3.6 > Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19) > [GCC 7.2.0] on linux > Type "help", "copyright", "credits" or "license" for more information. > > > Try reinstalling thing in your scratch directory as > > /opt/miniconda3/bin/conda install pytorch torchvision cuda91 -c pytorch > > You should see something like > > The following packages will be downloaded: > > package | build > ---------------------------|----------------- > pillow-5.0.0 | py36h3deb7b8_0 561 KB > mkl-2018.0.2 | 1 205.2 MB > cuda91-1.0 | h4c16780_0 3 KB > pytorch > libpng-1.6.34 | hb9fc6fc_0 334 KB > freetype-2.8 | hab7d2ae_1 804 KB > libgfortran-ng-7.2.0 | hdf63c60_3 1.2 MB > intel-openmp-2018.0.0 | 8 620 KB > libtiff-4.0.9 | h28f6b97_0 586 KB > pytorch-0.3.1 |py36_cuda9.1.85_cudnn7.0.5_2 475.0 > MB pytorch > torchvision-0.2.0 | py36h17b6947_1 102 KB > pytorch > jpeg-9b | h024ee3a_2 248 KB > numpy-1.14.2 | py36hdbf6ddf_0 4.0 MB > olefile-0.45.1 | py36_0 47 KB > ------------------------------------------------------------ > Total: 688.7 MB > > > Make sure you put your scratch as a path since file server is full. I > got clean installation but I didn't play further. One thing that worries > me is this line > > pytorch-0.3.1 |py36_cuda9.1.85_cudnn7.0.5_2 475.0 MB > pytorch > > We had problems with cudnn on 9.1 apparently because the upstream was > assuming 7.0.5 when in reality I have 7.1.1 CUDA 9 or even 7.1.5. CUDA > 9.1 > > GPU3 has CUDNN library 7.0.5 in cuda-9.0 so try adjusting conda command > accordingly. > > > Best, > Predrag > > > > > > > > > > > > Sent from my Samsung Galaxy smartphone. > > > > > > -------- Original message -------- > > From: Predrag Punosevac > > Date: 3/26/18 9:00 PM (GMT-05:00) > > To: Manzil Zaheer > > Cc: Barnabas Poczos , users at autonlab.org > > Subject: Re: Lua Torch > > > > Manzil Zaheer wrote: > > > > > Hi Predrag, > > > > > > I am not able to use any GPUSs on gpu5,6,7,9. I tried all 3 versions of cuda, but I get the following error: > > > > > > > > > I was able to build it after adding this > > > > export TORCH_NVCC_FLAGS="-D__CUDA_NO_HALF_OPERATORS__" > > > > per > > > > https://github.com/torch/torch7/issues/1086 > > > > When I try to run it I get errors that Lua packages are missing (probably > > due to my path variables). I have a vague recollection that Simon and I > > halped you once with this thing in the past. IIRC it was very picky about > > the version of some Lua package and required their version not the one > > which comes with yum . > > > > Anyhow I am forwarding this to users at autonlab in hope somebody is using > > it and might be of more help. Please stop by NSH 3119 and let us try to > > debug this. > > > > Predrag > > > > > > > > > > > THCudaCheck FAIL file=/pytorch/torch/lib/THC/THCGeneral.c line=70 error=30 : unknown error > > > Traceback (most recent call last): > > > File "", line 1, in > > > File "/zfsauton/home/manzilz/local/lib/python3.6/site-packages/torch/cuda/__init__.py", line 384, in _lazy_new > > > _lazy_init() > > > File "/zfsauton/home/manzilz/local/lib/python3.6/site-packages/torch/cuda/__init__.py", line 142, in _lazy_init > > > torch._C._cuda_init() > > > RuntimeError: cuda runtime error (30) : unknown error at /pytorch/torch/lib/THC/THCGeneral.c:70 > > > > > > Can you kindly look into it? > > > > > > Thanks, > > > Manzil From manzil at cmu.edu Tue Mar 27 01:46:44 2018 From: manzil at cmu.edu (Manzil Zaheer) Date: Tue, 27 Mar 2018 05:46:44 +0000 Subject: PyTorch In-Reply-To: <20180327053140.xM3NWbFsK%predragp@andrew.cmu.edu> References: <295ba51bedc34351966390695e2ad973@PGH-MSGMLT-02.andrew.ad.cmu.edu> <20180327010021.WzisjuLUh%predragp@andrew.cmu.edu> <20180327025012.PucNB2br-%predragp@andrew.cmu.edu> <1522126842790.39313@cmu.edu>, <20180327053140.xM3NWbFsK%predragp@andrew.cmu.edu> Message-ID: <798cc89dfa1a47b691994bc96880c039@PGH-MSGMLT-03.andrew.ad.cmu.edu> Hi Predrag, Thanks for pointing out the links. From the link you provided, we can see that FB engineers mention that "error 30 is usually unrelated to pytorch issues (or your code change)". Thanks, Manzil ________________________________________ From: Predrag Punosevac Sent: 27 March 2018 01:31 To: Manzil Zaheer Cc: Barnabas Poczos; users at autonlab.org Subject: Re: PyTorch Manzil Zaheer wrote: > Hi Pregrad, > > Thanks again for your help. But I still can not get anything running on GPU5,6,7,9. Also notice that GPU1,2,3,4,8 almost all GPUs are full, while no one is using GPU5,6,7,9. This might mean no one else is also able to run anything as well. > 7 if off limit used for the special project. How did you figure out that nobody is using it when you can't even log there? > So I tried many things. Everything installs without issue. But when i try to run the simple code like: > PyTorch is a research grade software. They have a mailing list. 3 sec Googling reveals https://github.com/pytorch/pytorch/issues/2527 also https://stackoverflow.com/questions/45861767/pytorch-giving-cuda-runtime-error I will look at this more but it would be helpful if you get on PyTorch mailing list and ask developers what they think. I see this once every 9 months they are looking at this bugs every day. Predrag > import torch > x = torch.cuda.FloatTensor(2,3,4) > print(x) > > > I get the following error: > THCudaCheck FAIL file=/pytorch/torch/lib/THC/THCGeneral.c line=70 error=30 : unknown error > Traceback (most recent call last): > File "", line 1, in > File "/zfsauton/home/manzilz/.local/lib/python3.6/site-packages/torch/_utils.py", line 69, in _cuda > return new_type(self.size()).copy_(self, async) > File "/zfsauton/home/manzilz/.local/lib/python3.6/site-packages/torch/cuda/__init__.py", line 384, in _lazy_new > _lazy_init() > File "/zfsauton/home/manzilz/.local/lib/python3.6/site-packages/torch/cuda/__init__.py", line 142, in _lazy_init > torch._C._cuda_init() > RuntimeError: cuda runtime error (30) : unknown error at /pytorch/torch/lib/THC/THCGeneral.c:70 > > Thanks, > Manzil > > ________________________________________ > From: Predrag Punosevac > Sent: 26 March 2018 22:50 > To: Manzil Zaheer > Cc: Barnabas Poczos; users at autonlab.org > Subject: Re: PyTorch > > Manzil Zaheer wrote: > > > Thanks for the detailed analysis. But I am using pytorch. I have not tried Lua torch. Can you please check? Thanks again! > > > > I did. You have Python 3.6.4 in /opt/miniconda3/bin/python3.6 > > predrag at gpu3$ /opt/miniconda3/bin/python3.6 > Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19) > [GCC 7.2.0] on linux > Type "help", "copyright", "credits" or "license" for more information. > > > Try reinstalling thing in your scratch directory as > > /opt/miniconda3/bin/conda install pytorch torchvision cuda91 -c pytorch > > You should see something like > > The following packages will be downloaded: > > package | build > ---------------------------|----------------- > pillow-5.0.0 | py36h3deb7b8_0 561 KB > mkl-2018.0.2 | 1 205.2 MB > cuda91-1.0 | h4c16780_0 3 KB > pytorch > libpng-1.6.34 | hb9fc6fc_0 334 KB > freetype-2.8 | hab7d2ae_1 804 KB > libgfortran-ng-7.2.0 | hdf63c60_3 1.2 MB > intel-openmp-2018.0.0 | 8 620 KB > libtiff-4.0.9 | h28f6b97_0 586 KB > pytorch-0.3.1 |py36_cuda9.1.85_cudnn7.0.5_2 475.0 > MB pytorch > torchvision-0.2.0 | py36h17b6947_1 102 KB > pytorch > jpeg-9b | h024ee3a_2 248 KB > numpy-1.14.2 | py36hdbf6ddf_0 4.0 MB > olefile-0.45.1 | py36_0 47 KB > ------------------------------------------------------------ > Total: 688.7 MB > > > Make sure you put your scratch as a path since file server is full. I > got clean installation but I didn't play further. One thing that worries > me is this line > > pytorch-0.3.1 |py36_cuda9.1.85_cudnn7.0.5_2 475.0 MB > pytorch > > We had problems with cudnn on 9.1 apparently because the upstream was > assuming 7.0.5 when in reality I have 7.1.1 CUDA 9 or even 7.1.5. CUDA > 9.1 > > GPU3 has CUDNN library 7.0.5 in cuda-9.0 so try adjusting conda command > accordingly. > > > Best, > Predrag > > > > > > > > > > > > Sent from my Samsung Galaxy smartphone. > > > > > > -------- Original message -------- > > From: Predrag Punosevac > > Date: 3/26/18 9:00 PM (GMT-05:00) > > To: Manzil Zaheer > > Cc: Barnabas Poczos , users at autonlab.org > > Subject: Re: Lua Torch > > > > Manzil Zaheer wrote: > > > > > Hi Predrag, > > > > > > I am not able to use any GPUSs on gpu5,6,7,9. I tried all 3 versions of cuda, but I get the following error: > > > > > > > > > I was able to build it after adding this > > > > export TORCH_NVCC_FLAGS="-D__CUDA_NO_HALF_OPERATORS__" > > > > per > > > > https://github.com/torch/torch7/issues/1086 > > > > When I try to run it I get errors that Lua packages are missing (probably > > due to my path variables). I have a vague recollection that Simon and I > > halped you once with this thing in the past. IIRC it was very picky about > > the version of some Lua package and required their version not the one > > which comes with yum . > > > > Anyhow I am forwarding this to users at autonlab in hope somebody is using > > it and might be of more help. Please stop by NSH 3119 and let us try to > > debug this. > > > > Predrag > > > > > > > > > > > THCudaCheck FAIL file=/pytorch/torch/lib/THC/THCGeneral.c line=70 error=30 : unknown error > > > Traceback (most recent call last): > > > File "", line 1, in > > > File "/zfsauton/home/manzilz/local/lib/python3.6/site-packages/torch/cuda/__init__.py", line 384, in _lazy_new > > > _lazy_init() > > > File "/zfsauton/home/manzilz/local/lib/python3.6/site-packages/torch/cuda/__init__.py", line 142, in _lazy_init > > > torch._C._cuda_init() > > > RuntimeError: cuda runtime error (30) : unknown error at /pytorch/torch/lib/THC/THCGeneral.c:70 > > > > > > Can you kindly look into it? > > > > > > Thanks, > > > Manzil From mbarnes1 at andrew.cmu.edu Tue Mar 27 08:30:13 2018 From: mbarnes1 at andrew.cmu.edu (Matthew Barnes) Date: Tue, 27 Mar 2018 12:30:13 +0000 Subject: PyTorch In-Reply-To: <798cc89dfa1a47b691994bc96880c039@PGH-MSGMLT-03.andrew.ad.cmu.edu> References: <295ba51bedc34351966390695e2ad973@PGH-MSGMLT-02.andrew.ad.cmu.edu> <20180327010021.WzisjuLUh%predragp@andrew.cmu.edu> <20180327025012.PucNB2br-%predragp@andrew.cmu.edu> <1522126842790.39313@cmu.edu> <20180327053140.xM3NWbFsK%predragp@andrew.cmu.edu> <798cc89dfa1a47b691994bc96880c039@PGH-MSGMLT-03.andrew.ad.cmu.edu> Message-ID: I think this is an issue with the CUDA install. I'm unable to run Tensorflow jobs on GPU9 as of last night (have not checked the others, but I suspect similar). 2018-03-26 14:54:49.214493: E tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to cuInit: CUDA_ERROR_UNKNOWN 2018-03-26 14:54:49.214599: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:158] retrieving CUDA diagnostic information for host: gpu9.int.autonlab.org 2018-03-26 14:54:49.214617: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: gpu9.int.autonlab.org 2018-03-26 14:54:49.214685: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: 390.30.0 2018-03-26 14:54:49.214747: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported version is: 390.30.0 2018-03-26 14:54:49.214762: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:300] kernel version seems to match DSO: 390.30.0 On Tue, Mar 27, 2018 at 1:47 AM Manzil Zaheer wrote: > Hi Predrag, > > Thanks for pointing out the links. From the link you provided, we can see > that FB engineers mention that "error 30 is usually unrelated to pytorch > issues (or your code change)". > > Thanks, > Manzil > ________________________________________ > From: Predrag Punosevac > Sent: 27 March 2018 01:31 > To: Manzil Zaheer > Cc: Barnabas Poczos; users at autonlab.org > Subject: Re: PyTorch > > Manzil Zaheer wrote: > > > Hi Pregrad, > > > > Thanks again for your help. But I still can not get anything running on > GPU5,6,7,9. Also notice that GPU1,2,3,4,8 almost all GPUs are full, while > no one is using GPU5,6,7,9. This might mean no one else is also able to run > anything as well. > > > > 7 if off limit used for the special project. How did you figure out that > nobody is using it when > you can't even log there? > > > So I tried many things. Everything installs without issue. But when i > try to run the simple code like: > > > > PyTorch is a research grade software. They have a mailing list. 3 sec > Googling reveals > > > https://github.com/pytorch/pytorch/issues/2527 > > also > > > https://stackoverflow.com/questions/45861767/pytorch-giving-cuda-runtime-error > > I will look at this more but it would be helpful if you get on PyTorch > mailing list and ask > developers what they think. I see this once every 9 months they are > looking at this bugs every > day. > > Predrag > > > import torch > > x = torch.cuda.FloatTensor(2,3,4) > > print(x) > > > > > > I get the following error: > > THCudaCheck FAIL file=/pytorch/torch/lib/THC/THCGeneral.c line=70 > error=30 : unknown error > > Traceback (most recent call last): > > File "", line 1, in > > File > "/zfsauton/home/manzilz/.local/lib/python3.6/site-packages/torch/_utils.py", > line 69, in _cuda > > return new_type(self.size()).copy_(self, async) > > File > "/zfsauton/home/manzilz/.local/lib/python3.6/site-packages/torch/cuda/__init__.py", > line 384, in _lazy_new > > _lazy_init() > > File > "/zfsauton/home/manzilz/.local/lib/python3.6/site-packages/torch/cuda/__init__.py", > line 142, in _lazy_init > > torch._C._cuda_init() > > RuntimeError: cuda runtime error (30) : unknown error at > /pytorch/torch/lib/THC/THCGeneral.c:70 > > > > Thanks, > > Manzil > > > > ________________________________________ > > From: Predrag Punosevac > > Sent: 26 March 2018 22:50 > > To: Manzil Zaheer > > Cc: Barnabas Poczos; users at autonlab.org > > Subject: Re: PyTorch > > > > Manzil Zaheer wrote: > > > > > Thanks for the detailed analysis. But I am using pytorch. I have not > tried Lua torch. Can you please check? Thanks again! > > > > > > > I did. You have Python 3.6.4 in /opt/miniconda3/bin/python3.6 > > > > predrag at gpu3$ /opt/miniconda3/bin/python3.6 > > Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19) > > [GCC 7.2.0] on linux > > Type "help", "copyright", "credits" or "license" for more information. > > > > > > Try reinstalling thing in your scratch directory as > > > > /opt/miniconda3/bin/conda install pytorch torchvision cuda91 -c pytorch > > > > You should see something like > > > > The following packages will be downloaded: > > > > package | build > > ---------------------------|----------------- > > pillow-5.0.0 | py36h3deb7b8_0 561 KB > > mkl-2018.0.2 | 1 205.2 MB > > cuda91-1.0 | h4c16780_0 3 KB > > pytorch > > libpng-1.6.34 | hb9fc6fc_0 334 KB > > freetype-2.8 | hab7d2ae_1 804 KB > > libgfortran-ng-7.2.0 | hdf63c60_3 1.2 MB > > intel-openmp-2018.0.0 | 8 620 KB > > libtiff-4.0.9 | h28f6b97_0 586 KB > > pytorch-0.3.1 |py36_cuda9.1.85_cudnn7.0.5_2 475.0 > > MB pytorch > > torchvision-0.2.0 | py36h17b6947_1 102 KB > > pytorch > > jpeg-9b | h024ee3a_2 248 KB > > numpy-1.14.2 | py36hdbf6ddf_0 4.0 MB > > olefile-0.45.1 | py36_0 47 KB > > ------------------------------------------------------------ > > Total: 688.7 MB > > > > > > Make sure you put your scratch as a path since file server is full. I > > got clean installation but I didn't play further. One thing that worries > > me is this line > > > > pytorch-0.3.1 |py36_cuda9.1.85_cudnn7.0.5_2 475.0 MB > > pytorch > > > > We had problems with cudnn on 9.1 apparently because the upstream was > > assuming 7.0.5 when in reality I have 7.1.1 CUDA 9 or even 7.1.5. CUDA > > 9.1 > > > > GPU3 has CUDNN library 7.0.5 in cuda-9.0 so try adjusting conda command > > accordingly. > > > > > > Best, > > Predrag > > > > > > > > > > > > > > > > > > > > > Sent from my Samsung Galaxy smartphone. > > > > > > > > > -------- Original message -------- > > > From: Predrag Punosevac > > > Date: 3/26/18 9:00 PM (GMT-05:00) > > > To: Manzil Zaheer > > > Cc: Barnabas Poczos , users at autonlab.org > > > Subject: Re: Lua Torch > > > > > > Manzil Zaheer wrote: > > > > > > > Hi Predrag, > > > > > > > > I am not able to use any GPUSs on gpu5,6,7,9. I tried all 3 versions > of cuda, but I get the following error: > > > > > > > > > > > > > I was able to build it after adding this > > > > > > export TORCH_NVCC_FLAGS="-D__CUDA_NO_HALF_OPERATORS__" > > > > > > per > > > > > > https://github.com/torch/torch7/issues/1086 > > > > > > When I try to run it I get errors that Lua packages are missing > (probably > > > due to my path variables). I have a vague recollection that Simon and I > > > halped you once with this thing in the past. IIRC it was very picky > about > > > the version of some Lua package and required their version not the one > > > which comes with yum . > > > > > > Anyhow I am forwarding this to users at autonlab in hope somebody is > using > > > it and might be of more help. Please stop by NSH 3119 and let us try to > > > debug this. > > > > > > Predrag > > > > > > > > > > > > > > > > THCudaCheck FAIL file=/pytorch/torch/lib/THC/THCGeneral.c line=70 > error=30 : unknown error > > > > Traceback (most recent call last): > > > > File "", line 1, in > > > > File > "/zfsauton/home/manzilz/local/lib/python3.6/site-packages/torch/cuda/__init__.py", > line 384, in _lazy_new > > > > _lazy_init() > > > > File > "/zfsauton/home/manzilz/local/lib/python3.6/site-packages/torch/cuda/__init__.py", > line 142, in _lazy_init > > > > torch._C._cuda_init() > > > > RuntimeError: cuda runtime error (30) : unknown error at > /pytorch/torch/lib/THC/THCGeneral.c:70 > > > > > > > > Can you kindly look into it? > > > > > > > > Thanks, > > > > Manzil > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Tue Mar 27 17:35:56 2018 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Tue, 27 Mar 2018 17:35:56 -0400 Subject: PyTorch In-Reply-To: References: <295ba51bedc34351966390695e2ad973@PGH-MSGMLT-02.andrew.ad.cmu.edu> <20180327010021.WzisjuLUh%predragp@andrew.cmu.edu> <20180327025012.PucNB2br-%predragp@andrew.cmu.edu> <1522126842790.39313@cmu.edu> <20180327053140.xM3NWbFsK%predragp@andrew.cmu.edu> <798cc89dfa1a47b691994bc96880c039@PGH-MSGMLT-03.andrew.ad.cmu.edu> Message-ID: <20180327213556.fTho4DuWR%predragp@andrew.cmu.edu> Matthew Barnes wrote: > I think this is an issue with the CUDA install. I'm unable to run > Tensorflow jobs on GPU9 as of last night (have not checked the others, but > I suspect similar). Nothing has changed since the last night. The error you are seeing is TensorFlow complaning about 390.30 NVidia driver but we upgraded driver last week accross all servers and IIRC you were able to use TensorFlow on GPU2, GPU3, and GPU4 after the upgrade. The main problem seems CUDNN library as TensorFlow and PyTorch seems to expect older libraries. Look for them in CUDA-9.0 directory. Predrag > > 2018-03-26 14:54:49.214493: E > tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to cuInit: > CUDA_ERROR_UNKNOWN > 2018-03-26 14:54:49.214599: I > tensorflow/stream_executor/cuda/cuda_diagnostics.cc:158] retrieving CUDA > diagnostic information for host: gpu9.int.autonlab.org > 2018-03-26 14:54:49.214617: I > tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: > gpu9.int.autonlab.org > 2018-03-26 14:54:49.214685: I > tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported > version is: 390.30.0 > 2018-03-26 14:54:49.214747: I > tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported > version is: 390.30.0 > 2018-03-26 14:54:49.214762: I > tensorflow/stream_executor/cuda/cuda_diagnostics.cc:300] kernel version > seems to match DSO: 390.30.0 > > > On Tue, Mar 27, 2018 at 1:47 AM Manzil Zaheer wrote: > > > Hi Predrag, > > > > Thanks for pointing out the links. From the link you provided, we can see > > that FB engineers mention that "error 30 is usually unrelated to pytorch > > issues (or your code change)". > > > > Thanks, > > Manzil > > ________________________________________ > > From: Predrag Punosevac > > Sent: 27 March 2018 01:31 > > To: Manzil Zaheer > > Cc: Barnabas Poczos; users at autonlab.org > > Subject: Re: PyTorch > > > > Manzil Zaheer wrote: > > > > > Hi Pregrad, > > > > > > Thanks again for your help. But I still can not get anything running on > > GPU5,6,7,9. Also notice that GPU1,2,3,4,8 almost all GPUs are full, while > > no one is using GPU5,6,7,9. This might mean no one else is also able to run > > anything as well. > > > > > > > 7 if off limit used for the special project. How did you figure out that > > nobody is using it when > > you can't even log there? > > > > > So I tried many things. Everything installs without issue. But when i > > try to run the simple code like: > > > > > > > PyTorch is a research grade software. They have a mailing list. 3 sec > > Googling reveals > > > > > > https://github.com/pytorch/pytorch/issues/2527 > > > > also > > > > > > https://stackoverflow.com/questions/45861767/pytorch-giving-cuda-runtime-error > > > > I will look at this more but it would be helpful if you get on PyTorch > > mailing list and ask > > developers what they think. I see this once every 9 months they are > > looking at this bugs every > > day. > > > > Predrag > > > > > import torch > > > x = torch.cuda.FloatTensor(2,3,4) > > > print(x) > > > > > > > > > I get the following error: > > > THCudaCheck FAIL file=/pytorch/torch/lib/THC/THCGeneral.c line=70 > > error=30 : unknown error > > > Traceback (most recent call last): > > > File "", line 1, in > > > File > > "/zfsauton/home/manzilz/.local/lib/python3.6/site-packages/torch/_utils.py", > > line 69, in _cuda > > > return new_type(self.size()).copy_(self, async) > > > File > > "/zfsauton/home/manzilz/.local/lib/python3.6/site-packages/torch/cuda/__init__.py", > > line 384, in _lazy_new > > > _lazy_init() > > > File > > "/zfsauton/home/manzilz/.local/lib/python3.6/site-packages/torch/cuda/__init__.py", > > line 142, in _lazy_init > > > torch._C._cuda_init() > > > RuntimeError: cuda runtime error (30) : unknown error at > > /pytorch/torch/lib/THC/THCGeneral.c:70 > > > > > > Thanks, > > > Manzil > > > > > > ________________________________________ > > > From: Predrag Punosevac > > > Sent: 26 March 2018 22:50 > > > To: Manzil Zaheer > > > Cc: Barnabas Poczos; users at autonlab.org > > > Subject: Re: PyTorch > > > > > > Manzil Zaheer wrote: > > > > > > > Thanks for the detailed analysis. But I am using pytorch. I have not > > tried Lua torch. Can you please check? Thanks again! > > > > > > > > > > I did. You have Python 3.6.4 in /opt/miniconda3/bin/python3.6 > > > > > > predrag at gpu3$ /opt/miniconda3/bin/python3.6 > > > Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19) > > > [GCC 7.2.0] on linux > > > Type "help", "copyright", "credits" or "license" for more information. > > > > > > > > > Try reinstalling thing in your scratch directory as > > > > > > /opt/miniconda3/bin/conda install pytorch torchvision cuda91 -c pytorch > > > > > > You should see something like > > > > > > The following packages will be downloaded: > > > > > > package | build > > > ---------------------------|----------------- > > > pillow-5.0.0 | py36h3deb7b8_0 561 KB > > > mkl-2018.0.2 | 1 205.2 MB > > > cuda91-1.0 | h4c16780_0 3 KB > > > pytorch > > > libpng-1.6.34 | hb9fc6fc_0 334 KB > > > freetype-2.8 | hab7d2ae_1 804 KB > > > libgfortran-ng-7.2.0 | hdf63c60_3 1.2 MB > > > intel-openmp-2018.0.0 | 8 620 KB > > > libtiff-4.0.9 | h28f6b97_0 586 KB > > > pytorch-0.3.1 |py36_cuda9.1.85_cudnn7.0.5_2 475.0 > > > MB pytorch > > > torchvision-0.2.0 | py36h17b6947_1 102 KB > > > pytorch > > > jpeg-9b | h024ee3a_2 248 KB > > > numpy-1.14.2 | py36hdbf6ddf_0 4.0 MB > > > olefile-0.45.1 | py36_0 47 KB > > > ------------------------------------------------------------ > > > Total: 688.7 MB > > > > > > > > > Make sure you put your scratch as a path since file server is full. I > > > got clean installation but I didn't play further. One thing that worries > > > me is this line > > > > > > pytorch-0.3.1 |py36_cuda9.1.85_cudnn7.0.5_2 475.0 MB > > > pytorch > > > > > > We had problems with cudnn on 9.1 apparently because the upstream was > > > assuming 7.0.5 when in reality I have 7.1.1 CUDA 9 or even 7.1.5. CUDA > > > 9.1 > > > > > > GPU3 has CUDNN library 7.0.5 in cuda-9.0 so try adjusting conda command > > > accordingly. > > > > > > > > > Best, > > > Predrag > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Sent from my Samsung Galaxy smartphone. > > > > > > > > > > > > -------- Original message -------- > > > > From: Predrag Punosevac > > > > Date: 3/26/18 9:00 PM (GMT-05:00) > > > > To: Manzil Zaheer > > > > Cc: Barnabas Poczos , users at autonlab.org > > > > Subject: Re: Lua Torch > > > > > > > > Manzil Zaheer wrote: > > > > > > > > > Hi Predrag, > > > > > > > > > > I am not able to use any GPUSs on gpu5,6,7,9. I tried all 3 versions > > of cuda, but I get the following error: > > > > > > > > > > > > > > > > > I was able to build it after adding this > > > > > > > > export TORCH_NVCC_FLAGS="-D__CUDA_NO_HALF_OPERATORS__" > > > > > > > > per > > > > > > > > https://github.com/torch/torch7/issues/1086 > > > > > > > > When I try to run it I get errors that Lua packages are missing > > (probably > > > > due to my path variables). I have a vague recollection that Simon and I > > > > halped you once with this thing in the past. IIRC it was very picky > > about > > > > the version of some Lua package and required their version not the one > > > > which comes with yum . > > > > > > > > Anyhow I am forwarding this to users at autonlab in hope somebody is > > using > > > > it and might be of more help. Please stop by NSH 3119 and let us try to > > > > debug this. > > > > > > > > Predrag > > > > > > > > > > > > > > > > > > > > > THCudaCheck FAIL file=/pytorch/torch/lib/THC/THCGeneral.c line=70 > > error=30 : unknown error > > > > > Traceback (most recent call last): > > > > > File "", line 1, in > > > > > File > > "/zfsauton/home/manzilz/local/lib/python3.6/site-packages/torch/cuda/__init__.py", > > line 384, in _lazy_new > > > > > _lazy_init() > > > > > File > > "/zfsauton/home/manzilz/local/lib/python3.6/site-packages/torch/cuda/__init__.py", > > line 142, in _lazy_init > > > > > torch._C._cuda_init() > > > > > RuntimeError: cuda runtime error (30) : unknown error at > > /pytorch/torch/lib/THC/THCGeneral.c:70 > > > > > > > > > > Can you kindly look into it? > > > > > > > > > > Thanks, > > > > > Manzil > > > > From barunpatra95 at gmail.com Wed Mar 28 01:34:56 2018 From: barunpatra95 at gmail.com (Barun Patra) Date: Wed, 28 Mar 2018 01:34:56 -0400 Subject: PyTorch In-Reply-To: <20180327213556.fTho4DuWR%predragp@andrew.cmu.edu> References: <295ba51bedc34351966390695e2ad973@PGH-MSGMLT-02.andrew.ad.cmu.edu> <20180327010021.WzisjuLUh%predragp@andrew.cmu.edu> <20180327025012.PucNB2br-%predragp@andrew.cmu.edu> <1522126842790.39313@cmu.edu> <20180327053140.xM3NWbFsK%predragp@andrew.cmu.edu> <798cc89dfa1a47b691994bc96880c039@PGH-MSGMLT-03.andrew.ad.cmu.edu> <20180327213556.fTho4DuWR%predragp@andrew.cmu.edu> Message-ID: Has anyone been able to run either Tensorflow or pytorch on gpu machines 5, 6, 9 ? Both give CUDA_ERROR_UNKNOWN errors. I tried setting my LD_LIBRARY_PATH and PATH variables to the cuda-8.0 / cuda-9.0/ cuda-9.1 (and the LD_LIBRARY_PATH to the corresponding lib64), reinstalling pytorch for cuda-8.0/ cuda-9.0/ cuda-9.1 using both virtualenv and the system miniconda, as well as reinstalled tensorflow. Nothing seems to work unfortunately. IIRC, these errors first appeared when the systems were rebooted after the spring break, and have persisted ever since. Any help in the matter would be appreciated ! On Tue, Mar 27, 2018 at 5:35 PM, Predrag Punosevac wrote: > Matthew Barnes wrote: > > > I think this is an issue with the CUDA install. I'm unable to run > > Tensorflow jobs on GPU9 as of last night (have not checked the others, > but > > I suspect similar). > > Nothing has changed since the last night. The error you are seeing is > TensorFlow complaning about 390.30 NVidia driver but we upgraded driver > last week accross all servers and IIRC you were able to use TensorFlow > on GPU2, GPU3, and GPU4 after the upgrade. > > The main problem seems CUDNN library as TensorFlow and PyTorch seems to > expect older libraries. Look for them in CUDA-9.0 directory. > > Predrag > > > > > 2018-03-26 14:54:49.214493: E > > tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to > cuInit: > > CUDA_ERROR_UNKNOWN > > 2018-03-26 14:54:49.214599: I > > tensorflow/stream_executor/cuda/cuda_diagnostics.cc:158] retrieving CUDA > > diagnostic information for host: gpu9.int.autonlab.org > > 2018-03-26 14:54:49.214617: I > > tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: > > gpu9.int.autonlab.org > > 2018-03-26 14:54:49.214685: I > > tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda > reported > > version is: 390.30.0 > > 2018-03-26 14:54:49.214747: I > > tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported > > version is: 390.30.0 > > 2018-03-26 14:54:49.214762: I > > tensorflow/stream_executor/cuda/cuda_diagnostics.cc:300] kernel version > > seems to match DSO: 390.30.0 > > > > > > On Tue, Mar 27, 2018 at 1:47 AM Manzil Zaheer wrote: > > > > > Hi Predrag, > > > > > > Thanks for pointing out the links. From the link you provided, we can > see > > > that FB engineers mention that "error 30 is usually unrelated to > pytorch > > > issues (or your code change)". > > > > > > Thanks, > > > Manzil > > > ________________________________________ > > > From: Predrag Punosevac > > > Sent: 27 March 2018 01:31 > > > To: Manzil Zaheer > > > Cc: Barnabas Poczos; users at autonlab.org > > > Subject: Re: PyTorch > > > > > > Manzil Zaheer wrote: > > > > > > > Hi Pregrad, > > > > > > > > Thanks again for your help. But I still can not get anything running > on > > > GPU5,6,7,9. Also notice that GPU1,2,3,4,8 almost all GPUs are full, > while > > > no one is using GPU5,6,7,9. This might mean no one else is also able > to run > > > anything as well. > > > > > > > > > > 7 if off limit used for the special project. How did you figure out > that > > > nobody is using it when > > > you can't even log there? > > > > > > > So I tried many things. Everything installs without issue. But when i > > > try to run the simple code like: > > > > > > > > > > PyTorch is a research grade software. They have a mailing list. 3 sec > > > Googling reveals > > > > > > > > > https://github.com/pytorch/pytorch/issues/2527 > > > > > > also > > > > > > > > > https://stackoverflow.com/questions/45861767/pytorch- > giving-cuda-runtime-error > > > > > > I will look at this more but it would be helpful if you get on PyTorch > > > mailing list and ask > > > developers what they think. I see this once every 9 months they are > > > looking at this bugs every > > > day. > > > > > > Predrag > > > > > > > import torch > > > > x = torch.cuda.FloatTensor(2,3,4) > > > > print(x) > > > > > > > > > > > > I get the following error: > > > > THCudaCheck FAIL file=/pytorch/torch/lib/THC/THCGeneral.c line=70 > > > error=30 : unknown error > > > > Traceback (most recent call last): > > > > File "", line 1, in > > > > File > > > "/zfsauton/home/manzilz/.local/lib/python3.6/site- > packages/torch/_utils.py", > > > line 69, in _cuda > > > > return new_type(self.size()).copy_(self, async) > > > > File > > > "/zfsauton/home/manzilz/.local/lib/python3.6/site- > packages/torch/cuda/__init__.py", > > > line 384, in _lazy_new > > > > _lazy_init() > > > > File > > > "/zfsauton/home/manzilz/.local/lib/python3.6/site- > packages/torch/cuda/__init__.py", > > > line 142, in _lazy_init > > > > torch._C._cuda_init() > > > > RuntimeError: cuda runtime error (30) : unknown error at > > > /pytorch/torch/lib/THC/THCGeneral.c:70 > > > > > > > > Thanks, > > > > Manzil > > > > > > > > ________________________________________ > > > > From: Predrag Punosevac > > > > Sent: 26 March 2018 22:50 > > > > To: Manzil Zaheer > > > > Cc: Barnabas Poczos; users at autonlab.org > > > > Subject: Re: PyTorch > > > > > > > > Manzil Zaheer wrote: > > > > > > > > > Thanks for the detailed analysis. But I am using pytorch. I have > not > > > tried Lua torch. Can you please check? Thanks again! > > > > > > > > > > > > > I did. You have Python 3.6.4 in /opt/miniconda3/bin/python3.6 > > > > > > > > predrag at gpu3$ /opt/miniconda3/bin/python3.6 > > > > Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19) > > > > [GCC 7.2.0] on linux > > > > Type "help", "copyright", "credits" or "license" for more > information. > > > > > > > > > > > > Try reinstalling thing in your scratch directory as > > > > > > > > /opt/miniconda3/bin/conda install pytorch torchvision cuda91 -c > pytorch > > > > > > > > You should see something like > > > > > > > > The following packages will be downloaded: > > > > > > > > package | build > > > > ---------------------------|----------------- > > > > pillow-5.0.0 | py36h3deb7b8_0 561 KB > > > > mkl-2018.0.2 | 1 205.2 MB > > > > cuda91-1.0 | h4c16780_0 3 KB > > > > pytorch > > > > libpng-1.6.34 | hb9fc6fc_0 334 KB > > > > freetype-2.8 | hab7d2ae_1 804 KB > > > > libgfortran-ng-7.2.0 | hdf63c60_3 1.2 MB > > > > intel-openmp-2018.0.0 | 8 620 KB > > > > libtiff-4.0.9 | h28f6b97_0 586 KB > > > > pytorch-0.3.1 |py36_cuda9.1.85_cudnn7.0.5_2 > 475.0 > > > > MB pytorch > > > > torchvision-0.2.0 | py36h17b6947_1 102 KB > > > > pytorch > > > > jpeg-9b | h024ee3a_2 248 KB > > > > numpy-1.14.2 | py36hdbf6ddf_0 4.0 MB > > > > olefile-0.45.1 | py36_0 47 KB > > > > ------------------------------------------------------------ > > > > Total: 688.7 MB > > > > > > > > > > > > Make sure you put your scratch as a path since file server is full. I > > > > got clean installation but I didn't play further. One thing that > worries > > > > me is this line > > > > > > > > pytorch-0.3.1 |py36_cuda9.1.85_cudnn7.0.5_2 475.0 > MB > > > > pytorch > > > > > > > > We had problems with cudnn on 9.1 apparently because the upstream was > > > > assuming 7.0.5 when in reality I have 7.1.1 CUDA 9 or even 7.1.5. > CUDA > > > > 9.1 > > > > > > > > GPU3 has CUDNN library 7.0.5 in cuda-9.0 so try adjusting conda > command > > > > accordingly. > > > > > > > > > > > > Best, > > > > Predrag > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Sent from my Samsung Galaxy smartphone. > > > > > > > > > > > > > > > -------- Original message -------- > > > > > From: Predrag Punosevac > > > > > Date: 3/26/18 9:00 PM (GMT-05:00) > > > > > To: Manzil Zaheer > > > > > Cc: Barnabas Poczos , users at autonlab.org > > > > > Subject: Re: Lua Torch > > > > > > > > > > Manzil Zaheer wrote: > > > > > > > > > > > Hi Predrag, > > > > > > > > > > > > I am not able to use any GPUSs on gpu5,6,7,9. I tried all 3 > versions > > > of cuda, but I get the following error: > > > > > > > > > > > > > > > > > > > > > I was able to build it after adding this > > > > > > > > > > export TORCH_NVCC_FLAGS="-D__CUDA_NO_HALF_OPERATORS__" > > > > > > > > > > per > > > > > > > > > > https://github.com/torch/torch7/issues/1086 > > > > > > > > > > When I try to run it I get errors that Lua packages are missing > > > (probably > > > > > due to my path variables). I have a vague recollection that Simon > and I > > > > > halped you once with this thing in the past. IIRC it was very picky > > > about > > > > > the version of some Lua package and required their version not the > one > > > > > which comes with yum . > > > > > > > > > > Anyhow I am forwarding this to users at autonlab in hope somebody is > > > using > > > > > it and might be of more help. Please stop by NSH 3119 and let us > try to > > > > > debug this. > > > > > > > > > > Predrag > > > > > > > > > > > > > > > > > > > > > > > > > > THCudaCheck FAIL file=/pytorch/torch/lib/THC/THCGeneral.c > line=70 > > > error=30 : unknown error > > > > > > Traceback (most recent call last): > > > > > > File "", line 1, in > > > > > > File > > > "/zfsauton/home/manzilz/local/lib/python3.6/site-packages/ > torch/cuda/__init__.py", > > > line 384, in _lazy_new > > > > > > _lazy_init() > > > > > > File > > > "/zfsauton/home/manzilz/local/lib/python3.6/site-packages/ > torch/cuda/__init__.py", > > > line 142, in _lazy_init > > > > > > torch._C._cuda_init() > > > > > > RuntimeError: cuda runtime error (30) : unknown error at > > > /pytorch/torch/lib/THC/THCGeneral.c:70 > > > > > > > > > > > > Can you kindly look into it? > > > > > > > > > > > > Thanks, > > > > > > Manzil > > > > > > > -- Barun Patra Master's Student Machine Learning Department Carnegie Mellon University -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Wed Mar 28 18:58:49 2018 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Wed, 28 Mar 2018 18:58:49 -0400 Subject: NVidia driver broke GPUs In-Reply-To: References: <295ba51bedc34351966390695e2ad973@PGH-MSGMLT-02.andrew.ad.cmu.edu> <20180327010021.WzisjuLUh%predragp@andrew.cmu.edu> <20180327025012.PucNB2br-%predragp@andrew.cmu.edu> <1522126842790.39313@cmu.edu> <20180327053140.xM3NWbFsK%predragp@andrew.cmu.edu> <798cc89dfa1a47b691994bc96880c039@PGH-MSGMLT-03.andrew.ad.cmu.edu> Message-ID: <20180328225849.jVGyjyWSc%predragp@andrew.cmu.edu> Barnabas Poczos wrote: > If this can't be fixed quickly, then would it be possible to do a roll > back on these GPU machines (5,6,9) to the latest state when they > worked fine? > (If I know correctly, they are down since March 23.) > > Sorry for bugging you with this, I just want to find a quick solution > to make these 12 GPU cards usable again with pytorch and tensorflow > because several deadlines are coming. > > Many thanks! ... and sorry for annoying you with this! > Ok Yotam and I spent last 3-4h debugging this. It is not PyTorch nor TensorFlow issue. It is not even CUDA issue. NVidia driver itself is broken. I have no idea how it happened on some machines and didn't happen on others (all GPU machines with the exception of GPU-7 run the same latest Red Hat 3.10.0-693.21.1.el7.x86 kernel). The clue should have being the fact that MATLAB also got broken on some machines. My hunch is that NVidia driver gets recompiled during the kernel update and apparently that is not as robust as it should be. The plan of the action is that I will try to remove everything NVidia related from GPU9 machine try to reinstall driver, CUDA from the scratch. Hopefully GPU9 will become functional just like GPU8. Once it works for GPU9 I can go and fix other machines. If that doesn't work I will reinstall GPU9 from the scratch. Long story short somebody at NVidia did a shady job with QA and we became victims. Oh just for the record we don't use ZFS on Linux. If I was running root of the ZFS pool as I am doing on the file server I could just do beadm select the previous working system and go back. I am not aware that Linux can do something like that but that is what I do on FreeBSD and that what Solaris does. Best, Predrag > Cheers, > Barnabas > ====================== > Barnabas Poczos, PhD > Assistant Professor > Machine Learning Department > Carnegie Mellon University > From bapoczos at cs.cmu.edu Wed Mar 28 19:16:21 2018 From: bapoczos at cs.cmu.edu (Barnabas Poczos) Date: Wed, 28 Mar 2018 19:16:21 -0400 Subject: NVidia driver broke GPUs In-Reply-To: <8588f843f78646378e50557740100683@PGH-MSGMLT-01.andrew.ad.cmu.edu> References: <295ba51bedc34351966390695e2ad973@PGH-MSGMLT-02.andrew.ad.cmu.edu> <20180327010021.WzisjuLUh%predragp@andrew.cmu.edu> <20180327025012.PucNB2br-%predragp@andrew.cmu.edu> <1522126842790.39313@cmu.edu> <20180327053140.xM3NWbFsK%predragp@andrew.cmu.edu> <798cc89dfa1a47b691994bc96880c039@PGH-MSGMLT-03.andrew.ad.cmu.edu> <8588f843f78646378e50557740100683@PGH-MSGMLT-01.andrew.ad.cmu.edu> Message-ID: Thanks Predrag and Yotam for your help working on this! Best, Barnabas ====================== Barnabas Poczos, PhD Assistant Professor Machine Learning Department Carnegie Mellon University On Wed, Mar 28, 2018 at 6:58 PM, Predrag Punosevac wrote: > Barnabas Poczos wrote: > >> If this can't be fixed quickly, then would it be possible to do a roll >> back on these GPU machines (5,6,9) to the latest state when they >> worked fine? >> (If I know correctly, they are down since March 23.) >> >> Sorry for bugging you with this, I just want to find a quick solution >> to make these 12 GPU cards usable again with pytorch and tensorflow >> because several deadlines are coming. >> >> Many thanks! ... and sorry for annoying you with this! >> > > Ok Yotam and I spent last 3-4h debugging this. It is not PyTorch nor > TensorFlow issue. It is not even CUDA issue. NVidia driver itself is > broken. I have no idea how it happened on some machines and didn't > happen on others (all GPU machines with the exception of GPU-7 run the > same latest Red Hat 3.10.0-693.21.1.el7.x86 kernel). The clue should > have being the fact that MATLAB also got broken on some machines. > My hunch is that NVidia driver gets recompiled during the kernel update > and apparently that is not as robust as it should be. > > The plan of the action is that I will try to remove everything NVidia > related from GPU9 machine try to reinstall driver, CUDA from the > scratch. Hopefully GPU9 will become functional just like GPU8. Once it > works for GPU9 I can go and fix other machines. If that doesn't work I > will reinstall GPU9 from the scratch. > > Long story short somebody at NVidia did a shady job with QA and we > became victims. Oh just for the record we don't use ZFS on Linux. If I > was running root of the ZFS pool as I am doing on the file server I > could just do beadm select the previous working system and go back. I am > not aware that Linux can do something like that but that is what I do on > FreeBSD and that what Solaris does. > > > Best, > Predrag > > > > >> Cheers, >> Barnabas >> ====================== >> Barnabas Poczos, PhD >> Assistant Professor >> Machine Learning Department >> Carnegie Mellon University >> From predragp at andrew.cmu.edu Wed Mar 28 23:15:06 2018 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Wed, 28 Mar 2018 23:15:06 -0400 Subject: NVidia driver broke GPUs In-Reply-To: References: <295ba51bedc34351966390695e2ad973@PGH-MSGMLT-02.andrew.ad.cmu.edu> <20180327010021.WzisjuLUh%predragp@andrew.cmu.edu> <20180327025012.PucNB2br-%predragp@andrew.cmu.edu> <1522126842790.39313@cmu.edu> <20180327053140.xM3NWbFsK%predragp@andrew.cmu.edu> <798cc89dfa1a47b691994bc96880c039@PGH-MSGMLT-03.andrew.ad.cmu.edu> <8588f843f78646378e50557740100683@PGH-MSGMLT-01.andrew.ad.cmu.edu> Message-ID: <20180329031506.IcZISPvTL%predragp@andrew.cmu.edu> Dear Autonians, I have another update on NVidia driver issue. I have actually reinstall the driver and CUDA-9.0 on GPU9 but the issue is still here. Please see below detailed report. I have seeing few people reporting even this stupidity with NVidia hardware. Their solution is cold reboot. I have rebooted this machine multiple times but every time remotely with the reboot command. That is so called soft reboot where the power actaully never gets completely cut off. Tomorrow I will go to machine room turn off the machine for 10 minutes and bring it back on line. We will see if that helps. Predrag root at gpu9$ cat /proc/driver/nvidia/version NVRM version: NVIDIA UNIX x86_64 Kernel Module 390.30 Wed Jan 31 22:08:49 PST 2018 GCC version: gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC) root at gpu9$ nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2017 NVIDIA Corporation Built on Fri_Sep__1_21:08:03_CDT_2017 Cuda compilation tools, release 9.0, V9.0.176 root at gpu9$ nvidia-smi Wed Mar 28 23:13:13 2018 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 390.30 Driver Version: 390.30 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 TITAN Xp Off | 00000000:02:00.0 Off | N/A | | 23% 40C P0 61W / 250W | 0MiB / 12196MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 TITAN Xp Off | 00000000:03:00.0 Off | N/A | | 24% 43C P0 61W / 250W | 0MiB / 12196MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 TITAN Xp Off | 00000000:82:00.0 Off | N/A | | 23% 40C P0 62W / 250W | 0MiB / 12196MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 3 TITAN Xp Off | 00000000:83:00.0 Off | N/A | | 23% 42C P0 62W / 250W | 0MiB / 12196MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ root at gpu9$ ls deviceQuery deviceQuery.cpp deviceQuery.o Makefile NsightEclipse.xml readme.txt root at gpu9$ ./deviceQuery ./deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) cudaGetDeviceCount returned 30 -> unknown error Result = FAIL > On Wed, Mar 28, 2018 at 6:58 PM, Predrag Punosevac > wrote: > > Barnabas Poczos wrote: > > > >> If this can't be fixed quickly, then would it be possible to do a roll > >> back on these GPU machines (5,6,9) to the latest state when they > >> worked fine? > >> (If I know correctly, they are down since March 23.) > >> > >> Sorry for bugging you with this, I just want to find a quick solution > >> to make these 12 GPU cards usable again with pytorch and tensorflow > >> because several deadlines are coming. > >> > >> Many thanks! ... and sorry for annoying you with this! > >> > > > > Ok Yotam and I spent last 3-4h debugging this. It is not PyTorch nor > > TensorFlow issue. It is not even CUDA issue. NVidia driver itself is > > broken. I have no idea how it happened on some machines and didn't > > happen on others (all GPU machines with the exception of GPU-7 run the > > same latest Red Hat 3.10.0-693.21.1.el7.x86 kernel). The clue should > > have being the fact that MATLAB also got broken on some machines. > > My hunch is that NVidia driver gets recompiled during the kernel update > > and apparently that is not as robust as it should be. > > > > The plan of the action is that I will try to remove everything NVidia > > related from GPU9 machine try to reinstall driver, CUDA from the > > scratch. Hopefully GPU9 will become functional just like GPU8. Once it > > works for GPU9 I can go and fix other machines. If that doesn't work I > > will reinstall GPU9 from the scratch. > > > > Long story short somebody at NVidia did a shady job with QA and we > > became victims. Oh just for the record we don't use ZFS on Linux. If I > > was running root of the ZFS pool as I am doing on the file server I > > could just do beadm select the previous working system and go back. I am > > not aware that Linux can do something like that but that is what I do on > > FreeBSD and that what Solaris does. > > > > > > Best, > > Predrag > > > > > > > > > >> Cheers, > >> Barnabas > >> ====================== > >> Barnabas Poczos, PhD > >> Assistant Professor > >> Machine Learning Department > >> Carnegie Mellon University > >> From awd at andrew.cmu.edu Thu Mar 29 03:33:55 2018 From: awd at andrew.cmu.edu (Artur Dubrawski) Date: Thu, 29 Mar 2018 03:33:55 -0400 Subject: Traffic Jam helps good people do good things Message-ID: See our Artificial Intelligence Expert at work, making difference: https://dms.licdn.com/playback/C4E05AQF3_HZSNDO8vw/05ccd005919445b4b6228a5c7c905b42/feedshare-mp4_500/1479932728445-v0ch3x?e=1522396800&v=alpha&t=rJnRwa84uB2jPFmPP8XoBUTViphAES0aNJiAGGMGKyU It is a very cool video even though CMU Auton Lab or Traffic Jam software are not mentioned by name. Yet, Deliver Fund are important partners who do things our software and our analysts do not do: physically face the criminals and physically pull the sex trafficking victims off their nasty hands. They use Traffic Jam to prioritize and plan their field activity, and to train law enforcement officers to do the same in their respective jurisdictions. Cheers, Artur -------------- next part -------------- An HTML attachment was scrubbed... URL: From predragp at andrew.cmu.edu Thu Mar 29 14:44:25 2018 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Thu, 29 Mar 2018 14:44:25 -0400 Subject: GPU8 In-Reply-To: References: Message-ID: <20180329184425.atDL-wIha%predragp@andrew.cmu.edu> Yotam Hechtlinger wrote: > Hello Predrag, > > There might be a bug with GPU8 also. > I didn't have time to test it yet, but python crashes when trying to call > keras. I did cold reboot. It didn't help. I think what we see is the bug with the driver 390.30. The bug could be Titan Xp specific that is why we see older machines working.Nvidia has a websites where one can download the scripts which one can use to recompile the latest driver. I think the latest driver is 390.48. which is quite a few versions ahead of 390.30. I am installing it right now on GPU9. If that doesn't work I will try downgrading kernel which assumption that it is a kernel bug. The following kernels are available kernel.x86_64 3.10.0-693.5.2.el7 @updates kernel.x86_64 3.10.0-693.11.6.el7 @updates kernel.x86_64 3.10.0-693.21.1.el7 Right now I am running 3.10.0-693.21.1 but we can try to go one or even two kernels back. If all that fails I still have few magic tricks in my hat but they are related to motherboard firmware. GPU8 and GPU9 have the same motherboards but not other servers. Best, Predrag > Unlike GPU 5,6 & 9, you can actually get the GPU working, but when I run a > keras prediction functions it crashed and says: > > Loaded runtime CuDNN library: 7101 (compatibility version 7100) but source > was compiled with 7004 (compatibility version 7000). If using a binary > install, upgrade your CuDNN library to match. If building from sources, > make sure the library loaded at runtime matches a compatible version > specified during compile configuration. > 2018-03-29 09:57:49.807855: F tensorflow/core/kernels/conv_ops.cc:717] > Check failed: stream->parent()->GetConvolveAlgorithms( > conv_parameters.ShouldIncludeWinogradNonfusedAlgo(), &algorithms) > > Same code works on GPU4. > I know this is not informative, I'll look into it later, just wanted to > give you a heads up. > I think this might be why there aren't any users on GPU8 but there are on > GPU4. > > Thanks, > Yotam. From predragp at andrew.cmu.edu Thu Mar 29 15:25:40 2018 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Thu, 29 Mar 2018 15:25:40 -0400 Subject: GPU problem fixed! In-Reply-To: <20180329184425.atDL-wIha%predragp@andrew.cmu.edu> References: <20180329184425.atDL-wIha%predragp@andrew.cmu.edu> Message-ID: <20180329192540.j26RKf_wI%predragp@andrew.cmu.edu> Dear Autonians, This is now fixed! Apparently we hit a serious driver bug with 930.30. Please try now to compile TensorFlow and PyTorch on GPU9 Predrag Punoseva ccess from TITAN Xp (GPU0) -> TITAN Xp (GPU1) : Yes > Peer access from TITAN Xp (GPU0) -> TITAN Xp (GPU2) : No > Peer access from TITAN Xp (GPU0) -> TITAN Xp (GPU3) : No > Peer access from TITAN Xp (GPU1) -> TITAN Xp (GPU0) : Yes > Peer access from TITAN Xp (GPU1) -> TITAN Xp (GPU2) : No > Peer access from TITAN Xp (GPU1) -> TITAN Xp (GPU3) : No > Peer access from TITAN Xp (GPU2) -> TITAN Xp (GPU0) : No > Peer access from TITAN Xp (GPU2) -> TITAN Xp (GPU1) : No > Peer access from TITAN Xp (GPU2) -> TITAN Xp (GPU3) : Yes > Peer access from TITAN Xp (GPU3) -> TITAN Xp (GPU0) : No > Peer access from TITAN Xp (GPU3) -> TITAN Xp (GPU1) : No > Peer access from TITAN Xp (GPU3) -> TITAN Xp (GPU2) : Yes deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.1, CUDA Runtime Version = 9.1, NumDevs = 4 Result = PASS wrote: I will ping you with a plan of action as soon as Kyle and I stop dancing We are kind in celebratory mood right now. We will have to fix first servers with higher number which have Titan Xp cards and newer motherboards before moving to lower number servers with older GPU cards. Predrag > Yotam Hechtlinger wrote: > > > Hello Predrag, > > > > There might be a bug with GPU8 also. > > I didn't have time to test it yet, but python crashes when trying to call > > keras. > > I did cold reboot. It didn't help. I think what we see is the bug with > the driver 390.30. The bug could be Titan Xp specific that is why we see > older machines working.Nvidia has a websites where one can download the > scripts which one can use to recompile the latest driver. I think the > latest driver is 390.48. which is quite a few versions ahead of 390.30. > I am installing it right now on GPU9. If that doesn't work I will try > downgrading kernel which assumption that it is a kernel bug. The > following kernels are available > > kernel.x86_64 3.10.0-693.5.2.el7 @updates > kernel.x86_64 3.10.0-693.11.6.el7 @updates > kernel.x86_64 3.10.0-693.21.1.el7 > > Right now I am running 3.10.0-693.21.1 but we can try to go one or even > two kernels back. > > If all that fails I still have few magic tricks in my hat but they are > related to motherboard firmware. GPU8 and GPU9 have the same > motherboards but not other servers. > > Best, > Predrag > > > > > Unlike GPU 5,6 & 9, you can actually get the GPU working, but when I run a > > keras prediction functions it crashed and says: > > > > Loaded runtime CuDNN library: 7101 (compatibility version 7100) but source > > was compiled with 7004 (compatibility version 7000). If using a binary > > install, upgrade your CuDNN library to match. If building from sources, > > make sure the library loaded at runtime matches a compatible version > > specified during compile configuration. > > 2018-03-29 09:57:49.807855: F tensorflow/core/kernels/conv_ops.cc:717] > > Check failed: stream->parent()->GetConvolveAlgorithms( > > conv_parameters.ShouldIncludeWinogradNonfusedAlgo(), &algorithms) > > > > Same code works on GPU4. > > I know this is not informative, I'll look into it later, just wanted to > > give you a heads up. > > I think this might be why there aren't any users on GPU8 but there are on > > GPU4. > > > > Thanks, > > Yotam. From predragp at andrew.cmu.edu Thu Mar 29 16:44:16 2018 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Thu, 29 Mar 2018 16:44:16 -0400 Subject: GPU status update Message-ID: <20180329204416.iK3izftKV%predragp@andrew.cmu.edu> Dear Autonians, The NVidia driver is now updated to 390.48 on GPU5, GPU6, GPU8, GPU9 There no other machines (GPU 7 is treated separately due to its current use) with Titan Xp cards. Titan X crads were unaffected by a driver bug in 930.30 according to intial reports. I can use GPU from the MATLAB on GPU5, 6, 8, 9. CUDA-8 is removed from all those servers. CUDA-9 and CUDA-9.1 are there. Server should default to cuda-9.0 due to the fact that TensorFlow and PyTorch are not released for 9.1. I really need people to test this now. Please make sure you local paths and library links are fixed before e-mailing me. People who need that proprietary Intel Library or cuDNN will have to wait until we get this right so that all GPU servers have basic functionality. As you can see there are lot of moving parts in these servers and they don't quite act like computers you can buy in Wallmart. MATLAB was removed previously from GPU1 and GPU2 due to the lack of space. I will be putting it as shortly. I will put the latest 2018a release. Predrag From manzil at cmu.edu Thu Mar 29 17:02:42 2018 From: manzil at cmu.edu (Manzil Zaheer) Date: Thu, 29 Mar 2018 21:02:42 +0000 Subject: GPU status update In-Reply-To: <41389_1522356311_w2TKjAlB042004_20180329204416.iK3izftKV%predragp@andrew.cmu.edu> References: <41389_1522356311_w2TKjAlB042004_20180329204416.iK3izftKV%predragp@andrew.cmu.edu> Message-ID: Thanks Predrag for all the hard work. It works for me now. Yay! Best, Manzil -----Original Message----- From: Autonlab-users On Behalf Of Predrag Punosevac Sent: Thursday, March 29, 2018 4:44 PM To: users at autonlab.org Subject: GPU status update Dear Autonians, The NVidia driver is now updated to 390.48 on GPU5, GPU6, GPU8, GPU9 There no other machines (GPU 7 is treated separately due to its current use) with Titan Xp cards. Titan X crads were unaffected by a driver bug in 930.30 according to intial reports. I can use GPU from the MATLAB on GPU5, 6, 8, 9. CUDA-8 is removed from all those servers. CUDA-9 and CUDA-9.1 are there. Server should default to cuda-9.0 due to the fact that TensorFlow and PyTorch are not released for 9.1. I really need people to test this now. Please make sure you local paths and library links are fixed before e-mailing me. People who need that proprietary Intel Library or cuDNN will have to wait until we get this right so that all GPU servers have basic functionality. As you can see there are lot of moving parts in these servers and they don't quite act like computers you can buy in Wallmart. MATLAB was removed previously from GPU1 and GPU2 due to the lack of space. I will be putting it as shortly. I will put the latest 2018a release. Predrag From predragp at andrew.cmu.edu Thu Mar 29 17:30:48 2018 From: predragp at andrew.cmu.edu (Predrag Punosevac) Date: Thu, 29 Mar 2018 17:30:48 -0400 Subject: Migrated SVN repos to Git Message-ID: <20180329213048.Sdps47gJB%predragp@andrew.cmu.edu> -------- Original Message -------- Date: Thu, 29 Mar 2018 17:29:12 -0400 From: Predrag Punosevac To: donghanw at cs.cmu.edu Subject: Re: Migrated SVN repos to Git Donghan Wang wrote: > Hi Predrag, > > I migrated all 74 SVN repos to Git. They are available on Gogs at > http://git.int.autonlab.org/SVN. > Good job! We will continue to make CVS and SVN visible through VIEWVC for historical reasons but nobody should really use that stuff. They are read only anyway. > There are two giant repos > > - 1.3GB SVN/prateekt > - 3.3GB SVN/radiation_hunter > > Do you see any problems with them? > > The second question is how to set up the Gogs permission correctly so that > people can access them? Maybe something similar to http://git.int.autonlab. > org/C? Gogs is plugged into the LDAP so anybody with a valid Auton Lab account can log into the Gogs interface from one of internal machines (X2Go needs to be used for external access) upload her/his ssh key and just use the thing with ssh or via http. http://git.int.autonlab.org/user/login If you want to hide some repositories from praying eyes make them private. Gogs support the same security paradigm like GitHub. Owner of the repo should decide if they want repo public. Predrag > > Thanks, > Jarod From bapoczos at cs.cmu.edu Thu Mar 29 18:48:52 2018 From: bapoczos at cs.cmu.edu (Barnabas Poczos) Date: Thu, 29 Mar 2018 18:48:52 -0400 Subject: GPU status update In-Reply-To: <20180329204416.iK3izftKV%predragp@andrew.cmu.edu> References: <20180329204416.iK3izftKV%predragp@andrew.cmu.edu> Message-ID: Awesome! Many thanks Predrag for fixing these machines! Best, Barnabas ====================== Barnabas Poczos, PhD Assistant Professor Machine Learning Department Carnegie Mellon University On Thu, Mar 29, 2018 at 4:44 PM, Predrag Punosevac wrote: > Dear Autonians, > > The NVidia driver is now updated to 390.48 on GPU5, GPU6, GPU8, GPU9 > There no other machines (GPU 7 is treated separately due to its current > use) with Titan Xp cards. Titan X crads were unaffected by a driver bug > in 930.30 according to intial reports. > > I can use GPU from the MATLAB on GPU5, 6, 8, 9. CUDA-8 is removed from > all those servers. CUDA-9 and CUDA-9.1 are there. Server should default > to cuda-9.0 due to the fact that TensorFlow and PyTorch are not released > for 9.1. > > I really need people to test this now. Please make sure you local paths > and library links are fixed before e-mailing me. > > People who need that proprietary Intel Library or cuDNN will have to > wait until we get this right so that all GPU servers have basic > functionality. As you can see there are lot of moving parts in these > servers and they don't quite act like computers you can buy in Wallmart. > > > MATLAB was removed previously from GPU1 and GPU2 due to the lack of > space. I will be putting it as shortly. I will put the latest 2018a > release. > > > > Predrag