From chiragn at cs.cmu.edu  Sun Mar  4 23:27:45 2018
From: chiragn at cs.cmu.edu (Chirag Nagpal)
Date: Sun, 4 Mar 2018 23:27:45 -0500
Subject: numpy configuration
Message-ID: <CAKH1gVo0LT46wELc4SBCASee90b761TXCndj7hhvx96_Y1ifYg@mail.gmail.com>

Hi all!

I need help with configuring numpy with OpenBLAS.

specifically I'm using numpy on lov5, but matrix operations seem to use
just one core. explicitly linking numpy to openblas should alleviate this.

Need pointers on where openblas is located in the OS, and the correct way
to link it.

Thanks

Chirag

-- 

*Chirag Nagpal* Graduate Student, Language Technologies Institute
School of Computer Science
Carnegie Mellon University
cs.cmu.edu/~chiragn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20180304/4109d7b9/attachment.html>

From predragp at andrew.cmu.edu  Tue Mar  6 17:12:18 2018
From: predragp at andrew.cmu.edu (Predrag Punosevac)
Date: Tue, 06 Mar 2018 17:12:18 -0500
Subject: GPU2 hard-rebooted due to ...
Message-ID: <20180306221218.iPoMnuEmR%predragp@andrew.cmu.edu>

Dear Autonians,

Somebody run GPU2 into the ground. Memory including swap was 100%
loaded. I had to hard reboot the machine. I am fixing it right now.
Should not take more than 30 minutes. Please don't start anything until
machine is fully ready.

Predrag

From ngisolfi at cmu.edu  Thu Mar  8 09:37:52 2018
From: ngisolfi at cmu.edu (Nick Gisolfi)
Date: Thu, 8 Mar 2018 09:37:52 -0500
Subject: [hackAuton] weekend scheduling and shirt sizes
Message-ID: <CAJ4FYPw1a__5P0NjEL_nD2Dq3nSmnO5-uLh21bWr1Z1RYXMDqA@mail.gmail.com>

Hi Everyone,

We will need all hands on deck to help participants on the weekend of the
hackAuton, April 6-8.  Please plan on participating.

I have two links I need everyone to fill out...

Link 1 (volunteer sign ups and choose your event t-shirt size):
https://hackauton.com/volunteer

Link 2 (specify times you can be available...select as many as possible):
https://doodle.com/poll/nkgerss2dsc9bxmn

I will create a formal schedule once everyone signs up and I get a better
picture of the number of participants we will have.  We have 30
participants registered at the moment!

- Nick
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20180308/c52d28ed/attachment.html>

From awd at cs.cmu.edu  Sat Mar 10 13:48:36 2018
From: awd at cs.cmu.edu (Artur Dubrawski)
Date: Sat, 10 Mar 2018 13:48:36 -0500
Subject: as we've just celebrated the International Women's Day - check this
 out :)
Message-ID: <066a486a-2495-b8ec-e79d-0d125ccd9ce6@cs.cmu.edu>

https://www.girlboss.com/girlboss/2018/3/7/female-tech-founders?lipi=urn%3Ali%3Apage%3Ad_flagship3_feed%3B3cCJCEwsTPmW3EE4pD5gRA%3D%3D

Happy (belated) Women's Day to all our cherished Women Autonians!

Artur

From bpatra at andrew.cmu.edu  Wed Mar 14 15:58:17 2018
From: bpatra at andrew.cmu.edu (Barun Patra)
Date: Wed, 14 Mar 2018 15:58:17 -0400
Subject: Failed to initialize NVML: Driver/library version mismatch
Message-ID: <CAMBU3bQxscts8W=WGe09ODwxBr-B0Muy+U_kd+bVkMp7=V+QDQ@mail.gmail.com>

Hi,
Are any of you facing the same issue ?
Failed to initialize NVML: Driver/library version mismatch

The last time this issue occurred, I think rebooting fixed the issue.

Thanks for the help!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20180314/2b1a54f6/attachment.html>

From sheath at andrew.cmu.edu  Thu Mar 15 13:47:35 2018
From: sheath at andrew.cmu.edu (Simon Heath)
Date: Thu, 15 Mar 2018 13:47:35 -0400
Subject: Failed to initialize NVML: Driver/library version mismatch
In-Reply-To: <CAMBU3bQxscts8W=WGe09ODwxBr-B0Muy+U_kd+bVkMp7=V+QDQ@mail.gmail.com>
References: <CAMBU3bQxscts8W=WGe09ODwxBr-B0Muy+U_kd+bVkMp7=V+QDQ@mail.gmail.com>
Message-ID: <CAGeWgZTg0ST0W8RroLiBpQWNKo15WK=0BaXBU2JwQmiySoA8OQ@mail.gmail.com>

I'll happily help you with this but I need more information.

Which GPU node are you on?  What are you trying to do?  When did this
problem start?  Is there an easy way I can reproduce it for troubleshooting?

Thanks,
Simon

On Wed, Mar 14, 2018 at 3:58 PM, Barun Patra <bpatra at andrew.cmu.edu> wrote:

> Hi,
> Are any of you facing the same issue ?
> Failed to initialize NVML: Driver/library version mismatch
>
> The last time this issue occurred, I think rebooting fixed the issue.
>
> Thanks for the help!
>


-- 
Simon Heath, Research Programmer and Analyst
Robotics Institute - Auton Lab
Carnegie Mellon University
sheath at andrew.cmu.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20180315/218dc235/attachment.html>

From ngisolfi at cs.cmu.edu  Mon Mar 19 12:05:03 2018
From: ngisolfi at cs.cmu.edu (Nick Gisolfi)
Date: Mon, 19 Mar 2018 12:05:03 -0400
Subject: [hackAuton] Please sign up for PSC account
Message-ID: <CAJ4FYPxz15kTegiceenC93gOK-Y=yXf3_FkM52wGhmCD8JhNaQ@mail.gmail.com>

Hi Everyone,

The Pittsburgh Supercomputing Center is supplying the computational power
for our hackAuton!

Please create an account (http://portal.xsede.org) to help test the
environment and reply to me (not entire list) if you do decide to make an
account so I can add you to the proper allocation (takes about 24-48
hours).  We have access to both CPU and GPU nodes.

Our job now is to make sure that common software libraries are installed
and operating smoothly before the hackAuton.  PSC doesn't have a lot of
collaborations with AI folks, so we may need to ask them to add a few
modules to their servers.  Try running a few small experiments and please
help identify what needs to be added/fixed.

Thanks!

- Nick
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20180319/16edaa83/attachment.html>

From ngisolfi at cs.cmu.edu  Mon Mar 19 12:10:46 2018
From: ngisolfi at cs.cmu.edu (Nick Gisolfi)
Date: Mon, 19 Mar 2018 12:10:46 -0400
Subject: [hackAuton] Need more volunteers April 7&8
Message-ID: <CAJ4FYPxyaWPzZc2Rg064+rdHFJ_BCxToaZvoLFS-DvstC=PRiQ@mail.gmail.com>

Hi Everyone,

https://doodle.com/poll/nkgerss2dsc9bxmn

We need a few more volunteers for the weekend of the hackAuton.  Right now
we have 11 people signed up (thank you!!) but we do not have enough
volunteers for Saturday and Sunday.

Right now there are 46 registered participants.  I estimate this number to
grow significantly (close to 100), once email announcements go out today.

We want to have Autonians present in full force at the event to show the
depth of our lab.  Please sign up to help out if you have not already.

Thank you!

- Nick
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20180319/6a8f6236/attachment.html>

From awd at cs.cmu.edu  Tue Mar 20 09:50:02 2018
From: awd at cs.cmu.edu (Artur Dubrawski)
Date: Tue, 20 Mar 2018 09:50:02 -0400
Subject: Auton Lab postdoc candidate job talk: Wednesday March 28, 11am, NSH
 4119
Message-ID: <df94a2ab-bb37-e8bd-0a65-d8552c1ad67a@cs.cmu.edu>

Team,

We will have a skype presentation on Wednesday next week given by Bo Wu of Chinese Academy of Science, who is seeking a post-doctoral position with the Auton Lab.

Please see below the title/abstract and bio of the speaker and please join us to see the talk!

Cheers
Artur

---

Title: Temporal Learning and Prediction

Abstract:
While time-aware scenarios are ubiquitous, Temporal Learning and Prediction motivated by a wide range of applications depending on the dynamic platforms or systems (e.g. diffusion in marketing, pricing in ads etc). In domains as diverse as consumption, finance, entertainment and transportation, we observe a fundamental shift away from discrete, infrequent data to nearly continuous monitoring and recording. Therefore, temporal modeling dynamic signals, behaviors or information is a novel and prevalent topic in research area.

Meanwhile, as an important platform for users to share and spread information at anytime, ?social media? offers an good opportunity to study temporal social signals, such as post popularity, user interests over time etc. We treated future popularity prediction as our research problem, and our research work tends to investigate the temporal learning and prediction techniques for sequential or time-series data. Different with previous prediction algorithms, our work study multiple temporal-view prediction problems for social media popularity, which contains dynamic factorization prediction, specific time prediction and time-series prediction. From the inner to sequential and from implicit to explicit, these prediction approaches progressively the influence previous user sharing behaviors to future popularity. Moreover, we explorer temporal learning and prediction have effective effects, which evaluated by the experiments of social media popularity prediction on large dataset. And we are also tring to applied proposed temporal modeling approaches into other problems.

Short Bio:
Bo Wu received Ph.D. degree from Chinese Academy of Sciences (Institute 
of Computing Technology), Beijing, China. His current research interests 
are temporal machine learning, deep learning, computer vision and social 
multimedia. He has over 2-years research experience in Microsoft 
Research Asia, and one year research experience in Academia Sinica. He 
has authored several papers published at top conferences and journals 
(ACM MM, AAAI, IJCAI, TKDE etc.), and also invited as reviewers or TPC 
member of IEEE TKDE, IEEE TMM, ACM Multimedia, SIGIR and ICIP etc. He is 
co-organizer of ACM Multimedia Challenge 2017. He has received several 
awards, including Turing 50th Student Scholarship, Innovation Research 
Award, Ph.D. Student Research Award, Top 1% in Global Recommendation 
Challenge etc.


From ngisolfi at cs.cmu.edu  Thu Mar 22 13:13:37 2018
From: ngisolfi at cs.cmu.edu (Nick Gisolfi)
Date: Thu, 22 Mar 2018 13:13:37 -0400
Subject: [hackAuton] Cookies in NSH 3111
Message-ID: <CAJ4FYPwkAobx+UNfFSmnuHAaZaaU9Yrjkiwkjo+FRakjJmvxSw@mail.gmail.com>

Hi All,

There are chocolate chip cookies in NSH 3111!

- Nick
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20180322/9b6e6f2b/attachment.html>

From predragp at andrew.cmu.edu  Thu Mar 22 13:51:58 2018
From: predragp at andrew.cmu.edu (Predrag Punosevac)
Date: Thu, 22 Mar 2018 13:51:58 -0400
Subject: Driver/library version mismatch on gpu nodes
In-Reply-To: <CAPfEFgeQW679b9OOh5XkJr3feRX847b0opmRLCeyghnqffPs4Q@mail.gmail.com>
References: <CAPfEFgeQW679b9OOh5XkJr3feRX847b0opmRLCeyghnqffPs4Q@mail.gmail.com>
Message-ID: <20180322175158.IaQkucHzu%predragp@andrew.cmu.edu>

Michael Andrews <mbandrews at cmu.edu> wrote:

> Hi Predrag,
> 
> There seems to be a driver/library mismatch on some of the gpu nodes (e.g.
> gpu3, gpu4):
> 
> $ nvidia-smi
> Failed to initialize NVML: Driver/library version mismatch
> 

Unfortunatelly the machines will have to be rebooted to clear that. I
will do it today at 5:00 PM.

Predrag

> Could you have a look when you get a chance?
> 
> Thanks,
> Michael

From mbandrews at cmu.edu  Thu Mar 22 15:47:08 2018
From: mbandrews at cmu.edu (Michael Andrews)
Date: Thu, 22 Mar 2018 15:47:08 -0400
Subject: Driver/library version mismatch on gpu nodes
In-Reply-To: <20180322175158.IaQkucHzu%predragp@andrew.cmu.edu>
References: <CAPfEFgeQW679b9OOh5XkJr3feRX847b0opmRLCeyghnqffPs4Q@mail.gmail.com>
 <20180322175158.IaQkucHzu%predragp@andrew.cmu.edu>
Message-ID: <CAPfEFgd3K-2hBkMBsCjPHJLVFa8iHoVjewN-AM8aLuB_aSWPNQ@mail.gmail.com>

Thanks!

Michael

On Thu, Mar 22, 2018 at 1:51 PM, Predrag Punosevac <predragp at andrew.cmu.edu>
wrote:

> Michael Andrews <mbandrews at cmu.edu> wrote:
>
> > Hi Predrag,
> >
> > There seems to be a driver/library mismatch on some of the gpu nodes
> (e.g.
> > gpu3, gpu4):
> >
> > $ nvidia-smi
> > Failed to initialize NVML: Driver/library version mismatch
> >
>
> Unfortunatelly the machines will have to be rebooted to clear that. I
> will do it today at 5:00 PM.
>
> Predrag
>
> > Could you have a look when you get a chance?
> >
> > Thanks,
> > Michael
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20180322/8a0c0154/attachment.html>

From predragp at andrew.cmu.edu  Thu Mar 22 19:22:21 2018
From: predragp at andrew.cmu.edu (Predrag Punosevac)
Date: Thu, 22 Mar 2018 19:22:21 -0400
Subject: cudnn 7.0
In-Reply-To: <CALpevtRnoaNuEN84AF1zMYsKx4U_EV41X2CvArgVxJqmwW3j=Q@mail.gmail.com>
References: <CALpevtRnoaNuEN84AF1zMYsKx4U_EV41X2CvArgVxJqmwW3j=Q@mail.gmail.com>
Message-ID: <20180322232221.icsLz4INc%predragp@andrew.cmu.edu>

Matt Barnes <mbarnes1 at andrew.cmu.edu> wrote:

> Reminder to change the symlink on the GPU's
> 
> /usr/local/cuda-9.0/lib64/libcudnn.so.7
> 
> to point to libcudnn.so.7.0.5

Thank you for the remainder. I will look into this when I get back home
tonight. 

Predrag

From predragp at andrew.cmu.edu  Thu Mar 22 22:13:34 2018
From: predragp at andrew.cmu.edu (Predrag Punosevac)
Date: Thu, 22 Mar 2018 22:13:34 -0400
Subject: cudnn 7.0
In-Reply-To: <CALpevtRnoaNuEN84AF1zMYsKx4U_EV41X2CvArgVxJqmwW3j=Q@mail.gmail.com>
References: <CALpevtRnoaNuEN84AF1zMYsKx4U_EV41X2CvArgVxJqmwW3j=Q@mail.gmail.com>
Message-ID: <20180323021334.31W2vmrKf%predragp@andrew.cmu.edu>

Matt Barnes <mbarnes1 at andrew.cmu.edu> wrote:

> Reminder to change the symlink on the GPU's
> 
> /usr/local/cuda-9.0/lib64/libcudnn.so.7
> 
> to point to libcudnn.so.7.0.5

Simon,

Can you explain to me what is happening here? Why do these symbolic
links have to be set manually to the older version of cudnn?

Predrag


root at gpu1$ cd /usr/local/cuda-9.0/lib64
root at gpu1$ ls -l libcudnn.so.7
lrwxrwxrwx. 1 root root 17 Mar 22 18:06 libcudnn.so.7 ->
libcudnn.so.7.1.1

root at gpu2$ ls -l libcudnn.so.7
lrwxrwxrwx. 1 sheath sheath 17 Nov 16 23:41 libcudnn.so.7 ->
libcudnn.so.7.0.5

root at gpu3$ ls -l libcudnn.so.7
lrwxrwxrwx 1 sheath sheath 17 Nov 16 23:41 libcudnn.so.7 ->
libcudnn.so.7.0.5

root at gpu4$ ls -l libcudnn.so.7
lrwxrwxrwx. 1 sheath sheath 17 Nov 16 23:41 libcudnn.so.7 ->
libcudnn.so.7.0.5

root at gpu5$ ls -l libcudnn.so.7
lrwxrwxrwx. 1 root root 17 Mar 22 18:01 libcudnn.so.7 ->
libcudnn.so.7.1.1

root at gpu6$ ls -l libcudnn.so.7
lrwxrwxrwx. 1 root root 17 Mar 22 18:00 libcudnn.so.7 ->
libcudnn.so.7.1.1

root at gpu8$ ls -l libcudnn.so.7
lrwxrwxrwx. 1 root root 17 Mar 22 17:57 libcudnn.so.7 ->
libcudnn.so.7.1.1

root at gpu9$ ls -l libcudnn.so.7
lrwxrwxrwx. 1 root root 17 Mar 22 17:53 libcudnn.so.7 ->
libcudnn.so.7.1.1

From mbandrews at cmu.edu  Fri Mar 23 08:54:23 2018
From: mbandrews at cmu.edu (Michael Andrews)
Date: Fri, 23 Mar 2018 08:54:23 -0400
Subject: Fwd: Driver/library version mismatch on gpu nodes
In-Reply-To: <20180322232130.YbbVZl8sw%predragp@andrew.cmu.edu>
References: <CAPfEFgeQW679b9OOh5XkJr3feRX847b0opmRLCeyghnqffPs4Q@mail.gmail.com>
 <20180322232130.YbbVZl8sw%predragp@andrew.cmu.edu>
Message-ID: <CAPfEFgdqNvOkKhk5=G5E380frEpStj4+XFt-sMmm+NWUq-FCxQ@mail.gmail.com>

Forwarding for Predrag:

---------- Forwarded message ----------
From: Predrag Punosevac <predragp at andrew.cmu.edu>
Date: Thu, Mar 22, 2018 at 7:21 PM
Subject: Re: Driver/library version mismatch on gpu nodes
To: mbandrews at cmu.edu

Dear Autonians,

This turned to be little bit bigger job than originally anticipated.
This is the summary of what has being done:

gpu[1-9] with the exception of GPU7 which is used to serve a client have
being upgraded to the latest 3.10.0-693.21.1.el7 including all packages

nvidia-smi works as expected on all of those machines now

devtool-4 tools were replaced with devtool-6 which means that on all
those machines you have gcc 6

miniconda3 (python 3.6.3.) in /opt/miniconda3

All these machines now have /opt/rh-git29


MATLAB has to be removed from GPU1 and GPU2 due to the space issue. I
will be reinstalling it tomorrow on the different location. As a bonus
you will get MATLAB R2018a which I am not planning to install on other
servers unless requested (I will be waiting for R2018b).

For some reason MATLAB no longer works with GPUs on servers GPU5 and
GPU9. Those two servers will have the same treatment as GPU1 and GPU2
and I will install the latest version of MATLAB in order to fix the
problem (IIRC GPU8 and GPU9 are still only available to designated
users).

Finally Tensorflow is possibly broken due to a symlink. Please see my
next e-mail. I plan to fix this later tonight.

Predrag
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20180323/b42a0bf1/attachment-0001.html>

From predragp at andrew.cmu.edu  Fri Mar 23 09:43:27 2018
From: predragp at andrew.cmu.edu (Predrag Punosevac)
Date: Fri, 23 Mar 2018 09:43:27 -0400
Subject: Fwd: Driver/library version mismatch on gpu nodes
In-Reply-To: <CAPfEFgdqNvOkKhk5=G5E380frEpStj4+XFt-sMmm+NWUq-FCxQ@mail.gmail.com>
References: <CAPfEFgeQW679b9OOh5XkJr3feRX847b0opmRLCeyghnqffPs4Q@mail.gmail.com>
 <20180322232130.YbbVZl8sw%predragp@andrew.cmu.edu>
 <CAPfEFgdqNvOkKhk5=G5E380frEpStj4+XFt-sMmm+NWUq-FCxQ@mail.gmail.com>
Message-ID: <20180323134327.OQFs8bwM3%predragp@andrew.cmu.edu>

Michael Andrews <mbandrews at cmu.edu> wrote:

> Forwarding for Predrag:

Why did you forward me my own message?

Predrag

> 
> ---------- Forwarded message ----------
> From: Predrag Punosevac <predragp at andrew.cmu.edu>
> Date: Thu, Mar 22, 2018 at 7:21 PM
> Subject: Re: Driver/library version mismatch on gpu nodes
> To: mbandrews at cmu.edu
> 
> Dear Autonians,
> 
> This turned to be little bit bigger job than originally anticipated.
> This is the summary of what has being done:
> 
> gpu[1-9] with the exception of GPU7 which is used to serve a client have
> being upgraded to the latest 3.10.0-693.21.1.el7 including all packages
> 
> nvidia-smi works as expected on all of those machines now
> 
> devtool-4 tools were replaced with devtool-6 which means that on all
> those machines you have gcc 6
> 
> miniconda3 (python 3.6.3.) in /opt/miniconda3
> 
> All these machines now have /opt/rh-git29
> 
> 
> MATLAB has to be removed from GPU1 and GPU2 due to the space issue. I
> will be reinstalling it tomorrow on the different location. As a bonus
> you will get MATLAB R2018a which I am not planning to install on other
> servers unless requested (I will be waiting for R2018b).
> 
> For some reason MATLAB no longer works with GPUs on servers GPU5 and
> GPU9. Those two servers will have the same treatment as GPU1 and GPU2
> and I will install the latest version of MATLAB in order to fix the
> problem (IIRC GPU8 and GPU9 are still only available to designated
> users).
> 
> Finally Tensorflow is possibly broken due to a symlink. Please see my
> next e-mail. I plan to fix this later tonight.
> 
> Predrag

From mbarnes1 at andrew.cmu.edu  Fri Mar 23 10:38:54 2018
From: mbarnes1 at andrew.cmu.edu (Matthew Barnes)
Date: Fri, 23 Mar 2018 14:38:54 +0000
Subject: cudnn 7.0
In-Reply-To: <20180323021334.31W2vmrKf%predragp@andrew.cmu.edu>
References: <CALpevtRnoaNuEN84AF1zMYsKx4U_EV41X2CvArgVxJqmwW3j=Q@mail.gmail.com>
 <20180323021334.31W2vmrKf%predragp@andrew.cmu.edu>
Message-ID: <CAB7OVwBXs9PpRSqJfcdAbgnnNwj8Pob3QUWY4FSkCsWMLky+tA@mail.gmail.com>

I've installed my own versions of CUDA and cuDNN. So things are working for
me, but this is still going to be an issue for everyone else in the lab.

On Thu, Mar 22, 2018 at 10:13 PM Predrag Punosevac <predragp at andrew.cmu.edu>
wrote:

> Matt Barnes <mbarnes1 at andrew.cmu.edu> wrote:
>
> > Reminder to change the symlink on the GPU's
> >
> > /usr/local/cuda-9.0/lib64/libcudnn.so.7
> >
> > to point to libcudnn.so.7.0.5
>
> Simon,
>
> Can you explain to me what is happening here? Why do these symbolic
> links have to be set manually to the older version of cudnn?
>
> Predrag
>
>
> root at gpu1$ cd /usr/local/cuda-9.0/lib64
> root at gpu1$ ls -l libcudnn.so.7
> lrwxrwxrwx. 1 root root 17 Mar 22 18:06 libcudnn.so.7 ->
> libcudnn.so.7.1.1
>
> root at gpu2$ ls -l libcudnn.so.7
> lrwxrwxrwx. 1 sheath sheath 17 Nov 16 23:41 libcudnn.so.7 ->
> libcudnn.so.7.0.5
>
> root at gpu3$ ls -l libcudnn.so.7
> lrwxrwxrwx 1 sheath sheath 17 Nov 16 23:41 libcudnn.so.7 ->
> libcudnn.so.7.0.5
>
> root at gpu4$ ls -l libcudnn.so.7
> lrwxrwxrwx. 1 sheath sheath 17 Nov 16 23:41 libcudnn.so.7 ->
> libcudnn.so.7.0.5
>
> root at gpu5$ ls -l libcudnn.so.7
> lrwxrwxrwx. 1 root root 17 Mar 22 18:01 libcudnn.so.7 ->
> libcudnn.so.7.1.1
>
> root at gpu6$ ls -l libcudnn.so.7
> lrwxrwxrwx. 1 root root 17 Mar 22 18:00 libcudnn.so.7 ->
> libcudnn.so.7.1.1
>
> root at gpu8$ ls -l libcudnn.so.7
> lrwxrwxrwx. 1 root root 17 Mar 22 17:57 libcudnn.so.7 ->
> libcudnn.so.7.1.1
>
> root at gpu9$ ls -l libcudnn.so.7
> lrwxrwxrwx. 1 root root 17 Mar 22 17:53 libcudnn.so.7 ->
> libcudnn.so.7.1.1
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20180323/a146c856/attachment.html>

From predragp at andrew.cmu.edu  Fri Mar 23 14:50:58 2018
From: predragp at andrew.cmu.edu (Predrag Punosevac)
Date: Fri, 23 Mar 2018 14:50:58 -0400
Subject: Driver/library version mismatch on gpu nodes
In-Reply-To: <CAL+8iMq5oTboi2k=d2i-FAYoqsm63Cx_4GAc=cT4VrtTkkDruw@mail.gmail.com>
References: <CAPfEFgeQW679b9OOh5XkJr3feRX847b0opmRLCeyghnqffPs4Q@mail.gmail.com>
 <20180322175158.IaQkucHzu%predragp@andrew.cmu.edu>
 <CAL+8iMrjWt7U1id8hPQYkq8fWRWCT5JqhnmDbUY4bKrvSAck9w@mail.gmail.com>
 <20180322191116.gT7vCEaxh%predragp@andrew.cmu.edu>
 <CAL+8iMq5oTboi2k=d2i-FAYoqsm63Cx_4GAc=cT4VrtTkkDruw@mail.gmail.com>
Message-ID: <20180323185058.D3-snyt7d%predragp@andrew.cmu.edu>

Jay Yoon Lee <jaylee at andrew.cmu.edu> wrote:

> Hi Predrag,
> 
> I am not sure if it's just me or everybody else.
> After the reboot, GPU 1, 4, 8 is working for me,
> but GPU 2, 3, 5, 6, 9 is not working for me.
> 
>  GPU 2, 3, 5, 6, 9 are complaining --> failed to connect to server
> Failed to initialize NVML: Driver/library version mismatch
> 
> Is there anything I need to on my end ?
> (nvidia-smi does not work and I don't think I can do anything on my end.)
It works for me. I just logged into all GPU machines with the exception
of GPU7 and nvidia-smi gave the correct report. I did test things
yesterday but I didn't want to replay to your e-mail until I checked
things one more time.

It must be someting about your enviromental variables. Also bear in mind
that there are three different versions of CUDA on most of these GPUs.

root at gpu8$ ls -1|grep cuda
cuda
cuda-8.0
cuda-9.0
cuda-9.1

Predrag


> 
> Thanks,
> Jay-Yoon
> 
> 
> 
> On Thu, Mar 22, 2018 at 3:11 PM, Predrag Punosevac <predragp at andrew.cmu.edu>
> wrote:
> 
> > Jay Yoon Lee <jaylee at andrew.cmu.edu> wrote:
> >
> > > Hi Predrag,
> > >
> > > Thanks for the email & I upvote for rebooting gpu3 &4.
> > >
> > > As far as I know, before it was just  gpu2 having problem and now we have
> > > gpu3, 4 having the same symptoms.
> > >
> > > But, one question: I don't think gpu2 got fixed even after rebooting.
> > > Or is it just me? --> Do I have to reconfigure something?
> >
> > GPU2 has a problem with the full file system. I will move MATLAB to
> > different location and resolve that. OK. GPU2 will be also down at 5 PM
> > for about an hour.
> >
> > Predrag
> >
> > >
> > > I am asking this question to see,
> > > whether I have to do something once gpu3 & 4 are rebooted
> > > since gpu2 reboot didn't seem to work for me.
> > >
> > > Thanks!
> > > Jay-Yoon
> > >
> > > On Thu, Mar 22, 2018 at 1:51 PM, Predrag Punosevac <
> > predragp at andrew.cmu.edu>
> > > wrote:
> > >
> > > > Michael Andrews <mbandrews at cmu.edu> wrote:
> > > >
> > > > > Hi Predrag,
> > > > >
> > > > > There seems to be a driver/library mismatch on some of the gpu nodes
> > > > (e.g.
> > > > > gpu3, gpu4):
> > > > >
> > > > > $ nvidia-smi
> > > > > Failed to initialize NVML: Driver/library version mismatch
> > > > >
> > > >
> > > > Unfortunatelly the machines will have to be rebooted to clear that. I
> > > > will do it today at 5:00 PM.
> > > >
> > > > Predrag
> > > >
> > > > > Could you have a look when you get a chance?
> > > > >
> > > > > Thanks,
> > > > > Michael
> > > >
> >

From predragp at andrew.cmu.edu  Fri Mar 23 14:54:16 2018
From: predragp at andrew.cmu.edu (Predrag Punosevac)
Date: Fri, 23 Mar 2018 14:54:16 -0400
Subject: GPU8 and GPU9
Message-ID: <20180323185416.YK3cK4Z31%predragp@andrew.cmu.edu>

GPU 8 and GPU 9 are no longer off the limit and anybody can use them.
The only GPU server out of GPU[1-9] which remains reserved for the
specific project is GPU-7.

Predrag

From predragp at andrew.cmu.edu  Fri Mar 23 15:04:55 2018
From: predragp at andrew.cmu.edu (Predrag Punosevac)
Date: Fri, 23 Mar 2018 15:04:55 -0400
Subject: Tensorflow
Message-ID: <20180323190455.PwnlJzgeb%predragp@andrew.cmu.edu>

Tensorflow is tested and works well for multiple people on GPU2, GPU3,
and GPU4. 

On other servers you have to make sure for now that you are using 

/usr/local/cuda-9.0/lib64/libcudnn.so.7.0.5

This is due to the fact that Tensorflow is broken up stream with
libcudnn.so.7.1.5

I am thinking how best to play around this problem. All servers have
three versions of cuda but the default is the newest

root at gpu1$ ls -l |grep cuda
lrwxrwxrwx.  1 root root    8 Mar 22 19:10 cuda -> cuda-9.1
drwxr-xr-x. 14 root root 4096 Apr 19  2017 cuda-8.0
drwxr-xr-x. 15 root root 4096 Nov 30 16:10 cuda-9.0
drwxr-xr-x. 15 root root 4096 Mar 22 18:59 cuda-9.1

Predrag

From predragp at andrew.cmu.edu  Fri Mar 23 18:26:50 2018
From: predragp at andrew.cmu.edu (Predrag Punosevac)
Date: Fri, 23 Mar 2018 18:26:50 -0400
Subject: Gogs/Git now uses LDAP for authentication
Message-ID: <20180323222650.I9CIJcHy-%predragp@andrew.cmu.edu>

Dear Autonians,

Our Gogs/Git repository is now using LDAP database for authorization and
authentication. All existing local accounts have being mapped into the
LDAP accounts. No local account will be created again for any reason!

If your Gogs password was the same as your LDAP password no action will
be needed on your part. If you are using your e-mail address to log into
the Gogs interface you will have to use the same e-mail address I have
in LDAP. Uploaded ssh keys are not affected (tested). 

If your Gogs password was different from your LDAP password you will
have to use LDAP password to log into Gogs interface. 

There was one user in the Gogs whose username didn't match LDAP username
(Samy). His local account has being deleted as he had no repos. Next
time he tries to log into Gogs interface he will just use his LDAP
credentials. Unfortunately he will have to upload his ssh-key again.

There are four remaining local accounts three of which have Gogs/Git
admin privileges: awertz, sheath, and predrag. Mr. Jenkins 

https://jenkins.io/

also have a local account. 

Best,
Predrag

P.S. My understanding (please Anthony correct me) is that all CVS
repositories have being migrated to Git. Subversion code has not being
migrated yet. CVS and Subversion remain available for historical reasons
but nobody has write access. The easiest way to get on Anthony's,
Simon's and my bad side is to try to check in data or binary files into
Gogs/Git. If you don't know much version control please stop by NSH 3119
for short orientation.

From predragp at andrew.cmu.edu  Fri Mar 23 23:08:06 2018
From: predragp at andrew.cmu.edu (Predrag Punosevac)
Date: Fri, 23 Mar 2018 23:08:06 -0400
Subject: Main File server full
Message-ID: <20180324030806.XCenX7ise%predragp@andrew.cmu.edu>

Dear Autonians,

Our main file server is full and I no longer can take the snapshots of
your home directories. This doesn't affect members of Neill group who
have their own file server. 


ot at gaia] /var/log# head -10 messages
Mar 23 13:00:00 gaia newsyslog[71545]: logfile turned over due to size>100K
Mar 23 13:00:08 gaia autosnap.py: [tools.autosnap:58] Popen()ing: /sbin/zfs snapshot -r -o freenas:state=NEW zfsauton/home at auto-20180323.1300-2w
Mar 23 13:00:09 gaia autosnap.py: [tools.autosnap:243] Failed to create snapshot 'zfsauton/home at auto-20180323.1300-2w': cannot create snapshot
'zfsauton/home at auto-20180323.1300-2w': out of space no snapshots were created

The HDDs were purchased in November but hesitated to take the plunge
until I was 100% about the new design. Unfortunately this can't wait
any longer. The plan of the action is as follows.

1. Over the weekend I will verify quality of data and project pseudo
file-systems (datasets in ZFS lingo) replications.

2. Assuming that those replications are OK I will make them alive on
Monday. Some down time is unavoidable. I hope to keep it within 2h. 

3. Once we are happy with live copies of project and data pseudo-file
systems those will be destroyed on the main file server in order to make
additional space for the snapshots of your home folders.

4. That might buy us time (at least a 1-2 weeks). 

5. Once home directories are properly replicated on the backup server
they will be made alive. 

6. Main file will be rebuild with new HDDs. Old HDDs will not be erased
and we will still be able to put that ZFS pool back on line if the
things are not working the way we want.

7. In new setup each home directory will be separate data set and for
the first time we will have 300GB quota. You will also be able to access
your own snapshots without asking me for the files. 

Best,
Predrag

From predragp at andrew.cmu.edu  Mon Mar 26 21:00:21 2018
From: predragp at andrew.cmu.edu (Predrag Punosevac)
Date: Mon, 26 Mar 2018 21:00:21 -0400
Subject: Lua Torch
In-Reply-To: <295ba51bedc34351966390695e2ad973@PGH-MSGMLT-02.andrew.ad.cmu.edu>
References: <295ba51bedc34351966390695e2ad973@PGH-MSGMLT-02.andrew.ad.cmu.edu>
Message-ID: <20180327010021.WzisjuLUh%predragp@andrew.cmu.edu>

Manzil Zaheer <manzil at cmu.edu> wrote:

> Hi Predrag,
> 
> I am not able to use any GPUSs on gpu5,6,7,9. I tried all 3 versions of cuda, but I get the following error:
> 


I was able to build it after adding this

export TORCH_NVCC_FLAGS="-D__CUDA_NO_HALF_OPERATORS__"

per 

https://github.com/torch/torch7/issues/1086

When I try to run it I get errors that Lua packages are missing (probably
due to my path variables). I have a vague recollection that Simon and I
halped you once with this thing in the past. IIRC it was very picky about
the version of some Lua package and required their version not the one 
which comes with yum . 

Anyhow I am forwarding this to users at autonlab in hope somebody is using
it and might be of more help. Please stop by NSH 3119 and let us try to 
debug this. 

Predrag


> THCudaCheck FAIL file=/pytorch/torch/lib/THC/THCGeneral.c line=70 error=30 : unknown error
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/zfsauton/home/manzilz/local/lib/python3.6/site-packages/torch/cuda/__init__.py", line 384, in _lazy_new
>     _lazy_init()
>   File "/zfsauton/home/manzilz/local/lib/python3.6/site-packages/torch/cuda/__init__.py", line 142, in _lazy_init
>     torch._C._cuda_init()
> RuntimeError: cuda runtime error (30) : unknown error at /pytorch/torch/lib/THC/THCGeneral.c:70
> 
> Can you kindly look into it?
> 
> Thanks,
> Manzil

From manzil at cmu.edu  Mon Mar 26 21:02:18 2018
From: manzil at cmu.edu (Manzil Zaheer)
Date: Tue, 27 Mar 2018 01:02:18 +0000
Subject: Lua Torch
In-Reply-To: <20180327010021.WzisjuLUh%predragp@andrew.cmu.edu>
References: <295ba51bedc34351966390695e2ad973@PGH-MSGMLT-02.andrew.ad.cmu.edu>,
 <20180327010021.WzisjuLUh%predragp@andrew.cmu.edu>
Message-ID: <9f3af81667764f5b9cdb9d10dd156914@PGH-MSGMLT-02.andrew.ad.cmu.edu>

Thanks for the detailed analysis. But I am using pytorch. I have not tried Lua torch. Can you please check? Thanks again!


Sent from my Samsung Galaxy smartphone.


-------- Original message --------
From: Predrag Punosevac <predragp at andrew.cmu.edu>
Date: 3/26/18 9:00 PM (GMT-05:00)
To: Manzil Zaheer <manzil at cmu.edu>
Cc: Barnabas Poczos <bapoczos at andrew.cmu.edu>, users at autonlab.org
Subject: Re: Lua Torch

Manzil Zaheer <manzil at cmu.edu> wrote:

> Hi Predrag,
>
> I am not able to use any GPUSs on gpu5,6,7,9. I tried all 3 versions of cuda, but I get the following error:
>


I was able to build it after adding this

export TORCH_NVCC_FLAGS="-D__CUDA_NO_HALF_OPERATORS__"

per

https://github.com/torch/torch7/issues/1086

When I try to run it I get errors that Lua packages are missing (probably
due to my path variables). I have a vague recollection that Simon and I
halped you once with this thing in the past. IIRC it was very picky about
the version of some Lua package and required their version not the one
which comes with yum .

Anyhow I am forwarding this to users at autonlab in hope somebody is using
it and might be of more help. Please stop by NSH 3119 and let us try to
debug this.

Predrag


> THCudaCheck FAIL file=/pytorch/torch/lib/THC/THCGeneral.c line=70 error=30 : unknown error
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/zfsauton/home/manzilz/local/lib/python3.6/site-packages/torch/cuda/__init__.py", line 384, in _lazy_new
>     _lazy_init()
>   File "/zfsauton/home/manzilz/local/lib/python3.6/site-packages/torch/cuda/__init__.py", line 142, in _lazy_init
>     torch._C._cuda_init()
> RuntimeError: cuda runtime error (30) : unknown error at /pytorch/torch/lib/THC/THCGeneral.c:70
>
> Can you kindly look into it?
>
> Thanks,
> Manzil
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20180327/0bf590eb/attachment.html>

From predragp at andrew.cmu.edu  Mon Mar 26 22:50:12 2018
From: predragp at andrew.cmu.edu (Predrag Punosevac)
Date: Mon, 26 Mar 2018 22:50:12 -0400
Subject: PyTorch
In-Reply-To: <wpkwqwydo8ysf0aojf0tbyck.1522112537235@email.android.com>
References: <295ba51bedc34351966390695e2ad973@PGH-MSGMLT-02.andrew.ad.cmu.edu>
 <20180327010021.WzisjuLUh%predragp@andrew.cmu.edu>
 <wpkwqwydo8ysf0aojf0tbyck.1522112537235@email.android.com>
Message-ID: <20180327025012.PucNB2br-%predragp@andrew.cmu.edu>

Manzil Zaheer <manzil at cmu.edu> wrote:

> Thanks for the detailed analysis. But I am using pytorch. I have not tried Lua torch. Can you please check? Thanks again!
> 

I did. You have Python 3.6.4 in /opt/miniconda3/bin/python3.6

predrag at gpu3$ /opt/miniconda3/bin/python3.6
Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19) 
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.


Try reinstalling thing in your scratch directory as

/opt/miniconda3/bin/conda  install pytorch torchvision cuda91 -c pytorch

You should see something like

The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    pillow-5.0.0               |   py36h3deb7b8_0         561 KB
    mkl-2018.0.2               |                1       205.2 MB
    cuda91-1.0                 |       h4c16780_0           3 KB
pytorch
    libpng-1.6.34              |       hb9fc6fc_0         334 KB
    freetype-2.8               |       hab7d2ae_1         804 KB
    libgfortran-ng-7.2.0       |       hdf63c60_3         1.2 MB
    intel-openmp-2018.0.0      |                8         620 KB
    libtiff-4.0.9              |       h28f6b97_0         586 KB
    pytorch-0.3.1              |py36_cuda9.1.85_cudnn7.0.5_2       475.0
MB  pytorch
    torchvision-0.2.0          |   py36h17b6947_1         102 KB
pytorch
    jpeg-9b                    |       h024ee3a_2         248 KB
    numpy-1.14.2               |   py36hdbf6ddf_0         4.0 MB
    olefile-0.45.1             |           py36_0          47 KB
    ------------------------------------------------------------
                                           Total:       688.7 MB


Make sure you put your scratch as a path since file server is full. I
got clean installation but I didn't play further. One thing that worries
me is this line 

pytorch-0.3.1              |py36_cuda9.1.85_cudnn7.0.5_2       475.0 MB
pytorch

We had problems with cudnn on 9.1 apparently because the upstream was
assuming 7.0.5 when in reality I have 7.1.1 CUDA 9 or even 7.1.5.  CUDA
9.1

GPU3 has CUDNN library 7.0.5 in cuda-9.0 so try adjusting conda command
accordingly. 


Best,
Predrag


> 
> 
> Sent from my Samsung Galaxy smartphone.
> 
> 
> -------- Original message --------
> From: Predrag Punosevac <predragp at andrew.cmu.edu>
> Date: 3/26/18 9:00 PM (GMT-05:00)
> To: Manzil Zaheer <manzil at cmu.edu>
> Cc: Barnabas Poczos <bapoczos at andrew.cmu.edu>, users at autonlab.org
> Subject: Re: Lua Torch
> 
> Manzil Zaheer <manzil at cmu.edu> wrote:
> 
> > Hi Predrag,
> >
> > I am not able to use any GPUSs on gpu5,6,7,9. I tried all 3 versions of cuda, but I get the following error:
> >
> 
> 
> I was able to build it after adding this
> 
> export TORCH_NVCC_FLAGS="-D__CUDA_NO_HALF_OPERATORS__"
> 
> per
> 
> https://github.com/torch/torch7/issues/1086
> 
> When I try to run it I get errors that Lua packages are missing (probably
> due to my path variables). I have a vague recollection that Simon and I
> halped you once with this thing in the past. IIRC it was very picky about
> the version of some Lua package and required their version not the one
> which comes with yum .
> 
> Anyhow I am forwarding this to users at autonlab in hope somebody is using
> it and might be of more help. Please stop by NSH 3119 and let us try to
> debug this.
> 
> Predrag
> 
> 
> 
> 
> > THCudaCheck FAIL file=/pytorch/torch/lib/THC/THCGeneral.c line=70 error=30 : unknown error
> > Traceback (most recent call last):
> >   File "<stdin>", line 1, in <module>
> >   File "/zfsauton/home/manzilz/local/lib/python3.6/site-packages/torch/cuda/__init__.py", line 384, in _lazy_new
> >     _lazy_init()
> >   File "/zfsauton/home/manzilz/local/lib/python3.6/site-packages/torch/cuda/__init__.py", line 142, in _lazy_init
> >     torch._C._cuda_init()
> > RuntimeError: cuda runtime error (30) : unknown error at /pytorch/torch/lib/THC/THCGeneral.c:70
> >
> > Can you kindly look into it?
> >
> > Thanks,
> > Manzil

From manzil at cmu.edu  Tue Mar 27 01:00:39 2018
From: manzil at cmu.edu (Manzil Zaheer)
Date: Tue, 27 Mar 2018 05:00:39 +0000
Subject: PyTorch
In-Reply-To: <20180327025012.PucNB2br-%predragp@andrew.cmu.edu>
References: <295ba51bedc34351966390695e2ad973@PGH-MSGMLT-02.andrew.ad.cmu.edu>
 <20180327010021.WzisjuLUh%predragp@andrew.cmu.edu>
 <wpkwqwydo8ysf0aojf0tbyck.1522112537235@email.android.com>,
 <20180327025012.PucNB2br-%predragp@andrew.cmu.edu>
Message-ID: <64e210675cba4c68a73e803cfcaca728@PGH-MSGMLT-03.andrew.ad.cmu.edu>

Hi Pregrad,

Thanks again for your help. But I still can not get anything running on GPU5,6,7,9. Also notice that GPU1,2,3,4,8 almost all GPUs are full, while no one is using GPU5,6,7,9. This might mean no one else is also able to run anything as well.

So I tried many things. Everything installs without issue. But when i try to run the simple code like:

import torch
x = torch.cuda.FloatTensor(2,3,4)
print(x)


I get the following error:
THCudaCheck FAIL file=/pytorch/torch/lib/THC/THCGeneral.c line=70 error=30 : unknown error
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/zfsauton/home/manzilz/.local/lib/python3.6/site-packages/torch/_utils.py", line 69, in _cuda
    return new_type(self.size()).copy_(self, async)
  File "/zfsauton/home/manzilz/.local/lib/python3.6/site-packages/torch/cuda/__init__.py", line 384, in _lazy_new
    _lazy_init()
  File "/zfsauton/home/manzilz/.local/lib/python3.6/site-packages/torch/cuda/__init__.py", line 142, in _lazy_init
    torch._C._cuda_init()
RuntimeError: cuda runtime error (30) : unknown error at /pytorch/torch/lib/THC/THCGeneral.c:70

Thanks,
Manzil

________________________________________
From: Predrag Punosevac <predragp at andrew.cmu.edu>
Sent: 26 March 2018 22:50
To: Manzil Zaheer
Cc: Barnabas Poczos; users at autonlab.org
Subject: Re: PyTorch

Manzil Zaheer <manzil at cmu.edu> wrote:

> Thanks for the detailed analysis. But I am using pytorch. I have not tried Lua torch. Can you please check? Thanks again!
>

I did. You have Python 3.6.4 in /opt/miniconda3/bin/python3.6

predrag at gpu3$ /opt/miniconda3/bin/python3.6
Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19)
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.


Try reinstalling thing in your scratch directory as

/opt/miniconda3/bin/conda  install pytorch torchvision cuda91 -c pytorch

You should see something like

The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    pillow-5.0.0               |   py36h3deb7b8_0         561 KB
    mkl-2018.0.2               |                1       205.2 MB
    cuda91-1.0                 |       h4c16780_0           3 KB
pytorch
    libpng-1.6.34              |       hb9fc6fc_0         334 KB
    freetype-2.8               |       hab7d2ae_1         804 KB
    libgfortran-ng-7.2.0       |       hdf63c60_3         1.2 MB
    intel-openmp-2018.0.0      |                8         620 KB
    libtiff-4.0.9              |       h28f6b97_0         586 KB
    pytorch-0.3.1              |py36_cuda9.1.85_cudnn7.0.5_2       475.0
MB  pytorch
    torchvision-0.2.0          |   py36h17b6947_1         102 KB
pytorch
    jpeg-9b                    |       h024ee3a_2         248 KB
    numpy-1.14.2               |   py36hdbf6ddf_0         4.0 MB
    olefile-0.45.1             |           py36_0          47 KB
    ------------------------------------------------------------
                                           Total:       688.7 MB


Make sure you put your scratch as a path since file server is full. I
got clean installation but I didn't play further. One thing that worries
me is this line

pytorch-0.3.1              |py36_cuda9.1.85_cudnn7.0.5_2       475.0 MB
pytorch

We had problems with cudnn on 9.1 apparently because the upstream was
assuming 7.0.5 when in reality I have 7.1.1 CUDA 9 or even 7.1.5.  CUDA
9.1

GPU3 has CUDNN library 7.0.5 in cuda-9.0 so try adjusting conda command
accordingly.


Best,
Predrag


>
>
> Sent from my Samsung Galaxy smartphone.
>
>
> -------- Original message --------
> From: Predrag Punosevac <predragp at andrew.cmu.edu>
> Date: 3/26/18 9:00 PM (GMT-05:00)
> To: Manzil Zaheer <manzil at cmu.edu>
> Cc: Barnabas Poczos <bapoczos at andrew.cmu.edu>, users at autonlab.org
> Subject: Re: Lua Torch
>
> Manzil Zaheer <manzil at cmu.edu> wrote:
>
> > Hi Predrag,
> >
> > I am not able to use any GPUSs on gpu5,6,7,9. I tried all 3 versions of cuda, but I get the following error:
> >
>
>
> I was able to build it after adding this
>
> export TORCH_NVCC_FLAGS="-D__CUDA_NO_HALF_OPERATORS__"
>
> per
>
> https://github.com/torch/torch7/issues/1086
>
> When I try to run it I get errors that Lua packages are missing (probably
> due to my path variables). I have a vague recollection that Simon and I
> halped you once with this thing in the past. IIRC it was very picky about
> the version of some Lua package and required their version not the one
> which comes with yum .
>
> Anyhow I am forwarding this to users at autonlab in hope somebody is using
> it and might be of more help. Please stop by NSH 3119 and let us try to
> debug this.
>
> Predrag
>
>
>
>
> > THCudaCheck FAIL file=/pytorch/torch/lib/THC/THCGeneral.c line=70 error=30 : unknown error
> > Traceback (most recent call last):
> >   File "<stdin>", line 1, in <module>
> >   File "/zfsauton/home/manzilz/local/lib/python3.6/site-packages/torch/cuda/__init__.py", line 384, in _lazy_new
> >     _lazy_init()
> >   File "/zfsauton/home/manzilz/local/lib/python3.6/site-packages/torch/cuda/__init__.py", line 142, in _lazy_init
> >     torch._C._cuda_init()
> > RuntimeError: cuda runtime error (30) : unknown error at /pytorch/torch/lib/THC/THCGeneral.c:70
> >
> > Can you kindly look into it?
> >
> > Thanks,
> > Manzil


From predragp at andrew.cmu.edu  Tue Mar 27 01:31:40 2018
From: predragp at andrew.cmu.edu (Predrag Punosevac)
Date: Tue, 27 Mar 2018 01:31:40 -0400
Subject: PyTorch
In-Reply-To: <1522126842790.39313@cmu.edu>
References: <295ba51bedc34351966390695e2ad973@PGH-MSGMLT-02.andrew.ad.cmu.edu>
 <20180327010021.WzisjuLUh%predragp@andrew.cmu.edu>
 <wpkwqwydo8ysf0aojf0tbyck.1522112537235@email.android.com>
 <20180327025012.PucNB2br-%predragp@andrew.cmu.edu>
 <1522126842790.39313@cmu.edu>
Message-ID: <20180327053140.xM3NWbFsK%predragp@andrew.cmu.edu>

Manzil Zaheer <manzil at cmu.edu> wrote:

> Hi Pregrad,
> 
> Thanks again for your help. But I still can not get anything running on GPU5,6,7,9. Also notice that GPU1,2,3,4,8 almost all GPUs are full, while no one is using GPU5,6,7,9. This might mean no one else is also able to run anything as well.
> 

7 if off limit used for the special project. How did you figure out that nobody is using it when
you can't even log there?

> So I tried many things. Everything installs without issue. But when i try to run the simple code like:
> 

PyTorch is a research grade software. They have a mailing list. 3 sec Googling reveals 


https://github.com/pytorch/pytorch/issues/2527

also 

https://stackoverflow.com/questions/45861767/pytorch-giving-cuda-runtime-error

I will look at this more but it would be helpful if you get on PyTorch mailing list and ask
developers what they think. I see this once every 9 months they are looking at this bugs every
day. 

Predrag

> import torch
> x = torch.cuda.FloatTensor(2,3,4)
> print(x)
> 
> 
> I get the following error:
> THCudaCheck FAIL file=/pytorch/torch/lib/THC/THCGeneral.c line=70 error=30 : unknown error
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/zfsauton/home/manzilz/.local/lib/python3.6/site-packages/torch/_utils.py", line 69, in _cuda
>     return new_type(self.size()).copy_(self, async)
>   File "/zfsauton/home/manzilz/.local/lib/python3.6/site-packages/torch/cuda/__init__.py", line 384, in _lazy_new
>     _lazy_init()
>   File "/zfsauton/home/manzilz/.local/lib/python3.6/site-packages/torch/cuda/__init__.py", line 142, in _lazy_init
>     torch._C._cuda_init()
> RuntimeError: cuda runtime error (30) : unknown error at /pytorch/torch/lib/THC/THCGeneral.c:70
> 
> Thanks,
> Manzil
> 
> ________________________________________
> From: Predrag Punosevac <predragp at andrew.cmu.edu>
> Sent: 26 March 2018 22:50
> To: Manzil Zaheer
> Cc: Barnabas Poczos; users at autonlab.org
> Subject: Re: PyTorch
> 
> Manzil Zaheer <manzil at cmu.edu> wrote:
> 
> > Thanks for the detailed analysis. But I am using pytorch. I have not tried Lua torch. Can you please check? Thanks again!
> >
> 
> I did. You have Python 3.6.4 in /opt/miniconda3/bin/python3.6
> 
> predrag at gpu3$ /opt/miniconda3/bin/python3.6
> Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19)
> [GCC 7.2.0] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> 
> 
> Try reinstalling thing in your scratch directory as
> 
> /opt/miniconda3/bin/conda  install pytorch torchvision cuda91 -c pytorch
> 
> You should see something like
> 
> The following packages will be downloaded:
> 
>     package                    |            build
>     ---------------------------|-----------------
>     pillow-5.0.0               |   py36h3deb7b8_0         561 KB
>     mkl-2018.0.2               |                1       205.2 MB
>     cuda91-1.0                 |       h4c16780_0           3 KB
> pytorch
>     libpng-1.6.34              |       hb9fc6fc_0         334 KB
>     freetype-2.8               |       hab7d2ae_1         804 KB
>     libgfortran-ng-7.2.0       |       hdf63c60_3         1.2 MB
>     intel-openmp-2018.0.0      |                8         620 KB
>     libtiff-4.0.9              |       h28f6b97_0         586 KB
>     pytorch-0.3.1              |py36_cuda9.1.85_cudnn7.0.5_2       475.0
> MB  pytorch
>     torchvision-0.2.0          |   py36h17b6947_1         102 KB
> pytorch
>     jpeg-9b                    |       h024ee3a_2         248 KB
>     numpy-1.14.2               |   py36hdbf6ddf_0         4.0 MB
>     olefile-0.45.1             |           py36_0          47 KB
>     ------------------------------------------------------------
>                                            Total:       688.7 MB
> 
> 
> Make sure you put your scratch as a path since file server is full. I
> got clean installation but I didn't play further. One thing that worries
> me is this line
> 
> pytorch-0.3.1              |py36_cuda9.1.85_cudnn7.0.5_2       475.0 MB
> pytorch
> 
> We had problems with cudnn on 9.1 apparently because the upstream was
> assuming 7.0.5 when in reality I have 7.1.1 CUDA 9 or even 7.1.5.  CUDA
> 9.1
> 
> GPU3 has CUDNN library 7.0.5 in cuda-9.0 so try adjusting conda command
> accordingly.
> 
> 
> Best,
> Predrag
> 
> 
> 
> 
> 
> 
> >
> >
> > Sent from my Samsung Galaxy smartphone.
> >
> >
> > -------- Original message --------
> > From: Predrag Punosevac <predragp at andrew.cmu.edu>
> > Date: 3/26/18 9:00 PM (GMT-05:00)
> > To: Manzil Zaheer <manzil at cmu.edu>
> > Cc: Barnabas Poczos <bapoczos at andrew.cmu.edu>, users at autonlab.org
> > Subject: Re: Lua Torch
> >
> > Manzil Zaheer <manzil at cmu.edu> wrote:
> >
> > > Hi Predrag,
> > >
> > > I am not able to use any GPUSs on gpu5,6,7,9. I tried all 3 versions of cuda, but I get the following error:
> > >
> >
> >
> > I was able to build it after adding this
> >
> > export TORCH_NVCC_FLAGS="-D__CUDA_NO_HALF_OPERATORS__"
> >
> > per
> >
> > https://github.com/torch/torch7/issues/1086
> >
> > When I try to run it I get errors that Lua packages are missing (probably
> > due to my path variables). I have a vague recollection that Simon and I
> > halped you once with this thing in the past. IIRC it was very picky about
> > the version of some Lua package and required their version not the one
> > which comes with yum .
> >
> > Anyhow I am forwarding this to users at autonlab in hope somebody is using
> > it and might be of more help. Please stop by NSH 3119 and let us try to
> > debug this.
> >
> > Predrag
> >
> >
> >
> >
> > > THCudaCheck FAIL file=/pytorch/torch/lib/THC/THCGeneral.c line=70 error=30 : unknown error
> > > Traceback (most recent call last):
> > >   File "<stdin>", line 1, in <module>
> > >   File "/zfsauton/home/manzilz/local/lib/python3.6/site-packages/torch/cuda/__init__.py", line 384, in _lazy_new
> > >     _lazy_init()
> > >   File "/zfsauton/home/manzilz/local/lib/python3.6/site-packages/torch/cuda/__init__.py", line 142, in _lazy_init
> > >     torch._C._cuda_init()
> > > RuntimeError: cuda runtime error (30) : unknown error at /pytorch/torch/lib/THC/THCGeneral.c:70
> > >
> > > Can you kindly look into it?
> > >
> > > Thanks,
> > > Manzil

From manzil at cmu.edu  Tue Mar 27 01:46:44 2018
From: manzil at cmu.edu (Manzil Zaheer)
Date: Tue, 27 Mar 2018 05:46:44 +0000
Subject: PyTorch
In-Reply-To: <20180327053140.xM3NWbFsK%predragp@andrew.cmu.edu>
References: <295ba51bedc34351966390695e2ad973@PGH-MSGMLT-02.andrew.ad.cmu.edu>
 <20180327010021.WzisjuLUh%predragp@andrew.cmu.edu>
 <wpkwqwydo8ysf0aojf0tbyck.1522112537235@email.android.com>
 <20180327025012.PucNB2br-%predragp@andrew.cmu.edu>
 <1522126842790.39313@cmu.edu>,
 <20180327053140.xM3NWbFsK%predragp@andrew.cmu.edu>
Message-ID: <798cc89dfa1a47b691994bc96880c039@PGH-MSGMLT-03.andrew.ad.cmu.edu>

Hi Predrag,

Thanks for pointing out the links. From the link you provided, we can see that FB engineers mention that "error 30 is usually unrelated to pytorch issues (or your code change)". 

Thanks,
Manzil
________________________________________
From: Predrag Punosevac <predragp at andrew.cmu.edu>
Sent: 27 March 2018 01:31
To: Manzil Zaheer
Cc: Barnabas Poczos; users at autonlab.org
Subject: Re: PyTorch

Manzil Zaheer <manzil at cmu.edu> wrote:

> Hi Pregrad,
>
> Thanks again for your help. But I still can not get anything running on GPU5,6,7,9. Also notice that GPU1,2,3,4,8 almost all GPUs are full, while no one is using GPU5,6,7,9. This might mean no one else is also able to run anything as well.
>

7 if off limit used for the special project. How did you figure out that nobody is using it when
you can't even log there?

> So I tried many things. Everything installs without issue. But when i try to run the simple code like:
>

PyTorch is a research grade software. They have a mailing list. 3 sec Googling reveals


https://github.com/pytorch/pytorch/issues/2527

also

https://stackoverflow.com/questions/45861767/pytorch-giving-cuda-runtime-error

I will look at this more but it would be helpful if you get on PyTorch mailing list and ask
developers what they think. I see this once every 9 months they are looking at this bugs every
day.

Predrag

> import torch
> x = torch.cuda.FloatTensor(2,3,4)
> print(x)
>
>
> I get the following error:
> THCudaCheck FAIL file=/pytorch/torch/lib/THC/THCGeneral.c line=70 error=30 : unknown error
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/zfsauton/home/manzilz/.local/lib/python3.6/site-packages/torch/_utils.py", line 69, in _cuda
>     return new_type(self.size()).copy_(self, async)
>   File "/zfsauton/home/manzilz/.local/lib/python3.6/site-packages/torch/cuda/__init__.py", line 384, in _lazy_new
>     _lazy_init()
>   File "/zfsauton/home/manzilz/.local/lib/python3.6/site-packages/torch/cuda/__init__.py", line 142, in _lazy_init
>     torch._C._cuda_init()
> RuntimeError: cuda runtime error (30) : unknown error at /pytorch/torch/lib/THC/THCGeneral.c:70
>
> Thanks,
> Manzil
>
> ________________________________________
> From: Predrag Punosevac <predragp at andrew.cmu.edu>
> Sent: 26 March 2018 22:50
> To: Manzil Zaheer
> Cc: Barnabas Poczos; users at autonlab.org
> Subject: Re: PyTorch
>
> Manzil Zaheer <manzil at cmu.edu> wrote:
>
> > Thanks for the detailed analysis. But I am using pytorch. I have not tried Lua torch. Can you please check? Thanks again!
> >
>
> I did. You have Python 3.6.4 in /opt/miniconda3/bin/python3.6
>
> predrag at gpu3$ /opt/miniconda3/bin/python3.6
> Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19)
> [GCC 7.2.0] on linux
> Type "help", "copyright", "credits" or "license" for more information.
>
>
> Try reinstalling thing in your scratch directory as
>
> /opt/miniconda3/bin/conda  install pytorch torchvision cuda91 -c pytorch
>
> You should see something like
>
> The following packages will be downloaded:
>
>     package                    |            build
>     ---------------------------|-----------------
>     pillow-5.0.0               |   py36h3deb7b8_0         561 KB
>     mkl-2018.0.2               |                1       205.2 MB
>     cuda91-1.0                 |       h4c16780_0           3 KB
> pytorch
>     libpng-1.6.34              |       hb9fc6fc_0         334 KB
>     freetype-2.8               |       hab7d2ae_1         804 KB
>     libgfortran-ng-7.2.0       |       hdf63c60_3         1.2 MB
>     intel-openmp-2018.0.0      |                8         620 KB
>     libtiff-4.0.9              |       h28f6b97_0         586 KB
>     pytorch-0.3.1              |py36_cuda9.1.85_cudnn7.0.5_2       475.0
> MB  pytorch
>     torchvision-0.2.0          |   py36h17b6947_1         102 KB
> pytorch
>     jpeg-9b                    |       h024ee3a_2         248 KB
>     numpy-1.14.2               |   py36hdbf6ddf_0         4.0 MB
>     olefile-0.45.1             |           py36_0          47 KB
>     ------------------------------------------------------------
>                                            Total:       688.7 MB
>
>
> Make sure you put your scratch as a path since file server is full. I
> got clean installation but I didn't play further. One thing that worries
> me is this line
>
> pytorch-0.3.1              |py36_cuda9.1.85_cudnn7.0.5_2       475.0 MB
> pytorch
>
> We had problems with cudnn on 9.1 apparently because the upstream was
> assuming 7.0.5 when in reality I have 7.1.1 CUDA 9 or even 7.1.5.  CUDA
> 9.1
>
> GPU3 has CUDNN library 7.0.5 in cuda-9.0 so try adjusting conda command
> accordingly.
>
>
> Best,
> Predrag
>
>
>
>
>
>
> >
> >
> > Sent from my Samsung Galaxy smartphone.
> >
> >
> > -------- Original message --------
> > From: Predrag Punosevac <predragp at andrew.cmu.edu>
> > Date: 3/26/18 9:00 PM (GMT-05:00)
> > To: Manzil Zaheer <manzil at cmu.edu>
> > Cc: Barnabas Poczos <bapoczos at andrew.cmu.edu>, users at autonlab.org
> > Subject: Re: Lua Torch
> >
> > Manzil Zaheer <manzil at cmu.edu> wrote:
> >
> > > Hi Predrag,
> > >
> > > I am not able to use any GPUSs on gpu5,6,7,9. I tried all 3 versions of cuda, but I get the following error:
> > >
> >
> >
> > I was able to build it after adding this
> >
> > export TORCH_NVCC_FLAGS="-D__CUDA_NO_HALF_OPERATORS__"
> >
> > per
> >
> > https://github.com/torch/torch7/issues/1086
> >
> > When I try to run it I get errors that Lua packages are missing (probably
> > due to my path variables). I have a vague recollection that Simon and I
> > halped you once with this thing in the past. IIRC it was very picky about
> > the version of some Lua package and required their version not the one
> > which comes with yum .
> >
> > Anyhow I am forwarding this to users at autonlab in hope somebody is using
> > it and might be of more help. Please stop by NSH 3119 and let us try to
> > debug this.
> >
> > Predrag
> >
> >
> >
> >
> > > THCudaCheck FAIL file=/pytorch/torch/lib/THC/THCGeneral.c line=70 error=30 : unknown error
> > > Traceback (most recent call last):
> > >   File "<stdin>", line 1, in <module>
> > >   File "/zfsauton/home/manzilz/local/lib/python3.6/site-packages/torch/cuda/__init__.py", line 384, in _lazy_new
> > >     _lazy_init()
> > >   File "/zfsauton/home/manzilz/local/lib/python3.6/site-packages/torch/cuda/__init__.py", line 142, in _lazy_init
> > >     torch._C._cuda_init()
> > > RuntimeError: cuda runtime error (30) : unknown error at /pytorch/torch/lib/THC/THCGeneral.c:70
> > >
> > > Can you kindly look into it?
> > >
> > > Thanks,
> > > Manzil


From mbarnes1 at andrew.cmu.edu  Tue Mar 27 08:30:13 2018
From: mbarnes1 at andrew.cmu.edu (Matthew Barnes)
Date: Tue, 27 Mar 2018 12:30:13 +0000
Subject: PyTorch
In-Reply-To: <798cc89dfa1a47b691994bc96880c039@PGH-MSGMLT-03.andrew.ad.cmu.edu>
References: <295ba51bedc34351966390695e2ad973@PGH-MSGMLT-02.andrew.ad.cmu.edu>
 <20180327010021.WzisjuLUh%predragp@andrew.cmu.edu>
 <wpkwqwydo8ysf0aojf0tbyck.1522112537235@email.android.com>
 <20180327025012.PucNB2br-%predragp@andrew.cmu.edu>
 <1522126842790.39313@cmu.edu>
 <20180327053140.xM3NWbFsK%predragp@andrew.cmu.edu>
 <798cc89dfa1a47b691994bc96880c039@PGH-MSGMLT-03.andrew.ad.cmu.edu>
Message-ID: <CAB7OVwB0C3Jy1Ha5O7143tiVF71KwkGmRNYQQgXo7Qi7bne2Ww@mail.gmail.com>

I think this is an issue with the CUDA install. I'm unable to run
Tensorflow jobs on GPU9 as of last night (have not checked the others, but
I suspect similar).

2018-03-26 14:54:49.214493: E
tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to cuInit:
CUDA_ERROR_UNKNOWN
2018-03-26 14:54:49.214599: I
tensorflow/stream_executor/cuda/cuda_diagnostics.cc:158] retrieving CUDA
diagnostic information for host: gpu9.int.autonlab.org
2018-03-26 14:54:49.214617: I
tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname:
gpu9.int.autonlab.org
2018-03-26 14:54:49.214685: I
tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported
version is: 390.30.0
2018-03-26 14:54:49.214747: I
tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported
version is: 390.30.0
2018-03-26 14:54:49.214762: I
tensorflow/stream_executor/cuda/cuda_diagnostics.cc:300] kernel version
seems to match DSO: 390.30.0


On Tue, Mar 27, 2018 at 1:47 AM Manzil Zaheer <manzil at cmu.edu> wrote:

> Hi Predrag,
>
> Thanks for pointing out the links. From the link you provided, we can see
> that FB engineers mention that "error 30 is usually unrelated to pytorch
> issues (or your code change)".
>
> Thanks,
> Manzil
> ________________________________________
> From: Predrag Punosevac <predragp at andrew.cmu.edu>
> Sent: 27 March 2018 01:31
> To: Manzil Zaheer
> Cc: Barnabas Poczos; users at autonlab.org
> Subject: Re: PyTorch
>
> Manzil Zaheer <manzil at cmu.edu> wrote:
>
> > Hi Pregrad,
> >
> > Thanks again for your help. But I still can not get anything running on
> GPU5,6,7,9. Also notice that GPU1,2,3,4,8 almost all GPUs are full, while
> no one is using GPU5,6,7,9. This might mean no one else is also able to run
> anything as well.
> >
>
> 7 if off limit used for the special project. How did you figure out that
> nobody is using it when
> you can't even log there?
>
> > So I tried many things. Everything installs without issue. But when i
> try to run the simple code like:
> >
>
> PyTorch is a research grade software. They have a mailing list. 3 sec
> Googling reveals
>
>
> https://github.com/pytorch/pytorch/issues/2527
>
> also
>
>
> https://stackoverflow.com/questions/45861767/pytorch-giving-cuda-runtime-error
>
> I will look at this more but it would be helpful if you get on PyTorch
> mailing list and ask
> developers what they think. I see this once every 9 months they are
> looking at this bugs every
> day.
>
> Predrag
>
> > import torch
> > x = torch.cuda.FloatTensor(2,3,4)
> > print(x)
> >
> >
> > I get the following error:
> > THCudaCheck FAIL file=/pytorch/torch/lib/THC/THCGeneral.c line=70
> error=30 : unknown error
> > Traceback (most recent call last):
> >   File "<stdin>", line 1, in <module>
> >   File
> "/zfsauton/home/manzilz/.local/lib/python3.6/site-packages/torch/_utils.py",
> line 69, in _cuda
> >     return new_type(self.size()).copy_(self, async)
> >   File
> "/zfsauton/home/manzilz/.local/lib/python3.6/site-packages/torch/cuda/__init__.py",
> line 384, in _lazy_new
> >     _lazy_init()
> >   File
> "/zfsauton/home/manzilz/.local/lib/python3.6/site-packages/torch/cuda/__init__.py",
> line 142, in _lazy_init
> >     torch._C._cuda_init()
> > RuntimeError: cuda runtime error (30) : unknown error at
> /pytorch/torch/lib/THC/THCGeneral.c:70
> >
> > Thanks,
> > Manzil
> >
> > ________________________________________
> > From: Predrag Punosevac <predragp at andrew.cmu.edu>
> > Sent: 26 March 2018 22:50
> > To: Manzil Zaheer
> > Cc: Barnabas Poczos; users at autonlab.org
> > Subject: Re: PyTorch
> >
> > Manzil Zaheer <manzil at cmu.edu> wrote:
> >
> > > Thanks for the detailed analysis. But I am using pytorch. I have not
> tried Lua torch. Can you please check? Thanks again!
> > >
> >
> > I did. You have Python 3.6.4 in /opt/miniconda3/bin/python3.6
> >
> > predrag at gpu3$ /opt/miniconda3/bin/python3.6
> > Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19)
> > [GCC 7.2.0] on linux
> > Type "help", "copyright", "credits" or "license" for more information.
> >
> >
> > Try reinstalling thing in your scratch directory as
> >
> > /opt/miniconda3/bin/conda  install pytorch torchvision cuda91 -c pytorch
> >
> > You should see something like
> >
> > The following packages will be downloaded:
> >
> >     package                    |            build
> >     ---------------------------|-----------------
> >     pillow-5.0.0               |   py36h3deb7b8_0         561 KB
> >     mkl-2018.0.2               |                1       205.2 MB
> >     cuda91-1.0                 |       h4c16780_0           3 KB
> > pytorch
> >     libpng-1.6.34              |       hb9fc6fc_0         334 KB
> >     freetype-2.8               |       hab7d2ae_1         804 KB
> >     libgfortran-ng-7.2.0       |       hdf63c60_3         1.2 MB
> >     intel-openmp-2018.0.0      |                8         620 KB
> >     libtiff-4.0.9              |       h28f6b97_0         586 KB
> >     pytorch-0.3.1              |py36_cuda9.1.85_cudnn7.0.5_2       475.0
> > MB  pytorch
> >     torchvision-0.2.0          |   py36h17b6947_1         102 KB
> > pytorch
> >     jpeg-9b                    |       h024ee3a_2         248 KB
> >     numpy-1.14.2               |   py36hdbf6ddf_0         4.0 MB
> >     olefile-0.45.1             |           py36_0          47 KB
> >     ------------------------------------------------------------
> >                                            Total:       688.7 MB
> >
> >
> > Make sure you put your scratch as a path since file server is full. I
> > got clean installation but I didn't play further. One thing that worries
> > me is this line
> >
> > pytorch-0.3.1              |py36_cuda9.1.85_cudnn7.0.5_2       475.0 MB
> > pytorch
> >
> > We had problems with cudnn on 9.1 apparently because the upstream was
> > assuming 7.0.5 when in reality I have 7.1.1 CUDA 9 or even 7.1.5.  CUDA
> > 9.1
> >
> > GPU3 has CUDNN library 7.0.5 in cuda-9.0 so try adjusting conda command
> > accordingly.
> >
> >
> > Best,
> > Predrag
> >
> >
> >
> >
> >
> >
> > >
> > >
> > > Sent from my Samsung Galaxy smartphone.
> > >
> > >
> > > -------- Original message --------
> > > From: Predrag Punosevac <predragp at andrew.cmu.edu>
> > > Date: 3/26/18 9:00 PM (GMT-05:00)
> > > To: Manzil Zaheer <manzil at cmu.edu>
> > > Cc: Barnabas Poczos <bapoczos at andrew.cmu.edu>, users at autonlab.org
> > > Subject: Re: Lua Torch
> > >
> > > Manzil Zaheer <manzil at cmu.edu> wrote:
> > >
> > > > Hi Predrag,
> > > >
> > > > I am not able to use any GPUSs on gpu5,6,7,9. I tried all 3 versions
> of cuda, but I get the following error:
> > > >
> > >
> > >
> > > I was able to build it after adding this
> > >
> > > export TORCH_NVCC_FLAGS="-D__CUDA_NO_HALF_OPERATORS__"
> > >
> > > per
> > >
> > > https://github.com/torch/torch7/issues/1086
> > >
> > > When I try to run it I get errors that Lua packages are missing
> (probably
> > > due to my path variables). I have a vague recollection that Simon and I
> > > halped you once with this thing in the past. IIRC it was very picky
> about
> > > the version of some Lua package and required their version not the one
> > > which comes with yum .
> > >
> > > Anyhow I am forwarding this to users at autonlab in hope somebody is
> using
> > > it and might be of more help. Please stop by NSH 3119 and let us try to
> > > debug this.
> > >
> > > Predrag
> > >
> > >
> > >
> > >
> > > > THCudaCheck FAIL file=/pytorch/torch/lib/THC/THCGeneral.c line=70
> error=30 : unknown error
> > > > Traceback (most recent call last):
> > > >   File "<stdin>", line 1, in <module>
> > > >   File
> "/zfsauton/home/manzilz/local/lib/python3.6/site-packages/torch/cuda/__init__.py",
> line 384, in _lazy_new
> > > >     _lazy_init()
> > > >   File
> "/zfsauton/home/manzilz/local/lib/python3.6/site-packages/torch/cuda/__init__.py",
> line 142, in _lazy_init
> > > >     torch._C._cuda_init()
> > > > RuntimeError: cuda runtime error (30) : unknown error at
> /pytorch/torch/lib/THC/THCGeneral.c:70
> > > >
> > > > Can you kindly look into it?
> > > >
> > > > Thanks,
> > > > Manzil
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20180327/91f21d86/attachment-0001.html>

From predragp at andrew.cmu.edu  Tue Mar 27 17:35:56 2018
From: predragp at andrew.cmu.edu (Predrag Punosevac)
Date: Tue, 27 Mar 2018 17:35:56 -0400
Subject: PyTorch
In-Reply-To: <CAB7OVwB0C3Jy1Ha5O7143tiVF71KwkGmRNYQQgXo7Qi7bne2Ww@mail.gmail.com>
References: <295ba51bedc34351966390695e2ad973@PGH-MSGMLT-02.andrew.ad.cmu.edu>
 <20180327010021.WzisjuLUh%predragp@andrew.cmu.edu>
 <wpkwqwydo8ysf0aojf0tbyck.1522112537235@email.android.com>
 <20180327025012.PucNB2br-%predragp@andrew.cmu.edu>
 <1522126842790.39313@cmu.edu>
 <20180327053140.xM3NWbFsK%predragp@andrew.cmu.edu>
 <798cc89dfa1a47b691994bc96880c039@PGH-MSGMLT-03.andrew.ad.cmu.edu>
 <CAB7OVwB0C3Jy1Ha5O7143tiVF71KwkGmRNYQQgXo7Qi7bne2Ww@mail.gmail.com>
Message-ID: <20180327213556.fTho4DuWR%predragp@andrew.cmu.edu>

Matthew Barnes <mbarnes1 at andrew.cmu.edu> wrote:

> I think this is an issue with the CUDA install. I'm unable to run
> Tensorflow jobs on GPU9 as of last night (have not checked the others, but
> I suspect similar).

Nothing has changed since the last night. The error you are seeing is
TensorFlow complaning about 390.30 NVidia driver but we upgraded driver
last week accross all servers and IIRC you were able to use TensorFlow
on GPU2, GPU3, and GPU4 after the upgrade.

The main problem seems CUDNN library as TensorFlow and PyTorch seems to
expect older libraries. Look for them in CUDA-9.0 directory.

Predrag

> 
> 2018-03-26 14:54:49.214493: E
> tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to cuInit:
> CUDA_ERROR_UNKNOWN
> 2018-03-26 14:54:49.214599: I
> tensorflow/stream_executor/cuda/cuda_diagnostics.cc:158] retrieving CUDA
> diagnostic information for host: gpu9.int.autonlab.org
> 2018-03-26 14:54:49.214617: I
> tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname:
> gpu9.int.autonlab.org
> 2018-03-26 14:54:49.214685: I
> tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported
> version is: 390.30.0
> 2018-03-26 14:54:49.214747: I
> tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported
> version is: 390.30.0
> 2018-03-26 14:54:49.214762: I
> tensorflow/stream_executor/cuda/cuda_diagnostics.cc:300] kernel version
> seems to match DSO: 390.30.0
> 
> 
> On Tue, Mar 27, 2018 at 1:47 AM Manzil Zaheer <manzil at cmu.edu> wrote:
> 
> > Hi Predrag,
> >
> > Thanks for pointing out the links. From the link you provided, we can see
> > that FB engineers mention that "error 30 is usually unrelated to pytorch
> > issues (or your code change)".
> >
> > Thanks,
> > Manzil
> > ________________________________________
> > From: Predrag Punosevac <predragp at andrew.cmu.edu>
> > Sent: 27 March 2018 01:31
> > To: Manzil Zaheer
> > Cc: Barnabas Poczos; users at autonlab.org
> > Subject: Re: PyTorch
> >
> > Manzil Zaheer <manzil at cmu.edu> wrote:
> >
> > > Hi Pregrad,
> > >
> > > Thanks again for your help. But I still can not get anything running on
> > GPU5,6,7,9. Also notice that GPU1,2,3,4,8 almost all GPUs are full, while
> > no one is using GPU5,6,7,9. This might mean no one else is also able to run
> > anything as well.
> > >
> >
> > 7 if off limit used for the special project. How did you figure out that
> > nobody is using it when
> > you can't even log there?
> >
> > > So I tried many things. Everything installs without issue. But when i
> > try to run the simple code like:
> > >
> >
> > PyTorch is a research grade software. They have a mailing list. 3 sec
> > Googling reveals
> >
> >
> > https://github.com/pytorch/pytorch/issues/2527
> >
> > also
> >
> >
> > https://stackoverflow.com/questions/45861767/pytorch-giving-cuda-runtime-error
> >
> > I will look at this more but it would be helpful if you get on PyTorch
> > mailing list and ask
> > developers what they think. I see this once every 9 months they are
> > looking at this bugs every
> > day.
> >
> > Predrag
> >
> > > import torch
> > > x = torch.cuda.FloatTensor(2,3,4)
> > > print(x)
> > >
> > >
> > > I get the following error:
> > > THCudaCheck FAIL file=/pytorch/torch/lib/THC/THCGeneral.c line=70
> > error=30 : unknown error
> > > Traceback (most recent call last):
> > >   File "<stdin>", line 1, in <module>
> > >   File
> > "/zfsauton/home/manzilz/.local/lib/python3.6/site-packages/torch/_utils.py",
> > line 69, in _cuda
> > >     return new_type(self.size()).copy_(self, async)
> > >   File
> > "/zfsauton/home/manzilz/.local/lib/python3.6/site-packages/torch/cuda/__init__.py",
> > line 384, in _lazy_new
> > >     _lazy_init()
> > >   File
> > "/zfsauton/home/manzilz/.local/lib/python3.6/site-packages/torch/cuda/__init__.py",
> > line 142, in _lazy_init
> > >     torch._C._cuda_init()
> > > RuntimeError: cuda runtime error (30) : unknown error at
> > /pytorch/torch/lib/THC/THCGeneral.c:70
> > >
> > > Thanks,
> > > Manzil
> > >
> > > ________________________________________
> > > From: Predrag Punosevac <predragp at andrew.cmu.edu>
> > > Sent: 26 March 2018 22:50
> > > To: Manzil Zaheer
> > > Cc: Barnabas Poczos; users at autonlab.org
> > > Subject: Re: PyTorch
> > >
> > > Manzil Zaheer <manzil at cmu.edu> wrote:
> > >
> > > > Thanks for the detailed analysis. But I am using pytorch. I have not
> > tried Lua torch. Can you please check? Thanks again!
> > > >
> > >
> > > I did. You have Python 3.6.4 in /opt/miniconda3/bin/python3.6
> > >
> > > predrag at gpu3$ /opt/miniconda3/bin/python3.6
> > > Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19)
> > > [GCC 7.2.0] on linux
> > > Type "help", "copyright", "credits" or "license" for more information.
> > >
> > >
> > > Try reinstalling thing in your scratch directory as
> > >
> > > /opt/miniconda3/bin/conda  install pytorch torchvision cuda91 -c pytorch
> > >
> > > You should see something like
> > >
> > > The following packages will be downloaded:
> > >
> > >     package                    |            build
> > >     ---------------------------|-----------------
> > >     pillow-5.0.0               |   py36h3deb7b8_0         561 KB
> > >     mkl-2018.0.2               |                1       205.2 MB
> > >     cuda91-1.0                 |       h4c16780_0           3 KB
> > > pytorch
> > >     libpng-1.6.34              |       hb9fc6fc_0         334 KB
> > >     freetype-2.8               |       hab7d2ae_1         804 KB
> > >     libgfortran-ng-7.2.0       |       hdf63c60_3         1.2 MB
> > >     intel-openmp-2018.0.0      |                8         620 KB
> > >     libtiff-4.0.9              |       h28f6b97_0         586 KB
> > >     pytorch-0.3.1              |py36_cuda9.1.85_cudnn7.0.5_2       475.0
> > > MB  pytorch
> > >     torchvision-0.2.0          |   py36h17b6947_1         102 KB
> > > pytorch
> > >     jpeg-9b                    |       h024ee3a_2         248 KB
> > >     numpy-1.14.2               |   py36hdbf6ddf_0         4.0 MB
> > >     olefile-0.45.1             |           py36_0          47 KB
> > >     ------------------------------------------------------------
> > >                                            Total:       688.7 MB
> > >
> > >
> > > Make sure you put your scratch as a path since file server is full. I
> > > got clean installation but I didn't play further. One thing that worries
> > > me is this line
> > >
> > > pytorch-0.3.1              |py36_cuda9.1.85_cudnn7.0.5_2       475.0 MB
> > > pytorch
> > >
> > > We had problems with cudnn on 9.1 apparently because the upstream was
> > > assuming 7.0.5 when in reality I have 7.1.1 CUDA 9 or even 7.1.5.  CUDA
> > > 9.1
> > >
> > > GPU3 has CUDNN library 7.0.5 in cuda-9.0 so try adjusting conda command
> > > accordingly.
> > >
> > >
> > > Best,
> > > Predrag
> > >
> > >
> > >
> > >
> > >
> > >
> > > >
> > > >
> > > > Sent from my Samsung Galaxy smartphone.
> > > >
> > > >
> > > > -------- Original message --------
> > > > From: Predrag Punosevac <predragp at andrew.cmu.edu>
> > > > Date: 3/26/18 9:00 PM (GMT-05:00)
> > > > To: Manzil Zaheer <manzil at cmu.edu>
> > > > Cc: Barnabas Poczos <bapoczos at andrew.cmu.edu>, users at autonlab.org
> > > > Subject: Re: Lua Torch
> > > >
> > > > Manzil Zaheer <manzil at cmu.edu> wrote:
> > > >
> > > > > Hi Predrag,
> > > > >
> > > > > I am not able to use any GPUSs on gpu5,6,7,9. I tried all 3 versions
> > of cuda, but I get the following error:
> > > > >
> > > >
> > > >
> > > > I was able to build it after adding this
> > > >
> > > > export TORCH_NVCC_FLAGS="-D__CUDA_NO_HALF_OPERATORS__"
> > > >
> > > > per
> > > >
> > > > https://github.com/torch/torch7/issues/1086
> > > >
> > > > When I try to run it I get errors that Lua packages are missing
> > (probably
> > > > due to my path variables). I have a vague recollection that Simon and I
> > > > halped you once with this thing in the past. IIRC it was very picky
> > about
> > > > the version of some Lua package and required their version not the one
> > > > which comes with yum .
> > > >
> > > > Anyhow I am forwarding this to users at autonlab in hope somebody is
> > using
> > > > it and might be of more help. Please stop by NSH 3119 and let us try to
> > > > debug this.
> > > >
> > > > Predrag
> > > >
> > > >
> > > >
> > > >
> > > > > THCudaCheck FAIL file=/pytorch/torch/lib/THC/THCGeneral.c line=70
> > error=30 : unknown error
> > > > > Traceback (most recent call last):
> > > > >   File "<stdin>", line 1, in <module>
> > > > >   File
> > "/zfsauton/home/manzilz/local/lib/python3.6/site-packages/torch/cuda/__init__.py",
> > line 384, in _lazy_new
> > > > >     _lazy_init()
> > > > >   File
> > "/zfsauton/home/manzilz/local/lib/python3.6/site-packages/torch/cuda/__init__.py",
> > line 142, in _lazy_init
> > > > >     torch._C._cuda_init()
> > > > > RuntimeError: cuda runtime error (30) : unknown error at
> > /pytorch/torch/lib/THC/THCGeneral.c:70
> > > > >
> > > > > Can you kindly look into it?
> > > > >
> > > > > Thanks,
> > > > > Manzil
> >
> >

From barunpatra95 at gmail.com  Wed Mar 28 01:34:56 2018
From: barunpatra95 at gmail.com (Barun Patra)
Date: Wed, 28 Mar 2018 01:34:56 -0400
Subject: PyTorch
In-Reply-To: <20180327213556.fTho4DuWR%predragp@andrew.cmu.edu>
References: <295ba51bedc34351966390695e2ad973@PGH-MSGMLT-02.andrew.ad.cmu.edu>
 <20180327010021.WzisjuLUh%predragp@andrew.cmu.edu>
 <wpkwqwydo8ysf0aojf0tbyck.1522112537235@email.android.com>
 <20180327025012.PucNB2br-%predragp@andrew.cmu.edu>
 <1522126842790.39313@cmu.edu>
 <20180327053140.xM3NWbFsK%predragp@andrew.cmu.edu>
 <798cc89dfa1a47b691994bc96880c039@PGH-MSGMLT-03.andrew.ad.cmu.edu>
 <CAB7OVwB0C3Jy1Ha5O7143tiVF71KwkGmRNYQQgXo7Qi7bne2Ww@mail.gmail.com>
 <20180327213556.fTho4DuWR%predragp@andrew.cmu.edu>
Message-ID: <CAGLjJU-mrEoAFPMJMUOTsR62aJdgmEwLT+Fhdc0z9oD4VXrvCw@mail.gmail.com>

Has anyone been able to run either Tensorflow or pytorch on gpu machines 5,
6, 9 ?
Both give CUDA_ERROR_UNKNOWN errors.
I tried setting my LD_LIBRARY_PATH and PATH variables to the cuda-8.0 /
cuda-9.0/ cuda-9.1 (and the LD_LIBRARY_PATH to the corresponding lib64),
reinstalling pytorch for cuda-8.0/ cuda-9.0/ cuda-9.1 using both virtualenv
and the system miniconda, as well as reinstalled tensorflow.
Nothing seems to work unfortunately.
IIRC, these errors first appeared when the systems were rebooted after the
spring break, and have persisted ever since.

Any help in the matter would be appreciated !

On Tue, Mar 27, 2018 at 5:35 PM, Predrag Punosevac <predragp at andrew.cmu.edu>
wrote:

> Matthew Barnes <mbarnes1 at andrew.cmu.edu> wrote:
>
> > I think this is an issue with the CUDA install. I'm unable to run
> > Tensorflow jobs on GPU9 as of last night (have not checked the others,
> but
> > I suspect similar).
>
> Nothing has changed since the last night. The error you are seeing is
> TensorFlow complaning about 390.30 NVidia driver but we upgraded driver
> last week accross all servers and IIRC you were able to use TensorFlow
> on GPU2, GPU3, and GPU4 after the upgrade.
>
> The main problem seems CUDNN library as TensorFlow and PyTorch seems to
> expect older libraries. Look for them in CUDA-9.0 directory.
>
> Predrag
>
> >
> > 2018-03-26 14:54:49.214493: E
> > tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to
> cuInit:
> > CUDA_ERROR_UNKNOWN
> > 2018-03-26 14:54:49.214599: I
> > tensorflow/stream_executor/cuda/cuda_diagnostics.cc:158] retrieving CUDA
> > diagnostic information for host: gpu9.int.autonlab.org
> > 2018-03-26 14:54:49.214617: I
> > tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname:
> > gpu9.int.autonlab.org
> > 2018-03-26 14:54:49.214685: I
> > tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda
> reported
> > version is: 390.30.0
> > 2018-03-26 14:54:49.214747: I
> > tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported
> > version is: 390.30.0
> > 2018-03-26 14:54:49.214762: I
> > tensorflow/stream_executor/cuda/cuda_diagnostics.cc:300] kernel version
> > seems to match DSO: 390.30.0
> >
> >
> > On Tue, Mar 27, 2018 at 1:47 AM Manzil Zaheer <manzil at cmu.edu> wrote:
> >
> > > Hi Predrag,
> > >
> > > Thanks for pointing out the links. From the link you provided, we can
> see
> > > that FB engineers mention that "error 30 is usually unrelated to
> pytorch
> > > issues (or your code change)".
> > >
> > > Thanks,
> > > Manzil
> > > ________________________________________
> > > From: Predrag Punosevac <predragp at andrew.cmu.edu>
> > > Sent: 27 March 2018 01:31
> > > To: Manzil Zaheer
> > > Cc: Barnabas Poczos; users at autonlab.org
> > > Subject: Re: PyTorch
> > >
> > > Manzil Zaheer <manzil at cmu.edu> wrote:
> > >
> > > > Hi Pregrad,
> > > >
> > > > Thanks again for your help. But I still can not get anything running
> on
> > > GPU5,6,7,9. Also notice that GPU1,2,3,4,8 almost all GPUs are full,
> while
> > > no one is using GPU5,6,7,9. This might mean no one else is also able
> to run
> > > anything as well.
> > > >
> > >
> > > 7 if off limit used for the special project. How did you figure out
> that
> > > nobody is using it when
> > > you can't even log there?
> > >
> > > > So I tried many things. Everything installs without issue. But when i
> > > try to run the simple code like:
> > > >
> > >
> > > PyTorch is a research grade software. They have a mailing list. 3 sec
> > > Googling reveals
> > >
> > >
> > > https://github.com/pytorch/pytorch/issues/2527
> > >
> > > also
> > >
> > >
> > > https://stackoverflow.com/questions/45861767/pytorch-
> giving-cuda-runtime-error
> > >
> > > I will look at this more but it would be helpful if you get on PyTorch
> > > mailing list and ask
> > > developers what they think. I see this once every 9 months they are
> > > looking at this bugs every
> > > day.
> > >
> > > Predrag
> > >
> > > > import torch
> > > > x = torch.cuda.FloatTensor(2,3,4)
> > > > print(x)
> > > >
> > > >
> > > > I get the following error:
> > > > THCudaCheck FAIL file=/pytorch/torch/lib/THC/THCGeneral.c line=70
> > > error=30 : unknown error
> > > > Traceback (most recent call last):
> > > >   File "<stdin>", line 1, in <module>
> > > >   File
> > > "/zfsauton/home/manzilz/.local/lib/python3.6/site-
> packages/torch/_utils.py",
> > > line 69, in _cuda
> > > >     return new_type(self.size()).copy_(self, async)
> > > >   File
> > > "/zfsauton/home/manzilz/.local/lib/python3.6/site-
> packages/torch/cuda/__init__.py",
> > > line 384, in _lazy_new
> > > >     _lazy_init()
> > > >   File
> > > "/zfsauton/home/manzilz/.local/lib/python3.6/site-
> packages/torch/cuda/__init__.py",
> > > line 142, in _lazy_init
> > > >     torch._C._cuda_init()
> > > > RuntimeError: cuda runtime error (30) : unknown error at
> > > /pytorch/torch/lib/THC/THCGeneral.c:70
> > > >
> > > > Thanks,
> > > > Manzil
> > > >
> > > > ________________________________________
> > > > From: Predrag Punosevac <predragp at andrew.cmu.edu>
> > > > Sent: 26 March 2018 22:50
> > > > To: Manzil Zaheer
> > > > Cc: Barnabas Poczos; users at autonlab.org
> > > > Subject: Re: PyTorch
> > > >
> > > > Manzil Zaheer <manzil at cmu.edu> wrote:
> > > >
> > > > > Thanks for the detailed analysis. But I am using pytorch. I have
> not
> > > tried Lua torch. Can you please check? Thanks again!
> > > > >
> > > >
> > > > I did. You have Python 3.6.4 in /opt/miniconda3/bin/python3.6
> > > >
> > > > predrag at gpu3$ /opt/miniconda3/bin/python3.6
> > > > Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19)
> > > > [GCC 7.2.0] on linux
> > > > Type "help", "copyright", "credits" or "license" for more
> information.
> > > >
> > > >
> > > > Try reinstalling thing in your scratch directory as
> > > >
> > > > /opt/miniconda3/bin/conda  install pytorch torchvision cuda91 -c
> pytorch
> > > >
> > > > You should see something like
> > > >
> > > > The following packages will be downloaded:
> > > >
> > > >     package                    |            build
> > > >     ---------------------------|-----------------
> > > >     pillow-5.0.0               |   py36h3deb7b8_0         561 KB
> > > >     mkl-2018.0.2               |                1       205.2 MB
> > > >     cuda91-1.0                 |       h4c16780_0           3 KB
> > > > pytorch
> > > >     libpng-1.6.34              |       hb9fc6fc_0         334 KB
> > > >     freetype-2.8               |       hab7d2ae_1         804 KB
> > > >     libgfortran-ng-7.2.0       |       hdf63c60_3         1.2 MB
> > > >     intel-openmp-2018.0.0      |                8         620 KB
> > > >     libtiff-4.0.9              |       h28f6b97_0         586 KB
> > > >     pytorch-0.3.1              |py36_cuda9.1.85_cudnn7.0.5_2
>  475.0
> > > > MB  pytorch
> > > >     torchvision-0.2.0          |   py36h17b6947_1         102 KB
> > > > pytorch
> > > >     jpeg-9b                    |       h024ee3a_2         248 KB
> > > >     numpy-1.14.2               |   py36hdbf6ddf_0         4.0 MB
> > > >     olefile-0.45.1             |           py36_0          47 KB
> > > >     ------------------------------------------------------------
> > > >                                            Total:       688.7 MB
> > > >
> > > >
> > > > Make sure you put your scratch as a path since file server is full. I
> > > > got clean installation but I didn't play further. One thing that
> worries
> > > > me is this line
> > > >
> > > > pytorch-0.3.1              |py36_cuda9.1.85_cudnn7.0.5_2       475.0
> MB
> > > > pytorch
> > > >
> > > > We had problems with cudnn on 9.1 apparently because the upstream was
> > > > assuming 7.0.5 when in reality I have 7.1.1 CUDA 9 or even 7.1.5.
> CUDA
> > > > 9.1
> > > >
> > > > GPU3 has CUDNN library 7.0.5 in cuda-9.0 so try adjusting conda
> command
> > > > accordingly.
> > > >
> > > >
> > > > Best,
> > > > Predrag
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > >
> > > > >
> > > > > Sent from my Samsung Galaxy smartphone.
> > > > >
> > > > >
> > > > > -------- Original message --------
> > > > > From: Predrag Punosevac <predragp at andrew.cmu.edu>
> > > > > Date: 3/26/18 9:00 PM (GMT-05:00)
> > > > > To: Manzil Zaheer <manzil at cmu.edu>
> > > > > Cc: Barnabas Poczos <bapoczos at andrew.cmu.edu>, users at autonlab.org
> > > > > Subject: Re: Lua Torch
> > > > >
> > > > > Manzil Zaheer <manzil at cmu.edu> wrote:
> > > > >
> > > > > > Hi Predrag,
> > > > > >
> > > > > > I am not able to use any GPUSs on gpu5,6,7,9. I tried all 3
> versions
> > > of cuda, but I get the following error:
> > > > > >
> > > > >
> > > > >
> > > > > I was able to build it after adding this
> > > > >
> > > > > export TORCH_NVCC_FLAGS="-D__CUDA_NO_HALF_OPERATORS__"
> > > > >
> > > > > per
> > > > >
> > > > > https://github.com/torch/torch7/issues/1086
> > > > >
> > > > > When I try to run it I get errors that Lua packages are missing
> > > (probably
> > > > > due to my path variables). I have a vague recollection that Simon
> and I
> > > > > halped you once with this thing in the past. IIRC it was very picky
> > > about
> > > > > the version of some Lua package and required their version not the
> one
> > > > > which comes with yum .
> > > > >
> > > > > Anyhow I am forwarding this to users at autonlab in hope somebody is
> > > using
> > > > > it and might be of more help. Please stop by NSH 3119 and let us
> try to
> > > > > debug this.
> > > > >
> > > > > Predrag
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > > THCudaCheck FAIL file=/pytorch/torch/lib/THC/THCGeneral.c
> line=70
> > > error=30 : unknown error
> > > > > > Traceback (most recent call last):
> > > > > >   File "<stdin>", line 1, in <module>
> > > > > >   File
> > > "/zfsauton/home/manzilz/local/lib/python3.6/site-packages/
> torch/cuda/__init__.py",
> > > line 384, in _lazy_new
> > > > > >     _lazy_init()
> > > > > >   File
> > > "/zfsauton/home/manzilz/local/lib/python3.6/site-packages/
> torch/cuda/__init__.py",
> > > line 142, in _lazy_init
> > > > > >     torch._C._cuda_init()
> > > > > > RuntimeError: cuda runtime error (30) : unknown error at
> > > /pytorch/torch/lib/THC/THCGeneral.c:70
> > > > > >
> > > > > > Can you kindly look into it?
> > > > > >
> > > > > > Thanks,
> > > > > > Manzil
> > >
> > >
>


-- 
Barun Patra
Master's Student
Machine Learning Department
Carnegie Mellon University
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20180328/69d7e435/attachment-0001.html>

From predragp at andrew.cmu.edu  Wed Mar 28 18:58:49 2018
From: predragp at andrew.cmu.edu (Predrag Punosevac)
Date: Wed, 28 Mar 2018 18:58:49 -0400
Subject: NVidia driver broke GPUs
In-Reply-To: <CAHFiJZtqCMf-WScHd1_BbPyARuOZKaK8soOfTxac5jKg1552Qg@mail.gmail.com>
References: <295ba51bedc34351966390695e2ad973@PGH-MSGMLT-02.andrew.ad.cmu.edu>
 <20180327010021.WzisjuLUh%predragp@andrew.cmu.edu>
 <wpkwqwydo8ysf0aojf0tbyck.1522112537235@email.android.com>
 <20180327025012.PucNB2br-%predragp@andrew.cmu.edu>
 <1522126842790.39313@cmu.edu>
 <20180327053140.xM3NWbFsK%predragp@andrew.cmu.edu>
 <798cc89dfa1a47b691994bc96880c039@PGH-MSGMLT-03.andrew.ad.cmu.edu>
 <CAB7OVwB0C3Jy1Ha5O7143tiVF71KwkGmRNYQQgXo7Qi7bne2Ww@mail.gmail.com>
 <db32725b4bf14b0d9959dd3e64a35381@PGH-MSGMLT-02.andrew.ad.cmu.edu>
 <CAHFiJZuRHgDdjBgMiQuqozD1Tm1OZYMSDXbmg_TO-cuDVa3Bew@mail.gmail.com>
 <b93cc25af333492b9f627639d290cbce@dcns-msgmlt-03.andrew.ad.cmu.edu>
 <CAHFiJZs9b1Q69ys7KSu2Jj0aCoP1R7vyYDhw6HN9PE-4zzQuVA@mail.gmail.com>
 <bf5315aab84d46ff86046cb8d5e74cab@PGH-MSGMLT-03.andrew.ad.cmu.edu>
 <CAHFiJZtqCMf-WScHd1_BbPyARuOZKaK8soOfTxac5jKg1552Qg@mail.gmail.com>
Message-ID: <20180328225849.jVGyjyWSc%predragp@andrew.cmu.edu>

Barnabas Poczos <bapoczos at cs.cmu.edu> wrote:

> If this can't be fixed quickly, then would it be possible to do a roll
> back on these GPU machines (5,6,9)  to the latest state when they
> worked fine?
> (If I know correctly, they are down since March 23.)
> 
> Sorry for bugging you with this, I just want to find a quick solution
> to make these 12 GPU cards usable again with pytorch and tensorflow
> because several deadlines are coming.
> 
> Many thanks! ... and sorry for annoying you with this!
> 

Ok Yotam and I spent last 3-4h debugging this. It is not PyTorch nor
TensorFlow issue. It is not even CUDA issue. NVidia driver itself is
broken. I have no idea how it happened on some machines and didn't
happen on others (all GPU machines with the exception of GPU-7 run the
same latest Red Hat  3.10.0-693.21.1.el7.x86 kernel). The clue should
have being the fact that MATLAB also got broken on some machines.
My hunch is that NVidia driver gets recompiled during the kernel update
and apparently that is not as robust as it should be.

The plan of the action is that I will try to remove everything NVidia
related from GPU9 machine try to reinstall driver, CUDA from the
scratch. Hopefully GPU9 will become functional just like GPU8. Once it
works for GPU9 I can go and fix other machines. If that doesn't work I
will reinstall GPU9 from the scratch.

Long story short somebody at NVidia did a shady job with QA and we 
became victims. Oh just for the record we don't use ZFS on Linux. If I
was running root of the ZFS pool as I am doing on the file server I
could just do beadm select the previous working system and go back. I am
not aware that Linux can do something like that but that is what I do on
FreeBSD and that what Solaris does. 


Best,
Predrag


> Cheers,
> Barnabas
> ======================
> Barnabas Poczos, PhD
> Assistant Professor
> Machine Learning Department
> Carnegie Mellon University
> 

From bapoczos at cs.cmu.edu  Wed Mar 28 19:16:21 2018
From: bapoczos at cs.cmu.edu (Barnabas Poczos)
Date: Wed, 28 Mar 2018 19:16:21 -0400
Subject: NVidia driver broke GPUs
In-Reply-To: <8588f843f78646378e50557740100683@PGH-MSGMLT-01.andrew.ad.cmu.edu>
References: <295ba51bedc34351966390695e2ad973@PGH-MSGMLT-02.andrew.ad.cmu.edu>
 <20180327010021.WzisjuLUh%predragp@andrew.cmu.edu>
 <wpkwqwydo8ysf0aojf0tbyck.1522112537235@email.android.com>
 <20180327025012.PucNB2br-%predragp@andrew.cmu.edu>
 <1522126842790.39313@cmu.edu>
 <20180327053140.xM3NWbFsK%predragp@andrew.cmu.edu>
 <798cc89dfa1a47b691994bc96880c039@PGH-MSGMLT-03.andrew.ad.cmu.edu>
 <CAB7OVwB0C3Jy1Ha5O7143tiVF71KwkGmRNYQQgXo7Qi7bne2Ww@mail.gmail.com>
 <db32725b4bf14b0d9959dd3e64a35381@PGH-MSGMLT-02.andrew.ad.cmu.edu>
 <CAHFiJZuRHgDdjBgMiQuqozD1Tm1OZYMSDXbmg_TO-cuDVa3Bew@mail.gmail.com>
 <b93cc25af333492b9f627639d290cbce@dcns-msgmlt-03.andrew.ad.cmu.edu>
 <CAHFiJZs9b1Q69ys7KSu2Jj0aCoP1R7vyYDhw6HN9PE-4zzQuVA@mail.gmail.com>
 <bf5315aab84d46ff86046cb8d5e74cab@PGH-MSGMLT-03.andrew.ad.cmu.edu>
 <CAHFiJZtqCMf-WScHd1_BbPyARuOZKaK8soOfTxac5jKg1552Qg@mail.gmail.com>
 <8588f843f78646378e50557740100683@PGH-MSGMLT-01.andrew.ad.cmu.edu>
Message-ID: <CAHFiJZs3ot5mtPSj43bfYweX1Hbw-yDAk9gTsM+tt8Gh8JURwA@mail.gmail.com>

Thanks Predrag and Yotam for your help working on this!

Best,
Barnabas
======================
Barnabas Poczos, PhD
Assistant Professor
Machine Learning Department
Carnegie Mellon University


On Wed, Mar 28, 2018 at 6:58 PM, Predrag Punosevac
<predragp at andrew.cmu.edu> wrote:
> Barnabas Poczos <bapoczos at cs.cmu.edu> wrote:
>
>> If this can't be fixed quickly, then would it be possible to do a roll
>> back on these GPU machines (5,6,9)  to the latest state when they
>> worked fine?
>> (If I know correctly, they are down since March 23.)
>>
>> Sorry for bugging you with this, I just want to find a quick solution
>> to make these 12 GPU cards usable again with pytorch and tensorflow
>> because several deadlines are coming.
>>
>> Many thanks! ... and sorry for annoying you with this!
>>
>
> Ok Yotam and I spent last 3-4h debugging this. It is not PyTorch nor
> TensorFlow issue. It is not even CUDA issue. NVidia driver itself is
> broken. I have no idea how it happened on some machines and didn't
> happen on others (all GPU machines with the exception of GPU-7 run the
> same latest Red Hat  3.10.0-693.21.1.el7.x86 kernel). The clue should
> have being the fact that MATLAB also got broken on some machines.
> My hunch is that NVidia driver gets recompiled during the kernel update
> and apparently that is not as robust as it should be.
>
> The plan of the action is that I will try to remove everything NVidia
> related from GPU9 machine try to reinstall driver, CUDA from the
> scratch. Hopefully GPU9 will become functional just like GPU8. Once it
> works for GPU9 I can go and fix other machines. If that doesn't work I
> will reinstall GPU9 from the scratch.
>
> Long story short somebody at NVidia did a shady job with QA and we
> became victims. Oh just for the record we don't use ZFS on Linux. If I
> was running root of the ZFS pool as I am doing on the file server I
> could just do beadm select the previous working system and go back. I am
> not aware that Linux can do something like that but that is what I do on
> FreeBSD and that what Solaris does.
>
>
> Best,
> Predrag
>
>
>
>
>> Cheers,
>> Barnabas
>> ======================
>> Barnabas Poczos, PhD
>> Assistant Professor
>> Machine Learning Department
>> Carnegie Mellon University
>>

From predragp at andrew.cmu.edu  Wed Mar 28 23:15:06 2018
From: predragp at andrew.cmu.edu (Predrag Punosevac)
Date: Wed, 28 Mar 2018 23:15:06 -0400
Subject: NVidia driver broke GPUs
In-Reply-To: <CAHFiJZs3ot5mtPSj43bfYweX1Hbw-yDAk9gTsM+tt8Gh8JURwA@mail.gmail.com>
References: <295ba51bedc34351966390695e2ad973@PGH-MSGMLT-02.andrew.ad.cmu.edu>
 <20180327010021.WzisjuLUh%predragp@andrew.cmu.edu>
 <wpkwqwydo8ysf0aojf0tbyck.1522112537235@email.android.com>
 <20180327025012.PucNB2br-%predragp@andrew.cmu.edu>
 <1522126842790.39313@cmu.edu>
 <20180327053140.xM3NWbFsK%predragp@andrew.cmu.edu>
 <798cc89dfa1a47b691994bc96880c039@PGH-MSGMLT-03.andrew.ad.cmu.edu>
 <CAB7OVwB0C3Jy1Ha5O7143tiVF71KwkGmRNYQQgXo7Qi7bne2Ww@mail.gmail.com>
 <db32725b4bf14b0d9959dd3e64a35381@PGH-MSGMLT-02.andrew.ad.cmu.edu>
 <CAHFiJZuRHgDdjBgMiQuqozD1Tm1OZYMSDXbmg_TO-cuDVa3Bew@mail.gmail.com>
 <b93cc25af333492b9f627639d290cbce@dcns-msgmlt-03.andrew.ad.cmu.edu>
 <CAHFiJZs9b1Q69ys7KSu2Jj0aCoP1R7vyYDhw6HN9PE-4zzQuVA@mail.gmail.com>
 <bf5315aab84d46ff86046cb8d5e74cab@PGH-MSGMLT-03.andrew.ad.cmu.edu>
 <CAHFiJZtqCMf-WScHd1_BbPyARuOZKaK8soOfTxac5jKg1552Qg@mail.gmail.com>
 <8588f843f78646378e50557740100683@PGH-MSGMLT-01.andrew.ad.cmu.edu>
 <CAHFiJZs3ot5mtPSj43bfYweX1Hbw-yDAk9gTsM+tt8Gh8JURwA@mail.gmail.com>
Message-ID: <20180329031506.IcZISPvTL%predragp@andrew.cmu.edu>

Dear Autonians,

I have another update on NVidia driver issue. I have actually reinstall
the driver and CUDA-9.0 on GPU9 but the issue is still here. Please see
below detailed report. 

I have seeing few people reporting even this stupidity with NVidia
hardware. Their solution is cold reboot. I have rebooted this machine
multiple times but every time remotely with the reboot command. That is
so called soft reboot where the power actaully never gets completely cut
off. 

Tomorrow I will go to machine room turn off the machine for 10 minutes
and bring it back on line.

We will see if that helps. 


Predrag


root at gpu9$  cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  390.30  Wed Jan 31
22:08:49 PST 2018
GCC version:  gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC) 


root at gpu9$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176


root at gpu9$ nvidia-smi
Wed Mar 28 23:13:13 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.30                 Driver Version: 390.30
    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile
Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util
Compute M. |
|===============================+======================+======================|
|   0  TITAN Xp            Off  | 00000000:02:00.0 Off |
N/A |
| 23%   40C    P0    61W / 250W |      0MiB / 12196MiB |      0%
Default |
+-------------------------------+----------------------+----------------------+
|   1  TITAN Xp            Off  | 00000000:03:00.0 Off |
N/A |
| 24%   43C    P0    61W / 250W |      0MiB / 12196MiB |      0%
Default |
+-------------------------------+----------------------+----------------------+
|   2  TITAN Xp            Off  | 00000000:82:00.0 Off |
N/A |
| 23%   40C    P0    62W / 250W |      0MiB / 12196MiB |      0%
Default |
+-------------------------------+----------------------+----------------------+
|   3  TITAN Xp            Off  | 00000000:83:00.0 Off |
N/A |
| 23%   42C    P0    62W / 250W |      0MiB / 12196MiB |      0%
Default |
+-------------------------------+----------------------+----------------------+

     
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU
Memory |
|  GPU       PID   Type   Process name                             Usage
    |
|=============================================================================|
|  No running processes found
    |
+-----------------------------------------------------------------------------+


root at gpu9$ ls
deviceQuery  deviceQuery.cpp  deviceQuery.o  Makefile  NsightEclipse.xml
readme.txt
root at gpu9$ ./deviceQuery 
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

 cudaGetDeviceCount returned 30
 -> unknown error
 Result = FAIL


> On Wed, Mar 28, 2018 at 6:58 PM, Predrag Punosevac
> <predragp at andrew.cmu.edu> wrote:
> > Barnabas Poczos <bapoczos at cs.cmu.edu> wrote:
> >
> >> If this can't be fixed quickly, then would it be possible to do a roll
> >> back on these GPU machines (5,6,9)  to the latest state when they
> >> worked fine?
> >> (If I know correctly, they are down since March 23.)
> >>
> >> Sorry for bugging you with this, I just want to find a quick solution
> >> to make these 12 GPU cards usable again with pytorch and tensorflow
> >> because several deadlines are coming.
> >>
> >> Many thanks! ... and sorry for annoying you with this!
> >>
> >
> > Ok Yotam and I spent last 3-4h debugging this. It is not PyTorch nor
> > TensorFlow issue. It is not even CUDA issue. NVidia driver itself is
> > broken. I have no idea how it happened on some machines and didn't
> > happen on others (all GPU machines with the exception of GPU-7 run the
> > same latest Red Hat  3.10.0-693.21.1.el7.x86 kernel). The clue should
> > have being the fact that MATLAB also got broken on some machines.
> > My hunch is that NVidia driver gets recompiled during the kernel update
> > and apparently that is not as robust as it should be.
> >
> > The plan of the action is that I will try to remove everything NVidia
> > related from GPU9 machine try to reinstall driver, CUDA from the
> > scratch. Hopefully GPU9 will become functional just like GPU8. Once it
> > works for GPU9 I can go and fix other machines. If that doesn't work I
> > will reinstall GPU9 from the scratch.
> >
> > Long story short somebody at NVidia did a shady job with QA and we
> > became victims. Oh just for the record we don't use ZFS on Linux. If I
> > was running root of the ZFS pool as I am doing on the file server I
> > could just do beadm select the previous working system and go back. I am
> > not aware that Linux can do something like that but that is what I do on
> > FreeBSD and that what Solaris does.
> >
> >
> > Best,
> > Predrag
> >
> >
> >
> >
> >> Cheers,
> >> Barnabas
> >> ======================
> >> Barnabas Poczos, PhD
> >> Assistant Professor
> >> Machine Learning Department
> >> Carnegie Mellon University
> >>

From awd at andrew.cmu.edu  Thu Mar 29 03:33:55 2018
From: awd at andrew.cmu.edu (Artur Dubrawski)
Date: Thu, 29 Mar 2018 03:33:55 -0400
Subject: Traffic Jam helps good people do good things
Message-ID: <CAJvAoyvByFMTxuEf-UTSJ8O6GnK2vCmJxksU0eNTyzr86+Qo3Q@mail.gmail.com>

See our Artificial Intelligence Expert at work, making difference:

https://dms.licdn.com/playback/C4E05AQF3_HZSNDO8vw/05ccd005919445b4b6228a5c7c905b42/feedshare-mp4_500/1479932728445-v0ch3x?e=1522396800&v=alpha&t=rJnRwa84uB2jPFmPP8XoBUTViphAES0aNJiAGGMGKyU

It is a very cool video even though CMU Auton Lab or Traffic Jam software
are not mentioned by name.

Yet, Deliver Fund are important partners who do things our software and our
analysts do not do: physically face the criminals and physically pull the
sex trafficking victims off their nasty hands. They use Traffic Jam to
prioritize and plan their field activity, and to train law enforcement
officers to do the same in their respective jurisdictions.

Cheers,
Artur
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20180329/0990b698/attachment.html>

From predragp at andrew.cmu.edu  Thu Mar 29 14:44:25 2018
From: predragp at andrew.cmu.edu (Predrag Punosevac)
Date: Thu, 29 Mar 2018 14:44:25 -0400
Subject: GPU8
In-Reply-To: <CABftdoA_j4_4tY8unrBvXxd5JL++zZxx+MUJjz1+PG-ErrSsQw@mail.gmail.com>
References: <CABftdoA_j4_4tY8unrBvXxd5JL++zZxx+MUJjz1+PG-ErrSsQw@mail.gmail.com>
Message-ID: <20180329184425.atDL-wIha%predragp@andrew.cmu.edu>

Yotam Hechtlinger <yhechtli at andrew.cmu.edu> wrote:

> Hello Predrag,
> 
> There might be a bug with GPU8 also.
> I didn't have time to test it yet, but python crashes when trying to call
> keras.

I did cold reboot. It didn't help. I think what we see is the bug with
the driver 390.30. The bug could be Titan Xp specific that is why we see
older machines working.Nvidia has a websites where one can download the
scripts which one can use to recompile the latest driver. I think the
latest driver is 390.48. which is quite a few versions ahead of 390.30.
I am installing it right now on GPU9. If that doesn't work I will try
downgrading kernel which assumption that it is a kernel bug. The
following kernels are available 

kernel.x86_64                    3.10.0-693.5.2.el7 @updates
kernel.x86_64                    3.10.0-693.11.6.el7 @updates
kernel.x86_64                    3.10.0-693.21.1.el7  

Right now I am running 3.10.0-693.21.1 but we can try to go one or even
two kernels back. 

If all that fails I still have few magic tricks in my hat but they are
related to motherboard firmware. GPU8 and GPU9 have the same
motherboards but not other servers.

Best,
Predrag


> Unlike GPU 5,6 & 9, you can actually get the GPU working, but when I run a
> keras prediction functions it crashed and says:
> 
> Loaded runtime CuDNN library: 7101 (compatibility version 7100) but source
> was compiled with 7004 (compatibility version 7000).  If using a binary
> install, upgrade your CuDNN library to match.  If building from sources,
> make sure the library loaded at runtime matches a compatible version
> specified during compile configuration.
> 2018-03-29 09:57:49.807855: F tensorflow/core/kernels/conv_ops.cc:717]
> Check failed: stream->parent()->GetConvolveAlgorithms(
> conv_parameters.ShouldIncludeWinogradNonfusedAlgo<T>(), &algorithms)
> 
> Same code works on GPU4.
> I know this is not informative, I'll look into it later, just wanted to
> give you a heads up.
> I think this might be why there aren't any users on GPU8 but there are on
> GPU4.
> 
> Thanks,
> Yotam.

From predragp at andrew.cmu.edu  Thu Mar 29 15:25:40 2018
From: predragp at andrew.cmu.edu (Predrag Punosevac)
Date: Thu, 29 Mar 2018 15:25:40 -0400
Subject: GPU problem fixed!
In-Reply-To: <20180329184425.atDL-wIha%predragp@andrew.cmu.edu>
References: <CABftdoA_j4_4tY8unrBvXxd5JL++zZxx+MUJjz1+PG-ErrSsQw@mail.gmail.com>
 <20180329184425.atDL-wIha%predragp@andrew.cmu.edu>
Message-ID: <20180329192540.j26RKf_wI%predragp@andrew.cmu.edu>

Dear Autonians,

This is now fixed! Apparently we hit a serious driver bug with 930.30.
Please try now to compile TensorFlow and PyTorch on GPU9


Predrag Punoseva ccess from TITAN Xp (GPU0) -> TITAN Xp (GPU1) : Yes
> Peer access from TITAN Xp (GPU0) -> TITAN Xp (GPU2) : No
> Peer access from TITAN Xp (GPU0) -> TITAN Xp (GPU3) : No
> Peer access from TITAN Xp (GPU1) -> TITAN Xp (GPU0) : Yes
> Peer access from TITAN Xp (GPU1) -> TITAN Xp (GPU2) : No
> Peer access from TITAN Xp (GPU1) -> TITAN Xp (GPU3) : No
> Peer access from TITAN Xp (GPU2) -> TITAN Xp (GPU0) : No
> Peer access from TITAN Xp (GPU2) -> TITAN Xp (GPU1) : No
> Peer access from TITAN Xp (GPU2) -> TITAN Xp (GPU3) : Yes
> Peer access from TITAN Xp (GPU3) -> TITAN Xp (GPU0) : No
> Peer access from TITAN Xp (GPU3) -> TITAN Xp (GPU1) : No
> Peer access from TITAN Xp (GPU3) -> TITAN Xp (GPU2) : Yes

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.1, CUDA
Runtime Version = 9.1, NumDevs = 4
Result = PASS <predragp at andrew.cmu.edu> wrote:


I will ping you with a plan of action as soon as Kyle and I stop dancing
We are kind in celebratory mood right now. We will have to fix first
servers with higher number which have Titan Xp cards and newer
motherboards before moving to lower number servers with older GPU cards.


Predrag


> Yotam Hechtlinger <yhechtli at andrew.cmu.edu> wrote:
> 
> > Hello Predrag,
> > 
> > There might be a bug with GPU8 also.
> > I didn't have time to test it yet, but python crashes when trying to call
> > keras.
> 
> I did cold reboot. It didn't help. I think what we see is the bug with
> the driver 390.30. The bug could be Titan Xp specific that is why we see
> older machines working.Nvidia has a websites where one can download the
> scripts which one can use to recompile the latest driver. I think the
> latest driver is 390.48. which is quite a few versions ahead of 390.30.
> I am installing it right now on GPU9. If that doesn't work I will try
> downgrading kernel which assumption that it is a kernel bug. The
> following kernels are available 
> 
> kernel.x86_64                    3.10.0-693.5.2.el7 @updates
> kernel.x86_64                    3.10.0-693.11.6.el7 @updates
> kernel.x86_64                    3.10.0-693.21.1.el7  
> 
> Right now I am running 3.10.0-693.21.1 but we can try to go one or even
> two kernels back. 
> 
> If all that fails I still have few magic tricks in my hat but they are
> related to motherboard firmware. GPU8 and GPU9 have the same
> motherboards but not other servers.
> 
> Best,
> Predrag
> 
> 
> 
> > Unlike GPU 5,6 & 9, you can actually get the GPU working, but when I run a
> > keras prediction functions it crashed and says:
> > 
> > Loaded runtime CuDNN library: 7101 (compatibility version 7100) but source
> > was compiled with 7004 (compatibility version 7000).  If using a binary
> > install, upgrade your CuDNN library to match.  If building from sources,
> > make sure the library loaded at runtime matches a compatible version
> > specified during compile configuration.
> > 2018-03-29 09:57:49.807855: F tensorflow/core/kernels/conv_ops.cc:717]
> > Check failed: stream->parent()->GetConvolveAlgorithms(
> > conv_parameters.ShouldIncludeWinogradNonfusedAlgo<T>(), &algorithms)
> > 
> > Same code works on GPU4.
> > I know this is not informative, I'll look into it later, just wanted to
> > give you a heads up.
> > I think this might be why there aren't any users on GPU8 but there are on
> > GPU4.
> > 
> > Thanks,
> > Yotam.


From predragp at andrew.cmu.edu  Thu Mar 29 16:44:16 2018
From: predragp at andrew.cmu.edu (Predrag Punosevac)
Date: Thu, 29 Mar 2018 16:44:16 -0400
Subject: GPU status update
Message-ID: <20180329204416.iK3izftKV%predragp@andrew.cmu.edu>

Dear Autonians,

The NVidia driver is now updated to 390.48 on GPU5, GPU6, GPU8, GPU9
There no other machines (GPU 7 is treated separately due to its current
use) with Titan Xp cards. Titan X crads were unaffected by a driver bug
in 930.30 according to intial reports. 

I can use GPU from the MATLAB on GPU5, 6, 8, 9. CUDA-8 is removed from
all those servers. CUDA-9 and CUDA-9.1 are there. Server should default
to cuda-9.0 due to the fact that TensorFlow and PyTorch are not released
for 9.1. 

I really need people to test this now. Please make sure you local paths
and library links are fixed before e-mailing me. 

People who need that proprietary Intel Library or cuDNN will have to
wait until we get this right so that all GPU servers have basic
functionality. As you can see there are lot of moving parts in these
servers and they don't quite act like computers you can buy in Wallmart.


MATLAB was removed previously from GPU1 and GPU2 due to the lack of
space. I will be putting it as shortly. I will put the latest 2018a
release.


Predrag

From manzil at cmu.edu  Thu Mar 29 17:02:42 2018
From: manzil at cmu.edu (Manzil Zaheer)
Date: Thu, 29 Mar 2018 21:02:42 +0000
Subject: GPU status update
In-Reply-To: <41389_1522356311_w2TKjAlB042004_20180329204416.iK3izftKV%predragp@andrew.cmu.edu>
References: <41389_1522356311_w2TKjAlB042004_20180329204416.iK3izftKV%predragp@andrew.cmu.edu>
Message-ID: <c57af11ad8d746399a01a99433473566@PGH-MSGMLT-02.andrew.ad.cmu.edu>

Thanks Predrag for all the hard work. It works for me now. Yay!

Best,
Manzil

-----Original Message-----
From: Autonlab-users <autonlab-users-bounces at autonlab.org> On Behalf Of Predrag Punosevac
Sent: Thursday, March 29, 2018 4:44 PM
To: users at autonlab.org
Subject: GPU status update

Dear Autonians,

The NVidia driver is now updated to 390.48 on GPU5, GPU6, GPU8, GPU9 There no other machines (GPU 7 is treated separately due to its current
use) with Titan Xp cards. Titan X crads were unaffected by a driver bug in 930.30 according to intial reports. 

I can use GPU from the MATLAB on GPU5, 6, 8, 9. CUDA-8 is removed from all those servers. CUDA-9 and CUDA-9.1 are there. Server should default to cuda-9.0 due to the fact that TensorFlow and PyTorch are not released for 9.1. 

I really need people to test this now. Please make sure you local paths and library links are fixed before e-mailing me. 

People who need that proprietary Intel Library or cuDNN will have to wait until we get this right so that all GPU servers have basic functionality. As you can see there are lot of moving parts in these servers and they don't quite act like computers you can buy in Wallmart.


MATLAB was removed previously from GPU1 and GPU2 due to the lack of space. I will be putting it as shortly. I will put the latest 2018a release.


Predrag


From predragp at andrew.cmu.edu  Thu Mar 29 17:30:48 2018
From: predragp at andrew.cmu.edu (Predrag Punosevac)
Date: Thu, 29 Mar 2018 17:30:48 -0400
Subject: Migrated SVN repos to Git
Message-ID: <20180329213048.Sdps47gJB%predragp@andrew.cmu.edu>

-------- Original Message --------
Date: Thu, 29 Mar 2018 17:29:12 -0400
From: Predrag Punosevac <predragp at andrew.cmu.edu>
To: donghanw at cs.cmu.edu
Subject: Re: Migrated SVN repos to Git

Donghan Wang <donghanw at cs.cmu.edu> wrote:

> Hi Predrag,
> 
> I migrated all 74 SVN repos to Git. They are available on Gogs at
> http://git.int.autonlab.org/SVN.
> 

Good job! We will continue to make CVS and SVN visible through VIEWVC
for historical reasons but nobody should really use that stuff. They are
read only anyway.

> There are two giant repos
> 
>    - 1.3GB SVN/prateekt
>    - 3.3GB SVN/radiation_hunter
> 
> Do you see any problems with them?
> 
> The second question is how to set up the Gogs permission correctly so that
> people can access them? Maybe something similar to http://git.int.autonlab.
> org/C?

Gogs is plugged into the LDAP so anybody with a valid Auton Lab account
can log into the Gogs interface from one of internal machines (X2Go
needs to be used for external access) upload her/his ssh key and just
use the thing with ssh or via http.

http://git.int.autonlab.org/user/login

If you want to hide some repositories from praying eyes make them
private. Gogs support the same security paradigm like GitHub. Owner of
the repo should decide if they want repo public.


Predrag

> 
> Thanks,
> Jarod


From bapoczos at cs.cmu.edu  Thu Mar 29 18:48:52 2018
From: bapoczos at cs.cmu.edu (Barnabas Poczos)
Date: Thu, 29 Mar 2018 18:48:52 -0400
Subject: GPU status update
In-Reply-To: <20180329204416.iK3izftKV%predragp@andrew.cmu.edu>
References: <20180329204416.iK3izftKV%predragp@andrew.cmu.edu>
Message-ID: <CAHFiJZt80nyPyrxUjWBe-ei59p++YYuARpqVY16a_LMLHQhxvg@mail.gmail.com>

Awesome! Many thanks Predrag for fixing these machines!

Best,
Barnabas

======================
Barnabas Poczos, PhD
Assistant Professor
Machine Learning Department
Carnegie Mellon University


On Thu, Mar 29, 2018 at 4:44 PM, Predrag Punosevac
<predragp at andrew.cmu.edu> wrote:
> Dear Autonians,
>
> The NVidia driver is now updated to 390.48 on GPU5, GPU6, GPU8, GPU9
> There no other machines (GPU 7 is treated separately due to its current
> use) with Titan Xp cards. Titan X crads were unaffected by a driver bug
> in 930.30 according to intial reports.
>
> I can use GPU from the MATLAB on GPU5, 6, 8, 9. CUDA-8 is removed from
> all those servers. CUDA-9 and CUDA-9.1 are there. Server should default
> to cuda-9.0 due to the fact that TensorFlow and PyTorch are not released
> for 9.1.
>
> I really need people to test this now. Please make sure you local paths
> and library links are fixed before e-mailing me.
>
> People who need that proprietary Intel Library or cuDNN will have to
> wait until we get this right so that all GPU servers have basic
> functionality. As you can see there are lot of moving parts in these
> servers and they don't quite act like computers you can buy in Wallmart.
>
>
> MATLAB was removed previously from GPU1 and GPU2 due to the lack of
> space. I will be putting it as shortly. I will put the latest 2018a
> release.
>
>
>
> Predrag