<div dir="ltr">Ok, that should be enough.<div><br></div><div>samy</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Oct 21, 2016 at 4:44 PM, Barnabas Poczos <span dir="ltr"><<a href="mailto:bapoczos@cs.cmu.edu" target="_blank">bapoczos@cs.cmu.edu</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi Samy,<br>

<br>

Gpu1 will still have Matlab and 4 K80 GPus (which is technically 8<br>

GPUs). Won't that be enough for now?<br>

<span class="im HOEnZb"><br>

Best,<br>

B<br>

======================<br>

Barnabas Poczos, PhD<br>

Assistant Professor<br>

Machine Learning Department<br>

Carnegie Mellon University<br>

<br>

<br>

</span><div class="HOEnZb"><div class="h5">On Fri, Oct 21, 2016 at 4:21 PM, Kirthevasan Kandasamy<br>

<<a href="mailto:kandasamy@cmu.edu">kandasamy@cmu.edu</a>> wrote:<br>

> Hi all,<br>

><br>

> I was planning on using Matlab with GPUs for one of my projects.<br>

> Can we please keep gpu2 as it is for now?<br>

><br>

> samy<br>

><br>

> On Fri, Oct 21, 2016 at 3:54 PM, Barnabas Poczos <<a href="mailto:bapoczos@cs.cmu.edu">bapoczos@cs.cmu.edu</a>><br>

> wrote:<br>

>><br>

>> Sounds good. Let us have tensorflow system wide on all GPU nodes. We<br>

>> can worry about Matlab later.<br>

>><br>

>> Best,<br>

>> B<br>

>> ======================<br>

>> Barnabas Poczos, PhD<br>

>> Assistant Professor<br>

>> Machine Learning Department<br>

>> Carnegie Mellon University<br>

>><br>

>><br>

>> On Fri, Oct 21, 2016 at 3:50 PM, Predrag Punosevac <<a href="mailto:predragp@cs.cmu.edu">predragp@cs.cmu.edu</a>><br>

>> wrote:<br>

>> > Barnabas Poczos <<a href="mailto:bapoczos@cs.cmu.edu">bapoczos@cs.cmu.edu</a>> wrote:<br>

>> ><br>

>> >> Hi Predrag,<br>

>> >><br>

>> >> If there is no other solution, then I think it is OK not to have<br>

>> >> Matlab on GPU2 and GPU3.<br>

>> >> Tensorflow has higher priority on these nodes.<br>

>> ><br>

>> > We could possibly have multiple CUDA libraries for different versions<br>

>> > but that is going to bite us for the rear end quickly. People who want<br>

>> > to use MATLAB with GPUs will have to live with GPU1 probably until<br>

>> > Spring release of MATLAB.<br>

>> ><br>

>> > Predrag<br>

>> ><br>

>> >><br>

>> >> Best,<br>

>> >> Barnabas<br>

>> >><br>

>> >><br>

>> >><br>

>> >><br>

>> >> ======================<br>

>> >> Barnabas Poczos, PhD<br>

>> >> Assistant Professor<br>

>> >> Machine Learning Department<br>

>> >> Carnegie Mellon University<br>

>> >><br>

>> >><br>

>> >> On Fri, Oct 21, 2016 at 3:37 PM, Predrag Punosevac<br>

>> >> <<a href="mailto:predragp@cs.cmu.edu">predragp@cs.cmu.edu</a>> wrote:<br>

>> >> > Dougal Sutherland <<a href="mailto:dougal@gmail.com">dougal@gmail.com</a>> wrote:<br>

>> >> ><br>

>> >> ><br>

>> >> > Sorry that I am late for the party. This is my interpretation of what<br>

>> >> > we<br>

>> >> > should do.<br>

>> >> ><br>

>> >> > 1. I will go back to CUDA 8.0 which will brake MATLAB. We have to<br>

>> >> > live<br>

>> >> > with it. Barnabas please OK this. I will work with MathWorks for this<br>

>> >> > to<br>

>> >> > be fixed for 2017a release.<br>

>> >> ><br>

>> >> > 2. Then I could install TensorFlow compiled by Dougal system wide.<br>

>> >> > Please Dugal after I upgrade back to 8.0 recompile it again using<br>

>> >> > CUDA<br>

>> >> > 8.0. I could give you the root password so that you can compile and<br>

>> >> > install directly.<br>

>> >> ><br>

>> >> > 3. If everyone is OK with above I will pull the trigger on GPU3 at<br>

>> >> > 4:30PM and upgrade to 8.0<br>

>> >> ><br>

>> >> > 4. MATLAB will be broken on GPU2 as well after I put Titan cards<br>

>> >> > during<br>

>> >> > the October 25 power outrage.<br>

>> >> ><br>

>> >> > Predrag<br>

>> >> ><br>

>> >> ><br>

>> >> ><br>

>> >> ><br>

>> >> ><br>

>> >> ><br>

>> >> >> Heh. :)<br>

>> >> >><br>

>> >> >> An explanation:<br>

>> >> >><br>

>> >> >>    - Different nvidia gpu architectures are called "compute<br>

>> >> >> capabilities".<br>

>> >> >>    This is a number that describes the behavior of the card: the<br>

>> >> >> maximum size<br>

>> >> >>    of various things, which API functions it supports, etc. There's<br>

>> >> >> a<br>

>> >> >>    reference here<br>

>> >> >><br>

>> >> >> <<a href="https://en.wikipedia.org/wiki/CUDA#Version_features_and_specifications" rel="noreferrer" target="_blank">https://en.wikipedia.org/<wbr>wiki/CUDA#Version_features_<wbr>and_specifications</a>>,<br>

>> >> >>    but it shouldn't really matter.<br>

>> >> >>    - When CUDA compiles code, it targets a certain architecture,<br>

>> >> >> since it<br>

>> >> >>    needs to know what features to use and whatnot. I *think* that if<br>

>> >> >> you<br>

>> >> >>    compile for compute capability x, it will work on a card with<br>

>> >> >> compute<br>

>> >> >>    capability y approximately iff x <= y.<br>

>> >> >>    - Pascal Titan Xs, like gpu3 has, have compute capability 6.1.<br>

>> >> >>    - CUDA 7.5 doesn't know about compute capability 6.1, so if you<br>

>> >> >> ask to<br>

>> >> >>    compile for 6.1 it crashes.<br>

>> >> >>    - Theano by default tries to compile for the capability of the<br>

>> >> >> card, but<br>

>> >> >>    can be configured to compile for a different capability.<br>

>> >> >>    - Tensorflow asks for a list of capabilities to compile for when<br>

>> >> >> you<br>

>> >> >>    build it in the first place.<br>

>> >> >><br>

>> >> >><br>

>> >> >> On Fri, Oct 21, 2016 at 8:17 PM Dougal Sutherland <<a href="mailto:dougal@gmail.com">dougal@gmail.com</a>><br>

>> >> >> wrote:<br>

>> >> >><br>

>> >> >> > They do work with 7.5 if you specify an older compute<br>

>> >> >> > architecture; it's<br>

>> >> >> > just that their actual compute capability of 6.1 isn't supported<br>

>> >> >> > by cuda<br>

>> >> >> > 7.5. Thank is thrown off by this, for example, but it can be fixed<br>

>> >> >> > by<br>

>> >> >> > telling it to pass compute capability 5.2 (for example) to nvcc. I<br>

>> >> >> > don't<br>

>> >> >> > think that this was my problem with building tensorflow on 7.5;<br>

>> >> >> > I'm not<br>

>> >> >> > sure what that was.<br>

>> >> >> ><br>

>> >> >> > On Fri, Oct 21, 2016, 8:11 PM Kirthevasan Kandasamy<br>

>> >> >> > <<a href="mailto:kandasamy@cmu.edu">kandasamy@cmu.edu</a>><br>

>> >> >> > wrote:<br>

>> >> >> ><br>

>> >> >> > Thanks Dougal. I'll take a look atthis and get back to you.<br>

>> >> >> > So are you suggesting that this is an issue with TitanX's not<br>

>> >> >> > being<br>

>> >> >> > compatible with 7.5?<br>

>> >> >> ><br>

>> >> >> > On Fri, Oct 21, 2016 at 3:08 PM, Dougal Sutherland<br>

>> >> >> > <<a href="mailto:dougal@gmail.com">dougal@gmail.com</a>><br>

>> >> >> > wrote:<br>

>> >> >> ><br>

>> >> >> > I installed it in my scratch directory (not sure if there's a<br>

>> >> >> > global<br>

>> >> >> > install?). The main thing was to put its cache on scratch; it got<br>

>> >> >> > really<br>

>> >> >> > upset when the cache directory was on NFS. (Instructions at the<br>

>> >> >> > bottom of<br>

>> >> >> > my previous email.)<br>

>> >> >> ><br>

>> >> >> > On Fri, Oct 21, 2016, 8:04 PM Barnabas Poczos<br>

>> >> >> > <<a href="mailto:bapoczos@cs.cmu.edu">bapoczos@cs.cmu.edu</a>> wrote:<br>

>> >> >> ><br>

>> >> >> > That's great! Thanks Dougal.<br>

>> >> >> ><br>

>> >> >> > As I remember bazel was not installed correctly previously on<br>

>> >> >> > GPU3. Do<br>

>> >> >> > you know what went wrong with it before and why it is good now?<br>

>> >> >> ><br>

>> >> >> > Thanks,<br>

>> >> >> > Barnabas<br>

>> >> >> > ======================<br>

>> >> >> > Barnabas Poczos, PhD<br>

>> >> >> > Assistant Professor<br>

>> >> >> > Machine Learning Department<br>

>> >> >> > Carnegie Mellon University<br>

>> >> >> ><br>

>> >> >> ><br>

>> >> >> > On Fri, Oct 21, 2016 at 2:03 PM, Dougal Sutherland<br>

>> >> >> > <<a href="mailto:dougal@gmail.com">dougal@gmail.com</a>><br>

>> >> >> > wrote:<br>

>> >> >> > > I was just able to build tensorflow 0.11.0rc0 on gpu3! I used<br>

>> >> >> > > the cuda<br>

>> >> >> > 8.0<br>

>> >> >> > > install, and it built fine. So additionally installing 7.5 was<br>

>> >> >> > > probably<br>

>> >> >> > not<br>

>> >> >> > > necessary; in fact, cuda 7.5 doesn't know about the 6.1 compute<br>

>> >> >> > architecture<br>

>> >> >> > > that the Titan Xs use, so Theano at least needs to be manually<br>

>> >> >> > > told to<br>

>> >> >> > use<br>

>> >> >> > > an older architecture.<br>

>> >> >> > ><br>

>> >> >> > > A pip package is in<br>

>> >> >> > > ~dsutherl/tensorflow-0.11.<wbr>0rc0-py2-none-any.whl. I<br>

>> >> >> > think<br>

>> >> >> > > it should work fine with the cudnn in my scratch directory.<br>

>> >> >> > ><br>

>> >> >> > > You should probably install it to scratch, either running this<br>

>> >> >> > > first to<br>

>> >> >> > put<br>

>> >> >> > > libraries your scratch directory or using a virtualenv or<br>

>> >> >> > > something:<br>

>> >> >> > > export PYTHONUSERBASE=/home/scratch/$<wbr>USER/.local<br>

>> >> >> > ><br>

>> >> >> > > You'll need this to use the library and probably to install it:<br>

>> >> >> > > export<br>

>> >> >> > ><br>

>> >> >> ><br>

>> >> >> > LD_LIBRARY_PATH=/home/scratch/<wbr>dsutherl/cudnn-8.0-5.1/cuda/<wbr>lib64:"$LD_LIBRARY_PATH"<br>

>> >> >> > ><br>

>> >> >> > > To install:<br>

>> >> >> > > pip install --user<br>

>> >> >> > > ~dsutherl/tensorflow-0.11.<wbr>0rc0-py2-none-any.whl<br>

>> >> >> > > (remove --user if you're using a virtualenv)<br>

>> >> >> > ><br>

>> >> >> > > (A request: I'm submitting to ICLR in two weeks, and for some of<br>

>> >> >> > > the<br>

>> >> >> > models<br>

>> >> >> > > I'm running gpu3's cards are 4x the speed of gpu1 or 2's. So<br>

>> >> >> > > please don't<br>

>> >> >> > > run a ton of stuff on gpu3 unless you're working on a deadline<br>

>> >> >> > > too.<br>

>> >> >> > ><br>

>> >> >> > ><br>

>> >> >> > ><br>

>> >> >> > > Steps to install it, for the future:<br>

>> >> >> > ><br>

>> >> >> > > Install bazel in your home directory:<br>

>> >> >> > ><br>

>> >> >> > > wget<br>

>> >> >> > ><br>

>> >> >> ><br>

>> >> >> > <a href="https://github.com/bazelbuild/bazel/releases/download/0.3.2/bazel-0.3.2-installer-linux-x86_64.sh" rel="noreferrer" target="_blank">https://github.com/bazelbuild/<wbr>bazel/releases/download/0.3.2/<wbr>bazel-0.3.2-installer-linux-<wbr>x86_64.sh</a><br>

>> >> >> > > bash <a href="http://bazel-0.3.2-installer-linux-x86_64.sh" rel="noreferrer" target="_blank">bazel-0.3.2-installer-linux-<wbr>x86_64.sh</a><br>

>> >> >> > > --prefix=/home/scratch/$USER<br>

>> >> >> > > --base=/home/scratch/$USER/.<wbr>bazel<br>

>> >> >> > ><br>

>> >> >> > > Configure bazel to build in scratch. There's probably a better<br>

>> >> >> > > way to do<br>

>> >> >> > > this, but this works:<br>

>> >> >> > ><br>

>> >> >> > > mkdir /home/scratch/$USER/.cache<br>

>> >> >> > > ln -s /home/scratch/$USER/.cache/<wbr>bazel ~/.cache/bazel<br>

>> >> >> > ><br>

>> >> >> > > Build tensorflow. Note that builds from git checkouts don't<br>

>> >> >> > > work, because<br>

>> >> >> > > they assume a newer version of git than is on gpu3:<br>

>> >> >> > ><br>

>> >> >> > > cd /home/scratch/$USER<br>

>> >> >> > > wget<br>

>> >> >> > > tar xf<br>

>> >> >> > > cd tensorflow-0.11.0rc0<br>

>> >> >> > > ./configure<br>

>> >> >> > ><br>

>> >> >> > > This is an interactive script that doesn't seem to let you pass<br>

>> >> >> > arguments or<br>

>> >> >> > > anything. It's obnoxious.<br>

>> >> >> > > Use the default python<br>

>> >> >> > > don't use cloud platform or hadoop file system<br>

>> >> >> > > use the default site-packages path if it asks<br>

>> >> >> > > build with GPU support<br>

>> >> >> > > default gcc<br>

>> >> >> > > default Cuda SDK version<br>

>> >> >> > > specify /usr/local/cuda-8.0<br>

>> >> >> > > default cudnn version<br>

>> >> >> > > specify $CUDNN_DIR from use-cudnn.sh, e.g.<br>

>> >> >> > > /home/scratch/dsutherl/cudnn-<wbr>8.0-5.1/cuda<br>

>> >> >> > > Pascal Titan Xs have compute capability 6.1<br>

>> >> >> > ><br>

>> >> >> > > bazel build -c opt --config=cuda<br>

>> >> >> > > //tensorflow/tools/pip_<wbr>package:build_pip_package<br>

>> >> >> > > bazel-bin/tensorflow/tools/<wbr>pip_package/build_pip_package ./<br>

>> >> >> > > A .whl file, e.g. tensorflow-0.11.0rc0-py2-none-<wbr>any.whl, is put<br>

>> >> >> > > in the<br>

>> >> >> > > directory you specified above.<br>

>> >> >> > ><br>

>> >> >> > ><br>

>> >> >> > > - Dougal<br>

>> >> >> > ><br>

>> >> >> > ><br>

>> >> >> > > On Fri, Oct 21, 2016 at 6:14 PM Kirthevasan Kandasamy<br>

>> >> >> > > <<a href="mailto:kandasamy@cmu.edu">kandasamy@cmu.edu</a><br>

>> >> >> > ><br>

>> >> >> > > wrote:<br>

>> >> >> > >><br>

>> >> >> > >> Predrag,<br>

>> >> >> > >><br>

>> >> >> > >> Any updates on gpu3?<br>

>> >> >> > >> I have tried both tensorflow and chainer and in both cases the<br>

>> >> >> > >> problem<br>

>> >> >> > >> seems to be with cuda<br>

>> >> >> > >><br>

>> >> >> > >> On Wed, Oct 19, 2016 at 4:10 PM, Predrag Punosevac<br>

>> >> >> > >> <<a href="mailto:predragp@cs.cmu.edu">predragp@cs.cmu.edu</a><br>

>> >> >> > ><br>

>> >> >> > >> wrote:<br>

>> >> >> > >>><br>

>> >> >> > >>> Dougal Sutherland <<a href="mailto:dougal@gmail.com">dougal@gmail.com</a>> wrote:<br>

>> >> >> > >>><br>

>> >> >> > >>> > I tried for a while. I failed.<br>

>> >> >> > >>> ><br>

>> >> >> > >>><br>

>> >> >> > >>> Damn this doesn't look good. I guess back to the drawing<br>

>> >> >> > >>> board. Thanks<br>

>> >> >> > >>> for the quick feed back.<br>

>> >> >> > >>><br>

>> >> >> > >>> Predrag<br>

>> >> >> > >>><br>

>> >> >> > >>> > Version 0.10.0 fails immediately on build: "The specified<br>

>> >> >> > >>> > --crosstool_top<br>

>> >> >> > >>> > '@local_config_cuda//<wbr>crosstool:crosstool' is not a valid<br>

>> >> >> > >>> > cc_toolchain_suite<br>

>> >> >> > >>> > rule." Apparently this is because 0.10 required an older<br>

>> >> >> > >>> > version of<br>

>> >> >> > >>> > bazel (<br>

>> >> >> > >>> > <a href="https://github.com/tensorflow/tensorflow/issues/4368" rel="noreferrer" target="_blank">https://github.com/tensorflow/<wbr>tensorflow/issues/4368</a>), and I<br>

>> >> >> > >>> > don't<br>

>> >> >> > have<br>

>> >> >> > >>> > the<br>

>> >> >> > >>> > energy to install an old version of bazel.<br>

>> >> >> > >>> ><br>

>> >> >> > >>> > Version 0.11.0rc0 gets almost done and then complains about<br>

>> >> >> > >>> > no such<br>

>> >> >> > >>> > file or<br>

>> >> >> > >>> > directory for libcudart.so.7.5 (which is there, where I told<br>

>> >> >> > tensorflow<br>

>> >> >> > >>> > it<br>

>> >> >> > >>> > was...).<br>

>> >> >> > >>> ><br>

>> >> >> > >>> > Non-release versions from git fail immediately because they<br>

>> >> >> > >>> > call git<br>

>> >> >> > -C<br>

>> >> >> > >>> > to<br>

>> >> >> > >>> > get version info, which is only in git 1.9 (we have 1.8).<br>

>> >> >> > >>> ><br>

>> >> >> > >>> ><br>

>> >> >> > >>> > Some other notes:<br>

>> >> >> > >>> > - I made a symlink from ~/.cache/bazel to<br>

>> >> >> > >>> > /home/scratch/$USER/.cache/<wbr>bazel,<br>

>> >> >> > >>> > because bazel is the worst. (It complains about doing things<br>

>> >> >> > >>> > on NFS,<br>

>> >> >> > >>> > and<br>

>> >> >> > >>> > hung for me [clock-related?], and I can't find a global<br>

>> >> >> > >>> > config file<br>

>> >> >> > or<br>

>> >> >> > >>> > anything to change that in; it seems like there might be<br>

>> >> >> > >>> > one, but<br>

>> >> >> > their<br>

>> >> >> > >>> > documentation is terrible.)<br>

>> >> >> > >>> ><br>

>> >> >> > >>> > - I wasn't able to use the actual Titan X compute capability<br>

>> >> >> > >>> > of 6.1,<br>

>> >> >> > >>> > because that requires cuda 8; I used 5.2 instead. Probably<br>

>> >> >> > >>> > not a huge<br>

>> >> >> > >>> > deal,<br>

>> >> >> > >>> > but I don't know.<br>

>> >> >> > >>> ><br>

>> >> >> > >>> > - I tried explicitly including /usr/local/cuda/lib64 in<br>

>> >> >> > LD_LIBRARY_PATH<br>

>> >> >> > >>> > and<br>

>> >> >> > >>> > set CUDA_HOME to /usr/local/cuda before building, hoping<br>

>> >> >> > >>> > that would<br>

>> >> >> > >>> > help<br>

>> >> >> > >>> > with the 0.11.0rc0 problem, but it didn't.<br>

>> >> >> > >><br>

>> >> >> > >><br>

>> >> >> > ><br>

>> >> >> ><br>

>> >> >> ><br>

>> >> >> ><br>

><br>

><br>

</div></div></blockquote></div><br></div>