GPU3 back in business

Fri Oct 21 15:37:27 EDT 2016

Dougal Sutherland <dougal at gmail.com> wrote:

Sorry that I am late for the party. This is my interpretation of what we
should do.

1. I will go back to CUDA 8.0 which will brake MATLAB. We have to live
with it. Barnabas please OK this. I will work with MathWorks for this to
be fixed for 2017a release.

2. Then I could install TensorFlow compiled by Dougal system wide.
Please Dugal after I upgrade back to 8.0 recompile it again using CUDA
8.0. I could give you the root password so that you can compile and
install directly.

3. If everyone is OK with above I will pull the trigger on GPU3 at
4:30PM and upgrade to 8.0

4. MATLAB will be broken on GPU2 as well after I put Titan cards during
the October 25 power outrage.

Predrag 

> Heh. :)
> 
> An explanation:
> 
>    - Different nvidia gpu architectures are called "compute capabilities".
>    This is a number that describes the behavior of the card: the maximum size
>    of various things, which API functions it supports, etc. There's a
>    reference here
>    <https://en.wikipedia.org/wiki/CUDA#Version_features_and_specifications>,
>    but it shouldn't really matter.
>    - When CUDA compiles code, it targets a certain architecture, since it
>    needs to know what features to use and whatnot. I *think* that if you
>    compile for compute capability x, it will work on a card with compute
>    capability y approximately iff x <= y.
>    - Pascal Titan Xs, like gpu3 has, have compute capability 6.1.
>    - CUDA 7.5 doesn't know about compute capability 6.1, so if you ask to
>    compile for 6.1 it crashes.
>    - Theano by default tries to compile for the capability of the card, but
>    can be configured to compile for a different capability.
>    - Tensorflow asks for a list of capabilities to compile for when you
>    build it in the first place.
> 
> 
> On Fri, Oct 21, 2016 at 8:17 PM Dougal Sutherland <dougal at gmail.com> wrote:
> 
> > They do work with 7.5 if you specify an older compute architecture; it's
> > just that their actual compute capability of 6.1 isn't supported by cuda
> > 7.5. Thank is thrown off by this, for example, but it can be fixed by
> > telling it to pass compute capability 5.2 (for example) to nvcc. I don't
> > think that this was my problem with building tensorflow on 7.5; I'm not
> > sure what that was.
> >
> > On Fri, Oct 21, 2016, 8:11 PM Kirthevasan Kandasamy <kandasamy at cmu.edu>
> > wrote:
> >
> > Thanks Dougal. I'll take a look atthis and get back to you.
> > So are you suggesting that this is an issue with TitanX's not being
> > compatible with 7.5?
> >
> > On Fri, Oct 21, 2016 at 3:08 PM, Dougal Sutherland <dougal at gmail.com>
> > wrote:
> >
> > I installed it in my scratch directory (not sure if there's a global
> > install?). The main thing was to put its cache on scratch; it got really
> > upset when the cache directory was on NFS. (Instructions at the bottom of
> > my previous email.)
> >
> > On Fri, Oct 21, 2016, 8:04 PM Barnabas Poczos <bapoczos at cs.cmu.edu> wrote:
> >
> > That's great! Thanks Dougal.
> >
> > As I remember bazel was not installed correctly previously on GPU3. Do
> > you know what went wrong with it before and why it is good now?
> >
> > Thanks,
> > Barnabas
> > ======================
> > Barnabas Poczos, PhD
> > Assistant Professor
> > Machine Learning Department
> > Carnegie Mellon University
> >
> >
> > On Fri, Oct 21, 2016 at 2:03 PM, Dougal Sutherland <dougal at gmail.com>
> > wrote:
> > > I was just able to build tensorflow 0.11.0rc0 on gpu3! I used the cuda
> > 8.0
> > > install, and it built fine. So additionally installing 7.5 was probably
> > not
> > > necessary; in fact, cuda 7.5 doesn't know about the 6.1 compute
> > architecture
> > > that the Titan Xs use, so Theano at least needs to be manually told to
> > use
> > > an older architecture.
> > >
> > > A pip package is in ~dsutherl/tensorflow-0.11.0rc0-py2-none-any.whl. I
> > think
> > > it should work fine with the cudnn in my scratch directory.
> > >
> > > You should probably install it to scratch, either running this first to
> > put
> > > libraries your scratch directory or using a virtualenv or something:
> > > export PYTHONUSERBASE=/home/scratch/$USER/.local
> > >
> > > You'll need this to use the library and probably to install it:
> > > export
> > >
> > LD_LIBRARY_PATH=/home/scratch/dsutherl/cudnn-8.0-5.1/cuda/lib64:"$LD_LIBRARY_PATH"
> > >
> > > To install:
> > > pip install --user ~dsutherl/tensorflow-0.11.0rc0-py2-none-any.whl
> > > (remove --user if you're using a virtualenv)
> > >
> > > (A request: I'm submitting to ICLR in two weeks, and for some of the
> > models
> > > I'm running gpu3's cards are 4x the speed of gpu1 or 2's. So please don't
> > > run a ton of stuff on gpu3 unless you're working on a deadline too.
> > >
> > >
> > >
> > > Steps to install it, for the future:
> > >
> > > Install bazel in your home directory:
> > >
> > > wget
> > >
> > https://github.com/bazelbuild/bazel/releases/download/0.3.2/bazel-0.3.2-installer-linux-x86_64.sh
> > > bash bazel-0.3.2-installer-linux-x86_64.sh --prefix=/home/scratch/$USER
> > > --base=/home/scratch/$USER/.bazel
> > >
> > > Configure bazel to build in scratch. There's probably a better way to do
> > > this, but this works:
> > >
> > > mkdir /home/scratch/$USER/.cache
> > > ln -s /home/scratch/$USER/.cache/bazel ~/.cache/bazel
> > >
> > > Build tensorflow. Note that builds from git checkouts don't work, because
> > > they assume a newer version of git than is on gpu3:
> > >
> > > cd /home/scratch/$USER
> > > wget
> > > tar xf
> > > cd tensorflow-0.11.0rc0
> > > ./configure
> > >
> > > This is an interactive script that doesn't seem to let you pass
> > arguments or
> > > anything. It's obnoxious.
> > > Use the default python
> > > don't use cloud platform or hadoop file system
> > > use the default site-packages path if it asks
> > > build with GPU support
> > > default gcc
> > > default Cuda SDK version
> > > specify /usr/local/cuda-8.0
> > > default cudnn version
> > > specify $CUDNN_DIR from use-cudnn.sh, e.g.
> > > /home/scratch/dsutherl/cudnn-8.0-5.1/cuda
> > > Pascal Titan Xs have compute capability 6.1
> > >
> > > bazel build -c opt --config=cuda
> > > //tensorflow/tools/pip_package:build_pip_package
> > > bazel-bin/tensorflow/tools/pip_package/build_pip_package ./
> > > A .whl file, e.g. tensorflow-0.11.0rc0-py2-none-any.whl, is put in the
> > > directory you specified above.
> > >
> > >
> > > - Dougal
> > >
> > >
> > > On Fri, Oct 21, 2016 at 6:14 PM Kirthevasan Kandasamy <kandasamy at cmu.edu
> > >
> > > wrote:
> > >>
> > >> Predrag,
> > >>
> > >> Any updates on gpu3?
> > >> I have tried both tensorflow and chainer and in both cases the problem
> > >> seems to be with cuda
> > >>
> > >> On Wed, Oct 19, 2016 at 4:10 PM, Predrag Punosevac <predragp at cs.cmu.edu
> > >
> > >> wrote:
> > >>>
> > >>> Dougal Sutherland <dougal at gmail.com> wrote:
> > >>>
> > >>> > I tried for a while. I failed.
> > >>> >
> > >>>
> > >>> Damn this doesn't look good. I guess back to the drawing board. Thanks
> > >>> for the quick feed back.
> > >>>
> > >>> Predrag
> > >>>
> > >>> > Version 0.10.0 fails immediately on build: "The specified
> > >>> > --crosstool_top
> > >>> > '@local_config_cuda//crosstool:crosstool' is not a valid
> > >>> > cc_toolchain_suite
> > >>> > rule." Apparently this is because 0.10 required an older version of
> > >>> > bazel (
> > >>> > https://github.com/tensorflow/tensorflow/issues/4368), and I don't
> > have
> > >>> > the
> > >>> > energy to install an old version of bazel.
> > >>> >
> > >>> > Version 0.11.0rc0 gets almost done and then complains about no such
> > >>> > file or
> > >>> > directory for libcudart.so.7.5 (which is there, where I told
> > tensorflow
> > >>> > it
> > >>> > was...).
> > >>> >
> > >>> > Non-release versions from git fail immediately because they call git
> > -C
> > >>> > to
> > >>> > get version info, which is only in git 1.9 (we have 1.8).
> > >>> >
> > >>> >
> > >>> > Some other notes:
> > >>> > - I made a symlink from ~/.cache/bazel to
> > >>> > /home/scratch/$USER/.cache/bazel,
> > >>> > because bazel is the worst. (It complains about doing things on NFS,
> > >>> > and
> > >>> > hung for me [clock-related?], and I can't find a global config file
> > or
> > >>> > anything to change that in; it seems like there might be one, but
> > their
> > >>> > documentation is terrible.)
> > >>> >
> > >>> > - I wasn't able to use the actual Titan X compute capability of 6.1,
> > >>> > because that requires cuda 8; I used 5.2 instead. Probably not a huge
> > >>> > deal,
> > >>> > but I don't know.
> > >>> >
> > >>> > - I tried explicitly including /usr/local/cuda/lib64 in
> > LD_LIBRARY_PATH
> > >>> > and
> > >>> > set CUDA_HOME to /usr/local/cuda before building, hoping that would
> > >>> > help
> > >>> > with the 0.11.0rc0 problem, but it didn't.
> > >>
> > >>
> > >
> >
> >
> >