GPU3 back in business

Dougal Sutherland dougal at gmail.com
Fri Oct 21 15:27:32 EDT 2016


Heh. :)

An explanation:

   - Different nvidia gpu architectures are called "compute capabilities".
   This is a number that describes the behavior of the card: the maximum size
   of various things, which API functions it supports, etc. There's a
   reference here
   <https://en.wikipedia.org/wiki/CUDA#Version_features_and_specifications>,
   but it shouldn't really matter.
   - When CUDA compiles code, it targets a certain architecture, since it
   needs to know what features to use and whatnot. I *think* that if you
   compile for compute capability x, it will work on a card with compute
   capability y approximately iff x <= y.
   - Pascal Titan Xs, like gpu3 has, have compute capability 6.1.
   - CUDA 7.5 doesn't know about compute capability 6.1, so if you ask to
   compile for 6.1 it crashes.
   - Theano by default tries to compile for the capability of the card, but
   can be configured to compile for a different capability.
   - Tensorflow asks for a list of capabilities to compile for when you
   build it in the first place.


On Fri, Oct 21, 2016 at 8:17 PM Dougal Sutherland <dougal at gmail.com> wrote:

> They do work with 7.5 if you specify an older compute architecture; it's
> just that their actual compute capability of 6.1 isn't supported by cuda
> 7.5. Thank is thrown off by this, for example, but it can be fixed by
> telling it to pass compute capability 5.2 (for example) to nvcc. I don't
> think that this was my problem with building tensorflow on 7.5; I'm not
> sure what that was.
>
> On Fri, Oct 21, 2016, 8:11 PM Kirthevasan Kandasamy <kandasamy at cmu.edu>
> wrote:
>
> Thanks Dougal. I'll take a look atthis and get back to you.
> So are you suggesting that this is an issue with TitanX's not being
> compatible with 7.5?
>
> On Fri, Oct 21, 2016 at 3:08 PM, Dougal Sutherland <dougal at gmail.com>
> wrote:
>
> I installed it in my scratch directory (not sure if there's a global
> install?). The main thing was to put its cache on scratch; it got really
> upset when the cache directory was on NFS. (Instructions at the bottom of
> my previous email.)
>
> On Fri, Oct 21, 2016, 8:04 PM Barnabas Poczos <bapoczos at cs.cmu.edu> wrote:
>
> That's great! Thanks Dougal.
>
> As I remember bazel was not installed correctly previously on GPU3. Do
> you know what went wrong with it before and why it is good now?
>
> Thanks,
> Barnabas
> ======================
> Barnabas Poczos, PhD
> Assistant Professor
> Machine Learning Department
> Carnegie Mellon University
>
>
> On Fri, Oct 21, 2016 at 2:03 PM, Dougal Sutherland <dougal at gmail.com>
> wrote:
> > I was just able to build tensorflow 0.11.0rc0 on gpu3! I used the cuda
> 8.0
> > install, and it built fine. So additionally installing 7.5 was probably
> not
> > necessary; in fact, cuda 7.5 doesn't know about the 6.1 compute
> architecture
> > that the Titan Xs use, so Theano at least needs to be manually told to
> use
> > an older architecture.
> >
> > A pip package is in ~dsutherl/tensorflow-0.11.0rc0-py2-none-any.whl. I
> think
> > it should work fine with the cudnn in my scratch directory.
> >
> > You should probably install it to scratch, either running this first to
> put
> > libraries your scratch directory or using a virtualenv or something:
> > export PYTHONUSERBASE=/home/scratch/$USER/.local
> >
> > You'll need this to use the library and probably to install it:
> > export
> >
> LD_LIBRARY_PATH=/home/scratch/dsutherl/cudnn-8.0-5.1/cuda/lib64:"$LD_LIBRARY_PATH"
> >
> > To install:
> > pip install --user ~dsutherl/tensorflow-0.11.0rc0-py2-none-any.whl
> > (remove --user if you're using a virtualenv)
> >
> > (A request: I'm submitting to ICLR in two weeks, and for some of the
> models
> > I'm running gpu3's cards are 4x the speed of gpu1 or 2's. So please don't
> > run a ton of stuff on gpu3 unless you're working on a deadline too.
> >
> >
> >
> > Steps to install it, for the future:
> >
> > Install bazel in your home directory:
> >
> > wget
> >
> https://github.com/bazelbuild/bazel/releases/download/0.3.2/bazel-0.3.2-installer-linux-x86_64.sh
> > bash bazel-0.3.2-installer-linux-x86_64.sh --prefix=/home/scratch/$USER
> > --base=/home/scratch/$USER/.bazel
> >
> > Configure bazel to build in scratch. There's probably a better way to do
> > this, but this works:
> >
> > mkdir /home/scratch/$USER/.cache
> > ln -s /home/scratch/$USER/.cache/bazel ~/.cache/bazel
> >
> > Build tensorflow. Note that builds from git checkouts don't work, because
> > they assume a newer version of git than is on gpu3:
> >
> > cd /home/scratch/$USER
> > wget
> > tar xf
> > cd tensorflow-0.11.0rc0
> > ./configure
> >
> > This is an interactive script that doesn't seem to let you pass
> arguments or
> > anything. It's obnoxious.
> > Use the default python
> > don't use cloud platform or hadoop file system
> > use the default site-packages path if it asks
> > build with GPU support
> > default gcc
> > default Cuda SDK version
> > specify /usr/local/cuda-8.0
> > default cudnn version
> > specify $CUDNN_DIR from use-cudnn.sh, e.g.
> > /home/scratch/dsutherl/cudnn-8.0-5.1/cuda
> > Pascal Titan Xs have compute capability 6.1
> >
> > bazel build -c opt --config=cuda
> > //tensorflow/tools/pip_package:build_pip_package
> > bazel-bin/tensorflow/tools/pip_package/build_pip_package ./
> > A .whl file, e.g. tensorflow-0.11.0rc0-py2-none-any.whl, is put in the
> > directory you specified above.
> >
> >
> > - Dougal
> >
> >
> > On Fri, Oct 21, 2016 at 6:14 PM Kirthevasan Kandasamy <kandasamy at cmu.edu
> >
> > wrote:
> >>
> >> Predrag,
> >>
> >> Any updates on gpu3?
> >> I have tried both tensorflow and chainer and in both cases the problem
> >> seems to be with cuda
> >>
> >> On Wed, Oct 19, 2016 at 4:10 PM, Predrag Punosevac <predragp at cs.cmu.edu
> >
> >> wrote:
> >>>
> >>> Dougal Sutherland <dougal at gmail.com> wrote:
> >>>
> >>> > I tried for a while. I failed.
> >>> >
> >>>
> >>> Damn this doesn't look good. I guess back to the drawing board. Thanks
> >>> for the quick feed back.
> >>>
> >>> Predrag
> >>>
> >>> > Version 0.10.0 fails immediately on build: "The specified
> >>> > --crosstool_top
> >>> > '@local_config_cuda//crosstool:crosstool' is not a valid
> >>> > cc_toolchain_suite
> >>> > rule." Apparently this is because 0.10 required an older version of
> >>> > bazel (
> >>> > https://github.com/tensorflow/tensorflow/issues/4368), and I don't
> have
> >>> > the
> >>> > energy to install an old version of bazel.
> >>> >
> >>> > Version 0.11.0rc0 gets almost done and then complains about no such
> >>> > file or
> >>> > directory for libcudart.so.7.5 (which is there, where I told
> tensorflow
> >>> > it
> >>> > was...).
> >>> >
> >>> > Non-release versions from git fail immediately because they call git
> -C
> >>> > to
> >>> > get version info, which is only in git 1.9 (we have 1.8).
> >>> >
> >>> >
> >>> > Some other notes:
> >>> > - I made a symlink from ~/.cache/bazel to
> >>> > /home/scratch/$USER/.cache/bazel,
> >>> > because bazel is the worst. (It complains about doing things on NFS,
> >>> > and
> >>> > hung for me [clock-related?], and I can't find a global config file
> or
> >>> > anything to change that in; it seems like there might be one, but
> their
> >>> > documentation is terrible.)
> >>> >
> >>> > - I wasn't able to use the actual Titan X compute capability of 6.1,
> >>> > because that requires cuda 8; I used 5.2 instead. Probably not a huge
> >>> > deal,
> >>> > but I don't know.
> >>> >
> >>> > - I tried explicitly including /usr/local/cuda/lib64 in
> LD_LIBRARY_PATH
> >>> > and
> >>> > set CUDA_HOME to /usr/local/cuda before building, hoping that would
> >>> > help
> >>> > with the 0.11.0rc0 problem, but it didn't.
> >>
> >>
> >
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20161021/a9be260b/attachment-0001.html>


More information about the Autonlab-users mailing list