GPU3 back in business

Fri Oct 21 15:17:13 EDT 2016

They do work with 7.5 if you specify an older compute architecture; it's
just that their actual compute capability of 6.1 isn't supported by cuda
7.5. Thank is thrown off by this, for example, but it can be fixed by
telling it to pass compute capability 5.2 (for example) to nvcc. I don't
think that this was my problem with building tensorflow on 7.5; I'm not
sure what that was.

On Fri, Oct 21, 2016, 8:11 PM Kirthevasan Kandasamy <kandasamy at cmu.edu>
wrote:

> Thanks Dougal. I'll take a look atthis and get back to you.
> So are you suggesting that this is an issue with TitanX's not being
> compatible with 7.5?
>
> On Fri, Oct 21, 2016 at 3:08 PM, Dougal Sutherland <dougal at gmail.com>
> wrote:
>
> I installed it in my scratch directory (not sure if there's a global
> install?). The main thing was to put its cache on scratch; it got really
> upset when the cache directory was on NFS. (Instructions at the bottom of
> my previous email.)
>
> On Fri, Oct 21, 2016, 8:04 PM Barnabas Poczos <bapoczos at cs.cmu.edu> wrote:
>
> That's great! Thanks Dougal.
>
> As I remember bazel was not installed correctly previously on GPU3. Do
> you know what went wrong with it before and why it is good now?
>
> Thanks,
> Barnabas
> ======================
> Barnabas Poczos, PhD
> Assistant Professor
> Machine Learning Department
> Carnegie Mellon University
>
>
> On Fri, Oct 21, 2016 at 2:03 PM, Dougal Sutherland <dougal at gmail.com>
> wrote:
> > I was just able to build tensorflow 0.11.0rc0 on gpu3! I used the cuda
> 8.0
> > install, and it built fine. So additionally installing 7.5 was probably
> not
> > necessary; in fact, cuda 7.5 doesn't know about the 6.1 compute
> architecture
> > that the Titan Xs use, so Theano at least needs to be manually told to
> use
> > an older architecture.
> >
> > A pip package is in ~dsutherl/tensorflow-0.11.0rc0-py2-none-any.whl. I
> think
> > it should work fine with the cudnn in my scratch directory.
> >
> > You should probably install it to scratch, either running this first to
> put
> > libraries your scratch directory or using a virtualenv or something:
> > export PYTHONUSERBASE=/home/scratch/$USER/.local
> >
> > You'll need this to use the library and probably to install it:
> > export
> >
> LD_LIBRARY_PATH=/home/scratch/dsutherl/cudnn-8.0-5.1/cuda/lib64:"$LD_LIBRARY_PATH"
> >
> > To install:
> > pip install --user ~dsutherl/tensorflow-0.11.0rc0-py2-none-any.whl
> > (remove --user if you're using a virtualenv)
> >
> > (A request: I'm submitting to ICLR in two weeks, and for some of the
> models
> > I'm running gpu3's cards are 4x the speed of gpu1 or 2's. So please don't
> > run a ton of stuff on gpu3 unless you're working on a deadline too.
> >
> >
> >
> > Steps to install it, for the future:
> >
> > Install bazel in your home directory:
> >
> > wget
> >
> https://github.com/bazelbuild/bazel/releases/download/0.3.2/bazel-0.3.2-installer-linux-x86_64.sh
> > bash bazel-0.3.2-installer-linux-x86_64.sh --prefix=/home/scratch/$USER
> > --base=/home/scratch/$USER/.bazel
> >
> > Configure bazel to build in scratch. There's probably a better way to do
> > this, but this works:
> >
> > mkdir /home/scratch/$USER/.cache
> > ln -s /home/scratch/$USER/.cache/bazel ~/.cache/bazel
> >
> > Build tensorflow. Note that builds from git checkouts don't work, because
> > they assume a newer version of git than is on gpu3:
> >
> > cd /home/scratch/$USER
> > wget
> > tar xf
> > cd tensorflow-0.11.0rc0
> > ./configure
> >
> > This is an interactive script that doesn't seem to let you pass
> arguments or
> > anything. It's obnoxious.
> > Use the default python
> > don't use cloud platform or hadoop file system
> > use the default site-packages path if it asks
> > build with GPU support
> > default gcc
> > default Cuda SDK version
> > specify /usr/local/cuda-8.0
> > default cudnn version
> > specify $CUDNN_DIR from use-cudnn.sh, e.g.
> > /home/scratch/dsutherl/cudnn-8.0-5.1/cuda
> > Pascal Titan Xs have compute capability 6.1
> >
> > bazel build -c opt --config=cuda
> > //tensorflow/tools/pip_package:build_pip_package
> > bazel-bin/tensorflow/tools/pip_package/build_pip_package ./
> > A .whl file, e.g. tensorflow-0.11.0rc0-py2-none-any.whl, is put in the
> > directory you specified above.
> >
> >
> > - Dougal
> >
> >
> > On Fri, Oct 21, 2016 at 6:14 PM Kirthevasan Kandasamy <kandasamy at cmu.edu
> >
> > wrote:
> >>
> >> Predrag,
> >>
> >> Any updates on gpu3?
> >> I have tried both tensorflow and chainer and in both cases the problem
> >> seems to be with cuda
> >>
> >> On Wed, Oct 19, 2016 at 4:10 PM, Predrag Punosevac <predragp at cs.cmu.edu
> >
> >> wrote:
> >>>
> >>> Dougal Sutherland <dougal at gmail.com> wrote:
> >>>
> >>> > I tried for a while. I failed.
> >>> >
> >>>
> >>> Damn this doesn't look good. I guess back to the drawing board. Thanks
> >>> for the quick feed back.
> >>>
> >>> Predrag
> >>>
> >>> > Version 0.10.0 fails immediately on build: "The specified
> >>> > --crosstool_top
> >>> > '@local_config_cuda//crosstool:crosstool' is not a valid
> >>> > cc_toolchain_suite
> >>> > rule." Apparently this is because 0.10 required an older version of
> >>> > bazel (
> >>> > https://github.com/tensorflow/tensorflow/issues/4368), and I don't
> have
> >>> > the
> >>> > energy to install an old version of bazel.
> >>> >
> >>> > Version 0.11.0rc0 gets almost done and then complains about no such
> >>> > file or
> >>> > directory for libcudart.so.7.5 (which is there, where I told
> tensorflow
> >>> > it
> >>> > was...).
> >>> >
> >>> > Non-release versions from git fail immediately because they call git
> -C
> >>> > to
> >>> > get version info, which is only in git 1.9 (we have 1.8).
> >>> >
> >>> >
> >>> > Some other notes:
> >>> > - I made a symlink from ~/.cache/bazel to
> >>> > /home/scratch/$USER/.cache/bazel,
> >>> > because bazel is the worst. (It complains about doing things on NFS,
> >>> > and
> >>> > hung for me [clock-related?], and I can't find a global config file
> or
> >>> > anything to change that in; it seems like there might be one, but
> their
> >>> > documentation is terrible.)
> >>> >
> >>> > - I wasn't able to use the actual Titan X compute capability of 6.1,
> >>> > because that requires cuda 8; I used 5.2 instead. Probably not a huge
> >>> > deal,
> >>> > but I don't know.
> >>> >
> >>> > - I tried explicitly including /usr/local/cuda/lib64 in
> LD_LIBRARY_PATH
> >>> > and
> >>> > set CUDA_HOME to /usr/local/cuda before building, hoping that would
> >>> > help
> >>> > with the 0.11.0rc0 problem, but it didn't.
> >>
> >>
> >
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20161021/fe051223/attachment-0001.html>