GPU3 back in business

Dougal Sutherland dougal at gmail.com
Fri Oct 21 14:03:21 EDT 2016


I was just able to build tensorflow 0.11.0rc0 on gpu3! I used the cuda 8.0
install, and it built fine. So additionally installing 7.5 was probably not
necessary; in fact, cuda 7.5 doesn't know about the 6.1 compute
architecture that the Titan Xs use, so Theano at least needs to be manually
told to use an older architecture.

A pip package is in ~dsutherl/tensorflow-0.11.0rc0-py2-none-any.whl. I
think it should work fine with the cudnn in my scratch directory.

You should probably install it to scratch, either running this first to put
libraries your scratch directory or using a virtualenv or something:
export PYTHONUSERBASE=/home/scratch/$USER/.local

You'll need this to use the library and probably to install it:
export
LD_LIBRARY_PATH=/home/scratch/dsutherl/cudnn-8.0-5.1/cuda/lib64:"$LD_LIBRARY_PATH"

To install:
pip install --user ~dsutherl/tensorflow-0.11.0rc0-py2-none-any.whl
(remove --user if you're using a virtualenv)

(A request: I'm submitting to ICLR in two weeks, and for some of the models
I'm running gpu3's cards are 4x the speed of gpu1 or 2's. So please don't
run a ton of stuff on gpu3 unless you're working on a deadline too.



Steps to install it, for the future:

   - Install bazel in your home directory:
      - wget
      https://github.com/bazelbuild/bazel/releases/download/0.3.2/bazel-0.3.2-installer-linux-x86_64.sh
      - bash bazel-0.3.2-installer-linux-x86_64.sh --prefix=/home/scratch/$USER
      --base=/home/scratch/$USER/.bazel
      - Configure bazel to build in scratch. There's probably a better way
   to do this, but this works:
      - mkdir /home/scratch/$USER/.cache
      - ln -s /home/scratch/$USER/.cache/bazel ~/.cache/bazel
   - Build tensorflow. Note that builds from git checkouts don't work,
   because they assume a newer version of git than is on gpu3:
      - cd /home/scratch/$USER
      - wget
      - tar xf
      - cd tensorflow-0.11.0rc0
      - ./configure
         - This is an interactive script that doesn't seem to let you pass
         arguments or anything. It's obnoxious.
         - Use the default python
         - don't use cloud platform or hadoop file system
         - use the default site-packages path if it asks
         - build with GPU support
         - default gcc
         - default Cuda SDK version
         - specify /usr/local/cuda-8.0
         - default cudnn version
         - specify $CUDNN_DIR from use-cudnn.sh, e.g.
         /home/scratch/dsutherl/cudnn-8.0-5.1/cuda
         - Pascal Titan Xs have compute capability 6.1
      - bazel build -c opt --config=cuda
      //tensorflow/tools/pip_package:build_pip_package
      - bazel-bin/tensorflow/tools/pip_package/build_pip_package ./
      - A .whl file, e.g. tensorflow-0.11.0rc0-py2-none-any.whl, is put in
      the directory you specified above.


- Dougal


On Fri, Oct 21, 2016 at 6:14 PM Kirthevasan Kandasamy <kandasamy at cmu.edu>
wrote:

Predrag,

Any updates on gpu3?
I have tried both tensorflow and chainer and in both cases the problem
seems to be with cuda

On Wed, Oct 19, 2016 at 4:10 PM, Predrag Punosevac <predragp at cs.cmu.edu>
wrote:

Dougal Sutherland <dougal at gmail.com> wrote:

> I tried for a while. I failed.
>

Damn this doesn't look good. I guess back to the drawing board. Thanks
for the quick feed back.

Predrag

> Version 0.10.0 fails immediately on build: "The specified --crosstool_top
> '@local_config_cuda//crosstool:crosstool' is not a valid
cc_toolchain_suite
> rule." Apparently this is because 0.10 required an older version of bazel
(
> https://github.com/tensorflow/tensorflow/issues/4368), and I don't have
the
> energy to install an old version of bazel.
>
> Version 0.11.0rc0 gets almost done and then complains about no such file
or
> directory for libcudart.so.7.5 (which is there, where I told tensorflow it
> was...).
>
> Non-release versions from git fail immediately because they call git -C to
> get version info, which is only in git 1.9 (we have 1.8).
>
>
> Some other notes:
> - I made a symlink from ~/.cache/bazel to
/home/scratch/$USER/.cache/bazel,
> because bazel is the worst. (It complains about doing things on NFS, and
> hung for me [clock-related?], and I can't find a global config file or
> anything to change that in; it seems like there might be one, but their
> documentation is terrible.)
>
> - I wasn't able to use the actual Titan X compute capability of 6.1,
> because that requires cuda 8; I used 5.2 instead. Probably not a huge
deal,
> but I don't know.
>
> - I tried explicitly including /usr/local/cuda/lib64 in LD_LIBRARY_PATH
and
> set CUDA_HOME to /usr/local/cuda before building, hoping that would help
> with the 0.11.0rc0 problem, but it didn't.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20161021/fd09accc/attachment.html>


More information about the Autonlab-users mailing list