GPU3 back in business

Barnabas Poczos bapoczos at cs.cmu.edu
Fri Oct 21 15:04:08 EDT 2016


That's great! Thanks Dougal.

As I remember bazel was not installed correctly previously on GPU3. Do
you know what went wrong with it before and why it is good now?

Thanks,
Barnabas
======================
Barnabas Poczos, PhD
Assistant Professor
Machine Learning Department
Carnegie Mellon University


On Fri, Oct 21, 2016 at 2:03 PM, Dougal Sutherland <dougal at gmail.com> wrote:
> I was just able to build tensorflow 0.11.0rc0 on gpu3! I used the cuda 8.0
> install, and it built fine. So additionally installing 7.5 was probably not
> necessary; in fact, cuda 7.5 doesn't know about the 6.1 compute architecture
> that the Titan Xs use, so Theano at least needs to be manually told to use
> an older architecture.
>
> A pip package is in ~dsutherl/tensorflow-0.11.0rc0-py2-none-any.whl. I think
> it should work fine with the cudnn in my scratch directory.
>
> You should probably install it to scratch, either running this first to put
> libraries your scratch directory or using a virtualenv or something:
> export PYTHONUSERBASE=/home/scratch/$USER/.local
>
> You'll need this to use the library and probably to install it:
> export
> LD_LIBRARY_PATH=/home/scratch/dsutherl/cudnn-8.0-5.1/cuda/lib64:"$LD_LIBRARY_PATH"
>
> To install:
> pip install --user ~dsutherl/tensorflow-0.11.0rc0-py2-none-any.whl
> (remove --user if you're using a virtualenv)
>
> (A request: I'm submitting to ICLR in two weeks, and for some of the models
> I'm running gpu3's cards are 4x the speed of gpu1 or 2's. So please don't
> run a ton of stuff on gpu3 unless you're working on a deadline too.
>
>
>
> Steps to install it, for the future:
>
> Install bazel in your home directory:
>
> wget
> https://github.com/bazelbuild/bazel/releases/download/0.3.2/bazel-0.3.2-installer-linux-x86_64.sh
> bash bazel-0.3.2-installer-linux-x86_64.sh --prefix=/home/scratch/$USER
> --base=/home/scratch/$USER/.bazel
>
> Configure bazel to build in scratch. There's probably a better way to do
> this, but this works:
>
> mkdir /home/scratch/$USER/.cache
> ln -s /home/scratch/$USER/.cache/bazel ~/.cache/bazel
>
> Build tensorflow. Note that builds from git checkouts don't work, because
> they assume a newer version of git than is on gpu3:
>
> cd /home/scratch/$USER
> wget
> tar xf
> cd tensorflow-0.11.0rc0
> ./configure
>
> This is an interactive script that doesn't seem to let you pass arguments or
> anything. It's obnoxious.
> Use the default python
> don't use cloud platform or hadoop file system
> use the default site-packages path if it asks
> build with GPU support
> default gcc
> default Cuda SDK version
> specify /usr/local/cuda-8.0
> default cudnn version
> specify $CUDNN_DIR from use-cudnn.sh, e.g.
> /home/scratch/dsutherl/cudnn-8.0-5.1/cuda
> Pascal Titan Xs have compute capability 6.1
>
> bazel build -c opt --config=cuda
> //tensorflow/tools/pip_package:build_pip_package
> bazel-bin/tensorflow/tools/pip_package/build_pip_package ./
> A .whl file, e.g. tensorflow-0.11.0rc0-py2-none-any.whl, is put in the
> directory you specified above.
>
>
> - Dougal
>
>
> On Fri, Oct 21, 2016 at 6:14 PM Kirthevasan Kandasamy <kandasamy at cmu.edu>
> wrote:
>>
>> Predrag,
>>
>> Any updates on gpu3?
>> I have tried both tensorflow and chainer and in both cases the problem
>> seems to be with cuda
>>
>> On Wed, Oct 19, 2016 at 4:10 PM, Predrag Punosevac <predragp at cs.cmu.edu>
>> wrote:
>>>
>>> Dougal Sutherland <dougal at gmail.com> wrote:
>>>
>>> > I tried for a while. I failed.
>>> >
>>>
>>> Damn this doesn't look good. I guess back to the drawing board. Thanks
>>> for the quick feed back.
>>>
>>> Predrag
>>>
>>> > Version 0.10.0 fails immediately on build: "The specified
>>> > --crosstool_top
>>> > '@local_config_cuda//crosstool:crosstool' is not a valid
>>> > cc_toolchain_suite
>>> > rule." Apparently this is because 0.10 required an older version of
>>> > bazel (
>>> > https://github.com/tensorflow/tensorflow/issues/4368), and I don't have
>>> > the
>>> > energy to install an old version of bazel.
>>> >
>>> > Version 0.11.0rc0 gets almost done and then complains about no such
>>> > file or
>>> > directory for libcudart.so.7.5 (which is there, where I told tensorflow
>>> > it
>>> > was...).
>>> >
>>> > Non-release versions from git fail immediately because they call git -C
>>> > to
>>> > get version info, which is only in git 1.9 (we have 1.8).
>>> >
>>> >
>>> > Some other notes:
>>> > - I made a symlink from ~/.cache/bazel to
>>> > /home/scratch/$USER/.cache/bazel,
>>> > because bazel is the worst. (It complains about doing things on NFS,
>>> > and
>>> > hung for me [clock-related?], and I can't find a global config file or
>>> > anything to change that in; it seems like there might be one, but their
>>> > documentation is terrible.)
>>> >
>>> > - I wasn't able to use the actual Titan X compute capability of 6.1,
>>> > because that requires cuda 8; I used 5.2 instead. Probably not a huge
>>> > deal,
>>> > but I don't know.
>>> >
>>> > - I tried explicitly including /usr/local/cuda/lib64 in LD_LIBRARY_PATH
>>> > and
>>> > set CUDA_HOME to /usr/local/cuda before building, hoping that would
>>> > help
>>> > with the 0.11.0rc0 problem, but it didn't.
>>
>>
>


More information about the Autonlab-users mailing list