GPU3 back in business

Fri Oct 21 15:20:13 EDT 2016

I didn't understand half of what you said :P
But I'll give this a shot and get abck to you if I run into any issues.

On Fri, Oct 21, 2016 at 3:17 PM, Dougal Sutherland <dougal at gmail.com> wrote:

> They do work with 7.5 if you specify an older compute architecture; it's
> just that their actual compute capability of 6.1 isn't supported by cuda
> 7.5. Thank is thrown off by this, for example, but it can be fixed by
> telling it to pass compute capability 5.2 (for example) to nvcc. I don't
> think that this was my problem with building tensorflow on 7.5; I'm not
> sure what that was.
>
> On Fri, Oct 21, 2016, 8:11 PM Kirthevasan Kandasamy <kandasamy at cmu.edu>
> wrote:
>
>> Thanks Dougal. I'll take a look atthis and get back to you.
>> So are you suggesting that this is an issue with TitanX's not being
>> compatible with 7.5?
>>
>> On Fri, Oct 21, 2016 at 3:08 PM, Dougal Sutherland <dougal at gmail.com>
>> wrote:
>>
>> I installed it in my scratch directory (not sure if there's a global
>> install?). The main thing was to put its cache on scratch; it got really
>> upset when the cache directory was on NFS. (Instructions at the bottom of
>> my previous email.)
>>
>> On Fri, Oct 21, 2016, 8:04 PM Barnabas Poczos <bapoczos at cs.cmu.edu>
>> wrote:
>>
>> That's great! Thanks Dougal.
>>
>> As I remember bazel was not installed correctly previously on GPU3. Do
>> you know what went wrong with it before and why it is good now?
>>
>> Thanks,
>> Barnabas
>> ======================
>> Barnabas Poczos, PhD
>> Assistant Professor
>> Machine Learning Department
>> Carnegie Mellon University
>>
>>
>> On Fri, Oct 21, 2016 at 2:03 PM, Dougal Sutherland <dougal at gmail.com>
>> wrote:
>> > I was just able to build tensorflow 0.11.0rc0 on gpu3! I used the cuda
>> 8.0
>> > install, and it built fine. So additionally installing 7.5 was probably
>> not
>> > necessary; in fact, cuda 7.5 doesn't know about the 6.1 compute
>> architecture
>> > that the Titan Xs use, so Theano at least needs to be manually told to
>> use
>> > an older architecture.
>> >
>> > A pip package is in ~dsutherl/tensorflow-0.11.0rc0-py2-none-any.whl. I
>> think
>> > it should work fine with the cudnn in my scratch directory.
>> >
>> > You should probably install it to scratch, either running this first to
>> put
>> > libraries your scratch directory or using a virtualenv or something:
>> > export PYTHONUSERBASE=/home/scratch/$USER/.local
>> >
>> > You'll need this to use the library and probably to install it:
>> > export
>> > LD_LIBRARY_PATH=/home/scratch/dsutherl/cudnn-8.0-5.1/cuda/
>> lib64:"$LD_LIBRARY_PATH"
>> >
>> > To install:
>> > pip install --user ~dsutherl/tensorflow-0.11.0rc0-py2-none-any.whl
>> > (remove --user if you're using a virtualenv)
>> >
>> > (A request: I'm submitting to ICLR in two weeks, and for some of the
>> models
>> > I'm running gpu3's cards are 4x the speed of gpu1 or 2's. So please
>> don't
>> > run a ton of stuff on gpu3 unless you're working on a deadline too.
>> >
>> >
>> >
>> > Steps to install it, for the future:
>> >
>> > Install bazel in your home directory:
>> >
>> > wget
>> > https://github.com/bazelbuild/bazel/releases/download/0.3.2/
>> bazel-0.3.2-installer-linux-x86_64.sh
>> > bash bazel-0.3.2-installer-linux-x86_64.sh --prefix=/home/scratch/$USER
>> > --base=/home/scratch/$USER/.bazel
>> >
>> > Configure bazel to build in scratch. There's probably a better way to do
>> > this, but this works:
>> >
>> > mkdir /home/scratch/$USER/.cache
>> > ln -s /home/scratch/$USER/.cache/bazel ~/.cache/bazel
>> >
>> > Build tensorflow. Note that builds from git checkouts don't work,
>> because
>> > they assume a newer version of git than is on gpu3:
>> >
>> > cd /home/scratch/$USER
>> > wget
>> > tar xf
>> > cd tensorflow-0.11.0rc0
>> > ./configure
>> >
>> > This is an interactive script that doesn't seem to let you pass
>> arguments or
>> > anything. It's obnoxious.
>> > Use the default python
>> > don't use cloud platform or hadoop file system
>> > use the default site-packages path if it asks
>> > build with GPU support
>> > default gcc
>> > default Cuda SDK version
>> > specify /usr/local/cuda-8.0
>> > default cudnn version
>> > specify $CUDNN_DIR from use-cudnn.sh, e.g.
>> > /home/scratch/dsutherl/cudnn-8.0-5.1/cuda
>> > Pascal Titan Xs have compute capability 6.1
>> >
>> > bazel build -c opt --config=cuda
>> > //tensorflow/tools/pip_package:build_pip_package
>> > bazel-bin/tensorflow/tools/pip_package/build_pip_package ./
>> > A .whl file, e.g. tensorflow-0.11.0rc0-py2-none-any.whl, is put in the
>> > directory you specified above.
>> >
>> >
>> > - Dougal
>> >
>> >
>> > On Fri, Oct 21, 2016 at 6:14 PM Kirthevasan Kandasamy <
>> kandasamy at cmu.edu>
>> > wrote:
>> >>
>> >> Predrag,
>> >>
>> >> Any updates on gpu3?
>> >> I have tried both tensorflow and chainer and in both cases the problem
>> >> seems to be with cuda
>> >>
>> >> On Wed, Oct 19, 2016 at 4:10 PM, Predrag Punosevac <
>> predragp at cs.cmu.edu>
>> >> wrote:
>> >>>
>> >>> Dougal Sutherland <dougal at gmail.com> wrote:
>> >>>
>> >>> > I tried for a while. I failed.
>> >>> >
>> >>>
>> >>> Damn this doesn't look good. I guess back to the drawing board. Thanks
>> >>> for the quick feed back.
>> >>>
>> >>> Predrag
>> >>>
>> >>> > Version 0.10.0 fails immediately on build: "The specified
>> >>> > --crosstool_top
>> >>> > '@local_config_cuda//crosstool:crosstool' is not a valid
>> >>> > cc_toolchain_suite
>> >>> > rule." Apparently this is because 0.10 required an older version of
>> >>> > bazel (
>> >>> > https://github.com/tensorflow/tensorflow/issues/4368), and I don't
>> have
>> >>> > the
>> >>> > energy to install an old version of bazel.
>> >>> >
>> >>> > Version 0.11.0rc0 gets almost done and then complains about no such
>> >>> > file or
>> >>> > directory for libcudart.so.7.5 (which is there, where I told
>> tensorflow
>> >>> > it
>> >>> > was...).
>> >>> >
>> >>> > Non-release versions from git fail immediately because they call
>> git -C
>> >>> > to
>> >>> > get version info, which is only in git 1.9 (we have 1.8).
>> >>> >
>> >>> >
>> >>> > Some other notes:
>> >>> > - I made a symlink from ~/.cache/bazel to
>> >>> > /home/scratch/$USER/.cache/bazel,
>> >>> > because bazel is the worst. (It complains about doing things on NFS,
>> >>> > and
>> >>> > hung for me [clock-related?], and I can't find a global config file
>> or
>> >>> > anything to change that in; it seems like there might be one, but
>> their
>> >>> > documentation is terrible.)
>> >>> >
>> >>> > - I wasn't able to use the actual Titan X compute capability of 6.1,
>> >>> > because that requires cuda 8; I used 5.2 instead. Probably not a
>> huge
>> >>> > deal,
>> >>> > but I don't know.
>> >>> >
>> >>> > - I tried explicitly including /usr/local/cuda/lib64 in
>> LD_LIBRARY_PATH
>> >>> > and
>> >>> > set CUDA_HOME to /usr/local/cuda before building, hoping that would
>> >>> > help
>> >>> > with the 0.11.0rc0 problem, but it didn't.
>> >>
>> >>
>> >
>>
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20161021/66356d18/attachment-0001.html>