GPU3 back in business

Fri Oct 21 15:44:02 EDT 2016

Hi Predrag,

If there is no other solution, then I think it is OK not to have
Matlab on GPU2 and GPU3.
Tensorflow has higher priority on these nodes.

Best,
Barnabas

======================
Barnabas Poczos, PhD
Assistant Professor
Machine Learning Department
Carnegie Mellon University

On Fri, Oct 21, 2016 at 3:37 PM, Predrag Punosevac <predragp at cs.cmu.edu> wrote:
> Dougal Sutherland <dougal at gmail.com> wrote:
>
>
> Sorry that I am late for the party. This is my interpretation of what we
> should do.
>
> 1. I will go back to CUDA 8.0 which will brake MATLAB. We have to live
> with it. Barnabas please OK this. I will work with MathWorks for this to
> be fixed for 2017a release.
>
> 2. Then I could install TensorFlow compiled by Dougal system wide.
> Please Dugal after I upgrade back to 8.0 recompile it again using CUDA
> 8.0. I could give you the root password so that you can compile and
> install directly.
>
> 3. If everyone is OK with above I will pull the trigger on GPU3 at
> 4:30PM and upgrade to 8.0
>
> 4. MATLAB will be broken on GPU2 as well after I put Titan cards during
> the October 25 power outrage.
>
> Predrag
>
>
>
>
>
>
>> Heh. :)
>>
>> An explanation:
>>
>>    - Different nvidia gpu architectures are called "compute capabilities".
>>    This is a number that describes the behavior of the card: the maximum size
>>    of various things, which API functions it supports, etc. There's a
>>    reference here
>>    <https://en.wikipedia.org/wiki/CUDA#Version_features_and_specifications>,
>>    but it shouldn't really matter.
>>    - When CUDA compiles code, it targets a certain architecture, since it
>>    needs to know what features to use and whatnot. I *think* that if you
>>    compile for compute capability x, it will work on a card with compute
>>    capability y approximately iff x <= y.
>>    - Pascal Titan Xs, like gpu3 has, have compute capability 6.1.
>>    - CUDA 7.5 doesn't know about compute capability 6.1, so if you ask to
>>    compile for 6.1 it crashes.
>>    - Theano by default tries to compile for the capability of the card, but
>>    can be configured to compile for a different capability.
>>    - Tensorflow asks for a list of capabilities to compile for when you
>>    build it in the first place.
>>
>>
>> On Fri, Oct 21, 2016 at 8:17 PM Dougal Sutherland <dougal at gmail.com> wrote:
>>
>> > They do work with 7.5 if you specify an older compute architecture; it's
>> > just that their actual compute capability of 6.1 isn't supported by cuda
>> > 7.5. Thank is thrown off by this, for example, but it can be fixed by
>> > telling it to pass compute capability 5.2 (for example) to nvcc. I don't
>> > think that this was my problem with building tensorflow on 7.5; I'm not
>> > sure what that was.
>> >
>> > On Fri, Oct 21, 2016, 8:11 PM Kirthevasan Kandasamy <kandasamy at cmu.edu>
>> > wrote:
>> >
>> > Thanks Dougal. I'll take a look atthis and get back to you.
>> > So are you suggesting that this is an issue with TitanX's not being
>> > compatible with 7.5?
>> >
>> > On Fri, Oct 21, 2016 at 3:08 PM, Dougal Sutherland <dougal at gmail.com>
>> > wrote:
>> >
>> > I installed it in my scratch directory (not sure if there's a global
>> > install?). The main thing was to put its cache on scratch; it got really
>> > upset when the cache directory was on NFS. (Instructions at the bottom of
>> > my previous email.)
>> >
>> > On Fri, Oct 21, 2016, 8:04 PM Barnabas Poczos <bapoczos at cs.cmu.edu> wrote:
>> >
>> > That's great! Thanks Dougal.
>> >
>> > As I remember bazel was not installed correctly previously on GPU3. Do
>> > you know what went wrong with it before and why it is good now?
>> >
>> > Thanks,
>> > Barnabas
>> > ======================
>> > Barnabas Poczos, PhD
>> > Assistant Professor
>> > Machine Learning Department
>> > Carnegie Mellon University
>> >
>> >
>> > On Fri, Oct 21, 2016 at 2:03 PM, Dougal Sutherland <dougal at gmail.com>
>> > wrote:
>> > > I was just able to build tensorflow 0.11.0rc0 on gpu3! I used the cuda
>> > 8.0
>> > > install, and it built fine. So additionally installing 7.5 was probably
>> > not
>> > > necessary; in fact, cuda 7.5 doesn't know about the 6.1 compute
>> > architecture
>> > > that the Titan Xs use, so Theano at least needs to be manually told to
>> > use
>> > > an older architecture.
>> > >
>> > > A pip package is in ~dsutherl/tensorflow-0.11.0rc0-py2-none-any.whl. I
>> > think
>> > > it should work fine with the cudnn in my scratch directory.
>> > >
>> > > You should probably install it to scratch, either running this first to
>> > put
>> > > libraries your scratch directory or using a virtualenv or something:
>> > > export PYTHONUSERBASE=/home/scratch/$USER/.local
>> > >
>> > > You'll need this to use the library and probably to install it:
>> > > export
>> > >
>> > LD_LIBRARY_PATH=/home/scratch/dsutherl/cudnn-8.0-5.1/cuda/lib64:"$LD_LIBRARY_PATH"
>> > >
>> > > To install:
>> > > pip install --user ~dsutherl/tensorflow-0.11.0rc0-py2-none-any.whl
>> > > (remove --user if you're using a virtualenv)
>> > >
>> > > (A request: I'm submitting to ICLR in two weeks, and for some of the
>> > models
>> > > I'm running gpu3's cards are 4x the speed of gpu1 or 2's. So please don't
>> > > run a ton of stuff on gpu3 unless you're working on a deadline too.
>> > >
>> > >
>> > >
>> > > Steps to install it, for the future:
>> > >
>> > > Install bazel in your home directory:
>> > >
>> > > wget
>> > >
>> > https://github.com/bazelbuild/bazel/releases/download/0.3.2/bazel-0.3.2-installer-linux-x86_64.sh
>> > > bash bazel-0.3.2-installer-linux-x86_64.sh --prefix=/home/scratch/$USER
>> > > --base=/home/scratch/$USER/.bazel
>> > >
>> > > Configure bazel to build in scratch. There's probably a better way to do
>> > > this, but this works:
>> > >
>> > > mkdir /home/scratch/$USER/.cache
>> > > ln -s /home/scratch/$USER/.cache/bazel ~/.cache/bazel
>> > >
>> > > Build tensorflow. Note that builds from git checkouts don't work, because
>> > > they assume a newer version of git than is on gpu3:
>> > >
>> > > cd /home/scratch/$USER
>> > > wget
>> > > tar xf
>> > > cd tensorflow-0.11.0rc0
>> > > ./configure
>> > >
>> > > This is an interactive script that doesn't seem to let you pass
>> > arguments or
>> > > anything. It's obnoxious.
>> > > Use the default python
>> > > don't use cloud platform or hadoop file system
>> > > use the default site-packages path if it asks
>> > > build with GPU support
>> > > default gcc
>> > > default Cuda SDK version
>> > > specify /usr/local/cuda-8.0
>> > > default cudnn version
>> > > specify $CUDNN_DIR from use-cudnn.sh, e.g.
>> > > /home/scratch/dsutherl/cudnn-8.0-5.1/cuda
>> > > Pascal Titan Xs have compute capability 6.1
>> > >
>> > > bazel build -c opt --config=cuda
>> > > //tensorflow/tools/pip_package:build_pip_package
>> > > bazel-bin/tensorflow/tools/pip_package/build_pip_package ./
>> > > A .whl file, e.g. tensorflow-0.11.0rc0-py2-none-any.whl, is put in the
>> > > directory you specified above.
>> > >
>> > >
>> > > - Dougal
>> > >
>> > >
>> > > On Fri, Oct 21, 2016 at 6:14 PM Kirthevasan Kandasamy <kandasamy at cmu.edu
>> > >
>> > > wrote:
>> > >>
>> > >> Predrag,
>> > >>
>> > >> Any updates on gpu3?
>> > >> I have tried both tensorflow and chainer and in both cases the problem
>> > >> seems to be with cuda
>> > >>
>> > >> On Wed, Oct 19, 2016 at 4:10 PM, Predrag Punosevac <predragp at cs.cmu.edu
>> > >
>> > >> wrote:
>> > >>>
>> > >>> Dougal Sutherland <dougal at gmail.com> wrote:
>> > >>>
>> > >>> > I tried for a while. I failed.
>> > >>> >
>> > >>>
>> > >>> Damn this doesn't look good. I guess back to the drawing board. Thanks
>> > >>> for the quick feed back.
>> > >>>
>> > >>> Predrag
>> > >>>
>> > >>> > Version 0.10.0 fails immediately on build: "The specified
>> > >>> > --crosstool_top
>> > >>> > '@local_config_cuda//crosstool:crosstool' is not a valid
>> > >>> > cc_toolchain_suite
>> > >>> > rule." Apparently this is because 0.10 required an older version of
>> > >>> > bazel (
>> > >>> > https://github.com/tensorflow/tensorflow/issues/4368), and I don't
>> > have
>> > >>> > the
>> > >>> > energy to install an old version of bazel.
>> > >>> >
>> > >>> > Version 0.11.0rc0 gets almost done and then complains about no such
>> > >>> > file or
>> > >>> > directory for libcudart.so.7.5 (which is there, where I told
>> > tensorflow
>> > >>> > it
>> > >>> > was...).
>> > >>> >
>> > >>> > Non-release versions from git fail immediately because they call git
>> > -C
>> > >>> > to
>> > >>> > get version info, which is only in git 1.9 (we have 1.8).
>> > >>> >
>> > >>> >
>> > >>> > Some other notes:
>> > >>> > - I made a symlink from ~/.cache/bazel to
>> > >>> > /home/scratch/$USER/.cache/bazel,
>> > >>> > because bazel is the worst. (It complains about doing things on NFS,
>> > >>> > and
>> > >>> > hung for me [clock-related?], and I can't find a global config file
>> > or
>> > >>> > anything to change that in; it seems like there might be one, but
>> > their
>> > >>> > documentation is terrible.)
>> > >>> >
>> > >>> > - I wasn't able to use the actual Titan X compute capability of 6.1,
>> > >>> > because that requires cuda 8; I used 5.2 instead. Probably not a huge
>> > >>> > deal,
>> > >>> > but I don't know.
>> > >>> >
>> > >>> > - I tried explicitly including /usr/local/cuda/lib64 in
>> > LD_LIBRARY_PATH
>> > >>> > and
>> > >>> > set CUDA_HOME to /usr/local/cuda before building, hoping that would
>> > >>> > help
>> > >>> > with the 0.11.0rc0 problem, but it didn't.
>> > >>
>> > >>
>> > >
>> >
>> >
>> >