GPU3 back in business

Fri Oct 21 15:50:32 EDT 2016

Barnabas Poczos <bapoczos at cs.cmu.edu> wrote:

> Hi Predrag,
> 
> If there is no other solution, then I think it is OK not to have
> Matlab on GPU2 and GPU3.
> Tensorflow has higher priority on these nodes.

We could possibly have multiple CUDA libraries for different versions
but that is going to bite us for the rear end quickly. People who want
to use MATLAB with GPUs will have to live with GPU1 probably until
Spring release of MATLAB.

Predrag

> 
> Best,
> Barnabas
> 
> 
> 
> 
> ======================
> Barnabas Poczos, PhD
> Assistant Professor
> Machine Learning Department
> Carnegie Mellon University
> 
> 
> On Fri, Oct 21, 2016 at 3:37 PM, Predrag Punosevac <predragp at cs.cmu.edu> wrote:
> > Dougal Sutherland <dougal at gmail.com> wrote:
> >
> >
> > Sorry that I am late for the party. This is my interpretation of what we
> > should do.
> >
> > 1. I will go back to CUDA 8.0 which will brake MATLAB. We have to live
> > with it. Barnabas please OK this. I will work with MathWorks for this to
> > be fixed for 2017a release.
> >
> > 2. Then I could install TensorFlow compiled by Dougal system wide.
> > Please Dugal after I upgrade back to 8.0 recompile it again using CUDA
> > 8.0. I could give you the root password so that you can compile and
> > install directly.
> >
> > 3. If everyone is OK with above I will pull the trigger on GPU3 at
> > 4:30PM and upgrade to 8.0
> >
> > 4. MATLAB will be broken on GPU2 as well after I put Titan cards during
> > the October 25 power outrage.
> >
> > Predrag
> >
> >
> >
> >
> >
> >
> >> Heh. :)
> >>
> >> An explanation:
> >>
> >>    - Different nvidia gpu architectures are called "compute capabilities".
> >>    This is a number that describes the behavior of the card: the maximum size
> >>    of various things, which API functions it supports, etc. There's a
> >>    reference here
> >>    <https://en.wikipedia.org/wiki/CUDA#Version_features_and_specifications>,
> >>    but it shouldn't really matter.
> >>    - When CUDA compiles code, it targets a certain architecture, since it
> >>    needs to know what features to use and whatnot. I *think* that if you
> >>    compile for compute capability x, it will work on a card with compute
> >>    capability y approximately iff x <= y.
> >>    - Pascal Titan Xs, like gpu3 has, have compute capability 6.1.
> >>    - CUDA 7.5 doesn't know about compute capability 6.1, so if you ask to
> >>    compile for 6.1 it crashes.
> >>    - Theano by default tries to compile for the capability of the card, but
> >>    can be configured to compile for a different capability.
> >>    - Tensorflow asks for a list of capabilities to compile for when you
> >>    build it in the first place.
> >>
> >>
> >> On Fri, Oct 21, 2016 at 8:17 PM Dougal Sutherland <dougal at gmail.com> wrote:
> >>
> >> > They do work with 7.5 if you specify an older compute architecture; it's
> >> > just that their actual compute capability of 6.1 isn't supported by cuda
> >> > 7.5. Thank is thrown off by this, for example, but it can be fixed by
> >> > telling it to pass compute capability 5.2 (for example) to nvcc. I don't
> >> > think that this was my problem with building tensorflow on 7.5; I'm not
> >> > sure what that was.
> >> >
> >> > On Fri, Oct 21, 2016, 8:11 PM Kirthevasan Kandasamy <kandasamy at cmu.edu>
> >> > wrote:
> >> >
> >> > Thanks Dougal. I'll take a look atthis and get back to you.
> >> > So are you suggesting that this is an issue with TitanX's not being
> >> > compatible with 7.5?
> >> >
> >> > On Fri, Oct 21, 2016 at 3:08 PM, Dougal Sutherland <dougal at gmail.com>
> >> > wrote:
> >> >
> >> > I installed it in my scratch directory (not sure if there's a global
> >> > install?). The main thing was to put its cache on scratch; it got really
> >> > upset when the cache directory was on NFS. (Instructions at the bottom of
> >> > my previous email.)
> >> >
> >> > On Fri, Oct 21, 2016, 8:04 PM Barnabas Poczos <bapoczos at cs.cmu.edu> wrote:
> >> >
> >> > That's great! Thanks Dougal.
> >> >
> >> > As I remember bazel was not installed correctly previously on GPU3. Do
> >> > you know what went wrong with it before and why it is good now?
> >> >
> >> > Thanks,
> >> > Barnabas
> >> > ======================
> >> > Barnabas Poczos, PhD
> >> > Assistant Professor
> >> > Machine Learning Department
> >> > Carnegie Mellon University
> >> >
> >> >
> >> > On Fri, Oct 21, 2016 at 2:03 PM, Dougal Sutherland <dougal at gmail.com>
> >> > wrote:
> >> > > I was just able to build tensorflow 0.11.0rc0 on gpu3! I used the cuda
> >> > 8.0
> >> > > install, and it built fine. So additionally installing 7.5 was probably
> >> > not
> >> > > necessary; in fact, cuda 7.5 doesn't know about the 6.1 compute
> >> > architecture
> >> > > that the Titan Xs use, so Theano at least needs to be manually told to
> >> > use
> >> > > an older architecture.
> >> > >
> >> > > A pip package is in ~dsutherl/tensorflow-0.11.0rc0-py2-none-any.whl. I
> >> > think
> >> > > it should work fine with the cudnn in my scratch directory.
> >> > >
> >> > > You should probably install it to scratch, either running this first to
> >> > put
> >> > > libraries your scratch directory or using a virtualenv or something:
> >> > > export PYTHONUSERBASE=/home/scratch/$USER/.local
> >> > >
> >> > > You'll need this to use the library and probably to install it:
> >> > > export
> >> > >
> >> > LD_LIBRARY_PATH=/home/scratch/dsutherl/cudnn-8.0-5.1/cuda/lib64:"$LD_LIBRARY_PATH"
> >> > >
> >> > > To install:
> >> > > pip install --user ~dsutherl/tensorflow-0.11.0rc0-py2-none-any.whl
> >> > > (remove --user if you're using a virtualenv)
> >> > >
> >> > > (A request: I'm submitting to ICLR in two weeks, and for some of the
> >> > models
> >> > > I'm running gpu3's cards are 4x the speed of gpu1 or 2's. So please don't
> >> > > run a ton of stuff on gpu3 unless you're working on a deadline too.
> >> > >
> >> > >
> >> > >
> >> > > Steps to install it, for the future:
> >> > >
> >> > > Install bazel in your home directory:
> >> > >
> >> > > wget
> >> > >
> >> > https://github.com/bazelbuild/bazel/releases/download/0.3.2/bazel-0.3.2-installer-linux-x86_64.sh
> >> > > bash bazel-0.3.2-installer-linux-x86_64.sh --prefix=/home/scratch/$USER
> >> > > --base=/home/scratch/$USER/.bazel
> >> > >
> >> > > Configure bazel to build in scratch. There's probably a better way to do
> >> > > this, but this works:
> >> > >
> >> > > mkdir /home/scratch/$USER/.cache
> >> > > ln -s /home/scratch/$USER/.cache/bazel ~/.cache/bazel
> >> > >
> >> > > Build tensorflow. Note that builds from git checkouts don't work, because
> >> > > they assume a newer version of git than is on gpu3:
> >> > >
> >> > > cd /home/scratch/$USER
> >> > > wget
> >> > > tar xf
> >> > > cd tensorflow-0.11.0rc0
> >> > > ./configure
> >> > >
> >> > > This is an interactive script that doesn't seem to let you pass
> >> > arguments or
> >> > > anything. It's obnoxious.
> >> > > Use the default python
> >> > > don't use cloud platform or hadoop file system
> >> > > use the default site-packages path if it asks
> >> > > build with GPU support
> >> > > default gcc
> >> > > default Cuda SDK version
> >> > > specify /usr/local/cuda-8.0
> >> > > default cudnn version
> >> > > specify $CUDNN_DIR from use-cudnn.sh, e.g.
> >> > > /home/scratch/dsutherl/cudnn-8.0-5.1/cuda
> >> > > Pascal Titan Xs have compute capability 6.1
> >> > >
> >> > > bazel build -c opt --config=cuda
> >> > > //tensorflow/tools/pip_package:build_pip_package
> >> > > bazel-bin/tensorflow/tools/pip_package/build_pip_package ./
> >> > > A .whl file, e.g. tensorflow-0.11.0rc0-py2-none-any.whl, is put in the
> >> > > directory you specified above.
> >> > >
> >> > >
> >> > > - Dougal
> >> > >
> >> > >
> >> > > On Fri, Oct 21, 2016 at 6:14 PM Kirthevasan Kandasamy <kandasamy at cmu.edu
> >> > >
> >> > > wrote:
> >> > >>
> >> > >> Predrag,
> >> > >>
> >> > >> Any updates on gpu3?
> >> > >> I have tried both tensorflow and chainer and in both cases the problem
> >> > >> seems to be with cuda
> >> > >>
> >> > >> On Wed, Oct 19, 2016 at 4:10 PM, Predrag Punosevac <predragp at cs.cmu.edu
> >> > >
> >> > >> wrote:
> >> > >>>
> >> > >>> Dougal Sutherland <dougal at gmail.com> wrote:
> >> > >>>
> >> > >>> > I tried for a while. I failed.
> >> > >>> >
> >> > >>>
> >> > >>> Damn this doesn't look good. I guess back to the drawing board. Thanks
> >> > >>> for the quick feed back.
> >> > >>>
> >> > >>> Predrag
> >> > >>>
> >> > >>> > Version 0.10.0 fails immediately on build: "The specified
> >> > >>> > --crosstool_top
> >> > >>> > '@local_config_cuda//crosstool:crosstool' is not a valid
> >> > >>> > cc_toolchain_suite
> >> > >>> > rule." Apparently this is because 0.10 required an older version of
> >> > >>> > bazel (
> >> > >>> > https://github.com/tensorflow/tensorflow/issues/4368), and I don't
> >> > have
> >> > >>> > the
> >> > >>> > energy to install an old version of bazel.
> >> > >>> >
> >> > >>> > Version 0.11.0rc0 gets almost done and then complains about no such
> >> > >>> > file or
> >> > >>> > directory for libcudart.so.7.5 (which is there, where I told
> >> > tensorflow
> >> > >>> > it
> >> > >>> > was...).
> >> > >>> >
> >> > >>> > Non-release versions from git fail immediately because they call git
> >> > -C
> >> > >>> > to
> >> > >>> > get version info, which is only in git 1.9 (we have 1.8).
> >> > >>> >
> >> > >>> >
> >> > >>> > Some other notes:
> >> > >>> > - I made a symlink from ~/.cache/bazel to
> >> > >>> > /home/scratch/$USER/.cache/bazel,
> >> > >>> > because bazel is the worst. (It complains about doing things on NFS,
> >> > >>> > and
> >> > >>> > hung for me [clock-related?], and I can't find a global config file
> >> > or
> >> > >>> > anything to change that in; it seems like there might be one, but
> >> > their
> >> > >>> > documentation is terrible.)
> >> > >>> >
> >> > >>> > - I wasn't able to use the actual Titan X compute capability of 6.1,
> >> > >>> > because that requires cuda 8; I used 5.2 instead. Probably not a huge
> >> > >>> > deal,
> >> > >>> > but I don't know.
> >> > >>> >
> >> > >>> > - I tried explicitly including /usr/local/cuda/lib64 in
> >> > LD_LIBRARY_PATH
> >> > >>> > and
> >> > >>> > set CUDA_HOME to /usr/local/cuda before building, hoping that would
> >> > >>> > help
> >> > >>> > with the 0.11.0rc0 problem, but it didn't.
> >> > >>
> >> > >>
> >> > >
> >> >
> >> >
> >> >