GPU3 back in business
Kirthevasan Kandasamy
kandasamy at cmu.edu
Fri Oct 21 16:46:59 EDT 2016
Ok, that should be enough.
samy
On Fri, Oct 21, 2016 at 4:44 PM, Barnabas Poczos <bapoczos at cs.cmu.edu>
wrote:
> Hi Samy,
>
> Gpu1 will still have Matlab and 4 K80 GPus (which is technically 8
> GPUs). Won't that be enough for now?
>
> Best,
> B
> ======================
> Barnabas Poczos, PhD
> Assistant Professor
> Machine Learning Department
> Carnegie Mellon University
>
>
> On Fri, Oct 21, 2016 at 4:21 PM, Kirthevasan Kandasamy
> <kandasamy at cmu.edu> wrote:
> > Hi all,
> >
> > I was planning on using Matlab with GPUs for one of my projects.
> > Can we please keep gpu2 as it is for now?
> >
> > samy
> >
> > On Fri, Oct 21, 2016 at 3:54 PM, Barnabas Poczos <bapoczos at cs.cmu.edu>
> > wrote:
> >>
> >> Sounds good. Let us have tensorflow system wide on all GPU nodes. We
> >> can worry about Matlab later.
> >>
> >> Best,
> >> B
> >> ======================
> >> Barnabas Poczos, PhD
> >> Assistant Professor
> >> Machine Learning Department
> >> Carnegie Mellon University
> >>
> >>
> >> On Fri, Oct 21, 2016 at 3:50 PM, Predrag Punosevac <predragp at cs.cmu.edu
> >
> >> wrote:
> >> > Barnabas Poczos <bapoczos at cs.cmu.edu> wrote:
> >> >
> >> >> Hi Predrag,
> >> >>
> >> >> If there is no other solution, then I think it is OK not to have
> >> >> Matlab on GPU2 and GPU3.
> >> >> Tensorflow has higher priority on these nodes.
> >> >
> >> > We could possibly have multiple CUDA libraries for different versions
> >> > but that is going to bite us for the rear end quickly. People who want
> >> > to use MATLAB with GPUs will have to live with GPU1 probably until
> >> > Spring release of MATLAB.
> >> >
> >> > Predrag
> >> >
> >> >>
> >> >> Best,
> >> >> Barnabas
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> ======================
> >> >> Barnabas Poczos, PhD
> >> >> Assistant Professor
> >> >> Machine Learning Department
> >> >> Carnegie Mellon University
> >> >>
> >> >>
> >> >> On Fri, Oct 21, 2016 at 3:37 PM, Predrag Punosevac
> >> >> <predragp at cs.cmu.edu> wrote:
> >> >> > Dougal Sutherland <dougal at gmail.com> wrote:
> >> >> >
> >> >> >
> >> >> > Sorry that I am late for the party. This is my interpretation of
> what
> >> >> > we
> >> >> > should do.
> >> >> >
> >> >> > 1. I will go back to CUDA 8.0 which will brake MATLAB. We have to
> >> >> > live
> >> >> > with it. Barnabas please OK this. I will work with MathWorks for
> this
> >> >> > to
> >> >> > be fixed for 2017a release.
> >> >> >
> >> >> > 2. Then I could install TensorFlow compiled by Dougal system wide.
> >> >> > Please Dugal after I upgrade back to 8.0 recompile it again using
> >> >> > CUDA
> >> >> > 8.0. I could give you the root password so that you can compile and
> >> >> > install directly.
> >> >> >
> >> >> > 3. If everyone is OK with above I will pull the trigger on GPU3 at
> >> >> > 4:30PM and upgrade to 8.0
> >> >> >
> >> >> > 4. MATLAB will be broken on GPU2 as well after I put Titan cards
> >> >> > during
> >> >> > the October 25 power outrage.
> >> >> >
> >> >> > Predrag
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >> Heh. :)
> >> >> >>
> >> >> >> An explanation:
> >> >> >>
> >> >> >> - Different nvidia gpu architectures are called "compute
> >> >> >> capabilities".
> >> >> >> This is a number that describes the behavior of the card: the
> >> >> >> maximum size
> >> >> >> of various things, which API functions it supports, etc.
> There's
> >> >> >> a
> >> >> >> reference here
> >> >> >>
> >> >> >> <https://en.wikipedia.org/wiki/CUDA#Version_features_
> and_specifications>,
> >> >> >> but it shouldn't really matter.
> >> >> >> - When CUDA compiles code, it targets a certain architecture,
> >> >> >> since it
> >> >> >> needs to know what features to use and whatnot. I *think* that
> if
> >> >> >> you
> >> >> >> compile for compute capability x, it will work on a card with
> >> >> >> compute
> >> >> >> capability y approximately iff x <= y.
> >> >> >> - Pascal Titan Xs, like gpu3 has, have compute capability 6.1.
> >> >> >> - CUDA 7.5 doesn't know about compute capability 6.1, so if you
> >> >> >> ask to
> >> >> >> compile for 6.1 it crashes.
> >> >> >> - Theano by default tries to compile for the capability of the
> >> >> >> card, but
> >> >> >> can be configured to compile for a different capability.
> >> >> >> - Tensorflow asks for a list of capabilities to compile for
> when
> >> >> >> you
> >> >> >> build it in the first place.
> >> >> >>
> >> >> >>
> >> >> >> On Fri, Oct 21, 2016 at 8:17 PM Dougal Sutherland <
> dougal at gmail.com>
> >> >> >> wrote:
> >> >> >>
> >> >> >> > They do work with 7.5 if you specify an older compute
> >> >> >> > architecture; it's
> >> >> >> > just that their actual compute capability of 6.1 isn't supported
> >> >> >> > by cuda
> >> >> >> > 7.5. Thank is thrown off by this, for example, but it can be
> fixed
> >> >> >> > by
> >> >> >> > telling it to pass compute capability 5.2 (for example) to
> nvcc. I
> >> >> >> > don't
> >> >> >> > think that this was my problem with building tensorflow on 7.5;
> >> >> >> > I'm not
> >> >> >> > sure what that was.
> >> >> >> >
> >> >> >> > On Fri, Oct 21, 2016, 8:11 PM Kirthevasan Kandasamy
> >> >> >> > <kandasamy at cmu.edu>
> >> >> >> > wrote:
> >> >> >> >
> >> >> >> > Thanks Dougal. I'll take a look atthis and get back to you.
> >> >> >> > So are you suggesting that this is an issue with TitanX's not
> >> >> >> > being
> >> >> >> > compatible with 7.5?
> >> >> >> >
> >> >> >> > On Fri, Oct 21, 2016 at 3:08 PM, Dougal Sutherland
> >> >> >> > <dougal at gmail.com>
> >> >> >> > wrote:
> >> >> >> >
> >> >> >> > I installed it in my scratch directory (not sure if there's a
> >> >> >> > global
> >> >> >> > install?). The main thing was to put its cache on scratch; it
> got
> >> >> >> > really
> >> >> >> > upset when the cache directory was on NFS. (Instructions at the
> >> >> >> > bottom of
> >> >> >> > my previous email.)
> >> >> >> >
> >> >> >> > On Fri, Oct 21, 2016, 8:04 PM Barnabas Poczos
> >> >> >> > <bapoczos at cs.cmu.edu> wrote:
> >> >> >> >
> >> >> >> > That's great! Thanks Dougal.
> >> >> >> >
> >> >> >> > As I remember bazel was not installed correctly previously on
> >> >> >> > GPU3. Do
> >> >> >> > you know what went wrong with it before and why it is good now?
> >> >> >> >
> >> >> >> > Thanks,
> >> >> >> > Barnabas
> >> >> >> > ======================
> >> >> >> > Barnabas Poczos, PhD
> >> >> >> > Assistant Professor
> >> >> >> > Machine Learning Department
> >> >> >> > Carnegie Mellon University
> >> >> >> >
> >> >> >> >
> >> >> >> > On Fri, Oct 21, 2016 at 2:03 PM, Dougal Sutherland
> >> >> >> > <dougal at gmail.com>
> >> >> >> > wrote:
> >> >> >> > > I was just able to build tensorflow 0.11.0rc0 on gpu3! I used
> >> >> >> > > the cuda
> >> >> >> > 8.0
> >> >> >> > > install, and it built fine. So additionally installing 7.5 was
> >> >> >> > > probably
> >> >> >> > not
> >> >> >> > > necessary; in fact, cuda 7.5 doesn't know about the 6.1
> compute
> >> >> >> > architecture
> >> >> >> > > that the Titan Xs use, so Theano at least needs to be manually
> >> >> >> > > told to
> >> >> >> > use
> >> >> >> > > an older architecture.
> >> >> >> > >
> >> >> >> > > A pip package is in
> >> >> >> > > ~dsutherl/tensorflow-0.11.0rc0-py2-none-any.whl. I
> >> >> >> > think
> >> >> >> > > it should work fine with the cudnn in my scratch directory.
> >> >> >> > >
> >> >> >> > > You should probably install it to scratch, either running this
> >> >> >> > > first to
> >> >> >> > put
> >> >> >> > > libraries your scratch directory or using a virtualenv or
> >> >> >> > > something:
> >> >> >> > > export PYTHONUSERBASE=/home/scratch/$USER/.local
> >> >> >> > >
> >> >> >> > > You'll need this to use the library and probably to install
> it:
> >> >> >> > > export
> >> >> >> > >
> >> >> >> >
> >> >> >> > LD_LIBRARY_PATH=/home/scratch/dsutherl/cudnn-8.0-5.1/cuda/
> lib64:"$LD_LIBRARY_PATH"
> >> >> >> > >
> >> >> >> > > To install:
> >> >> >> > > pip install --user
> >> >> >> > > ~dsutherl/tensorflow-0.11.0rc0-py2-none-any.whl
> >> >> >> > > (remove --user if you're using a virtualenv)
> >> >> >> > >
> >> >> >> > > (A request: I'm submitting to ICLR in two weeks, and for some
> of
> >> >> >> > > the
> >> >> >> > models
> >> >> >> > > I'm running gpu3's cards are 4x the speed of gpu1 or 2's. So
> >> >> >> > > please don't
> >> >> >> > > run a ton of stuff on gpu3 unless you're working on a deadline
> >> >> >> > > too.
> >> >> >> > >
> >> >> >> > >
> >> >> >> > >
> >> >> >> > > Steps to install it, for the future:
> >> >> >> > >
> >> >> >> > > Install bazel in your home directory:
> >> >> >> > >
> >> >> >> > > wget
> >> >> >> > >
> >> >> >> >
> >> >> >> > https://github.com/bazelbuild/bazel/releases/download/0.3.2/
> bazel-0.3.2-installer-linux-x86_64.sh
> >> >> >> > > bash bazel-0.3.2-installer-linux-x86_64.sh
> >> >> >> > > --prefix=/home/scratch/$USER
> >> >> >> > > --base=/home/scratch/$USER/.bazel
> >> >> >> > >
> >> >> >> > > Configure bazel to build in scratch. There's probably a better
> >> >> >> > > way to do
> >> >> >> > > this, but this works:
> >> >> >> > >
> >> >> >> > > mkdir /home/scratch/$USER/.cache
> >> >> >> > > ln -s /home/scratch/$USER/.cache/bazel ~/.cache/bazel
> >> >> >> > >
> >> >> >> > > Build tensorflow. Note that builds from git checkouts don't
> >> >> >> > > work, because
> >> >> >> > > they assume a newer version of git than is on gpu3:
> >> >> >> > >
> >> >> >> > > cd /home/scratch/$USER
> >> >> >> > > wget
> >> >> >> > > tar xf
> >> >> >> > > cd tensorflow-0.11.0rc0
> >> >> >> > > ./configure
> >> >> >> > >
> >> >> >> > > This is an interactive script that doesn't seem to let you
> pass
> >> >> >> > arguments or
> >> >> >> > > anything. It's obnoxious.
> >> >> >> > > Use the default python
> >> >> >> > > don't use cloud platform or hadoop file system
> >> >> >> > > use the default site-packages path if it asks
> >> >> >> > > build with GPU support
> >> >> >> > > default gcc
> >> >> >> > > default Cuda SDK version
> >> >> >> > > specify /usr/local/cuda-8.0
> >> >> >> > > default cudnn version
> >> >> >> > > specify $CUDNN_DIR from use-cudnn.sh, e.g.
> >> >> >> > > /home/scratch/dsutherl/cudnn-8.0-5.1/cuda
> >> >> >> > > Pascal Titan Xs have compute capability 6.1
> >> >> >> > >
> >> >> >> > > bazel build -c opt --config=cuda
> >> >> >> > > //tensorflow/tools/pip_package:build_pip_package
> >> >> >> > > bazel-bin/tensorflow/tools/pip_package/build_pip_package ./
> >> >> >> > > A .whl file, e.g. tensorflow-0.11.0rc0-py2-none-any.whl, is
> put
> >> >> >> > > in the
> >> >> >> > > directory you specified above.
> >> >> >> > >
> >> >> >> > >
> >> >> >> > > - Dougal
> >> >> >> > >
> >> >> >> > >
> >> >> >> > > On Fri, Oct 21, 2016 at 6:14 PM Kirthevasan Kandasamy
> >> >> >> > > <kandasamy at cmu.edu
> >> >> >> > >
> >> >> >> > > wrote:
> >> >> >> > >>
> >> >> >> > >> Predrag,
> >> >> >> > >>
> >> >> >> > >> Any updates on gpu3?
> >> >> >> > >> I have tried both tensorflow and chainer and in both cases
> the
> >> >> >> > >> problem
> >> >> >> > >> seems to be with cuda
> >> >> >> > >>
> >> >> >> > >> On Wed, Oct 19, 2016 at 4:10 PM, Predrag Punosevac
> >> >> >> > >> <predragp at cs.cmu.edu
> >> >> >> > >
> >> >> >> > >> wrote:
> >> >> >> > >>>
> >> >> >> > >>> Dougal Sutherland <dougal at gmail.com> wrote:
> >> >> >> > >>>
> >> >> >> > >>> > I tried for a while. I failed.
> >> >> >> > >>> >
> >> >> >> > >>>
> >> >> >> > >>> Damn this doesn't look good. I guess back to the drawing
> >> >> >> > >>> board. Thanks
> >> >> >> > >>> for the quick feed back.
> >> >> >> > >>>
> >> >> >> > >>> Predrag
> >> >> >> > >>>
> >> >> >> > >>> > Version 0.10.0 fails immediately on build: "The specified
> >> >> >> > >>> > --crosstool_top
> >> >> >> > >>> > '@local_config_cuda//crosstool:crosstool' is not a valid
> >> >> >> > >>> > cc_toolchain_suite
> >> >> >> > >>> > rule." Apparently this is because 0.10 required an older
> >> >> >> > >>> > version of
> >> >> >> > >>> > bazel (
> >> >> >> > >>> > https://github.com/tensorflow/tensorflow/issues/4368),
> and I
> >> >> >> > >>> > don't
> >> >> >> > have
> >> >> >> > >>> > the
> >> >> >> > >>> > energy to install an old version of bazel.
> >> >> >> > >>> >
> >> >> >> > >>> > Version 0.11.0rc0 gets almost done and then complains
> about
> >> >> >> > >>> > no such
> >> >> >> > >>> > file or
> >> >> >> > >>> > directory for libcudart.so.7.5 (which is there, where I
> told
> >> >> >> > tensorflow
> >> >> >> > >>> > it
> >> >> >> > >>> > was...).
> >> >> >> > >>> >
> >> >> >> > >>> > Non-release versions from git fail immediately because
> they
> >> >> >> > >>> > call git
> >> >> >> > -C
> >> >> >> > >>> > to
> >> >> >> > >>> > get version info, which is only in git 1.9 (we have 1.8).
> >> >> >> > >>> >
> >> >> >> > >>> >
> >> >> >> > >>> > Some other notes:
> >> >> >> > >>> > - I made a symlink from ~/.cache/bazel to
> >> >> >> > >>> > /home/scratch/$USER/.cache/bazel,
> >> >> >> > >>> > because bazel is the worst. (It complains about doing
> things
> >> >> >> > >>> > on NFS,
> >> >> >> > >>> > and
> >> >> >> > >>> > hung for me [clock-related?], and I can't find a global
> >> >> >> > >>> > config file
> >> >> >> > or
> >> >> >> > >>> > anything to change that in; it seems like there might be
> >> >> >> > >>> > one, but
> >> >> >> > their
> >> >> >> > >>> > documentation is terrible.)
> >> >> >> > >>> >
> >> >> >> > >>> > - I wasn't able to use the actual Titan X compute
> capability
> >> >> >> > >>> > of 6.1,
> >> >> >> > >>> > because that requires cuda 8; I used 5.2 instead. Probably
> >> >> >> > >>> > not a huge
> >> >> >> > >>> > deal,
> >> >> >> > >>> > but I don't know.
> >> >> >> > >>> >
> >> >> >> > >>> > - I tried explicitly including /usr/local/cuda/lib64 in
> >> >> >> > LD_LIBRARY_PATH
> >> >> >> > >>> > and
> >> >> >> > >>> > set CUDA_HOME to /usr/local/cuda before building, hoping
> >> >> >> > >>> > that would
> >> >> >> > >>> > help
> >> >> >> > >>> > with the 0.11.0rc0 problem, but it didn't.
> >> >> >> > >>
> >> >> >> > >>
> >> >> >> > >
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20161021/33c5cb26/attachment-0001.html>
More information about the Autonlab-users
mailing list