Moral of the story

Wed Nov 7 12:22:05 EST 2018

Vincent Jeanselme <vjeansel at andrew.cmu.edu> wrote:

> Problem solved after restart of tmux
> 

This is a good opportunity for all of us to reflect on what we have
learnt from this long public e-mail exchange.

1. Caching thing be it pytorch, ccache, or something else speeds up the
things but create lot of problems when done on the volatile file system
as NFS backed up by the most expensive file system ZFS. It creates
unexpected hard to trace errors in the case of the file server
unavailability. However from a system admin point of view create
enormous garbage on the file server in the form of metadata needed to
store hourly snapshots. I would wage $100 that we probably have 500GB in
cache files and their snapshots alone on the main file server. 

I would really appreciate if everyone volunteerly uses only their
scratch directories (not /tmp not NFS) for caching as well as clean
their home directories during this time when ZFS snapshots are disabled.

2. Storing databases on NFS even unconsciously (sqlite used by Jupyter
notebook) will sooner or later leave them in unconsistent state and lead
to user frustration which is very hard and time consuming to trace and
address. It is even worse doing it intensionally with PostgreSQL or
MySQL. Please store your Jupyter notebooks sqlite databases on the
scratch directory. For everything else more serious, we have database
host that can be used on the need base.

3. Finally we all need to familiarize ourselves better with the tools we
are using (Git/Gogs/tmux/screen etc). The decision that we adopt Git as
a version control system for the Auton Lab was a long and carefully
thought-out.  For the record my opinion and my preference (fossil)
didn't bare almost any weight.  We had two other version control systems
CVS and Subversion in the past which are still available as read only
through ViewVC

http://svnhub.int.autonlab.org/viewvc

and I can assure you that we learnt the lectures by using them.  The
same goes for the Gogs self-hosted Git service which provides us with
web interface but also with bug tracking mechanism with code tagging,
Wiki, and solid integration with Jenkins. Is it perfect? No it is not.
Does one need to understand how the ssh-keys and environmental variables
are read. Yes you have to get your feet wet and it is far easier to do
it at the Auton Lab which is very forgiving academic computing
environment than at your next place of employment. If you think that the
Gogs alternative GitLab is any better think again and just talk to
people who used it or God forbid try to set it up. 

Best,
Predrag

> On 11/6/18 10:01 PM, Vincent Jeanselme wrote:
> >
> > Unfortunately not for me, I already had this path ...
> >
> > Le 06/11/2018 ?? 21:51, Matthew Barnes a ??crit??:
> >> The CUDA_CACHE_PATH works! Thanks for the quick fix.
> >>
> >> On Tue, Nov 6, 2018 at 9:44 PM Yichong Xu <yichongx at cs.cmu.edu 
> >> <mailto:yichongx at cs.cmu.edu>> wrote:
> >>
> >>     Previously we have encountered this issue: Basically somehow you
> >>     cannot put your cuda cache on nfs server now. Doing this will
> >>     resolve the problem (works for me):
> >>     export CUDA_CACHE_PATH=/home/scratch/[your_id]/[some_folder]
> >>
> >>     /Thanks,/
> >>     /Yichong/
> >>
> >>
> >>
> >>>     On Nov 6, 2018, at 7:41 PM, Emre Yolcu <eyolcu at cs.cmu.edu
> >>>     <mailto:eyolcu at cs.cmu.edu>> wrote:
> >>>
> >>>     Could you try setting up everything in the scratch directory and
> >>>     test that way (if that's not what you're already doing)? The
> >>>     last time we had a CUDA problem I moved everything from
> >>>     /zfsauton/home to /home/scratch directories and I cannot
> >>>     reproduce the error on gpu{6,8,9}.
> >>>
> >>>     On Tue, Nov 6, 2018 at 6:41 PM, <qiong.zhang at stat.ubc.ca
> >>>     <mailto:qiong.zhang at stat.ubc.ca>> wrote:
> >>>
> >>>         I have a similar issue. When I submit the job, it says
> >>>         Runtime error: CUDA error: unknown error. I tried the simple
> >>>         commands that you provided, doesn't work as well.
> >>>
> >>>         Qiong
> >>>
> >>>
> >>>         November 6, 2018 3:02 PM, "Matthew Barnes"
> >>>         <mbarnes1 at andrew.cmu.edu
> >>>         <mailto:%22Matthew%20Barnes%22%20%3Cmbarnes1 at andrew.cmu.edu%3E>>
> >>>         wrote:
> >>>
> >>>             Is anyone else having issues with CUDA since this week?
> >>>             Even simple pytorch commands hang:
> >>>             (torch) bash-4.2$ python
> >>>             Python 2.7.5 (default, Jul 3 2018, 19:30:05)
> >>>             [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] on linux2
> >>>             Type "help", "copyright", "credits" or "license" for
> >>>             more information.
> >>>             >>> import torch
> >>>             x>>> x = torch.zeros(4)
> >>>             >>> x.cuda()
> >>>             nvidia-smi works, and torch.cuda.is_available() returns
> >>>             True.
> >>>
> >>>
> >>>
> >>>
> >>
> > -- 
> > Vincent Jeanselme
> > -----------------
> > Analyst Researcher
> > Auton Lab - Robotics Institute
> > Carnegie Mellon University
> 
> -- 
> Vincent Jeanselme
> -----------------
> Analyst Researcher
> Auton Lab - Robotics Institute
> Carnegie Mellon University
>