GPU 1 error

Sat Nov 3 22:57:17 EDT 2018

Biswajit Paria <bparia at cs.cmu.edu> wrote:

> I see. I was using it yesterday. It is possible that the CUDA is broken,
> and it is somehow not using the cuda in my home directory. I will try to
> get it to use my local CUDA, otherwise I will wait till Monday.
> 
> Thanks!
> 

Ok I just spent almost 2h playing with GPU1. This is what I have done. I
cleaned the system, NVidia driver to 396.44 and upgraded CUDA to 9.2. I
then cleaned and upgraded all the packages. Note that I didn't want to
install recently release CUDA 10 which is probably still poorly
supported by applications. 

The system works like a swiss watch now but it is likely that all
deep-learning tools are in broken state. You will have to rebuilt
tensor-flow or whatever you were using.

The following two users 

678.5 GiB joliva
513.8 GiB chunlial 

should try to clean their scratch directories or at least e-mail me with
an explanation for such excessive use. I have half-way scripted now this
process for Ansible so I could push this to all GPU nodes but it is
likely that I will inflict lot of pain to people who are running jobs. 

We still have a problem on the servers GPU3 and GPU4 which appear to
have dead GPU cards.

Best,
Predrag

> 
> On Sat, Nov 3, 2018, 7:11 PM Predrag Punosevac <predragp at andrew.cmu.edu
> wrote:
> 
> > Biswajit Paria <bparia at cs.cmu.edu> wrote:
> >
> > > Hi Predrag,
> > >
> > > I am trying to use GPU 1, and getting an unusual segmentation fault. The
> > > same code that I was running for two days is now throwing a segmentation
> > > fault. Is it possible to restart GPU1? Doesn't look like anyone else it
> > > using it other than me.
> >
> >
> > Sure if nobody is using it. Are you sure that you were using this
> > machine after I rebooted last week? Those library exception errors are
> > typically due to NVidia 3rd party binary blob drivers which needs to be
> > reinstalled occasionally. I will give a two hours and reboot at the
> > same time when I reboot GPU2. If the driver gets broken it will have to
> > wait Monday.
> >
> >
> >
> > >
> > > Here is stack trace in case you want to have a look:
> > >
> > > Stack trace returned 10 entries:
> > > [bt] (0)
> > > /zfsauton/home/bparia/anaconda3/lib/python3.6/site-packages/mxnet/lib
> > > mxnet.so(+0x31f81a) [0x7feebb24f81a]
> > > [bt] (1)
> > > /zfsauton/home/bparia/anaconda3/lib/python3.6/site-packages/mxnet/lib
> > > mxnet.so(+0x29f33b6) [0x7feebd9233b6]
> > > [bt] (2) /lib64/libpthread.so.0(+0xf680) [0x7fef78319680]
> > > [bt] (3) /lib64/libpthread.so.0(raise+0x2b) [0x7fef7831954b]
> > > [bt] (4) /lib64/libpthread.so.0(+0xf680) [0x7fef78319680]
> > > [bt] (5) /usr/lib64/nvidia/libcuda.so.1(+0xf88d5) [0x7fef304548d5]
> > > [bt] (6) /usr/lib64/nvidia/libcuda.so.1(+0x248914) [0x7fef305a4914]
> > > [bt] (7) /usr/lib64/nvidia/libcuda.so.1(+0x1e4e80) [0x7fef30540e80]
> > > [bt] (8) /lib64/libpthread.so.0(+0x7dd5) [0x7fef78311dd5]
> > > [bt] (9) /lib64/libc.so.6(clone+0x6d) [0x7fef7803bb3d]
> > >
> > >
> > > Thanks in advance!
> > > --
> > > Biswajit Paria
> > > PhD student
> > > Machine Learning Department
> > > Carnegie Mellon University
> >