Driver/library version mismatch on gpu nodes

Predrag Punosevac predragp at andrew.cmu.edu
Fri Mar 23 14:50:58 EDT 2018


Jay Yoon Lee <jaylee at andrew.cmu.edu> wrote:

> Hi Predrag,
> 
> I am not sure if it's just me or everybody else.
> After the reboot, GPU 1, 4, 8 is working for me,
> but GPU 2, 3, 5, 6, 9 is not working for me.
> 
>  GPU 2, 3, 5, 6, 9 are complaining --> failed to connect to server
> Failed to initialize NVML: Driver/library version mismatch
> 
> Is there anything I need to on my end ?
> (nvidia-smi does not work and I don't think I can do anything on my end.)
It works for me. I just logged into all GPU machines with the exception
of GPU7 and nvidia-smi gave the correct report. I did test things
yesterday but I didn't want to replay to your e-mail until I checked
things one more time.

It must be someting about your enviromental variables. Also bear in mind
that there are three different versions of CUDA on most of these GPUs.

root at gpu8$ ls -1|grep cuda
cuda
cuda-8.0
cuda-9.0
cuda-9.1

Predrag


> 
> Thanks,
> Jay-Yoon
> 
> 
> 
> On Thu, Mar 22, 2018 at 3:11 PM, Predrag Punosevac <predragp at andrew.cmu.edu>
> wrote:
> 
> > Jay Yoon Lee <jaylee at andrew.cmu.edu> wrote:
> >
> > > Hi Predrag,
> > >
> > > Thanks for the email & I upvote for rebooting gpu3 &4.
> > >
> > > As far as I know, before it was just  gpu2 having problem and now we have
> > > gpu3, 4 having the same symptoms.
> > >
> > > But, one question: I don't think gpu2 got fixed even after rebooting.
> > > Or is it just me? --> Do I have to reconfigure something?
> >
> > GPU2 has a problem with the full file system. I will move MATLAB to
> > different location and resolve that. OK. GPU2 will be also down at 5 PM
> > for about an hour.
> >
> > Predrag
> >
> > >
> > > I am asking this question to see,
> > > whether I have to do something once gpu3 & 4 are rebooted
> > > since gpu2 reboot didn't seem to work for me.
> > >
> > > Thanks!
> > > Jay-Yoon
> > >
> > > On Thu, Mar 22, 2018 at 1:51 PM, Predrag Punosevac <
> > predragp at andrew.cmu.edu>
> > > wrote:
> > >
> > > > Michael Andrews <mbandrews at cmu.edu> wrote:
> > > >
> > > > > Hi Predrag,
> > > > >
> > > > > There seems to be a driver/library mismatch on some of the gpu nodes
> > > > (e.g.
> > > > > gpu3, gpu4):
> > > > >
> > > > > $ nvidia-smi
> > > > > Failed to initialize NVML: Driver/library version mismatch
> > > > >
> > > >
> > > > Unfortunatelly the machines will have to be rebooted to clear that. I
> > > > will do it today at 5:00 PM.
> > > >
> > > > Predrag
> > > >
> > > > > Could you have a look when you get a chance?
> > > > >
> > > > > Thanks,
> > > > > Michael
> > > >
> >


More information about the Autonlab-users mailing list