Issue with gpu17

Predrag Punosevac predragp at andrew.cmu.edu
Fri Jul 23 14:11:33 EDT 2021


Hi Brian,

This is the second time this particular problem is reported on GPU17. The
first time around it took a cold reboot to fix the issue. I will be happy
to cold reboot the server one more time but I have a bad feeling that one
of the GPU cards on that server is dying and this is how it manifests.

The server is still under the warranty IIRC so I will have to reach Silicon
Mechanics technical support. For now, I can reboot the server one more time
if this happens again please don't report it just use the server as a CPU
node.

Best,
Predrag

On Fri, Jul 23, 2021 at 12:55 PM Brian Yang <brianyan at andrew.cmu.edu> wrote:

> Hi Predrag,
>
> When I run nvidia-smi on gpu17, I get the following:
> "Unable to determine the device handle for GPU 0000:3B:00.0: GPU is lost.
> Reboot the system to recover this GPU"
> Wasn't sure if this had already been brought to your attention, but if not
> could you take a look?
>
> Thanks,
> Brian
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20210723/3fadb0a3/attachment.html>


More information about the Autonlab-users mailing list