<div dir="ltr">Hi Brian,<div><br></div><div>This is the second time this particular problem is reported on GPU17. The first time around it took a cold reboot to fix the issue. I will be happy to cold reboot the server one more time but I have a bad feeling that one of the GPU cards on that server is dying and this is how it manifests. </div><div><br></div><div>The server is still under the warranty IIRC so I will have to reach Silicon Mechanics technical support. For now, I can reboot the server one more time if this happens again please don't report it just use the server as a CPU node.</div><div><br></div><div>Best,</div><div>Predrag</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Jul 23, 2021 at 12:55 PM Brian Yang <<a href="mailto:brianyan@andrew.cmu.edu">brianyan@andrew.cmu.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Hi Predrag,<div><br></div><div>When I run nvidia-smi on gpu17, I get the following:<br>"Unable to determine the device handle for GPU 0000:3B:00.0: GPU is lost. Reboot the system to recover this GPU"<br></div><div>Wasn't sure if this had already been brought to your attention, but if not could you take a look?</div><div><br></div><div>Thanks,</div><div>Brian</div></div>
</blockquote></div>