gpu16 and gpu21 on their knees

Predrag Punosevac predragp at andrew.cmu.edu
Tue Sep 1 18:35:42 EDT 2020


GPU16 and GPU21 are on their knees. The available resources are
overprovisioned and the machines don't have 128 MB of memory left for me to
ssh as a root.

I could hard reboot GPU16 using IPMI from my home. GPU21 is not connected
to the IPMI server due to its current temporary location so that is not an
option. The machine room is only crewed 8h a day during the normal business
hours so my hands are tight with respect to gpu21.

Yesterday I rebooted gpu20 which was unresponsive since Friday night with
the very same symptoms.

This is not the way to use 40K servers. This is a serious drain on our
productivity.  Please be reasonable in your expectations what you can get
out of machines.

Predrag
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20200901/95a9e20b/attachment.html>


More information about the Autonlab-users mailing list