GPU24 killed, GPU25 /zfsauton/datasets issues
Predrag Punosevac
predragp at andrew.cmu.edu
Wed Nov 9 00:55:16 EST 2022
I used IPMI to power off/on GPU24. I am now logged into that node as well
monitoring use.
Predrag
On Wed, Nov 9, 2022 at 12:36 AM Predrag Punosevac <predragp at andrew.cmu.edu>
wrote:
> Dear Autonians,
>
> I am noticing a pattern here. A few (less or equal than 5) are fighting
> over the four most potent computing nodes in our cluster GPU[24-27]. Those
> few users have managed to chase away everyone else and got into the
> vicious cycle of running jobs too big even for those machines and killing
> all daemons and NFS mounts in the process. I don't know a thing about ML
> but this is not the way to conduct "scientific research".
>
>
> This will have to stop. I am currently logging into GPU[25-27]. GPU24 is
> not reachable even with my root ssh access. ssh daemon is usually one of
> the very last daemons to be killed by overuse of resources. I will remain
> logged for a few days and monitor activity. Repeated offenders will be
> reported.
>
> Cheers,
> Predrag
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20221109/3afbed28/attachment-0001.html>
More information about the Autonlab-users
mailing list