GPU24 killed, GPU25 /zfsauton/datasets issues
Predrag Punosevac
predragp at andrew.cmu.edu
Wed Nov 9 00:36:59 EST 2022
Dear Autonians,
I am noticing a pattern here. A few (less or equal than 5) are fighting
over the four most potent computing nodes in our cluster GPU[24-27]. Those
few users have managed to chase away everyone else and got into the
vicious cycle of running jobs too big even for those machines and killing
all daemons and NFS mounts in the process. I don't know a thing about ML
but this is not the way to conduct "scientific research".
This will have to stop. I am currently logging into GPU[25-27]. GPU24 is
not reachable even with my root ssh access. ssh daemon is usually one of
the very last daemons to be killed by overuse of resources. I will remain
logged for a few days and monitor activity. Repeated offenders will be
reported.
Cheers,
Predrag
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20221109/b6779d23/attachment.html>
More information about the Autonlab-users
mailing list