Somewhat Urgent: lots of GPU nodes not responding

Predrag Punosevac predragp at andrew.cmu.edu
Thu Jan 19 14:52:03 EST 2023


Hi Viraj,

Sorry for a bit of delay. I was attending some NFS calls for proposals. I
did a bit of poking. It looks like a tangled file system to me. I can't get

df -h

to produce the output. That is never a good sign. Autofs works as
expected.  I am surprised that more people didn't report this. Not sure
what to do about it as the reboot is probably unwarranted. Somebody is
really messing up with the file server.

Predrag

On Thu, Jan 19, 2023 at 9:42 AM Viraj Mehta <virajm at andrew.cmu.edu> wrote:

> Hi Predrag,
>
> Hope you are well this morning. I was kinda shocked to notice that I can’t
> access GPU nodes 2,3,4,11,12,17,20.  I am not sure what caused this but I
> was running jobs on some of these machines that all stopped producing
> output around 8:23 last night. These are super critical for the ICML
> deadline next Thursday and I would like to restart them ASAP. I am not
> entirely sure what happened here as I don’t think they are terribly
> write-heavy or anything like that. Please let me know if they are able to
> be restored to normal function as I urgently need them.
>
> If I did anything that was responsible for them crashing, please let me
> know as well so I don’t repeat it. I am under some time pressure so am
> running a fairly large number of jobs right now.
>
> Thanks,
> Viraj
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20230119/29870d7b/attachment.html>


More information about the Autonlab-users mailing list