Somewhat Urgent: lots of GPU nodes not responding

Predrag Punosevac predragp at andrew.cmu.edu
Thu Jan 19 14:59:36 EST 2023


The good news is that it involves only the machines you are listing. It
seems that other machines were not affected. How sure are you that those
scripts Ian and you were running don't involve heavy read/write?

Predrag

On Thu, Jan 19, 2023 at 2:52 PM Predrag Punosevac <predragp at andrew.cmu.edu>
wrote:

> Hi Viraj,
>
> Sorry for a bit of delay. I was attending some NFS calls for proposals. I
> did a bit of poking. It looks like a tangled file system to me. I can't get
>
> df -h
>
> to produce the output. That is never a good sign. Autofs works as
> expected.  I am surprised that more people didn't report this. Not sure
> what to do about it as the reboot is probably unwarranted. Somebody is
> really messing up with the file server.
>
> Predrag
>
> On Thu, Jan 19, 2023 at 9:42 AM Viraj Mehta <virajm at andrew.cmu.edu> wrote:
>
>> Hi Predrag,
>>
>> Hope you are well this morning. I was kinda shocked to notice that I
>> can’t access GPU nodes 2,3,4,11,12,17,20.  I am not sure what caused this
>> but I was running jobs on some of these machines that all stopped producing
>> output around 8:23 last night. These are super critical for the ICML
>> deadline next Thursday and I would like to restart them ASAP. I am not
>> entirely sure what happened here as I don’t think they are terribly
>> write-heavy or anything like that. Please let me know if they are able to
>> be restored to normal function as I urgently need them.
>>
>> If I did anything that was responsible for them crashing, please let me
>> know as well so I don’t repeat it. I am under some time pressure so am
>> running a fairly large number of jobs right now.
>>
>> Thanks,
>> Viraj
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20230119/a816add5/attachment.html>


More information about the Autonlab-users mailing list