Network disruption
Predrag Punosevac
predragp at andrew.cmu.edu
Thu Oct 26 19:55:33 EDT 2023
GPU[1-9], and Denver are not available. I could not reach them with IPMI
which meant that there was no power in the rack A1-2A or the switch died. I
just called OPS guys. The machines are actually up. The switch was powered
on but not flushing. OPS guys rebooted the switch but to no avail. Piotr
will have to replace the switch tomorrow morning. We have one in the
storage room ready for a situation like this.
On a related note. I updated xen host and restarted all virtual machines.
LOP2 should be now available as well as a bunch of other stuff including
Observium. Everything else looks OK to me but please ping Piotr and me with
any issues. I need to get back to much big problems in my current lab :-)
Best,
Predrag
On Thu, Oct 26, 2023 at 6:01 PM Predrag Punosevac <predragp at andrew.cmu.edu>
wrote:
> Kudos to Piotr! Everything is up and running now. I managed to patch all
> perimeter firewalls and network service machines during the outage. If you
> use OpenVPN on your desktop you will need to restart the daemon. If you
> don't know how, just reboot the machine.
>
> Predrag
>
> On Thu, Oct 26, 2023 at 4:20 PM Predrag Punosevac <predragp at andrew.cmu.edu>
> wrote:
>
>> Hi Piotr,
>>
>> Two of our newer perimeter firewalls (Phobos/Deimos) have dead CMOS
>> batteries. The power outage was too long. UPS batteries ran out of juice.
>> We have the same problem we had in May of this year. You need to go to the
>> server room and physically attach monitors to these machines and reset boot
>> order in UEFI. There is nothing I can do from New Mexico. Entire traffic
>> goes through these machines. They are only a few years old but they were
>> apparently shipped with bad CMOS batteries.
>>
>> Best,
>> Predrag
>>
>>
>>
>> On Thu, Oct 26, 2023 at 3:28 PM Piotr Bartosiewicz <
>> pbartosi at andrew.cmu.edu> wrote:
>>
>>> Update:
>>>
>>> https://computing.cs.cmu.edu/dashboard/2023/scs-wean-machine-room-alert-10-26-2023
>>>
>>> There was a power outage.
>>>
>>> Piotr.
>>>
>>>
>>> On Thu, Oct 26, 2023 at 3:25 PM Piotr Bartosiewicz <
>>> pbartosi at andrew.cmu.edu> wrote:
>>>
>>>> Looks like there is a network problem at SCS level.
>>>> We're looking into it.
>>>>
>>>> Piotr.
>>>>
>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20231026/27d4fab8/attachment.html>
More information about the Autonlab-users
mailing list