Late this afternoon our core router in Soda Hall had a hard failure on one of the management switch modules. When the router tried to fail over to the backup module, it also, exhibited a failure. I have rebooted the first module and network service has been restored. However, there may still be a problem and we are addressing this now with the vendor. Further outages may occur without warning until we get this resolved. I regret the inconvenience this may cause.
UPDATE
[2006-04-12 10:15:58 | Fred Archibald, IDSG Network Architect]
Last night we replaced the failing management module on our core network router in Soda Hall. The network has been stable since about 18:45 last evening. As a precaution, we will be replacing the other management module early Friday morning. There will be a brief network outage from 07:00-07:30 when this is done.
Additionally, we saw a software failure on the core router in Cory Hall about 16:15 yesterday. That box was rebooted and has been stable since as well. However, we are continuing to investigate this incident with the vendor.
Let’s all think positive thoughts for today.
Fred
UPDATE
[2006-04-13 09:33:23 | Fred Archibald, IDSG Network Architect]
It seems that the Soda core router’s suspect management module did indeed have a problem and failed again last evening (04/12/06). Network staff replaced that module so we now have 2 modules that have been replaced in the Soda core. I will continue some tests tomorrow, 4/14/06 during the scheduled network downtime at 07:00–07:30 to verify that the core will recover properly from a management module failure.
Additionally, the crash we saw in Cory on Tuesday, 4/11/06, was apparently caused by a known software bug in the router code. We will be enabling some features on the router as a prevention until the patch to the code is available at the end of the April. When the patch is available, we will be upgrading the network code and I will schedule a network downtime.