Late this afternoon our core router in Soda Hall had a hard failure on one of the management switch modules. When the router tried to fail over to the backup module, it also, exhibited a failure. I have rebooted the first module and network service has been restored. However, there may still be a problem and we are addressing this now with the vendor. Further outages may occur without warning until we get this resolved. I regret the inconvenience this may cause.
[2006-04-12 10:15:58 | Fred Archibald, IDSG Network Architect]
Last night we replaced the failing management module on our core network router in Soda Hall. The network has been stable since about 18:45 last evening. As a precaution, we will be replacing the other management module early Friday morning. There will be a brief network outage from 07:00-07:30 when this is done.
Additionally, we saw a software failure on the core router in Cory Hall about 16:15 yesterday. That box was rebooted and has been stable since as well. However, we are continuing to investigate this incident with the vendor.
Let’s all think positive thoughts for today.
[2006-04-13 09:33:23 | Fred Archibald, IDSG Network Architect]
It seems that the Soda core router’s suspect management module did indeed have a problem and failed again last evening (04/12/06). Network staff replaced that module so we now have 2 modules that have been replaced in the Soda core. I will continue some tests tomorrow, 4/14/06 during the scheduled network downtime at 07:00–07:30 to verify that the core will recover properly from a management module failure.
Additionally, the crash we saw in Cory on Tuesday, 4/11/06, was apparently caused by a known software bug in the router code. We will be enabling some features on the router as a prevention until the patch to the code is available at the end of the April. When the patch is available, we will be upgrading the network code and I will schedule a network downtime.