While following error messages generated by the primary 10Gb line card in the Cory Hall router, some wired subnets in Cory were temporarily isolated from the rest of the network and the Internet. This outage from 4:20-4:30pm occurred when the suspect line card failed to resume forwarding traffic after running diagnostics. Network staff were immediately aware of that disruption and quickly reestablished basic connectivity on a different port.
While the symptoms have been mitigated, the root cause of the outage has not been fully diagnosed and network is running without full redundancy. Network personnel are planning followup work and are actively monitoring this problem in the event that it reoccurs.
UPDATE
[2009-04-24 12:39:06 | Mike Howard]
After resetting both of the redundant links between the switch which hosts the DNS servers and the Sutardja Dai Hall router, connectivity was restored. The root cause of the problem is unknown; we will investigate further.
UPDATE
[2009-04-27 13:20:12 | Mike Howard]
One of the errors we perceived as network problems on Friday was the result of a latent misconfiguration introduced in March on cronus.cs, one of EECS’ DNS servers.
An errant netmask prevented connectivity to cronus from any subnet of 169.229.0.0/16 where cronus wasn’t directly connected. Because the EECS DNS servers have interfaces on all major production subnets and aren’t used from outside EECS, the problem affected very few clients and was mostly unnoticed.
The problem was noticed during diagnosis of connectivity problems reported in Sutardja Dai on Friday. This bug is unrelated to the outage on Friday; network staff are still investigating that.
Resolved as of 2009-04-27 12:00:00