EECS Network Border Malfunctioning, 2/15/18

We are currently experiencing an issue at the EECS network border which is causing intermittent outages. The connection affected is between EECS and the outside world.

This means that communication within the EECS network should be unaffected, and connectivity to campus or the Internet is intermittently unavailable.

Impact to the research computing cluster in Soda Hall is currently undetermined.

Network staff are coming onsite to better diagnose the issue, and an update will be posted here when we have more detail or an ETA.

Update 11:54 AM:

We are troubleshooting an issue where OSPF control traffic is intermittently dropped between our core and edge router, causing all data traffic to stop.

Traffic is affected between: * EECS network and campus/Internet
* EECS network and Soda research compute network

Traffic is not affected: * within the EECS network
* between Soda research compute and campus/Internet (except during certain troubleshooting procedures)

UPDATE

[2018-02-15 12:31:18 | Derek Calderon]

We are temporarily disabling network uplinks to the outside world in order to troubleshoot without impacting network security.

We need to determine whether the network firewall is implicated in this issue before engaging vendor support. In order to do this without making the network vulnerable from a security perspective, we must disable the links to campus and the Internet while this work is performed.

During this time no traffic will be able to enter or leave the department. We expect this will last for up to 30 minutes.

UPDATE

[2018-02-15 21:01:13 | Derek Calderon]

We have engaged vendor support and are working with level 3 engineers to try and pin down the issue. There is no estimated time to repair at this time, however I am not hopeful that this will be resolved before tomorrow morning.

Update 2/16 1:13 AM:

After several escalations with our vendor they have been able to identify the behavior we are seeing and they have collected copious logs and debug output to look over. Troubleshooting will pause now for the night and will resume tomorrow morning.

UPDATE

[2018-02-16 09:56:45 | Derek Calderon]

This issue is ongoing and we are currently working with the router manufacturer’s engineers to determine root cause.

UPDATE

[2018-02-16 17:03:06 | Derek Calderon]

We have narrowed the root cause to a specific network and are working with local sysadmins to positively identify the source of the disruption.

Most networks are stable at this time, however further interruptions are expected as we continue to troubleshoot.

UPDATE

[2018-02-16 17:43:53 | Derek Calderon]

We believe the network has been fully restored at this time, and will continue to monitor throughout the weekend.

A number of end-user devices were identified which appear to have been the cause of the instability. The mechanism by which this occurred is uncertain, however we are reasonably confident they are the culprit. These devices have been removed from the network and the routing issues have subsided.

Please contact help@eecs.berkeley.edu if you are still noticing network issues in your area.

Resolved as of 2018-02-16 17:30:00