The department’s web server (www.eecs & www.cs), mailing list server (lists.eecs), and jabber server (jabber.eecs) all run on a Xen virtual cluster. After last night’s network failure was resolved around 2:30am, the Xen cluster failed to recover gracefully. These services had to be manually restarted this morning. Web and mailing list services were restored around 9:20am, and jabber was restored around 9:45am.
Staff are investigating why this occurred and how to prevent it from happening again.
The EECS Department website, FTP server, Windows Terminal Server (winterm), the IRIS website, Jabber, and Sympa (mailing list) services will all experience brief periods of unavailability beginning at 7:00am on Wednesday, February 25, 2009, possibly lasting until 8:00am.
The servers hosting these services will be changing IP addresses during this time as their current subnet (188.8.131.52/24) is being repurposed. DNS will be updated at 7:00am as well, so you shouldn’t need to make any changes to client applications unless you’ve been accessing any of these services directly by their IP rather than by their DNS name. [Read more…] about Brief Downtime for Multiple IRIS Services
The Department’s mailing list server, lists.eecs.berkeley.edu, will be shutdown, upgraded and restarted starting at 6pm tonight. Mail sent to department mailing lists will be delayed during the outage, which is not expected to last longer than 15 minutes. The lists.eecs web interface will also be unavailable during this time.
A minor upgrade to the mailing list software was recently released which fixes a bug that was introduced in the previous version. Some of our mailing lists were affected by this bug, so we are upgrading with short notice.
The upgrade has been applied to a test server with no issues, and we expect the total downtime to be less than 5 minutes.
The Department’s mailing list server, lists.eecs.berkeley.edu, will be offline beginning at 10pm for maintenance. The mailing list software will be upgraded to Sympa 5.4.4. This is a minor upgrade which consists of numerous bug fixes.
Service should resume no later than 11pm, probably much sooner.
During the outage, any mail sent to a department mailing list will be held for delivery. All mail will be delivered when the maintenance is completed.
The EECS website, FTP server, Jabber server and Sympa mailing-list server were offline for about eight hours on Sunday, Dec 28, 2008 beginning at 2:48 p.m.
The virtualization cluster that hosts these services experienced a failure when the network switch that supplies its management network was rebooted during the [scheduled network maintenance](https://iris.eecs.berkeley.edu/news/2246-eecs-network-maintenance-sunday-dec). We didn’t notice the failure until this evening, but were able to restore service quickly once we realized what was wrong.
I apologize for the length of the downtime and am planning on working to make the cluster more robust and improve our notification and monitoring so that any future issues will be noticed more quickly.
The department website, Jabber and Sympa services were briefly interrupted this morning due to an issue with the cluster that provides them.
Each service was unavailable for a few minutes between 10:00 a.m. and 10:30 a.m. as the server cluster they run on was restarted. Restarting the cluster was necessary to alleviate a deadlock in the clustered file system used by the services.
My apologies for any inconvenience this has caused, I’m looking into how to avoid this situation in the future and bring more stability to the cluster.
Due to some necessary maintenance to some SAN hardware, the department’s website, FTP server, and mailing lists experienced brief outages this evening beginning around 6:24 p.m.
The website and FTP server were only offline for about six minutes, but mail sent to lists handled by lists.eecs.berkeley.edu (Sympa) during the outage may have been delayed up to 45 minutes. I apologize for any inconveniences this may have caused.
The server hosting the department web and FTP services along with the one hosting the database used by Sympa went offline when they unintentionally lost their connection to the SAN during maintenance. The maintenance was last-minute and shouldn’t have caused any disruption, but it didn’t quite go as planned.
Hardware has arrived to allow us to make redundant connections to the SAN for these services which will make them more resilient in future situations like this one. We plan to have the hardware installed and in use soon.