This post is not meant for readers generally but for those with some IT expertise and for those who wish to know why the site went down Friday and Saturday. I apologize for the slowdown. Especially affected were readers in European and Australian time zones.
The cause of the slowdown was that the owners of the server, Medtiatemple.net, had a latency issue at their data center starting around 7:41 p.m. on Friday, Oct. 5, when a module in a router failed. It took them some time to track down the issue.
The module was fixed by 10 p.m. on Saturday the 6th. Once that was up and running again, the load was reduced to normal.
Technical support called it a “rare issue.” They even went so far as to issue a comment, appended below. That was nice of them.
Mr. Beckow and viewers,
We are extremely sorry for the events that took place this weekend. As you said, it was definitely a “rare event” and took us some time to find and fix. We are constantly trying to improve our services and products on a daily basis for our customers. If you are currently still having problems, feel free to contact us at anytime so that we may assist you. Once again, we are very sorry for what occurred this weekend and hope to bring you and your customers no further problems.
I hope it is rare because a server going down for more than 24 hours is definitely undesirable from our point of view.
However, if the incident is repeated, there is nothing that can be done to assist you to access the site. I suggest that you leave off trying to access it that day and come back the next.
If you depend on getting reports from the site each day, I’d suggest taking a subscription to the Daily Digest through “Email Upates,” the button you’ll find on the righthand corner of the homepage. That way you are not dependent on a site visit.
As a result of this incident, other issues have opened up for us and we may need to relocate the site to another server.
Now more information on the incident itself. There is no need for the general reader to continue past here. I mean this report more for the IT-oriented reader.
The incident is reported on the status page of the host company, Mediatemple. The general URL of the status page is http://status.mediatemple.net/weblog/category/system-incidents/
Here are the specific reports on the issue in reverse chronology.
INC #1650 – Network Latency *Resolved*
Incident Tracker status: RESOLVED view incidents »
INC #1650 – Network Latency *Resolved*
Saturday, November 6th, 2010 at 10:38 pm
At this time, Mediatemple have determined that network latency returned to normal.
Saturday, November 6th, 2010 at 8:59 am
As of roughly 8:00 AM PDT, our network provider was able to reduce the latency on the network. We are going to be keeping an eye on this for a few more hours to ensure that it does not recur. A full summary will also be posted once we feel that it is safe to close this incident.
Status Update – Affected Customers
Saturday, November 6th, 2010 at 7:23 am
Thank you for your continued patience with this matter, as well as assisting us by providing us with the traceroutes previously requested. We would like to take a brief moment to let you know of the people that might be affected by this latency. If you are on the following services, then you may notice some additional latency when viewing your sites:
(gs) Grid-Service Cluster.03 through Cluster.07,
(dv) Dedicated-Virtual Servers on hostservers vz1 through vz299,
(dpv) Nitros on hostservers nitro1 through nitro200.
Saturday, November 6th, 2010 at 5:49 am
At this time Mediatemple Engineers have determined that sufficient information has been gathered to continue investigating this issue. If anything should change we will update you here on our System Status page. We would like to give thanks to everyone who submitted a traceroute in this effort to find a solution to this issue.
Saturday, November 6th, 2010 at 3:34 am
We are currently gathering additional information to help resolve this issue. If you are experiencing high latency on your service please provide us with a traceroute via Support Request.
Core Router Module Needs Replacing
Friday, November 5th, 2010 at 11:30 pm
At this time, Mediatemple Engineers found the core router causing the issue with network latency at the Data Center. Currently, the issue is partially resolved. The uplink has been moved to the backup core router, however, the module on the primary core router will need to be replaced. Upon replacing the module, the uplink will be returned to the primary core router. The estimated time of unavailability during the change in uplink is estimated to be 5-10 seconds. We value the continued patience from our customers affected by this incident.
Friday, November 5th, 2010 at 9:52 pm
(mt) Engineers continue working on the stability of the Data Center network latency. The issue has been isolated to a router directly connected or close to the servers which are affected. Further troubleshooting is underway to pinpoint the root cause of the issue and replace any faulty network hardware.
INC #1650 – Investigating Reports of Network Latency
Friday, November 5th, 2010 at 7:41 pm
At 7:30PM we began receiving reports of network latency over several protocols. MediatempleEngineers and Data Center staff are currently investigating the cause and working towards a resolution.