Site tools:

Last night’s outage: an explanation

imageLast night, starting just after 1800 PT, we experienced an outage with our small object platform, the platform responsible for the acceleration and delivery of many of the Internet’s busiest sites. The outage lasted about 45 minutes, and – depending on customer server configurations – the effect on end users may have been slightly shorter or longer.

The outage occurred as a result of a change we made to a configuration in our core caching platform (Sailfish). It’s important to note that this was not a security issue, or even a bug, but a simple human error that was not flagged by our standard staging and testing procedures.

We made the mistake when we were pushing out a change in response to a request made by one of our customers. To give a bit of background, our customer portal supports a wide range of configurations, but we can support an even wider range of customer requests by making changes directly to the configuration language of our core caching platform. In this case, the customer request for a wildcard CNAME setup required that type of change, which was handled by one of our engineers via our standard internal change process.

The customer made the request Monday afternoon, and we completed the configuration change and pushed it to our staging and testing network late Monday evening. During the day Tuesday, our testing network reported no flaws in the change and confirmed that the change fulfilled the customer’s request, so we pushed the change out to our production network at about 1750 PT Tuesday evening. In the few minutes after that, our edge servers began syncing with the new configuration and incorporating it into their operation.

Unfortunately, an error in the configuration change we made created a cascading logic problem. While it was correctly written, a misplaced character made one of the logic statements apply too broadly. So, when released into the broader network, it caused CNAME requests for many of our customers to be misdirected to the customer we had made the configuration change to. Because that customer’s servers were overwhelmed by the misdirected traffic (and, since they weren’t intended for them, they couldn’t have fulfilled them anyway) the result was what many web users experienced last night  gateway timeouts and other connection errors.

We noticed the problem almost instantly – the first alert in our Network Operations Center fired at 1806 PT, indicating trouble with some of our European nodes. Within less than 90 seconds – with the help of monitoring tools we use from our friends at Catchpoint – we confirmed the issue and saw that it was also beginning to affect some of our nodes in New York. In the moments following, we saw that it was cascading to other POPs, and at 1815 PT we escalated it to our engineering team and opened an emergency phone bridge involving our chief architect and several of our core and dev engineers. At 1820 PT, we had isolated the issue to our small object platform and began a series of tests based on recent network changes. At 1829 PT, we had narrowed the issue down to three changes that were potentially causing the problem. At 1832 PT we began re-testing and re-reviewing each one of those changes. At 1835 PT we found the offending change and by 1838 PT we had written a fix to the configuration. One minute later, at 1839 PT, we began pushing the new configuration to our edge servers, and that process took about two minutes. During the next several minutes, all of our edge servers began syncing with the new configuration and that process completed at roughly 1848 PT. Over the next one to two minutes, our monitoring tools showed that traffic was beginning to return to normal levels. By 1854 PT, everything looked normal again and we issued a (guarded) resolution alert to our customers while we continued monitoring throughout the night (as we always do.)

While this was a logic error and not a bug, our friends at Twitter explained the “cascading” effect very well in a recent post and we’ll borrow from their excellent explanation:

A ‘cascading bug’ is a bug with an effect that isn’t confined to a particular software element, but rather its effect ‘cascades’ into other elements as well. One of the characteristics of such a bug is that it can have a significant impact on all users worldwide…

Over the past year, in the face of our explosive growth, we’ve been implementing new QA and testing procedures, but we need to do more. One example is the aggressiveness of our “negative testing.” Every change we make to our platform is first rolled out to a test network and exposed to both “positive testing” (i.e. does this do what we wanted it to do?) and “negative testing” (i.e. does this do anything bad that we don’t want or didn’t expect?) Last night’s outage was a failure of our “negative testing” process, and it exposed a flaw in our testing environment that we’ve already begun fixing.

We now carry more than 4% of the world’s Internet traffic, and we take that responsibility extremely seriously. Not only do thousands of customers depend on us, but literally hundreds of millions of web users do, too. When we fall down, it affects everybody.

While every service provider experiences outages, they should always present an opportunity for improvement. We’ve already learned from this and begun implementing what we learned.

We’re sorry we let this get past us, and we are determined not to let it happen again.

Alex Kazerani
Chairman & CEO 

  1. edgecastcdn posted this