Over a hundred thousand websites were out of commission for about four hours this week due to an Amazon outage. Quora, IFTTT, Autodesk, and Nest were just a few of the higher profile sites affected. Ironically, the popular availability checking service IsItDownRightNow.com was also affected. It’s frustrating for many (to say the least) and it no doubt had a significant, negative financial impact on many of the business that rely upon those servers. However, hearing the reason for the outage should have a lot of the AppDetails community nodding their head in understanding and sharing feelings of compassion and empathy for those that caused the problem– administrators were intentionally taking a few servers offline and accidentally took down more than intended.
As the Verge reports: On Tuesday morning, members of the S3 team were debugging the billing system. As part of that, the team needed to take a small number of servers offline. “Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended,” Amazon said. “The servers that were inadvertently removed supported two other S3 subsystems.” The subsystems were important. One of them “manages the metadata and location information of all S3 objects in the region,” Amazon said. Without it, services that depend on it couldn’t perform basic data retrieval and storage tasks.
If critical servers really are being managed from the command line as suggested above, where a typo could be an excuse, I’d say these guys are in serious need of a real systems management solution (but I suspect the explanation is just being intentionally oversimplified). I’ve heard horror stories so many times of IT administrators accidentally deploying an application task to “all computers” and inadvertently rebooting critical systems. People loose their jobs over this kind of thing. I’ve also been on the product side of these situations where the product is called to blame. After all, should a management solution make it easy for users to hurt themselves? The problem of course is that no matter how strong a warning is displayed, it is in the nature of many to quickly dismiss such messages (often without reading them). Especially if such messages are seen frequently, as in with each deployment.
As with any problem, there are certainly solutions. In this case, I would say that systems management solutions can implement ways to minimize inadvertent damage. For one, don’t warn about everything, be smart about it. Even without the smarts to look at a task and see if it is potentially damaging, simply targeting over a certain threshold of systems would make for a good trigger. Trigger what? How about a message that requires a typed response? When deleting something important as a consumer, I’m sometimes asked to type the word DELETE into a text box which forces me to think for an extra moment about what I’m doing. Such should be used sparingly, and even better, optionally. But it is clearly a problem for which product and user may share some blame.
A secondary lesson– test your recovery procedures. It turns out that spinning up all those servers, ordering and checking steps along the way, all took much longer to do than anticipated. Amazon has stated that “S3 has experienced massive growth over the last several years and the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected.”