At the end of February, an error affecting S3 availability in Amazon’s US-EAST-1 region kicked off a set of events causing disruption in over 150,000 sites and affecting countless businesses. So what happened and why should we be concerned?
What’s in a Typo? Amazon’s Bad Day at the Office.
On the morning of February 28th, as part of a debugging exercise to alleviate performance issues relating to Amazon’s S3 billing system, an Amazon engineer was tasked with executing a command to remove a small number of servers in an S3 subsystem.
Unfortunately for over 150,000 web services and millions of users who depend on these services, a small error in the execution of this command resulted in a cascading set of events that brought S3 services in its Virginia Data Centre (or US-EAST-1 region) to a not-so-graceful halt.
The final result of the chain reaction was that a variety of services including EC2 services became unavailable. For those affected, web services, websites and client applications were inaccessible for periods of up to 5 hours. Embarrassingly, AWS’s own Service Health Dashboard showed all services as Green because it couldn’t ‘see’ the problem – a direct consequence of the outage. And of course, the irony of resultant inaccessibility of sites such as DownDetector and isitdownrightnow.com wasn’t lost on anyone!
The thing is, that although Amazon have promised us that this particular issue has been dealt with and can never happen again (and we have to believe them), it is a certainty that something else will. Despite all their inbuilt redundancies and security measures, to err is human and another blunder will lead another outage. And of course it’s not just Amazon – fans were quick to point out that Microsoft suffered a similar outage in March, but they seem to have got through their latest blip with relatively little commotion.
Amazon headline an S3 durability figure of 99.999999999 and availability SLA of up to 99.99%. The reality is that ‘up to’ means very little. Digging into the SLA gives the real figures and considering 99.99% represents about 52 minutes downtime in a given year (and the outage was for 5 hours). I think we’ve blown that budget for a little while.
While Public Could services have unquestionably revolutionised compute and storage over the past 10 years, the reputational and commercial damage that can be caused by unplanned downtime can be detrimental or even terminal to dependent organisations.
In contrast to public cloud users, organisations with private cloud or hybrid environments can control when and where potentially disruptive maintenance operations will take place. Additionally, if something does go wrong, they can quickly understand and calculate time to fix and inform stakeholders appropriately. Things can and will always go wrong, but sometimes it’s far more comforting when we can take direct action to resolve the problem and manage the communication with our clients rather than being in the dark, waiting for the lights to come back on.