Whether you call last week’s AWS incident an “outage”, or what Amazon called it, “Increased Error Levels”, the actual event caused widespread issues to multiple high-profile sites and services across the Web. In the wake of an event of this magnitude, the questions about trust inevitably start flying. I was having a dinner conversation with a customer who remarked about this very incident, and questioned the validity of claims that cloud is “more secure” and “safer.” Certainly, when an outage (which is really what it was, let’s be honest) like this occurs, questions like these are certainly appropriate, and are even very helpful for those of us in the technology business, because it helps us clarify our thinking on where exactly we want to take these platforms and businesses of ours. The fundamental question at hand is, “Can we trust cloud computing?” A spoiler alert, because I don’t like dragging things out, the answer is a resounding YES. But even though you now know the end of my story, stick with me.
What’s important to me is that the terminology we’re working with is accurate. “Cloud” is an unfortunate term, in my opinion, because it gives the impression of something magical, with no defined boundaries. In a sense, that’s true, but also somewhat misleading. It’s true because of it’s inherent elasticity. The notion of a data center requiring servers to be physical machines, with fixed costs, is something we’re very familiar with. To expand the capacity of those data centers, the simple view of it is to expand “horizontally”, which essentially means adding more of those servers. With an elastic environment such as AWS, the adding of the machines is handled as part of the service, and if your applications are architected to support that deployment model, then the service can effectively grow and shrink dynamically to meet specific load needs. In that sense, the service does take on an almost magical quantity. What’s important to understand is that, technically, the magic is the service, not the actual machines. Behind Amazon’s services, there are still physical machines, living in a physical data center, in a specific region, just like our traditional brick and mortar centers. Why this is important to understand is because the service is still fundamentally susceptible to the same types of failures that a common data center might encounter, such as the cause of the actual outage in this case: An Amazon engineer doing routine maintenance and accidentally removing servers out of a pool of critical subsystems, relied upon by the S3 service. These kinds of accidents happen everyday at many different places. Amazon’s response to this incident was similar to what many of our responses would have been: New processes are being put in place to further protect critical subsystems from being affected by these types of maintenance changes, along with a broad analysis of similar services and their potential exposure.
An important lesson for us in this is that the infrastructure behind the cloud service requires the same level of disaster recovery and overall business continuity planning that more traditional data centers require. In this week’s case, only Amazon’s N. Virginia facility, which covers their East region, was affected. Providers who had a traditional BCP plan around multiple AWS regions would have been much less affected by this incident.
There is no question that the benefits of services such as AWS far outweigh the risks in most cases, and our industry is rightfully moving more towards this type of arrangement for the services we provide. It would be foolish for us to overreact to this particular incident. It would also be foolish for us to assume that, because it’s in the cloud, that somehow basic principles of business continuity planning somehow don’t apply. The cloud is not the future, it’s present. With the right planning, we’ll be ready.