Hundreds of sites go down because of a typo made by Amazon hosting employee

Amazon AWS

More and more processes are getting computerized in a modern world, but a human factor still plays a major role. It became evident on Wednesday, March 1, when a great amount of websites went down or started working incredibly slow.

And it didn’t happen because of a hackers attack or anything like that – problems with Amazon’s S3 cloud hosting service are to blame here. A lot of sites and apps use S3 to run nowadays, so it basically takes one mistake, flaw, or a bug to make them all go offline. And that’s exactly what occurred here.

One of the Amazon employees was trying to fix a billing server that wasn’t working as it should and, accidentally, made a mistake while entering the inputs to the command that was supposed to make this server go offline for a while. What happened next was disastrous, as it led to several servers going down, taking hundreds of sites with them.

Amazon Web Services

It’s estimated that S3 is used by, at the very least, 40% of cloud services, including Venmo and Slack, so calling this situation a disaster is not over exaggerating. This raises serious questions about the feasibility of having just one platform to host different sites and shows the vulnerability of the Internet in general. Amazon, however, promises to learn from its mistakes and prevent such outages in the future. The company will “do everything it can to learn from this event and use it to improve availability even further.” It remains to be seen whether this initiative will be successful, but one thing is certainly very clear – humans still need to stay in control and not simply rely on computers to solve all the problems.

Of course, there’s no need to panic – as of now, it’s highly unlikely that machines will rise and go on the crusade to destroy all humans like it happened in “The Terminator”. Still, certain conclusions should be drawn from what happened, and more resources should be spent on employee’s education. S3 outage had also shown that more attention should be paid to server security – the system needs to be improved, so that a single error won’t lead to a great number of servers going down anymore. Implementing additional safety measures will be a step in the right direction and will allow to restore everything much faster in case of emergency. As evidenced by those events, such scenarios are still possible, no matter how reliable the modern computers had become.


