Friday, April 29, 2011

Post Mortem: What Happened During the Amazon Outage

Thursday morning, 4/21/2011, at 5:26AM CDT, we intended to very briefly take our site down for a routine maintenance task. Had all gone as planned, the site would have been down for only a few seconds while we applied a patch to the front page of the site. Unfortunately, not everything went as planned.

We use the Amazon AWS cloud hosting service because they are very stable, and keep things running much better than any other hosting company we've worked with before. They also make multiple copies of every bit of data we save, which is normally a very good thing (it would take many many hard drives failing simultaneously for any student data to be lost). See more about our choice in this post.

That morning, at 2:47AM CDT, Amazon performed their own routine maintenance on their AWS cloud hosting service. They've posted their own post mortem, but it all boils down to a couple key points:

  • During the maintenance, someone made a typo, pointing some of their servers to the wrong back up location. This caused the copying of the data to fail, so, each time something changed on a website, that website and the drives it was on locked up. It also caused the system that lets us restart our server to fail, causing our site to stay stuck offline during our few-second switch.
  • Since the servers weren't able to reach the drives they thought they should reach while copying, they reported that something was wrong with those drives. If something is wrong with a drive, Amazon automatically takes that drive offline until they can inspect it and see what was wrong. Many, many drives were reported as broken, so all of those drives went offline. Amazon didn't have enough reserve capacity to handle this drive outage, causing their servers to run out of space.
We were able to get to our data, copy it to another server, and relaunch, but it took a long time for us to do so. Since we keep track of what students enter every time they answer, we have a very large amount of data, and just downloading it and uploading it takes several hours in each direction. We're working to make that less of an issue (see below).

Amazon is still working to make things more bulletproof, but they've already done a lot to prevent these problems from happening in the future. 
  • They're implementing more automation and other safeguards to stop the typo from occurring in the first place.
  • They've already added a lot of reserve capacity, and are adding more. My guess is hard drive salespeople in Virginia, where the servers are located, made a lot of money that day selling high-capacity drives to Amazon.
We are also taking our own steps to avoid similar issues in the future. By the Fall semester, we hope to mirror our data to multiple servers in both the US and Canada, so that, if another Amazon outage occurs, we can quickly move our site to one of these other locations with very minimal downtime.

We have heard of other homework systems being down for multiple days at a time, and we consider that absolutely unacceptable. It took Amazon going down to take us out for those 13 hours, and we don't plan to let that happen again.

8 comments:

  1. I think your blog will easily to reach the correct market place, because its having the useful information and i got a good knowledge to read your informational post.

    Hospitals in Bangalore

    ReplyDelete
  2. I love your blog.. very nice colors & theme. Did you design this website yourself or did you hire someone to do it for you? Plz respond as I'm looking to design my own blog and would like to know where u got this from. thanks a lot singapore web developer

    ReplyDelete
  3. you are really a good webmaster. The website loading speed is amazing. It seems that you're doing any unique trick. Moreover, The contents are masterwork. you've done a excellent job on this topic!
    carbon interactive

    ReplyDelete
  4. คาสิโนออนไลน์ที่น่าเชื่อถือและมีความเป็นมืออาชีพที่สุดในตอนนี้
    โปรโมชั่นGclub ของทางทีมงานตอนนี้แจกฟรีโบนัส 50%
    เพียงแค่คุณสมัคร สล็อตออนไลน์ กับทางทีมงานของเราเพียงเท่านั้น
    ร่วมมาเป็นส่วนหนึ่งกับเว็บไซต์คาสิโนออนไลน์ของเราได้เลยค่ะ
    สมัครสล็อตออนไลน์ >>> Goldenslot
    สนใจร่วมสนุกกับ คาสิโนออนไลน์ คลิ๊กได้เลย
    มีทั้งคาสิโนออนไลน์ หวยออนไลน์ ฟุตบอลออนไลน์ สล็อตออนไลน์ และอื่นๆอีกมากมาย

    ReplyDelete
  5. The Best Rummy Experience gets even better on your Android mobile & tablet. To start playing your favourite Rummy Game, you need to install the app. Download rummy App now to your Android device.

    ReplyDelete
  6. Whoa! This blog looks just like my old one! It's on a entirely different subject but it has pretty much the same page layout and design. Excellent choice of colors! internet marketing

    ReplyDelete
  7. if you want to learn more things for the www.norton.com setup then you can welcome to my virtual world where i can share my thoughts.

    ReplyDelete