Friday, April 29, 2011

Post Mortem: What Happened During the Amazon Outage

Thursday morning, 4/21/2011, at 5:26AM CDT, we intended to very briefly take our site down for a routine maintenance task. Had all gone as planned, the site would have been down for only a few seconds while we applied a patch to the front page of the site. Unfortunately, not everything went as planned.

We use the Amazon AWS cloud hosting service because they are very stable, and keep things running much better than any other hosting company we've worked with before. They also make multiple copies of every bit of data we save, which is normally a very good thing (it would take many many hard drives failing simultaneously for any student data to be lost). See more about our choice in this post.

That morning, at 2:47AM CDT, Amazon performed their own routine maintenance on their AWS cloud hosting service. They've posted their own post mortem, but it all boils down to a couple key points:

  • During the maintenance, someone made a typo, pointing some of their servers to the wrong back up location. This caused the copying of the data to fail, so, each time something changed on a website, that website and the drives it was on locked up. It also caused the system that lets us restart our server to fail, causing our site to stay stuck offline during our few-second switch.
  • Since the servers weren't able to reach the drives they thought they should reach while copying, they reported that something was wrong with those drives. If something is wrong with a drive, Amazon automatically takes that drive offline until they can inspect it and see what was wrong. Many, many drives were reported as broken, so all of those drives went offline. Amazon didn't have enough reserve capacity to handle this drive outage, causing their servers to run out of space.
We were able to get to our data, copy it to another server, and relaunch, but it took a long time for us to do so. Since we keep track of what students enter every time they answer, we have a very large amount of data, and just downloading it and uploading it takes several hours in each direction. We're working to make that less of an issue (see below).

Amazon is still working to make things more bulletproof, but they've already done a lot to prevent these problems from happening in the future. 
  • They're implementing more automation and other safeguards to stop the typo from occurring in the first place.
  • They've already added a lot of reserve capacity, and are adding more. My guess is hard drive salespeople in Virginia, where the servers are located, made a lot of money that day selling high-capacity drives to Amazon.
We are also taking our own steps to avoid similar issues in the future. By the Fall semester, we hope to mirror our data to multiple servers in both the US and Canada, so that, if another Amazon outage occurs, we can quickly move our site to one of these other locations with very minimal downtime.

We have heard of other homework systems being down for multiple days at a time, and we consider that absolutely unacceptable. It took Amazon going down to take us out for those 13 hours, and we don't plan to let that happen again.

20 comments:

  1. I think your blog will easily to reach the correct market place, because its having the useful information and i got a good knowledge to read your informational post.

    Hospitals in Bangalore

    ReplyDelete
  2. I love your blog.. very nice colors & theme. Did you design this website yourself or did you hire someone to do it for you? Plz respond as I'm looking to design my own blog and would like to know where u got this from. thanks a lot singapore web developer

    ReplyDelete
  3. you are really a good webmaster. The website loading speed is amazing. It seems that you're doing any unique trick. Moreover, The contents are masterwork. you've done a excellent job on this topic!
    carbon interactive

    ReplyDelete
  4. Whoa! This blog looks just like my old one! It's on a entirely different subject but it has pretty much the same page layout and design. Excellent choice of colors! internet marketing

    ReplyDelete
  5. if you want to learn more things for the www.norton.com setup then you can welcome to my virtual world where i can share my thoughts.

    ReplyDelete
  6. The experts of BookMyEssay have the habit of delivering 100% unique Operations Management Assignment Help services at a Low-cost in Australia. It is effortless to use BookMyEssay for getting any solution from the experts for academic difficulties.

    ReplyDelete
  7. There are various printer issues too that most users experience when they attempt to print, fax, or sweep with their Epson printer. By then, Troubleshooting Epson Printer issues is ideal to deal with twoly. To start with, analyze the issue and afterward apply a powerful answer for resolve the printing issue. Nonetheless, there are different sorts of issues identified with the Epson printer investigating that can emerge when one utilizing it to print, and sadly, he/she neglects to figure everything out totally. Epson printer troubleshooting

    ReplyDelete
  8. Thanks for sharing such informative blog.If anyone wants Law Assignment Help in australia the they can directly get in touch with BookMyEssay.

    ReplyDelete
  9. I've done my CDR already, but I am not sure that if it will pass the initial stage or not, and I want to resubmit this. Please assure me that Coursework Writing Services will really help me in this task.

    ReplyDelete
  10. ngobrol games Tanks play a very important role. He must both safeguard the heart and become a hero if damage from the adversary is to be absorbed. gamesorbit Since a team wins the war easy without a good tank. We will give you with the 5 greatest tank heroes in 2021.

    ReplyDelete
  11. This comment has been removed by the author.

    ReplyDelete
  12. Plan and book your Maldives Packages, best case scenario, cost with Travel Triangle. Snap presently to get select arrangements and offers on maldives tour packages with airfare, local area expert, lodging and touring.

    ReplyDelete
  13. Thank you for sharing such a great bog it is very
    123betting

    ReplyDelete
  14. Take instant support from BookMyEssay Support to solve any query or problem that you are facing with your Expository Essay Help online . Explore more on BookMyEssay online support

    ReplyDelete
  15. Chipotle Menu Prices – Chipotle is a fast-casual dining restaurant mainly serves Mexican-inspired foods such as tacos and burritos. Chipotle Menu Prices

    ReplyDelete
  16. You can also know more about how to reset Natural Gas Login password and many more.
    Natural Gas Login

    ReplyDelete
  17. Excellent blog! Such clever work and exposure! Keep up the very good work. Bella Swan Jacket

    ReplyDelete
  18. pg slot game แบบใหม่ปัจจุบัน ของโลก สมัครเล่น PG SLOT วันนี้รับโบนัส แรกเข้า 100% โดยทันที โบนัส 50% สำหรับสมาชิกใหม่ ด้วยความพิเศษของ พีจีสล็อต ที่มีลักษณะของการเล่นที่ง่าย

    ReplyDelete