We use the Amazon AWS cloud hosting service because they are very stable, and keep things running much better than any other hosting company we've worked with before. They also make multiple copies of every bit of data we save, which is normally a very good thing (it would take many many hard drives failing simultaneously for any student data to be lost). See more about our choice in this post.
That morning, at 2:47AM CDT, Amazon performed their own routine maintenance on their AWS cloud hosting service. They've posted their own post mortem, but it all boils down to a couple key points:
- During the maintenance, someone made a typo, pointing some of their servers to the wrong back up location. This caused the copying of the data to fail, so, each time something changed on a website, that website and the drives it was on locked up. It also caused the system that lets us restart our server to fail, causing our site to stay stuck offline during our few-second switch.
- Since the servers weren't able to reach the drives they thought they should reach while copying, they reported that something was wrong with those drives. If something is wrong with a drive, Amazon automatically takes that drive offline until they can inspect it and see what was wrong. Many, many drives were reported as broken, so all of those drives went offline. Amazon didn't have enough reserve capacity to handle this drive outage, causing their servers to run out of space.
We were able to get to our data, copy it to another server, and relaunch, but it took a long time for us to do so. Since we keep track of what students enter every time they answer, we have a very large amount of data, and just downloading it and uploading it takes several hours in each direction. We're working to make that less of an issue (see below).
Amazon is still working to make things more bulletproof, but they've already done a lot to prevent these problems from happening in the future.
- They're implementing more automation and other safeguards to stop the typo from occurring in the first place.
- They've already added a lot of reserve capacity, and are adding more. My guess is hard drive salespeople in Virginia, where the servers are located, made a lot of money that day selling high-capacity drives to Amazon.
We are also taking our own steps to avoid similar issues in the future. By the Fall semester, we hope to mirror our data to multiple servers in both the US and Canada, so that, if another Amazon outage occurs, we can quickly move our site to one of these other locations with very minimal downtime.
We have heard of other homework systems being down for multiple days at a time, and we consider that absolutely unacceptable. It took Amazon going down to take us out for those 13 hours, and we don't plan to let that happen again.