Friday, April 29, 2011

Post Mortem: What Happened During the Amazon Outage

Thursday morning, 4/21/2011, at 5:26AM CDT, we intended to very briefly take our site down for a routine maintenance task. Had all gone as planned, the site would have been down for only a few seconds while we applied a patch to the front page of the site. Unfortunately, not everything went as planned.

We use the Amazon AWS cloud hosting service because they are very stable, and keep things running much better than any other hosting company we've worked with before. They also make multiple copies of every bit of data we save, which is normally a very good thing (it would take many many hard drives failing simultaneously for any student data to be lost). See more about our choice in this post.

That morning, at 2:47AM CDT, Amazon performed their own routine maintenance on their AWS cloud hosting service. They've posted their own post mortem, but it all boils down to a couple key points:

  • During the maintenance, someone made a typo, pointing some of their servers to the wrong back up location. This caused the copying of the data to fail, so, each time something changed on a website, that website and the drives it was on locked up. It also caused the system that lets us restart our server to fail, causing our site to stay stuck offline during our few-second switch.
  • Since the servers weren't able to reach the drives they thought they should reach while copying, they reported that something was wrong with those drives. If something is wrong with a drive, Amazon automatically takes that drive offline until they can inspect it and see what was wrong. Many, many drives were reported as broken, so all of those drives went offline. Amazon didn't have enough reserve capacity to handle this drive outage, causing their servers to run out of space.
We were able to get to our data, copy it to another server, and relaunch, but it took a long time for us to do so. Since we keep track of what students enter every time they answer, we have a very large amount of data, and just downloading it and uploading it takes several hours in each direction. We're working to make that less of an issue (see below).

Amazon is still working to make things more bulletproof, but they've already done a lot to prevent these problems from happening in the future. 
  • They're implementing more automation and other safeguards to stop the typo from occurring in the first place.
  • They've already added a lot of reserve capacity, and are adding more. My guess is hard drive salespeople in Virginia, where the servers are located, made a lot of money that day selling high-capacity drives to Amazon.
We are also taking our own steps to avoid similar issues in the future. By the Fall semester, we hope to mirror our data to multiple servers in both the US and Canada, so that, if another Amazon outage occurs, we can quickly move our site to one of these other locations with very minimal downtime.

We have heard of other homework systems being down for multiple days at a time, and we consider that absolutely unacceptable. It took Amazon going down to take us out for those 13 hours, and we don't plan to let that happen again.

Thursday, April 21, 2011

Sapling's Cloud Computing Choice

Sapling Learning, and indeed much of the Internet, has been greatly affected by the current failures within the Amazon Elastic Compute Cloud (EC2). We are very troubled by the outage and the issues that it is creating for professors and students that use our site. We at Sapling wanted to share with you what the Amazon "cloud" is and why Sapling chose to host our site within Amazon's EC2 services.

Amazon EC2 is a cloud web service that provides resizable computing capacity, and allows companies to easily configure their web services in response to load. This easy scalability, combined with Amazon's proven uptime record, is why many prominent businesses, including Netflix, Eli Lilly, Autodesk, Ericsson, Yelp, PBS, and ShareThis, have migrated their Web sites and software to EC2. NASA uses Amazon's servers to process telemetry data from the Mars rovers. Amazon EC2 Service Level Agreement guarantees an Annual Uptime Percentage of 99.95%, and has performed well beyond this measure. 

In 2010, at the same time as Netflix, Sapling began moving our Web hosting to Amazon EC2. Stability and scalability were our main reasons for doing so. Until this month, Sapling was very proud of our 99.99% uptime. To put this in perspective, in 2009 Gmail had only a 99.90% reliability rate, with its stated goal to reach 99.99%. In March 2010, Twitter had 99.74% uptime. Up until today, we have been very happy with our decision, especially considering that our previous hosting company has been down at least two times since we moved to Amazon EC2, and we had not had any issues.

Unfortunately, Sapling and many other technology companies are affected by the current EC2 failure. These include leading Internet companies with millions of users such as Foursquare, Reddit, Hootsuite, and Quora, and because of the pervasive nature of this outage across the Internet, the Amazon failure has been prominently featured in the national news.

We are currently working parallel paths to resolve the solution, one of these will assure restoration of Sapling's site for tomorrow's classes in case Amazon is not able to get its cloud services issues fixed before then. Once we are back up and running, we will then begin to implement a strategy that will guard against even this unprecedented cloud downtime. 

Update on Site Availability

Sapling Learning has been unavailable for much of this morning, starting at about 6 AM Eastern to the present. Sapling is hosted on the Amazon Cloud, and the entire US-EAST-1 Region is experiencing connectivity issues. Sapling Learning, among thousands of other companies, has chosen to host our services on the Amazon Cloud for its excellent reliability and scalability. We are confident that Amazon will have their issues resolved quickly.

When the site is back up, your Technology TAs will grant an extension of 24 hours to any assignment that was due today. Please email your TechTA if you do not want the extension, or if you would prefer a longer extension for your classes.

We will post updates about Amazon's availability here hourly until this issue is resolved. Please visit this site for more details.

We apologize for the inconvenience that this has caused you and your students.

UPDATE (10:55 AM): We are still working to get the site back up. We apologize for the continued confusion. If your students have any questions, please forward them along to support@saplinglearning.com.

UPDATE (10:59 AM): Amazon has posted an explanation of the issue:
8:54 AM PDT We'd like to provide additional color on what were working on right now (please note that we always know more and understand issues better after we fully recover and dive deep into the post mortem). A networking event early this morning triggered a large amount of re-mirroring of EBS volumes in US-EAST-1. This re-mirroring created a shortage of capacity in one of the US-EAST-1 Availability Zones, which impacted new EBS volume creation as well as the pace with which we could re-mirror and recover affected EBS volumes. Additionally, one of our internal control planes for EBS has become inundated such that it's difficult to create new EBS volumes and EBS backed instances. We are working as quickly as possible to add capacity to that one Availability Zone to speed up the re-mirroring, and working to restore the control plane issue. We're starting to see progress on these efforts, but are not there yet. We will continue to provide updates when we have them.
UPDATE (11:39 AM): We're working on setting up the site on the USWEST half of the Amazon cloud, but it will take a while to get the very large amount of data we store transferred. Either through Amazon's fix or our own, we will have the site up as soon as possible. We will update you as soon as we have a time estimate.

UPDATE (1:01 PM): Amazon reports that they have made "significant progress" in stabilizing the cloud servers. We hope to have the issues resolved soon.

UPDATE (1:22 PM): Amazon estimates that, at worst, it will take a "few hours" for the servers to be back up. More information is available through Amazon's RSS feed, but we'll continue to also update here.

UPDATE (2:45 PM): Amazon is reporting some success of launching instances on the cloud servers. We hope to be back up soon.

UPDATE (3:46 PM): We're making progress, but the site is still down.

UPDATE (4:19 PM): We think we're getting very close.

UPDATE (5:01 PM): Amazon has brought up 3 of 4 zones in the USEAST-1 region. We, unforunately, are on the fourth zone. We are continuing to try to work around this outage, and to put safeties in place to avoid extensive outages like this in the future.

UPDATE (6:22 PM): There is a lot of activity from the server guys right now, but we are not disturbing them to get details. It looks like it's getting very close.

UPDATE (6:35 PM): We're up! We pride ourselves on our 99.95% uptime, and apologize for today's frustrations.

Tuesday, April 12, 2011

Sapling Learning for High School

Sapling Learning is pleased to announce online homework and problem-solving practice for high-school Biology, Chemistry, Physics, and IPC aligned to TEKS and STAAR EOC. Learn more at hs.saplinglearning.com. Additionally, we will know in a matter of months whether we've been approved by the TEA, which means that Texas high schools will be able to use TEA allocated funds toward the purchase of Sapling Learning for their classes.