Firm averts Amazon cloud crash by ‘spreading out the risk’

A two-day power outage at Inc. ‘s Northern Virginia data centre last week crippled several Web sites including those of Foursquare, HootSuite, Quora and Reddit, but thanks to redundant cloud services a Canadian company was able to avoid any major disruption.

By employing a combination of cloud and quasi cloud back-up services,, a London, Ont.–based voice talent firm, only suffered about 90 minutes of minor signal latency before being able to recover full online capabilities while other Amazon clients did not fare as well.

Related stories

State of the SMB: Contemplating the cloud, keeping the lights on

Amazon’s EC2 cloud service dinged by Gartner rating

“We had some lag time according to our lead programmer but that was just for about 90 minutes. We did not receive any customer complaints,” David Ciccarelli, founder of told stores audio files of more than 25,000 voice actors in its online database hosted by Amazon. The London, Ont.-company works with producers such as DreamWorks and networks that include NBC, ABC, and the History Channel. stores its client-created audio files (as much as 20 terabytes) on Amazon’s servers.But the company’s other applications are spread over services that include: RackSpace Inc. , a Texas-based Web hosting and cloud management firm; Google Docs; Google Apps; Gmail; and

Disaster in the cloud

Trouble at Amazon’s data centre started a little after 5 a.m. Eastern time last Thursday when the company’s Service Health Dashboard reported connectivity problems that were affecting its Relational Database Service, which is used to manage a relational database in the cloud, across multiple zones in the eastern U.S.

Because of server problems at Amazon’s data center, which handles the company’s EC2 Web hosting services, Web sites, including popular Web 2.0 sites, were left staggering or disabled.

As of noon Eastern time last Friday, those sites had been affected for about 30 hours.

Earlier that day, at 5:41 a.m., Amazon reported that its engineers were making progress. At 9:18 a.m. it noted, “We’re starting to see more meaningful progress in restoring volumes (many have been restored in the last few hours) and expect this progress to continue over the next few hours.”

That was about 19 hours after Amazon reported Thursday afternoon that it was only a few hours away from having the problem solved.

Amazon updated users again at 11:49 a.m., saying that “many” customers have confirmed that their sites are recovering. “Our current estimate is that the majority of volumes will be recovered over the next five to six hours,” the company reported.

Reddit reported at 10:30 a.m. that it was still running in emergency mode. Foursquare appeared to be up and running, while Quora was bouncing between read-only mode and not launching at all and showing an “internal server error” message.

Vancouver-based Twitter monitoring service HootSuite was also having problems, reporting at one point that it was “back up” and then changing to “again offline.”

Ezra Gottheil, an analyst at Technology Business Research, said the outage is a big problem for the disabled Web sites, but it’s an even bigger problem for Amazon.

“It’s a pretty big hit. It’s big and it’s public,” Gottheil added. “When you’re doing business on the Web, you don’t want to have your doors closed — ever. It’s tough for the sites. Most users will check again later, but [Amazon will] lose a few

Cloud services under a cloud

Thanks to Amazon’s most recent outage, supporters of cloud services are going to have a tough time arguing that the uptime delivered by cloud services is superior to anything corporate IT can deliver.
The Amazon outage “is going to be devastating,” according to Tref Laplante, the CEO WorkXpress.

WorkXpress is a platform as a service. It has created an entirely visual drag-and-drop development environment using Linux, Apache, MySQL and PHP to allow app development without writing code. Its users, which include many businesses, have built apps used in medical, real estate, manufacturing and other industries.

Related stories

Lessons learned from Japan: use cloud-based disaster recovery strategy

What the cloud really costs?

Laplante says he has one customer — a small manufacturer whose core business application was built on WorkXpress and running on Amazon — who has been knocked offline. “They are fired up and they are very angry,” he said. The customer now wants the app hosted on a server in their shop.

Laplante said the Amazon outage, which began Thursday morning, is going to make it difficult to sell cloud approaches. “I’m going to have to sell against this outage.”
Paul Haugan, CTO of Lynnwood, Wash., said his city has been looking at Amazon’s cloud offerings, but “the recent outage confirmed, for us, that cloud services are not yet ready for prime time.”

Haugan’s view, which stems not just from Amazon’s outage alone, is that “cloud services need some more maturing and a much more hardened infrastructure and security model prior to our adoption.”

How avoided a cloud disaster

Ciccarelli of remembers the 2008 power outage suffered by Amazon. “That lasted about 12 hours. We received numerous calls from customers seeking customer support.” clients were unable to access their audio files for the duration of the outage. These were files the customers used to audition for assignments requiring voice actors., said Ciccarelli suffered a hit to its reputation. “It wasn’t just that our IT department had to wade through a ton of calls,” he said. “Our reliability was put in question because our clients don’t really care that Amazon is providing us the cloud service, what they see is our company handling their audio files.”

Thankfully, despite the complaints, did not lose any clients.

Today, spreads the risk around.

RackSpace handles the voice talent firms’ critical online applications needed to run the Web site. also uses Google Docs, Google Apps and Gmail for its office and email apps and employs for its Web-based customer relations and customer communication services.

When last week’s Amazon outage struck, the Web site was still open to visitors and customers and clients were able to carry out most transactions because these services were powered by RackSpace’s servers. “This was our quasi-cloud services because the site apps are actually run through RackSpace servers in Texas,” said Ciccarelli. was still able to communicate with its customers through

The only hitch was that for 90 minutes, users could not access their audio files stored on Amazon servers.

Ciccarelli said they decided to keep the audio files with Amazon because that 20 TB of voice files was just to “unfeasible and too expensive to house in RackSpace’s servers.”

Office operations at went on because office applications were being run through Google Docs and Google Apps while company email was handled by Gmail.

“Not having all our eggs is one basket adds extra layers of redundancy in case disaster strikes,” said Ciccarelli.
(With notes from Sharon Gaudin and Patrick Thibodeau)

Nestor Arellano is a Senior Writer at Follow him on Twitter, read his blog, and join the IT Business Facebook Page.

Would you recommend this article?


Thanks for taking the time to let us know what you think of this article!
We'd love to hear your opinion about this or any other story you read in our publication.

Jim Love, Chief Content Officer, IT World Canada

Featured Download

Related Tech News

Get ITBusiness Delivered

Our experienced team of journalists brings you engaging content targeted to IT professionals and line-of-business executives delivered directly to your inbox.

Featured Tech Jobs