Firm averts Amazon cloud crash by ‘spreading out the risk’

A two-day power outage at Amazon.com Inc. ‘s Northern Virginia data centre last week crippled several Web sites including those of Foursquare, HootSuite, Quora and Reddit, but thanks to redundant cloud services a Canadian company was able to avoid any major disruption.

By employing a combination of cloud and quasi cloud back-up services, Voices.com, a London, Ont.–based voice talent firm, only suffered about 90 minutes of minor signal latency before being able to recover full online capabilities while other Amazon clients did not fare as well.

Related stories

State of the SMB: Contemplating the cloud, keeping the lights on

Amazon’s EC2 cloud service dinged by Gartner rating

“We had some lag time according to our lead programmer but that was just for about 90 minutes. We did not receive any customer complaints,” David Ciccarelli, founder of Voices.com told ITBusiness.ca. Voices.com stores audio files of more than 25,000 voice actors in its online database hosted by Amazon. The London, Ont.-company works with producers such as DreamWorks and networks that include NBC, ABC, and the History Channel.

Voices.com stores its client-created audio files (as much as 20 terabytes) on Amazon’s servers.But the company’s other applications are spread over services that include: RackSpace Inc. , a Texas-based Web hosting and cloud management firm; Google Docs; Google Apps; Gmail; and Salesforce.com.

Disaster in the cloud

Trouble at Amazon’s data centre started a little after 5 a.m. Eastern time last Thursday when the company’s Service Health Dashboard reported connectivity problems that were affecting its Relational Database Service, which is used to manage a relational database in the cloud, across multiple zones in the eastern U.S.

Because of server problems at Amazon’s data center, which handles the company’s EC2 Web hosting services, Web sites, including popular Web 2.0 sites, were left staggering or disabled.

As of noon Eastern time last Friday, those sites had been affected for about 30 hours.

Earlier that day, at 5:41 a.m., Amazon reported that its engineers were making progress. At 9:18 a.m. it noted, “We’re starting to see more meaningful progress in restoring volumes (many have been restored in the last few hours) and expect this progress to continue over the next few hours.”

That was about 19 hours after Amazon reported Thursday afternoon that it was only a few hours away from having the problem solved.

Amazon updated users again at 11:49 a.m., saying that “many” customers have confirmed that their sites are recovering. “Our current estimate is that the majority of volumes will be recovered over the next five to six hours,” the company reported.

Reddit reported at 10:30 a.m. that it was still running in emergency mode. Foursquare appeared to be up and running, while Quora was bouncing between read-only mode and not launching at all and showing an “internal server error” message.

Vancouver-based Twitter monitoring service HootSuite was also having problems, reporting at one point that it was “back up” and then changing to “again offline.”

Ezra Gottheil, an analyst at Technology Business Research, said the outage is a big problem for the disabled Web sites, but it’s an even bigger problem for Amazon.

“It’s a pretty big hit. It’s big and it’s public,” Gottheil added. “When you’re doing business on the Web, you don’t want to have your doors closed — ever. It’s tough for the sites. Most users will check again later, but [Amazon will] lose a few

Cloud services under a cloud

Thanks to Amazon’s most recent outage, supporters of cloud services are going to have a tough time arguing that the uptime delivered by cloud services is superior to anything corporate IT can deliver.
The Amazon outage “is going to be devastating,” according to Tref Laplante, the CEO WorkXpress.

WorkXpress is a platform as a service. It has created an entirely visual drag-and-drop development environment using Linux, Apache, MySQL and PHP to allow app development without writing code. Its users, which include many businesses, have built apps used in medical, real estate, manufacturing and other industries.

Related stories

Lessons learned from Japan: use cloud-based disaster recovery strategy

What the cloud really costs?

Laplante says he has one customer — a small manufacturer whose core business application was built on WorkXpress and running on Amazon — who has been knocked offline. “They are fired up and they are very angry,” he said. The customer now wants the app hosted on a server in their shop.

Laplante said the Amazon outage, which began Thursday morning, is going to make it difficult to sell cloud approaches. “I’m going to have to sell against this outage.”
Paul Haugan, CTO of Lynnwood, Wash., said his city has been looking at Amazon’s cloud offerings, but “the recent outage confirmed, for us, that cloud services are not yet ready for prime time.”

Haugan’s view, which stems not just from Amazon’s outage alone, is that “cloud services need some more maturing and a much more hardened infrastructure and security model prior to our adoption.”

How Voices.com avoided a cloud disaster

Ciccarelli of Voices.com remembers the 2008 power outage suffered by Amazon. “That lasted about 12 hours. We received numerous calls from customers seeking customer support.”

Voices.com clients were unable to access their audio files for the duration of the outage. These were files the customers used to audition for assignments requiring voice actors.

Voices.com, said Ciccarelli suffered a hit to its reputation. “It wasn’t just that our IT department had to wade through a ton of calls,” he said. “Our reliability was put in question because our clients don’t really care that Amazon is providing us the cloud service, what they see is our company handling their audio files.”

Thankfully, despite the complaints, Voices.com did not lose any clients.

Today, Voices.com spreads the risk around.

RackSpace handles the voice talent firms’ critical online applications needed to run the Web site.

Voices.com also uses Google Docs, Google Apps and Gmail for its office and email apps and employs Salesforce.com for its Web-based customer relations and customer communication services.

When last week’s Amazon outage struck, the Voices.com Web site was still open to visitors and customers and clients were able to carry out most transactions because these services were powered by RackSpace’s servers. “This was our quasi-cloud services because the site apps are actually run through RackSpace servers in Texas,” said Ciccarelli.

Voices.com was still able to communicate with its customers through Salesforce.com.

The only hitch was that for 90 minutes, users could not access their audio files stored on Amazon servers.

Ciccarelli said they decided to keep the audio files with Amazon because that 20 TB of voice files was just to “unfeasible and too expensive to house in RackSpace’s servers.”

Office operations at Voices.com went on because office applications were being run through Google Docs and Google Apps while company email was handled by Gmail.

“Not having all our eggs is one basket adds extra layers of redundancy in case disaster strikes,” said Ciccarelli.
 
(With notes from Sharon Gaudin and Patrick Thibodeau)

Nestor Arellano is a Senior Writer at ITBusiness.ca. Follow him on Twitter, read his blog, and join the IT Business Facebook Page.

Share on LinkedIn Share with Google+