We often hear about the promise of big data for organizations to improve everything from inventory fulfillment to marketing strategies, but rarely do we get the nitty-gritty details of how it might be done and with what tools. That’s why this article from Ben Lorica, the chief data scientist at O’Reilly Media Inc. is so refreshing, it details the growth of a promising new open source big data engine and how some companies are using it to their advantage.
Apache Spark is an open source cluster computing system that purports to scour through big data at 100 times the speed of alternative engines such as Hadoop MapReduce. It delivers such speed by allowing for in-memory computing rather than a disk-based system and also aims to make developers’ jobs easier by allowing them to code in Scala, Java, and Python. The project is built upon the Berkeley Data Analytics Stack. The first ever Spark Summit is taking place in San Franciso on Dec. 2 and will feature real-world examples of Spark in action from the likes of Yahoo, Amazon.com, and Cloudera.
Perhaps most importantly for those in line-of-business roles is the trend of developers finding a way to represent the terabytes of data involved in an analysis with visual representation. That means the complex analysis could soon be more accessible to business users through the tools they’re already accustomed to working with for these purposes. Take the example of ClearStory, a startup that built a platform to allow users to bring together several data sources and produce a visual graphic of the analysis.
While Spark is still a young platform that still hasn’t hit version 1.0, it counts 67 contributors in its community and is being deployed at some companies where real-time data analysis is crucial. With a summit coming up that will feature case studies from major Web brands like Yahoo, you can expect this is one platform you’ll be hearing a lot about in 2014.