Curious Insight

Technology, software, data science, machine learning, entrepreneurship, investing, and various other topics


Curious Insight

Why Spark May Be Even Bigger Than The Hype

15th August 2015

Spark is currently one of the hottest open source projects in the big data space, even eclipsing Hadoop in terms of excitement. Originally a Berkley AMPlab project, Spark became a top-level Apache project early last year and has been on a tear ever since. The company now backing Spark – Databricks – has been churning our major releases every few months with no signs of slowing down. But to claim that Spark is solely a Databricks-driven endeavor is a bit misleading. Spark is getting contributions from all over the place, including many of its biggest users, which is one of the characteristics of the project that make it so interesting. As of this writing there have been over 600 individual contributors to the project according to GitHub’s stats. That goes way beyond any single organization.

To give a better sense of just how quickly Spark as taken off, consider that 3 years ago it was an obscure academic experiment based off of distributed systems research taking place at AMPlab. By the end of 2013, the first Spark Summit was held with over 400 developers in attendance. And just last month, the 2015 event in San Francisco sold out with over 2,000 in attendance. Spark is currently the most active project under the stewardship of the Apache foundation and already boasts more than 500 production deployments, according to Patrick Wendell, one of the project’s founders and a co-founder of Databricks.

Early adopters have clearly taken notice of Spark’s rapid rise, and with good reason. Spark brings a lot of innovation to big data processing. Hadoop pushed the boundary of what is possible in terms of handling scale and variety of data, but MapReduce is fundamentally a batch-oriented approach. You write a MapReduce job, set it off, go get some coffee (or maybe lunch, depending on the size of the job) and hopefully get some results when you return. SQL abstractions such as Hive and Impala have reduced this friction somewhat by both alleviating the need to write MapReduce code and optimizing the execution plan of the code to improve performance. However, one would rarely hear the experience described as “real-time” except for relatively small jobs. Spark brings a different approach to the table. By intelligently using in-memory storage along with more sophisticated distributed processing algorithms, Spark is able to bring a more interactive feel to working with big data. Along with much more developer-friendly APIs in languages like Scala and Python, Spark opens the doors to a whole new audience that would never have even considered touching MapReduce if they didn’t have to.

Yet while Spark appears on the surface to be competing with Hadoop, it also runs on YARN and supports HDFS file storage. Spark is frequently installed on existing Hadoop clusters alongside Hive, HBase and the like, giving users another option for interacting with the data already in the cluster. In that sense it is more complementary to Hadoop than adversarial, and may even help to increase adoption of Hadoop. There’s a reason that all of the major Hadoop vendors have embraced Spark, and companies like IBM (a traditional enterprise stalwart) are purportedly going all in with initiatives to expand Spark’s footprint. The tidal wave is already occurring – why not embrace it?

But despite all of the optimism, there are definitely reasons to take pause. One major red flag is probably Databricks itself. The history of companies whose business model is based entirely on open source products is mixed to say the least. Although their model isn’t the traditional “license support from the experts” approach, they’re arguably taking an even more tenuous position. They’re essentially betting that they can entice enterprises to use their cloud platform (Spark-as-a-service?) over base-level Spark by providing some nice “extras” such as managed deployment/scheduling and an interactive workspace tool (which looks an awful lot like the Jypyter project, formerly IPython notebook). But what happens if the open source community or a major Hadoop vendor catches up to Databricks’ proprietary value-add code, thus negating any advantage to using their platform? Perhaps even more concerning – how does Databricks balance the need to keep Spark moving forward (which benefits everyone) vs. improving their own platform (which makes their business viable)? This is new territory and it may be years before we see how this plays out.

I think concerns over whether or not Spark can keep this momentum going are probably justified, but I still feel pretty optimistic about its future. One of the reasons I feel this way is because the vision that the project’s founders have for Spark going forward sounds amazing. If you watch some of the keynote speeches from the last summit, there’s a lot of excitement about where things are at today but even more excitement about where things are going. Essentially the goal is to turn Spark into an “operating system” for data processing. A variety of lightweight front-ends in varying languages will compile code down to a single logical model based on a common data frames API. From there, initiatives like Project Tungsten will take that logical plan and push the performance envelope with innovations like cache-aware computation and advanced code generation. Longer term, Spark may even be able to compile instructions to frameworks like LLVM or OpenCL and leverage alternative computation engines like GPUs (currently the de-facto standard for training “deep learning” neural nets). If the front-end APIs and library of distributed algorithms evolves to a point where it’s as easy to run analytics and machine learning on Spark as it is on a single computer today (or perhaps even easier), this could be a game-changer.

So although there’s a great deal of hype around Spark today, and it’s entirely possible that much of it ends up being overblown, I think it’s also possible that it ends up getting much, much bigger. IBM has called Spark “potentially the most significant open source project of the next decade”. Andrew Brust wrote earlier this year that “…when platforms get beyond a certain critical mass of support, they eventually become what the hype has made them out to be. In other words, belief in the quality of a platform tends to self-fulfill.” Will great vision, top-notch engineering talent, and a widespread belief that it will be successful end up being enough to overcome the challenges that lay ahead? Only time will tell, but one thing is certain – it’s going to be fun to watch.

Follow me on twitter to get new post updates.

Big DataStrategy

Data scientist, engineer, author, investor, entrepreneur