Curious Insight

Technology, software, data science, machine learning, entrepreneurship, investing, and various other topics


Curious Insight

Comparing The "Big Three" Hadoop Distros

11th April 2015

Anyone that’s paying attention to the “big data” buzz these days is probably aware of just how hot Hadoop is right now. In fact the hype is so great, and the market so untapped, that adoption is expected to increase 25-fold over the next 5 years. Given that, it’s no surprise that large enterprises are looking at Hadoop and going “should we be getting in on this? What’s the deal here?”. Equally unsurprising is the observation - admittedly my own, likely biased observation, but still probably reasonably accurate - that most large enterprises have virtually no clue what Hadoop is or how to use it. Still, let’s say that you work for a company that has decided to “do Hadoop”, regardless of whether or not said company actually knows anything about or has any actual need for Hadoop. Let’s also suppose that said company is not likely to roll their own distro by compiling the various Apache projects from source on GitHub. You’re now at a point where you need to pick a vendor and use their distro (and likely support model, training, and so on). How do you choose?

Although there is no easy and completely obvious answer (if there was then there wouldn’t be any market competition in the first place), there are certainly some good heuristics to draw on. Fortunately, I recently had the opportunity to sit down with teams from each of the major vendors and discuss their offering, and I think some of the general observations that I took away from these meetings may be useful to others in a similar position.

There’s essentially a “big three” in the world of enterprise Hadoop distros – Hortonworks, Cloudera, and MapR. There are technically other distros in the market, but nothing really worth mentioning. If you’re starting an evaluation today, these are the main players. What’s really fascinating though is how different they all are, and how much their business strategies vary, despite all of them literally pushing the exact same core code base.

First let’s get some of the similarities out of the way. All three companies are more or less entirely focused on Hadoop. It’s not just a piece of their business, it's entire revenue stream. All three are mid-size companies (500-1000 employees) with paying customers numbering in the hundreds and an array of partnerships across various industries. They all provide free, downloadable versions of their Hadoop distributions (Cloudera and MapR also ship “premium” distributions for paying customers). They all rely on support contracts for at least part of their customer revenue. They also all employ engineers with committer status to at least some of the projects in the very broad Hadoop “ecosystem”. But that’s where the similarities end.

Hortonworks is both the youngest of the three and the only one that has already gone public. Right out of the gate their strategy is pretty clear – they’re all about open source. Hortonworks pretty much wants to be for Hadoop what Red Hat is for Linux – successful based largely on a robust support model for otherwise entirely free software. Of the three enterprise distros, theirs is the only one that doesn’t include any proprietary software at all. Rather, their approach is to focus on improving the open source projects (or creating new projects if necessary) to fill gaps and improve the overall story for Hadoop. Their business is then based around being the go-to experts to provide support for the platform, because they have the engineers that are writing the code in the first place (of course the other vendors are committing on most of the same projects as well). They’ve also formed close partnerships with major enterprise software vendors like Microsoft and SAP, which helps them get their foot in the door.

If Hortonworks is the pure open-source advocate of the bunch, Cloudera is something of a hybrid. They’re definitely heavily invested in open source and they’ve got quite a few Apache committers themselves, so it’s not like they’re riding on Hortonworks’ coattails. But they’re also championing non-Apache projects like the Impala SQL engine which, while still open source, is owned and controlled by Cloudera. In addition, Cloudera packages their own proprietary tools into their distribution for things like security and cluster management. They consider these tools part of their competitive advantage. Cloudera touts its size and first-mover status (they were the first to create an enterprise distribution by quite a wide margin) along with its support model as differentiators in the marketplace.

MapR seems to be taking an entirely different approach. Their strategy is 100% focused on the product and using it to solve problems. While they do use (most of) core Hadoop, they’re not as concerned with open source as the other two. MapR has by far the most proprietary software in their stack, most notably a custom file system (they do NOT use HDFS, although it sounded like you could if you really wanted to). Another key differentiator is that MapR doesn’t use a “NameNode” architecture, they’re developed their own way to track meta-data about the cluster from within the cluster nodes themselves. Their claim is that all of this results in significantly better performance and flexibility than the other distros. They’re also the only distro that supports Apache Drill, which is another flavor of Hadoop query language.

All three companies seem to have a lot going for them, and one would probably do well using any of their distros. But given that they’re all pushing the same core product, I found it surprising how different they are. Probably the most interesting (and entertaining) part of the presentations was how each company would throw in subtle (and often not-so-subtle) digs at their competitors. Hortonworks’ message was basically “we’re the only pure open-source option, and in the long run, open source wins”. Cloudera’s message was “we’ve been first to market on every front, we’re the biggest, we have the most resources, and we’ve got enterprise-grade IP that sets us apart”. MapR’s message, on the other hand, was essentially “HDFS sucks, NameNodes suck, our competitors suck, our distro is way better at everything” (the “Googleyness” is strong with those guys). MapR even went so far as to call out Doug Cutting, the original creator of Hadoop, for building a half-baked, reverse-engineered file system in HDFS that was a poor man’s version of the “real thing” that Google…I mean MapR built (I swear I’m not making this stuff up).

In the end, after six hours of marketing slides and competitor-shaming, we were left with lots of new information to digest but no clear winner in sight. That, I suppose, is to be expected – we were never going to decide based on a PowerPoint deck. There was, however a clear distinction in philosophy and strategy between the three. That element alone may be enough for some organizations to reach a conclusion. If open source is your thing, then Hortonworks is for you. If it’s all about product, then MapR may be the way to go. If you fit somewhere in the middle then Cloudera may be a good choice. On the other hand, if you just want to have some fun then invite them all in, grab some popcorn, and ask them what they think of each other.

Follow me on twitter to get new post updates.

Big DataStrategy

Data scientist, engineer, author, investor, entrepreneur