Archive for the ‘HBase’ Category
My team is currently working on a brand new product – the forthcoming MMO www.FightMyMonster.com. This has given us the luxury of building against a NOSQL database, which means we can put the horrors of MySQL sharding and expensive scalability behind us. Recently a few people have been asking why we seem to have changed our preference from HBase to Cassandra. I can confirm the change is true and that we have in fact almost completed porting our code to Cassandra, and here I will seek to provide an explanation.
For those that are new to NOSQL, in a following post I will write about why I think we will see a seismic shift from SQL to NOSQL over the coming years, which will be just as important as the move to cloud computing. That post will also seek to explain why I think NOSQL might be the right choice for your company. But for now I will simply relay the reasons why we have chosen Cassandra as our NOSQL solution.
Caveat Emptor – if you’re looking for a shortcut to engaging your neurons be aware this isn’t an exhaustive critical comparison, it just summarizes the logic of just another startup in a hurry with limited time and resources!!
Did Cassandra’s bloodline foretell the future?
One of my favourite tuppences for engineers struggling to find a bug is “breadth first not depth first”. This can be annoying for someone working through complex technical details, because it implies that the solution is actually much simpler if they only looked (advice: only use this saying with established colleagues who will forgive you). I coined this saying because in software matters I find that if we force ourselves to examine the top level considerations first, before tunnelling down into the detail of a particular line of enquiry, we can save enormous time.
So before getting technical, I’ll mention I might have heeded my motto better when we were making our initial choice between HBase and Cassandra. The technical conclusions behind our eventual switch might have been predicted: HBase and Cassandra have dramatically different bloodlines and genes, and I think this influenced their applicability within our business.
Loosely speaking, HBase and its required supporting systems are derived from what is known of the original Google BigTable and Google File System designs (as known from the Google File System paper Google published in 2003, and the BigTable paper published in 2006). Cassandra on the other hand is a recent open source fork of a standalone database system initially coded by Facebook, which while implementing the BigTable data model, uses a system inspired by Amazon’s Dynamo for storing data (in fact much of the initial development work on Cassandra was performed by two Dynamo engineers recruited to Facebook from Amazon).
In my opinion, these differing histories have resulted in HBase being more suitable for data warehousing, and large scale data processing and analysis (for example, such as that involved when indexing the Web) and Cassandra being more suitable for real time transaction processing and the serving of interactive data. Writing a proper study of that hypothesis is well beyond this post, but I believe you will be able to detect this theme recurring when considering the databases.
NOTE: if you are looking for lightweight validation you’ll find the current makeup of the key committers interesting: the primary committers to HBase work for Bing (M$ bought their search company last year, and gave them permission to continue submitting open source code after a couple of months). By contrast the primary committers on Cassandra work for Rackspace, which supports the idea of an advanced general purpose NOSQL solution being freely available to counter the threat of companies becoming locked in to the proprietary NOSQL solutions offered by the likes of Google, Yahoo and Amazon EC2.
Malcolm Gladwell would say my unconscious brain would have known immediately that my business would eventually prefer Cassandra based upon these differing backgrounds. It is horses for courses. But of course, justifying a business decision made in the blink of an eye is difficult…
Which NOSQL database has the most momentum?
Another consideration that has persuaded us to move to Cassandra is a belief that it is now has the most general momentum in our community. As you know, in the business of software platforms the bigger you get the bigger you get – where platforms are perceived as similar, people tend to aggregate around the platform that is going to offer the best supporting ecosystem in the long term (i.e. where the most supporting software is available from the community, and where the most developers are available for hire). This effect is self-reinforcing.
When starting with HBase, my impression then was that it had the greatest community momentum behind it, but I now believe that Cassandra is coming through much stronger. The original impression was partly created by two very persuasive and excellently delivered presentations given by the CTOs of StumpleUpon and Streamy, two big players in the Web industry who committed to HBase some time before Cassandra was really an option, and also from a quick reading of an article entitled “HBase vs Cassandra: NoSQL Battle!” (much of which has now been widely debunked).
Proving momentum comprehensively is difficult to do, and you will have to poke about for yourself, but one simple pointer I offer you is the developer activity on IRC. If you connect to freenode.org and compare the #hbase and #cassandra developer channels, you will find Cassandra typically has twice the number of developers online at any time.
If you consider Cassandra has been around for half as long as HBase, you can see why this is quite a clear indication of the accelerating momentum behind Cassandra. You might also take note of the big names coming on board, such as Twitter, where they plan broad usage (see here).
Note: Cassandra’s supporting website looks much lovelier than HBase’s, but seriously, this could be a trend driven by more than the marketing. Read on!
Deep down and technical: CAP and the myth of CA vs AP
There is a very powerful theorem that applies to the development of distributed systems (and here we are talking about distributed databases, as I’m sure you’ve noticed). This is known as the CAP Theorem, and was developed by Professor Eric Brewer, Co-founder and Chief Scientist of Inktomi.
The theorem states, that a distributed (or “shared data”) system design, can offer at most two out of three desirable properties – Consistency, Availability and tolerance to network Partitions. Very basically, “consistency” means that if someone writes a value to a database, thereafter other users will immediately be able to read the same value back, “availability” means that if some number of nodes fail in your cluster the distributed system can remain operational, and “tolerance to partitions” means that if the nodes in your cluster are divided into two groups that can no longer communicate by a network failure, again the system remains operational.
Professor Brewer is an eminent man and many developers, including many in the HBase community, have taken it to heart that their systems can only support two of these properties and have accordingly worked to this design principle. Indeed, if you search online posts related to HBase and Cassandra comparisons, you will regularly find the HBase community explaining that they have chosen CP, while Cassandra has chosen AP – no doubt mindful of the fact that most developers need consistency (the C) at some level.
However I need to draw to your attention to the fact that these claims are based on a complete non sequitur. The CAP theorem only applies to a single distributed algorithm (and here I hope Professor Brewer would agree). But there is no reason why you cannot design a single system where for any given operation, the underlying algorithm and thus the trade-off achieved is selectable. Thus while it is true that a system may only offer two of these properties per operation, what has been widely missed is that a system can be designed that allows a caller to choose which properties they want when any given operation is performed. Not only that, reality is not nearly so black and white, and it is possible to offer differing degrees of balance between consistency, availability and tolerance to partition. This is Cassandra.
This is such an important point I will reiterate: the beauty of Cassandra is that you can choose the trade-offs you want on a case by case basis such that they best match the requirements of the particular operation you are performing. Cassandra proves you can go beyond the popular interpretation of the CAP Theorem and the world keeps on spinning!
For example, let’s look at two different extremes. Let us say that I must read a value from the database with very high consistency – that is, where I will be 100% sure to receive the last copy of that data which was previously written. In this case, I can read the value from Cassandra specifying consistency level “ALL”, which requires that all the nodes that hold replicated copies of that data agree on its value. In this case, I have zero tolerance to either node failure, or network partition. At the other extreme, if I do not care about consistency particularly, and simply want the maximum possible performance, I can read the value from Cassandra using consistency level “ONE”. In this case, a copy is simply taken from a random node amongst those holding the replicas – and in this case, if the data is replicated three times, it does not matter if either of the two other nodes holding copies have failed or been partitioned from us, although now of course it is also possible that such conditions may mean the data I read is stale.
And better still, you are not forced to live in a black and white world. For example, in our particular application important read/write operations typically use consistency level “QUORUM”, which basically means – and I simplify so please research before writing your Cassandra app – that a majority of nodes in the replication factor agree. From our perspective, this provides both a reasonable degree of resilience to node failure and network partition, while still delivering an extremely high level of consistency. In the general case, we typically use the aforementioned consistency level of “ONE”, which provides maximum performance. Nice!
For us this is a very big plus for Cassandra. Not only can we now easily tune our system, we can also design it so that, for example, when a certain number of nodes fail, or the network connecting those nodes falters, our service continues operating in many respects, and only those aspects that require data consistency fail. HBase is not nearly so flexible, and the pursuit of a single approach within the system (CP) reminds me of the wall that exists between SQL developers and the query optimizer – something it is good to get beyond!
In our project then, Cassandra has proven by far the most flexible system, although you may find your brain at first loses consistency when considering your QUORUMs.
When is monolithic better than modular?
An important distinction between Cassandra and HBase, is that while Cassandra comes as a single Java process to be run per node, a complete HBase solution is really comprised of several parts: you have the database process itself, which may run in several modes, a properly configured and operational hadoop HDFS distributed file system setup, and a Zookeeper system to coordinate the different HBase processes. Does this mean then that this is a modularity win for HBase?
Although it is true that such a setup might promise to leverage the collective benefits of different development teams, in terms of systems administration the modularity of HBase cannot be considered a plus. In fact, especially for a smaller startup company, the modularity of HBase might be a big negative. Let me explain…
The underpinnings of HBase are pretty complex, and anyone in doubt of this should read the original Google File System and BigTable papers. Even setting up HBase in pseudo distributed mode on a single server is difficult – so difficult in fact that I did my best to write a guide that takes you past all the various gotchas in the minimum time (see https://ria101.wordpress.com/2010/01/28/setup-hbase-in-pseudo-distributed-mode-and-connect-java-client/ if you wish to try it). As you will see from that guide, getting HBase up and running in this mode actually involves setting up two different system systems manually: first hadoop HDFS, then HBase itself.
Now to the point: the HBase configuration files are monsters, and your setup is vulnerable to the quirks in default network configurations (in which I include both the default networking setups on Ubuntu boxes, and the subtleties of Elastic IPs and internally assigned domain names on EC2). When things go wrong, you will be presented with reams of output in the log file. All the information you need to fix things is in there, and if you are a skilled admin you are going to get through it.
But what happens if it does wrong in production and you need to fix it in a hurry? And what happens if like us, you have a small team of developers with big ambitions and can’t afford a team of crack admins to be on standby 247?
Look seriously, if you’re an advanced db admin wanting to learn a NOSQL system, choose HBase. It’s so damn complex that safe pairs of hands are going to get paid well.
But if you’re a small team just trying to get to the end of the tunnel like us, wait ’til you hear the Gossip…
It’s Gossip talk dude, Gossip!
Cassandra is a completely symmetric system. That is to say, there are no master nodes or region servers like in HBase – every node plays a completely equal role in the system. Rather than any particular node or entity taking on a coordination role, the nodes in your cluster coordinate their activities using a pure P2P communication protocol called “Gossip”.
A description of Gossip and the model using it is beyond this post, but the application of P2P communication within Cassandra has been mathematically modelled to show that, for example, the time taken for the detection of node failure to be propagated across the system, or for a client request to be routed to the node(s) holding the data, occur deterministically within well bounded timeframes that are surprisingly small. Personally I believe that Cassandra represents one of the most exciting uses of P2P technology to date, but of course this idea is not relevant to choosing your NOSQL database!
What is relevant are the real benefits that the Gossip-based architecture gives to Cassandra’s users. Firstly, continuing with the theme of systems administration, life becomes much simpler. For example, adding a new node to the system becomes as simple as bootstrapping its Cassandra process and pointing it at a seed node (an existing node within your cluster). When you think of the underlying complexity of a distributed database running across, potentially, hundreds of nodes, the ability to add new nodes to scale up with such ease is incredible. Furthermore, when things go wrong you no longer have to consider what kind of nodes you are dealing with – everything is the same, which can make debugging a more progressive and repeatable process.
Secondly I have come to the conclusion that Cassandra’s P2P architecture provides it with performance and availability advantages. Load can be very evenly balanced across system nodes thus maximizing the potential for parallelism, the ability to continue seamlessly in the face of network partitions or node failures is greatly increased, and the symmetry between nodes prevents the temporary instabilities in performance that have been reported with HBase when nodes are added and removed (Cassandra boots quickly, and its performance scales smoothly as new nodes are added).
If you are looking for more evidence, you will be interested to read a report from a team with a vested interest in hadoop (i.e. which should favor HBase)…
A report is worth a thousand words. I mean graph right?
The first comprehensive benchmarking of NOSQL systems performed by Yahoo! Research now seems to bear out the general performance advantage that Cassandra enjoys, and on the face of it the figures do currently look very good for Cassandra.
NOTE: in this report HBase performs better than Cassandra only respect of range scans over records. Although the Cassandra team believes they will quickly approach the HBase times, it is also worth pointing out that in a common configuration of Cassandra range scans aren’t even possible. I recommend this to you as being of no matter, because actually in practice you should implement your indexes on top of Cassandra, rather than seek to use range scans. If you are interested in issues relating to range scans and storing indexes in Cassandra, see my post here https://ria101.wordpress.com/2010/02/22/cassandra-randompartitioner-vs-orderpreservingpartitioner/).
FINAL POINT OF INTEREST: the Yahoo! Research team behind this paper are trying to get their benchmarking application past their legal department and make it available to the community. If they succeed, and I hope they do, we will be treated to an ongoing speed competition galore, and both HBase and Cassandra will doubtless be improving their times further.
A word on locking, and useful modularity
You may no doubt hear from the HBase camp that their more complex architecture is able to give you things that Cassandra’s P2P architecture can’t. An example that may be raised is the fact that HBase provides the developer with row locking facilities whereas Cassandra cannot (in HBase row locking can be controlled by a region server since data replication occurs within the hadoop layer below, whereas in Cassandra’s P2P architecture all nodes are equal, and therefore none can act as a gateway that takes responsibility for locking replicated data).
However, I would reflect this back as an argument about modularity, which actually favours Cassandra. Cassandra implements the BigTable data model but uses a design where data storage is distributed over symmetric nodes. It does that, and that’s all, but in the most flexible and performant manner possible. But if you need locking, transactions or any other functionality then that can be added to your system in a modular manner – for example we have found scalable locking quite simple to add to our application using Zookeeper and its associated recipes (and other systems such as Hazelcast might also exist for these purposes, although we have not explored them).
By minimizing its function to a narrower purpose, it seems to me that Cassandra manages to implement a design that executes that purpose better – as indicated for example by its selectable CAP tradeoffs. This modularity means you can build a system as you need it – want locking, grab yourself Zookeeper, want to store a full text index, grab yourself Lucandra, and so on. For developers like us, this means we don’t have to take on board more complexity than we actually need, and ultimately provides us with a more flexible route to building the application we want.
MapReduce, don’t mention MapReduce!
One thing Cassandra can’t do well yet is MapReduce! For those not versed in this technology, it is a system for the parallel processing of vast amounts of data, such as the extraction of statistics from millions of pages that have been downloaded from the Web. MapReduce and related systems such as Pig and Hive work well with HBase because it uses hadoop HDFS to store its data, which is the platform these systems were primarily designed to work with. If you need to do that kind of data crunching and analysis, HBase may currently be your best option.
Remember, it’s horses for courses!
Therefore as I finish off my impassioned extolation of Cassandra’s relative virtues, I should point out HBase and Cassandra should not necessarily be viewed as out and out competitors. While it is true that they may often be used for the same purpose, in much the same way as MySQL and Postgres, what I believe will likely emerge is that they will become preferred solutions for different applications. For example, as I understand StumbleUpon has been using HBase with the associated hadoop MapReduce technologies to crunch the vast amounts of data added to its service. Twitter is now using Cassandra for real time interactive community posts. Our needs fit better with the interactive serving and processing of data and so we are using Cassandra, and probably to some degree there you have it.
As a controversial parting shot though the gloves are off for the next point!
NOTE: before I continue I should point out Cassandra has hadoop support in 0.6, so its MapReduce integration may be about to get a whole load better.
O boy, I can’t afford to lose that data…
Perhaps as a result of the early CAP Theorem debates, an impression has grown that data is somehow safer in HBase than Cassandra. This is a final myth that I wish to debunk: in Cassandra, when you write new data it is actually immediately written to the commit log on one of the nodes in the quorum that will hold the replicas, as well as being replicated across the memory of the nodes. This means that if you have a complete power failure across your cluster, you will likely lose little data. Furthermore once in the system, data entropy is prevented using Merkle trees, which further add to the security of your data 🙂
In truth I am not clear exactly what the situation with HBase is – and I will endeavour to update this post as soon as possible with details – but my current understanding is that because hadoop does not yet support append, HBase cannot efficiently regularly flush its modified blocks of data to HDFS (whereupon the new mutations to data will be replicated and persisted). This means that there is a much larger window where your latest changes are vulnerable (if I am wrong, as I may be, please tell me and I will update the post).
So while the Cassandra of Greek mythology had a rather terrible time, the data inside your Cassandra shouldn’t.
NOTE: Wade Arnold points out below that (at the time of writing this) hadoop .21 is about to be released, which will solve this problem with HBase.
If like me, you have moved your enterprise to the NOSQL database camp, you will trying to figure out which NOSQL database will be the best long term option. This is no easy task, because we are still at the beginning of the cycle.
Our choice came down to HBase vs Cassandra. We chose HBase for its vibrant community, faithful Google BigTable-like design and because it can use hadoop for storage (which is already quite mature). Right now we’re quite far down the road with HBase but trying to keep our minds open.
Platforms used: Ubuntu karmic, hadoop-0.20, hbase-0.20.3, Java client
This post will be useful for those wishing to setup HBase on a single server machine in pseudo distributed mode. The advantage of running the database in this mode is that it can then be accessed over the network, for example to allow a bunch of developers to start crafting Java code against it. It is also a nice stepping stone to fully distributed mode, either on your own servers, or somewhere like EC2.
On first reading of the HBase documentation, setting up pseudo distributed mode sounds very simple. The problem is that there a lot of gotchas, which can make life very difficult indeed. Consequently, many people follow a twisted journey to their final destination, and when they finally get there, aren’t sure which of the measures they took were needed, and which were not. This is reflected by a degree of misinformation on the Web, and I will try and present here a reasonably minimal way of getting up and running (that is not even to say that every step I take is absolutely necessary even, but I’ll mention where I’m not sure).
Step 1: Check your IP setup
I believe this is one of the main causes of the weirdness that can happen. So, if you’re on Ubuntu check your hosts file. If you see something like:
127.0.1.1 <server fqn> <server name, as in /etc/hostname>
get rid of the second line, and change to
<server ip> <server fqn> <server name, as in /etc/hostname>
220.127.116.11 hbase.mycompany.com hbase
If you don’t do this, the region servers will resolve their addresses to 127.0.1.1. This information will be stored inside the ZooKeeper instance that HBase runs (the directory and lock manager used by the system to configure and synchronize a running HBase cluster). When manipulating remote HBase data, client code libraries actually connect to ZooKeeper to find the address of the region server maintaining the data. In this case, they will be given 127.0.1.1 which resolves to the client machine. duh!
Step 2: Install Hadoop Packages
Hadoop is quite a big subject – hell the book has over 500 pages. That’s why it is great that there is a company making pre-packaged distributions called cloudera. So my recommendation here is to go with those packages. Perform the following steps, but check the important notes before proceeding:
a/ If you are on Debian, you need to modify your Apt Repository so you can pickup the packages. In the instructions following, if you are running a recent Ubuntu distro like karmic, then configure your cloudera.list to pickup the packages for “jaunty-testing”. Make sure you choose hadoop-0.20 or better.
b/ Install the packages setting up hadoop in standalone mode
c/ Install the package the sets up the pseudo distributed configuration
You should begin the above process with your system in a completely hadoop-free state to be sure the steps will work correctly. For example, if you have an entry for a hadoop user in your /etc/passwds file that is different to the one the config package wants to install, installation of the config package can fail. Furthermore, old items may have the wrong permissions which may cause later steps to fail. To find everything on your system you need to remove, do:
find -name "*hadoop*"
grep -R hadoop
Step 3: Prepare user “hadoop”
We are going to make it possible to login as user hadoop (or rather, do a sudo -i -u hadoop). This will make it possible to easily edit for example configuration files while keeping their owner as hadoop. We are also going to run HBase as user hadoop.
Change the following entry in /etc/passwd
There is a lot of talk on the Web about setting up ssh for the hadoop user, so that hadoop can ssh to different nodes without specifying a password. I’m not sure that this is necessary any more, but the weight of recommendation (including here http://hadoop.apache.org/common/docs/current/quickstart.html) persuadesd me to do this anyway. So next:
# sudo -i -u hadoop
hadoop$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
hadoop$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
Step 4: Configure hadoop
Open /etc/hadoop/conf/hadoop-env.sh and make sure your Java home is correctly set e.g.
Step 6: Install the HBase package
You need to download and install the latest version from http://www.apache.org/dyn/closer.cgi/hadoop/hbase/. Proceed as root as follows:
# cd ~
# wget http://apache.mirror.anlx.net/hadoop/hbase/hbase-0.20.3/hbase-0.20.3.tar.gz
# tar -xzf hbase-0.20.3.tar.gz
# mv hbase-0.20.3 /usr/lib
# cd /etc/alternatives
# ln -s /usr/lib/hbase-0.20.3 hbase-lib
# cd /usr/lib
# ln -s /etc/alternatives/hbase-lib hbase
# chown -R hadoop:hadoop hbase
Step 7: Configure HBase
Now, login as hadoop and go to its conf directory
# sudo -i -u hadoop
hadoop$ cd /usr/lib/hbase/conf
a/ Then update the Java classpath in hbase-env.sh, just like you did for hadoop-env.sh
b/ Inside hbase-site.xml, configure hbase.roodir and hbase.master. The result should look something like below, notes following:
<description>The directory shared by region servers.
Should be fully-qualified to include the filesystem to use.
<description>The host and port that the HBase master runs at.
1/ hbase.rootdir must specify a host and port number exactly the same as specified by fs.default.name inside /etc/hadoop/conf/core-site.xml. Basically it tells HBase where to find the distributed file system.
2/ hbase.master specifies the interface that the HBase master, or rather the Zookeeper instance HBase will start (more later) will listen on. It must be externally addressable for clients to connect. This is a good point to double-check the IP setup step at the beginning of this post.
Step 8: Start up HBase
If you have not already started hadoop, then start it e.g. as described by cloudera:
for service in /etc/init.d/hadoop-0.20-*
sudo $service start
Next, start HBase as the hadoop user:
Step 9: Check HBase is up and running
Open up the HBase Web UI e.g. http://hbase.mycompany.com:60010
Step 10: Open HBase shell, create a table and column family
You need to login to hbase, and create a table and column family that will be used by the Java client example:
me$ /usr/lib/hbase/bin/hbase shell
hbase(main):001:0> create "myLittleHBaseTable", "myLittleFamily"
Step 11: Create Java client project
Create your sample Java client project with code, as described for example at http://hadoop.apache.org/hbase/docs/r0.20.3/api/index.html.
Next, you need to add a special hbase-site.xml file to its classpath. It should specify the ZooKeeper quorum (the minimum running ZooKeeper instances, in this case, your hbase server). The client contacts ZooKeeper to find out where the master, region servers etc are (ZooKeeper acts as a directory for clients, but it also acts a definitive description of the HBase cluster and synchronization system for its own nodes). Based upon the foregoing, the contents will look something like:
<description>The host and port that the HBase master runs at.
A value of 'local' runs the master and a regionserver in
a single process.
Step 12: Yeeeha. Build your Java client, and debug
Watch your e.g. NetBeans Ouput window for a trace of what is hopefully happening… welcome to the world of HBase