Dominic Williams

Occasionally useful posts about RIAs, Web scale computing & miscellanea

Posts Tagged ‘distributed database

Cassandra: RandomPartitioner vs OrderPreservingPartitioner

with 20 comments

When building a Cassandra cluster, the “key” question (sorry, that’s weak) is whether to use the RandomPartitioner (RP), or the OrderPreservingPartitioner (OPP). These control how your data is distributed over your nodes. Once you have chosen your partitioner, you cannot change without wiping your data, so think carefully!

For Cassandra newbies, like me and my team of HBasers wanting to try a quick port of our project (more on why in another post) nailing the exact issues is quite daunting. So here is a quick summary.

What OPP gives you

Using OPP provides you with two obvious advantages over RP:
1. You can perform range slices. That is you can scan over ranges of your rows as though you were moving a cursor through a traditional index. For example, if you are using user ids as your keys, you could scan over the rows for users whose names begin with J e.g. jake, james, jamie etc
2. You can store real time full text indexes inside Cassandra, which are built using the aforementioned feature e.g. see Lucandra
3. If you screw up, you can scan over your data to recover/delete orphaned keys

***UPDATE*** Since v6 you *can* now scan your keys when using RP, although obviously not in any particular order. Typically you request a page of rows starting with the empty/”” key, and then use the apparently random end key from the page as the start key when you request another page. At the time of writing, this method only seems to work with KeyRange not TokenRing. If you are using Java to access Cassandra read the change log for v0.804 of Pelops.

Given that Web applications typically need/benefit from the above, the question is why would you *not* use OPP. The answer is a nuanced one about load balancing.

The problem with OPP

With both RP and OPP, by default Cassandra will tend to evenly distribute individual keys and their corresponding rows over the nodes in the cluster. The default algorithm is nice and simple: every time you add a new node, it will assign a range of keys to that node such that it takes responsibility for half the keys stored on the node that currently stores the most keys (more on options for overriding the default behaviour later).

The nuance is, that this simple default algorithm will tend to lead to good load balancing when RP is used, but not necessarily when OPP is used.

The reason is that although the algorithm succeeds in assigning key ranges such that as your cluster scales nodes receive roughly similar numbers of keys, with OPP on any given node those keys are unlikely to be drawn equally from the different column families present within your database…

If the distribution of keys used by individual column families is different, their sets of keys will not fall evenly across the ranges assigned to nodes. Thus nodes will end up storing preponderances of keys (and the associated data) corresponding to one column family or another. If as is likely column families store differing quantities of data with their keys, or store data accessed according to differing usage patterns, then some nodes will end up with disproportionately more data than others, or serving more “hot” data than others. <yikes!>

By contrast, when using RP the distribution of the keys occuring within individual column families does not matter. This is because an MD5 hash of keys is used as the “real” key by the system for the purposes of locating the key and data on nodes (the MD5 hashes randomly map any input key to a point in the 0..2**127 range). The result is that the keys from each individual column family are spread evenly across the ranges/nodes, meaning that data and access corresponding to those column families is evenly distributed across the cluster.

If you must have OPP

You may quite reasonably feel that you must have the range scan features that come with OPP, for example because you want to use Lucandra. The question then becomes how you can you ameliorate the aforementioned problems with load balancing.

The best you can do, is to identify the data upon which you do not need to perform range scans. This data can then be randomly distributed across your cluster using a simple idiom where the key is actually written as <MD5(ROWKEY)>.<ROWKEY>

But be clear, the items whose keys must be undecorated (because you wish to perform range scans over them), may still not map evenly onto the key ranges held by the nodes. The only recourse you have then, is to consider manually specifying the key ranges assigned to nodes. This is typically done when you bootstrap a new node, but you can also rebalance an existing cluster by simply decomissioning nodes, deleting their data, and then bootstrapping them back in. To do this safely, you obviously have to do this one at a time, but then I’m sure I didn’t have to tell you that…

You can see where this is going now right? You’ve just made a whole load of work for yourself, and anyway, even if you have the time, if you have lots of different column families with widely differing key distributions then getting load balancing right is going to be a nightmare.

This is the basic reason that fully seasoned Cassandra heads, in my experience, seem to prefer RD *unless* a mono use setup is proposed, for example where a cluster is used simply to store a full-text index with Lucandra.

If you have a database with a seriously heterogeneous set of column families, and need range scans, you might now be thinking you should actually be using HBase, which is designed for this. That would not be a bad choice (!), but there are good reasons for hanging with Cassandra if you can, which I will cover in a future post. Read on…

If you must use RP (very likely)

So having delved a little more deeply into the implications of OPP, you decide you really should go with RP. But, what to do with those indexes you need?

Well, first of all there is a really simple if brutal solution: simply store your index inside a single column family row as a series of columns. Since Cassandra can in principle cope with millions of columns, this is perfectly possible. Although it is true each index won’t be distributed across your whole cluster, the load will at the least be distributed across the nodes holding the replicas. If you use a typical replication factor (RF) of 3 the load associated with each index will be shared by 3 nodes etc.

In the vast majority of cases, this will be enough, and it will be sufficient that the rest of your data is properly balanced across your cluster.

But, I hear you saying, this is too brutal. Your index is too massive to fit on 3 nodes, is extremely hot and this just won’t work. You moved to Cassandra because you want your load distributed across your entire cluster. Period.

This is a perfectly reasonably point of view.

The only solution in this case is to build an index system over the top of the simple hashmap provided. We are taking this approach, and it will be elaborated with some sample code in a later post.

Basic indexing strategy for RP

For those that need to know the basic strategy now, here it is: you need to start off with the simple approach where you store your entire index using columns under a single key. As the number of columns grows past some threshold you define, the columns should be split such that half the entries are migrated to a new key/row. Thus the index is split across the cluster evenly.

Each range can be stored under a key named in a predictable way, for example <INDEX>.<SPLIT NO.> The start and end index entries stored in each split should themselves be stored in a dedicated column family that is used to record index meta information using the same key name, ensuring that the meta information is also distributed.

You can then progressively test the existence of splits simply by attempting to open the key for the meta that would be used to describe the split. If you can retrieve the meta information, you know that the split also exists. It won’t be necessary to cache this information to make the process reasonably performant – Cassandra already caches data in memory, and also uses Bloom filters to determine whether or not a requested row exists (Bloom filters enable a Cassandra node to rapidly determine whether it holds a key without traversing its list of keys).

There you have it, an index offering range scans fully distributed over your cluster!

Full text search sanity check

Implementing a full text index will of course involve more work than a simple left-side/ISAM style index, although the principles are the same. Given the existence of Lucandra though, I would suggest that before proceeding to create your full text index using the described approach, you first examine another possibility: running your full text searches off a dedicated cluster.

If you are running in the cloud, for example on EC2 or Rackspace Cloud, you can start your dedicated full text search cluster at low cost on small instances that can be scaled up if necessary later. Otherwise, consider virtualization or configuring Cassandra to run two clusters in parallel on the same nodes (more on this possibility in a later post).

The beauty of open source is that many problems have already been solved for you, and Lucandra is too good an opportunity to miss is you need full text search on Cassandra.

Written by dominicwilliams

February 22, 2010 at 10:56 pm

Quick install HBase in “pseudo distributed” mode and connect from Java

with 10 comments

Platforms used: Ubuntu karmic, hadoop-0.20, hbase-0.20.3, Java client

This post will be useful for those wishing to setup HBase on a single server machine in pseudo distributed mode. The advantage of running the database in this mode is that it can then be accessed over the network, for example to allow a bunch of developers to start crafting Java code against it. It is also a nice stepping stone to fully distributed mode, either on your own servers, or somewhere like EC2.

On first reading of the HBase documentation, setting up pseudo distributed mode sounds very simple. The problem is that there a lot of gotchas, which can make life very difficult indeed. Consequently, many people follow a twisted journey to their final destination, and when they finally get there, aren’t sure which of the measures they took were needed, and which were not. This is reflected by a degree of misinformation on the Web, and I will try and present here a reasonably minimal way of getting up and running (that is not even to say that every step I take is absolutely necessary even, but I’ll mention where I’m not sure).

Step 1: Check your IP setup
I believe this is one of the main causes of the weirdness that can happen. So, if you’re on Ubuntu check your hosts file. If you see something like:
127.0.0.1 localhost
127.0.1.1 <server fqn> <server name, as in /etc/hostname>

get rid of the second line, and change to
127.0.0.1 locahost
<server ip> <server fqn> <server name, as in /etc/hostname>

e.g.
127.0.0.1 localhost
23.201.99.100 hbase.mycompany.com hbase

If you don’t do this, the region servers will resolve their addresses to 127.0.1.1. This information will be stored inside the ZooKeeper instance that HBase runs (the directory and lock manager used by the system to configure and synchronize a running HBase cluster). When manipulating remote HBase data, client code libraries actually connect to ZooKeeper to find the address of the region server maintaining the data. In this case, they will be given 127.0.1.1 which resolves to the client machine. duh!

Step 2: Install Hadoop Packages
Hadoop is quite a big subject – hell the book has over 500 pages. That’s why it is great that there is a company making pre-packaged distributions called cloudera. So my recommendation here is to go with those packages. Perform the following steps, but check the important notes before proceeding:
a/ If you are on Debian, you need to modify your Apt Repository so you can pickup the packages. In the instructions following, if you are running a recent Ubuntu distro like karmic, then configure your cloudera.list to pickup the packages for “jaunty-testing”. Make sure you choose hadoop-0.20 or better.
Instructions http://archive.cloudera.com/docs/_apt.html
b/ Install the packages setting up hadoop in standalone mode
Instructions http://archive.cloudera.com/docs/_installing_hadoop_standalone_mode.html
c/ Install the package the sets up the pseudo distributed configuration
Instructions http://archive.cloudera.com/docs/cdh2-pseudo-distributed.html
IMPORTANT NOTES
You should begin the above process with your system in a completely hadoop-free state to be sure the steps will work correctly. For example, if you have an entry for a hadoop user in your /etc/passwds file that is different to the one the config package wants to install, installation of the config package can fail. Furthermore, old items may have the wrong permissions which may cause later steps to fail. To find everything on your system you need to remove, do:

cd /
find -name "*hadoop*"

and

cd /etc
grep -R hadoop

Step 3: Prepare user “hadoop”
We are going to make it possible to login as user hadoop (or rather, do a sudo -i -u hadoop). This will make it possible to easily edit for example configuration files while keeping their owner as hadoop. We are also going to run HBase as user hadoop.
Change the following entry in /etc/passwd

hadoop:x:104:112:Hadoop User,,,:/var/run/hadoop-0.20:/bin/false

to

hadoop:x:104:112:Hadoop User,,,:/var/run/hadoop-0.20:/bin/bash

There is a lot of talk on the Web about setting up ssh for the hadoop user, so that hadoop can ssh to different nodes without specifying a password. I’m not sure that this is necessary any more, but the weight of recommendation (including here http://hadoop.apache.org/common/docs/current/quickstart.html) persuadesd me to do this anyway. So next:

# sudo -i -u hadoop
hadoop$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
hadoop$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

Step 4: Configure hadoop
Open /etc/hadoop/conf/hadoop-env.sh and make sure your Java home is correctly set e.g.

export JAVA_HOME=/usr/lib/jvm/java-6-sun-1.6.0.15

Step 5: Test hadoop
Checkout the Web-based admin interfaces e.g.
http://hbase.mycompany.com:50070/
and
http://hbase.mycompany.com:50030/

Step 6: Install the HBase package
You need to download and install the latest version from http://www.apache.org/dyn/closer.cgi/hadoop/hbase/. Proceed as root as follows:

# cd ~
# wget http://apache.mirror.anlx.net/hadoop/hbase/hbase-0.20.3/hbase-0.20.3.tar.gz
# tar -xzf hbase-0.20.3.tar.gz
# mv hbase-0.20.3 /usr/lib
# cd /etc/alternatives
# ln -s /usr/lib/hbase-0.20.3 hbase-lib
# cd /usr/lib
# ln -s /etc/alternatives/hbase-lib hbase
# chown -R hadoop:hadoop hbase

Step 7: Configure HBase
Now, login as hadoop and go to its conf directory

# sudo -i -u hadoop
hadoop$ cd /usr/lib/hbase/conf

a/ Then update the Java classpath in hbase-env.sh, just like you did for hadoop-env.sh
b/ Inside hbase-site.xml, configure hbase.roodir and hbase.master. The result should look something like below, notes following:

<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://localhost:8020/hbase</value>
<description>The directory shared by region servers.
Should be fully-qualified to include the filesystem to use.
E.g: hdfs://NAMENODE_SERVER:PORT/HBASE_ROOTDIR
</description>
</property>

<property>
<name>hbase.master</name>
<value>23.201.99.100:60000</value>
<description>The host and port that the HBase master runs at.
</description>
</property>
</configuration>

NOTES
1/ hbase.rootdir must specify a host and port number exactly the same as specified by fs.default.name inside /etc/hadoop/conf/core-site.xml. Basically it tells HBase where to find the distributed file system.
2/ hbase.master specifies the interface that the HBase master, or rather the Zookeeper instance HBase will start (more later) will listen on. It must be externally addressable for clients to connect. This is a good point to double-check the IP setup step at the beginning of this post.

Step 8: Start up HBase
If you have not already started hadoop, then start it e.g. as described by cloudera:

for service in /etc/init.d/hadoop-0.20-*
do
sudo $service start
done

Next, start HBase as the hadoop user:

hadoop$ /usr/lib/hbase/bin/start-hbase.sh

Step 9: Check HBase is up and running
Open up the HBase Web UI e.g. http://hbase.mycompany.com:60010

Step 10: Open HBase shell, create a table and column family
You need to login to hbase, and create a table and column family that will be used by the Java client example:

me$ /usr/lib/hbase/bin/hbase shell
hbase(main):001:0> create "myLittleHBaseTable", "myLittleFamily"

Step 11: Create Java client project
Create your sample Java client project with code, as described for example at http://hadoop.apache.org/hbase/docs/r0.20.3/api/index.html.
Next, you need to add a special hbase-site.xml file to its classpath. It should specify the ZooKeeper quorum (the minimum running ZooKeeper instances, in this case, your hbase server). The client contacts ZooKeeper to find out where the master, region servers etc are (ZooKeeper acts as a directory for clients, but it also acts a definitive description of the HBase cluster and synchronization system for its own nodes). Based upon the foregoing, the contents will look something like:

<configuration>
<property>
<name>hbase.zookeeper.quorum</name>
<value>hbase.mycompany.com</value>
<description>The host and port that the HBase master runs at.
A value of 'local' runs the master and a regionserver in
a single process.
</description>
</property>
</configuration>

Step 12: Yeeeha. Build your Java client, and debug
Watch your e.g. NetBeans Ouput window for a trace of what is hopefully happening… welcome to the world of HBase

Written by dominicwilliams

January 28, 2010 at 12:21 pm