Archive for the ‘Indexing’ Category
When building a Cassandra cluster, the “key” question (sorry, that’s weak) is whether to use the RandomPartitioner (RP), or the OrderPreservingPartitioner (OPP). These control how your data is distributed over your nodes. Once you have chosen your partitioner, you cannot change without wiping your data, so think carefully!
For Cassandra newbies, like me and my team of HBasers wanting to try a quick port of our project (more on why in another post) nailing the exact issues is quite daunting. So here is a quick summary.
What OPP gives you
Using OPP provides you with two obvious advantages over RP:
1. You can perform range slices. That is you can scan over ranges of your rows as though you were moving a cursor through a traditional index. For example, if you are using user ids as your keys, you could scan over the rows for users whose names begin with J e.g. jake, james, jamie etc
2. You can store real time full text indexes inside Cassandra, which are built using the aforementioned feature e.g. see Lucandra
3. If you screw up, you can scan over your data to recover/delete orphaned keys
***UPDATE*** Since v6 you *can* now scan your keys when using RP, although obviously not in any particular order. Typically you request a page of rows starting with the empty/”" key, and then use the apparently random end key from the page as the start key when you request another page. At the time of writing, this method only seems to work with KeyRange not TokenRing. If you are using Java to access Cassandra read the change log for v0.804 of Pelops.
Given that Web applications typically need/benefit from the above, the question is why would you *not* use OPP. The answer is a nuanced one about load balancing.
The problem with OPP
With both RP and OPP, by default Cassandra will tend to evenly distribute individual keys and their corresponding rows over the nodes in the cluster. The default algorithm is nice and simple: every time you add a new node, it will assign a range of keys to that node such that it takes responsibility for half the keys stored on the node that currently stores the most keys (more on options for overriding the default behaviour later).
The nuance is, that this simple default algorithm will tend to lead to good load balancing when RP is used, but not necessarily when OPP is used.
The reason is that although the algorithm succeeds in assigning key ranges such that as your cluster scales nodes receive roughly similar numbers of keys, with OPP on any given node those keys are unlikely to be drawn equally from the different column families present within your database…
If the distribution of keys used by individual column families is different, their sets of keys will not fall evenly across the ranges assigned to nodes. Thus nodes will end up storing preponderances of keys (and the associated data) corresponding to one column family or another. If as is likely column families store differing quantities of data with their keys, or store data accessed according to differing usage patterns, then some nodes will end up with disproportionately more data than others, or serving more “hot” data than others. <yikes!>
By contrast, when using RP the distribution of the keys occuring within individual column families does not matter. This is because an MD5 hash of keys is used as the “real” key by the system for the purposes of locating the key and data on nodes (the MD5 hashes randomly map any input key to a point in the 0..2**127 range). The result is that the keys from each individual column family are spread evenly across the ranges/nodes, meaning that data and access corresponding to those column families is evenly distributed across the cluster.
If you must have OPP
You may quite reasonably feel that you must have the range scan features that come with OPP, for example because you want to use Lucandra. The question then becomes how you can you ameliorate the aforementioned problems with load balancing.
The best you can do, is to identify the data upon which you do not need to perform range scans. This data can then be randomly distributed across your cluster using a simple idiom where the key is actually written as <MD5(ROWKEY)>.<ROWKEY>
But be clear, the items whose keys must be undecorated (because you wish to perform range scans over them), may still not map evenly onto the key ranges held by the nodes. The only recourse you have then, is to consider manually specifying the key ranges assigned to nodes. This is typically done when you bootstrap a new node, but you can also rebalance an existing cluster by simply decomissioning nodes, deleting their data, and then bootstrapping them back in. To do this safely, you obviously have to do this one at a time, but then I’m sure I didn’t have to tell you that…
You can see where this is going now right? You’ve just made a whole load of work for yourself, and anyway, even if you have the time, if you have lots of different column families with widely differing key distributions then getting load balancing right is going to be a nightmare.
This is the basic reason that fully seasoned Cassandra heads, in my experience, seem to prefer RD *unless* a mono use setup is proposed, for example where a cluster is used simply to store a full-text index with Lucandra.
If you have a database with a seriously heterogeneous set of column families, and need range scans, you might now be thinking you should actually be using HBase, which is designed for this. That would not be a bad choice (!), but there are good reasons for hanging with Cassandra if you can, which I will cover in a future post. Read on…
If you must use RP (very likely)
So having delved a little more deeply into the implications of OPP, you decide you really should go with RP. But, what to do with those indexes you need?
Well, first of all there is a really simple if brutal solution: simply store your index inside a single column family row as a series of columns. Since Cassandra can in principle cope with millions of columns, this is perfectly possible. Although it is true each index won’t be distributed across your whole cluster, the load will at the least be distributed across the nodes holding the replicas. If you use a typical replication factor (RF) of 3 the load associated with each index will be shared by 3 nodes etc.
In the vast majority of cases, this will be enough, and it will be sufficient that the rest of your data is properly balanced across your cluster.
But, I hear you saying, this is too brutal. Your index is too massive to fit on 3 nodes, is extremely hot and this just won’t work. You moved to Cassandra because you want your load distributed across your entire cluster. Period.
This is a perfectly reasonably point of view.
The only solution in this case is to build an index system over the top of the simple hashmap provided. We are taking this approach, and it will be elaborated with some sample code in a later post.
Basic indexing strategy for RP
For those that need to know the basic strategy now, here it is: you need to start off with the simple approach where you store your entire index using columns under a single key. As the number of columns grows past some threshold you define, the columns should be split such that half the entries are migrated to a new key/row. Thus the index is split across the cluster evenly.
Each range can be stored under a key named in a predictable way, for example <INDEX>.<SPLIT NO.> The start and end index entries stored in each split should themselves be stored in a dedicated column family that is used to record index meta information using the same key name, ensuring that the meta information is also distributed.
You can then progressively test the existence of splits simply by attempting to open the key for the meta that would be used to describe the split. If you can retrieve the meta information, you know that the split also exists. It won’t be necessary to cache this information to make the process reasonably performant – Cassandra already caches data in memory, and also uses Bloom filters to determine whether or not a requested row exists (Bloom filters enable a Cassandra node to rapidly determine whether it holds a key without traversing its list of keys).
There you have it, an index offering range scans fully distributed over your cluster!
Full text search sanity check
Implementing a full text index will of course involve more work than a simple left-side/ISAM style index, although the principles are the same. Given the existence of Lucandra though, I would suggest that before proceeding to create your full text index using the described approach, you first examine another possibility: running your full text searches off a dedicated cluster.
If you are running in the cloud, for example on EC2 or Rackspace Cloud, you can start your dedicated full text search cluster at low cost on small instances that can be scaled up if necessary later. Otherwise, consider virtualization or configuring Cassandra to run two clusters in parallel on the same nodes (more on this possibility in a later post).
The beauty of open source is that many problems have already been solved for you, and Lucandra is too good an opportunity to miss is you need full text search on Cassandra.