Dominic Williams

Occasionally useful posts about RIAs, Web scale computing & miscellanea

Cassandra: RandomPartitioner vs OrderPreservingPartitioner

with 20 comments

When building a Cassandra cluster, the “key” question (sorry, that’s weak) is whether to use the RandomPartitioner (RP), or the OrderPreservingPartitioner (OPP). These control how your data is distributed over your nodes. Once you have chosen your partitioner, you cannot change without wiping your data, so think carefully!

For Cassandra newbies, like me and my team of HBasers wanting to try a quick port of our project (more on why in another post) nailing the exact issues is quite daunting. So here is a quick summary.

What OPP gives you

Using OPP provides you with two obvious advantages over RP:
1. You can perform range slices. That is you can scan over ranges of your rows as though you were moving a cursor through a traditional index. For example, if you are using user ids as your keys, you could scan over the rows for users whose names begin with J e.g. jake, james, jamie etc
2. You can store real time full text indexes inside Cassandra, which are built using the aforementioned feature e.g. see Lucandra
3. If you screw up, you can scan over your data to recover/delete orphaned keys

***UPDATE*** Since v6 you *can* now scan your keys when using RP, although obviously not in any particular order. Typically you request a page of rows starting with the empty/”” key, and then use the apparently random end key from the page as the start key when you request another page. At the time of writing, this method only seems to work with KeyRange not TokenRing. If you are using Java to access Cassandra read the change log for v0.804 of Pelops.

Given that Web applications typically need/benefit from the above, the question is why would you *not* use OPP. The answer is a nuanced one about load balancing.

The problem with OPP

With both RP and OPP, by default Cassandra will tend to evenly distribute individual keys and their corresponding rows over the nodes in the cluster. The default algorithm is nice and simple: every time you add a new node, it will assign a range of keys to that node such that it takes responsibility for half the keys stored on the node that currently stores the most keys (more on options for overriding the default behaviour later).

The nuance is, that this simple default algorithm will tend to lead to good load balancing when RP is used, but not necessarily when OPP is used.

The reason is that although the algorithm succeeds in assigning key ranges such that as your cluster scales nodes receive roughly similar numbers of keys, with OPP on any given node those keys are unlikely to be drawn equally from the different column families present within your database…

If the distribution of keys used by individual column families is different, their sets of keys will not fall evenly across the ranges assigned to nodes. Thus nodes will end up storing preponderances of keys (and the associated data) corresponding to one column family or another. If as is likely column families store differing quantities of data with their keys, or store data accessed according to differing usage patterns, then some nodes will end up with disproportionately more data than others, or serving more “hot” data than others. <yikes!>

By contrast, when using RP the distribution of the keys occuring within individual column families does not matter. This is because an MD5 hash of keys is used as the “real” key by the system for the purposes of locating the key and data on nodes (the MD5 hashes randomly map any input key to a point in the 0..2**127 range). The result is that the keys from each individual column family are spread evenly across the ranges/nodes, meaning that data and access corresponding to those column families is evenly distributed across the cluster.

If you must have OPP

You may quite reasonably feel that you must have the range scan features that come with OPP, for example because you want to use Lucandra. The question then becomes how you can you ameliorate the aforementioned problems with load balancing.

The best you can do, is to identify the data upon which you do not need to perform range scans. This data can then be randomly distributed across your cluster using a simple idiom where the key is actually written as <MD5(ROWKEY)>.<ROWKEY>

But be clear, the items whose keys must be undecorated (because you wish to perform range scans over them), may still not map evenly onto the key ranges held by the nodes. The only recourse you have then, is to consider manually specifying the key ranges assigned to nodes. This is typically done when you bootstrap a new node, but you can also rebalance an existing cluster by simply decomissioning nodes, deleting their data, and then bootstrapping them back in. To do this safely, you obviously have to do this one at a time, but then I’m sure I didn’t have to tell you that…

You can see where this is going now right? You’ve just made a whole load of work for yourself, and anyway, even if you have the time, if you have lots of different column families with widely differing key distributions then getting load balancing right is going to be a nightmare.

This is the basic reason that fully seasoned Cassandra heads, in my experience, seem to prefer RD *unless* a mono use setup is proposed, for example where a cluster is used simply to store a full-text index with Lucandra.

If you have a database with a seriously heterogeneous set of column families, and need range scans, you might now be thinking you should actually be using HBase, which is designed for this. That would not be a bad choice (!), but there are good reasons for hanging with Cassandra if you can, which I will cover in a future post. Read on…

If you must use RP (very likely)

So having delved a little more deeply into the implications of OPP, you decide you really should go with RP. But, what to do with those indexes you need?

Well, first of all there is a really simple if brutal solution: simply store your index inside a single column family row as a series of columns. Since Cassandra can in principle cope with millions of columns, this is perfectly possible. Although it is true each index won’t be distributed across your whole cluster, the load will at the least be distributed across the nodes holding the replicas. If you use a typical replication factor (RF) of 3 the load associated with each index will be shared by 3 nodes etc.

In the vast majority of cases, this will be enough, and it will be sufficient that the rest of your data is properly balanced across your cluster.

But, I hear you saying, this is too brutal. Your index is too massive to fit on 3 nodes, is extremely hot and this just won’t work. You moved to Cassandra because you want your load distributed across your entire cluster. Period.

This is a perfectly reasonably point of view.

The only solution in this case is to build an index system over the top of the simple hashmap provided. We are taking this approach, and it will be elaborated with some sample code in a later post.

Basic indexing strategy for RP

For those that need to know the basic strategy now, here it is: you need to start off with the simple approach where you store your entire index using columns under a single key. As the number of columns grows past some threshold you define, the columns should be split such that half the entries are migrated to a new key/row. Thus the index is split across the cluster evenly.

Each range can be stored under a key named in a predictable way, for example <INDEX>.<SPLIT NO.> The start and end index entries stored in each split should themselves be stored in a dedicated column family that is used to record index meta information using the same key name, ensuring that the meta information is also distributed.

You can then progressively test the existence of splits simply by attempting to open the key for the meta that would be used to describe the split. If you can retrieve the meta information, you know that the split also exists. It won’t be necessary to cache this information to make the process reasonably performant – Cassandra already caches data in memory, and also uses Bloom filters to determine whether or not a requested row exists (Bloom filters enable a Cassandra node to rapidly determine whether it holds a key without traversing its list of keys).

There you have it, an index offering range scans fully distributed over your cluster!

Full text search sanity check

Implementing a full text index will of course involve more work than a simple left-side/ISAM style index, although the principles are the same. Given the existence of Lucandra though, I would suggest that before proceeding to create your full text index using the described approach, you first examine another possibility: running your full text searches off a dedicated cluster.

If you are running in the cloud, for example on EC2 or Rackspace Cloud, you can start your dedicated full text search cluster at low cost on small instances that can be scaled up if necessary later. Otherwise, consider virtualization or configuring Cassandra to run two clusters in parallel on the same nodes (more on this possibility in a later post).

The beauty of open source is that many problems have already been solved for you, and Lucandra is too good an opportunity to miss is you need full text search on Cassandra.

Written by dominicwilliams

February 22, 2010 at 10:56 pm

20 Responses

Subscribe to comments with RSS.

  1. so if OPP is used, and every node is load balanced (decommission + bootstrap) regularly, then you would expect to see similar distribution compared to RP?

    keith

    February 23, 2010 at 3:33 am

    • Hi, it is important to realize when considering doing manual load balancing over OPP that there are actually several types of load to balance across the nodes e.g.
      – total storage consumption
      – storage I/O bandwidth
      – CPU load

      The thing is, although you will be able manually set the key ranges assigned to your nodes, you cannot control which column families those keys will be drawn from, and the makeup of keys will always vary from node to node.

      Because different column families will create different types of load, it is likely that by assigning key ranges to nodes you might only be able to balance one type of load e.g. by manual balancing you might be able to balance CPU load, but then find your nodes require widely differing storage capacities.

      So not only can manual balancing create an administrative overhead, but you might eventually find yourself in a situation where you cannot balance all the types of load satisfactorily, and therefore actually have to start beefing up the hardware on specific nodes.

      By contrast, RP offers a much easier solution. The keys stored on each node share the same column family makeup, and therefore the different types of load are spread evenly across the cluster.

      dominicwilliams

      February 23, 2010 at 11:37 am

    • FYI, the “loadbalance” operation provides exactly this behavior, in a more convenient package. Although as Dominic says this you can find conditions where balancing purely by number of keys is not optimal, for most applications this should work reasonably well.

      Jonathan Ellis

      February 23, 2010 at 2:03 pm

  2. […] NOTE: in this report HBase performs better than Cassandra in range scans over records. Although the Cassandra team believes they will soon approach the HBase times, it is also worth pointing out that in a common configuration of Cassandra range scans aren’t even possible. I recommend this to you as being of no matter, because actually in practice you should implement your indexes on top of Cassandra, rather than seek to use range scans. If you are interested in issues relating to range scans and storing indexes in Cassandra, see my post here https://ria101.wordpress.com/2010/02/22/cassandra-randompartitioner-vs-orderpreservingpartitioner/). […]

  3. Excellent post thank you for the info

    Virgilio Sosinsky

    March 5, 2010 at 11:33 pm

  4. The descriptions are almostly right, but following tow:
    1. The reference of Yahoo’s benchmark report is not good.
    2. To be a technicist, pithy is better, and save writer and reader’s time.

    Thanks for you good post.

    schubertzhang

    March 31, 2010 at 11:10 am

  5. You listed the advantages of the OPP as:

    “3. If you screw up, you can scan over your data to recover/delete orphaned keys”

    But can you please help me understand what exactly that means and how it is a unique feature of the OPP? What do you mean by “orphaned key”? Keys don’t have parents, in general.

    Paul Prescod

    April 7, 2010 at 10:11 pm

    • That particular point has been made obsolete by the latest version of Cassandra, but I’ll tell you what I meant…

      Previously if you had RP, you used not to be able to do a scan of your records. The only way you could retrieve a record, is by requesting the record using its key. This means (or rather *meant*), that if for some reason you lost track of a record, it would become garbage that you could not get back to. It also meant you had to maintain your own indexes if you wanted to do aggregation analysis. Now it is possible to scan your records, albeit not in any particular order, and this is a big improvement. I will research this feature shortly and update this article because I don’t want people to be put off by that point. This point is now incorrect 🙂

      Of course, with OPP you have always been able to perform scans of your records, and you can do this between two specified keys.

      dominicwilliams

      April 8, 2010 at 10:38 pm

  6. […] data store. Although Dominic Williams covered a good chunk of the topic wonderfully in his post “Cassandra: RandomPartitioner vs OrderPreservingPartitioner”, his framing of the question is a bit different than mine and was somewhat dominated by his focus […]

  7. […] Cassandra: RandomPartitioner vs OrderPreservingPartitioner […]

  8. […] key and column ordering. Dominic Williams covered a good chunk of the topic wonderfully in his post “Cassandra: RandomPartitioner vs OrderPreservingPartitioner”. But his framing of the question is a bit different than mine and was somewhat dominated by his […]

  9. […] Apart from this, the way Lucandra uses Cassandra can also have some scalability issues with large data. You can find some clue here: https://ria101.wordpress.com/2010/02/22/cassandra-randompartitioner-vs-orderpreservingpartitioner/ […]

  10. […] donnée suivant la clé (ie. l’identifiant de ligne). Reste que Cassandra peut utiliser, de base, deux modes de partitionnement. Le premier, le RandomPartitioner, permet d’avoir une distribution équilibré et basé sur le […]

  11. […] client) partitioning the data based on the row key. Out of the box, Cassandra can nevertheless use two different algorithms to distribute data over the nodes. The first one is the RandomPartitionner and it gives you an equally and hash-based distribution. […]

  12. […] The partitioner is a much more tricky and Cassandra provides, by default, two partitioners : the RandomPartitioner and the OrderPreservingPartitioner. In the first case, the data will be partitioned using a row key hash (typically md5). In the […]

  13. […] 注意:这份报告中 HBase 仅在对一个范围的记录进行扫描这一项上优于 Cassandra。虽然 Cassandra 团队相信他们可以很快达到 HBase 的时间,但还是值得指出,在通常的 Cassandra 配置中,区间扫描几乎是不可能的。我建议你可以无视这一点,因为实际上你应该在 Cassandra 上面来实现你自己的索引,而非使用区间扫描。如果你对区间扫描和在 Cassandra 中存储索引相关问题有兴趣,可以看我的这篇文章。 […]

  14. […] That being said, it’s possible to use a random partitioner for the data in Cassandra. The random partitioner makes it very easy to add additional nodes and distribute data across them. The random partitioner comes with a price. It makes it impossible to do quick range slice queries in Cassandra – you can no longer say “I want to see all of the data for January 3rd, 2010 through January 8th, 2010”. Instead, you would need to build up custom indexes to support your querying and build batch processes to load the indexes. The tradeoffs between the random partitioner and the order preserving partitioner are covered very well in Dominic Williams’s article Cassandra: RandomPartitioner vs OrderPreservingPartitioner. […]

  15. […] Yahoo对NOSQL系统进行了较为详细的比较,研究结果表明Cassandra更有优势。HBase仅在Range scan上比较有优势。但是我认为实际上应该在Cassandra的基础上再实现你自己的索引,而不是直接用Range scan。如果你对Cassandra的区间查询和存储索引感兴趣,参考我另一篇https://ria101.wordpress.com/2010/02/22/cassandra-randompartitioner-vs-orderpreservingpartitioner/。 […]

  16. […] problems dealing with scale. Part of Cassandra’s problems scaling problems have to do with designing the partitioning scheme – if you’re using an RDBMS, this is the sharding problem we just talked about. That […]

  17. […] Nevertheless, it’s possible to use a random partitioner for the data in Cassandra. The random partitioner makes it very easy to add additional nodes and distribute data across them. However, the random partitioner comes with a price — it makes fast range slice queries in Cassandra impossible. For instance, you can’t retrieve data “from December 3rd, 2010 through December 8th, 2010.” Instead, you would need to build up custom indexes to support your querying and build batch processes to load the indexes. The tradeoffs between the random partitioner and the order preserving partitioner are discussed in depth in Dominic Williams’s article Cassandra: RandomPartitioner vs OrderPreservingPartitioner. […]


Leave a comment