Dominic Williams

Occasionally useful posts about RIAs, Web scale computing & miscellanea

Cassandra: the importance of system clocks, avoiding OOM and how to escape (unnecessary) OOM meltdown!

with 2 comments

First of all, big thanks to jbellis and driftx (Jonathan Ellis and Brandon Williams of DataStax). Without them, I’d still be up trying to fix a problem node I’ll talk about later. They were no doubt horrified at an issue created through wanton misconfiguration, but they helped out amazingly anyway. For the benefit of other people as reckless as us, and who are equally unskilled in the black arts, here are some cautionary tales and a hard core tip for recovering from a serious OOM situation.

Before going any further, I’ll just give you some background to Fight My Monster, which went live late last year. We are a self-funded London startup. We have had enough funds to develop a great product, promote it using TV commercials, but everything else has to be done on a shoestring. Cassandra is the primary workhorse for processing data, and we ask a LOT of it.

The game itself is an MMO, social networking and minigames hybrid. Since launching it has grown quite fast, with most traffic being concentrated 3pm -> 10pm GMT weekdays and weekends (basically when kids in the UK are free – since we haven’t started with international markets yet). The site is characterized by a huge number of transactions, many of which require consistency. Furthermore, we write and rewrite huge amounts of data. We log instant chat messages for the security of kids, we copy kid’s room (aka “Wall”) posts to all their friend’s feeds, and we let kids grow, fight and trade unlimited numbers of monsters for nuggets, with every transaction being recorded, generating a storm of writes. And that’s just the start of it.

My only experience of highly loaded SQL systems is from running large user forums through another company. What I can tell you is that SQL systems just wouldn’t give Fight My Monster anywhere near comparable performance and reliability at the same cost. To some extent, Cassandra makes FMM possible. That said, there are a couple of pitfalls you really need to avoid…

Beware Virtualization: Your Clocks Need To Work!

Before discussing column family configuration and OOM, I will outline an equally serious problem that can be caused by virtualization. I decided to give up managing server racks a long time ago. Fight My Monster combines a CDN with a cloud service where you can actually provision *dedicated* machines on demand. I recommend anyone to do the same if you can, because it simplifies resource management.

However, fantastic as the cloud service we use is, in the early days aspects of our cloud computing infrastructure caused problems with our data. The problem was this: although we are able to provision dedicated machines, our cluster nodes  still run Ubuntu on top of a VMware hypervisor that enables features like snapshotting. Unfortunately, what we did not realize is that system clock drift is a very common issue with virtualized systems (see to get a feel), and this is very bad in Cassandra setups because timestamps affect the serialization of writes.

Unbeknownst to us, when our virtual servers were under load, their system clocks were drifting in a dramatic way. Ntpd would try to keep the clocks in sync with the Internet NTP servers though periodic tiny adjustments as normal, but as we found out, when drift is too rapid ntpd temporarily gives up. Total drift would thus continue growing until system load abated at which point ntpd, upon deciding it could now keep everything in sync again, would suddenly bring the clock into line with the network time servers, sometimes setting the clock back even 20 minutes or so into the past!!! Bizarrely, this was also happening to 1000s of LAMP servers using the service, but nobody had noticed.

In our case, running a distributed database, we certainly noticed the effects. At lot of our processing is transactional. For example, each time a monster fight happens, the monsters involved change and have their histories updated, and the monster owners have their monster lists and nugget bank balances adjusted. Often transactions stack up against each other with users engaging in multiple fights and trades in a short space of time. Where necessary, we write with consistency level QUORUM and use the Cages + ZooKeeper distributed locking system to prevent lost updates and ensure everything is serialized properly (see

But we were constantly suffering inexplicable data states as consequence of broken serialization. We were convinced the problem lay inside Cages or Cassandra (an upside of the situation was that Cages got many extra hours of code review). Eventually though, I noticed a fight record stored in Cassandra where a monster went into a fight with attributes that it was to gain from a fight in the future. This led to the realization the problem had something to do with clocks!

The key point: Cassandra cluster serializes write operations using the timestamp you send with them (the timestamp is often created for you if you use a library like Pelops, see In fact, timestamps enable Cassandra to use the clock as a serialization mechanism among its loosely connected nodes. If the clocks on your application servers jump about, so that the timestamps you send with write operations are wrong, you’re going to have problems.

Moral: if you are using virtual servers for your Cassandra setup, pay very close attention to how the system clocks are working and configure ntpd carefully.

Each case will be different, but for those interested this is what we did:

  1. Added “noapictimer irqpoll” to end of kernel line in /boot/menu.lst
  2. Added following line to /etc/rc.local
    echo “jiffies” > /sys/devices/system/clocksource/clocksource0/current_clocksource
  3. Configured ntp better e.g. ntp.conf

# don’t give up so easily, even though this should never happen
tinker panic 0
# log your drift and stats and review them periodically
driftfile /var/lib/ntp/ntp.drift
statsdir /var/log/ntpstats/
statistics loopstats peerstats clockstats
filegen loopstats file loopstats type day enable
filegen peerstats file peerstats type day enable
filegen clockstats file clockstats type day enable
# use lots of NTP servers to get closest synchronization to actual time

Our drift is now max 1-2ms under load and Cages/ZooKeeper/Cassandra is now serializing millions of transactions just fine.

Avoiding OOM: Pay Close Attention to Column Family Configuration

When you create new column families in a keyspace, Cassandra automatically configures them with sensible defaults for you. These defaults control things like the size of associated caches, which are actually very important to operation. Because the defaults are sensible for new installations, or for projects with relatively small numbesr of column families, it is easy to forget these parameters and concentrate purely on data design as you evolve. That’s what we did, so we had to learn the following the hard way!

Several Cassandra aficionados have expressed shock and horror at the number of column families we use in Fight My Monster: We now have more than 50. But this isn’t down to bad design. Really! Because our Fight My Monster is destination site, it provides wide ranging functionality and while some of the column families belong to legacy code and are being deleted or consolidated, the simple fact is that we need a lot of them. Unless you are a Web 2.0 site addressing a single very specific need, I would say the number of column families you use will increase inevitably. And anyway, you should still make sure you know about this…

So what’s the problem? The problem is that Cassanda is bootstrapped with a fixed maximum heap size. When it is running, there are therefore limits on how much data it can cache in memory at once, and if you misconfigure it sufficiently so that it reaches a situation where it needs more memory than is available, it will crash and dump core with an Out of Memory exception. Furthermore, if you have not made reasonable attempts at setting appropriate column family parameters, you can find yourself in a situation where a Cassandra node dies with OOM when you try to restart it, which is what happened to us.

First the basics: A Cassandra node sends new writes to a column family to its commit log, and then on to a in-memory data structure called a memtable where they are aggregated before being written to disk (memtables absorb overwrites and also act as read caches) . The memtable associated with a column family has three key parameters which affect memory consumption. There are some others too, which you can see by examining a column family configuration.

These three key parameters are 1. memtable_operations_in_millions 2. memtable_throughput_in_mb and 3.  memtable_flush_after_mins. The first parameter controls the maximum number of writes that the memtable will hold in memory before flushing them to disk. So for example if this number is 0.75, it will hold 750,000 writes in memory before flushing them. The second parameter is the maximum total size of the values of the column that must be written before the memtable is flushed from memory. However, it only pertains to the size of the column values, not their names, which can also be important. The third parameter is the maximum time in minutes a memtable should reside in memory before it is flushed.

Configuring these parameters requires care and thought, and you should use nodetool to examine statistics for your column families to work out the best values. Each column family must be treated on a case by case basis. For example, if a column family stores lots of small columns, then the total calculated size of the values will be small in memory. But simple Java objects representing operations/column writes in memory have an overhead, and if you set memtable_operations_in_millions too high the actual maximum memory consumed by the memtable will be able to greatly exceed the memtable_throughput_in_mb value. It must be possible for Cassandra to allocate the maximum memory that all of your memtables might consume from its heap, while still leaving memory available for other functions. A suggested calculation has been:

heap size = memtable_throughput_in_mb * 3 * number of hot CFs + 1G + internal caches

Although in a perfect world database systems would be clever enough to configure these thresholds themselves, what you cannot do is leave a system with a large number of column families with the defaults.

In our case, we had legacy defaults which meant memtable_operations_in_millions=0.75, memtable_throughput_in_mb=160 and memtable_flush_after_mins=60 across all our column families. Since many of our hot column families store small pieces of data, the total memory overhead of many memtables would have been 400MB or more. Memtables fill up when you are under load, exactly when you don’t want OOM to raise its head!!! Under load, our memtables wanted more memory than the heap could provide.

I found a special way of triggering the problem. I forgot to restart a node after a rolling upgrade. By the time I realized it was down, there were a storm of hinted writes waiting for it from the other nodes (writes that the other nodes saved on its behalf for later application because it was not available to receive them at the time, a bit like your neighbor holding a package from DHL for you that arrived when you were out). When the node came up, these writes stormed into the commit logs, and then into the memtables which quickly filled up. Total memory was quickly exhausted before they could be flushed, leading the node to crash with OOM!

The Black Art of Recovering from OOM Meltdown

If you have not done so you need to check your column family configurations now because the situation can be worse than a node simply dying with an OOM exception. You can find as we did with our badly misconfigured system that a node cannot be restarted! Very possibly, many of the writes that caused your node to crash will remain in your commit logs. When you bring the node backup, Cassandra will try to replay these logs filling up the memtables again and if you’re unlucky causing another OOM core dump!

If anyone reaches this situation, I hope for their sake they find this post!!! Caveat emptor: This is not advice, it merely reflects what we did to solve our problems. I am not in any way responsible for the consequences if this doesn’t work 😉 That said, this is what you do…

Firstly you need to bring your service *down* (yes, your *site* or *game* needs to be turned off). In our case we use RF 3 so only the performance of our service had suffered up until this point. But when you start messing with the commit logs, you need to make sure no new writes are occurring…

Once no new writes can take place, proceed as follows. Firstly increase the heap allocated to Cassandra to the maximum value your server will support (i.e. using the -Xmx JVM parameter). Try restarting Cassandra again. If you are successful, it should replay all the logs successfully and flush the memtables to disk. In this case, your original commit logs will be deleted and new ones created. You can now stop Cassandra, reset the heap size, and restart. You can then reconfigure your column family parameters, and hopefully be in the clear.

If on the other hand this does not work you need to do the following:

  1. Move your commit logs to a backup directory e.g. mv /var/opt/cassandra/commitlogs/* /var/opt/cassandra/logsbackup
  2. Connect to a working Cassandra node using cassandra-cli and reconfigure your column family parameters appropriately. Make sure the heap on your down node can supply your maximum memtable needs.
  3. Restart the Cassandra node without its commit logs. It will now receive the schema updates from the other nodes. Because it’s memtables will be empty, it won’t crash and will be able to reconfigure its own copy of the schema
  4. Shutdown the Cassandra node, copy the commit logs back and restart it again. Unfortunately, if you have a sufficiently large number of column families, you have not been conservative enough and your commit log backlog is large enough, you will bomb out with OOM again!
  5. Copy out all your commit logs. Then copy back a single commit log and restart. Once this log is successfully replayed and deleted, you should then shutdown the node and repeat the process with the next commit log (for the purposes of speed, Cassandra processes commit logs in parallel so minimum load is created when you process your commit logs one by one).

Hopefully you will now be feeling better, you will have fixed your broken node and you can restart your service. You will now be a wiser person and never forget to configure your column families properly 🙂

Written by dominicwilliams

February 8, 2011 at 8:13 pm

Posted in Uncategorized, our redirector

with 2 comments

Fight My Monster is a fantastic brand name, but is quite a handful for kids to type as a domain name ( What we don’t want is kids googling “Fight My monster” and then clicking on our own search ads (that could get expensive!!). The best solution: especially for usage with TV commercials, is get yourself a redirection domain, in our case!

Next step, start blogging about the domain so Google picks it up!.. if anyone wants to talk about our domain strategy, please feel free?? And link to this post 🙂

We’ve already registered all of the likely misspellings. Any other ideas?

Written by dominicwilliams

August 3, 2010 at 11:12 am

Posted in Fight My Monster

Tagged with , ,

Cassandra: Up and running quickly in Java using Pelops

with 52 comments


In Greek mythology Cassandra is captured by the triumphant king Agamemnon after the fall of Troy, with whom she has two sons, Pelops and Teledamus. This Java client library is Pelop’s namesake nicknamed “Cassandra’s beautiful son” because it offers a beautiful way to code against the Cassandra database. This is a quick introduction to the library.

You can find the open source code here


Pelops was born to improve the quality of Cassandra code across a complex commercial project that makes extensive use of the database. The main objectives the library are:

  • To faithfully expose Cassandra’s API in a manner that is immediately understandable to anyone:
    simple, but beautiful
  • To completely separate low-level concerns such as connection pooling from data processing code
  • To eliminate “dressing code”, so that the semantics of data processing stand clear and obvious
  • To accelerate development through intellisense, function overloading and powerful high-level methods
  • To implement strategies like load balancing based upon the per node running operation count
  • To include robust error handling and recovery that does not mask application-level logic problems
  • To track the latest Cassandra releases and features without causing breaking changes
  • To define a long-lasting paradigm for those writing client code

Up and running in 5 minutes

To start working with Pelops and Cassandra, you need to know three things:

  1. How to create a connection pool, typically once at startup
  2. How to write data using the Mutator class
  3. How to read data using the Selector class.

It’s that easy!

Creating a connection pool

To work with a Cassandra cluster, you need to start off by defining a connection pool. This is typically done once in the startup code of your application. Sometimes you will define more than one connection pool. For example, in our project, we use two Cassandra database clusters, one which uses random partitioning for data storage, and one which uses order preserving partitioning for indexes. You can create as many connection pools as you need.

To create a pool, you need to specify a name, a list of known contact nodes (the library can automatically detect further nodes in the cluster, but see notes at the end), the network port that the nodes are listening on, and a policy which controls things like the number of connections in your pool.

Here a pool is created with default policies:

    new String[] { "", "", ""},
    new Policy());

Using a Mutator

The Mutator class is used to make mutations to a keyspace (which in SQL speak translates as making changes to a database). You ask Pelops for a new mutator, and then specify the mutations you wish to make. These are sent to Cassandra in a single batch when you call its execute method.

To create a mutator, you must specify the name of the connection pool you will use and the name of the keyspace you wish to mutate. Note that the pool determines what database cluster you are talking to.

Mutator mutator = Pelops.createMutator("Main", "SupportTickets");

Once you have the mutator, you start specifying changes.

 * Write multiple sub-column values to a super column...
 * @param rowKey                    The key of the row to modify
 * @param colFamily                 The name of the super column family to operate on
 * @param colName                   The name of the super column
 * @param subColumns                A list of the sub-columns to write
mutator. writeSubColumns(
    UuidHelper.newTimeUuidBytes(), // using a UUID value that sorts by time
        mutator.newColumn("category", "videoPhone"),
        mutator.newColumn("reportType", "POOR_PICTURE"),
        mutator.newColumn("createdDate", NumberHelper.toBytes(System.currentTimeMillis())),
        mutator.newColumn("capture", jpegBytes),
        mutator.newColumn("comment") ));

 * Delete a list of columns or super columns...
 * @param rowKey                    The key of the row to modify
 * @param colFamily                 The name of the column family to operate on
 * @param colNames                  The column and/or super column names to delete

After specifying the changes, you send them to Cassandra in a single batch by calling execute. This takes the Cassandra consistency level as a parameter.


Note that if you need to know a particular mutation operation has completed successfully before initiating some subsequent operation, then you should not batch your mutations together. Since you cannot re-use a mutator after it has been executed, you should create two or more mutators, and execute them with at least a QUORUM consistency level.

Browse the Mutator class to see the methods and overloads that are available

Using a Selector

The Selector class is used to read data from a keyspace. You ask Pelops for a new selector, and then read data by calling its methods.

Selector selector = Pelops.createSelector("Main", "SupportTickets");

Once you have a selector instance, you can start reading data using its many overloads.

 * Retrieve a super column from a row...
 * @param rowKey                        The key of the row
 * @param columnFamily                  The name of the column family containing the super column
 * @param superColName                  The name of the super column to retrieve
 * @param cLevel                        The Cassandra consistency level with which to perform the operation
 * @return                              The requested SuperColumn
SuperColumn ticket = selector.getSuperColumnFromRow(

assert ticketId.equals(

// enumerate sub-columns
for (Column data : ticket.columns) {
    String name =;
    byte[] value = data.value;

 * Retrieve super columns from a row
 * @param rowKey                        The key of the row
 * @param columnFamily                  The name of the column family containing the super columns
 * @param colPredicate                  The super column selector predicate
 * @param cLevel                        The Cassandra consistency level with which to perform the operation
 * @return                              A list of matching columns
List<SuperColumn> allTickets = selector.getSuperColumnsFromRow(
    Selector.newColumnsPredicateAll(true, 10000),

 * Retrieve super columns from a set of rows.
 * @param rowKeys                        The keys of the rows
 * @param columnFamily                   The name of the column family containing the super columns
 * @param colPredicate                   The super column selector predicate
 * @param cLevel                         The Cassandra consistency level with which to perform the operation
 * @return                               A map from row keys to the matching lists of super columns
Map<String, List<SuperColumn>> allTicketsForFriends = selector.getSuperColumnsFromRows(
    Arrays.asList(new String[] { "matt", "james", "dom" }, // the friends
    Selector.newColumnsPredicateAll(true, 10000),

 * Retrieve a page of super columns composed from a segment of the sequence of super columns in a row.
 * @param rowKey                        The key of the row
 * @param columnFamily                  The name of the column family containing the super columns
 * @param startBeyondName               The sequence of super columns must begin with the smallest super column name greater than this value. Pass null to start at the beginning of the sequence.
 * @param orderType                     The scheme used to determine how the column names are ordered
 * @param reversed                      Whether the scan should proceed in descending super column name order
 * @param count                         The maximum number of super columns that can be retrieved by the scan
 * @param cLevel                        The Cassandra consistency level with which to perform the operation
 * @return                              A page of super columns
List<SuperColumn> pageTickets = getPageOfSuperColumnsFromRow(
    lastIdOfPrevPage, // null for first page
    Selector.OrderType.TimeUUIDType, // ordering defined in this super column family
    true, // blog order
    10, // count shown per page

There are a huge number of selector methods and overloads which expose the full power of Cassandra, and others like the paginator methods that make otherwise complex tasks simple. Browse the Selector class to see what is available here

Other stuff

All the main things you need to start using Pelops have been covered, and with your current knowledge you can easily feel your way around Pelops inside your IDE using intellisense. Some final points it will be useful to keep in mind if you want to work with Pelops:

  • If you need to perform deletions at the row key level, use an instance of the KeyDeletor class (call Pelops.createKeyDeletor).
  • If you need metrics from a Cassandra cluster, use an instance of the Metrics class (call Pelops.createMetrics).
  • To work with Time UUIDs, which are globally unique identifiers that can be sorted by time – which you will find to be very useful throughout your Cassandra code – use the UuidHelper class.
  • To work with numbers stored as binary values, use the NumberHelper class.
  • To work with strings stored as binary values, use the StringHelper class.
  • Methods in the Pelops library that cause interaction with Cassandra throw the standard
    Cassandra exceptions defined here.

The Pelops design secret

One of the key design decisions that at the time of writing distinguishes Pelops, is that the data processing code written by developers does not involve connection pooling or management. Instead, classes like Mutator and Selector borrow connections to Cassandra from a Pelops pool for just the periods that they need to read and write to the underlying Thrift API. This has two advantages.

Firstly, obviously, code becomes cleaner and developers are freed from connection management concerns. But also more subtly this enables the Pelops library to completely manage connection pooling itself, and for example keep track of how many outstanding operations are currently running against each cluster node.

This for example, enables Pelops to perform more effective client load balancing by ensuring that new operations are performed against the node to which it currently has the least outstanding operations running. Because of this architectural choice, it will even be possible to offer strategies in the future where for example nodes are actually queried to determine their load.

To see how the library abstracts connection pooling away from the semantics of data processing, take a look at the execute method of Mutator and the tryOperation method of Operand. This is the foundation upon which Pelops greatly improves over existing libraries that have modelled connection management on pre-existing SQL database client libraries.


That’s all. I hope you get the same benefits from Pelops that we did.

Written by dominicwilliams

June 11, 2010 at 12:31 pm

As3Crypto RSA “padding function returned null” death bug fix :)))

with 5 comments

UPDATE: Unfortunately it transpired the fix I had only worked for small amounts of data I was encrypted/decrypting. It was masking a problem I could see in the code, but did not work with larger amounts of data. This post therefore has changed into a report of the problem. Sorry for any disappointment.

WHAT I FOUND: Firstly, this really is a horrible bug, and I’ve had to move on after spending much more time on it than I intended!

For those that what to grab the baton, here are the main points. When you encrypt/decrypt some data, the results will actually appear non-deterministic as illustrated by the code immediately following.

This is the first thing you need to eliminate to debug, and this achieved by disabling the Random class in by returning 0xff in the first line of nextByte(). Thereafter, you will the encryption operations that fail will be deterministic, unlike in the following code:

	protected function onRunTest(event:MouseEvent):void
		var privKey: String = "-----BEGIN RSA PRIVATE KEY-----\n" +
			"MIIBOwIBAAJBAM7U/mEVLiXKC4vGjpDcQ3qiAi/paF2Qp03Y6JXvcuYZG5/vHRgD\n" +
			"qWpVX+oFtXq7uhjcl+m4fx9zSfklcwdmknUCAwEAAQJAFN7PMFKnzm5dzePiPOHM\n" +
			"+VHhsJ33xwEysJtDlOWNjYQqhrpYRNgkx/fRKzEbY9ymwihRhO1IfBHpOYL5WXgQ\n" +
			"3QIhAPZKg4j+osWdRmv33Ql8bJdls5u5OqgX5k7clvbawwzrAiEA1vxFt/o9BMds\n" +
			"DJDxfDl1eYmUsYFUP2vMavjHgz3Phh8CIQCK9ClX7koJch1cJuCnTHK7zB5UWmH0\n" +
			"ml9O2Pe3WF85dwIgNhEn74cNhYAp2lcxhE5nDvPc429lIrYXqOd8NbN7130CIQCo\n" +
			"JUINRE5bjANnPIIviuVUqJDAmc1ZS29dyJiz/Qr+3w==\n" +
			"-----END RSA PRIVATE KEY-----";
		var pubKey: String = "-----BEGIN PUBLIC KEY-----\n" +
			"p03Y6JXvcuYZG5/vHRgDqWpVX+oFtXq7uhjcl+m4fx9zSfklcwdmknUCAwEAAQ==\n" +
			"-----END PUBLIC KEY-----";
		var dummyTxt:String =
		var rsaPublic: RSAKey = PEM.readRSAPublicKey(pubKey);
		var rsaPrivate: RSAKey = PEM.readRSAPrivateKey(privKey);

		for (var i: int=0; i<500; i++) 
			var message: String = dummyTxt.substr(0, Math.floor(Math.random() * 400));
			if (!doTest(rsaPublic, rsaPrivate, message))
				if (doTest(rsaPublic, rsaPrivate, message))
				{"Arrghh. Nondeterministic! "+i);
	protected function doTest(rsaPublic: RSAKey, rsaPrivate: RSAKey, message: String): Boolean
		try {
			var src:ByteArray = Hex.toArray(Hex.fromString(message));
			var encrypted:ByteArray = new ByteArray();
			rsaPublic.encrypt(src, encrypted, src.length);
			var decrypted:ByteArray = new ByteArray();
			rsaPrivate.decrypt(encrypted, decrypted, encrypted.length);
			var original:String = Hex.toString(Hex.fromArray(decrypted));

			if (original != message) {"Corruption: shouldn't happen!");
		} catch (e: Error) {
			return false;
		return true;

Of course in many ways it would have been nice if the result continued to be nondeterministic without Random being disabled. It would mean some global state was accidentally disabled which would be relatively easy to find.

Instead, I’m afraid this looks like a bit of a demon bug buried in the complexities of bit shifting and maths with big integer values. If you are reading this, and you are an encryption specialist, realize that it is your responsibility to sort this out 😉 since it will take a non-specialist a very long time.

For those feeling brave, you’ll notice that in the following function from, when chunk is calculated in such a way as to cause only 63 bytes to be written to dst, then you’ve got a problem. You can partially mask the problem by having BigInteger construct from 63-bytes, which works, but this does not solve the problem when a concatenation of big integers is written out and some entry but the last is shortened.

	private function _encrypt(op:Function, src:ByteArray, dst:ByteArray, length:uint, pad:Function, padType:int):void {
		// adjust pad if needed
		if (pad==null) pad = pkcs1pad;
		// convert src to BigInteger
		if (src.position >= src.length) {
			src.position = 0;
		var bl:uint = getBlockSize();
		var end:int = src.position + length;
		while (src.position<end) {
			var block:BigInteger = new BigInteger(pad(src, end, bl, padType), bl, true);
			var chunk:BigInteger = op(block);

Good karma in advance for anyone who is brave enough to take this on!

Written by dominicwilliams

June 4, 2010 at 4:02 pm

Flex font embedding with Spark and Halo made easy!

with 7 comments

I’m going to keep this one really brief. There is so much stuff written about fonts and Flex on the Web that the subject seems incredibly complex. It is not(!) and I’m going to quickly outline what I think is the best real-world production approach to take with fonts when using Flash Builder 4.

Firstly, understand that you often *should* use embedded fonts inside your complex Flex projects. The reason is simple: you probably don’t your text to look slightly different depending on what computer it’s running on. For example, a Windows machine and Mac will have different versions of the Arial font. Unless you embed the font your are going to use, you do not know exactly what the font will look like to the user, and the differences can be very significant from the design perspective. Better to embed fonts, so it always looks the same.

So, you need to embed fonts. How do you do this? The most convenient way is to use a style declaration in the root of your project e.g.

        @font-face {
            font-family: "MyriadPro";
            src: url("assets/MyriadPro-Regular.otf");
            embedAsCFF: true;

Notice that we embed as CFF, which provides benefits such as advanced anti-aliasing. But, you howl, Halo (the stuff starting with mx) doesn’t support this! Don’t worry, this is covered in a moment.

Before moving on, we’ll optimize this some by only embedding the characters from the font we are likely to use, which keeps file size down.

        @font-face {
            font-family: "MyriadPro";
            src: url("assets/MyriadPro-Regular.otf");
            embedAsCFF: true;
            unicode-range: U+0020-U+007D;

So now everything looks the same on each system, perhaps we have been able to use an exotic font, and we haven’t blown our SWF file size up too much. But there’s a problem, Halo’s mx controls show blank spaces where text should be.

No worries, this is simple too. You just need to override the class of the text renderer object used by the Halo controls.

        @namespace mx "library://";
        @font-face {
            font-family: "MyriadPro";
            src: url("assets/MyriadPro-Regular.otf");
            embedAsCFF: true;
        mx|Alert, mx|Button, mx|Form, mx|FormHeading, mx|FormItem, mx|FormItemLabel {
            font-family: MyriadPro;
            textFieldClass: ClassReference("mx.core.UIFTETextField");

Nice! There you have your production font strategy. Of course, the more of those sorry old mx controls you are using, the longer your list of mx|,… will be 🙂

Written by dominicwilliams

May 27, 2010 at 12:23 pm

Locking and transactions over Cassandra using Cages

with 43 comments


Anyone following my occasional posts will know that me and my team are working on a new kids game / social network called We are trying to break new ground with this project in many ways, and to support the data intensive nature of what we are trying to do we eventually selected the Cassandra database after working with several others.

This post is about a library we are using with Cassandra called Cages. Using Cages, you can perform much more advanced data manipulation and storage over a Cassandra database. This post explains why and gives some more information.

You can find Cages here

Brief Background

For those that aren’t already familiar with Cassandra (skip this if you are), it can be described as the best representative of a new breed of fast, easily scalable databases. Write operations are evenly spread across a cluster of machines, removing the bottleneck found in traditional SQL database clusters, and it can continue operating even when some nodes are lost or partitioned. The cluster is symmetric in the sense there is no master node, nodes communicate with each other using a P2P protocol and can be easily added and removed by an administrator.

In order to deliver these characteristics,  which are particularly valuable to Web 2.0 enterprises but also will likely prove useful in other industries too, Cassandra offers what is known as a NoSQL model. This model is significantly different to a traditional SQL model, and many coming from more traditional database backgrounds will more easily understand Cassandra as a highly scalable, highly resilient distributed structured storage engine. While NoSQL offers some unique advantages to developers when compared to SQL, it is also the case that whereas in SQL complex operations can be specified in a single statement that is either executed or not (i.e. that have ACID properties), in Cassandra complex operations on data must usually be comprised from several different operations, which can only be made reliable individually.

What Cages is for?

In many cases, websites and systems can be built against Cassandra without regard to ACID issues. Data storage and manipulation can be limited to operations against single rows (and for those that don’t know, rows in NoSQL models are really like multi-level hash tables which can contain hierarchical “ready-joined” data, and generally offer many more possibilities than SQL rows). Where a mutation of these rows must be reliable, or immediately seen by other clients of the database, Cassandra allows the developer to choose from a range of consistency levels that specify the tradeoff between performance, safety of storage and the timeliness with which data becomes visible to all.

This system is undeniably very effective, but when the systems you are building involve complex data structures and manipulation, you can still quickly reach situations where your logical operations necessarily involve several individual Cassandra read and write operations across multiple rows. Cassandra does not get involved in managing the safety and reliability of operations at the higher logical level, which means guaranteeing the logical consistency of your data can require some extra work. Some people, particularly those wedded to SQL databases, advocate storing some parts of your data in traditional SQL databases. For us though, it is most definitely preferable to develop and use Cages!

What is Cages?

Cages is a new Java library that provides distributed synchronization functionality, and soon additional functionality for things like transactions, by using the services of a ZooKeeper server or cluster. ZooKeeper is a very active project and the system is currently widely used. It started life as a Yahoo Research project, see here and is now an important Apache project, see Cages has wide application, but its development will be very much driven by needs in relation to Cassandra.

Using Cages for locking

Cages offers three locking classes, ZkReadLock, ZkWriteLock and ZkMultiLock.

Single path locking

The simplest application of Cages can be to enforce correct updates on data values inside Cassandra (or some other NoSQL database). For example, you may have issues with that old chestnut, the Lost Update Problem. This happens where you read the data with one operation, modify the data and then write it back with a second operation. Problems occur when another client performs the same operation simultaneously, such that the last client to write back the modified value will overwrite the modifications made by the other.

Thus in its most simple form, two clients wish to donate some money to a bank balance. Both simultaneously read the same bank balance value B1. The first client adds donation D1, and writes back (B1 + D1). The second client adds donation D2, and writes back (B1 + D2). Unfortunately bank balance B2 = B1 + D2, and donation D1 has been lost.

Cages provides an easy fix:

    void depositMoney(int amount) {
        ZkWriteLock lock = new ZkWriteLock(“/accounts/” + accountId + “/balance”);
        try {
            // 1. Read the balance
            // 2. Update the balance
            // 3. Write the balance back
        } finally {

Note that the paths passed to the lock classes can represent actual data paths within a NoSQL model, or can simply represent logical control over parts of a wider data model (so long as your application faithfully adheres to the rules you set).

Multi path locking

The Lost Update Problem is the most simple locking scenario where Cages can be applied. In our case, while many parts of our system even use Cassandra without locking, often with low consistency levels for maximum performance, there are several areas where we necessarily perform complex operations over contended data that involve numerous individual read and write operations. To begin with, we decided to treat the cases by nesting the ZkReadLock and ZkWriteLock single path locking primitives. However, there is a problem doing this in a distributed environment.

It is a simple fact that in a distributed environment, many situations where you acquire single path locks in a nested manner can result in deadlock. For example, if one operation sequentially tries to acquire R(P1) then W(P2), and a second operation simultaneously tries to acquire R(P2) then W(P1), deadlock will likely result: the first operation will acquire R(P1) and the second operation will acquire R(P2), but then the first operation will block waiting to acquire W(P2) and the second operation will block waiting to acquire W(P1).

Avoiding these problems with single path locks is no simple matter. For a start, the logical detection of closed wait graphs (deadlock) in the distributed environment is difficult and expensive to perform. The simplest approach to solving the problem is to try to acquire locks with a timeout, such that if you get into a deadlock situation, your acquire() calls throw an exception and you abandon your attempt. The problem here though is that your code has to handle the exception, and possibly rollback parts of the operation performed earlier under the protection of the outer lock.

For all these reasons, when an operation needs to acquire locks over multiple paths in the distributed environment, ZkMultiLock is the class to use.

ZkMultiLock allows you to specify any number of read and write locks over paths, which may then all be acquired “simultaneously”. If your operation can acquire all the locks it needs together at the outset using ZkMultiLock, this avoids any possibility of deadlock. This provides slightly worse performance where multiple paths are specified and locks on the paths are highly contended. But in practice, locks are rarely that highly contended, and you just need to guard against the disaster of simultaneously running operations interfering with each other and corrupting data. Because of the dangers of deadlock, in the Fight My Monster project have mandated that only ZkMultiLock can be used unless there are very special reasons, a situation we have not yet encountered.

    void executeTrade(long lotId, String sellerId, String buyerId) {
        // In the following we need to hold write locks over both the seller and buyer's account balances
        // so they can be checked and updated correctly. We also want a lock over the lot, since the value
        // of lots owned might be used in conjunction with the bank balance by code considering the
        // total worth of the owner. Acquiring the required locks simultaneously using ZkMultiLock avoids
        // the possibility of accidental deadlock occurring between this client code and other client code
        // contending access to the same data / lock paths.
        ZkMultiLock mlock = new ZkMultiLock();
        mlock.addWriteLock("/Bank/accounts/" + sellerId);
        mlock.addWriteLock("/Bank/accounts/" + buyerId);
        mlock.addWriteLock("/Warehouse/" + lotId);
        try {
            // 1. check buyer has sufficient funds
            // 2. debit buyer's account
            // 3. credit seller's account
            // 4. change ownership of goods
        } finally {

Transactions for Cassandra

Transactions are a planned feature at the time of writing, 12/5/2010. It should be too long before they make it into the library, so I will explain a bit about them here.

Locking allows you to synchronize sequences of read and write (mutation) operations across rows stored on your Cassandra cluster. However, the locking classes do not solve the problem that occurs when part way through a complex operation your client machine expires, leaving the data inside the distributed database in a logically inconsistent state. For many applications the likelihood of this occurring is low enough for the locking classes alone to be sufficient. But there may be a small number of operations within applications for which data simply must be logically consistent, and even a very rare failure is unacceptable. This is where transactions come in.

For those that are interested, the following explains how they will work.

A new ZkTransaction class will provide the functionality, and it will need to be used in conjunction with the ZkMultiLock class. ZkTransaction will provide a simplified version of the Cassandra Thrift API that allows a series of data mutation operations to be specified. Client operations wil will proceed by first specifying the necessary locks that must be held, and then specifying the set of data mutations that must be performed by the transaction. When the transaction has been specified, it’s commit() method must be called passing the ZkMultiLock instance as a parameter.

At this point, internally Cages will add a reference to a transaction node created on ZooKeeper from each single path lock node held. The ZkTransaction instance reads from Cassandra the current values of the data it is required to modify, and writes it into the transaction node as a “before” state. Once this is done, it sets about applying the data mutations specified in the necessary sequence of individual Cassandra read and write (mutate) operations. Once all operations are performed, the references to the transaction node from within the locks are removed, and then finally the transaction node itself is deleted – the transaction has now been committed, and the developer can release() the ZkMultiLock.

ZkTransaction can provide a guarantee of consistency for Cages clients because if during the execution of the sequence of individual Cassandra mutation operations the client machine suddenly dies, Cages will immediately revoke the locks the client holds. From this point any instances of ZkReadLock, ZkWriteLock or ZkMultiLock wishing to acquire the released paths must first rollback the transaction node by returning the relevant data to its original “before” state specified. The key point is that any processes that need to see the data in a logically consistent state, and therefore always acquire locks referencing the data in question before accessing it, will always see it as such. This provides a form of ACID for complex operations against a Cassandra database.

    void executeTrade(long lotId, String sellerId, String buyerId) {
        ZkMultiLock mlock = new ZkMultiLock();
        mlock.addWriteLock("/Bank/accounts/" + sellerId);
        mlock.addWriteLock("/Bank/accounts/" + buyerId);
        mlock.addWriteLock("/Warehouse/" + lotId);
        try {
            // 1. check that buyer has sufficient funds
            // ....

            // 2. perform mutations using transaction object
            ZkTransaction transaction = new ZkTransaction(NoSQL.Cassandra);
            try {
                // 2. debit buyer's account
                transaction.insert(buyerId, "accounts", bytes("balance"), bytes(newBalance));
                // 3. credit seller's account
                // ...
                // 4. change ownership of goods
                // ...
            } finally {
        } finally {

Scalability and hashing ZooKeeper clusters

It is worth saying first off that a three node ZooKeeper cluster using powerful machines should be able to handle a considerable workload, and that where usage of locking and transactions is limited on an as-needed basis, such a setup will be able to provide for the needs of many Internet scale applications. However, it is easy to conceive of Cassandra being applied more widely outside of typical Web 2.0 norms where usage of locking and transactions is much heavier, and therefore the scalability of ZooKeeper must be examined.

The main issue is that for the purposes described it is not desirable to scale ZooKeeper clusters beyond three nodes. The reason for this is that while adding nodes scales up read performance, write performance actually starts degrading because of the need to synchronize write operations across all members, and therefore clustering really offers availability rather than performance. A good overview of the actual performance parameters can be found here The question then, is what to do where ZooKeeper becomes a bottleneck.

The solution we suggest is simply to run more than one ZooKeeper cluster for the purposes of locking and transactions, and simply to hash locks and transactions onto particular clusters. This will be the final feature added to Cages.

Note: since I wrote the above Eric Hauser kindly drew my attention to the new “Observers” feature in ZooKeeper 3.3. This may greatly raise the limit at which hashing to separate 3 node clusters becomes necessary. I am hoping to collate performance information and tests in the near future so people have more of an idea what to expect. See

That’s it. Hope it was interesting. Please bear with me as Cages develops further over the coming weeks and feel free to test and report.

Final note

Check out the comments too because there are already useful several clarifications and expansions there.

Written by dominicwilliams

May 12, 2010 at 10:10 pm

Build Google Guava on Mac OS X

leave a comment »

Yo if you are a Java programmer and you haven’t heard of Google Guava, then read up!!!

Guava provides numerous classes that make up for various Java standard library deficiencies encountered by people in the RIA space. Generally I’m not a great fan of using third party libraries because of the support and dependency issues, but when it’s the main Java utility library that Google maintains to support its own app developers, I make an exception!) So if you regularly implement things like event listener sinks, don’t reinvent the wheel, just use Guava HashMultiMap (although a ConcurrentWeakHashMultiMap would be nice, but let’s not look a gift horse in the mouth). On a lesser level, this library also comes with things like character processing utilities that make tasks like username validation much simpler.

However on Mac OS X you might have problems building the source – at least with Guava 1.5 – and a jar file isn’t included.

Solution, once inside the guava folder:

1. At your command prompt do:

you$ export JAVA5_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.5

2. Using your favorite editor, add the following target to build.xml

<target name="jar" depends="compile">
    <jar destfile="google-guava.jar" basedir="build/classes" />

3. Using your favorite editor, update the java5bootclasspath property

<property name="java5bootclasspath" value="${env.JAVA5_HOME}/Classes/classes.jar"/>

should get yourself a nice google-guava.jar file now.

Written by dominicwilliams

March 17, 2010 at 2:07 pm

Posted in Java, Libraries

Tagged with , , ,