Java | Dominic Williams

Archive for the ‘Java’ Category

Fight My Monster Architecture in 5 mins!

We computing engineers live in exciting times. The Internet and its supporting infrastructures really have come of age, and all of a sudden things can, and mostly importantly should be done in new ways. Some small teams are successfully scaling enterprises to massive scale on tiny budgets, and the new generation of applications and games can be delivered across multiple platform formats. With these opportunities come responsibilities and we need to choose our technical strategies and platforms wisely, as these decisions have deep impact on developing tech businesses.

Fight My Monster has grown rapidly and will soon pass 1 million user accounts. But although we recently completed a funding round with some fantastic investors, the business was originally bootstrapped on a very limited budget. For those following the project, and for those who are considering working for Fight My Monster — as you may have heard, Fight My Monster is expanding and has some great opportunities for talented engineers right now — I wanted to provide a quick overview of our architecture, and the reasons we chose it. As you will see, it’s not that conventional.

We’ve gone for a simple but scalable three level architecture, which we host in the cloud. I hope this overview of how our system works proves interesting.

RIA Client

We have a Flash client with a difference. It is full screen, uses a flowed layout, and looks and behaves like a normal website, except in those areas where highly interactive game sparkle is required. The client implements an underused technique called deep linking which allows individual virtual pages within the “site” to be addressed by URL, and browser integration via Javascript allows the simulated site to respond to clicks on the browser’s forward and back buttons and change page. The client has a complex state, but if the browser window is refreshed, the client is able to restore its previous state from the server. We developed the client mainly in Flex using a framework we developed in house. It’s one of the more sophisticated Flex applications out there at the moment.

Application Layer (Starburst)

Our client communicates with the application layer using a proprietary remote method calling framework built on top of the RTMP (Real Time Media Protocol) protocol supported by the Flash player. RTMP allows for multiplexing of a remote method call stream, video and other data over single TCP/IP connection, in which remote calls pack their parameters using AMF (Action Message Format). The protocol also allows server code to call into functions that the client has registered. The framework we created allows call results to be returned asynchronously in a different order to which the calls were made, such that server methods taking a little longer to generate their result do not block the return of results for following calls, thereby keeping client applications responsive. The framework also handles things like automatic reconnection, the queuing of events while a connection is broken, and the resetting of client state where data has been lost after cluster events or prolonged disconnection.

The application layer itself is comprised from a horizontally scalable cluster of servers running a proprietary Java software framework called Starburst, which we may eventually open source. Starburst uses the Red5 server as a container, which provides the base RTMP functionality and contains Tomcat too, which enables us to serve JSP pages (we only serve a few HTML pages, but it is a requirement nonetheless). The current setup may change, and we are looking at using Netty and even replacing RTMP with a bespoke protocol but that is another discussion.

I am limited in what I can say about how Starburst works right now, but basically a client connects by making an initial HTTP “directory” request to obtain the IP address of the cluster node it should connect to. The HTTP request is made to a load balancer, and can be forwarded to any running Starburst node in the cluster, which is completely symmetric in the sense that all nodes are the same and none play special roles. Once the targeted cluster node’s IP address is returned, the client makes a direct RTMP connection to that node. The node the client is directed to is chosen using consistent hashing (a really useful approach that you should Google if you have not come across before). The consistent hashing algorithm evenly distributes clients over the cluster nodes and, if a cluster node is lost, automatically evenly redistributes clients that were connected to it among the remaining nodes, thereby preventing load spikes and hot spots.

The cool bit about Starburst is that it completely abstracts away the multiplicity of client and server nodes for application code writers. Clients and server side connection contexts may register to receive events, and server code written to Starburst APIs may generate events on paths or for specific users that may or may not be online, but it is only ever necessary to write business/game logic, never to consider how different parts of the system are connected. I can’t go into the means by which this is achieved at scale now, but needless to say it makes a real difference to what we do.

Database (Cassandra)

There used to be only one way to make complex apps that scale: shard your database. Sharding is the process of dividing up responsibility for different ranges of your data amongst your servers. A simple sharding strategy might be, for example, to have 27 servers and then use the first letter of usernames to map responsibility for users to the servers. In practice, many organizations also shard functionality, so different servers have different types of function too. For example, Facebook works this way with MySQL databases. The problem with this approach is that your systems become very, very complex, and system complexity has a tendency to scale up with app complexity.

When we first started with Fight My Monster, we also began by thinking about a sharded architecture using MySQL for storage. However, Fight My Monster is a pretty complex app, and it quickly became apparent that sharding was going to require too much work given the resources we had at that time. We then experimented with different ideas, and it was about this time that the NoSQL movement first came to our attention. For all the described reasons, swapping the relational database model for a less complex one that could scale without sharding immediately stood out as a bargain. We began using HBase, which is a very solid system that offers a single logical database that is horizontally scalable, but which is actually quite complex to maintain and administer. Because of that complexity, and for other reasons including throughput, we moved over to Cassandra, which has proven to be a good decision for us.

Without wishing to stir up NoSQL flame wars, Cassandra has amazing qualities. Some of the highlights are as follows:-

Clusters provide a single logical database that is almost infinitely scalable
The cluster is symmetric i.e there are no special types of node to manage
Nodes organize themselves using a P2P system, and adding new nodes is simple
Performance is best in category, with write performance better than read performance
Data is safely replicated across nodes (we use a Replication Factor of 3)
A number of nodes can go down without affecting cluster operation

We access Cassandra using an open source Java library we originally created called Pelops. Because our data processing is highly transactional, and Cassandra does not provide locking mechanisms itself, we need to handle the distributed serialization of database operations initiated from the various Starburst nodes. For this we use another open source library that we also created called Cages in conjunction with Hadoop’s ZooKeeper system.

Using Cages and ZooKeeper has worked very well for us, but if there is a single theme running through Fight My Monster’s architecture it’s simplification! For this reason, I recently developed the Wait Chain Algorithm, which provides a way for Cassandra itself to be used as a distributed lock server, for example using the Pelops or Hector clients, thus providing a way for us to eventually do without ZooKeeper. Watch this blog for news of the first system using this algorithm.

Written by dominicwilliams

February 28, 2012 at 8:32 pm

Posted in Cages, Cassandra, Flash, Java, NOSQL Databases, Red5

Cassandra: Up and running quickly in Java using Pelops

with 52 comments

Pelops

In Greek mythology Cassandra is captured by the triumphant king Agamemnon after the fall of Troy, with whom she has two sons, Pelops and Teledamus. This Java client library is Pelop’s namesake nicknamed “Cassandra’s beautiful son” because it offers a beautiful way to code against the Cassandra database. This is a quick introduction to the library.

You can find the open source code here http://pelops.googlecode.com/

Objectives

Pelops was born to improve the quality of Cassandra code across a complex commercial project that makes extensive use of the database. The main objectives the library are:

To faithfully expose Cassandra’s API in a manner that is immediately understandable to anyone:
simple, but beautiful
To completely separate low-level concerns such as connection pooling from data processing code
To eliminate “dressing code”, so that the semantics of data processing stand clear and obvious
To accelerate development through intellisense, function overloading and powerful high-level methods
To implement strategies like load balancing based upon the per node running operation count
To include robust error handling and recovery that does not mask application-level logic problems
To track the latest Cassandra releases and features without causing breaking changes
To define a long-lasting paradigm for those writing client code

Up and running in 5 minutes

To start working with Pelops and Cassandra, you need to know three things:

How to create a connection pool, typically once at startup
How to write data using the Mutator class
How to read data using the Selector class.

It’s that easy!

Creating a connection pool

To work with a Cassandra cluster, you need to start off by defining a connection pool. This is typically done once in the startup code of your application. Sometimes you will define more than one connection pool. For example, in our project, we use two Cassandra database clusters, one which uses random partitioning for data storage, and one which uses order preserving partitioning for indexes. You can create as many connection pools as you need.

To create a pool, you need to specify a name, a list of known contact nodes (the library can automatically detect further nodes in the cluster, but see notes at the end), the network port that the nodes are listening on, and a policy which controls things like the number of connections in your pool.

Here a pool is created with default policies:

Pelops.addPool(
    "Main",
    new String[] { "cass1.database.com", "cass2.database.com", "cass3.database.com"},
    9160,
    new Policy());

Using a Mutator

The Mutator class is used to make mutations to a keyspace (which in SQL speak translates as making changes to a database). You ask Pelops for a new mutator, and then specify the mutations you wish to make. These are sent to Cassandra in a single batch when you call its execute method.

To create a mutator, you must specify the name of the connection pool you will use and the name of the keyspace you wish to mutate. Note that the pool determines what database cluster you are talking to.

Mutator mutator = Pelops.createMutator("Main", "SupportTickets");

Once you have the mutator, you start specifying changes.

/**
 * Write multiple sub-column values to a super column...
 * @param rowKey                    The key of the row to modify
 * @param colFamily                 The name of the super column family to operate on
 * @param colName                   The name of the super column
 * @param subColumns                A list of the sub-columns to write
 */
mutator. writeSubColumns(
    userId,
    "L1Tickets",
    UuidHelper.newTimeUuidBytes(), // using a UUID value that sorts by time
    mutator.newColumnList(
        mutator.newColumn("category", "videoPhone"),
        mutator.newColumn("reportType", "POOR_PICTURE"),
        mutator.newColumn("createdDate", NumberHelper.toBytes(System.currentTimeMillis())),
        mutator.newColumn("capture", jpegBytes),
        mutator.newColumn("comment") ));

/**
 * Delete a list of columns or super columns...
 * @param rowKey                    The key of the row to modify
 * @param colFamily                 The name of the column family to operate on
 * @param colNames                  The column and/or super column names to delete
 */
mutator.deleteColumns(
    userId,
    "L1Tickets",
    resolvedList);

After specifying the changes, you send them to Cassandra in a single batch by calling execute. This takes the Cassandra consistency level as a parameter.

mutator.execute(ConsistencyLevel.ONE);

Note that if you need to know a particular mutation operation has completed successfully before initiating some subsequent operation, then you should not batch your mutations together. Since you cannot re-use a mutator after it has been executed, you should create two or more mutators, and execute them with at least a QUORUM consistency level.

Browse the Mutator class to see the methods and overloads that are available
here

Using a Selector

The Selector class is used to read data from a keyspace. You ask Pelops for a new selector, and then read data by calling its methods.

Selector selector = Pelops.createSelector("Main", "SupportTickets");

Once you have a selector instance, you can start reading data using its many overloads.

/**
 * Retrieve a super column from a row...
 * @param rowKey                        The key of the row
 * @param columnFamily                  The name of the column family containing the super column
 * @param superColName                  The name of the super column to retrieve
 * @param cLevel                        The Cassandra consistency level with which to perform the operation
 * @return                              The requested SuperColumn
 */
SuperColumn ticket = selector.getSuperColumnFromRow(
    userId,
    "L1Tickets",
    ticketId,
    ConsistencyLevel.ONE);

assert ticketId.equals(ticket.name)

// enumerate sub-columns
for (Column data : ticket.columns) {
    String name = data.name;
    byte[] value = data.value;
}

/**
 * Retrieve super columns from a row
 * @param rowKey                        The key of the row
 * @param columnFamily                  The name of the column family containing the super columns
 * @param colPredicate                  The super column selector predicate
 * @param cLevel                        The Cassandra consistency level with which to perform the operation
 * @return                              A list of matching columns
 */
List<SuperColumn> allTickets = selector.getSuperColumnsFromRow(
    userId,
    "L1Tickets",
    Selector.newColumnsPredicateAll(true, 10000),
    ConsistencyLevel.ONE);

/**
 * Retrieve super columns from a set of rows.
 * @param rowKeys                        The keys of the rows
 * @param columnFamily                   The name of the column family containing the super columns
 * @param colPredicate                   The super column selector predicate
 * @param cLevel                         The Cassandra consistency level with which to perform the operation
 * @return                               A map from row keys to the matching lists of super columns
 */
Map<String, List<SuperColumn>> allTicketsForFriends = selector.getSuperColumnsFromRows(
    Arrays.asList(new String[] { "matt", "james", "dom" }, // the friends
    "L1Tickets",
    Selector.newColumnsPredicateAll(true, 10000),
    ConsistencyLevel.ONE);

/**
 * Retrieve a page of super columns composed from a segment of the sequence of super columns in a row.
 * @param rowKey                        The key of the row
 * @param columnFamily                  The name of the column family containing the super columns
 * @param startBeyondName               The sequence of super columns must begin with the smallest super column name greater than this value. Pass null to start at the beginning of the sequence.
 * @param orderType                     The scheme used to determine how the column names are ordered
 * @param reversed                      Whether the scan should proceed in descending super column name order
 * @param count                         The maximum number of super columns that can be retrieved by the scan
 * @param cLevel                        The Cassandra consistency level with which to perform the operation
 * @return                              A page of super columns
 */
List<SuperColumn> pageTickets = getPageOfSuperColumnsFromRow(
    userId,
    "L1Tickets",
    lastIdOfPrevPage, // null for first page
    Selector.OrderType.TimeUUIDType, // ordering defined in this super column family
    true, // blog order
    10, // count shown per page
    ConsistencyLevel.ONE);

There are a huge number of selector methods and overloads which expose the full power of Cassandra, and others like the paginator methods that make otherwise complex tasks simple. Browse the Selector class to see what is available here

Other stuff

All the main things you need to start using Pelops have been covered, and with your current knowledge you can easily feel your way around Pelops inside your IDE using intellisense. Some final points it will be useful to keep in mind if you want to work with Pelops:

If you need to perform deletions at the row key level, use an instance of the KeyDeletor class (call Pelops.createKeyDeletor).
If you need metrics from a Cassandra cluster, use an instance of the Metrics class (call Pelops.createMetrics).
To work with Time UUIDs, which are globally unique identifiers that can be sorted by time – which you will find to be very useful throughout your Cassandra code – use the UuidHelper class.
To work with numbers stored as binary values, use the NumberHelper class.
To work with strings stored as binary values, use the StringHelper class.
Methods in the Pelops library that cause interaction with Cassandra throw the standard
Cassandra exceptions defined here.

The Pelops design secret

One of the key design decisions that at the time of writing distinguishes Pelops, is that the data processing code written by developers does not involve connection pooling or management. Instead, classes like Mutator and Selector borrow connections to Cassandra from a Pelops pool for just the periods that they need to read and write to the underlying Thrift API. This has two advantages.

Firstly, obviously, code becomes cleaner and developers are freed from connection management concerns. But also more subtly this enables the Pelops library to completely manage connection pooling itself, and for example keep track of how many outstanding operations are currently running against each cluster node.

This for example, enables Pelops to perform more effective client load balancing by ensuring that new operations are performed against the node to which it currently has the least outstanding operations running. Because of this architectural choice, it will even be possible to offer strategies in the future where for example nodes are actually queried to determine their load.

To see how the library abstracts connection pooling away from the semantics of data processing, take a look at the execute method of Mutator and the tryOperation method of Operand. This is the foundation upon which Pelops greatly improves over existing libraries that have modelled connection management on pre-existing SQL database client libraries.

*–

That’s all. I hope you get the same benefits from Pelops that we did.

Written by dominicwilliams

June 11, 2010 at 12:31 pm

Posted in Cassandra, Java, Libraries, NOSQL Databases, Recipies, Uncategorized

Tagged with Cassandra, client library, connection pool, Java, mutator, Pelops, selector

Locking and transactions over Cassandra using Cages

with 43 comments

Introduction

Anyone following my occasional posts will know that me and my team are working on a new kids game / social network called http://www.FightMyMonster.com. We are trying to break new ground with this project in many ways, and to support the data intensive nature of what we are trying to do we eventually selected the Cassandra database after working with several others.

This post is about a library we are using with Cassandra called Cages. Using Cages, you can perform much more advanced data manipulation and storage over a Cassandra database. This post explains why and gives some more information.

You can find Cages here http://cages.googlecode.com.

Brief Background

For those that aren’t already familiar with Cassandra (skip this if you are), it can be described as the best representative of a new breed of fast, easily scalable databases. Write operations are evenly spread across a cluster of machines, removing the bottleneck found in traditional SQL database clusters, and it can continue operating even when some nodes are lost or partitioned. The cluster is symmetric in the sense there is no master node, nodes communicate with each other using a P2P protocol and can be easily added and removed by an administrator.

In order to deliver these characteristics, which are particularly valuable to Web 2.0 enterprises but also will likely prove useful in other industries too, Cassandra offers what is known as a NoSQL model. This model is significantly different to a traditional SQL model, and many coming from more traditional database backgrounds will more easily understand Cassandra as a highly scalable, highly resilient distributed structured storage engine. While NoSQL offers some unique advantages to developers when compared to SQL, it is also the case that whereas in SQL complex operations can be specified in a single statement that is either executed or not (i.e. that have ACID properties), in Cassandra complex operations on data must usually be comprised from several different operations, which can only be made reliable individually.

What Cages is for?

In many cases, websites and systems can be built against Cassandra without regard to ACID issues. Data storage and manipulation can be limited to operations against single rows (and for those that don’t know, rows in NoSQL models are really like multi-level hash tables which can contain hierarchical “ready-joined” data, and generally offer many more possibilities than SQL rows). Where a mutation of these rows must be reliable, or immediately seen by other clients of the database, Cassandra allows the developer to choose from a range of consistency levels that specify the tradeoff between performance, safety of storage and the timeliness with which data becomes visible to all.

This system is undeniably very effective, but when the systems you are building involve complex data structures and manipulation, you can still quickly reach situations where your logical operations necessarily involve several individual Cassandra read and write operations across multiple rows. Cassandra does not get involved in managing the safety and reliability of operations at the higher logical level, which means guaranteeing the logical consistency of your data can require some extra work. Some people, particularly those wedded to SQL databases, advocate storing some parts of your data in traditional SQL databases. For us though, it is most definitely preferable to develop and use Cages!

What is Cages?

Cages is a new Java library that provides distributed synchronization functionality, and soon additional functionality for things like transactions, by using the services of a ZooKeeper server or cluster. ZooKeeper is a very active project and the system is currently widely used. It started life as a Yahoo Research project, see here http://research.yahoo.com/project/1849 and is now an important Apache project, see http://hadoop.apache.org/zookeeper/. Cages has wide application, but its development will be very much driven by needs in relation to Cassandra.

Using Cages for locking

Cages offers three locking classes, ZkReadLock, ZkWriteLock and ZkMultiLock.

Single path locking

The simplest application of Cages can be to enforce correct updates on data values inside Cassandra (or some other NoSQL database). For example, you may have issues with that old chestnut, the Lost Update Problem. This happens where you read the data with one operation, modify the data and then write it back with a second operation. Problems occur when another client performs the same operation simultaneously, such that the last client to write back the modified value will overwrite the modifications made by the other.

Thus in its most simple form, two clients wish to donate some money to a bank balance. Both simultaneously read the same bank balance value B1. The first client adds donation D1, and writes back (B1 + D1). The second client adds donation D2, and writes back (B1 + D2). Unfortunately bank balance B2 = B1 + D2, and donation D1 has been lost.

Cages provides an easy fix:

    void depositMoney(int amount) {
        ZkWriteLock lock = new ZkWriteLock(“/accounts/” + accountId + “/balance”);
        lock.acquire();
        try {
            // 1. Read the balance
            // 2. Update the balance
            // 3. Write the balance back
        } finally {
            lock.release();
        }
    }

Note that the paths passed to the lock classes can represent actual data paths within a NoSQL model, or can simply represent logical control over parts of a wider data model (so long as your application faithfully adheres to the rules you set).

Multi path locking

The Lost Update Problem is the most simple locking scenario where Cages can be applied. In our case, while many parts of our system even use Cassandra without locking, often with low consistency levels for maximum performance, there are several areas where we necessarily perform complex operations over contended data that involve numerous individual read and write operations. To begin with, we decided to treat the cases by nesting the ZkReadLock and ZkWriteLock single path locking primitives. However, there is a problem doing this in a distributed environment.

It is a simple fact that in a distributed environment, many situations where you acquire single path locks in a nested manner can result in deadlock. For example, if one operation sequentially tries to acquire R(P1) then W(P2), and a second operation simultaneously tries to acquire R(P2) then W(P1), deadlock will likely result: the first operation will acquire R(P1) and the second operation will acquire R(P2), but then the first operation will block waiting to acquire W(P2) and the second operation will block waiting to acquire W(P1).

Avoiding these problems with single path locks is no simple matter. For a start, the logical detection of closed wait graphs (deadlock) in the distributed environment is difficult and expensive to perform. The simplest approach to solving the problem is to try to acquire locks with a timeout, such that if you get into a deadlock situation, your acquire() calls throw an exception and you abandon your attempt. The problem here though is that your code has to handle the exception, and possibly rollback parts of the operation performed earlier under the protection of the outer lock.

For all these reasons, when an operation needs to acquire locks over multiple paths in the distributed environment, ZkMultiLock is the class to use.

ZkMultiLock allows you to specify any number of read and write locks over paths, which may then all be acquired “simultaneously”. If your operation can acquire all the locks it needs together at the outset using ZkMultiLock, this avoids any possibility of deadlock. This provides slightly worse performance where multiple paths are specified and locks on the paths are highly contended. But in practice, locks are rarely that highly contended, and you just need to guard against the disaster of simultaneously running operations interfering with each other and corrupting data. Because of the dangers of deadlock, in the Fight My Monster project have mandated that only ZkMultiLock can be used unless there are very special reasons, a situation we have not yet encountered.

    void executeTrade(long lotId, String sellerId, String buyerId) {
        // In the following we need to hold write locks over both the seller and buyer's account balances
        // so they can be checked and updated correctly. We also want a lock over the lot, since the value
        // of lots owned might be used in conjunction with the bank balance by code considering the
        // total worth of the owner. Acquiring the required locks simultaneously using ZkMultiLock avoids
        // the possibility of accidental deadlock occurring between this client code and other client code
        // contending access to the same data / lock paths.
        ZkMultiLock mlock = new ZkMultiLock();
        mlock.addWriteLock("/Bank/accounts/" + sellerId);
        mlock.addWriteLock("/Bank/accounts/" + buyerId);
        mlock.addWriteLock("/Warehouse/" + lotId);
        mlock.acquire();
        try {
            // 1. check buyer has sufficient funds
            // 2. debit buyer's account
            // 3. credit seller's account
            // 4. change ownership of goods
        } finally {
             mlock.release();
        }
    }

Transactions for Cassandra

Transactions are a planned feature at the time of writing, 12/5/2010. It should be too long before they make it into the library, so I will explain a bit about them here.

Locking allows you to synchronize sequences of read and write (mutation) operations across rows stored on your Cassandra cluster. However, the locking classes do not solve the problem that occurs when part way through a complex operation your client machine expires, leaving the data inside the distributed database in a logically inconsistent state. For many applications the likelihood of this occurring is low enough for the locking classes alone to be sufficient. But there may be a small number of operations within applications for which data simply must be logically consistent, and even a very rare failure is unacceptable. This is where transactions come in.

For those that are interested, the following explains how they will work.

A new ZkTransaction class will provide the functionality, and it will need to be used in conjunction with the ZkMultiLock class. ZkTransaction will provide a simplified version of the Cassandra Thrift API that allows a series of data mutation operations to be specified. Client operations wil will proceed by first specifying the necessary locks that must be held, and then specifying the set of data mutations that must be performed by the transaction. When the transaction has been specified, it’s commit() method must be called passing the ZkMultiLock instance as a parameter.

At this point, internally Cages will add a reference to a transaction node created on ZooKeeper from each single path lock node held. The ZkTransaction instance reads from Cassandra the current values of the data it is required to modify, and writes it into the transaction node as a “before” state. Once this is done, it sets about applying the data mutations specified in the necessary sequence of individual Cassandra read and write (mutate) operations. Once all operations are performed, the references to the transaction node from within the locks are removed, and then finally the transaction node itself is deleted – the transaction has now been committed, and the developer can release() the ZkMultiLock.

ZkTransaction can provide a guarantee of consistency for Cages clients because if during the execution of the sequence of individual Cassandra mutation operations the client machine suddenly dies, Cages will immediately revoke the locks the client holds. From this point any instances of ZkReadLock, ZkWriteLock or ZkMultiLock wishing to acquire the released paths must first rollback the transaction node by returning the relevant data to its original “before” state specified. The key point is that any processes that need to see the data in a logically consistent state, and therefore always acquire locks referencing the data in question before accessing it, will always see it as such. This provides a form of ACID for complex operations against a Cassandra database.

    void executeTrade(long lotId, String sellerId, String buyerId) {
        ZkMultiLock mlock = new ZkMultiLock();
        mlock.addWriteLock("/Bank/accounts/" + sellerId);
        mlock.addWriteLock("/Bank/accounts/" + buyerId);
        mlock.addWriteLock("/Warehouse/" + lotId);
        mlock.acquire();
        try {
            // 1. check that buyer has sufficient funds
            // ....

            // 2. perform mutations using transaction object
            ZkTransaction transaction = new ZkTransaction(NoSQL.Cassandra);
            transaction.begin(mlock);
            try {
                // 2. debit buyer's account
                transaction.insert(buyerId, "accounts", bytes("balance"), bytes(newBalance));
                // 3. credit seller's account
                // ...
                // 4. change ownership of goods
                // ...
            } finally {
                transaction.commit();
            }
        } finally {
             mlock.release();
        }
    }

Scalability and hashing ZooKeeper clusters

It is worth saying first off that a three node ZooKeeper cluster using powerful machines should be able to handle a considerable workload, and that where usage of locking and transactions is limited on an as-needed basis, such a setup will be able to provide for the needs of many Internet scale applications. However, it is easy to conceive of Cassandra being applied more widely outside of typical Web 2.0 norms where usage of locking and transactions is much heavier, and therefore the scalability of ZooKeeper must be examined.

The main issue is that for the purposes described it is not desirable to scale ZooKeeper clusters beyond three nodes. The reason for this is that while adding nodes scales up read performance, write performance actually starts degrading because of the need to synchronize write operations across all members, and therefore clustering really offers availability rather than performance. A good overview of the actual performance parameters can be found here http://hadoop.apache.org/zookeeper/docs/r3.3.0/zookeeperOver.html. The question then, is what to do where ZooKeeper becomes a bottleneck.

The solution we suggest is simply to run more than one ZooKeeper cluster for the purposes of locking and transactions, and simply to hash locks and transactions onto particular clusters. This will be the final feature added to Cages.

Note: since I wrote the above Eric Hauser kindly drew my attention to the new “Observers” feature in ZooKeeper 3.3. This may greatly raise the limit at which hashing to separate 3 node clusters becomes necessary. I am hoping to collate performance information and tests in the near future so people have more of an idea what to expect. See http://hadoop.apache.org/zookeeper/docs/r3.3.0/zookeeperObservers.html

That’s it. Hope it was interesting. Please bear with me as Cages develops further over the coming weeks and feel free to test and report.

Final note

Check out the comments too because there are already useful several clarifications and expansions there.

Written by dominicwilliams

May 12, 2010 at 10:10 pm

Posted in Cages, Cassandra, Java, Libraries, Locking, NOSQL Databases, Recipies, Transactions, Uncategorized

Tagged with Cages, Cassandra, Java, Locking, Transactions, ZooKeeper

Dominic Williams

Archive for the ‘Java’ Category

Fight My Monster Architecture in 5 mins!

RIA Client

Application Layer (Starburst)

Database (Cassandra)

Cassandra: Up and running quickly in Java using Pelops

Pelops

Objectives

Up and running in 5 minutes

Creating a connection pool

Using a Mutator

Using a Selector

Other stuff

The Pelops design secret

Locking and transactions over Cassandra using Cages

Introduction

Brief Background

What Cages is for?

What is Cages?

Using Cages for locking

Single path locking

Multi path locking

Transactions for Cassandra

Scalability and hashing ZooKeeper clusters

Final note

Build Google Guava on Mac OS X

Dominic Williams

Topics

Tags

RSS

My Tweets…