Dominic Williams

Occasionally useful posts about RIAs, Web scale computing & miscellanea

Posts Tagged ‘client library

Cassandra: Up and running quickly in Java using Pelops

with 52 comments

Pelops

In Greek mythology Cassandra is captured by the triumphant king Agamemnon after the fall of Troy, with whom she has two sons, Pelops and Teledamus. This Java client library is Pelop’s namesake nicknamed “Cassandra’s beautiful son” because it offers a beautiful way to code against the Cassandra database. This is a quick introduction to the library.

You can find the open source code here http://pelops.googlecode.com/

Objectives

Pelops was born to improve the quality of Cassandra code across a complex commercial project that makes extensive use of the database. The main objectives the library are:

  • To faithfully expose Cassandra’s API in a manner that is immediately understandable to anyone:
    simple, but beautiful
  • To completely separate low-level concerns such as connection pooling from data processing code
  • To eliminate “dressing code”, so that the semantics of data processing stand clear and obvious
  • To accelerate development through intellisense, function overloading and powerful high-level methods
  • To implement strategies like load balancing based upon the per node running operation count
  • To include robust error handling and recovery that does not mask application-level logic problems
  • To track the latest Cassandra releases and features without causing breaking changes
  • To define a long-lasting paradigm for those writing client code

Up and running in 5 minutes

To start working with Pelops and Cassandra, you need to know three things:

  1. How to create a connection pool, typically once at startup
  2. How to write data using the Mutator class
  3. How to read data using the Selector class.

It’s that easy!

Creating a connection pool

To work with a Cassandra cluster, you need to start off by defining a connection pool. This is typically done once in the startup code of your application. Sometimes you will define more than one connection pool. For example, in our project, we use two Cassandra database clusters, one which uses random partitioning for data storage, and one which uses order preserving partitioning for indexes. You can create as many connection pools as you need.

To create a pool, you need to specify a name, a list of known contact nodes (the library can automatically detect further nodes in the cluster, but see notes at the end), the network port that the nodes are listening on, and a policy which controls things like the number of connections in your pool.

Here a pool is created with default policies:

Pelops.addPool(
    "Main",
    new String[] { "cass1.database.com", "cass2.database.com", "cass3.database.com"},
    9160,
    new Policy());

Using a Mutator

The Mutator class is used to make mutations to a keyspace (which in SQL speak translates as making changes to a database). You ask Pelops for a new mutator, and then specify the mutations you wish to make. These are sent to Cassandra in a single batch when you call its execute method.

To create a mutator, you must specify the name of the connection pool you will use and the name of the keyspace you wish to mutate. Note that the pool determines what database cluster you are talking to.

Mutator mutator = Pelops.createMutator("Main", "SupportTickets");

Once you have the mutator, you start specifying changes.

/**
 * Write multiple sub-column values to a super column...
 * @param rowKey                    The key of the row to modify
 * @param colFamily                 The name of the super column family to operate on
 * @param colName                   The name of the super column
 * @param subColumns                A list of the sub-columns to write
 */
mutator. writeSubColumns(
    userId,
    "L1Tickets",
    UuidHelper.newTimeUuidBytes(), // using a UUID value that sorts by time
    mutator.newColumnList(
        mutator.newColumn("category", "videoPhone"),
        mutator.newColumn("reportType", "POOR_PICTURE"),
        mutator.newColumn("createdDate", NumberHelper.toBytes(System.currentTimeMillis())),
        mutator.newColumn("capture", jpegBytes),
        mutator.newColumn("comment") ));

/**
 * Delete a list of columns or super columns...
 * @param rowKey                    The key of the row to modify
 * @param colFamily                 The name of the column family to operate on
 * @param colNames                  The column and/or super column names to delete
 */
mutator.deleteColumns(
    userId,
    "L1Tickets",
    resolvedList);

After specifying the changes, you send them to Cassandra in a single batch by calling execute. This takes the Cassandra consistency level as a parameter.

mutator.execute(ConsistencyLevel.ONE);

Note that if you need to know a particular mutation operation has completed successfully before initiating some subsequent operation, then you should not batch your mutations together. Since you cannot re-use a mutator after it has been executed, you should create two or more mutators, and execute them with at least a QUORUM consistency level.

Browse the Mutator class to see the methods and overloads that are available
here

Using a Selector

The Selector class is used to read data from a keyspace. You ask Pelops for a new selector, and then read data by calling its methods.

Selector selector = Pelops.createSelector("Main", "SupportTickets");

Once you have a selector instance, you can start reading data using its many overloads.

/**
 * Retrieve a super column from a row...
 * @param rowKey                        The key of the row
 * @param columnFamily                  The name of the column family containing the super column
 * @param superColName                  The name of the super column to retrieve
 * @param cLevel                        The Cassandra consistency level with which to perform the operation
 * @return                              The requested SuperColumn
 */
SuperColumn ticket = selector.getSuperColumnFromRow(
    userId,
    "L1Tickets",
    ticketId,
    ConsistencyLevel.ONE);

assert ticketId.equals(ticket.name)

// enumerate sub-columns
for (Column data : ticket.columns) {
    String name = data.name;
    byte[] value = data.value;
}

/**
 * Retrieve super columns from a row
 * @param rowKey                        The key of the row
 * @param columnFamily                  The name of the column family containing the super columns
 * @param colPredicate                  The super column selector predicate
 * @param cLevel                        The Cassandra consistency level with which to perform the operation
 * @return                              A list of matching columns
 */
List<SuperColumn> allTickets = selector.getSuperColumnsFromRow(
    userId,
    "L1Tickets",
    Selector.newColumnsPredicateAll(true, 10000),
    ConsistencyLevel.ONE);

/**
 * Retrieve super columns from a set of rows.
 * @param rowKeys                        The keys of the rows
 * @param columnFamily                   The name of the column family containing the super columns
 * @param colPredicate                   The super column selector predicate
 * @param cLevel                         The Cassandra consistency level with which to perform the operation
 * @return                               A map from row keys to the matching lists of super columns
 */
Map<String, List<SuperColumn>> allTicketsForFriends = selector.getSuperColumnsFromRows(
    Arrays.asList(new String[] { "matt", "james", "dom" }, // the friends
    "L1Tickets",
    Selector.newColumnsPredicateAll(true, 10000),
    ConsistencyLevel.ONE);

/**
 * Retrieve a page of super columns composed from a segment of the sequence of super columns in a row.
 * @param rowKey                        The key of the row
 * @param columnFamily                  The name of the column family containing the super columns
 * @param startBeyondName               The sequence of super columns must begin with the smallest super column name greater than this value. Pass null to start at the beginning of the sequence.
 * @param orderType                     The scheme used to determine how the column names are ordered
 * @param reversed                      Whether the scan should proceed in descending super column name order
 * @param count                         The maximum number of super columns that can be retrieved by the scan
 * @param cLevel                        The Cassandra consistency level with which to perform the operation
 * @return                              A page of super columns
 */
List<SuperColumn> pageTickets = getPageOfSuperColumnsFromRow(
    userId,
    "L1Tickets",
    lastIdOfPrevPage, // null for first page
    Selector.OrderType.TimeUUIDType, // ordering defined in this super column family
    true, // blog order
    10, // count shown per page
    ConsistencyLevel.ONE);

There are a huge number of selector methods and overloads which expose the full power of Cassandra, and others like the paginator methods that make otherwise complex tasks simple. Browse the Selector class to see what is available here

Other stuff

All the main things you need to start using Pelops have been covered, and with your current knowledge you can easily feel your way around Pelops inside your IDE using intellisense. Some final points it will be useful to keep in mind if you want to work with Pelops:

  • If you need to perform deletions at the row key level, use an instance of the KeyDeletor class (call Pelops.createKeyDeletor).
  • If you need metrics from a Cassandra cluster, use an instance of the Metrics class (call Pelops.createMetrics).
  • To work with Time UUIDs, which are globally unique identifiers that can be sorted by time – which you will find to be very useful throughout your Cassandra code – use the UuidHelper class.
  • To work with numbers stored as binary values, use the NumberHelper class.
  • To work with strings stored as binary values, use the StringHelper class.
  • Methods in the Pelops library that cause interaction with Cassandra throw the standard
    Cassandra exceptions defined here.

The Pelops design secret

One of the key design decisions that at the time of writing distinguishes Pelops, is that the data processing code written by developers does not involve connection pooling or management. Instead, classes like Mutator and Selector borrow connections to Cassandra from a Pelops pool for just the periods that they need to read and write to the underlying Thrift API. This has two advantages.

Firstly, obviously, code becomes cleaner and developers are freed from connection management concerns. But also more subtly this enables the Pelops library to completely manage connection pooling itself, and for example keep track of how many outstanding operations are currently running against each cluster node.

This for example, enables Pelops to perform more effective client load balancing by ensuring that new operations are performed against the node to which it currently has the least outstanding operations running. Because of this architectural choice, it will even be possible to offer strategies in the future where for example nodes are actually queried to determine their load.

To see how the library abstracts connection pooling away from the semantics of data processing, take a look at the execute method of Mutator and the tryOperation method of Operand. This is the foundation upon which Pelops greatly improves over existing libraries that have modelled connection management on pre-existing SQL database client libraries.

*–

That’s all. I hope you get the same benefits from Pelops that we did.

Written by dominicwilliams

June 11, 2010 at 12:31 pm