Cassandra: the importance of system clocks, avoiding OOM and how to escape (unnecessary) OOM meltdown!
First of all, big thanks to jbellis and driftx (Jonathan Ellis and Brandon Williams of DataStax). Without them, I’d still be up trying to fix a problem node I’ll talk about later. They were no doubt horrified at an issue created through wanton misconfiguration, but they helped out amazingly anyway. For the benefit of other people as reckless as us, and who are equally unskilled in the black arts, here are some cautionary tales and a hard core tip for recovering from a serious OOM situation.
Before going any further, I’ll just give you some background to Fight My Monster, which went live late last year. We are a self-funded London startup. We have had enough funds to develop a great product, promote it using TV commercials, but everything else has to be done on a shoestring. Cassandra is the primary workhorse for processing data, and we ask a LOT of it.
The game itself is an MMO, social networking and minigames hybrid. Since launching it has grown quite fast, with most traffic being concentrated 3pm -> 10pm GMT weekdays and weekends (basically when kids in the UK are free – since we haven’t started with international markets yet). The site is characterized by a huge number of transactions, many of which require consistency. Furthermore, we write and rewrite huge amounts of data. We log instant chat messages for the security of kids, we copy kid’s room (aka “Wall”) posts to all their friend’s feeds, and we let kids grow, fight and trade unlimited numbers of monsters for nuggets, with every transaction being recorded, generating a storm of writes. And that’s just the start of it.
My only experience of highly loaded SQL systems is from running large user forums through another company. What I can tell you is that SQL systems just wouldn’t give Fight My Monster anywhere near comparable performance and reliability at the same cost. To some extent, Cassandra makes FMM possible. That said, there are a couple of pitfalls you really need to avoid…
Beware Virtualization: Your Clocks Need To Work!
Before discussing column family configuration and OOM, I will outline an equally serious problem that can be caused by virtualization. I decided to give up managing server racks a long time ago. Fight My Monster combines a CDN with a cloud service where you can actually provision *dedicated* machines on demand. I recommend anyone to do the same if you can, because it simplifies resource management.
However, fantastic as the cloud service we use is, in the early days aspects of our cloud computing infrastructure caused problems with our data. The problem was this: although we are able to provision dedicated machines, our cluster nodes still run Ubuntu on top of a VMware hypervisor that enables features like snapshotting. Unfortunately, what we did not realize is that system clock drift is a very common issue with virtualized systems (see kb.vmware.com/kb/1006427 to get a feel), and this is very bad in Cassandra setups because timestamps affect the serialization of writes.
Unbeknownst to us, when our virtual servers were under load, their system clocks were drifting in a dramatic way. Ntpd would try to keep the clocks in sync with the Internet NTP servers though periodic tiny adjustments as normal, but as we found out, when drift is too rapid ntpd temporarily gives up. Total drift would thus continue growing until system load abated at which point ntpd, upon deciding it could now keep everything in sync again, would suddenly bring the clock into line with the network time servers, sometimes setting the clock back even 20 minutes or so into the past!!! Bizarrely, this was also happening to 1000s of LAMP servers using the service, but nobody had noticed.
In our case, running a distributed database, we certainly noticed the effects. At lot of our processing is transactional. For example, each time a monster fight happens, the monsters involved change and have their histories updated, and the monster owners have their monster lists and nugget bank balances adjusted. Often transactions stack up against each other with users engaging in multiple fights and trades in a short space of time. Where necessary, we write with consistency level QUORUM and use the Cages + ZooKeeper distributed locking system to prevent lost updates and ensure everything is serialized properly (see github.com/s7).
But we were constantly suffering inexplicable data states as consequence of broken serialization. We were convinced the problem lay inside Cages or Cassandra (an upside of the situation was that Cages got many extra hours of code review). Eventually though, I noticed a fight record stored in Cassandra where a monster went into a fight with attributes that it was to gain from a fight in the future. This led to the realization the problem had something to do with clocks!
The key point: Cassandra cluster serializes write operations using the timestamp you send with them (the timestamp is often created for you if you use a library like Pelops, see github.com/s7) In fact, timestamps enable Cassandra to use the clock as a serialization mechanism among its loosely connected nodes. If the clocks on your application servers jump about, so that the timestamps you send with write operations are wrong, you’re going to have problems.
Moral: if you are using virtual servers for your Cassandra setup, pay very close attention to how the system clocks are working and configure ntpd carefully.
Each case will be different, but for those interested this is what we did:
- Added “noapictimer irqpoll” to end of kernel line in /boot/menu.lst
- Added following line to /etc/rc.local
echo “jiffies” > /sys/devices/system/clocksource/clocksource0/current_clocksource
- Configured ntp better e.g. ntp.conf
# don’t give up so easily, even though this should never happen
tinker panic 0
# log your drift and stats and review them periodically
statistics loopstats peerstats clockstats
filegen loopstats file loopstats type day enable
filegen peerstats file peerstats type day enable
filegen clockstats file clockstats type day enable
# use lots of NTP servers to get closest synchronization to actual time
Our drift is now max 1-2ms under load and Cages/ZooKeeper/Cassandra is now serializing millions of transactions just fine.
Avoiding OOM: Pay Close Attention to Column Family Configuration
When you create new column families in a keyspace, Cassandra automatically configures them with sensible defaults for you. These defaults control things like the size of associated caches, which are actually very important to operation. Because the defaults are sensible for new installations, or for projects with relatively small numbesr of column families, it is easy to forget these parameters and concentrate purely on data design as you evolve. That’s what we did, so we had to learn the following the hard way!
Several Cassandra aficionados have expressed shock and horror at the number of column families we use in Fight My Monster: We now have more than 50. But this isn’t down to bad design. Really! Because our Fight My Monster is destination site, it provides wide ranging functionality and while some of the column families belong to legacy code and are being deleted or consolidated, the simple fact is that we need a lot of them. Unless you are a Web 2.0 site addressing a single very specific need, I would say the number of column families you use will increase inevitably. And anyway, you should still make sure you know about this…
So what’s the problem? The problem is that Cassanda is bootstrapped with a fixed maximum heap size. When it is running, there are therefore limits on how much data it can cache in memory at once, and if you misconfigure it sufficiently so that it reaches a situation where it needs more memory than is available, it will crash and dump core with an Out of Memory exception. Furthermore, if you have not made reasonable attempts at setting appropriate column family parameters, you can find yourself in a situation where a Cassandra node dies with OOM when you try to restart it, which is what happened to us.
First the basics: A Cassandra node sends new writes to a column family to its commit log, and then on to a in-memory data structure called a memtable where they are aggregated before being written to disk (memtables absorb overwrites and also act as read caches) . The memtable associated with a column family has three key parameters which affect memory consumption. There are some others too, which you can see by examining a column family configuration.
These three key parameters are 1. memtable_operations_in_millions 2. memtable_throughput_in_mb and 3. memtable_flush_after_mins. The first parameter controls the maximum number of writes that the memtable will hold in memory before flushing them to disk. So for example if this number is 0.75, it will hold 750,000 writes in memory before flushing them. The second parameter is the maximum total size of the values of the column that must be written before the memtable is flushed from memory. However, it only pertains to the size of the column values, not their names, which can also be important. The third parameter is the maximum time in minutes a memtable should reside in memory before it is flushed.
Configuring these parameters requires care and thought, and you should use nodetool to examine statistics for your column families to work out the best values. Each column family must be treated on a case by case basis. For example, if a column family stores lots of small columns, then the total calculated size of the values will be small in memory. But simple Java objects representing operations/column writes in memory have an overhead, and if you set memtable_operations_in_millions too high the actual maximum memory consumed by the memtable will be able to greatly exceed the memtable_throughput_in_mb value. It must be possible for Cassandra to allocate the maximum memory that all of your memtables might consume from its heap, while still leaving memory available for other functions. A suggested calculation has been:
heap size = memtable_throughput_in_mb * 3 * number of hot CFs + 1G + internal caches
Although in a perfect world database systems would be clever enough to configure these thresholds themselves, what you cannot do is leave a system with a large number of column families with the defaults.
In our case, we had legacy defaults which meant memtable_operations_in_millions=0.75, memtable_throughput_in_mb=160 and memtable_flush_after_mins=60 across all our column families. Since many of our hot column families store small pieces of data, the total memory overhead of many memtables would have been 400MB or more. Memtables fill up when you are under load, exactly when you don’t want OOM to raise its head!!! Under load, our memtables wanted more memory than the heap could provide.
I found a special way of triggering the problem. I forgot to restart a node after a rolling upgrade. By the time I realized it was down, there were a storm of hinted writes waiting for it from the other nodes (writes that the other nodes saved on its behalf for later application because it was not available to receive them at the time, a bit like your neighbor holding a package from DHL for you that arrived when you were out). When the node came up, these writes stormed into the commit logs, and then into the memtables which quickly filled up. Total memory was quickly exhausted before they could be flushed, leading the node to crash with OOM!
The Black Art of Recovering from OOM Meltdown
If you have not done so you need to check your column family configurations now because the situation can be worse than a node simply dying with an OOM exception. You can find as we did with our badly misconfigured system that a node cannot be restarted! Very possibly, many of the writes that caused your node to crash will remain in your commit logs. When you bring the node backup, Cassandra will try to replay these logs filling up the memtables again and if you’re unlucky causing another OOM core dump!
If anyone reaches this situation, I hope for their sake they find this post!!! Caveat emptor: This is not advice, it merely reflects what we did to solve our problems. I am not in any way responsible for the consequences if this doesn’t work That said, this is what you do…
Firstly you need to bring your service *down* (yes, your *site* or *game* needs to be turned off). In our case we use RF 3 so only the performance of our service had suffered up until this point. But when you start messing with the commit logs, you need to make sure no new writes are occurring…
Once no new writes can take place, proceed as follows. Firstly increase the heap allocated to Cassandra to the maximum value your server will support (i.e. using the -Xmx JVM parameter). Try restarting Cassandra again. If you are successful, it should replay all the logs successfully and flush the memtables to disk. In this case, your original commit logs will be deleted and new ones created. You can now stop Cassandra, reset the heap size, and restart. You can then reconfigure your column family parameters, and hopefully be in the clear.
If on the other hand this does not work you need to do the following:
- Move your commit logs to a backup directory e.g. mv /var/opt/cassandra/commitlogs/* /var/opt/cassandra/logsbackup
- Connect to a working Cassandra node using cassandra-cli and reconfigure your column family parameters appropriately. Make sure the heap on your down node can supply your maximum memtable needs.
- Restart the Cassandra node without its commit logs. It will now receive the schema updates from the other nodes. Because it’s memtables will be empty, it won’t crash and will be able to reconfigure its own copy of the schema
- Shutdown the Cassandra node, copy the commit logs back and restart it again. Unfortunately, if you have a sufficiently large number of column families, you have not been conservative enough and your commit log backlog is large enough, you will bomb out with OOM again!
- Copy out all your commit logs. Then copy back a single commit log and restart. Once this log is successfully replayed and deleted, you should then shutdown the node and repeat the process with the next commit log (for the purposes of speed, Cassandra processes commit logs in parallel so minimum load is created when you process your commit logs one by one).
Hopefully you will now be feeling better, you will have fixed your broken node and you can restart your service. You will now be a wiser person and never forget to configure your column families properly