You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
This information is outdated.
Cassandra instance size is limited due to JVM GC and some architectural limitations. With the default CMS collector, the recommendation is to use heap sizes no larger than 8G in order to maintain reasonable GC pause times. With G1GC, we have been able to push it 16G without falling over, but the metrics are showing occasional timeout spikes coinciding with larger (5s) GC spikes.
The Cassandra documentation recommends: "Most workloads work best with a capacity under 500GB to 1TB per node depending on I/O. Maximum recommended capacity for Cassandra 1.2 and later is 3 to 5TB per node for uncompressed data." With our current compression levels of around 17%, this translates to no more than 500-850G of compressed storage per node.
This fits with our observations in practice. We started to see some (~a dozen per day) request timeouts at instance sizes of about 600G (compressed). At around 1.2-1.5T compressed storage per node (7-9T raw data), we additionally observed
- mean latencies growing by an order of magnitude
- timeout rates occasionally bursting into the double-digit percent of requests
- nodes crashing from OOM
- compaction falling behind, leading to large read latencies, retry rates and larger storage size
- failure to bootstrap new nodes
- metrics reporting stopping after a few hours of operation
Mapping of instances to hardware
Our current hardware is underutilized from a memory and cpu perspective, and while it may be possible to achieve higher node storage densities for our workload, it is clear that we won't be able to utilize the full 3T with a single Cassandra instance while maintaining acceptable latencies and reliability.
At a maximum instance size of about 600G compressed data, provisioning multiple Cassandra instances per hardware node (via systemd units, containers, or virtualization) will be needed for reasonable utilization.
Chassis and hardware
- ideally 8x 2.5" SSD; no hot-swap needed
- extras like redundant power not necessarily needed, redundancy is achieved through replication
- 1G ethernet for single instance, 10G for multi-instance box
- JBOD configuration; no HW RAID needed. For larger hardware nodes we could consider RAID-5 for lower recovery times after a single SSD failure.
About 1T of raw storage per instance, with a target utilization of no more than 600G.
Cassandra writes in large sequential chunks, which means that write amplification is minimal. Write rates are moderate at ~10M/s per disk, which means that consumer SSDs with an endurance of 2000+ erase cycles will theoretically last about 6.5 years. Once we increase the cluster size, write rates will further decrease and life expectancy accordingly increase.
- Samsung 850 Pro: Used in the current Eqiad cluster. About $490 at time of writing.
- Samsung 850 Evo: Same flash as the Pro, but smaller processor. Same endurance rating. About $380 at time of writing.
CPU and memory, per instance
- 4+ cores
- min 16G RAM
eqiad RESTBase Cluster
RESTBase was initially deployed in a single data-center configuration to 6 dual-8-core machines with 64GB of memory, and 3T of SSD storage. Cassandra is configured for a replication count of 3, and is using
Where we are (as of May 2015)
- Deployed across 3 data-center rows, (using
NetworkTopologyStrategy, 2 nodes per row)
- ~6T total storage (compressed, includes replication overhead), growing at a rate of ~60GB/day (~1T/node)
- Request latencies are nominal, ~100 request timeouts / 24 hours
- Cassandra is utilizing ~18G of RAM (RSS)
- CPU utilization is typically below 20%
Where we are going / what we expect
- ~54T total storage anticipated in the coming year
- 1 additional data-center, replication factor 3 (54T total compressed storage)
- The capacity to maintain service-levels in the face of a data-center row (=cassandra rack) outage
- Initial RESTBase hardware planning ticket: Dec 2014
- Expand RESTBase cluster capacity: May 2015; includes discussion of SSD characteristics
- The most important architectural limitation is probably the depth of log-structured merge trees. Large data sets with leveled compaction tend to use 5 sstable levels, which means that all data is written at least five times until it is fully merged into the lowest level. Tables with less than 200G per instance tend to get by with four levels instead. Really small tables with less than ~100G per instance only need three levels. Additionally, compaction parallelism is limited by the requirement to process non-overlapping key ranges at each level. This means that overlapping compactions at different levels often block each other, which can result in only a single CPU-bound thread working on a large table, and compaction falling behind as a result.
- RESTBase capacity planning for 2015/16