You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Cassandra/Upgrades and testing

From Wikitech-static
< Cassandra
Revision as of 18:05, 20 July 2016 by imported>GWicke (GWicke moved page Cassandra/Upgrades to Cassandra/Upgrades and testing: Include testing in title, reflecting general what-to-test focus)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Developing a Plan

Part of the test and evaluation of new versions is the development of a strategy for rolling out the new version to production. Some items to keep in mind when developing such a plan:

  • The process should begin with the upgrade of a single instance, a so-called canary. When feasible, consider using a node from the stand-by data-center as the canary to limit user-facing impact in the event of a problem.
  • Establish in advance a comprehensive set of Go/No-Go criteria. Know what needs to be tested/measured beforehand, and what the results must before continuing, holding, or rolling back.
  • Test the upgrade plan; When upgrading a development, beta, or staging environment, closely follow the process that will be used in production. You're not just vetting the software for production, you are vetting the process for rolling it out as well.

Things to test before a major version upgrade

  • Upgrade one node at a time, and perform the full set of tests in both mixed operation & after a full upgrade.
  • Make sure to generate enough load to provoke failures. A simple technique is to run 2 or more html dumps from different offsets / for different projects.
  • Decommissions & bootstraps in mixed operation & after upgrade
    • Some bootstrap failures in this space only happen with large data sets / specific compaction strategies. In the past, we had consistent bootstrap failures on 2.1 when using Leveled Compaction and data sets greater than ~700G per instance.
      • Open question: Can we provoke common failure modes without having the full data set?
  • Performance: Benchmark app throughput before / after
    • In-memory performance: ab -c 300 -n 100000 <url>
    • More realistic performance: Multiple concurrent HTML dumps, as described above.
    • Check IO metrics (disk read & write throughput, iowait), look for major changes
    • Check network metrics, look for major changes
  • Inspect logs

Checklist

Node.js driver and native protocol versions

In the absence of an explicitly configured maxVersion, the Node.js driver will use a constant as the default. As the driver works through the contact points at startup, it will attempt to use this default, and failing that, decrement and try again. Once it has established the first successful connection, this negotiated protocol version is used for all subsequent connections, and any node that does not support it is skipped entirely. This can result in severe imbalances of client connections during an upgrade where the default native protocol version is higher than the version being upgraded.

When upgrading, be aware of any changes to the default native protocol version, and ensure that protocolOptions.maxVersion is explicitly configured for the least common denominator node until such time as the upgrade is complete.

See also: Mixed cluster versions and rolling upgrades

Past Upgrades

2.1.13 -to- 2.2.6

RESTBase