You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Incident documentation/20160531-RESTBase: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Eevans
(post-mortem of Cassandra node upgrade)
 
imported>Krinkle
 
Line 1: Line 1:
=== Summary ===
#REDIRECT [[Incidents/20160531-RESTBase]]
On May 31, 2016 a Cassandra upgrade of restbase1007.eqiad.wmnet was performed (from 2.1.13 to 2.2.6).  [[phab:T126629|Despite careful preparation]], the upgrade did not go as planned.
 
=== Timeline ===
 
==== 2016-05-31T16:15 ====
All 3 instances of restbase1007 were upgraded from 2.1.13 to 2.2.6.  Afterward it became apparent that Cassandra metrics reporting had ceased due to an unhandled exception propagating up to [https://github.com/wikimedia/cassandra-metrics-collector cassandra-metrics-collector] from Cassandra via JMX.
 
While there were no Cassandra metrics at this time, host-based metrics indicated an increase in disk read throughput of more than 10x, from ~100MBps, to over 1GBps, an increase in network throughput by 3x, and an increase in RESTBase HTML revision request storage latency (99p) from ~280ms to at times over 2s.
 
==== 2016-05-31T17:15 ====
The source of the unhandled exceptions was narrowed down to a histogram introduced in Cassandra 2.2, <code>ColUpdateTimeDeltaHistogram</code> (see [https://issues.apache.org/jira/browse/CASSANDRA-11931 CASSANDRA-11931]), and a patched version of [https://github.com/wikimedia/cassandra-metrics-collector cassandra-metrics-collector] was live-hacked into place, restoring Cassandra metrics.
 
With Cassandra metrics restored it became obvious that columnfamily read latency had increased from ~40ms to an oddly consistent 400ms<ref>https://issues.apache.org/jira/browse/CASSANDRA-11752</ref> (1007-c), and that the 1007 instances (1007-c in particular), seemed to be shouldering the majority of client requests.
 
==== 2016-05-31T20:12 ====
The CQL native transport port was disabled on 1007-c for a brief period of time to gauge the effect.  During this period, the additional traffic to 1007-c shifted to 1007-{a,b}.
 
==== 2016-05-31T21:16 ====
The CQL native transport was disabled on 1007-{a,b,c}.  RESTBase service failures across the cluster ensued, requiring RESTBase restarts to correct.  With client ports shutdown, and RESTBase restarted, client traffic rebalanced across the cluster.  Utilization on 1007 improved dramatically, with network throughput and CPU utilization normalizing.  Only disk read throughput remained aberrant, dropping somewhat from ~1.5GBps to ~1GBps (still many times the norm).
 
Investigations continued into the evening, but with the node somewhat stable and RESTBase 99p latencies under control, it was left in this state overnight.
 
==== 2016-06-01T17:47 ====
Largely on a hunch, <code>disk_access_policy</code> was live-hacked to <code>mmap_index_only</code> (defaults to <code>auto</code>) and all instances were restarted.  In our environment, where all data at rest is compressed, this should have the effect of disabling mmap access only for decompression reads, something that was newly introduced in Cassandra 2.2.  With this setting in place, disk read throughput figures normalized immediately.
 
During the restart of instances, the binary ports were opened momentarily, long enough for some RESTBase instances to pick them up, again resulting in some service failures (and subsequent restarts) after they were again disabled.
 
==== 2016-06-01T20:00 ====
With the instances in an otherwise normal state, the binary protocol ports were re-enabled, and the previously observed imbalance of client traffic returned.
 
Upon further investigation it was determined that in the absence of an explicitly configured protocol version, the Node.js driver defaults to version 4 (the default for Cassandra 2.2.6 is v4, the default for 2.1.13 is v3).  When the driver connects to the first contact point, it attempts to make use of this version, if that succeeds, it becomes the version used on subsequent connections, (ignoring contact points that do not support it).  Since restbase1007 sorts first among our nodes, v4 became the de facto version everywhere but 1007 (each node ignores itself as a contact point).  This explained both the imbalance of client traffic, and the service failures when 1007 was restarted (since those instances were the only ones in the connection pools).
 
==== 2016-06-01T23:57 ====
A change was deployed to set <code>maxVersion</code> to 3 for Cassandra client connections, and client traffic redistributed evenly across the cluster.
 
=== Conclusions ===
There were three separate issues here that slipped by undetected during testing.  In the cases of the broken histogram and the aberrant disk read throughput, it required a production workload and/or access patterns for the problems to manifest.
 
The issues centering around the selection of driver protocol version could have been caught if the same upgrade work-flow had been applied to staging as was planned for production (i.e. if more complete testing had been carried out after the canary upgrade, but before a complete cluster upgrade).
 
=== Actionables ===
* Establish staging and test methodologies that more accurately reflect the production environment ([[phab:T136340|T136340]])
* Establish a document of best practices for future upgrades, including (but not limited to):
** A matching (step-for-step) upgrade process during testing (test the upgrade process as  well as the software)
** A checklist that includes evaluating native protocol disparities, and updating driver options in advance
** Establishing ahead of time, a rigorous go no-go gating criteria with rollback strategy
** Canary upgrades to the stand-by DC, possibly combined with a controlled temporary switch over
* Continue with investigation of aberrant disk read throughput
** Attempt to reproduce in staging by generating queries over a larger distribution of records (ex: multiple dump script runs from different title offsets). If successful, integrate this process into standardized testing.
* Follow up on lack of recency bias in histogram metrics ([https://issues.apache.org/jira/browse/CASSANDRA-11752 CASSANDRA-11752])
 
* Follow up on broken <code>ColUpdateTimeDeltaHistogram</code> metrics ([https://issues.apache.org/jira/browse/CASSANDRA-11931 CASSANDRA-11931])
 
=== Footnotes ===
<references />

Latest revision as of 17:45, 8 April 2022