You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Memcached for MediaWiki: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Wolfgang Kandek
imported>Krinkle
No edit summary
 
(16 intermediate revisions by 6 users not shown)
Line 3: Line 3:
This page is about '''Memcached for MediaWiki'''.
This page is about '''Memcached for MediaWiki'''.


This page is not about other Memcached clusters in production, such as those for [[Thumbor]], [[Wikimedia Cloud Services]], [[Wikitech|Wikitech wiki]], and [[Swift]].
This page is not about other Memcached clusters in production, such as those for [[Thumbor]], [[Help:Cloud Services introduction|Wikimedia Cloud Services]], [[Wikitech|Wikitech wiki]], and [[Swift]].
[[File:Wikipedia Memcached flow 2020.png|thumb|400x400px|Wikipedia's Memcached configuration as of November 2020. Detailing the use of additional "onhost" Memcached instances, local to MW web servers (handled via Mcrouter route prefixes).]]
[[File:Wikipedia Memcached flow 2022.png|thumb|400x400px|MediaWiki's use of Memcached at WMF.]]


== Service ==
== Infrastructure ==
There are two logical pools of Memcached servers for MediaWiki:
There are two logical pools of memcached servers for MediaWiki:


* '''Main''': The main pool for has 18 shards and runs on the <code>mc10XX</code> hosts (in [[Eqiad cluster|Eqiad]]) and <code>mc20XX</code> hosts (in [[Codfw cluster|Codfw]]).
* '''Main''': The main pool for has 18 [[:en:Shard_(database_architecture)|shards]] and runs on the <code>mc10XX</code> hosts (in [[Eqiad cluster|Eqiad]]) and <code>mc20XX</code> hosts (in [[Codfw cluster|Codfw]]).
* '''Gutter''': The gutter pool has 3 shards and runs  on the <code>mc-gp100x</code> and <code>mc-gp200x</code> hosts, which have 10Gbit/s NICs instead of 1Gbit/s.
* '''Gutter''': The gutter pool has 3 shards per DC, and hosted on <code>mc-gp100x</code> and <code>mc-gp200x</code> hosts (launch task: [[phab:T244852|T244852]]).


See [[#Mcrouter]] for how these are used.
MediaWiki connects to memcached through a proxy called [[#Mcrouter]] which provides a number of benefits.


== Magic numbers ==
== Magic numbers ==
Line 19: Line 19:
** Interim value: 1 second.
** Interim value: 1 second.
* Mcrouter
* Mcrouter
** Gutter TTL: upto 5 minutes ([https://github.com/wikimedia/puppet/blob/9f9e389dac7830c7aabbf95803f51fd288cc6357/hieradata/role/common/mediawiki/appserver.yaml#L10 gutter_ttl]).
** Gutter TTL: upto 10 minutes ([https://github.com/wikimedia/puppet/blob/9f9e389dac7830c7aabbf95803f51fd288cc6357/hieradata/role/common/mediawiki/appserver.yaml#L10 gutter_ttl]).
** On-host tier TTL: upto 10 seconds.


== WANObjectCache ==
== WANObjectCache ==
'''WANObjectCache''' is the abstraction layer in MediaWiki PHP that deals with multi-datacenter concerns and Mcrouter. It builds on top of '''BagOStuff''' which is the generic key-value class that abstracts the Memcached protocol itself.
'''WANObjectCache''' (or '''WANCache''') is the primary interface in MediaWiki for interacting with Memcached and mcrouter. WANCache provides a developer-friendly API that naturally follow our best practices and transparently deals with the complex requirements of operating a platform of our scale. This includes preventing cache stampedes, avoiding cache misses for hot data through probabilistic and asynchronous regeneration prior to logical expiry, avoiding network congestion, supporting multiple versions of the software to run alongside each other (and apply purges to both, whilst storing values separately), avoiding cache polution during long-running processes or when databases are experiencing replication lag. The WANCache interface came out of the [[mw:Wikimedia_Performance_Team/Multi-DC_MediaWiki|Multi-DC MediaWiki initiative]] which required us to take these constraints more seriously, though they generally are not unique to Multi-DC and also significantly improved resilience and correctness during the 2015-2021 single-DC period.  
 
WANCache builds on top of '''BagOStuff''', which is the lower level key-value interface to Memcached and other storage backends.


See also:
See also:
Line 31: Line 34:
=== High level ===
=== High level ===
* '''Like a replica.''' There is generally no proactive setting of values during HTTP write actions. Instead, values are computed based on information from replica DBs, and computed on-demand using the <code>getWithSet(key, ttl, callable)</code> idiom. This means the application generally only expects cache values to be as up to date as a replica DB would be. Historically, it was common for MediaWiki to populate its cache during HTTP write actions instead. This meant that in a single-DC setup it could loosely be expected that the cache was as up-to-date as the master DB. As part of the multi-dc effort, this was changed starting in 2015, and thus its expectations were loosened to that of a replica DB.
* '''Like a replica.''' There is generally no proactive setting of values during HTTP write actions. Instead, values are computed based on information from replica DBs, and computed on-demand using the <code>getWithSet(key, ttl, callable)</code> idiom. This means the application generally only expects cache values to be as up to date as a replica DB would be. Historically, it was common for MediaWiki to populate its cache during HTTP write actions instead. This meant that in a single-DC setup it could loosely be expected that the cache was as up-to-date as the master DB. As part of the multi-dc effort, this was changed starting in 2015, and thus its expectations were loosened to that of a replica DB.
* '''No synchronisation.''' MediaWiki's WANObjectCache layer does not require synchronisation of cached values across data centers. Instead, it considers each datacenter's Memcached cluster as independent. Each populating its own values as-needed on dc-local app servers from dc-local replica DBs.
* '''No synchronisation.''' MediaWiki's WANCache layer does not require synchronisation of cached values across data centers. Instead, it considers each datacenter's Memcached cluster as independent. Each populating its own values as-needed on dc-local app servers from dc-local replica DBs.
* '''Tombstones (broadcasted purge)'''. During HTTP write actions, MediaWiki asks WANObjectCache to purge cache keys of which it has modified the source data. These purges take the form of short-lived Memcached keys known as "tombstones". We do not use the <code>DELETE</code> command because we want each data center to be able to populate its memcached independently, thus requiring no cross-dc master database connection, thus reading from a local replica, thus values ingested in the cache may be as stale as a replica can be. Implementing a Memcached purge as <code>DELETE</code> would mean both in the same DC and other DCs, the same key could be re-populated immediately with the same stale value we just deleted. Instead, WANObjectCache formulates its purge as a <code>SET</code> operation that stores a placeholder value known as a "tombstone" (lasts for approx. 10 seconds for local and remote replica DBs to catch up).
* '''Tombstones (broadcasted purge)'''. During HTTP write actions, MediaWiki asks WANCache to purge cache keys of which it has modified the source data. These purges take the form of short-lived Memcached keys known as "tombstones". We do not use the <code>DELETE</code> command because we want each data center to be able to populate its memcached independently, thus requiring no cross-dc master database connection, thus reading from a local replica, thus values ingested in the cache may be as stale as a replica can be. Implementing a Memcached purge as <code>DELETE</code> would mean both in the same DC and other DCs, the same key could be re-populated immediately with the same stale value we just deleted. Instead, WANCache formulates its purge as a <code>SET</code> operation that stores a placeholder value known as a "tombstone" (lasts for approx. 10 seconds for local and remote replica DBs to catch up).
* '''Interim values'''. Upon seeing such tombstone, WANObjectCache acts much like a cache miss, except that the newly computed value is not written back over the tombstone (as the computed value may be stale). Instead, to avoid a recompute stampede these maybe-stale values are stored as an "interim value" in a sister key which is only kept for a few seconds.
* '''Interim values'''. Upon seeing such tombstone, WANCache acts much like a cache miss, except that the newly computed value is not written back over the tombstone (as the computed value may be stale). Instead, to avoid a recompute stampede these maybe-stale values are stored as an "interim value" in a sister key which is only kept for a few seconds.


=== Memcached commands ===
=== Memcached commands ===
Line 43: Line 46:
Cross-dc:
Cross-dc:


* Purge traffic uses the <code>/*/mw-wan/</code> prefix to tell Mcouter to broadcast this to other pools and clusters as well. The actual command is generally <code>SET</code> as it needs to induce a "hold-off" period using the tombstone (per the above). In rare cases where a hold-off is not needed (e.g. if the purge is not related to a DB write), then the broadcasted event will use <code>DELETE</code>
* Purge traffic uses the <code>/*/mw-wan/</code> prefix to tell mcrouter to broadcast this to other pools and clusters as well. The actual command is generally <code>SET</code> as it needs to induce a "hold-off" period using the tombstone (per the above). In rare cases where a hold-off is not needed (e.g. if the purge is not related to a DB write), then the broadcasted event will use <code>DELETE</code>


=== Getting revision/page from WANObjectCache key ===
=== Getting revision/page from WANCache key ===
If you're trying to track down the specific revision text given an SqlBlobStore key, the somewhat convoluted procedure is documented at [[mw:Manual:Caching#Revision_text]].
If you're trying to track down the specific revision text given an SqlBlobStore key, the somewhat convoluted procedure is documented at [[mw:Manual:Caching#Revision_text]].


== Mcrouter ==
== Mcrouter ==
{{Main|Memcached for MediaWiki/mcrouter}}
=== Service ===
=== Service ===
There is a local Mcrouter instance on every app server.
There is a local mcrouter instance on every app server which offers the following features:
 
* consistent shards data across the memcached servers
There is also a cluster of 4-shard "'''Proxies'''" pool of Mcrouter instances in each data center for the purpose of receiving cross-dc Memcached commands to then proxy further to the dc-local app servers accordingly.
* connection pooling
 
* failover to <code>gutter pool</code> in case of a server unavailability
''TODO: Do these proxies also consider their dc-local gutterpool?''
* cross-dc replication via TLS [https://phabricator.wikimedia.org/T271967 T271967]
 
* onhost memcached
=== Routes ===
=== Routes ===
Each MediaWiki api/appserver sees memcached through a local proxy called [[Mcrouter]]<ref>[https://engineering.fb.com/web/introducing-mcrouter-a-memcached-protocol-router-for-scaling-memcached-deployments/ Introducing mcrouter: A memcached protocol router for scaling memcached deployments]</ref>. Each route applies consistent hashing on the key name to know where to send it.
Each MediaWiki api/appserver accesses memcached through its local Mcrouter instance <ref>[https://engineering.fb.com/web/introducing-mcrouter-a-memcached-protocol-router-for-scaling-memcached-deployments/ Introducing mcrouter: A memcached protocol router for scaling memcached deployments]</ref>. Mcrouter introduces the concepts of [https://github.com/facebook/mcrouter/wiki/Prefix-routing-setup routes] and pools and each route applies consistent hashing on the key name to know where to send it, i.e. which of the 18 shards for memcached.


There are several routes available through this, which are addressable via a route prefix that Mcrouter strips from the key before forwarding the Memcached command.
There are several routes available in our configuration, which are addressable via a route prefix that mcrouter '''strips from the key''' before forwarding the memcached command.


# '''Main route'''. This route is declared as <code>/$region/mw/</code> but is not addressed by MediaWiki as such. It routes to the dc-local "Main" pool shards. If a shard is perceived as unavailable from an appserver ("TKO") the local Mcrouter forwards all commands (incl gets, sets, and locks)  to a shard of the "Gutter" pool instead (see [[phab:T240684|T240684]], [[phab:T244852|T244852]]).
# '''Main route'''. This route is declared as <code>/$region/mw/</code> but is not addressed by MediaWiki as such. It routes to the dc-local "Main" pool shards. If a shard is perceived as unavailable from an appserver ("TKO") the local mcrouter forwards all commands (incl gets, sets, and locks)  to a shard of the "Gutter" pool instead (launch task: [[phab:T244852|T244852]]).
#* This route is used by the majority of traffic, through <code>WANObjectCache::getWithSet</code> calls in MediaWiki.
#* This route is used by the majority of traffic, through <code>WANObjectCache::getWithSet</code> calls in MediaWiki.
#* MediaWiki doesn't use the <code>/$region/mw/</code> prefix. Instead <code>/$region/mw/</code> is the default route and MediaWiki sends these commands without any routing prefix.
#* MediaWiki doesn't use the <code>/$region/mw/</code> prefix. Instead <code>/$region/mw/</code> is the default route and MediaWiki sends these commands without any routing prefix.
Line 72: Line 73:


=== Example ===
=== Example ===
 
The memcached key <code>WANCache:v:metawiki:translate-groups</code> (belongs to the [[mw:Extension:Translate|Translate extension]]) is formatted by the WANCache library. When Translate wants to get the value of this key, WANCache will send a <code>GET</code> command from MediaWiki to <code>localhost:11213</code>, where mcrouter is listening. The command is then further routed to <code>mc1022</code> (based on key hashing). MediaWiki it totally ignorant about the <code>mc[1,2]0XX</code> host, it only knows about sending commands to a localhost port. A mcrouter admin command helps figure out where keys are hashed/routed to:<syntaxhighlight lang="bash">
 
The Memcached key <code>WANCache:v:metawiki:translate-groups</code> (belongs to the [[mw:Extension:Translate|Translate extension]]) is formatted by the WANObjectCache library. When Translate wants to get the value of this key, WANObjectCache will send a <code>GET</code> command from MediaWiki to <code>localhost:11213</code>, where Mcrouter is listening. The command is then further routed to <code>mc1022</code> (based on key hashing). MediaWiki it totally ignorant about the <code>mc[1,2]0XX</code> host, it only knows about sending commands to a localhost port. A Mcrouter admin command helps figure out where keys are hashed/routed to:<syntaxhighlight lang="bash">
elukey@mw1345:~$ echo "get __mcrouter__.route(get,WANCache:v:metawiki:translate-groups)" | nc localhost 11213 -q 2
elukey@mw1345:~$ echo "get __mcrouter__.route(get,WANCache:v:metawiki:translate-groups)" | nc localhost 11213 -q 2
VALUE __mcrouter__.route(get,WANCache:v:metawiki:translate-groups) 0 16
VALUE __mcrouter__.route(get,WANCache:v:metawiki:translate-groups) 0 16
Line 84: Line 83:
</syntaxhighlight>Some things to notice:
</syntaxhighlight>Some things to notice:


* The special prefix <code>__mcrouter__.route</code> is intercepted by Mcrouter. These are admin commands, for which proxy returns directly without contacting the Memcached hosts. This function returns the target of the consistent hashing of the key name.
* The special prefix <code>__mcrouter__.route</code> is intercepted by mcrouter. These are admin commands, for which proxy returns directly without contacting the memcached hosts. This function returns the target of the consistent hashing of the key name.
* Mcrouter listens on port 11213 on all MediaWiki [[Application servers|app servers]], meanwhile on every <code>mc10XX</code> host memcached listens on port 11211.
* Mcrouter listens on port 11213 on all MediaWiki [[Application servers|app servers]], meanwhile on every <code>mc10XX</code> host memcached listens on port 11211.


Line 95: Line 94:
== Runbooks ==
== Runbooks ==
* [[Memcached for MediaWiki/Memcached server failure|Memcached server failure]]
* [[Memcached for MediaWiki/Memcached server failure|Memcached server failure]]
* [[Performance/Runbooks/Analyze memcached]] (How to use memkeys or cachedump)


== See also ==
== See also ==
Line 103: Line 103:
[[Category:Caching]]
[[Category:Caching]]
[[Category:MediaWiki production]]
[[Category:MediaWiki production]]
[[Category:SRE Service Operations]]

Latest revision as of 17:59, 15 May 2023

This page is about Memcached for MediaWiki.

This page is not about other Memcached clusters in production, such as those for Thumbor, Wikimedia Cloud Services, Wikitech wiki, and Swift.

MediaWiki's use of Memcached at WMF.

Infrastructure

There are two logical pools of memcached servers for MediaWiki:

  • Main: The main pool for has 18 shards and runs on the mc10XX hosts (in Eqiad) and mc20XX hosts (in Codfw).
  • Gutter: The gutter pool has 3 shards per DC, and hosted on mc-gp100x and mc-gp200x hosts (launch task: T244852).

MediaWiki connects to memcached through a proxy called #Mcrouter which provides a number of benefits.

Magic numbers

  • WANCache (last updated: May 2020.)
    • Tombstone (aka "hold-off TTL"): 11 seconds.
    • Interim value: 1 second.
  • Mcrouter
    • Gutter TTL: upto 10 minutes (gutter_ttl).
    • On-host tier TTL: upto 10 seconds.

WANObjectCache

WANObjectCache (or WANCache) is the primary interface in MediaWiki for interacting with Memcached and mcrouter. WANCache provides a developer-friendly API that naturally follow our best practices and transparently deals with the complex requirements of operating a platform of our scale. This includes preventing cache stampedes, avoiding cache misses for hot data through probabilistic and asynchronous regeneration prior to logical expiry, avoiding network congestion, supporting multiple versions of the software to run alongside each other (and apply purges to both, whilst storing values separately), avoiding cache polution during long-running processes or when databases are experiencing replication lag. The WANCache interface came out of the Multi-DC MediaWiki initiative which required us to take these constraints more seriously, though they generally are not unique to Multi-DC and also significantly improved resilience and correctness during the 2015-2021 single-DC period.

WANCache builds on top of BagOStuff, which is the lower level key-value interface to Memcached and other storage backends.

See also:

High level

  • Like a replica. There is generally no proactive setting of values during HTTP write actions. Instead, values are computed based on information from replica DBs, and computed on-demand using the getWithSet(key, ttl, callable) idiom. This means the application generally only expects cache values to be as up to date as a replica DB would be. Historically, it was common for MediaWiki to populate its cache during HTTP write actions instead. This meant that in a single-DC setup it could loosely be expected that the cache was as up-to-date as the master DB. As part of the multi-dc effort, this was changed starting in 2015, and thus its expectations were loosened to that of a replica DB.
  • No synchronisation. MediaWiki's WANCache layer does not require synchronisation of cached values across data centers. Instead, it considers each datacenter's Memcached cluster as independent. Each populating its own values as-needed on dc-local app servers from dc-local replica DBs.
  • Tombstones (broadcasted purge). During HTTP write actions, MediaWiki asks WANCache to purge cache keys of which it has modified the source data. These purges take the form of short-lived Memcached keys known as "tombstones". We do not use the DELETE command because we want each data center to be able to populate its memcached independently, thus requiring no cross-dc master database connection, thus reading from a local replica, thus values ingested in the cache may be as stale as a replica can be. Implementing a Memcached purge as DELETE would mean both in the same DC and other DCs, the same key could be re-populated immediately with the same stale value we just deleted. Instead, WANCache formulates its purge as a SET operation that stores a placeholder value known as a "tombstone" (lasts for approx. 10 seconds for local and remote replica DBs to catch up).
  • Interim values. Upon seeing such tombstone, WANCache acts much like a cache miss, except that the newly computed value is not written back over the tombstone (as the computed value may be stale). Instead, to avoid a recompute stampede these maybe-stale values are stored as an "interim value" in a sister key which is only kept for a few seconds.

Memcached commands

Intra-dc:

  • Read traffic from the getWithSet idiom results in a GETS command (getMulti) that fetches the main key, plus any sister keys that might exist.
  • Write trafffic from the getWithSet idiom results in either ADD if the key was known to be absent, or Memcached->mergeViaCas if a value existed but either required (or was elected for) regeneration.

Cross-dc:

  • Purge traffic uses the /*/mw-wan/ prefix to tell mcrouter to broadcast this to other pools and clusters as well. The actual command is generally SET as it needs to induce a "hold-off" period using the tombstone (per the above). In rare cases where a hold-off is not needed (e.g. if the purge is not related to a DB write), then the broadcasted event will use DELETE

Getting revision/page from WANCache key

If you're trying to track down the specific revision text given an SqlBlobStore key, the somewhat convoluted procedure is documented at mw:Manual:Caching#Revision_text.

Mcrouter

Service

There is a local mcrouter instance on every app server which offers the following features:

  • consistent shards data across the memcached servers
  • connection pooling
  • failover to gutter pool in case of a server unavailability
  • cross-dc replication via TLS T271967
  • onhost memcached

Routes

Each MediaWiki api/appserver accesses memcached through its local Mcrouter instance [1]. Mcrouter introduces the concepts of routes and pools and each route applies consistent hashing on the key name to know where to send it, i.e. which of the 18 shards for memcached.

There are several routes available in our configuration, which are addressable via a route prefix that mcrouter strips from the key before forwarding the memcached command.

  1. Main route. This route is declared as /$region/mw/ but is not addressed by MediaWiki as such. It routes to the dc-local "Main" pool shards. If a shard is perceived as unavailable from an appserver ("TKO") the local mcrouter forwards all commands (incl gets, sets, and locks) to a shard of the "Gutter" pool instead (launch task: T244852).
    • This route is used by the majority of traffic, through WANObjectCache::getWithSet calls in MediaWiki.
    • MediaWiki doesn't use the /$region/mw/ prefix. Instead /$region/mw/ is the default route and MediaWiki sends these commands without any routing prefix.
    • Switchover to and from the gutterpool is decided by Mcrouter locally (per-appserver), it is not centrally coordinated. The keys stored in a gutter server have a reduced TTL.
  2. WAN route. This route is declared as /$region/mw-wan/. It routes to the dc-local "Main" pool shards as well as the "Proxies" for all non-local DCs.
    • This route is for internal use by MediaWiki's WANObjectCache to broadcast its purges ("tombstones"). This happens from calls to WANObjectCache::purge (invalidates a single key) or WANObjectCache::touchCheckKey (effectively invalidate many keys, through a shared "check" key; somewhat like the Varnish XKey mechanism).
    • This route is not used for storing "regular" values is not exposed to any generic WANObjectCache::getWithSet or BagOStuff calls.

Example

The memcached key WANCache:v:metawiki:translate-groups (belongs to the Translate extension) is formatted by the WANCache library. When Translate wants to get the value of this key, WANCache will send a GET command from MediaWiki to localhost:11213, where mcrouter is listening. The command is then further routed to mc1022 (based on key hashing). MediaWiki it totally ignorant about the mc[1,2]0XX host, it only knows about sending commands to a localhost port. A mcrouter admin command helps figure out where keys are hashed/routed to:

elukey@mw1345:~$ echo "get __mcrouter__.route(get,WANCache:v:metawiki:translate-groups)" | nc localhost 11213 -q 2
VALUE __mcrouter__.route(get,WANCache:v:metawiki:translate-groups) 0 16
10.64.0.83:11211
END

elukey@mw1345:~$ dig -x 10.64.0.83 +short
mc1022.eqiad.wmnet.

Some things to notice:

  • The special prefix __mcrouter__.route is intercepted by mcrouter. These are admin commands, for which proxy returns directly without contacting the memcached hosts. This function returns the target of the consistent hashing of the key name.
  • Mcrouter listens on port 11213 on all MediaWiki app servers, meanwhile on every mc10XX host memcached listens on port 11211.

To get a key and dump it to a file it is sufficient to:

elukey@mw1345:~$ echo "get WANCache:v:metawiki:translate-groups" | nc localhost 11213 -q 2 > dump.txt
elukey@mw1345:~$ du -hs dump.txt
380K	dump.txt

In this case the key's value is pretty big, and it needs PHP to be interpreted correctly (to unserialize it), but nonetheless we got some useful information (like the size of the key). This could be useful when it is necessary to quickly get how big a key is, rather than knowing its content.

Runbooks

See also