Jump to content

This is a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

CDN/History

From Wikitech

An overview of notable events and changes to our caching infrastructure

Old caching clusters

These former clusters no longer exist but remnants may exist in our repositories.

cache_bits
Used to exist just for static content and ResourceLoader, now decommissioned (traffic went to cache_text)
cache_maps
Served maps.wikimedia.org exclusively, which is now serviced by cache_upload
cache_misc
Miscellaneous lower-traffic / support services (e.g. phabricator, metrics, etherpad, graphite, etc). Now moved to cache_text.
cache_mobile
Was like cache_text but just for (m|zero)\. mobile hostnames, now decommissioned (traffic went to cache_text)
cache_parsoid
Legacy entrypoint for parsoid and related *oid services, now decommissioned (traffic goes via cache_text to RESTBase )

Through the years

2023

  • In Dec 2023, parser cache retention was raised back to 30 days ( T280604 ).

2022

  • In April 2022, we replaced ATS with HAProxy for TLS termination and HTTP/2 ( T290005 ). This changed the stack to:
    • HAProxy for TLS termination,
    • Varnish frontend, and
    • ATS backend.
  • In September 2022, Multi-DC MediaWiki was enabled worldwide. This means cache misses may now route to one of two data centers, rather than misses from all caching data centers routing to the same singular primary DC.

2021

  • In May 2021, ParserCache retention was temporary reduced from 30 to 21 days due to reaching capacity limits ( change 685181 ).

2020

  • In April 2020, one year after switching from Varnish to ATS as cache backend, the TTL was lowered from the 7 days set in 2016, down to 24 hours ( T249627 ). With Varnish frontend also at 1 day and a grace-keep of 7 days, this means frontend objects may now outlive backend ones.
  • In June 2022, the Purged service was introduced. MediaWiki no longer uses multicast HTCP purging , but instead produces Kafka events for purging URLs, which local Purged instances on Varnish and ATS servers consume and apply by producing local PURGE requests.

2019

  • In 2019, we replaced the "backend" Varnish with Apache Traffic Server (ATS) for improved on-disk caching ( T227432 ). The same year, we also replaced Nginx- with another of ATS for TLS termination ( T231627 ). This was referred to as the "ATS sandwich", featuring Apache Traffic Server (ATS) as both TLS terminator and as backend cache, thus discontinuing Nginx- ("nginx minus") and Varnish backend. This changed the stack to:
    • ATS for TLS termination ( ats-tls ),
    • Varnish frontend ( varnish-fe ), and
    • ATS backend ( ats-be ).
  • Discussion occurred on whether to evolve the ATS-TLS layer to subsume the responsibilities of Varnish-frontend one day.
  • In 2019, routing of requests between cache-only data centers and the application data center changed to no longer involve a second caching tier ( T108580#7555566 ). The switch from Varnish to ATS for backend caching, meant we no longer relied on Varnish's unstable disk caching ( T142848 ), which previously justified a "Tier 2" Varnish backend. This change also reduced complexity and bugs stemming from the same VCL applying multiple times, TTLs being out of sync between tiers ( T108612 ), and purge race conditions ( T133821 ). Lastly, paired with Envoy for internal HTTPS, this change let us adopt HTTPS for encryption of traffic between DCs, whereas previously (with Varnishes talking plain HTTP to each other) this relied on IPsec ( T108580 ).

2015-2019

  • Prior to 2019, the stack for many years involved two Varnish layers serving as a frontend and a backend cache respectively. As such, in older documentation "Varnish" might sometimes also refer to the cache backend. The stack was as follows:
    • Nginx- for TLS termination and HTTP2,
    • Varnish frontend, and
    • Varnish backend.
  • Prior to 2019, when a request is a cache miss or pass, it would route from the edge's Varnish backend through the primary DC's Varnish backend as well, before going to the MediaWiki application. This design increased resiliency against cache loss after restarts (especially given unstable disk-based caching in Varnish T142848 ), and increased request coalescing and cache hits more generally. It was routed as follows ( archived 2015 diagram ):
    • Internet
    • >> Caching data center (LVS > Nginx > Varnish frontend > Varnish "Tier 1" backend)
    • >> Primary data center (Varnish "Tier 2" backend > LVS > MediaWiki) .

2016

  • We decreased the max object TTL in Varnish from the long-standing 31 days down to 1 day for Varnish frontends, and 14 days for Varnish backends and MediaWiki ( T124954 ). The parser cache remains at 31 days.
  • We deployed HTTP/2 support to the Wikimedia CDN, which was at the time comprised of Nginx- and Varnish ( T96848 ).
  • The cache_mobile cluster was merged into cache_text ( T109286 )

2013

  • From Jan to Dec 2013, we migrated all use of Squid in the CDN to Varnish. See Gerrit commits . (This does not include use of Squid as a non-caching proxy outside the CDN, such as url-downloader .)
  • Prevent white-washing of expired page-view HTML. Various static aspects of a page are not tracked or versions, as such, when the max-age expires, a If-Not-Modified must not return true after expiry even if the database entry of the wiki page was unchanged ( T46570 ).

2011

2010

2009

2004

Further reading