You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Wikidata Query Service/Streaming Updater Rollout Plan

From Wikitech-static
< Wikidata Query Service
Revision as of 13:14, 4 August 2021 by imported>Aklapper ((Please add years to docs))
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Important notes:

  • Week of Sept. 13, 2021: planned datacenter switch codfw -> eqiad
  • Revision map used to generate the initial states are available on Fridays (7am UTC)
  • Dumps should be considered available on Fridays on the mirror


General process

  • Notify users on ML&Wiki
  • Import
  • Switch traffic to eqiad only
  • Migrate all machines in codfw
  • Switch traffic to codfw (user impact starts)
  • Notify users on ML&Wiki (response)
  • Migrate all machines in eqiad
  • Re-open traffic to both DC

Details

Before we start (week before):

  • query-preview.wikidata.org should be closed
  • stop the streaming updater if still running on k8s eqiad/codfw as part of testing k8s

Send a message to users on ML&wiki with a estimate once W0 is known.

Deployment plan:

  • W0:Friday
    • depool wdqs2008 and ship a config patch to switch to the streaming-updater-consumer
    • start import on wdqs1009 and wdqs2008 with --skolemize: best case 10 days (import from 2 machines to maximize the chances of success)
    • start import on wdqs2008: best case 10 days
    • from stat1004 generate the initial state to swift://rdf-streaming-updater-eqiad.thanos-swift/wikidata/savepoints/initial_state_$IMPORT_DATE and swift://rdf-streaming-updater-codfw.thanos-swift/wikidata/savepoints/initial_state_$IMPORT_DATE
    • start the updater producer on k8s@eqiad and k8s@codfw using the corresponding savepoint (note the time)
  • W1: monitor the import and react quickly (the import process is known to be fragile)
  • W2:Monday
    • if both imports worked start the updater-consumer on wdqs1009 and wdqs2008 (if not automatically started)
    • if only one import worked use the data-transfer to ship the data
    • wait for the lag to catchup (EST: 1 to 2 days)
  • W2:Wednesday
    • switch all traffic to eqiad
    • start data-transfer + updater-consumer activation wdqs2008 -> all codfw machines (EST: 2 to 3days: 3h/machine*7)
  • W3:Monday
    • Switch traffic to codfw: users are now impacted
    • Notify users
    • Monitor that everything works fine
  • W3:Wednesday
    • start data-transfer + updater-consumer activation wdqs1009 -> all eqiad machines (EST: 2 to 3days: 3h/machine * 10)
    • re-enable eqiad