You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Wikidata Query Service/Streaming Updater Rollout Plan

From Wikitech-static
< Wikidata Query Service
Revision as of 14:53, 5 August 2021 by imported>DCausse (→‎Details)
Jump to navigation Jump to search

Important notes:

  • Week of Sept. 13, 2021: planned datacenter switch codfw -> eqiad so we should be doing that after the switch
  • Revision map used to generate the initial states are available on Fridays (7am UTC)
  • Dumps should be considered available on Fridays on the mirror
  • The plan does not depend on what is the active DC for MW


General process

  • Notify users on ML&Wiki
  • Import
  • Switch traffic to eqiad only
  • Migrate all machines in codfw
  • Switch traffic to codfw (user impact starts)
  • Notify users on ML&Wiki (response)
  • Migrate all machines in eqiad
  • Re-open traffic to both DC

Details

Before we start (week before):

  • query-preview.wikidata.org backed by wdqs1009 should be closed because we will need wdqs1009 to do an import
  • stop the streaming updater if still running on k8s eqiad/codfw as part of testing k8s

Send a message to users on ML&wiki with a estimate once W0 is known.

Deployment plan:

  • W0:Friday
    • depool wdqs2008 and ship a config patch to switch to the streaming-updater-consumer
    • start import on wdqs1009 and wdqs2008 with --skolemize: best case 10 days (import from 2 machines to maximize the chances of success)
    • from stat1004 generate the initial state to swift://rdf-streaming-updater-eqiad.thanos-swift/wikidata/savepoints/initial_state_$IMPORT_DATE and swift://rdf-streaming-updater-codfw.thanos-swift/wikidata/savepoints/initial_state_$IMPORT_DATE
    • start the updater producer on k8s@eqiad and k8s@codfw using the corresponding savepoint (note the time)
      • Make sure that the retention of eqiad.rdf-streaming-updater.mutation and codfw.rdf-streaming-updater.mutation is set to 1 month on kafka-main in eqiad and codfw (ask Andrew Otto)
  • W1: monitor the import and react quickly (the import process is known to be fragile)
    • The import time must not be higher that the retention of the mutation topic
  • W2:Monday
    • if both imports worked start the updater-consumer on wdqs1009 and wdqs2008 (if not automatically started)
    • if only one import worked use the data-transfer to ship the data
    • wait for the lag to catchup (EST: 1 to 2 days)
  • W2:Wednesday
    • it is expected that the lag on wdqs1009 and wdqs2008 is around 1min
    • switch all traffic to eqiad
    • start data-transfer + updater-consumer activation, wdqs2008 -> all codfw machines (EST: 2 to 3days: 3h/machine*7)
      • Figure out the procedure to disable the old updater and activate the new updater on the machines only after the data-transfer is done
      • Figure out if there is a way to optimize and parallelize this process
  • W3:Monday
    • Switch traffic to codfw: users are now impacted
    • Notify users
    • Monitor that everything works fine
    • if we have to rollback:
      • switch traffic to eqiad
      • data-transfer a journal from a eqiad machine that is not updated by the streaming updater to a codfw machine
      • propagate this journal to all codfw machines and re-enable the old updater
      • re-enable traffic to both DC
  • W4:Monday
    • start data-transfer + updater-consumer activation wdqs1009 -> all eqiad machines (EST: 2 to 3days: 3h/machine * 10)
      • except wdqs1010 that we could use as source for emergency rollback
    • re-enable eqiad
    • if we have to rollback:
      • use wdqs1010 to propagate its journal everywhere and re-enable the old updater