You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Difference between revisions of "Wikidata Query Service/Streaming Updater Rollout Plan"

From Wikitech-static
Jump to navigation Jump to search
imported>Aklapper
((Please add years to docs))
 
imported>DCausse
Line 2: Line 2:


Important notes:
Important notes:
* Week of Sept. 13, 2021: planned datacenter switch codfw -> eqiad
* Week of Sept. 13, 2021: planned datacenter switch codfw -> eqiad so we should be doing that after the switch
* Revision map used to generate the initial states are available on Fridays (7am UTC)
* Revision map used to generate the initial states are available on Fridays (7am UTC)
* Dumps should be considered available on Fridays on the mirror
* Dumps should be considered available on Fridays on the mirror
* The plan does not depend on what is the active DC for MW




Line 19: Line 20:
= Details =
= Details =
Before we start (week before):
Before we start (week before):
* query-preview.wikidata.org should be closed
* query-preview.wikidata.org backed by wdqs1009 should be closed because we will need wdqs1009 to do an import
* stop the streaming updater if still running on k8s eqiad/codfw as part of testing k8s
* stop the streaming updater if still running on k8s eqiad/codfw as part of testing k8s


Line 28: Line 29:
** depool wdqs2008 and ship a config patch to switch to the streaming-updater-consumer
** depool wdqs2008 and ship a config patch to switch to the streaming-updater-consumer
** start import on wdqs1009 and wdqs2008 with <code>--skolemize</code>: best case 10 days (import from 2 machines to maximize the chances of success)
** start import on wdqs1009 and wdqs2008 with <code>--skolemize</code>: best case 10 days (import from 2 machines to maximize the chances of success)
** start import on wdqs2008: best case 10 days
** from stat1004 generate the initial state to <code>swift://rdf-streaming-updater-eqiad.thanos-swift/wikidata/savepoints/initial_state_$IMPORT_DATE</code> and <code>swift://rdf-streaming-updater-codfw.thanos-swift/wikidata/savepoints/initial_state_$IMPORT_DATE</code>
** from stat1004 generate the initial state to <code>swift://rdf-streaming-updater-eqiad.thanos-swift/wikidata/savepoints/initial_state_$IMPORT_DATE</code> and <code>swift://rdf-streaming-updater-codfw.thanos-swift/wikidata/savepoints/initial_state_$IMPORT_DATE</code>
** start the updater producer on k8s@eqiad and k8s@codfw using the corresponding savepoint (note the time)
** start the updater producer on k8s@eqiad and k8s@codfw using the corresponding savepoint (note the time)
*** Make sure that the retention of <code>eqiad.rdf-streaming-updater.mutation</code> and <code>codfw.rdf-streaming-updater.mutation</code> is set to 1 month on kafka-main in eqiad and codfw (ask Andrew Otto)
* W1: monitor the import and react quickly (the import process is known to be fragile)
* W1: monitor the import and react quickly (the import process is known to be fragile)
** The import time must not be higher that the retention of the mutation topic
* W2:Monday
* W2:Monday
** if both imports worked start the updater-consumer on wdqs1009 and wdqs2008 (if not automatically started)
** if both imports worked start the updater-consumer on wdqs1009 and wdqs2008 (if not automatically started)
Line 37: Line 39:
** wait for the lag to catchup (EST: 1 to 2 days)
** wait for the lag to catchup (EST: 1 to 2 days)
* W2:Wednesday
* W2:Wednesday
** it is expected that the lag on wdqs1009 and wdqs2008 is around 1min
** switch all traffic to eqiad
** switch all traffic to eqiad
** start data-transfer + updater-consumer activation wdqs2008 -> all codfw machines (EST: 2 to 3days: 3h/machine*7)
** start data-transfer + updater-consumer activation, wdqs2008 -> all codfw machines (EST: 2 to 3days: 3h/machine*7)
*** Figure out the procedure to disable the old updater and activate the new updater on the machines only after the data-transfer is done
*** Figure out if there is a way to optimize and parallelize this process
* W3:Monday
* W3:Monday
** Switch traffic to codfw: '''users are now impacted'''
** Switch traffic to codfw: '''users are now impacted'''
** Notify users
** Notify users
** Monitor that everything works fine
** Monitor that everything works fine
* W3:Wednesday
** if we have to rollback:
*** switch traffic to eqiad
*** data-transfer a journal from a eqiad machine that is not updated by the streaming updater to a codfw machine
*** propagate this journal to all codfw machines and re-enable the old updater
*** re-enable traffic to both DC
* W4:Monday
** start data-transfer + updater-consumer activation wdqs1009 -> all eqiad machines (EST: 2 to 3days: 3h/machine * 10)
** start data-transfer + updater-consumer activation wdqs1009 -> all eqiad machines (EST: 2 to 3days: 3h/machine * 10)
*** except wdqs1010 that we could use as source for emergency rollback
** re-enable eqiad
** re-enable eqiad
** if we have to rollback:
*** use wdqs1010 to propagate its journal everywhere and re-enable the old updater

Revision as of 14:53, 5 August 2021

Important notes:

  • Week of Sept. 13, 2021: planned datacenter switch codfw -> eqiad so we should be doing that after the switch
  • Revision map used to generate the initial states are available on Fridays (7am UTC)
  • Dumps should be considered available on Fridays on the mirror
  • The plan does not depend on what is the active DC for MW


General process

  • Notify users on ML&Wiki
  • Import
  • Switch traffic to eqiad only
  • Migrate all machines in codfw
  • Switch traffic to codfw (user impact starts)
  • Notify users on ML&Wiki (response)
  • Migrate all machines in eqiad
  • Re-open traffic to both DC

Details

Before we start (week before):

  • query-preview.wikidata.org backed by wdqs1009 should be closed because we will need wdqs1009 to do an import
  • stop the streaming updater if still running on k8s eqiad/codfw as part of testing k8s

Send a message to users on ML&wiki with a estimate once W0 is known.

Deployment plan:

  • W0:Friday
    • depool wdqs2008 and ship a config patch to switch to the streaming-updater-consumer
    • start import on wdqs1009 and wdqs2008 with --skolemize: best case 10 days (import from 2 machines to maximize the chances of success)
    • from stat1004 generate the initial state to swift://rdf-streaming-updater-eqiad.thanos-swift/wikidata/savepoints/initial_state_$IMPORT_DATE and swift://rdf-streaming-updater-codfw.thanos-swift/wikidata/savepoints/initial_state_$IMPORT_DATE
    • start the updater producer on k8s@eqiad and k8s@codfw using the corresponding savepoint (note the time)
      • Make sure that the retention of eqiad.rdf-streaming-updater.mutation and codfw.rdf-streaming-updater.mutation is set to 1 month on kafka-main in eqiad and codfw (ask Andrew Otto)
  • W1: monitor the import and react quickly (the import process is known to be fragile)
    • The import time must not be higher that the retention of the mutation topic
  • W2:Monday
    • if both imports worked start the updater-consumer on wdqs1009 and wdqs2008 (if not automatically started)
    • if only one import worked use the data-transfer to ship the data
    • wait for the lag to catchup (EST: 1 to 2 days)
  • W2:Wednesday
    • it is expected that the lag on wdqs1009 and wdqs2008 is around 1min
    • switch all traffic to eqiad
    • start data-transfer + updater-consumer activation, wdqs2008 -> all codfw machines (EST: 2 to 3days: 3h/machine*7)
      • Figure out the procedure to disable the old updater and activate the new updater on the machines only after the data-transfer is done
      • Figure out if there is a way to optimize and parallelize this process
  • W3:Monday
    • Switch traffic to codfw: users are now impacted
    • Notify users
    • Monitor that everything works fine
    • if we have to rollback:
      • switch traffic to eqiad
      • data-transfer a journal from a eqiad machine that is not updated by the streaming updater to a codfw machine
      • propagate this journal to all codfw machines and re-enable the old updater
      • re-enable traffic to both DC
  • W4:Monday
    • start data-transfer + updater-consumer activation wdqs1009 -> all eqiad machines (EST: 2 to 3days: 3h/machine * 10)
      • except wdqs1010 that we could use as source for emergency rollback
    • re-enable eqiad
    • if we have to rollback:
      • use wdqs1010 to propagate its journal everywhere and re-enable the old updater