You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Difference between revisions of "Wikidata Query Service/Streaming Updater Rollout Plan"

From Wikitech-static
Jump to navigation Jump to search
imported>DCausse
imported>DCausse
(Replaced content with "Moved to phabricator at phab:T288231")
 
Line 1: Line 1:
{{Template:Draft}}
Moved to phabricator at [[phab:T288231]]
 
Important notes:
* Week of Sept. 13, 2021: planned datacenter switch codfw -> eqiad so we should be doing that after the switch
* Revision map used to generate the initial states are available on Fridays (7am UTC)
* Dumps should be considered available on Fridays on the mirror
* The plan does not depend on what is the active DC for MW
 
 
= General process =
* Notify users on ML&Wiki
* Import
* Switch traffic to eqiad only
* Migrate all machines in codfw
* Switch traffic to codfw (user impact starts)
* Notify users on ML&Wiki (response)
* Migrate all machines in eqiad
* Re-open traffic to both DC
 
= Details =
Before we start (week before):
* query-preview.wikidata.org backed by wdqs1009 should be closed because we will need wdqs1009 to do an import
* stop the streaming updater if still running on k8s eqiad/codfw as part of testing k8s
 
Send a message to users on ML&wiki with a estimate once W0 is known.
 
Deployment plan:
* W0:Friday
** depool wdqs2008 and ship a config patch to switch to the streaming-updater-consumer
** start import on wdqs1009 and wdqs2008 with <code>--skolemize</code>: best case 10 days (import from 2 machines to maximize the chances of success)
** from stat1004 generate the initial state to <code>swift://rdf-streaming-updater-eqiad.thanos-swift/wikidata/savepoints/initial_state_$IMPORT_DATE</code> and <code>swift://rdf-streaming-updater-codfw.thanos-swift/wikidata/savepoints/initial_state_$IMPORT_DATE</code>
** start the updater producer on k8s@eqiad and k8s@codfw using the corresponding savepoint (note the time)
*** Make sure that the retention of <code>eqiad.rdf-streaming-updater.mutation</code> and <code>codfw.rdf-streaming-updater.mutation</code> is set to 1 month on kafka-main in eqiad and codfw (ask Andrew Otto)
* W1: monitor the import and react quickly (the import process is known to be fragile)
** The import time must not be higher that the retention of the mutation topic
* W2:Monday
** if both imports worked start the updater-consumer on wdqs1009 and wdqs2008 (if not automatically started)
** if only one import worked use the data-transfer to ship the data
** wait for the lag to catchup (EST: 1 to 2 days)
* W2:Wednesday
** it is expected that the lag on wdqs1009 and wdqs2008 is around 1min
** switch all traffic to eqiad
** start data-transfer + updater-consumer activation, wdqs2008 -> all codfw machines (EST: 2 to 3days: 3h/machine*7)
*** Figure out the procedure to disable the old updater and activate the new updater on the machines only after the data-transfer is done
*** Figure out if there is a way to optimize and parallelize this process
* W3:Monday
** Switch traffic to codfw: '''users are now impacted'''
** Notify users
** Monitor that everything works fine
** if we have to rollback:
*** switch traffic to eqiad
*** data-transfer a journal from a eqiad machine that is not updated by the streaming updater to a codfw machine
*** propagate this journal to all codfw machines and re-enable the old updater
*** re-enable traffic to both DC
* W4:Monday
** start data-transfer + updater-consumer activation wdqs1009 -> all eqiad machines (EST: 2 to 3days: 3h/machine * 10)
*** except wdqs1010 that we could use as source for emergency rollback
** re-enable eqiad
** if we have to rollback:
*** use wdqs1010 to propagate its journal everywhere and re-enable the old updater

Latest revision as of 11:58, 15 September 2021

Moved to phabricator at phab:T288231