You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Server Admin Log: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Labslogbot
(krenair@mira Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/264460/ (duration: 02m 30s) (logmsgbot))
imported>Stashbot
(ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1141 (T312863)', diff saved to https://phabricator.wikimedia.org/P32380 and previous config saved to /var/cache/conftool/dbconfig/20220814-085443-ladsgroup.json)
Line 1: Line 1:
== 2016-01-28 ==
== 2022-08-14 ==
* 01:03 logmsgbot: krenair@mira Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/264460/ (duration: 02m 30s)
* 08:54 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1141 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32380 and previous config saved to /var/cache/conftool/dbconfig/20220814-085443-ladsgroup.json
* 00:58 logmsgbot: krenair@mira Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/264066/ (duration: 02m 26s)
* 08:54 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1141.eqiad.wmnet with reason: Maintenance
* 00:46 logmsgbot: krenair@mira Synchronized php-1.27.0-wmf.11/extensions/Gather/resources: https://gerrit.wikimedia.org/r/#/c/266793/ and https://gerrit.wikimedia.org/r/#/c/266792/ (duration: 02m 23s)
* 08:54 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1141.eqiad.wmnet with reason: Maintenance
* 00:41 logmsgbot: krenair@mira Synchronized php-1.27.0-wmf.11/extensions/Flow/: https://gerrit.wikimedia.org/r/#/c/266939/ (duration: 02m 27s)
* 00:27 logmsgbot: krenair@mira Synchronized php-1.27.0-wmf.10/extensions/Flow/includes: https://gerrit.wikimedia.org/r/#/c/266938/ (duration: 02m 29s)
* 00:09 logmsgbot: krenair@mira Synchronized wmf-config/InitialiseSettings-labs.php: https://gerrit.wikimedia.org/r/#/c/266945/ (duration: 02m 36s)


== 2016-01-27 ==
== 2022-08-13 ==
* 22:36 robh: restarting parsoid-rt-client service on ruthenium
* 13:37 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance
* 22:29 ottomata: starting mysqldump of MobileWebSectionUsage_14321266 from db1047 into m4-master
* 13:37 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance
* 21:45 yurik: updated graphoid on scb*
* 13:37 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32379 and previous config saved to /var/cache/conftool/dbconfig/20220813-133713-ladsgroup.json
* 21:29 mdholloway: mobileapps deployed 6f35859
* 13:22 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P32378 and previous config saved to /var/cache/conftool/dbconfig/20220813-132207-ladsgroup.json
* 21:26 cscott: updated OCG to version 64050af0456a43344b32e3e93561a79207565eaf
* 13:07 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P32377 and previous config saved to /var/cache/conftool/dbconfig/20220813-130701-ladsgroup.json
* 21:26 logmsgbot: ori@mira Synchronized docroot and w: (no message) (duration: 02m 26s)
* 12:51 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32376 and previous config saved to /var/cache/conftool/dbconfig/20220813-125156-ladsgroup.json
* 19:48 YuviPanda: started nfs-exports daemon on labstore1001, had been dead for a few days
* 19:32 mutante: stat1002 - redis.exceptions.ConnectionError: Error connecting to mira.codfw.wmnet:6379. timed out.
* 19:31 mutante: stat1002 - running puppet, was reported as last run about 4 hours ago but not deactivated
* 19:14 logmsgbot: dduvall@mira rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.27.0-wmf.11
* 19:07 ejegg: set donation queue consumer time limit back to 90 sec
* 18:49 logmsgbot: jynus@mira Synchronized wmf-config/db-eqiad.php: Repool pc1006 after cloning (duration: 02m 25s)
* 18:48 bd808: HHVM on mw1019 still dying on a regular basis with "Lost parent, LightProcess exiting"
* 18:00 csteipp: deploy patch for T103239
* 17:50 csteipp: deploy patch for T97157
* 17:47 jynus: migrating ruthenium parsoid-test database to m5-master
* 17:27 elukey: rebooting analytics105* hosts to upgrade their kernel
* 17:16 elukey: rebooting analytics1035.eqiad.wmnet for kernel upgrade
* 16:23 ejegg: updated SmashPig from 072c7ec6ed94e7074ba35b7986d5dde94866fe2f to 97629339994bffe8831a9067f5e9c21fa423586b
* 16:22 logmsgbot: thcipriani@mira Synchronized php-1.27.0-wmf.11/extensions/CentralAuth/includes/CentralAuthUtils.php: SWAT: Preserve certain keys when updating central session [[gerrit:266672]] (duration: 02m 28s)
* 16:11 logmsgbot: thcipriani@mira Synchronized php-1.27.0-wmf.11/extensions/CentralAuth/includes/session/CentralAuthSessionProvider.php: SWAT: Avoid forceHTTPS cookie flapping if core and CA are setting the same cookie [[gerrit:266671]] (duration: 02m 26s)
* 16:03 elukey: rebooting analytics 1043 -> 1050 for kernel upgrade.
* 15:47 elukey: rebooting analytics 1026, 1040 -> 1042 due to kernel upgrade.
* 14:58 jynus: cloning persercache contents from pc1003 to pc1006
* 14:45 elukey: rebooting analytics 1036 to 1039 for kernel upgrade
* 14:35 elukey: analytics 1035 hasn't been rebooted because it is a Hadoop Journal Node (will be restarted in the end)
* 14:04 elukey: rebooting analytics 1032 to 1035 for kernel upgrades
* 14:03 logmsgbot: jynus@mira Synchronized wmf-config/db-eqiad.php: Depool pc1003 for cloning to pc1006 (duration: 02m 30s)
* 13:59 jynus: about to going new hardware/OS/mariadb-only for parsercache service
* 13:32 elukey: rebooting analytics1030/1031 for kernel upgrade
* 13:15 akosiaris: rebooting fermium for kernel upgrades
* 13:10 elukey: rebooting analytics1029 for kernel upgrade
* 12:29 moritzm: rebooting analytics1028 for kernel update
* 10:25 ema: restarting apache2 and hhvm on mw1119
* 03:19 logmsgbot: ebernhardson@mira Synchronized wmf-config/CirrusSearch-production.php: Correct invalid cirrus shard configuration (duration: 02m 59s)
* 02:55 logmsgbot: l10nupdate@tin ResourceLoader cache refresh completed at Wed Jan 27 02:55:21 UTC 2016 (duration 7m 13s)
* 02:48 logmsgbot: mwdeploy@tin sync-l10n completed (1.27.0-wmf.11) (duration: 10m 25s)
* 02:24 logmsgbot: mwdeploy@tin sync-l10n completed (1.27.0-wmf.10) (duration: 09m 51s)
* 01:59 logmsgbot: ori@mira Synchronized docroot and w: Icc4f6134b0: Add a speed experiment which inlines the top stylesheet (duration: 02m 28s)
* 01:29 MaxSem: on terbium: ran mwscript namespaceDupes.php --wiki=wuuwiki --source-pseudo-namespace='' --add-suffix=/renamed --fix
* 01:26 MaxSem: Fail, trying something else...
* 01:21 MaxSem: running mwscript namespaceDupes.php  --wiki=wuuwiki --move-talk --fix
* 00:52 logmsgbot: krenair@mira Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/266497/ (duration: 02m 26s)
* 00:48 logmsgbot: krenair@mira Synchronized w/static/images/project-logos/ukwikinews.png: https://gerrit.wikimedia.org/r/#/c/266497/ (duration: 02m 29s)
* 00:44 logmsgbot: krenair@mira Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/266161/ (duration: 02m 27s)
* 00:15 logmsgbot: ebernhardson@mira Synchronized php-1.27.0-wmf.11/extensions/CirrusSearch/: Allow pointing morelike queries at a specific datacenter (duration: 03m 04s)
* 00:10 logmsgbot: ebernhardson@mira Synchronized wmf-config/CirrusSearch-production.php: point morelike queries back at the eqiad cluster (duration: 05m 41s)
* 00:02 chasemp: enable puppet and codify the 192 thread count for nfsd


== 2016-01-26 ==
== 2022-08-12 ==
* 22:25 logmsgbot: dduvall@mira rebuilt wikiversions.php and synchronized wikiversions files: group0 to 1.27.0-wmf.11, for real this time
* 23:41 mutante: wikistats-bullseye:~$ /usr/lib/wikistats/update.php wp prefix blk ; /usr/lib/wikistats/update.php wp prefix kcg [[phab:T315121|T315121]]
* 22:17 logmsgbot: dduvall@mira rebuilt wikiversions.php and synchronized wikiversions files: group0 to 1.27.0-wmf.11
* 23:38 mutante: [mwmaint1002:~] $ sudo systemctl start mediawiki_job_initsitestats.timer [[phab:T315121|T315121]]
* 22:15 logmsgbot: dduvall@mira Synchronized php-1.27.0-wmf.11: syncing wmf.11 backports of session fixes (duration: 03m 55s)
* 22:14 ryankemper@cumin1001: END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - [[phab:T289135|T289135]]
* 21:55 logmsgbot: ori@mira Synchronized docroot and w: I9b054d847a: New set of speed experiments (duration: 01m 29s)
* 21:48 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1071.eqiad.wmnet with OS bullseye
* 21:41 marxarelli: filed https://phabricator.wikimedia.org/T124828 for fatal in extensions/Echo
* 21:45 andrew@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host clouddb2002-dev.codfw.wmnet with OS bullseye
* 21:22 marxarelli: Fatal error: Cannot redeclare class CallbackFilterIterator in /srv/mediawiki-staging/php-1.27.0-wmf.11/extensions/Echo/includes/iterator/CallbackFilterIterator.php on line 24
* 21:27 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1071.eqiad.wmnet with reason: host reimage
* 21:21 marxarelli: lint error found when running sync-dir 'Errors parsing /srv/mediawiki-staging/php-1.27.0-wmf.11/extensions/Echo/includes/iterator/CallbackFilterIterator.php'
* 21:25 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1071.eqiad.wmnet with reason: host reimage
* 21:11 marxarelli: sync-dir php linting failed
* 21:12 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1071.eqiad.wmnet with OS bullseye
* 21:02 marxarelli: resuming sync-dir and ignoring error as a known issue
* 21:10 andrew@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddb2002-dev.codfw.wmnet with reason: host reimage
* 20:59 marxarelli: getting 'Lost parent, LightProcess exiting' when running sync-dir
* 21:06 andrew@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on clouddb2002-dev.codfw.wmnet with reason: host reimage
* 20:57 chasemp: drop labstore1001 nfs threads down to 192
* 21:06 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1053.eqiad.wmnet with OS bullseye
* 20:46 chasemp: stopping nfs on labstore1001
* 20:50 andrew@cumin1001: START - Cookbook sre.hosts.reimage for host clouddb2002-dev.codfw.wmnet with OS bullseye
* 20:46 marxarelli: modified wikiversions.php locally on mw1017 to promote all wikis to wmf.11 for initial testing
* 20:43 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1053.eqiad.wmnet with reason: host reimage
* 20:18 marxarelli: locally modified wikiversions.php and wikiversions.json on mw1017 for testing
* 20:39 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1053.eqiad.wmnet with reason: host reimage
* 20:14 marxarelli: running 'sync-common --verbose deployment.eqiad.wmnet' on mw1017 to sync wmf.11 for initial testing
* 20:24 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1053.eqiad.wmnet with OS bullseye
* 20:02 marxarelli: proceeding with train deploy. wmf.11 to mw1017, then group0
* 20:12 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1048.eqiad.wmnet with OS bullseye
* 19:46 akosiaris: issuing a varnish ban on all esams mobile frontend varnish for req.http.host .*wikimedia.org
* 19:55 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1048.eqiad.wmnet with reason: host reimage
* 19:45 akosiaris: issuing a varnish ban on all esams mobile backend varnish for req.http.host .*wikimedia.org
* 19:53 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1048.eqiad.wmnet with reason: host reimage
* 19:44 akosiaris: issuing a varnish ban on all ulsfo mobile frontend varnish for req.http.host .*wikimedia.org
* 19:42 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1048.eqiad.wmnet with OS bullseye
* 19:44 akosiaris: issuing a varnish ban on all ulsfo mobile backend varnish for req.http.host .*wikimedia.org
* 19:38 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1146:3314 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32375 and previous config saved to /var/cache/conftool/dbconfig/20220812-193822-ladsgroup.json
* 19:43 akosiaris: issuing a varnish ban on all codfw mobile frontend varnish for req.http.host .*wikimedia.org
* 19:38 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance
* 19:36 akosiaris: issuing a varnish ban on all codfw mobile backend varnish for req.http.host .*wikimedia.org
* 19:38 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance
* 19:36 akosiaris: issuing a varnish ban on all eqiad mobile frontend varnish for req.http.host .*wikimedia.org
* 19:38 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1121 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32374 and previous config saved to /var/cache/conftool/dbconfig/20220812-193801-ladsgroup.json
* 19:36 akosiaris: issuing a varnish ban on all eqiad mobile backend varnish for req.http.host .*wikimedia.org
* 19:33 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1054.eqiad.wmnet with OS bullseye
* 19:36 akosiaris: all of the above referred to cache_text
* 19:22 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P32373 and previous config saved to /var/cache/conftool/dbconfig/20220812-192255-ladsgroup.json
* 19:29 akosiaris: all of the above already done, back logging
* 19:12 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1054.eqiad.wmnet with reason: host reimage
* 19:29 akosiaris: issuing a varnish ban on all esams frontend varnish for req.http.host .*wikimedia.org
* 19:09 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1054.eqiad.wmnet with reason: host reimage
* 19:29 akosiaris: issuing a varnish ban on all esams backend varnish for req.http.host .*wikimedia.org
* 19:07 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P32372 and previous config saved to /var/cache/conftool/dbconfig/20220812-190749-ladsgroup.json
* 19:29 akosiaris: issuing a varnish ban on all ulsfo backend varnish for req.http.host .*wikimedia.org
* 18:58 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet with reason: Maint
* 19:29 akosiaris: issuing a varnish ban on all ulsfo frontend varnish for req.http.host .*wikimedia.org
* 18:58 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 2 days, 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet with reason: Maint
* 19:28 akosiaris: issuing a varnish ban on all ulsfo backend varnish for req.http.host .*wikimedia.org
* 18:54 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1054.eqiad.wmnet with OS bullseye
* 19:28 akosiaris: issuing a varnish ban on all codfw frontend varnish for req.http.host .*wikimedia.org
* 18:52 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1121 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32371 and previous config saved to /var/cache/conftool/dbconfig/20220812-185243-ladsgroup.json
* 19:28 akosiaris: issuing a varnish ban on all codfw backend varnish for req.http.host .*wikimedia.org
* 18:48 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1066.eqiad.wmnet with OS bullseye
* 19:28 akosiaris: issuing a varnish ban on all eqiad frontend varnish for req.http.host .*wikimedia.org
* 18:25 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1066.eqiad.wmnet with reason: host reimage
* 19:14 akosiaris: issuing a varnish ban on all eqiad backend varnish for req.http.host .*wikimedia.org
* 18:22 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1066.eqiad.wmnet with reason: host reimage
* 19:02 marxarelli: backports to wmf.11 ready on mira but delaying train due to wikimedia.org outage
* 18:08 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1066.eqiad.wmnet with OS bullseye
* 18:44 _joe_: running salt --batch-size=20 -C 'G@luster:appserver and G@site:eqiad' cmd.run 'puppet agent -t --tags mw-apache-config'
* 18:00 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1064.eqiad.wmnet with OS bullseye
* 18:27 robh: i broke icinga, but then i fixed it, icinga back to normal.
* 17:42 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1064.eqiad.wmnet with reason: host reimage
* 18:21 robh: icinga is broken, it seems it was from a change before mine, but my forced reload broke it
* 17:39 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1064.eqiad.wmnet with reason: host reimage
* 18:18 legoktm: running mwscript updateArticleCount.php --wiki=jawiki --update=1
* 17:24 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1064.eqiad.wmnet with OS bullseye
* 18:14 cmjohnson1: starting puppet on mw cluster
* 17:21 pt1979@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts netmon2002.wikimedia.org
* 18:14 robh: i broke icinga, fixing
* 17:21 pt1979@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts netmon2002.wikimedia.org
* 18:08 logmsgbot: jynus@mira Synchronized wmf-config/db-eqiad.php: Pool new parsercache pc1005 after cloning it from pc1002 (duration: 01m 28s)
* 17:19 pt1979@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netmon2002.wikimedia.org with OS bullseye
* 17:43 thcipriani: ltwiki collation updated 503623 rows processed
* 17:04 pt1979@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netmon2002.wikimedia.org with reason: host reimage
* 17:35 mutante: mw1258 - restart hhvm
* 17:01 pt1979@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on netmon2002.wikimedia.org with reason: host reimage
* 17:20 cmjohnson: disabling puppet on mw cluster
* 16:42 pt1979@cumin2002: START - Cookbook sre.hosts.reimage for host netmon2002.wikimedia.org with OS bullseye
* 17:02 thcipriani: running updateCollation on ltwiki
* 16:26 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1067.eqiad.wmnet with OS bullseye
* 17:01 logmsgbot: thcipriani@mira Synchronized wmf-config/InitialiseSettings.php: SWAT: Set category collation to uca-lt on lt.wikipedia [[gerrit:266427]] (duration: 01m 33s)
* 16:21 andrew@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudcontrol2003-dev.wikimedia.org
* 16:55 logmsgbot: thcipriani@mira Synchronized wmf-config/InitialiseSettings.php: SWAT: Namespace configuration on ur.wikipedia [[gerrit:265888]] (duration: 07m 10s)
* 16:21 andrew@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 16:36 logmsgbot: thcipriani@mira Synchronized w/static/images/project-logos/etwikiquote.png: SWAT: Update et.wikiquote logo [[gerrit:265623]] (duration: 01m 27s)
* 16:16 andrew@cumin1001: START - Cookbook sre.dns.netbox
* 16:31 logmsgbot: thcipriani@mira Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable SandboxLink on nl.wikiquote [[gerrit:265666]] (duration: 01m 26s)
* 16:11 andrew@cumin1001: START - Cookbook sre.hosts.decommission for hosts cloudcontrol2003-dev.wikimedia.org
* 16:26 logmsgbot: thcipriani@mira Synchronized wmf-config/InitialiseSettings.php: SWAT: Namespaces configuration on sk.wikipedia [[gerrit:265896]] (duration: 01m 27s)
* 16:08 pt1979@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['netmon2002.wikimedia.org']
* 16:19 logmsgbot: thcipriani@mira Synchronized wmf-config/InitialiseSettings.php: SWAT: Remove Tranwiki namespace on wuu.wikipedia [[gerrit:265892]] and Add Portal namespace on wuu.wikipedia [[gerrit:265893]] (duration: 01m 27s)
* 16:03 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1067.eqiad.wmnet with reason: host reimage
* 16:12 logmsgbot: thcipriani@mira Synchronized wmf-config/InitialiseSettings.php: SWAT: Namespace configuration for wuu.wikipedia [[gerrit:265891]] (duration: 01m 29s)
* 15:58 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1067.eqiad.wmnet with reason: host reimage
* 14:57 ema: Finished migration of mobile traffic to text cluster in esams https://phabricator.wikimedia.org/T109286
* 15:43 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1067.eqiad.wmnet with OS bullseye
* 14:48 chasemp: RPS on eth0 on labstores
* 15:37 pt1979@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['netmon2002.wikimedia.org']
* 14:39 bblack: upgrading packages (incl kernel) on all ulsfo caches (cp4xxx)
* 15:31 jbond@cumin2002: END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['netmon2002.wikimedia.org']
* 14:21 akosiaris: migrating alsafi,mx2001 back to 2004 for testing
* 15:31 jbond@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['netmon2002.wikimedia.org']
* 14:14 akosiaris: migrate alsafi,mx2001 back from ganeti2004 to fix a network misconfiguration
* 15:07 jbond@cumin2002: END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts netmon1002.wikimedia.org
* 13:32 moritzm: rebooted nescio/maerlant for kernel update
* 15:07 jbond@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts netmon1002.wikimedia.org
* 13:14 logmsgbot: jynus@mira Synchronized wmf-config/db-eqiad.php: Depool pc1002 for maintenance (clone to pc1005) (duration: 01m 39s)
* 15:04 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1061.eqiad.wmnet with OS bullseye
* 12:39 akosiaris: rolling reboot of ganeti200{1,2,3,4,5,6}.codfw.wmnet for kernel upgrade
* 14:46 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1061.eqiad.wmnet with reason: host reimage
* 12:10 moritzm: rebooting mx2001/mx1001 (with a delay in between) for kernel update
* 14:46 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2042.codfw.wmnet,service=varnish-fe
* 11:50 moritzm: rebooting etherpad1001 for kernel update
* 14:46 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2042.codfw.wmnet,service=ats-be
* 11:46 moritzm: rebooting bromine for kernel update
* 14:46 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2042.codfw.wmnet,service=ats-tls
* 10:50 ema: Starting migration of mobile traffic to text cluster in esams https://phabricator.wikimedia.org/T109286
* 14:43 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1061.eqiad.wmnet with reason: host reimage
* 09:30 hashar: restarting Jenkins to upgrade the gearman plugin with https://review.openstack.org/#/c/271543/
* 14:29 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maint
* 09:28 _joe_: finishing reboots of appservers in eqiad
* 14:28 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1061.eqiad.wmnet with OS bullseye
* 04:27 legoktm: restarted resetGlobalUserTokens.php after it lost mysql connection again
* 14:28 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maint
* 02:31 logmsgbot: l10nupdate@tin ResourceLoader cache refresh completed at Tue Jan 26 02:30:58 UTC 2016 (duration 7m 0s)
* 14:24 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1063.eqiad.wmnet with OS bullseye
* 02:24 logmsgbot: mwdeploy@tin sync-l10n completed (1.27.0-wmf.10) (duration: 09m 36s)
* 14:05 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1063.eqiad.wmnet with reason: host reimage
* 01:45 logmsgbot: krenair@mira Synchronized wmf-config/wikitech.php: https://gerrit.wikimedia.org/r/#/c/266453/ (duration: 01m 27s)
* 14:02 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1063.eqiad.wmnet with reason: host reimage
* 00:45 mobrovac: mobileapps deploying c2318b6
* 13:47 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1063.eqiad.wmnet with OS bullseye
* 00:40 logmsgbot: ebernhardson@mira Synchronized wmf-config/CommonSettings.php: (no message) (duration: 01m 25s)
* 13:41 ryankemper@cumin1001: START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - [[phab:T289135|T289135]]
* 00:37 logmsgbot: ebernhardson@mira Synchronized wmf-config/InitialiseSettings.php: SWAT bd808 (duration: 01m 34s)
* 06:01 ryankemper@puppetmaster1001: conftool action : set/pooled=yes:weight=10; selector: name=elastic10[8-9][0-9].*
* 00:32 logmsgbot: ebernhardson@mira Synchronized portals/: SWAT jgirault (duration: 01m 28s)
* 05:54 ryankemper@puppetmaster1001: conftool action : set/pooled=yes:weight=10; selector: name=elastic110.*
* 00:29 logmsgbot: ebernhardson@mira Synchronized wmf-config/InitialiseSettings.php: SWAT ebernhardson (duration: 01m 26s)
* 01:03 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1121 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32369 and previous config saved to /var/cache/conftool/dbconfig/20220812-010312-ladsgroup.json
* 00:27 logmsgbot: ebernhardson@mira Synchronized wmf-config/CirrusSearch-common.php: SWAT ebernhardson (duration: 01m 26s)
* 01:03 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
* 00:25 logmsgbot: ebernhardson@mira Synchronized wmf-config/CommonSettings.php: SWAT ebernhardson (duration: 01m 27s)
* 01:02 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
* 00:15 logmsgbot: ebernhardson@mira Synchronized wmf-config/CommonSettings.php: SWAT AaronSchulz (duration: 01m 26s)
* 01:02 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1121.eqiad.wmnet with reason: Maintenance
* 00:13 logmsgbot: ebernhardson@mira Synchronized wmf-config/filebackend-production.php: SWAT AaronSchulz (duration: 01m 26s)
* 01:02 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1121.eqiad.wmnet with reason: Maintenance
* 00:10 logmsgbot: ebernhardson@mira Synchronized wmf-config/CommonSettings.php: SWAT James_F (duration: 01m 26s)
* 01:02 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1147 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32368 and previous config saved to /var/cache/conftool/dbconfig/20220812-010233-ladsgroup.json
* 00:08 logmsgbot: ebernhardson@mira Synchronized wmf-config/InitialiseSettings.php: SWAT James_F (duration: 01m 35s)
* 00:47 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P32367 and previous config saved to /var/cache/conftool/dbconfig/20220812-004727-ladsgroup.json
* 00:32 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P32366 and previous config saved to /var/cache/conftool/dbconfig/20220812-003221-ladsgroup.json
* 00:17 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1147 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32365 and previous config saved to /var/cache/conftool/dbconfig/20220812-001715-ladsgroup.json


== 2016-01-25 ==
== 2022-08-11 ==
* 23:14 logmsgbot: legoktm@mira Synchronized php-1.27.0-wmf.10/includes/parser/: live hacks, now committed (duration: 01m 27s)
* 21:30 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 23:07 logmsgbot: legoktm@mira Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/266410/ (duration: 01m 35s)
* 21:29 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 22:52 logmsgbot: ori@mira Synchronized php-1.27.0-wmf.10/includes/parser/ParserOutput.php: Fix-up for ParserOutput.php@263 debug logging (duration: 01m 27s)
* 21:29 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 22:30 logmsgbot: legoktm@mira Synchronized php-1.27.0-wmf.10/includes/parser/: https://gerrit.wikimedia.org/r/#/c/266401/ + https://gerrit.wikimedia.org/r/#/c/266406/ + live hacks (duration: 01m 28s)
* 21:28 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 22:28 logmsgbot: legoktm@mira Synchronized php-1.27.0-wmf.10/includes/content/WikitextContent.php: https://gerrit.wikimedia.org/r/#/c/266401/ (duration: 01m 29s)
* 21:23 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 21:53 logmsgbot: hoo@mira Synchronized wmf-config/Wikibase-production.php: Disable (not yet deployed) commons category sidebar link overwrite in production (duration: 01m 28s)
* 21:22 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 21:47 mutante: nitrogen - shutdown -h now ....
* 21:22 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 21:45 mutante: alsafi - was reported down in icinga , is ganeti VM - fixed by just logging in as if it went to hibernate
* 21:21 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 21:37 mdholloway: mobileapps deployed 9252a22
* 21:16 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 21:30 mutante: nitrogen - stop puppet, stop salt, remove from stored configs / icinga
* 21:15 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 20:19 logmsgbot: hoo@mira Synchronized wmf-config/Wikibase-labs.php: (no message) (duration: 01m 28s)
* 21:15 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 20:14 chasemp: bump labstore nfs threads to 288 from 244
* 21:14 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 19:32 paravoid: eqiad: removing static routes for 6to4/Teredo to nitrogen (decommissioning our own relays)
* 21:04 thcipriani@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: revert [[gerrit:806944{{!}}Define default value for "wmgSiteLogoVariants" (T305692 T308620)]] (duration: 03m 15s)
* 19:10 bd808: Live hacking on mw1017 to debug 1.27.0-wmf.11 issues. All wikis there currently set to use 1.27.0-wmf.11.
* 20:59 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 19:05 chasemp: labstore1001 temp change to CFQ scheduler on 01/22/2015
* 20:58 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 19:04 chasemp: the nfsd thread change is on labstore1001
* 20:58 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 19:04 chasemp: nfsd has 224 threads atm and was bumped up over the weekend
* 20:57 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 18:58 ori: removed unused wikiversions.cdb on mira and tin
* 20:52 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 18:28 jynus: retroactively logging the depool of mw1217, mw1178 and mw1257 3 hours ago (Jan 25 15:45:26)
* 20:51 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 16:49 ema: Finished migration of mobile traffic to text cluster in ulsfo https://phabricator.wikimedia.org/T109286
* 20:50 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 16:38 logmsgbot: jynus@mira Synchronized wmf-config/db-eqiad.php: Preparing ips for new parsercache deployments (third try) (duration: 01m 35s)
* 20:49 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 16:26 logmsgbot: jynus@mira Synchronized wmf-config/db-eqiad.php: Preparing ips for new parsercache deployments (second try after running puppet) (duration: 03m 23s)
* 20:47 thcipriani@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:806944{{!}}Define default value for "wmgSiteLogoVariants" (T305692 T308620)]] (duration: 03m 07s)
* 16:25 _joe_: restarting salt-minion on all deployment targets
* 20:44 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 16:24 _joe_: running salt deploy.fixurl on all deployment targets
* 20:43 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 16:09 logmsgbot: jynus@mira Synchronized wmf-config/db-eqiad.php: Preparing ips for new parsercache deployments (duration: 03m 32s)
* 20:43 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 15:51 ejegg: updated DjangoBannerStats from a64fe0e373a978d3df0b7f1dd74ac4cc5c78d34e to 71df14d4d8b11f3ca0ef1eeb6c6e2db9be79103a
* 20:42 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 15:35 ema: Starting migration of mobile traffic to text cluster in ulsfo https://phabricator.wikimedia.org/T109286
* 20:29 thcipriani@deploy1002: Synchronized php-1.39.0-wmf.23/extensions/VisualEditor/modules/ve-mw/preinit/ve.init.mw.DesktopArticleTarget.init.js: Backport: [[gerrit:822396{{!}}Do not show incompatible skin warning when page is not editable (T314952)]] (duration: 03m 16s)
* 15:14 chasemp: restart of pdns and pdns-recursor on labservices1001
* 20:27 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 14:56 logmsgbot: jynus@mira Synchronized wmf-config/db-eqiad.php: deploy new parsercache hardware (pc1004) substituting pc1001 (duration: 03m 25s)
* 20:26 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 13:16 elukey: ran kafka preferred-replica-election on kafka1022 to balance the leaders
* 20:26 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 13:07 elukey: restarting kafka on kafka1022
* 20:25 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 12:57 elukey: restarting kafka on kafka1013
* 20:23 mutante: merging change on prod phabricator host to allow scap deployment, part 1
* 12:38 elukey: restarting kafka on kafka1014
* 19:42 damilare: payments-wiki upgraded from {{Gerrit|cf5e1848}} to {{Gerrit|0894d75a}}
* 12:20 jynus: compressed and truncated iridium's phab daemons.log - it was taking 20% of disk space
* 19:41 mutante: disabling puppet on C:profile::phabricator::main
* 12:04 ema: restarting kafka on kafka1018
* 19:20 mvernon@cumin2002: END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-eqiad: upgrade to 3.11.13 [[phab:T309896|T309896]] - mvernon@cumin2002
* 11:26 jynus: stopping mysql at pc1001 and cloning to pc1004
* 17:58 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 10:55 logmsgbot: jynus@mira Synchronized wmf-config/db-eqiad.php: Depool pc1001 for maintenance (clone to pc1004) (duration: 01m 41s)
* 17:58 taavi@deploy1002: Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:822428{{!}}Fix labtestwiki database name servers (T310795)]] (duration: 03m 39s)
* 10:11 _joe_: switching the active deployment host to mira
* 17:57 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 09:56 ema: limiting GCLogFileSize and restarting kafka on kafka1012
* 17:57 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 09:31 _joe_: rolling reboot of the eqiad appserver cluster
* 17:56 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 09:27 moritzm: installed fuse security update on labnodepool1001 (the other fuse installations are on Ubuntu, which doesn't ship the udev rule, but uses mountall instead)
* 17:52 sukhe: testing ATS 9.1.3-1wm1 on cp3064: [[phab:T309651|T309651]]
* 07:47 paravoid: stat1002: umount -f /mnt/hdfs
* 17:49 pt1979@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host netmon2002.mgmt.codfw.wmnet with reboot policy FORCED
* 07:34 _joe_: rebooting alsafi, unresponsive to ssh
* 17:46 sukhe: testing ATS 9.1.3-1wm1 on cp3064: [[phab:T3096515|T3096515]]
* 07:24 _joe_: restarting hhvm on mw1148, stuck in HPHP::Treadmill::startRequest (__lll_lock_wait)
* 17:41 pt1979@cumin2002: START - Cookbook sre.hosts.provision for host netmon2002.mgmt.codfw.wmnet with reboot policy FORCED
* 07:23 _joe_: restarting hhvm on mw1143, stuck into HPHP::SynchronizableMulti::waitImpl (__pthread_cond_wait)
* 17:40 pt1979@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 03:10 logmsgbot: tstarling@tin Synchronized php-1.27.0-wmf.10/includes/parser/ParserCache.php: (no message) (duration: 00m 25s)
* 17:38 sukhe: testing ATS 9.1.3-1wm1 on cp1090: [[phab:T309651|T309651]]
* 03:03 logmsgbot: tstarling@tin Synchronized php-1.27.0-wmf.10/includes/parser/ParserCache.php: (no message) (duration: 00m 25s)
* 17:36 pt1979@cumin2002: START - Cookbook sre.dns.netbox
* 03:02 logmsgbot: tstarling@tin Synchronized php-1.27.0-wmf.10/includes/parser/ParserOutput.php: (no message) (duration: 00m 27s)
* 17:35 pt1979@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host netmon2002
* 02:30 logmsgbot: l10nupdate@tin ResourceLoader cache refresh completed at Mon Jan 25 02:30:13 UTC 2016 (duration 6m 52s)
* 17:34 pt1979@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host netmon2002
* 02:23 logmsgbot: mwdeploy@tin sync-l10n completed (1.27.0-wmf.10) (duration: 09m 09s)
* 17:33 sukhe: testing ATS 9.1.3-1wm1 on cp3065: [[phab:T309651|T309651]]
* 17:28 sukhe: testing ATS 9.1.3-1wm1 on cp1089: [[phab:T309651|T309651]]
* 17:19 bking@cumin1001: conftool action : set/weight=10:pooled=no; selector: service=elasticsearch-omega-ssl,name=elastic1100.eqiad.wmnet
* 17:18 bking@cumin1001: conftool action : set/weight=10:pooled=yes; selector: service=elasticsearch-omega-ssl,name=elastic1100.eqiad.wmnet
* 17:15 bking@cumin1001: conftool action : set/weight=10:pooled=yes; selector: service=search-omega-https,name=elastic1100.eqiad.wmnet
* 16:35 mvernon@cumin2002: START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-eqiad: upgrade to 3.11.13 [[phab:T309896|T309896]] - mvernon@cumin2002
* 16:30 mvernon@cumin2002: END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-codfw: upgrade to 3.11.13 [[phab:T309896|T309896]] - mvernon@cumin2002
* 16:29 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic[1100-1102].eqiad.wmnet with reason: [[phab:T309810|T309810]]
* 16:29 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic[1100-1102].eqiad.wmnet with reason: [[phab:T309810|T309810]]
* 16:26 inflatador: bking@elastic1054 attempting to ban elastic1100-1102 from cluster due to firewall issues
* 16:13 bking@cumin1001: conftool action : set/weight=10:pooled=yes; selector: service=search-omega-https,name=elastic1100.eqiad.wmnet
* 16:12 bking@cumin1001: conftool action : set/weight=10:pooled=yes; selector: name=elastic1100
* 15:15 cmjohnson@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 15:09 cmjohnson@cumin1001: START - Cookbook sre.dns.netbox
* 14:58 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1162 (re)pooling @ 100%: Maint done', diff saved to https://phabricator.wikimedia.org/P32364 and previous config saved to /var/cache/conftool/dbconfig/20220811-145823-ladsgroup.json
* 14:55 inflatador: bking@cumin1001 running puppet agent across eqiad elastic hosts
* 14:48 ryankemper@cumin1001: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - [[phab:T289135|T289135]]
* 14:43 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1162 (re)pooling @ 75%: Maint done', diff saved to https://phabricator.wikimedia.org/P32362 and previous config saved to /var/cache/conftool/dbconfig/20220811-144318-ladsgroup.json
* 14:28 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1162 (re)pooling @ 25%: Maint done', diff saved to https://phabricator.wikimedia.org/P32361 and previous config saved to /var/cache/conftool/dbconfig/20220811-142813-ladsgroup.json
* 14:28 andrew@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudcontrol1003.wikimedia.org
* 14:28 andrew@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 14:24 andrew@cumin1001: START - Cookbook sre.dns.netbox
* 14:19 andrew@cumin1001: START - Cookbook sre.hosts.decommission for hosts cloudcontrol1003.wikimedia.org
* 14:19 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 14:18 andrew@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudcontrol1004.wikimedia.org
* 14:18 andrew@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 14:18 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 14:18 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 14:17 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 14:17 ladsgroup@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:822375{{!}}Stop writing to the old templatelinks fields in s2 (T312865)]] (duration: 03m 25s)
* 14:16 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
* 14:16 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
* 14:16 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
* 14:15 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
* 14:13 andrew@cumin1001: START - Cookbook sre.dns.netbox
* 14:13 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1162 (re)pooling @ 10%: Maint done', diff saved to https://phabricator.wikimedia.org/P32360 and previous config saved to /var/cache/conftool/dbconfig/20220811-141309-ladsgroup.json
* 14:12 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 14:11 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 14:11 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 14:11 awight: EU backport window complete
* 14:10 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 14:10 awight@deploy1002: Synchronized php-1.39.0-wmf.23/extensions/DiscussionTools/includes/CommentFormatter.php: Backport: [[gerrit:822149{{!}}CommentFormatter: Set 'data-mw-comment' even when reply tool disabled (T314707)]] (duration: 03m 31s)
* 14:09 andrew@cumin1001: START - Cookbook sre.hosts.decommission for hosts cloudcontrol1004.wikimedia.org
* 14:05 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 14:04 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 14:04 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 14:03 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 13:52 mvernon@cumin2002: START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-codfw: upgrade to 3.11.13 [[phab:T309896|T309896]] - mvernon@cumin2002
* 13:50 awight@deploy1002: Synchronized wmf-config: Config: [[gerrit:820666{{!}}Revert "Revert "testwiki: Add mediawiki.web_ui.interactions stream""]] (duration: 03m 10s)
* 13:48 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 13:47 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 13:47 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 13:46 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 13:36 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1060.eqiad.wmnet with OS bullseye
* 13:36 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 13:36 awight@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:822130{{!}}trwikiquote: Install WikiLove extension (T314895)]] (duration: 03m 30s)
* 13:35 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 13:35 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 13:34 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 13:33 filippo@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host logstash2003.codfw.wmnet
* 13:25 awight@deploy1002: Synchronized static/images: Config: [[gerrit:821330{{!}}Revert "trwiki: Change old and new vector logos for 500k articles"]] (part 3) (duration: 03m 09s)
* 13:21 awight@deploy1002: Synchronized logos/: Config: [[gerrit:821330{{!}}Revert "trwiki: Change old and new vector logos for 500k articles"]] (part 2) (duration: 03m 09s)
* 13:19 topranks: merging CR821781 to expose additional network info in puppet facts
* 13:18 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 13:18 awight@deploy1002: Synchronized wmf-config/: Config: [[gerrit:821330{{!}}Revert "trwiki: Change old and new vector logos for 500k articles"]] (part 1) (duration: 03m 13s)
* 13:17 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 13:17 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 13:16 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 13:14 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1060.eqiad.wmnet with reason: host reimage
* 13:11 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 13:11 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1060.eqiad.wmnet with reason: host reimage
* 13:10 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 13:10 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 13:09 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 13:08 awight@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:822073{{!}}Enable editor line numbering on all namespaces, for twwiki (T302852)]] (duration: 03m 42s)
* 12:56 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1060.eqiad.wmnet with OS bullseye
* 12:55 ryankemper@cumin1001: START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - [[phab:T289135|T289135]]
* 12:49 aikochou@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
* 12:46 aikochou@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
* 12:26 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=restbase2018.codfw.wmnet
* 12:26 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=restbase202[367].codfw.wmnet
* 12:17 elukey@deploy1002: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
* 12:17 elukey@deploy1002: helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
* 12:17 elukey@deploy1002: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
* 12:16 elukey@deploy1002: helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
* 12:13 filippo@cumin1001: START - Cookbook sre.hosts.reboot-single for host logstash2003.codfw.wmnet
* 12:11 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
* 12:10 elukey@deploy1002: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
* 12:09 elukey@deploy1002: helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
* 11:20 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance
* 11:20 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance
* 09:58 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance
* 09:57 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance
* 09:56 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance
* 09:56 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance
* 09:49 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1162.eqiad.wmnet with reason: Maintenance
* 09:49 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on db1162.eqiad.wmnet with reason: Maintenance
* 09:32 godog: arm keyholder on netmon2001
* 09:09 jbond: update gnutls28 on bullseye systems
* 09:00 jbond: update unzip
* 08:21 elukey@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
* 08:13 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
* 08:12 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
* 08:06 ayounsi@cumin1001: END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox interface ID cr3-ulsfo:xe-0/1/1
* 08:06 ayounsi@cumin1001: START - Cookbook sre.network.debug for Netbox interface ID cr3-ulsfo:xe-0/1/1
* 07:58 ayounsi@cumin1001: END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox interface ID cr3-ulsfo:xe-0/1/1
* 07:57 ayounsi@cumin1001: START - Cookbook sre.network.debug for Netbox interface ID cr3-ulsfo:xe-0/1/1
* 07:55 oblivian@cumin1001: conftool action : set/pooled=false; selector: dnsdisc=k8s-ingress-wikikube-rw,name=codfw
* 07:51 vgutierrez: rolling restart of pybal in eqsin and ulsfo
* 07:24 oblivian@cumin1001: conftool action : set/pooled=false; selector: dnsdisc=restbase-async,name=eqiad
* 07:24 oblivian@cumin1001: conftool action : set/pooled=true; selector: dnsdisc=shellbox-timeline
* 07:23 oblivian@cumin1001: conftool action : set/pooled=true; selector: dnsdisc=inference
* 07:19 _joe_: pooling all services in codfw
* 07:03 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1147 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32357 and previous config saved to /var/cache/conftool/dbconfig/20220811-070312-ladsgroup.json
* 07:03 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1147.eqiad.wmnet with reason: Maintenance
* 07:02 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1147.eqiad.wmnet with reason: Maintenance
* 07:02 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1160 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32356 and previous config saved to /var/cache/conftool/dbconfig/20220811-070252-ladsgroup.json
* 06:47 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P32355 and previous config saved to /var/cache/conftool/dbconfig/20220811-064746-ladsgroup.json
* 06:32 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P32354 and previous config saved to /var/cache/conftool/dbconfig/20220811-063240-ladsgroup.json
* 06:28 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance
* 06:28 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance
* 06:17 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1160 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32353 and previous config saved to /var/cache/conftool/dbconfig/20220811-061734-ladsgroup.json
* 06:17 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maint
* 06:17 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maint
* 06:06 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depool db1162 ([[phab:T314368|T314368]] [[phab:T298555|T298555]] [[phab:T312863|T312863]] [[phab:T310011|T310011]] [[phab:T309311|T309311]] [[phab:T60674|T60674]] [[phab:T298560|T298560]] [[phab:T303603|T303603]] [[phab:T310485|T310485]])', diff saved to https://phabricator.wikimedia.org/P32352 and previous config saved to /var/cache/conftool/dbconfig/20220811-060625-ladsgroup.json
* 06:01 ladsgroup@cumin1001: dbctl commit (dc=all): 'Promote db1122 to s2 primary and set section read-write [[phab:T314368|T314368]]', diff saved to https://phabricator.wikimedia.org/P32351 and previous config saved to /var/cache/conftool/dbconfig/20220811-060113-ladsgroup.json
* 06:00 ladsgroup@cumin1001: dbctl commit (dc=all): 'Set s2 eqiad as read-only for maintenance - [[phab:T314368|T314368]]', diff saved to https://phabricator.wikimedia.org/P32350 and previous config saved to /var/cache/conftool/dbconfig/20220811-060042-ladsgroup.json
* 06:00 Amir1: Starting s2 eqiad failover from db1162 to db1122 - [[phab:T314368|T314368]]
* 05:19 ladsgroup@cumin1001: dbctl commit (dc=all): 'Set db1122 with weight 0 [[phab:T314368|T314368]]', diff saved to https://phabricator.wikimedia.org/P32349 and previous config saved to /var/cache/conftool/dbconfig/20220811-051913-ladsgroup.json
* 05:19 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 25 hosts with reason: Primary switchover s2 [[phab:T314368|T314368]]
* 05:18 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1:00:00 on 25 hosts with reason: Primary switchover s2 [[phab:T314368|T314368]]
* m: chown -R librenms /srv/librenms/rrd/ on netmon1003 [[phab:T314972|T314972]]
* 03:51 cwhite: chown librenms /srv/librenms/rrd/* on netmon1003 [[phab:T314972|T314972]]
* 02:55 ejegg: civicrm upgraded from {{Gerrit|1f91ac2d}} to {{Gerrit|92467234}}
* 02:46 ejegg: updated process-control yaml files with @wmff alias
* 02:08 ejegg: civicrm rolled back from {{Gerrit|92467234}} to {{Gerrit|1f91ac2d}}
* 02:05 ejegg: civicrm upgraded from {{Gerrit|1f91ac2d}} to {{Gerrit|92467234}}
* 01:40 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 01:39 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 01:39 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 01:38 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 01:38 tstarling@deploy1002: Synchronized wmf-config/logging.php: (no justification provided) (duration: 03m 25s)
* 01:19 tstarling@puppetmaster1001: conftool action : set/pooled=true; selector: dnsdisc=(appservers{{!}}api)-ro,name=codfw
* 01:19 tstarling@puppetmaster1001: conftool action : set/pooled=true; selector: dnsdisc=sessionstore,name=codfw
* 00:58 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2042.codfw.wmnet,service=varnish-fe
* 00:58 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2042.codfw.wmnet,service=ats-be
* 00:58 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2042.codfw.wmnet,service=ats-tls
* 00:57 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on cp2042.codfw.wmnet with reason: host down; depooled and will debug tomorrow
* 00:57 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on cp2042.codfw.wmnet with reason: host down; depooled and will debug tomorrow


== 2016-01-24 ==
== 2022-08-10 ==
* 02:31 logmsgbot: l10nupdate@tin ResourceLoader cache refresh completed at Sun Jan 24 02:31:21 UTC 2016 (duration 6m 58s)
* 21:25 bking@cumin1001: conftool action : set/weight=10:pooled=yes; selector: name=wdqs1016.eqiad.wmnet
* 02:24 logmsgbot: mwdeploy@tin sync-l10n completed (1.27.0-wmf.10) (duration: 09m 11s)
* 21:23 bking@cumin1001: conftool action : set/weight=10:pooled=yes; selector: name=wdqs1014.eqiad.wmnet
* 21:10 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 16 hosts with reason: [[phab:T309810|T309810]]
* 21:10 bking@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on 16 hosts with reason: [[phab:T309810|T309810]]
* 21:09 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on elastic[1101-1102].eqiad.wmnet with reason: [[phab:T309810|T309810]]
* 21:09 bking@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on elastic[1101-1102].eqiad.wmnet with reason: [[phab:T309810|T309810]]
* 21:00 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 21:00 cjming: end of UTC late backport window
* 20:59 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 20:59 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 20:59 cjming@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:820533{{!}}Remove unused $wgEnableMWSuggest]] (duration: 03m 04s)
* 20:58 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 20:56 cjming@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:820568{{!}}Enable new topic tool on dewiki (T313699)]] (duration: 03m 01s)
* 20:34 cjming@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:822093{{!}}testwiki: set $wgCdnMatchParameterOrder to false (T314868)]] (duration: 03m 20s)
* 20:31 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 20:30 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 20:30 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 20:30 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 20:19 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 20:18 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 20:18 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 20:17 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 20:12 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 20:09 bking@cumin1001: END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
* 20:08 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 20:08 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 20:08 cjming@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:820646{{!}}Start writing to cuc_actor everywhere except s4 and s8 (T233004)]] (duration: 03m 15s)
* 20:07 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 19:51 rzl@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc[2053-2054].codfw.wmnet
* 19:51 rzl@cumin1001: START - Cookbook sre.hosts.remove-downtime for mc[2053-2054].codfw.wmnet
* 19:35 rzl@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse[2019-2020].codfw.wmnet
* 19:35 rzl@cumin1001: START - Cookbook sre.hosts.remove-downtime for parse[2019-2020].codfw.wmnet
* 19:35 rzl@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse[2016-2018].codfw.wmnet
* 19:35 rzl@cumin1001: START - Cookbook sre.hosts.remove-downtime for parse[2016-2018].codfw.wmnet
* 19:34 rzl@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc2036.codfw.wmnet
* 19:34 rzl@cumin1001: START - Cookbook sre.hosts.remove-downtime for mc2036.codfw.wmnet
* 19:28 sukhe: testing ATS 9.1.3-1wm1 on cp4026: [[phab:T309651|T309651]]
* 19:09 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1087.eqiad.wmnet with OS bullseye
* 19:06 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1086.eqiad.wmnet with OS bullseye
* 18:55 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1087.eqiad.wmnet with reason: host reimage
* 18:51 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1086.eqiad.wmnet with reason: host reimage
* 18:50 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1087.eqiad.wmnet with reason: host reimage
* 18:49 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1086.eqiad.wmnet with reason: host reimage
* 18:47 bking@cumin1001: START - Cookbook sre.wdqs.data-transfer
* 18:38 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1087.eqiad.wmnet with OS bullseye
* 18:36 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1086.eqiad.wmnet with OS bullseye
* 18:22 urandom: truncating Cassandra hints (eqiad datacenter)  -- [[phab:T314941|T314941]]
* 18:13 urandom: truncating codfw Cassandra hints (eqiad datacenter)  -- [[phab:T314941|T314941]]
* 18:07 rzl@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kafka-main2005.codfw.wmnet
* 18:07 rzl@cumin1001: START - Cookbook sre.hosts.remove-downtime for kafka-main2005.codfw.wmnet
* 18:05 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repool D8 DBs after PDU maint ([[phab:T310146|T310146]])', diff saved to https://phabricator.wikimedia.org/P32346 and previous config saved to /var/cache/conftool/dbconfig/20220810-180529-ladsgroup.json
* 17:42 otto@deploy1002: Finished deploy [analytics/refinery@6e47e0e]: Add missing changes to the deletion script - [[phab:T270433|T270433]] -  [analytics/refinery@6e47e0e] (duration: 05m 28s)
* 17:39 fnegri@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts labweb1002.wikimedia.org
* 17:39 fnegri@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 17:36 otto@deploy1002: Started deploy [analytics/refinery@6e47e0e]: Add missing changes to the deletion script - [[phab:T270433|T270433]] -  [analytics/refinery@6e47e0e]
* 17:35 fnegri@cumin1001: START - Cookbook sre.dns.netbox
* 17:34 otto@deploy1002: Finished deploy [analytics/refinery@6e47e0e] (hadoop-test): Add missing changes to the deletion script - [[phab:T270433|T270433]] - TEST [analytics/refinery@6e47e0e] (duration: 04m 19s)
* 17:30 fnegri@cumin1001: START - Cookbook sre.hosts.decommission for hosts labweb1002.wikimedia.org
* 17:30 otto@deploy1002: Started deploy [analytics/refinery@6e47e0e] (hadoop-test): Add missing changes to the deletion script - [[phab:T270433|T270433]] - TEST [analytics/refinery@6e47e0e]
* 17:09 dzahn@cumin2002: START - Cookbook sre.dns.netbox
* 17:08 otto@deploy1002: Started deploy [analytics/refinery@d4dd7e4] (hadoop-test): Add safety limits to refinery-drop-older-than - [[phab:T270433|T270433]] - TEST [analytics/refinery@d4dd7e4]
* 17:06 sukhe: testing ATS 9.1.3-1wm1 on cp4032: [[phab:T309651|T309651]]
* 17:06 urandom: flushing RESTBase Cassandra tables -row B- to (temporarily) free instance-data space -- [[phab:T314941|T314941]]
* 17:05 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on krb2001.codfw.wmnet with reason: btullis codfw maintenance
* 17:05 btullis@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on krb2001.codfw.wmnet with reason: btullis codfw maintenance
* 17:04 dzahn@cumin2002: START - Cookbook sre.hosts.decommission for hosts gerrit2001.wikimedia.org
* 17:02 sukhe: testing ATS 9.1.3-1wm1 on cp6008: [[phab:T309651|T309651]]
* 16:56 sukhe: testing ATS 9.1.3-1wm1 on cp6016: [[phab:T309651|T309651]]
* 16:55 fnegri@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts labweb1001.wikimedia.org
* 16:55 fnegri@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 16:32 dzahn@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts gerrit2001.wikimedia.org
* 16:32 dzahn@cumin2002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
* 16:32 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubernetes[2013-2014].codfw.wmnet
* 16:31 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for kubernetes[2013-2014].codfw.wmnet
* 16:31 jelto: kubectl uncordon kubernetes2014.codfw.wmnet
* 16:31 fnegri@cumin1001: START - Cookbook sre.dns.netbox
* 16:30 jelto: kubectl uncordon kubernetes2013.codfw.wmnet
* 16:29 urandom: restarting Cassandra (RESTBase) -row A- to apply r822110 -- [[phab:T314941|T314941]]
* 16:27 dzahn@cumin2002: START - Cookbook sre.dns.netbox
* 16:25 fnegri@cumin1001: START - Cookbook sre.hosts.decommission for hosts labweb1001.wikimedia.org
* 16:23 mutante: shutting down gerrit2001
* 16:23 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc[2034-2035].codfw.wmnet
* 16:23 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for mc[2034-2035].codfw.wmnet
* 16:22 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse[2016-2018].codfw.wmnet
* 16:22 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for parse[2016-2018].codfw.wmnet
* 16:16 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=maps2010.codfw.wmnet
* 16:16 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=sessionstore2003.codfw.wmnet
* 16:13 sukhe: reprepro -C component/trafficserver9 include buster-wikimedia trafficserver_9.1.3-1wm1_amd64.changes: [[phab:T309651|T309651]]
* 16:13 dzahn@cumin2002: START - Cookbook sre.hosts.decommission for hosts gerrit2001.wikimedia.org
* 16:11 mvernon@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ms-be[2039,2050,2056,2059].codfw.wmnet,thanos-be2004.codfw.wmnet with reason: PDU work
* 16:10 mvernon@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ms-be[2039,2050,2056,2059].codfw.wmnet,thanos-be2004.codfw.wmnet with reason: PDU work
* 16:09 urandom: flushing tables in row D (RESTBase Cassandra cluster)  -- [[phab:T314941|T314941]]
* 15:54 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for gitlab-runner2004.codfw.wmnet
* 15:54 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for gitlab-runner2004.codfw.wmnet
* 15:53 sukhe: poweroff cp2041, 42 for PDU ugprade: rack D7
* 15:51 urandom: flushing tables in row B (RESTBase Cassandra cluster)  -- [[phab:T314941|T314941]]
* 15:49 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve2004.codfw.wmnet with reason: PDU maintenance
* 15:49 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2004.codfw.wmnet with reason: PDU maintenance
* 15:46 urandom: flushing tables in row A (RESTBase Cassandra cluster)  -- [[phab:T314941|T314941]]
* 15:46 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on aqs2012.codfw.wmnet with reason: btullis codfw maintenance
* 15:46 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on cp[2041-2042].codfw.wmnet with reason: shutdown for PDU upgrade: rack D4
* 15:46 btullis@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on aqs2012.codfw.wmnet with reason: btullis codfw maintenance
* 15:46 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 3:00:00 on cp[2041-2042].codfw.wmnet with reason: shutdown for PDU upgrade: rack D4
* 15:46 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on aqs2011.codfw.wmnet with reason: btullis codfw maintenance
* 15:46 btullis@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on aqs2011.codfw.wmnet with reason: btullis codfw maintenance
* 15:45 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on aqs2010.codfw.wmnet with reason: btullis codfw maintenance
* 15:45 btullis@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on aqs2010.codfw.wmnet with reason: btullis codfw maintenance
* 15:45 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on aqs2009.codfw.wmnet with reason: btullis codfw maintenance
* 15:45 btullis@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on aqs2009.codfw.wmnet with reason: btullis codfw maintenance
* 15:37 urandom: (ephemerally) increasing hinted hand-off delivery rate limit to 16KB, RESTBase eqiad nodes  -- [[phab:T314941|T314941]]
* 15:34 jbond: remove puppetmaster[12]002 from production
* 15:30 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kafka-main2004.codfw.wmnet
* 15:30 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for kafka-main2004.codfw.wmnet
* 15:20 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc[2051-2052].codfw.wmnet
* 15:20 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for mc[2051-2052].codfw.wmnet
* 15:17 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc-gp2003.codfw.wmnet
* 15:17 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for mc-gp2003.codfw.wmnet
* 15:16 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc2033.codfw.wmnet
* 15:16 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for mc2033.codfw.wmnet
* 15:14 _joe_: power off krb2002
* 15:14 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on krb2002.codfw.wmnet with reason: PDU maintenance
* 15:13 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on krb2002.codfw.wmnet with reason: PDU maintenance
* 15:13 _joe_: shutting down rdb2010,puppetmaster2002 for d5 maintenance
* 15:02 jelto: power off mc2035
* 15:01 jelto: power off mc2034
* 15:01 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mc2035.codfw.wmnet with reason: PDU swap
* 15:01 jelto@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on mc2035.codfw.wmnet with reason: PDU swap
* 15:01 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mc2034.codfw.wmnet with reason: PDU swap
* 15:01 jelto@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on mc2034.codfw.wmnet with reason: PDU swap
* 14:43 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2094.codfw.wmnet with reason: PDU Maint ([[phab:T310146|T310146]])
* 14:43 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2094.codfw.wmnet with reason: PDU Maint ([[phab:T310146|T310146]])
* 14:38 urandom: disabling reserved space on eqiad nodes (RESTBase), /dev/md2 (aka /srv/cassandra/instance-data) -- [[phab:T314941|T314941]]
* 14:28 jelto: power off kafka-main2004 gracefully
* 14:28 hnowlan: shutting down sessionstore2003
* 14:27 hnowlan@puppetmaster1001: conftool action : set/pooled=no; selector: name=sessionstore2003.codfw.wmnet
* 14:27 sukhe: power off cp2039, cp2040 for PDU upgrade: rack D
* 14:27 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sessionstore2003.codfw.wmnet with reason: PDU maintenance
* 14:27 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on sessionstore2003.codfw.wmnet with reason: PDU maintenance
* 14:25 jelto: power off mc-gp2003
* 14:25 jelto: power off mc2033
* 14:24 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:30:00 on kafka-main2004.codfw.wmnet with reason: PDU swap
* 14:23 jelto@cumin1001: START - Cookbook sre.hosts.downtime for 4:30:00 on kafka-main2004.codfw.wmnet with reason: PDU swap
* 14:23 sukhe: depool codfw for PDU upgrade: rack D
* 14:23 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:30:00 on mc-gp2003.codfw.wmnet with reason: PDU swap
* 14:23 jelto@cumin1001: START - Cookbook sre.hosts.downtime for 4:30:00 on mc-gp2003.codfw.wmnet with reason: PDU swap
* 14:23 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:30:00 on mc2033.codfw.wmnet with reason: PDU swap
* 14:23 jelto@cumin1001: START - Cookbook sre.hosts.downtime for 4:30:00 on mc2033.codfw.wmnet with reason: PDU swap
* 14:15 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp20[39{{!}}40]\.codfw\.wmnet,service=ats-tls
* 14:13 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on cp[2039-2040].codfw.wmnet with reason: shutdown for PDU upgrade: rack D4
* 14:13 urandom: flushing Cassandra tables, restbase1030
* 14:13 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 3:00:00 on cp[2039-2040].codfw.wmnet with reason: shutdown for PDU upgrade: rack D4
* 14:13 urandom: flushing Cassandra tables, restbase1019
* 14:12 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on dns2002.wikimedia.org with reason: shutdown for PDU upgrade: rack D4
* 14:12 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 3:00:00 on dns2002.wikimedia.org with reason: shutdown for PDU upgrade: rack D4
* 14:11 urandom: flushing Cassandra tables, restbase1017 1018 1021 1024 1025 1026 1028 1029
* 14:05 urandom: flushing tables, restbase1016
* 13:52 hnowlan: powered up restbase2018
* 13:32 filippo@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on netmon1003.wikimedia.org with reason: pdu
* 13:32 filippo@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on netmon1003.wikimedia.org with reason: pdu
* 13:32 filippo@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on logstash2003.codfw.wmnet with reason: pdu
* 13:31 filippo@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on logstash2003.codfw.wmnet with reason: pdu
* 13:31 root@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on logstash2029.codfw.wmnet with reason: pdu
* 13:31 root@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on logstash2029.codfw.wmnet with reason: pdu
* 13:30 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 6 hosts with reason: [[phab:T310146|T310146]]
* 13:30 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 6 hosts with reason: [[phab:T310146|T310146]]
* 13:17 elukey: powering on restbase2027
* 13:12 elukey: powering on restbase2026
* 13:12 _joe_: powering on restbase2023
* 13:01 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1160 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32343 and previous config saved to /var/cache/conftool/dbconfig/20220810-130108-ladsgroup.json
* 13:01 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance
* 13:00 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance
* 12:37 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic[2072,2084-2085].codfw.wmnet with reason: [[phab:T310146|T310146]]
* 12:37 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic[2072,2084-2085].codfw.wmnet with reason: [[phab:T310146|T310146]]
* 12:27 jbond: remove confd from serveres that shouldn;t have it
* 12:05 ladsgroup@deploy1002: Synchronized php-1.39.0-wmf.23/extensions/Echo/maintenance/removeOrphanedEvents.php: Backport: [[gerrit:821735{{!}}Run clean ups with removeOrphanedEvents in major batches (T310428)]] (duration: 03m 32s)
* 11:47 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 11:46 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 11:46 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 11:43 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 11:15 jbond@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1001.eqiad.wmnet with OS buster
* 10:54 jbond@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage
* 10:51 jbond@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage
* 10:37 jbond@cumin1001: START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS buster
* 10:31 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[2181-2182].codfw.wmnet with reason: D6 PDU maint ([[phab:T310146|T310146]])
* 10:31 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db[2181-2182].codfw.wmnet with reason: D6 PDU maint ([[phab:T310146|T310146]])
* 10:26 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on maps2010.codfw.wmnet with reason: PDU maintenance
* 10:26 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on maps2010.codfw.wmnet with reason: PDU maintenance
* 10:26 hnowlan@puppetmaster1001: conftool action : set/pooled=no; selector: name=maps2010.codfw.wmnet
* 10:25 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=maps2010.codfw.wmnet
* 10:25 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=maps2008.codfw.wmnet
* 10:24 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on restbase2018.codfw.wmnet with reason: PDU maintenance
* 10:24 hnowlan@puppetmaster1001: conftool action : set/pooled=no; selector: name=restbase2018.codfw.wmnet
* 10:24 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on restbase2018.codfw.wmnet with reason: PDU maintenance
* 10:23 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ml-serve2008.codfw.wmnet with reason: PDU maintenance
* 10:23 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on ml-serve2008.codfw.wmnet with reason: PDU maintenance
* 10:20 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ores2009.codfw.wmnet with reason: PDU maintenance
* 10:20 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on ores2009.codfw.wmnet with reason: PDU maintenance
* 10:19 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ores2008.codfw.wmnet with reason: PDU maintenance
* 10:19 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on ores2008.codfw.wmnet with reason: PDU maintenance
* 10:03 hnowlan@puppetmaster1001: conftool action : set/pooled=no; selector: name=restbase202[367].codfw.wmnet
* 10:02 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on restbase[2023,2026-2027].codfw.wmnet with reason: PDU maintenance
* 10:02 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on restbase[2023,2026-2027].codfw.wmnet with reason: PDU maintenance
* 09:53 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 6 hosts with reason: D8 PDU Maint ([[phab:T310146|T310146]])
* 09:52 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 6 hosts with reason: D8 PDU Maint ([[phab:T310146|T310146]])
* 09:51 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depool D8 DBs for PDU maint ([[phab:T310146|T310146]])', diff saved to https://phabricator.wikimedia.org/P32341 and previous config saved to /var/cache/conftool/dbconfig/20220810-095059-ladsgroup.json
* 09:36 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[2101,2130,2140].codfw.wmnet,dbproxy2004.codfw.wmnet with reason: D6 PDU maint ([[phab:T310146|T310146]])
* 09:36 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db[2101,2130,2140].codfw.wmnet,dbproxy2004.codfw.wmnet with reason: D6 PDU maint ([[phab:T310146|T310146]])
* 09:34 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depool D6 dbs ([[phab:T310146|T310146]])', diff saved to https://phabricator.wikimedia.org/P32340 and previous config saved to /var/cache/conftool/dbconfig/20220810-093433-ladsgroup.json
* 09:31 jelto: depool services in codfw for upcoming PDU replacement - [[phab:T309956|T309956]]
* 09:30 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 8 hosts with reason: Maintenance
* 09:29 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 8 hosts with reason: Maintenance
* 09:29 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance
* 09:29 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance
* 09:28 jynus: shutdown backup2007 before pdu upgrade [[phab:T310146|T310146]]
* 09:16 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 09:15 ladsgroup@deploy1002: Synchronized php-1.39.0-wmf.23/maintenance/namespaceDupes.php: Backport: [[gerrit:821734{{!}}maintenance: Add support for links migration to namespaceDupes.php (T314711)]] (duration: 03m 18s)
* 09:15 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[2093,2120,2129,2172].codfw.wmnet with reason: D5 PDU maint ([[phab:T310146|T310146]])
* 09:15 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db[2093,2120,2129,2172].codfw.wmnet with reason: D5 PDU maint ([[phab:T310146|T310146]])
* 09:14 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 09:14 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 09:13 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 09:10 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depool D5 dbs ([[phab:T310146|T310146]])', diff saved to https://phabricator.wikimedia.org/P32339 and previous config saved to /var/cache/conftool/dbconfig/20220810-091038-ladsgroup.json
* 08:49 jynus: shutdown dbprov2003 before pdu upgrade [[phab:T310146|T310146]]
* 08:49 ayounsi@cumin1001: END (FAIL) - Cookbook sre.network.debug (exit_code=99) for Netbox interface ID cr1-drmrs:xe-0/1/2
* 08:48 ayounsi@cumin1001: START - Cookbook sre.network.debug for Netbox interface ID cr1-drmrs:xe-0/1/2
* 08:48 mvernon@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be2028.codfw.wmnet
* 08:48 mvernon@cumin1001: START - Cookbook sre.hosts.remove-downtime for ms-be2028.codfw.wmnet
* 08:42 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1130 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P32337 and previous config saved to /var/cache/conftool/dbconfig/20220810-084222-ladsgroup.json
* 08:37 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 08:37 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
* 08:36 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
* 08:36 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 08:36 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 08:35 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 08:35 ladsgroup@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:822037{{!}}Stop writing to the old templatelinks fields in s5 (T312865)]] (duration: 03m 29s)
* 08:32 jelto: power off gitlab-runner2004
* 08:31 root@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:30:00 on gitlab-runner2004.codfw.wmnet with reason: PDU swap
* 08:31 root@cumin1001: START - Cookbook sre.hosts.downtime for 8:30:00 on gitlab-runner2004.codfw.wmnet with reason: PDU swap
* 08:29 mvernon@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ms-be2028.codfw.wmnet with reason: Trying to fix full /
* 08:28 mvernon@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ms-be2028.codfw.wmnet with reason: Trying to fix full /
* 08:28 ayounsi@cumin1001: END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox interface ID cr1-drmrs:xe-0/1/2
* 08:27 ayounsi@cumin1001: START - Cookbook sre.network.debug for Netbox interface ID cr1-drmrs:xe-0/1/2
* 08:27 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1130 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P32336 and previous config saved to /var/cache/conftool/dbconfig/20220810-082718-ladsgroup.json
* 08:25 ayounsi@cumin1001: END (FAIL) - Cookbook sre.network.debug (exit_code=99) for Netbox interface ID cr1-drmrs:xe-0/1/2
* 08:25 ayounsi@cumin1001: START - Cookbook sre.network.debug for Netbox interface ID cr1-drmrs:xe-0/1/2
* 08:24 ayounsi@cumin1001: END (FAIL) - Cookbook sre.network.debug (exit_code=99) for Netbox interface ID cr1-drmrs:xe-0/1/2
* 08:24 ayounsi@cumin1001: START - Cookbook sre.network.debug for Netbox interface ID cr1-drmrs:xe-0/1/2
* 08:23 kart_: Run: mwscript namespaceDupes.php arywiki --fix ([[phab:T291737|T291737]])
* 08:13 jynus: restart replication on db1117:m1 [[phab:T309074|T309074]]
* 08:12 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1130 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P32335 and previous config saved to /var/cache/conftool/dbconfig/20220810-081213-ladsgroup.json
* 08:09 kartik@deploy1002: Finished scap: Backport: [[gerrit:821732{{!}}arywiki: change namespace translations, add unchanged namespaces and add old translations as aliases (T291737)]] (duration: 10m 37s)
* 07:59 kartik@deploy1002: Started scap: Backport: [[gerrit:821732{{!}}arywiki: change namespace translations, add unchanged namespaces and add old translations as aliases (T291737)]]
* 07:57 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1130 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P32334 and previous config saved to /var/cache/conftool/dbconfig/20220810-075708-ladsgroup.json
* 07:56 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1130 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P32333 and previous config saved to /var/cache/conftool/dbconfig/20220810-075636-ladsgroup.json
* 07:55 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 07:52 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 07:52 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 07:52 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1130.eqiad.wmnet with reason: Maintenance
* 07:52 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on db1130.eqiad.wmnet with reason: Maintenance
* 07:51 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 07:51 dcaro@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 07:46 dcaro@cumin1001: START - Cookbook sre.dns.netbox
* 07:39 dcaro@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 07:34 dcaro@cumin1001: START - Cookbook sre.dns.netbox
* 07:33 godog: depool thanos-fe2001 for debugging
* 07:11 kartik@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:821170{{!}}Enable SectionTranslation on testwiki with new MT support from Google (T313296)]] (duration: 05m 44s)
* 07:11 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 07:08 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 07:08 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 07:07 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 05:24 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on kubernetes[2013-2014].codfw.wmnet with reason: PDU maintenance
* 05:24 oblivian@cumin1001: START - Cookbook sre.hosts.downtime for 18:00:00 on kubernetes[2013-2014].codfw.wmnet with reason: PDU maintenance
* 05:19 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on parse[2016-2020].codfw.wmnet with reason: PDU maintenance
* 05:19 oblivian@cumin1001: START - Cookbook sre.hosts.downtime for 18:00:00 on parse[2016-2020].codfw.wmnet with reason: PDU maintenance
* 05:12 _joe_: starting to shut down servers in codfw for the PDU maintenance
* 05:09 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on 10 hosts with reason: PDU maintenance
* 05:09 oblivian@cumin1001: START - Cookbook sre.hosts.downtime for 18:00:00 on 10 hosts with reason: PDU maintenance
* 05:09 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on mc-gp2003.codfw.wmnet with reason: PDU maintenance
* 05:09 oblivian@cumin1001: START - Cookbook sre.hosts.downtime for 18:00:00 on mc-gp2003.codfw.wmnet with reason: PDU maintenance
* 05:06 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on mc2033.codfw.wmnet with reason: PDU maintenance
* 05:06 oblivian@cumin1001: START - Cookbook sre.hosts.downtime for 18:00:00 on mc2033.codfw.wmnet with reason: PDU maintenance
* 05:05 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on 7 hosts with reason: PDU maintenance
* 05:05 oblivian@cumin1001: START - Cookbook sre.hosts.downtime for 18:00:00 on 7 hosts with reason: PDU maintenance
* 02:34 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 02:33 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 02:33 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 02:32 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 02:07 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 02:06 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 02:06 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 02:05 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply


== 2016-01-23 ==
== 2022-08-09 ==
* 19:03 logmsgbot: ebernhardson@tin Synchronized wmf-config/CirrusSearch-production.php: config change to repoint morelike search from eqiad to codfw (duration: 00m 26s)
* 23:17 bking@cumin1001: conftool action : set/weight=10:pooled=yes; selector: name=wdqs1011.eqiad.wmnet
* 19:02 logmsgbot: ebernhardson@tin Synchronized php-1.27.0-wmf.10/extensions/CirrusSearch/: Support code for repointing morelike queries from eqiad to codfw (duration: 00m 30s)
* 23:07 bking@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
* 19:00 ebernhardson: repoint most expensive search queries (morelike) at codfw cluster to reduce load. 1/2 of eqiad cluster maxed on cpu
* 23:06 bking@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
* 16:47 Krinkle: mwscript deleteEqualMessages.php --wiki wowiki
* 22:51 bking@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
* 13:25 jynus: upgrading and restarting db1046
* 22:51 bking@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
* 13:13 jynus: db1046 maintenance finished- restarting mysql to apply latest configuration
* 22:49 bking@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
* 02:32 logmsgbot: l10nupdate@tin ResourceLoader cache refresh completed at Sat Jan 23 02:32:15 UTC 2016 (duration 7m 3s)
* 22:49 bking@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
* 02:25 logmsgbot: mwdeploy@tin sync-l10n completed (1.27.0-wmf.10) (duration: 09m 09s)
* 22:46 bking@cumin1001: conftool action : set/weight=10:pooled=yes; selector: name=wdqs1015.eqiad.wmnet
* 01:33 logmsgbot: bd808@tin rebuilt wikiversions.php and synchronized wikiversions files: Back to 1.27.0-wmf10 again after fixking l10n cache problems
* 22:31 ryankemper@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
* 01:28 logmsgbot: bd808@tin rebuilt wikiversions.php and synchronized wikiversions files: Temporarily back to 1.27.0-wmf11; need to rebuild l10n cache
* 22:31 ryankemper@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
* 01:16 logmsgbot: bd808@tin rebuilt wikiversions.php and synchronized wikiversions files: Revert all wikis to 1.27.0-wmf.10
* 22:28 bking@cumin1001: END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
* 00:08 logmsgbot: bd808@tin Synchronized php-1.27.0-wmf.11/extensions/CentralAuth/includes/session/CentralAuthSessionProvider.php: https://gerrit.wikimedia.org/r/#/c/265872/ (duration: 00m 25s)
* 22:02 bking@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
* 00:07 logmsgbot: bd808@tin Synchronized php-1.27.0-wmf.11/includes/session/CookieSessionProvider.php: https://gerrit.wikimedia.org/r/#/c/265871/ (duration: 00m 25s)
* 22:02 bking@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
* 21:57 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wdqs2006.codfw.wmnet with reason: [[phab:T310146|T310146]]
* 21:57 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs2006.codfw.wmnet with reason: [[phab:T310146|T310146]]
* 21:53 bking@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
* 21:52 bking@deploy1002: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply
* 21:50 bking@deploy1002: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply
* 21:49 bking@deploy1002: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply
* 21:43 bking@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop: apply
* 21:43 bking@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop: apply
* 21:43 bking@deploy1002: helmfile [eqiad] DONE helmfile.d/services/changeprop: apply
* 21:43 bking@deploy1002: helmfile [eqiad] START helmfile.d/services/changeprop: apply
* 21:43 bking@deploy1002: helmfile [staging] DONE helmfile.d/services/changeprop: apply
* 21:43 bking@deploy1002: helmfile [staging] START helmfile.d/services/changeprop: apply
* 21:08 bking@cumin1001: START - Cookbook sre.wdqs.data-transfer
* 21:00 bking@cumin1001: conftool action : set/weight=10:pooled=yes; selector: name=wdqs1014.eqiad.wmnet
* 20:56 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
* 20:55 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
* 20:55 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1148 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32332 and previous config saved to /var/cache/conftool/dbconfig/20220809-205548-ladsgroup.json
* 20:51 bking@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wdqs1014.eqiad.wmnet
* 20:51 bking@cumin1001: START - Cookbook sre.hosts.remove-downtime for wdqs1014.eqiad.wmnet
* 20:46 bking@cumin1001: END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
* 20:40 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P32331 and previous config saved to /var/cache/conftool/dbconfig/20220809-204042-ladsgroup.json
* 20:25 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P32330 and previous config saved to /var/cache/conftool/dbconfig/20220809-202536-ladsgroup.json
* 20:10 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1148 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32329 and previous config saved to /var/cache/conftool/dbconfig/20220809-201030-ladsgroup.json
* 19:57 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on wdqs1014.eqiad.wmnet with reason: [[phab:T314890|T314890]]
* 19:57 bking@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on wdqs1014.eqiad.wmnet with reason: [[phab:T314890|T314890]]
* 19:56 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on wdqs1016.eqiad.wmnet with reason: [[phab:T314890|T314890]]
* 19:56 bking@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on wdqs1016.eqiad.wmnet with reason: [[phab:T314890|T314890]]
* 19:55 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on wdqs1015.eqiad.wmnet with reason: [[phab:T314890|T314890]]
* 19:55 bking@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on wdqs1015.eqiad.wmnet with reason: [[phab:T314890|T314890]]
* 19:38 dcausse@deploy1002: helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply
* 19:36 dcausse@deploy1002: helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply
* 19:35 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1130.eqiad.wmnet with reason: Maintenance
* 19:35 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 10:00:00 on db1130.eqiad.wmnet with reason: Maintenance
* 19:25 bking@cumin1001: START - Cookbook sre.wdqs.data-transfer
* 18:09 cmjohnson@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 18:06 cmjohnson@cumin1001: START - Cookbook sre.dns.netbox
* 17:54 cmjohnson@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 17:47 cmjohnson@cumin1001: START - Cookbook sre.dns.netbox
* 17:38 bking@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1072.eqiad.wmnet with OS bullseye
* 17:29 vgutierrez: test trafficserver 9.1.2-1wm2 in cp6016 - [[phab:T309651|T309651]]
* 17:15 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1072.eqiad.wmnet with reason: host reimage
* 17:13 bking@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1072.eqiad.wmnet with reason: host reimage
* 17:00 bking@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1072.eqiad.wmnet with OS bullseye
* 16:54 bking@cumin1001: END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0)
* 16:54 bking@cumin1001: START - Cookbook sre.elasticsearch.force-shard-allocation
* 16:53 bking@cumin1001: END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0)
* 16:53 bking@cumin1001: START - Cookbook sre.elasticsearch.force-shard-allocation
* 16:26 bking@deploy1002: helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
* 16:26 bking@deploy1002: helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
* 16:01 bking@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1069.eqiad.wmnet with OS bullseye
* 15:45 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1069.eqiad.wmnet with reason: host reimage
* 15:42 bking@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1069.eqiad.wmnet with reason: host reimage
* 15:32 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 15:31 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 15:31 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 15:30 bking@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1069.eqiad.wmnet with OS bullseye
* 15:28 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 15:27 bking@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1058.eqiad.wmnet with OS bullseye
* 15:08 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1058.eqiad.wmnet with reason: host reimage
* 15:05 bking@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1058.eqiad.wmnet with reason: host reimage
* 14:59 elukey@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
* 14:59 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
* m: finished running 'homer "status:active" commit "netmon: Add the netmon1003 host as a syslog destination"' in the cumin1001 host. Homer reported no errors.
* 14:54 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
* 14:50 bking@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1058.eqiad.wmnet with OS bullseye
* 14:28 bking@cumin1001: conftool action : set/pooled=false; selector: dnsdisc=wdqs,name=codfw
* 13:57 kevinbazira@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
* 13:57 kevinbazira@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
* m: Add the new netmon1003 host as a syslog destination in homer templates/common/system.conf https://gerrit.wikimedia.org/r/c/operations/homer/public/+/819124
* m: Successfully ran '# run-puppet-merge' in the netmon1002 and netmon1003 hosts.
* m: Running '# run-puppet-agent' in the netmon1003 host
* m: Running '# run-puppet-agent' in the netmon1002 host
* 13:47 ryankemper@cumin1001: END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0)
* 13:46 ryankemper@cumin1001: START - Cookbook sre.elasticsearch.force-shard-allocation
* m: puppet-merge on puppetmaster2004.codfw.wmnet for patch 819179 succeeded
* m: Set netmon1003 as netmon_server and netmon1002 as a netmon_servers_failover in the Puppet repository https://gerrit.wikimedia.org/r/c/operations/puppet/+/819179
* m: authdns updated successfully
* m: Had to revert https://gerrit.wikimedia.org/r/c/operations/dns/+/819177 because I rebased my changes incorrectly, sent the new patch in https://gerrit.wikimedia.org/r/c/operations/dns/+/821746
* m: running '# authdns-update' in  ns0.wikimedia.org
* m: Flip DNS for LibreNMS and Smokeping from netmon1002 to netmon1003 https://gerrit.wikimedia.org/r/c/operations/dns/+/819177
* 13:23 jynus: stop replication on db1117:m1 [[phab:T309074|T309074]]
* m: netmon1002 to netmon1003 failover
* 13:17 elukey@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
* 13:16 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
* 10:58 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
* 09:53 vgutierrez: rolling restart of pybal in eqsin - [[phab:T310070|T310070]]
* 09:25 elukey@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
* 09:24 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
* 09:24 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
* 09:12 vgutierrez: rolling restart of pybal in codfw - [[phab:T310070|T310070]]
* 08:47 elukey@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
* 08:30 elukey@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
* 08:28 elukey@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
* 08:27 elukey@deploy1002: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
* 08:27 elukey@deploy1002: helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
* 08:27 elukey@deploy1002: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
* 08:26 elukey@deploy1002: helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
* 08:24 jynus: starting data check using es1021 and es2021, expect increased read traffic [[phab:T314559|T314559]]
* 08:21 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
* 06:22 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1130.eqiad.wmnet with reason: Maintenance
* 06:22 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1130.eqiad.wmnet with reason: Maintenance
* 06:19 Amir1: dbmaint s5@eqiad ([[phab:T312863|T312863]] [[phab:T312984|T312984]] [[phab:T310011|T310011]] [[phab:T310485|T310485]])
* 06:11 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db1130.eqiad.wmnet with reason: Maint
* 06:11 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db1130.eqiad.wmnet with reason: Maint
* 06:08 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depool db1130 [[phab:T314370|T314370]]', diff saved to https://phabricator.wikimedia.org/P32323 and previous config saved to /var/cache/conftool/dbconfig/20220809-060836-ladsgroup.json
* 06:07 oblivian@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
* 06:02 ladsgroup@cumin1001: dbctl commit (dc=all): 'Promote db1100 to s5 primary and set section read-write [[phab:T314370|T314370]]', diff saved to https://phabricator.wikimedia.org/P32322 and previous config saved to /var/cache/conftool/dbconfig/20220809-060159-ladsgroup.json
* 06:01 ladsgroup@cumin1001: dbctl commit (dc=all): 'Set s5 eqiad as read-only for maintenance - [[phab:T314370|T314370]]', diff saved to https://phabricator.wikimedia.org/P32321 and previous config saved to /var/cache/conftool/dbconfig/20220809-060105-ladsgroup.json
* 06:00 Amir1: Starting s5 eqiad failover from db1130 to db1100 - [[phab:T314370|T314370]]
* 05:12 ladsgroup@cumin1001: dbctl commit (dc=all): 'Set db1100 with weight 0 [[phab:T314370|T314370]]', diff saved to https://phabricator.wikimedia.org/P32320 and previous config saved to /var/cache/conftool/dbconfig/20220809-051251-ladsgroup.json
* 05:12 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 22 hosts with reason: Primary switchover s5 [[phab:T314370|T314370]]
* 05:11 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1:00:00 on 22 hosts with reason: Primary switchover s5 [[phab:T314370|T314370]]
* 02:42 ejegg: SmashPig upgraded from {{Gerrit|9b97ea15}} to {{Gerrit|13e9e9cc}}
* 02:31 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1148 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32318 and previous config saved to /var/cache/conftool/dbconfig/20220809-023113-ladsgroup.json
* 02:31 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1148.eqiad.wmnet with reason: Maintenance
* 02:30 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1148.eqiad.wmnet with reason: Maintenance
* 02:30 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32317 and previous config saved to /var/cache/conftool/dbconfig/20220809-023052-ladsgroup.json
* 02:28 ejegg: payments-wiki upgraded from {{Gerrit|6880236d}} to {{Gerrit|cf5e1848}}
* 02:15 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P32316 and previous config saved to /var/cache/conftool/dbconfig/20220809-021546-ladsgroup.json
* 02:00 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P32315 and previous config saved to /var/cache/conftool/dbconfig/20220809-020040-ladsgroup.json
* 01:45 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32314 and previous config saved to /var/cache/conftool/dbconfig/20220809-014534-ladsgroup.json


== 2016-01-22 ==
== 2022-08-08 ==
* 23:43 logmsgbot: legoktm@tin Synchronized php-1.27.0-wmf.11/extensions/CentralAuth/includes/session/CentralAuthSessionProvider.php: https://gerrit.wikimedia.org/r/#/c/265870/ (duration: 00m 26s)
* 23:52 tstarling@deploy1002: Synchronized wmf-config/InitialiseSettings.php: clean up testwiki experiments [[phab:T314750|T314750]] (duration: 03m 19s)
* 23:42 logmsgbot: legoktm@tin Synchronized php-1.27.0-wmf.11/includes/session/CookieSessionProvider.php: https://gerrit.wikimedia.org/r/#/c/265869/ (duration: 00m 26s)
* 23:47 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 23:22 mobrovac: restbase cassandra truncating local_group_wiktionary_T_term_definition.data
* 23:46 tstarling@deploy1002: Synchronized wmf-config/CommonSettings.php: clean up testwiki experiments [[phab:T314750|T314750]] (duration: 03m 27s)
* 22:33 mdholloway: mobileapps deployed 2900faa
* 23:46 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 22:23 logmsgbot: twentyafterfour@tin Finished scap: deploy  https://gerrit.wikimedia.org/r/#/c/263415/ and clean up old branches (duration: 07m 02s)
* 23:46 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 22:16 logmsgbot: twentyafterfour@tin Started scap: deploy  https://gerrit.wikimedia.org/r/#/c/263415/ and clean up old branches
* 23:45 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 22:06 bblack: upgrading vhtcpd on all caches
* 23:32 eileen___: config revision changed from {{Gerrit|f5668044}} to 787cd0e0<eileen___> eileen
* 22:05 eileen: upgrade Civicrm from b9ebf3d31aeab8120143cfbf6bc2df0f617341cf to c009af16944a6478bd0292422f5bb0151f7a22c1
* 23:32 eileen___: civicrm upgraded from {{Gerrit|497bddf7}} to {{Gerrit|1f91ac2d}}
* 21:49 logmsgbot: anomie@tin Synchronized php-1.27.0-wmf.11/includes/: Fix T124468, for real this time (duration: 00m 36s)
* 22:16 ryankemper@cumin1001: END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - [[phab:T289135|T289135]]
* 21:48 logmsgbot: anomie@tin Synchronized php-1.27.0-wmf.11/includes/: Fix T124468 (duration: 00m 38s)
* 22:16 ryankemper@cumin1001: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host elastic1065.eqiad.wmnet with OS bullseye
* 21:17 legoktm: running migrateAccount.php --attachbroken over list of all unattached users (T74791)
* 21:53 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1065.eqiad.wmnet with reason: host reimage
* 20:04 mutante: ruthenium - rebooting for reinstall
* 21:50 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1065.eqiad.wmnet with reason: host reimage
* 19:42 logmsgbot: aaron@tin Synchronized wmf-config/CommonSettings.php: Revert "Bump $wgJobBackoffThrottling to lower the htmlcacheupdate backlog" (duration: 00m 32s)
* 21:36 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1065.eqiad.wmnet with OS bullseye
* 18:51 jynus: "repairing" enwiki.oldtable on dbstore1001
* 21:12 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1062.eqiad.wmnet with OS bullseye
* 18:40 logmsgbot: jynus@tin Synchronized wmf-config/db-eqiad.php: Aborting pc1001 maintenance (duration: 00m 31s)
* 20:53 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1062.eqiad.wmnet with reason: host reimage
* 18:15 legoktm: running CentralAuth's resetGlobalUserTokens.php to force session resets for all users T124440
* 20:50 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1062.eqiad.wmnet with reason: host reimage
* 18:02 logmsgbot: anomie@tin Synchronized php-1.27.0-wmf.11/includes/user/User.php: Fix T124414 (duration: 00m 33s)
* 20:36 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1062.eqiad.wmnet with OS bullseye
* 17:53 legoktm: manually attaching User:Mower Genetics and User:Themeetingplace because they made edits somehow (T74791)
* 20:32 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 17:46 logmsgbot: ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: Stop logging the CirrusSearchRequests channel (duration: 00m 32s)
* 20:31 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 17:44 legoktm: running migrateAccount.php --attachbroken over lists on T74791
* 20:31 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 17:39 _joe_: removed an archived CirrusSearchRequests.log on fluorine, now we have enough room for the weekend
* 20:29 ryankemper@cumin1001: START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - [[phab:T289135|T289135]]
* 17:29 logmsgbot: anomie@tin Synchronized php-1.27.0-wmf.11/extensions/CentralAuth/includes: Fix T124406 (duration: 00m 35s)
* 20:28 cjming: end of UTC late backport window
* 17:05 mobrovac: mobileapps deploying bba45456
* 20:27 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 17:00 logmsgbot: reedy@tin Synchronized docroot and w: Extra noc symlinks (duration: 00m 32s)
* 20:27 cjming@deploy1002: Synchronized php-1.39.0-wmf.23/skins/Vector/resources/skins.vector.styles/layouts/grid.less: Backport: [[gerrit:821243{{!}}Fix grid blowout bug (T314756)]] (duration: 03m 26s)
* 16:58 logmsgbot: jynus@tin Synchronized wmf-config/InitialiseSettings.php: monolog: reduce on-disk logging of DBPerformance to warning (duration: 00m 32s)
* 20:12 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 16:47 jynus: truncating 100GB DBPerformance.log on fluorine, compressed backup available
* 20:11 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 16:46 logmsgbot: anomie@tin Synchronized php-1.27.0-wmf.11/extensions/CentralAuth/includes/session/CentralAuthSessionProvider.php: Fix T124409, part 2 (duration: 00m 32s)
* 20:11 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 16:46 logmsgbot: anomie@tin Synchronized php-1.27.0-wmf.11/includes/session/SessionBackend.php: Fix T124409, part 1 (duration: 00m 33s)
* 20:11 cjming@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:817785{{!}}Disable sticky header edit A/B test for pilot wikis (T312296)]] (duration: 03m 35s)
* 16:41 cmjohnson1: Troubleshooting mw1228
* 20:08 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 16:36 _joe_: all api appservers in eqiad have been restarted
* 17:34 bking@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1088.eqiad.wmnet with OS bullseye
* 16:21 ori: restarted statsv on hafnium
* 17:15 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1088.eqiad.wmnet with reason: host reimage
* 15:53 ema: Finished migrating mobile traffic to text cluster in codfw (Mexico + green US states on this map https://phabricator.wikimedia.org/T114659)
* 17:12 bking@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1088.eqiad.wmnet with reason: host reimage
* 15:39 gwicke: aqs: increased compression block size on per-article table from 128k to 256k; expectation is to further increase compression ratio & reduce seeks on rotating disks
* 17:00 bking@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1088.eqiad.wmnet with OS bullseye
* 15:22 Reedy: created translate tables on ruwikimedia T121766
* 16:54 bking@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1085.eqiad.wmnet with OS bullseye
* 14:18 paravoid: cr1-eqord: turning up BGP with Zayo
* 16:49 ryankemper@cumin1001: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - [[phab:T289135|T289135]]
* 13:08 logmsgbot: ori@tin Synchronized php-1.27.0-wmf.10/extensions/MobileFrontend: I08cdf37a1: Use TitleSquidURLs hook to purge mobile URLs directly (Bug: T124165) (duration: 00m 33s)
* 16:43 pt1979@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 13:05 logmsgbot: ori@tin Synchronized wmf-config/InitialiseSettings.php: If443f3c80: monolog: explicitly declare logstash as debug for sessions (duration: 00m 34s)
* 16:41 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1085.eqiad.wmnet with reason: host reimage
* 12:31 ema: Starting migration of mobile traffic to text cluster https://phabricator.wikimedia.org/T109286
* 16:39 pt1979@cumin2002: START - Cookbook sre.dns.netbox
* 11:35 logmsgbot: oblivian@tin Synchronized wmf-config/InitialiseSettings.php: Re-synching (duration: 00m 31s)
* 16:38 bking@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1085.eqiad.wmnet with reason: host reimage
* 11:25 logmsgbot: oblivian@tin Synchronized wmf-config/InitialiseSettings.php: Stop writing session logs to fluorine (duration: 01m 25s)
* 16:26 bking@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1085.eqiad.wmnet with OS bullseye
* 11:17 bblack: codfw LVS under etcd/conftool control now, like ulsfo
* 16:24 bking@cumin1001: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host elastic1085.eqiad.wmnet with OS bullseye
* 10:57 logmsgbot: jynus@tin Synchronized wmf-config/db-eqiad.php: Depool pc1001 for maintenance (duration: 02m 48s)
* 16:19 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1085.eqiad.wmnet with reason: host reimage
* 10:45 _joe_: rolling restarting the API cluster in eqiad
* 16:16 bking@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1085.eqiad.wmnet with reason: host reimage
* 10:34 _joe_: rolling restart of all api appservers in eqiad
* 16:16 ryankemper@cumin1001: START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - [[phab:T289135|T289135]]
* 10:07 _joe_: dropping api logs from 2015 on fluorine
* 16:14 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
* 09:10 _joe_: rolling restart of imagescalers in eqiad
* 16:12 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
* 08:48 _joe_: powercycling ms-be1002, blank console, down
* 16:10 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
* 08:46 _joe_: rebooting mw1001 with a new kernel
* 16:09 ryankemper@cumin1001: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - [[phab:T289135|T289135]]
* 08:07 _joe_: upgrading kernel on all mw hosts in eqiad
* 16:04 bking@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1085.eqiad.wmnet with OS bullseye
* 05:07 logmsgbot: tstarling@tin Synchronized php-1.27.0-wmf.11/includes/parser/ParserCache.php: (no message) (duration: 01m 28s)
* 16:00 bking@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1084.eqiad.wmnet with OS bullseye
* 02:42 logmsgbot: tstarling@tin Synchronized php-1.27.0-wmf.11/includes/parser/ParserCache.php: (no message) (duration: 01m 28s)
* 15:58 ryankemper@cumin1001: START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - [[phab:T289135|T289135]]
* 02:40 logmsgbot: tstarling@tin Synchronized php-1.27.0-wmf.11/includes/OutputPage.php: (no message) (duration: 01m 32s)
* 15:47 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1084.eqiad.wmnet with reason: host reimage
* 02:30 logmsgbot: mwdeploy@tin sync-l10n completed (1.27.0-wmf.11) (duration: 09m 31s)
* 15:46 sukhe: upload reprepro -C main include bullseye-wikimedia python-pynetbox_6.6.0-1+wmf11u1_amd64.changes
* 01:44 logmsgbot: catrope@tin Finished scap: Deploying OATHAuth and WikimediaMessages i18n changes (duration: 30m 52s)
* 15:45 bking@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1084.eqiad.wmnet with reason: host reimage
* 01:37 gwicke: restbase cassandra: increased compression chunk size from 256 to 512k on wikimedia and wikipedia html and data-parsoid
* 15:37 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2021.codfw.wmnet with reason: Maint
* 01:13 logmsgbot: catrope@tin Started scap: Deploying OATHAuth and WikimediaMessages i18n changes
* 15:37 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2021.codfw.wmnet with reason: Maint
* 01:08 eileen: Updating CiviCRM from cb5e20c29d7376920c45eb5c343e6ee464217833 to to b9ebf3d31aeab8120143cfbf6bc2df0f617341cf
* 15:32 bking@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1084.eqiad.wmnet with OS bullseye
* 00:19 logmsgbot: ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: Add ability for OfficeWiki sysops to add and remove flood group rights from themselves. (duration: 01m 27s)
* 14:59 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
* 00:14 logmsgbot: ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: enable EventBus extension on mediawikiwiki (duration: 01m 27s)
* 14:55 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
* 00:10 logmsgbot: ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: enable sandboxlink on ladwiki and dont sent messages to autocreated accounts on metawiki (duration: 01m 27s)
* 14:47 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on cp5001.eqsin.wmnet with reason: depooled: faulty DIMM: [[phab:T314256|T314256]]
* 00:08 logmsgbot: ebernhardson@tin Synchronized wmf-config/throttle.php: Santiago Editatón throttle rule (duration: 01m 27s)
* 14:46 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on cp5001.eqsin.wmnet with reason: depooled: faulty DIMM: [[phab:T314256|T314256]]
* 00:02 logmsgbot: ebernhardson@tin Synchronized wmf-config/CirrusSearch-production.php: configure cirrus completion suggester recycling (duration: 01m 29s)
* 14:34 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
* 00:00 logmsgbot: ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: configure cirrus completion suggester recycling (duration: 01m 28s)
* 14:11 kevinbazira@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
* 13:03 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 13:01 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 13:01 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 12:58 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 12:56 urbanecm@deploy1002: Synchronized wmf-config/CommonSettings.php: {{Gerrit|77fd5abdd7d9462869259e1511bbcf2d7ce62246}}: Growth: Add new rights to wgAvailableRights (duration: 03m 24s)
* 12:30 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1102.eqiad.wmnet
* 12:09 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 12:09 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 12:06 urbanecm@deploy1002: Synchronized php-1.39.0-wmf.23/extensions/GrowthExperiments/: {{Gerrit|3eaf155678b7313c55dcca0cd39ab29f73eead37}}: MentorTools: Do not use MentorWeightManager ([[phab:T314362|T314362]]) (duration: 03m 31s)
* 12:04 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 11:43 btullis@cumin1001: START - Cookbook sre.hosts.reboot-single for host an-worker1102.eqiad.wmnet
* 11:21 jelto@cumin1001: conftool action : set/pooled=yes; selector: name=kubernetes2022.codfw.wmnet
* 11:21 jelto: kubectl uncordon kubernetes2022.codfw.wmnet
* 10:43 Amir1: Removing db2079 from orchestrator ([[phab:T313885|T313885]])
* 10:39 Amir1: Removing db2079 from zarcillo ([[phab:T313885|T313885]])
* 10:35 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2079.codfw.wmnet
* 10:35 ladsgroup@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 10:30 ladsgroup@cumin1001: START - Cookbook sre.dns.netbox
* 10:25 ladsgroup@cumin1001: START - Cookbook sre.hosts.decommission for hosts db2079.codfw.wmnet
* 10:18 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 12:00:00 on db2079.codfw.wmnet with reason: Decom
* 10:17 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 2 days, 12:00:00 on db2079.codfw.wmnet with reason: Decom
* 08:41 jbond: deploy libtirpc update
* 07:57 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1144:3314 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32310 and previous config saved to /var/cache/conftool/dbconfig/20220808-075723-ladsgroup.json
* 07:57 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance
* 07:57 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance
* 07:57 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1142 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32309 and previous config saved to /var/cache/conftool/dbconfig/20220808-075702-ladsgroup.json
* 07:53 godog: grow sda/sdb 3 by 100G on thanos-be2001 - [[phab:T314275|T314275]]
* 07:50 godog: grow sda/sdb 3 by 100G on thanos-be1004 - [[phab:T314275|T314275]]
* 07:41 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P32308 and previous config saved to /var/cache/conftool/dbconfig/20220808-074156-ladsgroup.json
* 07:32 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 07:27 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 07:27 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 07:26 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P32307 and previous config saved to /var/cache/conftool/dbconfig/20220808-072650-ladsgroup.json
* 07:23 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 07:22 kartik@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:820815{{!}}trwikivoyage: Create rollbacker user group (T314678)]] (duration: 03m 17s)
* 07:18 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 07:17 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 07:17 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 07:16 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 07:11 elukey: restart rsyslog on ml-serve2007
* 07:11 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 07:11 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1142 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32306 and previous config saved to /var/cache/conftool/dbconfig/20220808-071144-ladsgroup.json
* 07:10 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 07:10 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 07:09 kartik@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:820261{{!}}Enable SectionTranslation on 10 Wikipedias where ContentTranslation is default (T308829)]] (duration: 03m 15s)
* 07:09 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 07:06 XioNoX: add CSP headers to Netbox - [[phab:T296356|T296356]]
* 07:05 elukey: restart rsyslog on ml-serve-ctrl2001


== 2016-01-21 ==
== 2022-08-07 ==
* 22:46 legoktm: started running migratePass0.php (CentralAuth) on group1 wikis
* 19:58 taavi: taavi@mwmaint1002 ~ $ echo "https://upload.wikimedia.org/wikipedia/commons/1/15/Keep_tidy_ask.svg" {{!}} mwscript purgeList.php --wiki enwiki # [[phab:T314712|T314712]]
* 22:24 logmsgbot: thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.27.0-wmf.11
* 13:52 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1142 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32305 and previous config saved to /var/cache/conftool/dbconfig/20220807-135204-ladsgroup.json
* 22:23 legoktm: started running migratePass0.php (CentralAuth) on group0 wikis
* 13:51 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1142.eqiad.wmnet with reason: Maintenance
* 21:35 ejegg: re-enabled low-level fundraising banner campaigns
* 13:51 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1142.eqiad.wmnet with reason: Maintenance
* 21:30 ejegg: reverted donatewiki maintenance message
* 13:51 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1141 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32304 and previous config saved to /var/cache/conftool/dbconfig/20220807-135143-ladsgroup.json
* 21:19 ejegg: updated paymentswiki from a7785baa7b40b442ecf0b60d47572502d0759780 to 1817327b4b0919ebe26bbd8b9d84fac1bd7ddb03
* 13:36 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P32303 and previous config saved to /var/cache/conftool/dbconfig/20220807-133637-ladsgroup.json
* 21:13 andrewbogott: all reachable labs instances are now running security-patched kernels.
* 13:21 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P32302 and previous config saved to /var/cache/conftool/dbconfig/20220807-132131-ladsgroup.json
* 21:12 logmsgbot: thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: cswiktionary to 1.27.0-wmf.11
* 13:06 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1141 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32301 and previous config saved to /var/cache/conftool/dbconfig/20220807-130625-ladsgroup.json
* 21:12 ejegg: disabled low-level fundraising banner campaigns
* 12:06 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1141 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32300 and previous config saved to /var/cache/conftool/dbconfig/20220807-120610-ladsgroup.json
* 21:12 andrewbogott: all labvirt10xx hosts are now running the latest utopic kernel
* 12:06 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1141.eqiad.wmnet with reason: Maintenance
* 21:09 ejegg: replaced form on donatewiki with maintenance notice
* 12:05 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1141.eqiad.wmnet with reason: Maintenance
* 21:08 logmsgbot: thcipriani@tin Synchronized php-1.27.0-wmf.11/includes/session/SessionManager.php: SessionManager: Notify AuthPlugin when auto-creating accounts [[gerrit:265578]] (duration: 01m 26s)
* 12:05 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1149 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32299 and previous config saved to /var/cache/conftool/dbconfig/20220807-120549-ladsgroup.json
* 21:01 andrewbogott: rebooting labvirt1010
* 11:50 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P32298 and previous config saved to /var/cache/conftool/dbconfig/20220807-115043-ladsgroup.json
* 20:51 andrewbogott: rebooting labvirt1009
* 11:35 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P32297 and previous config saved to /var/cache/conftool/dbconfig/20220807-113537-ladsgroup.json
* 20:33 andrewbogott: rebooting labvirt1007
* 11:20 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1149 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32296 and previous config saved to /var/cache/conftool/dbconfig/20220807-112031-ladsgroup.json
* 20:33 logmsgbot: dduvall@tin Synchronized php-1.27.0-wmf.11/includes/user/BotPassword.php: deploy fix for T124335 (duration: 01m 29s)
* 20:27 mobrovac: restbase deploy end of 79a4d27
* 20:20 mobrovac: restbase deploy start of 79a4d27
* 20:16 andrewbogott: rebooting labvirt1006
* 19:58 mobrovac: mobileapps deploying 68c09e
* 19:54 logmsgbot: dduvall@tin rebuilt wikiversions.php and synchronized wikiversions files: rollback cswiktionary to 1.27.0-wmf.10
* 19:54 andrewbogott: rebooting labvirt1005
* 19:32 andrewbogott: rebooting labvirt1004
* 19:31 logmsgbot: dduvall@tin Synchronized php-1.27.0-wmf.11/extensions/CentralAuth/includes/session/CentralAuthTokenSessionProvider.php: deploy https://gerrit.wikimedia.org/r/#/c/265545/ for 1.27.0-wmf.11 (duration: 01m 28s)
* 19:24 mobrovac: restbase rolling-restart after firejail inclusion
* 19:22 mobrovac: restbase re-enabling puppet in prod
* 19:14 andrewbogott: rebooting labvirt1003
* 18:57 logmsgbot: dduvall@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.27.0-wmf.11
* 18:53 marxarelli: starting train promotion of group1 to 1.27.0-wmf.11
* 18:52 marxarelli: sync to mw2020 failed due to failed host key verification, mw2087/mw2039/mw2098 due to connection failed
* 18:47 marxarelli: 4 apache sync failures during sync-file, appear to be know issues
* 18:46 andrewbogott: rebooting labvirt1002
* 18:43 logmsgbot: dduvall@tin Synchronized php-1.27.0-wmf.11/includes/session/PHPSessionHandler.php: deploy follow-up warning fix for T124126 (duration: 01m 28s)
* 18:43 mobrovac: restbase disabling puppet in prod for testing firejail in staging
* 18:41 akosiaris: enable puppet and salt-minion on sca100{1,2}.eqiad.wmnet
* 18:39 akosiaris: depool sca1001, sca1002 for citoid
* 18:34 akosiaris: pool scb1001, scb1002 for citoid
* 18:07 andrewbogott: rebooting labvirt1001
* 17:57 akosiaris: depool sca1001,sca1002 for graphoid pybal config
* 17:49 logmsgbot: thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Really enable ContentTranslationCorpora [[gerrit:265514]] (duration: 01m 29s)
* 17:48 akosiaris: add scb1001, scb1002 in pybal graphoid config
* 17:30 akosiaris: disabled puppet and salt-minion on sca1001, sca1002 for graphoid upgrade
* 17:24 logmsgbot: thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT: Enable ContentTranslationCorpora Part II [[gerrit:265459]] (duration: 01m 28s)
* 17:22 logmsgbot: thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable ContentTranslationCorpora Part I [[gerrit:265459]] (duration: 01m 28s)
* 17:12 _joe_: restarting pybal on the main balancers in ulsfo to consume from etcd
* 17:02 andrewbogott: rebooting labvirt1008
* 16:42 jynus: batch-converting m4-master (log) tables from innodb to tokudb
* 16:42 logmsgbot: thcipriani@tin Synchronized php-1.27.0-wmf.11/extensions/MobileFrontend/MobileFrontend.php: SWAT: Use TitleSquidURLs hook to purge mobile URLs directly Part II [[gerrit:265486]] (duration: 01m 28s)
* 16:40 logmsgbot: thcipriani@tin Synchronized php-1.27.0-wmf.11/extensions/MobileFrontend/includes/MobileFrontend.hooks.php: SWAT: Use TitleSquidURLs hook to purge mobile URLs directly Part I [[gerrit:265486]] (duration: 01m 28s)
* 16:35 ottomata: stopped eventlogging mysql consumers for long downtime: https://phabricator.wikimedia.org/T120187
* 16:28 logmsgbot: thcipriani@tin Synchronized php-1.27.0-wmf.10/extensions/MobileApp/config/config.json: SWAT: Roll out RESTBase usage to Android Beta app: 100% [[gerrit:265117]] (duration: 01m 27s)
* 16:22 logmsgbot: thcipriani@tin Synchronized php-1.27.0-wmf.11/extensions/MobileApp/config/config.json: SWAT: Roll out RESTBase usage to Android Beta app: 100% [[gerrit:265118]] (duration: 01m 28s)
* 16:20 ottomata: started eventlogging mysql consumers
* 16:19 paravoid: deactivating GTT BGP peering on cr2-eqiad
* 16:05 logmsgbot: thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: wgRCWatchCategoryMembership true on dewiki [[gerrit:264732]] (duration: 01m 28s)
* 15:59 ottomata: stopping eventlogging mysql consumers for https://phabricator.wikimedia.org/T123546
* 14:37 paravoid: upgraded cr2-codfw to JunOS 13.3R8.7
* 13:20 _joe_: rolling reboot of imagescalers, jobrunners in codfw
* 12:10 paravoid: upgrading cr1-codfw to JunOS 13.3R8.7
* 11:27 _joe_: restarting pybal on lvs4003, switching to etcd
* 11:25 _joe_: restarting pybal on lvs4004, switching to etcd
* 11:09 jynus: adding new version of mariadb to carbon for jessie (10.0.23-1)
* 10:19 _joe_: mw2098 doesn't reboot, console unreachable
* 10:10 jynus: mw2098.codfw.wmnet failed to sync
* 10:10 logmsgbot: jynus@tin Synchronized wmf-config/db-eqiad.php: Restore s5 DB configuration (duration: 01m 57s)
* 09:53 _joe_: rolling reboot of the codfw appserver layer
* 09:27 _joe_: powercycled mw1162, memory exhaustion
* 08:01 _joe_: upgrading all codfw appserver layer's kernel to linux-image-3.13.0-76-generic
* 02:56 logmsgbot: l10nupdate@tin ResourceLoader cache refresh completed at Thu Jan 21 02:56:44 UTC 2016 (duration 7m 9s)
* 02:49 logmsgbot: mwdeploy@tin sync-l10n completed (1.27.0-wmf.11) (duration: 09m 39s)
* 02:27 logmsgbot: mwdeploy@tin sync-l10n completed (1.27.0-wmf.10) (duration: 09m 33s)
* 02:24 mobrovac: citoid deploying 3a1b6c8648
* 02:16 ori: Restarting jobrunner service on job runners to ensure I180856917 gets picked up
* 01:47 mutante: nitrogen - install package upgrades
* 01:15 bd808: Restarted logstash on logstash1003
* 01:14 bd808: Restarted logstash on logstash1002
* 01:04 logmsgbot: maxsem@tin Synchronized wmf-config/: https://gerrit.wikimedia.org/r/#/c/265395/ (duration: 00m 32s)
* 00:56 logmsgbot: maxsem@tin Synchronized php-1.27.0-wmf.11/extensions/GeoData/: https://gerrit.wikimedia.org/r/#/c/265409/ (duration: 00m 33s)
* 00:50 logmsgbot: maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/265142/ (duration: 00m 32s)


== 2016-01-20 ==
== 2022-08-06 ==
* 23:56 logmsgbot: reedy@tin Synchronized php-1.27.0-wmf.10/extensions/SemanticForms/: fix wikitech again (duration: 00m 34s)
* 17:59 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1149 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32295 and previous config saved to /var/cache/conftool/dbconfig/20220806-175916-ladsgroup.json
* 23:06 bd808: Restarted logstash on logstash1001
* 17:59 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1149.eqiad.wmnet with reason: Maintenance
* 23:04 bd808: Logstash1001 went nuts and decided that instead of 2016 it would go back to the start of 2015 after 2015-12-31T23:59
* 17:58 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1149.eqiad.wmnet with reason: Maintenance
* 22:54 bd808: no HHVM log events in logstash since 2015-12-31T23:59:44.000Z
* 03:10 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 22:48 bd808: HHVM log messages not being recorded in Logstash; bd808 to investigate
* 03:09 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 22:38 logmsgbot: tgr@tin Synchronized php-1.27.0-wmf.11/includes/: T124143,T124126 (duration: 00m 36s)
* 03:09 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 22:06 logmsgbot: anomie@tin Synchronized php-1.27.0-wmf.11/extensions/OAuth: Deploy fix for T124224 (duration: 00m 32s)
* 03:08 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 22:04 logmsgbot: anomie@tin Synchronized php-1.27.0-wmf.2/extensions/OAuth: Deploy fix for T124224 (duration: 00m 34s)
* 03:03 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 21:51 logmsgbot: reedy@tin Synchronized php-1.27.0-wmf.11/extensions/SemanticResultFormats: Fix wikitech log noise (duration: 00m 31s)
* 03:02 krinkle@deploy1002: Synchronized w/: {{Gerrit|I9067d47fab0324}} (duration: 03m 25s)
* 21:50 logmsgbot: reedy@tin Synchronized php-1.27.0-wmf.11/extensions/SemanticMediaWiki: Fix wikitech log noise (duration: 00m 34s)
* 03:02 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 21:48 subbu: finished deploying parsoid sha f1ddfb88
* 03:02 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 21:41 subbu: synced new parsoid code; restarted parsoid on wtp1001 as a canary
* 03:01 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 21:35 subbu: starting parsoid deploy
* 02:41 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 21:32 thcipriani: reverted group1 wikis to 1.27.0-wmf.10 due to session errors.
* 02:39 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 21:30 logmsgbot: thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.27.0-wmf.10
* 02:39 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 21:14 andrewbogott: rebooting labvirt1011
* 02:38 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 21:08 logmsgbot: reedy@tin Synchronized php-1.27.0-wmf.11/extensions/SemanticForms/: Fix fatal on wikitech (duration: 00m 36s)
* 02:38 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1143.eqiad.wmnet with reason: Maintenance
* 20:37 akosiaris: s#/dev/md1#/dev/mapper/tank-data# on labvirt1010, reverted by puppet with Notice: /Stage[main]/Role::Labs::Openstack::Nova::Compute/Mount[/var/lib/nova/instances]/device: device changed '/dev/mapper/tank-data' to '/dev/md1'
* 02:37 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1143.eqiad.wmnet with reason: Maintenance
* 20:37 akosiaris: s#/dev/md1#/dev/mapper/tank-data#
* 02:33 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 19:32 logmsgbot: dduvall@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.27.0-wmf.11
* 02:32 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 19:14 marxarelli: including labswiki and labtestwiki in group1 promotion after all
* 02:32 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 19:09 marxarelli: starting promotion of group1, but holding back labswiki and labtestwiki until Jan 21 'all' promotion
* 02:31 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 18:54 paravoid: manually triggering an ubuntu mirror update ("sudo -u mirror /usr/local/sbin/update-ubuntu-mirror" on carbon)
* 18:41 jynus: schema change on wikidatawiki (wb_terms) finished- slaves already catching up
* 18:34 mutante: restart hhvm on mw1206
* 18:32 godog: bounce stuck hhvm on mw1205
* 18:06 paravoid: turning up BGP with Zayo in codfw
* 17:48 jynus: restarting replication on db1026 after schema change
* 17:09 gwicke: restbase cassandra: set DTCS max_window_size_seconds to 70736000, large enough to accommodate a two-year window
* 16:56 logmsgbot: thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Set default graph vega version back to 1 [[gerrit:265289]] (duration: 00m 32s)
* 16:46 logmsgbot: thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Add davidabian.com to wgCopyUploadsDomains [[gerrit:265286]] (duration: 00m 32s)
* 16:42 logmsgbot: thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT: Change default graph version param. Part II [[gerrit:265282]] (duration: 00m 32s)
* 16:42 logmsgbot: thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Change default graph version param. Part I [[gerrit:265282]] (duration: 00m 36s)
* 16:33 logmsgbot: thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Add davidabian.com to wgCopyUploadsDomains [[gerrit:259003]] (duration: 00m 32s)
* 16:21 logmsgbot: thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Add *.bodleian.ox.ac.uk to wgCopyUploadsDomains [[gerrit:265165]] (duration: 00m 33s)
* 16:19 logmsgbot: thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Add *.archives.gov to wgCopyUploadsDomains [[gerrit:265163]] (duration: 00m 32s)
* 16:13 godog: bounce hhvm on mw1191 and syntaxlight runaway processes
* 16:05 logmsgbot: thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Disable active gadget user stats on enwiki since it takes too long [[gerrit:265185]] (duration: 00m 32s)
* 14:52 logmsgbot: reedy@tin Synchronized php-1.27.0-wmf.11/vendor/: Fix ?PHP properly from commit (duration: 00m 36s)
* 14:50 godog: powercycle mw1123, hhvm oom
* 14:47 ema: Finished reverting migration of mobile traffic to text cluster in codfw https://phabricator.wikimedia.org/T109286
* 14:24 logmsgbot: hoo@tin Synchronized wmf-config/db-eqiad.php: Set db1045 load to 0 (duration: 00m 32s)
* 14:23 logmsgbot: reedy@tin Synchronized php-1.27.0-wmf.11/: consistency (duration: 02m 38s)
* 14:15 logmsgbot: hoo@tin Synchronized wmf-config/db-eqiad.php: Re-Pool lagged db1045 (duration: 00m 35s)
* 14:14 _joe_: syncronizing /srv/deployment manually between the two deployment servers for the first time
* 14:11 logmsgbot: hoo@tin Synchronized wmf-config/db-eqiad.php: Has not been synced before (duration: 00m 32s)
* 14:07 logmsgbot: reedy@tin Synchronized php-1.27.0-wmf.10/: consistency (duration: 02m 38s)
* 13:58 logmsgbot: reedy@tin Synchronized php-1.27.0-wmf.11/extensions/Validator/: noop for wikitech deploy (duration: 00m 32s)
* 13:58 logmsgbot: reedy@tin Synchronized php-1.27.0-wmf.11/extensions/SemanticMediaWiki/: noop for wikitech deploy (duration: 00m 34s)
* 13:57 logmsgbot: reedy@tin Synchronized php-1.27.0-wmf.11/extensions/SemanticResultFormats/: noop for wikitech deploy (duration: 00m 33s)
* 13:41 ema: Revert migration of mobile traffic to text cluster in codfw https://phabricator.wikimedia.org/T109286
* 12:55 akosiaris: restart hhvm on mw1130
* 12:43 jynus: performing alter table on db1026 (ETA: 5 hours)
* 12:20 logmsgbot: jynus@tin Synchronized wmf-config/db-eqiad.php: Setting s5 master as recentchanges role (duration: 00m 32s)
* 12:04 jynus: trying schema change on wikidata (wb_terms)
* 09:36 akosiaris: gnt-instance modify -H disk_aio=native cygnus.codfw.wmnet
* 09:18 akosiaris: offline fr_archive volume on nas1001-a
* 09:15 akosiaris: unexport /vol/fr_archive on nas1001-a
* 07:56 _joe_: powercycling mw1162, unable to login from console, memory exhaustion
* 07:24 logmsgbot: ebernhardson@tin Synchronized php-1.27.0-wmf.10/extensions/CirrusSearch/includes/DataSender.php: stop checking for frozen indices while codfw elasticsearch recovers (duration: 01m 42s)
* 06:24 ebernhardson: codfw elasticsearch cluster stopped responding during load test, idling test to see if it recovers
* 03:44 logmsgbot: l10nupdate@tin ResourceLoader cache refresh completed at Wed Jan 20 03:44:48 UTC 2016 (duration 7m 29s)
* 03:37 logmsgbot: mwdeploy@tin sync-l10n completed (1.27.0-wmf.11) (duration: 16m 21s)
* 03:02 logmsgbot: mwdeploy@tin sync-l10n completed (1.27.0-wmf.10) (duration: 10m 06s)
* 02:35 logmsgbot: mwdeploy@tin sync-l10n completed (1.27.0-wmf.9) (duration: 11m 20s)
* 01:27 logmsgbot: aaron@tin Synchronized wmf-config: Configure $wgCdnReboundPurgeDelay (duration: 00m 32s)
* 01:01 mobrovac: restbase deploy end of d621b76
* 00:57 logmsgbot: krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/264917/ (duration: 00m 32s)
* 00:56 legoktm: delete from localuser where lu_name ="Αντώνης Μανιός" and lu_wiki ="mediawikiwiki" limit 1 on centralauth db for T119736
* 00:53 logmsgbot: krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/264920/ (duration: 00m 33s)
* 00:49 logmsgbot: krenair@tin Synchronized php-1.27.0-wmf.10/extensions/MobileFrontend/includes/api/ApiMobileView.php: https://gerrit.wikimedia.org/r/#/c/264973/ (duration: 00m 32s)
* 00:49 mobrovac: restbase deploy start of d621b76
* 00:38 logmsgbot: krenair@tin Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/264961/ (duration: 00m 31s)
* 00:37 logmsgbot: krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/264961/ (duration: 00m 33s)
* 00:22 logmsgbot: krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/264260/ (duration: 00m 32s)
* 00:21 logmsgbot: krenair@tin Synchronized wmf-config/InitialiseSettings-labs.php: https://gerrit.wikimedia.org/r/#/c/264260/ (duration: 00m 32s)
* 00:17 logmsgbot: krenair@tin Synchronized php-1.27.0-wmf.10/extensions/CirrusSearch: https://gerrit.wikimedia.org/r/#/c/265146/ (duration: 00m 33s)
* 00:10 logmsgbot: krenair@tin Synchronized php-1.27.0-wmf.10/extensions/CirrusSearch/includes/ElasticsearchIntermediary.php: https://gerrit.wikimedia.org/r/#/c/264989/ (duration: 00m 32s)


== 2016-01-19 ==
== 2022-08-05 ==
* 23:33 logmsgbot: aaron@tin Synchronized wmf-config/CommonSettings.php: Bump $wgJobBackoffThrottling to lower the htmlcacheupdate backlog (duration: 00m 32s)
* 22:20 dcausse@deploy1002: Finished deploy [wikimedia/discovery/analytics@71fe016]: Fix schedule_interval for image_recommendation_weekly (duration: 02m 01s)
* 23:22 logmsgbot: krenair@tin Synchronized wmf-config/wikitech.php: https://gerrit.wikimedia.org/r/265145 (duration: 02m 24s)
* 22:18 dcausse@deploy1002: Started deploy [wikimedia/discovery/analytics@71fe016]: Fix schedule_interval for image_recommendation_weekly
* 23:19 logmsgbot: dduvall@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 to 1.27.0-wmf.11
* 17:08 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1195.eqiad.wmnet with OS bullseye
* 23:13 logmsgbot: dduvall@tin Finished scap: testwiki to php-1.27.0-wmf.11 and rebuild l10n cache (duration: 72m 03s)
* 16:54 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1194.eqiad.wmnet with OS bullseye
* 22:01 logmsgbot: dduvall@tin Started scap: testwiki to php-1.27.0-wmf.11 and rebuild l10n cache
* 16:53 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1195.eqiad.wmnet with reason: host reimage
* 21:35 logmsgbot: krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/265135 (duration: 00m 32s)
* 16:49 pt1979@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on db1195.eqiad.wmnet with reason: host reimage
* 21:33 logmsgbot: krenair@tin Synchronized dblists/nonglobal.dblist: https://gerrit.wikimedia.org/r/265135 (duration: 03m 21s)
* 16:41 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1194.eqiad.wmnet with reason: host reimage
* 21:33 ema: Finished migrating mobile traffic to text cluster in codfw (Mexico + green US states on this map https://phabricator.wikimedia.org/T114659)
* 16:37 pt1979@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on db1194.eqiad.wmnet with reason: host reimage
* 21:15 logmsgbot: dduvall@tin scap failed: CalledProcessError Command '/usr/local/bin/mwscript mergeMessageFileList.php --wiki="testwiki" --list-file="/srv/mediawiki-staging/wmf-config/extension-list" --output="/tmp/tmp.qyk48j8kem" ' returned non-zero exit status 1 (duration: 16m 11s)
* 16:34 pt1979@cumin1001: START - Cookbook sre.hosts.reimage for host db1195.eqiad.wmnet with OS bullseye
* 20:59 Krenair: sync-common on labtestweb2001
* 16:27 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp203[56]\.codfw\.wmnet,service=varnish-fe
* 20:58 logmsgbot: dduvall@tin Started scap: testwiki to php-1.27.0-wmf.11 and rebuild l10n cache
* 16:27 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp203[56]\.codfw\.wmnet,service=ats-be
* 20:48 mutante: tin: deleted unused things from /srv/deployment (T120157)
* 16:27 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp203[56]\.codfw\.wmnet,service=ats-tls
* 20:46 logmsgbot: catrope@tin Synchronized wmf-config/InitialiseSettings.php: Disable global AbuseFilters on non-global wikis (duration: 02m 04s)
* 16:26 pt1979@cumin1001: START - Cookbook sre.hosts.reimage for host db1194.eqiad.wmnet with OS bullseye
* 20:25 logmsgbot: dduvall@tin scap failed: CalledProcessError Command '/usr/local/bin/mwscript mergeMessageFileList.php --wiki="labtestwiki" --list-file="/srv/mediawiki-staging/wmf-config/extension-list" --output="/tmp/tmp.jRNpeW67FO" ' returned non-zero exit status 1 (duration: 01m 31s)
* 16:25 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1193.eqiad.wmnet with OS bullseye
* 20:23 logmsgbot: dduvall@tin Started scap: testwiki to php-1.27.0-wmf.11 and rebuild l10n cache
* 16:21 pt1979@cumin1001: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host db1192.eqiad.wmnet with OS bullseye
* 20:13 mutante: ruthenium: disable puppet, copy data over to osmium (screen)
* 16:12 dcausse@deploy1002: Finished deploy [wikimedia/discovery/analytics@8489923]: [[phab:T304954|T304954]]: Automate imagesuggestion imports (duration: 02m 03s)
* 20:12 mutante: ruthenium: service mysql stop
* 16:11 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1193.eqiad.wmnet with reason: host reimage
* 19:15 logmsgbot: catrope@tin Synchronized wmf-config/CommonSettings.php: EventBus plumbing (duration: 00m 30s)
* 16:11 milimetric@deploy1002: Finished deploy [analytics/refinery@fe7bf9e]: Hotfix for webrequest load refine, now with FORCE :) (duration: 06m 09s)
* 19:14 logmsgbot: catrope@tin Synchronized wmf-config/InitialiseSettings.php: Disable Flow on wikitech; add EventBus plumbing (duration: 00m 31s)
* 16:10 dcausse@deploy1002: Started deploy [wikimedia/discovery/analytics@8489923]: [[phab:T304954|T304954]]: Automate imagesuggestion imports
* 19:13 logmsgbot: catrope@tin Synchronized wmf-config/extension-list: Add EventBus (duration: 00m 31s)
* 16:07 pt1979@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on db1193.eqiad.wmnet with reason: host reimage
* 19:00 marxarelli: starting branch cut for 1.27.0-wmf.11
* 16:07 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1192.eqiad.wmnet with reason: host reimage
* 18:42 ema: Starting migration of mobile traffic to text cluster https://phabricator.wikimedia.org/T109286
* 16:05 milimetric@deploy1002: Started deploy [analytics/refinery@fe7bf9e]: Hotfix for webrequest load refine, now with FORCE :)
* 17:54 logmsgbot: krenair@tin Synchronized php-1.27.0-wmf.10/extensions/UploadWizard/UploadWizard.config.php: https://gerrit.wikimedia.org/r/#/c/264969/ (duration: 00m 31s)
* 16:04 milimetric@deploy1002: Finished deploy [analytics/refinery@fe7bf9e]: Hotfix for webrequest load refine (duration: 34m 38s)
* 16:51 logmsgbot: krenair@tin Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/264964/ (duration: 00m 31s)
* 16:03 pt1979@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on db1192.eqiad.wmnet with reason: host reimage
* 16:47 logmsgbot: krenair@tin Synchronized php-1.27.0-wmf.10/extensions/Graph/modules/graph-loader.js: https://gerrit.wikimedia.org/r/#/c/264715/ (duration: 00m 31s)
* 15:55 pt1979@cumin1001: START - Cookbook sre.hosts.reimage for host db1193.eqiad.wmnet with OS bullseye
* 16:45 logmsgbot: krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/264469/ (duration: 00m 31s)
* 15:52 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1191.eqiad.wmnet with OS bullseye
* 16:41 logmsgbot: krenair@tin Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/264437/ (duration: 00m 32s)
* 15:51 pt1979@cumin1001: START - Cookbook sre.hosts.reimage for host db1192.eqiad.wmnet with OS bullseye
* 14:58 cmjohnson1: reseating asw-c-eqiad uplink module (xe-1/1/0 and xe-1/1/2)
* 15:42 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1190.eqiad.wmnet with OS bullseye
* 14:29 jynus: reimporting some fawiki tables from production into labsdb hosts
* 15:38 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1191.eqiad.wmnet with reason: host reimage
* 13:52 godog: powercycle ms-be1001
* 15:34 pt1979@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on db1191.eqiad.wmnet with reason: host reimage
* 13:51 paravoid: powercycling alsafi
* 15:30 milimetric@deploy1002: Started deploy [analytics/refinery@fe7bf9e]: Hotfix for webrequest load refine
* 02:53 logmsgbot: l10nupdate@tin ResourceLoader cache refresh completed at Tue Jan 19 02:53:40 UTC 2016 (duration 7m 0s)
* 15:28 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1190.eqiad.wmnet with reason: host reimage
* 02:46 logmsgbot: mwdeploy@tin sync-l10n completed (1.27.0-wmf.10) (duration: 09m 21s)
* 15:25 pt1979@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on db1190.eqiad.wmnet with reason: host reimage
* 02:26 logmsgbot: mwdeploy@tin sync-l10n completed (1.27.0-wmf.9) (duration: 10m 40s)
* 15:24 jbond: upload trapperkeeper-metrics-clojure to puppet7 component
* 15:22 pt1979@cumin1001: START - Cookbook sre.hosts.reimage for host db1191.eqiad.wmnet with OS bullseye
* 15:19 jbond: upload puppetlabs-http-client-clojur to puppet7 component
* 15:17 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 15:16 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 15:16 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 15:15 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 15:14 dancy@deploy1002: Finished scap: Backport for [[gerrit:820653]] scap gitignore: ignore all files under the `scap` directory (duration: 04m 41s)
* 15:11 jbond: upload jolokia to puppet7 component
* 15:10 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1185.eqiad.wmnet with OS bullseye
* 15:09 dancy@deploy1002: Started scap: Backport for [[gerrit:820653]] scap gitignore: ignore all files under the `scap` directory
* 15:09 jbond: upload test-chuck-clojure to puppet7 component
* 15:05 pt1979@cumin1001: START - Cookbook sre.hosts.reimage for host db1190.eqiad.wmnet with OS bullseye
* 15:04 jbond: upload test-check-clojure to puppet7 component
* 14:57 jbond: upload nippy-clojure to puppet7 component
* 14:56 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1185.eqiad.wmnet with reason: host reimage
* 14:52 pt1979@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on db1185.eqiad.wmnet with reason: host reimage
* 14:43 jbond: upload fressian to puppet7 component
* 14:40 pt1979@cumin1001: START - Cookbook sre.hosts.reimage for host db1185.eqiad.wmnet with OS bullseye
* 14:40 jbond: upload test-generative-clojure to puppet7 component
* 14:35 pt1979@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 14:34 jbond: upload data-generators-clojure to puppet7 component
* 14:31 pt1979@cumin2002: START - Cookbook sre.dns.netbox
* 14:23 jbond: upload encore-clojure to puppet7 component
* 14:17 jbond: upload truss-clojure to puppet7 component
* 14:13 jbond: upload structured-logging-clojure to puppet7 component
* 14:06 jbond: upload murphy-clojure to puppet7 component
* 13:57 jbond: upload logstash-logback-encoder-7.2 to puppet7 component
* 13:49 jbond: upload kitchensink-clojure to puppet7 component
* 13:27 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depool hosts with fragile power supply ([[phab:T314559|T314559]] [[phab:T314628|T314628]])', diff saved to https://phabricator.wikimedia.org/P32292 and previous config saved to /var/cache/conftool/dbconfig/20220805-132709-ladsgroup.json
* 13:12 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 12:00:00 on db2095.codfw.wmnet with reason: Maintenance
* 13:12 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 2 days, 12:00:00 on db2095.codfw.wmnet with reason: Maintenance
* 13:09 sukhe: repool codfw
* 13:02 jbond: upload honeysql-clojure to puppet7 component
* 12:53 _joe_: progressive repool of services in codfw
* 12:24 moritzm: installing nano bugfix updates from bullseye point release
* 11:50 hnowlan@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: sync
* 11:40 hnowlan@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: sync
* 11:37 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repool after PDU maint on D3 ([[phab:T310146|T310146]])', diff saved to https://phabricator.wikimedia.org/P32291 and previous config saved to /var/cache/conftool/dbconfig/20220805-113729-ladsgroup.json
* 11:35 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repool after PDU maint on C6 ([[phab:T310145|T310145]])', diff saved to https://phabricator.wikimedia.org/P32290 and previous config saved to /var/cache/conftool/dbconfig/20220805-113555-ladsgroup.json
* 11:34 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repool after PDU maint on C5 ([[phab:T310145|T310145]])', diff saved to https://phabricator.wikimedia.org/P32289 and previous config saved to /var/cache/conftool/dbconfig/20220805-113436-ladsgroup.json
* 10:46 hnowlan@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: sync
* 10:36 hnowlan@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: sync
* 10:17 hnowlan@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: sync
* 10:12 Amir1: dbmaint at s4@codfw ([[phab:T312863|T312863]])
* 10:07 hnowlan@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: sync
* 09:04 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 12 hosts with reason: Maintenance
* 09:03 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 12 hosts with reason: Maintenance
* 09:03 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2110.codfw.wmnet with reason: Maintenance
* 09:03 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2110.codfw.wmnet with reason: Maintenance
* 00:53 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8 days, 0:00:00 on gerrit2001.wikimedia.org with reason: decom, replaced by gerrit2002
* 00:53 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 8 days, 0:00:00 on gerrit2001.wikimedia.org with reason: decom, replaced by gerrit2002
* 00:53 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for gerrit2002.wikimedia.org
* 00:53 dzahn@cumin1001: START - Cookbook sre.hosts.remove-downtime for gerrit2002.wikimedia.org
* 00:52 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8 days, 0:00:00 on gerrit2002.wikimedia.org with reason: decom, replaced by gerrit2002
* 00:52 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 8 days, 0:00:00 on gerrit2002.wikimedia.org with reason: decom, replaced by gerrit2002
* 00:18 mutante: restarting gerrit for config change - removing old replica [[phab:T313250|T313250]]


== 2016-01-18 ==
== 2022-08-04 ==
* 23:26 logmsgbot: krenair@tin Synchronized multiversion/MWMultiVersion.php: https://gerrit.wikimedia.org/r/264895 (duration: 00m 31s)
* 23:07 mutante: switching gerrit-replica.wikimedia.org to new machine gerrit2002, dropping gerrit-replica-new.wikimedia.org [[phab:T313250|T313250]]
* 23:08 logmsgbot: krenair@tin Synchronized wmf-config: https://gerrit.wikimedia.org/r/#/c/264786/ (duration: 00m 32s)
* 21:07 ryankemper@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
* 22:55 logmsgbot: krenair@tin rebuilt wikiversions.php and synchronized wikiversions files: (no message)
* 20:59 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 22:55 logmsgbot: krenair@tin Synchronized dblists: (no message) (duration: 00m 31s)
* 20:57 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 22:53 logmsgbot: krenair@tin Synchronized w/static/images/project-logos/wikitech.png: https://gerrit.wikimedia.org/r/#/c/264786/ (duration: 00m 31s)
* 20:57 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 17:30 logmsgbot: krenair@tin Synchronized wmf-config/InitialiseSettings-labs.php: https://gerrit.wikimedia.org/r/264758 - labs-only change (duration: 00m 36s)
* 20:56 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 14:24 godog: powercycle praseodymium
* 20:56 thcipriani@deploy1002: Finished scap: Backport for [[gerrit:819774]] tkwiki: Update wordmark (duration: 06m 12s)
* 10:42 godog: powercycle ms-be2016, high load avg
* 20:51 ryankemper@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
* 10:16 godog: dist-upgrade ms-be3002 to trusty
* 20:51 ryankemper@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
* 02:57 logmsgbot: l10nupdate@tin ResourceLoader cache refresh completed at Mon Jan 18 02:57:41 UTC 2016 (duration 7m 8s)
* 20:51 ryankemper@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
* 02:50 logmsgbot: mwdeploy@tin sync-l10n completed (1.27.0-wmf.10) (duration: 08m 39s)
* 20:50 thcipriani@deploy1002: Started scap: Backport for [[gerrit:819774]] tkwiki: Update wordmark
* 02:49 YuviPanda: updated annualreport for foks
* 20:48 thcipriani@deploy1002: Finished scap: Backport for [[gerrit:812391]] [config]: Add click event logging for mobile and desktop (duration: 39m 16s)
* 02:30 logmsgbot: mwdeploy@tin sync-l10n completed (1.27.0-wmf.9) (duration: 11m 38s)
* 20:45 ryankemper@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
* 20:24 ryankemper@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
* 20:23 ryankemper@deploy1002: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply
* 20:22 ryankemper@deploy1002: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply
* 20:16 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 20:15 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 20:15 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 20:14 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 20:13 ryankemper@deploy1002: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply
* 20:13 ryankemper@deploy1002: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply
* 20:10 ryankemper@deploy1002: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply
* 20:09 ryankemper@deploy1002: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply
* 20:08 thcipriani@deploy1002: Started scap: Backport for [[gerrit:812391]] [config]: Add click event logging for mobile and desktop
* 19:59 ryankemper@deploy1002: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply
* 19:55 dancy@deploy1002: rebuilt and synchronized wikiversions files: resync
* 19:49 mvernon@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for thanos-be2001.codfw.wmnet
* 19:49 mvernon@cumin1001: START - Cookbook sre.hosts.remove-downtime for thanos-be2001.codfw.wmnet
* 19:44 mvernon@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 8 hosts
* 19:44 mvernon@cumin1001: START - Cookbook sre.hosts.remove-downtime for 8 hosts
* 19:42 Emperor: rebooting thanos-be2001 to fix drive ordering
* 19:37 bking@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for elastic2071.codfw.wmnet
* 19:37 bking@cumin1001: START - Cookbook sre.hosts.remove-downtime for elastic2071.codfw.wmnet
* 19:31 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2071.codfw.wmnet with reason: [[phab:T310146|T310146]]
* 19:31 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2071.codfw.wmnet with reason: [[phab:T310146|T310146]]
* 19:13 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 19:12 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 19:12 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 19:12 ryankemper@deploy1002: helmfile [eqiad] DONE helmfile.d/services/changeprop: apply
* 19:11 ryankemper@deploy1002: helmfile [eqiad] START helmfile.d/services/changeprop: apply
* 19:11 dancy: There were many errors during php-fpm restart due to failure to contact  http://lvs2009:9090/pools/appservers-https_443/mw2361.codfw.wmnet and the like.
* 19:11 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 19:10 dancy@deploy1002: rebuilt and synchronized wikiversions files: group2 wikis to 1.39.0-wmf.23  refs [[phab:T308076|T308076]]
* 19:09 ryankemper@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop: apply
* 19:09 ryankemper@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop: apply
* 19:05 otto@deploy1002: helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: sync
* 19:04 otto@deploy1002: helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: sync
* 19:04 otto@deploy1002: helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: sync
* 19:03 otto@deploy1002: helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: sync
* 19:03 otto@deploy1002: helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: sync
* 19:02 otto@deploy1002: helmfile [staging] START helmfile.d/services/eventgate-analytics-external: sync
* 19:02 ottomata: roll-restarting eventgate-analytics-external to pick up backwards incompatible schema change - [[phab:T314151|T314151]]
* 18:47 ryankemper@deploy1002: helmfile [staging] DONE helmfile.d/services/changeprop: apply
* 18:46 ryankemper@deploy1002: helmfile [staging] START helmfile.d/services/changeprop: apply
* 18:41 cwhite: poweroff kafka-logging2003 - [[phab:T310145|T310145]]
* 18:39 dzahn@cumin2002: conftool action : set/pooled=yes; selector: dc=codfw,name=mw237[0-6].codfw.wmnet
* 18:39 dzahn@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 7 hosts
* 18:39 dzahn@cumin2002: START - Cookbook sre.hosts.remove-downtime for 7 hosts
* 18:35 dzahn@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw2369.codfw.wmnet
* 18:35 dzahn@cumin2002: START - Cookbook sre.hosts.remove-downtime for mw2369.codfw.wmnet
* 18:35 dzahn@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw2368.codfw.wmnet
* 18:35 dzahn@cumin2002: START - Cookbook sre.hosts.remove-downtime for mw2368.codfw.wmnet
* 18:35 dzahn@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw2367.codfw.wmnet
* 18:35 dzahn@cumin2002: START - Cookbook sre.hosts.remove-downtime for mw2367.codfw.wmnet
* 18:35 dzahn@cumin2002: conftool action : set/pooled=yes; selector: dc=codfw,name=mw2369.codfw.wmnet
* 18:35 dzahn@cumin2002: conftool action : set/pooled=yes; selector: dc=codfw,name=mw2368.codfw.wmnet
* 18:35 dzahn@cumin2002: conftool action : set/pooled=yes; selector: dc=codfw,name=mw2367.codfw.wmnet
* 18:35 dzahn@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw2366.codfw.wmnet
* 18:35 dzahn@cumin2002: START - Cookbook sre.hosts.remove-downtime for mw2366.codfw.wmnet
* 18:34 dzahn@cumin2002: conftool action : set/pooled=yes; selector: dc=codfw,name=mw2366.codfw.wmnet
* 18:30 dzahn@cumin2002: conftool action : set/pooled=yes; selector: dc=codfw,name=mw2279.codfw.wmnet
* 18:30 dzahn@cumin2002: conftool action : set/pooled=yes; selector: dc=codfw,name=mw2278.codfw.wmnet
* 18:29 dzahn@cumin2002: conftool action : set/pooled=yes; selector: dc=codfw,name=mw2277.codfw.wmnet
* 18:29 dzahn@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw2276.codfw.wmnet
* 18:29 dzahn@cumin2002: START - Cookbook sre.hosts.remove-downtime for mw2276.codfw.wmnet
* 18:29 dzahn@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw2275.codfw.wmnet
* 18:29 dzahn@cumin2002: START - Cookbook sre.hosts.remove-downtime for mw2275.codfw.wmnet
* 18:29 dzahn@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw2274.codfw.wmnet
* 18:29 dzahn@cumin2002: START - Cookbook sre.hosts.remove-downtime for mw2274.codfw.wmnet
* 18:29 dzahn@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw2273.codfw.wmnet
* 18:29 dzahn@cumin2002: START - Cookbook sre.hosts.remove-downtime for mw2273.codfw.wmnet
* 18:26 milimetric@deploy1002: Finished deploy [analytics/refinery@2553288]: Regular analytics weekly train [analytics/refinery@2553288] (duration: 02m 39s)
* 18:24 dzahn@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw2272.codfw.wmnet
* 18:24 dzahn@cumin2002: START - Cookbook sre.hosts.remove-downtime for mw2272.codfw.wmnet
* 18:24 dzahn@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw2271.codfw.wmnet
* 18:24 dzahn@cumin2002: START - Cookbook sre.hosts.remove-downtime for mw2271.codfw.wmnet
* 18:23 milimetric@deploy1002: Started deploy [analytics/refinery@2553288]: Regular analytics weekly train [analytics/refinery@2553288]
* 18:23 milimetric@deploy1002: Finished deploy [analytics/refinery@2553288]: Regular analytics weekly train [analytics/refinery@2553288] (duration: 01m 32s)
* 18:23 dzahn@cumin2002: conftool action : set/pooled=yes; selector: dc=codfw,name=mw2276.codfw.wmnet
* 18:23 dzahn@cumin2002: conftool action : set/pooled=yes; selector: dc=codfw,name=mw2275.codfw.wmnet
* 18:23 dzahn@cumin2002: conftool action : set/pooled=yes; selector: dc=codfw,name=mw2274.codfw.wmnet
* 18:22 dzahn@cumin2002: conftool action : set/pooled=yes; selector: dc=codfw,name=mw2273.codfw.wmnet
* 18:22 dzahn@cumin2002: conftool action : set/pooled=yes; selector: dc=codfw,name=mw2272.codfw.wmnet
* 18:22 Emperor: shutdown  moss-fe2001.codfw.wmnet,ms-fe2011.codfw.wmnet,ms-be20[34,35,42,48,68].codfw.wmnet PDU work [[phab:T310145|T310145]]
* 18:22 mvernon@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 8 hosts with reason: PDU work
* 18:21 milimetric@deploy1002: Started deploy [analytics/refinery@2553288]: Regular analytics weekly train [analytics/refinery@2553288]
* 18:21 mvernon@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 8 hosts with reason: PDU work
* 18:21 milimetric@deploy1002: Finished deploy [analytics/refinery@2553288]: Regular analytics weekly train [analytics/refinery@2553288] (duration: 00m 03s)
* 18:21 milimetric@deploy1002: Started deploy [analytics/refinery@2553288]: Regular analytics weekly train [analytics/refinery@2553288]
* 18:21 milimetric@deploy1002: Finished deploy [analytics/refinery@2553288]: Regular analytics weekly train [analytics/refinery@2553288] (duration: 00m 03s)
* 18:21 milimetric@deploy1002: Started deploy [analytics/refinery@2553288]: Regular analytics weekly train [analytics/refinery@2553288]
* 18:20 milimetric@deploy1002: Finished deploy [analytics/refinery@2553288]: Regular analytics weekly train [analytics/refinery@2553288] (duration: 01m 49s)
* 18:20 mvernon@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 9 hosts
* 18:20 mvernon@cumin1001: START - Cookbook sre.hosts.remove-downtime for 9 hosts
* 18:19 milimetric@deploy1002: Started deploy [analytics/refinery@2553288]: Regular analytics weekly train [analytics/refinery@2553288]
* 18:14 mutante: mw2272 and upwards: scap pull, checking monitoring, repooling.. one by one
* 18:13 dzahn@cumin2002: conftool action : set/pooled=yes; selector: dc=codfw,name=mw2271.codfw.wmnet
* 18:12 btullis@deploy1002: Finished deploy [analytics/refinery@2553288]: Regular analytics weekly train [analytics/refinery@2553288] (duration: 00m 51s)
* 18:11 btullis@deploy1002: Started deploy [analytics/refinery@2553288]: Regular analytics weekly train [analytics/refinery@2553288]
* 18:06 btullis@deploy1002: Finished deploy [analytics/refinery@2553288]: Regular analytics weekly train [analytics/refinery@2553288] (duration: 01m 54s)
* 18:04 btullis@deploy1002: Started deploy [analytics/refinery@2553288]: Regular analytics weekly train [analytics/refinery@2553288]
* 17:55 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs2009.codfw.wmnet with reason: shutdown for PDU upgrade
* 17:55 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on lvs2009.codfw.wmnet with reason: shutdown for PDU upgrade
* 17:43 mutante: maps2008 - downtime and shutdown for D3 maintenance
* 17:42 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on maps2008.codfw.wmnet with reason: codfw reboots
* 17:42 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 1:00:00 on maps2008.codfw.wmnet with reason: codfw reboots
* 17:42 mutante: thunmbor2006 - downtime and shutdown for D3 maintenance
* 17:42 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on thumbor2006.codfw.wmnet with reason: codfw reboots
* 17:41 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 1:00:00 on thumbor2006.codfw.wmnet with reason: codfw reboots
* 17:39 mutante: mw2386 - systemctl reset-failed
* 17:31 mutante: phab2001 - systemctl restart ssh-phab, attempting to clear Icinga pybal alerts, related to reboots
* 17:30 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on dns2001.wikimedia.org with reason: shutdown for PDU upgrade
* 17:30 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 3:00:00 on dns2001.wikimedia.org with reason: shutdown for PDU upgrade
* 17:29 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dns2001.wikimedia.org with reason: shutdown for PDU upgrade
* 17:29 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 1:00:00 on dns2001.wikimedia.org with reason: shutdown for PDU upgrade
* 17:28 Amir1: dbmaint at s4@eqiad ([[phab:T312863|T312863]])
* 17:26 bd808@deploy1002: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply
* 17:26 bd808@deploy1002: helmfile [eqiad] START helmfile.d/services/developer-portal: apply
* 17:24 bd808@deploy1002: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply
* 17:23 bd808@deploy1002: helmfile [codfw] START helmfile.d/services/developer-portal: apply
* 17:23 bd808@deploy1002: helmfile [staging] DONE helmfile.d/services/developer-portal: apply
* 17:23 bd808@deploy1002: helmfile [staging] START helmfile.d/services/developer-portal: apply
* 17:20 mutante: [an-launcher1002:~] $ sudo systemctl reset-failed
* 17:20 mvernon@cumin1001: conftool action : set/pooled=no; selector: name=ms-fe2012.codfw.wmnet
* 17:18 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance
* 17:18 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance
* 17:18 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2038.codfw.wmnet,service=varnish-fe
* 17:18 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2038.codfw.wmnet,service=ats-be
* 17:18 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2038.codfw.wmnet,service=ats-tls
* 17:16 Emperor: shutdown of moss-fe2002.codfw.wmnet,ms-be20[37,38,43,61,65,69].codfw.wmnet,ms-fe2012.codfw.wmnet,thanos-fe2003.codfw.wmnet for power work [[phab:T310146|T310146]]
* 17:16 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on cp[2035-2036].codfw.wmnet with reason: shutdown for PDU upgrade
* 17:15 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 4:00:00 on cp[2035-2036].codfw.wmnet with reason: shutdown for PDU upgrade
* 17:15 mvernon@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 9 hosts with reason: PDU work
* 17:15 mvernon@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 9 hosts with reason: PDU work
* 17:15 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp203[56]\.codfw\.wmnet,service=varnish-fe
* 17:15 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp203[56]\.codfw\.wmnet,service=ats-be
* 17:15 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp203[56]\.codfw\.wmnet,service=ats-tls
* 17:13 mvernon@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be[2036,2049,2054].codfw.wmnet,thanos-be2003.codfw.wmnet
* 17:13 mvernon@cumin1001: START - Cookbook sre.hosts.remove-downtime for ms-be[2036,2049,2054].codfw.wmnet,thanos-be2003.codfw.wmnet
* 17:12 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2037.codfw.wmnet,service=varnish-fe
* 17:12 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2037.codfw.wmnet,service=ats-be
* 17:12 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2037.codfw.wmnet,service=ats-tls
* 17:12 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2050.codfw.wmnet with reason: [[phab:T310146|T310146]]
* 17:12 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2050.codfw.wmnet with reason: [[phab:T310146|T310146]]
* 17:11 ebysans@deploy1002: Finished deploy [analytics/refinery@2553288] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@2553288] (duration: 00m 04s)
* 17:11 ebysans@deploy1002: Started deploy [analytics/refinery@2553288] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@2553288]
* 17:11 ebysans@deploy1002: Finished deploy [analytics/refinery@2553288] (thin): Regular analytics weekly train THIN [analytics/refinery@2553288] (duration: 00m 07s)
* 17:10 ebysans@deploy1002: Started deploy [analytics/refinery@2553288] (thin): Regular analytics weekly train THIN [analytics/refinery@2553288]
* 17:10 ebysans@deploy1002: Finished deploy [analytics/refinery@2553288]: Regular analytics weekly train [analytics/refinery@2553288] (duration: 00m 15s)
* 17:09 ebysans@deploy1002: Started deploy [analytics/refinery@2553288]: Regular analytics weekly train [analytics/refinery@2553288]
* 17:07 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on lvs2010.codfw.wmnet with reason: shutdown for PDU upgrade
* 17:07 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 1:00:00 on lvs2010.codfw.wmnet with reason: shutdown for PDU upgrade
* 16:55 hnowlan@puppetmaster1001: conftool action : set/pooled=no; selector: name=maps2008.codfw.wmnet
* 16:51 ebysans@deploy1002: Finished deploy [analytics/refinery@2553288] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@2553288] (duration: 07m 14s)
* 16:45 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=restbase2016.codfw.wmnet
* 16:45 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=restbase202[05].codfw.wmnet
* 16:45 hnowlan@puppetmaster1001: conftool action : set/pooled=no; selector: name=restbase202[05].codfw.wmnet
* 16:45 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=maps2007.codfw.wmnet
* 16:43 ebysans@deploy1002: Started deploy [analytics/refinery@2553288] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@2553288]
* 16:43 ebysans@deploy1002: Finished deploy [analytics/refinery@2553288] (thin): Regular analytics weekly train THIN [analytics/refinery@2553288] (duration: 00m 07s)
* 16:43 ebysans@deploy1002: Started deploy [analytics/refinery@2553288] (thin): Regular analytics weekly train THIN [analytics/refinery@2553288]
* 16:37 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 18 hosts
* 16:37 jayme@cumin1001: START - Cookbook sre.hosts.remove-downtime for 18 hosts
* 16:35 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2059.codfw.wmnet with reason: [[phab:T310145|T310145]]
* 16:35 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2059.codfw.wmnet with reason: [[phab:T310145|T310145]]
* 16:34 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kafka-main2003.codfw.wmnet with reason: PDU swap
* 16:34 ebysans@deploy1002: Finished deploy [analytics/refinery@2553288]: Regular analytics weekly train [analytics/refinery@2553288] (duration: 00m 20s)
* 16:34 jayme@cumin1001: START - Cookbook sre.hosts.downtime for 1:00:00 on kafka-main2003.codfw.wmnet with reason: PDU swap
* 16:34 ebysans@deploy1002: Started deploy [analytics/refinery@2553288]: Regular analytics weekly train [analytics/refinery@2553288]
* 16:32 ebysans@deploy1002: Finished deploy [analytics/refinery@2553288]: Regular analytics weekly train [analytics/refinery@2553288] (duration: 29m 59s)
* 16:30 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depool D3 for PDU maint', diff saved to https://phabricator.wikimedia.org/P32286 and previous config saved to /var/cache/conftool/dbconfig/20220804-163037-ladsgroup.json
* 16:28 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 16:28 ladsgroup@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:820376{{!}}Start reading from new templatelinks columns in commons (T306673)]] (duration: 03m 00s)
* 16:27 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 16:27 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 16:26 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 16:17 brett: deploying authdns - geodns: Map out African countries by DC latency ([[phab:T311472|T311472]])
* 16:12 cwhite: poweroff logstash2028 - [[phab:T310145|T310145]]
* 16:06 Emperor: shutdown ms-be20[39,49,54].codfw.wmnet,thanos-be2003 for PDU swap [[phab:T310145|T310145]]
* 16:03 mvernon@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ms-be[2036,2049,2054].codfw.wmnet,thanos-be2003.codfw.wmnet with reason: PDU work
* 16:02 mvernon@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ms-be[2036,2049,2054].codfw.wmnet,thanos-be2003.codfw.wmnet with reason: PDU work
* 16:02 ebysans@deploy1002: Started deploy [analytics/refinery@2553288]: Regular analytics weekly train [analytics/refinery@2553288]
* 15:50 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2048.codfw.wmnet with reason: [[phab:T310145|T310145]]
* 15:50 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2048.codfw.wmnet with reason: [[phab:T310145|T310145]]
* 15:43 damilare: payments-wiki upgraded from {{Gerrit|0e4a5b3b}} to {{Gerrit|6880236d}}
* 15:37 _joe_: uncordoning ml-serve200<nowiki>{</nowiki>1,6<nowiki>}</nowiki>
* 15:27 sukhe: power off cp2037,cp2038: PDU upgrade
* 15:25 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:30:00 on phab2001.codfw.wmnet with reason: PDU swap
* 15:25 jelto: power off phab2001
* 15:25 jelto@cumin1001: START - Cookbook sre.hosts.downtime for 3:30:00 on phab2001.codfw.wmnet with reason: PDU swap
* 15:25 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on cp[2037-2038].codfw.wmnet with reason: shutdown for PDU upgrade
* 15:24 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 4:00:00 on cp[2037-2038].codfw.wmnet with reason: shutdown for PDU upgrade
* 15:24 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp203[78]\.codfw\.wmnet,service=varnish-fe
* 15:23 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp203[78]\.codfw\.wmnet,service=ats-be
* 15:23 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp203[78]\.codfw\.wmnet,service=ats-tls
* 15:21 XioNoX: un-drain codfw-ulsfo link - [[phab:T310310|T310310]]
* 15:21 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db[2116,2127,2167-2168].codfw.wmnet,es2022.codfw.wmnet with reason: Maintenance ([[phab:T310145|T310145]])
* 15:20 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 10:00:00 on db[2116,2127,2167-2168].codfw.wmnet,es2022.codfw.wmnet with reason: Maintenance ([[phab:T310145|T310145]])
* 15:20 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depool C6 for PDU maint ([[phab:T310145|T310145]])', diff saved to https://phabricator.wikimedia.org/P32285 and previous config saved to /var/cache/conftool/dbconfig/20220804-151958-ladsgroup.json
* 15:16 btullis@cumin1001: END (PASS) - Cookbook sre.aqs.roll-restart (exit_code=0) for AQS aqs cluster: Roll restart of all AQS's nodejs daemons.
* 15:16 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on restbase[2016,2020,2025].codfw.wmnet with reason: PDU maintenance
* 15:16 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime for 3:00:00 on restbase[2016,2020,2025].codfw.wmnet with reason: PDU maintenance
* 15:13 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db[2114,2126,2166].codfw.wmnet with reason: Maintenance ([[phab:T310145|T310145]])
* 15:13 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 10:00:00 on db[2114,2126,2166].codfw.wmnet with reason: Maintenance ([[phab:T310145|T310145]])
* 15:13 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp203[12]\.codfw\.wmnet,service=varnish-fe
* 15:13 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp203[12]\.codfw\.wmnet,service=ats-be
* 15:13 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp203[12]\.codfw\.wmnet,service=ats-tls
* 15:12 mvernon@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be[2058,2064].codfw.wmnet
* 15:12 mvernon@cumin1001: START - Cookbook sre.hosts.remove-downtime for ms-be[2058,2064].codfw.wmnet
* 15:11 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depool hosts for PDU maint ([[phab:T310145|T310145]])', diff saved to https://phabricator.wikimedia.org/P32284 and previous config saved to /var/cache/conftool/dbconfig/20220804-151121-ladsgroup.json
* 15:09 godog: poweroff logstash2002 - [[phab:T310145|T310145]]
* 15:07 _joe_: pwoering down mc203<nowiki>{</nowiki>0,1<nowiki>}</nowiki>
* 15:07 root@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on logstash2002.codfw.wmnet with reason: pdu
* 15:06 root@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on logstash2002.codfw.wmnet with reason: pdu
* 15:05 btullis@cumin1001: START - Cookbook sre.aqs.roll-restart for AQS aqs cluster: Roll restart of all AQS's nodejs daemons.
* 14:58 jelto: power off mc20[30-31]
* 14:56 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mc[2030-2031].codfw.wmnet with reason: PDU swap
* 14:56 jelto@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on mc[2030-2031].codfw.wmnet with reason: PDU swap
* 14:56 XioNoX: draining codfw-ulsfo link - [[phab:T310310|T310310]]
* 14:36 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=maps2009.codfw.wmnet
* 14:35 hnowlan@puppetmaster1001: conftool action : set/pooled=no; selector: name=maps2007.codfw.wmnet
* 14:35 hnowlan@puppetmaster1001: conftool action : set/pooled=no; selector: name=restbase2025.codfw.wmnet
* 14:35 hnowlan@puppetmaster1001: conftool action : set/pooled=no; selector: name=restbase2020.codfw.wmnet
* 14:35 hnowlan@puppetmaster1001: conftool action : set/pooled=no; selector: name=restbase2016.codfw.wmnet
* 14:32 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wdqs2011.codfw.wmnet with reason: [[phab:T310145|T310145]]
* 14:31 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs2011.codfw.wmnet with reason: [[phab:T310145|T310145]]
* 14:25 jelto: power off gitlab-runner2003
* 14:25 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:30:00 on gitlab-runner2003.codfw.wmnet with reason: PDU swap
* 14:25 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wdqs2001.codfw.wmnet with reason: [[phab:T310145|T310145]]
* 14:24 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs2001.codfw.wmnet with reason: [[phab:T310145|T310145]]
* 14:24 jelto@cumin1001: START - Cookbook sre.hosts.downtime for 4:30:00 on gitlab-runner2003.codfw.wmnet with reason: PDU swap
* 14:23 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2032.codfw.wmnet with reason: [[phab:T310145|T310145]]
* 14:22 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2032.codfw.wmnet with reason: [[phab:T310145|T310145]]
* 14:22 root@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on logstash2035.codfw.wmnet with reason: pdu
* 14:22 godog: poweroff logstash2035 - [[phab:T310145|T310145]]
* 14:22 root@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on logstash2035.codfw.wmnet with reason: pdu
* 14:21 Emperor: shutdown ms-be20[58,64].codfw.wmnet for PDU swap [[phab:T310145|T310145]]
* 14:20 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 14:19 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 14:19 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 14:18 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 14:14 Lucas_WMDE: UTC afternoon backport+config window done
* 14:13 lucaswerkmeister-wmde@deploy1002: Synchronized wmf-config/CommonSettings-labs.php: Config: [[gerrit:820454{{!}}Remove unused $wgMathUseRestBase (T274436)]] (duration: 03m 01s)
* 14:13 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 14:12 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 14:12 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 14:11 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 14:06 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 14:05 lucaswerkmeister-wmde@deploy1002: Synchronized wmf-config/CommonSettings-labs.php: Config: [[gerrit:820254{{!}}CommonSettings-labs: Fix usage of $wgSFSValidateIPListLocationMD5]] (duration: 02m 51s)
* 14:05 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 14:05 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 14:05 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2033.codfw.wmnet with reason: [[phab:T310145|T310145]]
* 14:04 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2033.codfw.wmnet with reason: [[phab:T310145|T310145]]
* 14:04 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 13:59 lucaswerkmeister-wmde@deploy1002: Synchronized wmf-config/wikitech.php: Config: [[gerrit:820255{{!}}wikitech: Remove old LDAP config vars]] (duration: 02m 54s)
* 13:59 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 13:58 mvernon@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ms-be[2058,2064].codfw.wmnet with reason: PDU work
* 13:58 mvernon@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ms-be[2058,2064].codfw.wmnet with reason: PDU work
* 13:58 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 13:58 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 13:57 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 13:52 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 13:51 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 13:51 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 13:50 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 13:48 lucaswerkmeister-wmde@deploy1002: Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:820404{{!}}Remove unused $wgIncludejQueryMigrate (T280944)]] (2/2) (duration: 03m 03s)
* 13:45 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 13:45 lucaswerkmeister-wmde@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:820404{{!}}Remove unused $wgIncludejQueryMigrate (T280944)]] (1/2) (duration: 02m 58s)
* 13:44 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 13:44 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 13:43 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 13:40 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2066.codfw.wmnet with reason: [[phab:T310145|T310145]]
* 13:39 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2066.codfw.wmnet with reason: [[phab:T310145|T310145]]
* 13:38 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 13:37 lucaswerkmeister-wmde@deploy1002: Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:820402{{!}}Remove unused $wgLegacyJavaScriptGlobals (T72470)]] (2/2) (duration: 02m 59s)
* 13:37 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 13:37 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 13:36 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 13:34 lucaswerkmeister-wmde@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:820402{{!}}Remove unused $wgLegacyJavaScriptGlobals (T72470)]] (1/2) (duration: 02m 58s)
* 13:31 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 13:30 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 13:30 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 13:29 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 13:26 lucaswerkmeister-wmde@deploy1002: Synchronized wmf-config/SearchSettingsForSDC.php: Config: [[gerrit:820397{{!}}Remove unused $wgWBCSEnableDispatchingQueryBuilder]] (duration: 03m 01s)
* 13:24 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 13:23 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 13:23 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 13:19 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 13:17 taavi@deploy1002: Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:820441{{!}}Remove unused CA P3P config]] (duration: 03m 09s)
* 13:14 jbond: intorudce new puppetmaster backends puppetmaster[12]004
* 13:14 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2065.codfw.wmnet with reason: [[phab:T310145|T310145]]
* 13:14 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2065.codfw.wmnet with reason: [[phab:T310145|T310145]]
* 13:14 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 13:13 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 13:13 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 13:12 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 13:11 taavi@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:819175{{!}}QuickSurveys: Deploy research incentive survey to Bengali wiki (T314333)]] (duration: 03m 26s)
* 13:07 moritzm: installing jetty9 security updates
* 12:48 moritzm: installing Linux 4.19.249 kernels on Buster hosts
* 12:03 jbond: send sretest100[12] and idp-test2001 to the new puppetmaster[12]004 servers to test
* 11:46 moritzm: installing Linux 5.10.127-2 kernels on Bullseye hosts
* 11:41 jmm@cumin2002: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2017.codfw.wmnet to cluster codfw and group D
* 11:41 moritzm: installing libpgjava security updates
* 11:37 jmm@cumin2002: START - Cookbook sre.ganeti.addnode for new host ganeti2017.codfw.wmnet to cluster codfw and group D
* 11:34 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2017.codfw.wmnet
* 11:26 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host ganeti2017.codfw.wmnet
* 11:09 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2017.codfw.wmnet with OS bullseye
* 10:53 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=restbase2015.codfw.wmnet
* 10:53 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=restbase2022.codfw.wmnet
* 10:52 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2017.codfw.wmnet with reason: host reimage
* 10:49 jmm@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2017.codfw.wmnet with reason: host reimage
* 10:30 jmm@cumin2002: START - Cookbook sre.hosts.reimage for host ganeti2017.codfw.wmnet with OS bullseye
* 10:27 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on ganeti2017.codfw.wmnet with reason: Remove node for eventual reimage, [[phab:T311686|T311686]]
* 10:27 jmm@cumin2002: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on ganeti2017.codfw.wmnet with reason: Remove node for eventual reimage, [[phab:T311686|T311686]]
* 10:19 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 9:00:00 on 32 hosts with reason: PDU swap
* 10:19 jayme@cumin1001: START - Cookbook sre.hosts.downtime for 9:00:00 on 32 hosts with reason: PDU swap
* 10:03 Lucas_WMDE: stashbot temporarily parted and lost several logs between 9:42 UTC and 9:49 UTC; mainly mwdebug helmfil start/done, also ayounsi sre.deploy.python-code cookbook to cumin1001, cumin2002; see IRC logs
* 10:02 ayounsi@cumin1001: END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: update requirements + wmf-netbox - ayounsi@cumin1001
* 10:00 jynus: stop db2099 [[phab:T310145|T310145]]
* 10:00 ayounsi@cumin1001: START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: update requirements + wmf-netbox - ayounsi@cumin1001
* 09:39 jelto: power off mw22[71-79].codfw.wmnet
* 09:38 urbanecm@deploy1002: Synchronized php-1.39.0-wmf.23/extensions/GrowthExperiments/includes/EventLogging/SpecialEditGrowthConfigLogger.php: {{Gerrit|ba67dd940217e9f786f4349b4da0fe088475fde9}}: SpecialEditGrowthConfigLogger: Update schema version ([[phab:T314173|T314173]], [[phab:T312148|T312148]]) (duration: 03m 18s)
* 09:37 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 09:37 marostegui@cumin1001: dbctl commit (dc=all): 'Add db2177 to s3 [[phab:T311494|T311494]]', diff saved to https://phabricator.wikimedia.org/P32282 and previous config saved to /var/cache/conftool/dbconfig/20220804-093704-marostegui.json
* 09:36 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 09:36 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 09:35 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|ddcd333015bb58a98709a5005a5db7e8519dd0a5}}: testwiki: Growth: Assign enrollasmentor to * ([[phab:T310905|T310905]]) (duration: 03m 41s)
* 09:35 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 09:32 jelto: set/pooled=inactive mw22[71-79].codfw.wmnet
* 09:31 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 9:30:00 on 9 hosts with reason: PDU swap
* 09:31 jelto@cumin1001: START - Cookbook sre.hosts.downtime for 9:30:00 on 9 hosts with reason: PDU swap
* 09:30 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 09:29 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 09:29 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 09:28 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 09:27 ayounsi@cumin1001: END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: wmf-netbox.py update - ayounsi@cumin1001
* 09:26 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2089.codfw.wmnet
* 09:26 marostegui@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 09:26 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|0614a39bf15252c95a96565dd7c986237f3d3323}}: testwiki: Growth: Switch to structured mentor list ([[phab:T310905|T310905]]) (duration: 03m 38s)
* 09:25 ayounsi@cumin1001: START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: wmf-netbox.py update - ayounsi@cumin1001
* 09:23 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 09:23 marostegui@cumin1001: START - Cookbook sre.dns.netbox
* 09:22 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 09:22 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 09:21 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 09:18 marostegui@cumin1001: START - Cookbook sre.hosts.decommission for hosts db2089.codfw.wmnet
* 09:12 jelto@cumin1001: conftool action : set/pooled=inactive; selector: name=kubernetes2022.codfw.wmnet
* 09:03 oblivian@mwmaint1002: pull aborted:  (duration: 00m 06s)
* 08:58 moritzm: installing gsasl security updates
* 08:57 oblivian@mwmaint1002: pull aborted:  (duration: 00m 18s)
* 08:48 moritzm: draining ganeti2017 [[phab:T311686|T311686]]
* 08:45 jelto: power off kubernetes2022
* 08:43 oblivian@deploy1002: Synchronized README: testing new scap configuration (duration: 03m 18s)
* 08:39 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:22:00 on kubernetes2022.codfw.wmnet with reason: PDU swap
* 08:38 jelto@cumin1001: START - Cookbook sre.hosts.downtime for 10:22:00 on kubernetes2022.codfw.wmnet with reason: PDU swap
* 08:37 jelto@cumin1001: conftool action : set/pooled=no; selector: name=kubernetes2022.codfw.wmnet
* 08:35 jelto: kubectl drain kubernetes2022.codfw.wmnet
* 08:32 jelto: kubectl cordon kubernetes2022.codfw.wmnet
* 08:28 moritzm: imported gsasl 1.8.0-8+wmf1 to stretch-wikimedia
* 08:26 jelto: power off mc2049 and mc2050
* 08:24 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:36:00 on mc[2049-2050].codfw.wmnet with reason: PDU swap
* 08:24 jelto@cumin1001: START - Cookbook sre.hosts.downtime for 10:36:00 on mc[2049-2050].codfw.wmnet with reason: PDU swap
* 08:22 oblivian@mwmaint1002: pull aborted:  (duration: 00m 11s)
* 08:19 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1132, db111, db1127, db1143', diff saved to https://phabricator.wikimedia.org/P32281 and previous config saved to /var/cache/conftool/dbconfig/20220804-081958-root.json
* 08:19 jelto: power off mc2047 and mc2048
* 08:16 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:45:00 on mc[2047-2048].codfw.wmnet with reason: PDU swap
* 08:16 jelto@cumin1001: START - Cookbook sre.hosts.downtime for 10:45:00 on mc[2047-2048].codfw.wmnet with reason: PDU swap
* 08:04 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubestagetcd2002.codfw.wmnet with reason: Switch instance to plain disks, [[phab:T311686|T311686]]
* 08:04 jmm@cumin2002: START - Cookbook sre.hosts.downtime for 1:00:00 on kubestagetcd2002.codfw.wmnet with reason: Switch instance to plain disks, [[phab:T311686|T311686]]
* 07:55 marostegui: Remove grants for 208.80.154.160/208.80.155.109 [[phab:T314528|T314528]]
* 07:49 marostegui@cumin1001: dbctl commit (dc=all): 'Remove db2089 from dbctl [[phab:T313799|T313799]]', diff saved to https://phabricator.wikimedia.org/P32280 and previous config saved to /var/cache/conftool/dbconfig/20220804-074957-marostegui.json
* 07:47 godog: grow sda/sdb 3 by 100G on thanos-be2002 - [[phab:T314275|T314275]]
* 07:46 godog: grow sda/sdb 3 by 100G on thanos-be1003 - [[phab:T314275|T314275]]
* 07:30 jmm@cumin2002: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2030.codfw.wmnet to cluster codfw and group A
* 07:29 jmm@cumin2002: START - Cookbook sre.ganeti.addnode for new host ganeti2030.codfw.wmnet to cluster codfw and group A
* 07:18 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2030.codfw.wmnet
* 07:16 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[2135,2160].codfw.wmnet with reason: codfw pdu maintenance
* 07:16 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db[2135,2160].codfw.wmnet with reason: codfw pdu maintenance
* 07:11 ayounsi@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 07:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host ganeti2030.codfw.wmnet
* 07:09 jmm@cumin2002: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti2030.codfw.wmnet to cluster codfw and group A
* 07:09 jmm@cumin2002: START - Cookbook sre.ganeti.addnode for new host ganeti2030.codfw.wmnet to cluster codfw and group A
* 07:09 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es[2023-2025].codfw.wmnet with reason: codfw pdu maintenance
* 07:08 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es[2023-2025].codfw.wmnet with reason: codfw pdu maintenance
* 07:05 ayounsi@cumin1001: START - Cookbook sre.dns.netbox
* 07:03 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2030.codfw.wmnet
* 07:02 ayounsi@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 06:58 ayounsi@cumin1001: START - Cookbook sre.dns.netbox
* 06:58 ayounsi@cumin1001: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97)
* 06:58 ayounsi@cumin1001: START - Cookbook sre.dns.netbox
* 06:54 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host ganeti2030.codfw.wmnet
* 06:06 _joe_: restarted memcached on mc2038 to pick up the actual production configuration
* 05:59 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2030.codfw.wmnet with OS bullseye
* 05:49 kart_: Updated cxserver to 2022-08-04-022612-production ([[phab:T313296|T313296]], [[phab:T308248|T308248]])
* 05:44 kartik@deploy1002: helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
* 05:43 kartik@deploy1002: helmfile [eqiad] START helmfile.d/services/cxserver: apply
* 05:42 kartik@deploy1002: helmfile [codfw] DONE helmfile.d/services/cxserver: apply
* 05:41 kartik@deploy1002: helmfile [codfw] START helmfile.d/services/cxserver: apply
* 05:40 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2030.codfw.wmnet with reason: host reimage
* 05:39 kartik@deploy1002: helmfile [staging] DONE helmfile.d/services/cxserver: apply
* 05:38 kartik@deploy1002: helmfile [staging] START helmfile.d/services/cxserver: apply
* 05:36 jmm@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2030.codfw.wmnet with reason: host reimage
* 05:22 jmm@cumin2002: START - Cookbook sre.hosts.reimage for host ganeti2030.codfw.wmnet with OS bullseye
* 05:17 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on ganeti2030.codfw.wmnet with reason: Remove node for eventual reimage, [[phab:T311686|T311686]]
* 05:16 jmm@cumin2002: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on ganeti2030.codfw.wmnet with reason: Remove node for eventual reimage, [[phab:T311686|T311686]]
* 04:38 ejegg: payments-wiki upgraded from {{Gerrit|712df4ce}} to {{Gerrit|0e4a5b3b}}
* 04:29 TimStarling: on mw2377 fiddling with CPU frequency control and doing benchmarks
* 04:09 krinkle@mwmaint1002: pull aborted:  (duration: 00m 05s)
* 01:23 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32278 and previous config saved to /var/cache/conftool/dbconfig/20220804-012341-marostegui.json
* 01:08 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P32277 and previous config saved to /var/cache/conftool/dbconfig/20220804-010834-marostegui.json
* 00:53 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P32276 and previous config saved to /var/cache/conftool/dbconfig/20220804-005328-marostegui.json
* 00:38 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32275 and previous config saved to /var/cache/conftool/dbconfig/20220804-003822-marostegui.json
* 00:36 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1146:3312 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32274 and previous config saved to /var/cache/conftool/dbconfig/20220804-003611-marostegui.json
* 00:36 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance
* 00:35 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance
* 00:35 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32273 and previous config saved to /var/cache/conftool/dbconfig/20220804-003549-marostegui.json
* 00:20 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P32272 and previous config saved to /var/cache/conftool/dbconfig/20220804-002043-marostegui.json
* 00:06 mutante: gerrit - [2022-08-04 00:05:33,173] Replication to gerrit2@gerrit2002.wikimedia.org:/srv/gerrit/git/analytics/geowiki.git started.. [[phab:T313250|T313250]]
* 00:06 mutante: gerrit - [2022-08-04 00:05:33,173] Replication to gerrit2@gerrit2002.wikimedia.org:/srv/gerrit/git/analytics/geowiki.git started... [CONTEXT pushOneId="83ad5008" ]
* 00:05 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P32271 and previous config saved to /var/cache/conftool/dbconfig/20220804-000536-marostegui.json
* 00:03 mutante: gerrit - service restart to deploy config change to add second replica [[phab:T313250|T313250]]
* 00:01 dzahn@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on gerrit.wikimedia.org with reason: service restart
* 00:00 dzahn@cumin2002: START - Cookbook sre.hosts.downtime for 1:00:00 on gerrit.wikimedia.org with reason: service restart
* 00:00 dzahn@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on gerrit1001.wikimedia.org with reason: service restart


== 2016-01-17 ==
== 2022-08-03 ==
* 04:58 YuviPanda: started restbase on restbase1002
* 23:59 dzahn@cumin2002: START - Cookbook sre.hosts.downtime for 1:00:00 on gerrit1001.wikimedia.org with reason: service restart
* 02:53 logmsgbot: l10nupdate@tin ResourceLoader cache refresh completed at Sun Jan 17 02:53:19 UTC 2016 (duration 6m 59s)
* 23:50 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32270 and previous config saved to /var/cache/conftool/dbconfig/20220803-235030-marostegui.json
* 02:46 logmsgbot: mwdeploy@tin sync-l10n completed (1.27.0-wmf.10) (duration: 08m 53s)
* 22:50 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1170:3312 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32269 and previous config saved to /var/cache/conftool/dbconfig/20220803-225015-marostegui.json
* 02:26 logmsgbot: mwdeploy@tin sync-l10n completed (1.27.0-wmf.9) (duration: 10m 41s)
* 22:50 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance
* 01:47 paravoid: restarting HHVM on mw1120, mw1125, mw1127, mw1132, mw1148; OOM
* 22:49 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance
* 22:49 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 9 hosts with reason: Maintenance
* 22:49 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on 9 hosts with reason: Maintenance
* 22:49 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance
* 22:49 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance
* 22:48 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
* 22:48 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
* 22:48 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32268 and previous config saved to /var/cache/conftool/dbconfig/20220803-224827-marostegui.json
* 22:33 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P32267 and previous config saved to /var/cache/conftool/dbconfig/20220803-223321-marostegui.json
* 22:18 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P32266 and previous config saved to /var/cache/conftool/dbconfig/20220803-221815-marostegui.json
* 22:03 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32265 and previous config saved to /var/cache/conftool/dbconfig/20220803-220309-marostegui.json
* 22:00 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32264 and previous config saved to /var/cache/conftool/dbconfig/20220803-220057-marostegui.json
* 22:00 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
* 22:00 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
* 22:00 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance
* 22:00 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance
* 22:00 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32263 and previous config saved to /var/cache/conftool/dbconfig/20220803-220007-marostegui.json
* 21:45 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P32262 and previous config saved to /var/cache/conftool/dbconfig/20220803-214501-marostegui.json
* 21:44 damilare: payments-wiki updated from {{Gerrit|e1b6036a}} to {{Gerrit|712df4ce}}
* 21:37 ryankemper@cumin1001: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster plugin upgrade - ryankemper@cumin1001 - [[phab:T314078|T314078]]
* 21:35 ryankemper@deploy1002: helmfile [eqiad] DONE helmfile.d/services/changeprop: apply
* 21:35 ryankemper@deploy1002: helmfile [eqiad] START helmfile.d/services/changeprop: apply
* 21:30 ryankemper@deploy1002: helmfile [staging] DONE helmfile.d/services/changeprop: apply
* 21:30 ryankemper@deploy1002: helmfile [staging] START helmfile.d/services/changeprop: apply
* 21:29 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P32261 and previous config saved to /var/cache/conftool/dbconfig/20220803-212955-marostegui.json
* 21:14 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32260 and previous config saved to /var/cache/conftool/dbconfig/20220803-211449-marostegui.json
* 21:12 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1105:3312 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32259 and previous config saved to /var/cache/conftool/dbconfig/20220803-211237-marostegui.json
* 21:12 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance
* 21:12 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance
* 21:12 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1122 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32258 and previous config saved to /var/cache/conftool/dbconfig/20220803-211216-marostegui.json
* 21:03 ejegg: updated standalone SmashPig deployment from {{Gerrit|8e8f0017}} to {{Gerrit|9b97ea15}}
* 21:02 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 21:01 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 21:01 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 21:00 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 20:57 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P32257 and previous config saved to /var/cache/conftool/dbconfig/20220803-205710-marostegui.json
* 20:55 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 20:55 ebernhardson@deploy1002: Synchronized wmf-config/CirrusSearch-production.php: Config: [[gerrit:820223{{!}}cirrus: Set ElasticaWrite partition count for cloudelastic to 3]] (duration: 03m 29s)
* 20:54 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 20:54 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 20:53 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 20:48 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 20:48 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 20:48 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 20:43 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 20:43 urbanecm@deploy1002: Synchronized php-1.39.0-wmf.23/extensions/VisualEditor/includes/VisualEditorParsoidClient.php: {{Gerrit|a804fe18f1e14795ba7836d3ebf6c361bb1538a7}}: Update call to PageConfigFactory::create to use new signature ([[phab:T314523|T314523]]) (duration: 03m 25s)
* 20:42 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P32256 and previous config saved to /var/cache/conftool/dbconfig/20220803-204204-marostegui.json
* 20:39 urbanecm@deploy1002: sync-file aborted: {{Gerrit|a804fe18f1e14795ba7836d3ebf6c361bb1538a7}}: Update call to PageConfigFactory::create to use new signature ([[phab:T314523|T314523]]ú (duration: 00m 00s)
* 20:38 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 20:36 urbanecm@deploy1002: Synchronized php-1.39.0-wmf.23/extensions/DiscussionTools/: {{Gerrit|b840eef86837aed3e566885110e93b2ca9ab5f42}}: Fix ReplyLinksController#teardown (duration: 03m 27s)
* 20:34 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 20:34 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 20:33 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 20:31 urbanecm@deploy1002: Synchronized php-1.39.0-wmf.23/extensions/CirrusSearch/: {{Gerrit|70a18f5846111a0dfe8ba473daf384cbb8e88804}}:  Add explicit partitioning key to ElasticaWrite ([[phab:T314426|T314426]]) (duration: 03m 13s)
* 20:28 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 20:28 urbanecm@deploy1002: Synchronized php-1.39.0-wmf.22/extensions/CirrusSearch/: {{Gerrit|9961e9bc8f5873f8ddc8a11108de0a7bfcb14ae6}}: Add explicit partitioning key to ElasticaWrite ([[phab:T314426|T314426]]) (duration: 03m 23s)
* 20:28 cwhite@cumin2002: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host logstash2032.codfw.wmnet
* 20:27 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 20:27 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 20:27 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1122 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32255 and previous config saved to /var/cache/conftool/dbconfig/20220803-202658-marostegui.json
* 20:23 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 20:21 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1122 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32254 and previous config saved to /var/cache/conftool/dbconfig/20220803-202146-marostegui.json
* 20:21 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1122.eqiad.wmnet with reason: Maintenance
* 20:21 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1122.eqiad.wmnet with reason: Maintenance
* 20:21 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32253 and previous config saved to /var/cache/conftool/dbconfig/20220803-202125-marostegui.json
* 20:14 rzl@deploy1002: helmfile [codfw] DONE helmfile.d/services/mobileapps: apply
* 20:13 rzl@deploy1002: helmfile [codfw] START helmfile.d/services/mobileapps: apply
* 20:13 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 20:12 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|195f8090b9694be65c937cea108ff4f6400972ec}}: Start writing to cuc_actor on test wikis ([[phab:T233004|T233004]]) (duration: 03m 27s)
* 20:09 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 20:09 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 20:09 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 20:08 cwhite@cumin2002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) logstash2032.codfw.wmnet on all recursors
* 20:08 cwhite@cumin2002: START - Cookbook sre.dns.wipe-cache logstash2032.codfw.wmnet on all recursors
* 20:08 cwhite@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 20:07 mutante: gerrit - adding second replica [[phab:T313250|T313250]]
* 20:06 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P32252 and previous config saved to /var/cache/conftool/dbconfig/20220803-200619-marostegui.json
* 20:04 cwhite@cumin2002: START - Cookbook sre.dns.netbox
* 20:03 cwhite@cumin2002: START - Cookbook sre.ganeti.makevm for new host logstash2032.codfw.wmnet
* 20:00 rzl@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubernetes2012.codfw.wmnet
* 20:00 rzl@cumin1001: START - Cookbook sre.hosts.remove-downtime for kubernetes2012.codfw.wmnet
* 20:00 rzl@deploy1002: conftool action : set/pooled=yes; selector: name=kubernetes2012.codfw.wmnet
* 19:51 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P32251 and previous config saved to /var/cache/conftool/dbconfig/20220803-195113-marostegui.json
* 19:40 ryankemper: [[phab:T314078|T314078]] Forgot to mention, restart is at `ryankemper@cumin1001` tmux session `codfw_restarts`
* 19:39 ryankemper: [[phab:T314078|T314078]] Rolling upgrade of codfw hosts; after this all of eqiad/codfw will have the new plugin version and we can resume the `search-loader` instances: `sudo -E cookbook sre.elasticsearch.rolling-operation search_codfw "codfw cluster plugin upgrade" --upgrade --nodes-per-run 3 --start-datetime 2022-08-03T19:38:10 --task-id [[phab:T314078|T314078]]`
* 19:38 ryankemper@cumin1001: START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster plugin upgrade - ryankemper@cumin1001 - [[phab:T314078|T314078]]
* 19:36 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32250 and previous config saved to /var/cache/conftool/dbconfig/20220803-193607-marostegui.json
* 19:33 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32249 and previous config saved to /var/cache/conftool/dbconfig/20220803-193354-marostegui.json
* 19:33 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance
* 19:33 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance
* 19:33 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1129 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32248 and previous config saved to /var/cache/conftool/dbconfig/20220803-193334-marostegui.json
* 19:25 mutante: gerrit1001 - rsyncing /var/lib/gerrit/review_site/ over to gerrit2002 815401
* 19:18 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P32247 and previous config saved to /var/cache/conftool/dbconfig/20220803-191828-marostegui.json
* 19:03 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P32246 and previous config saved to /var/cache/conftool/dbconfig/20220803-190321-marostegui.json
* 18:56 rzl@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubernetes2011.codfw.wmnet
* 18:56 rzl@cumin1001: START - Cookbook sre.hosts.remove-downtime for kubernetes2011.codfw.wmnet
* 18:56 rzl@deploy1002: conftool action : set/pooled=yes; selector: name=kubernetes2011.codfw.wmnet
* 18:33 rzl@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc[2027,2037].codfw.wmnet
* 18:33 rzl@cumin1001: START - Cookbook sre.hosts.remove-downtime for mc[2027,2037].codfw.wmnet
* 18:23 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 18:16 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 18:16 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 18:16 dancy@deploy1002: Synchronized php: group1 wikis to 1.39.0-wmf.23  refs [[phab:T308076|T308076]] (duration: 03m 37s)
* 18:15 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 18:12 dancy@deploy1002: rebuilt and synchronized wikiversions files: group1 wikis to 1.39.0-wmf.23  refs [[phab:T308076|T308076]]
* 17:58 rzl@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubestage2002.codfw.wmnet
* 17:58 rzl@cumin1001: START - Cookbook sre.hosts.remove-downtime for kubestage2002.codfw.wmnet
* 17:57 rzl@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc[2025-2026].codfw.wmnet
* 17:57 rzl@cumin1001: START - Cookbook sre.hosts.remove-downtime for mc[2025-2026].codfw.wmnet
* 17:57 bking@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for elastic2044.codfw.wmnet
* 17:57 bking@cumin1001: START - Cookbook sre.hosts.remove-downtime for elastic2044.codfw.wmnet
* 17:56 bking@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for elastic2043.codfw.wmnet
* 17:56 bking@cumin1001: START - Cookbook sre.hosts.remove-downtime for elastic2043.codfw.wmnet
* 17:55 ottomata: increasing partitions from 5 to 6 for *.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite topics in Kafka main-eqiad and main-codfw - [[phab:T314426|T314426]]
* 17:55 mvernon@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be2055.codfw.wmnet
* 17:55 mvernon@cumin1001: START - Cookbook sre.hosts.remove-downtime for ms-be2055.codfw.wmnet
* 17:50 rzl@cumin1001: conftool action : set/pooled=yes; selector: name=kubestage2002.codfw.wmnet
* 17:38 rzl@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse[2008-2010].codfw.wmnet
* 17:38 rzl@cumin1001: START - Cookbook sre.hosts.remove-downtime for parse[2008-2010].codfw.wmnet
* 17:23 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=restbase20[12]4.codfw.wmnet
* 17:14 mvernon@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 6 hosts
* 17:14 mvernon@cumin1001: START - Cookbook sre.hosts.remove-downtime for 6 hosts
* 17:08 ryankemper: [[phab:T310145|T310145]] `elastic2031` and `wcqs2002` powered off in preparation for C1 maintenance
* 17:06 jayme@cumin1001: conftool action : set/pooled=yes; selector: name=(kubernetes2020.codfw.wmnet{{!}}kubernetes2009.codfw.wmnet{{!}}kubernetes2010.codfw.wmnet)
* 17:00 btullis@cumin1001: END (FAIL) - Cookbook sre.aqs.roll-restart (exit_code=99) for AQS aqs cluster: Roll restart of all AQS's nodejs daemons.
* 16:48 Emperor: shutdown  moss-fe2001.codfw.wmnet,ms-fe2011.codfw.wmnet,ms-be20[34,35,42,48,55,68].codfw.wmnet PDU work [[phab:T310145|T310145]]
* 16:47 mvernon@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 8 hosts with reason: PDU work
* 16:47 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on gerrit2002.wikimedia.org with reason: in setup / flapping
* 16:47 mvernon@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 8 hosts with reason: PDU work
* 16:47 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on gerrit2002.wikimedia.org with reason: in setup / flapping
* 16:46 mvernon@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be[2033,2047].codfw.wmnet,thanos-be2002.codfw.wmnet
* 16:46 mvernon@cumin1001: START - Cookbook sre.hosts.remove-downtime for ms-be[2033,2047].codfw.wmnet,thanos-be2002.codfw.wmnet
* 16:40 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc2046.codfw.wmnet
* 16:40 jayme@cumin1001: START - Cookbook sre.hosts.remove-downtime for mc2046.codfw.wmnet
* 16:39 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 10 hosts
* 16:39 jayme@cumin1001: START - Cookbook sre.hosts.remove-downtime for 10 hosts
* 16:38 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc2023.codfw.wmnet
* 16:38 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for mc2023.codfw.wmnet
* 16:37 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on gitlab-runner2002.codfw.wmnet with reason: PDU swap
* 16:37 jelto@cumin1001: START - Cookbook sre.hosts.downtime for 0:30:00 on gitlab-runner2002.codfw.wmnet with reason: PDU swap
* 16:35 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mc[2025-2026].codfw.wmnet with reason: PDU swap
* 16:35 jayme@cumin1001: START - Cookbook sre.hosts.downtime for 0:30:00 on mc[2025-2026].codfw.wmnet with reason: PDU swap
* 16:32 jelto: power off mc2025-2026
* 16:31 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for rdb2008.codfw.wmnet
* 16:30 jayme@cumin1001: START - Cookbook sre.hosts.remove-downtime for rdb2008.codfw.wmnet
* 16:28 btullis@cumin1001: START - Cookbook sre.aqs.roll-restart for AQS aqs cluster: Roll restart of all AQS's nodejs daemons.
* 16:28 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubernetes[2009-2010,2020].codfw.wmnet
* 16:27 jayme@cumin1001: START - Cookbook sre.hosts.remove-downtime for kubernetes[2009-2010,2020].codfw.wmnet
* 16:11 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 12 hosts
* 16:11 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for 12 hosts
* 16:08 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 15 hosts
* 16:08 jayme@cumin1001: START - Cookbook sre.hosts.remove-downtime for 15 hosts
* 16:08 mvernon@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for aqs[2005-2008].codfw.wmnet
* 16:08 mvernon@cumin1001: START - Cookbook sre.hosts.remove-downtime for aqs[2005-2008].codfw.wmnet
* 15:59 Emperor: shutdown ms-be20[33,47],thanos-be2002 prior to PDU work [[phab:T310070|T310070]]
* 15:58 mvernon@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ms-be[2033,2047].codfw.wmnet,thanos-be2002.codfw.wmnet with reason: PDU work
* 15:58 mvernon@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ms-be[2033,2047].codfw.wmnet,thanos-be2002.codfw.wmnet with reason: PDU work
* 15:52 jelto: pooling mw2259-2270 again
* 15:45 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32242 and previous config saved to /var/cache/conftool/dbconfig/20220803-154515-marostegui.json
* 15:38 vgutierrez: clearing ats-be cache on cp6008 - [[phab:T309651|T309651]]
* 15:38 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 15:38 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 15:37 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 15:37 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 15:36 elukey: powercycle kafka-logging2003 - not responsive to serial console
* 15:36 urbanecm@deploy1002: Synchronized php-1.39.0-wmf.22/extensions/GrowthExperiments/includes/NewcomerTasks/AddImage/ServiceImageRecommendationProvider.php: {{Gerrit|4438957e78e0012aff646e52dc16a4fb796cfd6b}}: ServiceImageRecommendationProvider: Add extra logging when no JSON response received ([[phab:T313973|T313973]]) (duration: 03m 04s)
* 15:35 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on maps2009.codfw.wmnet with reason: PDU maintenance
* 15:35 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime for 3:00:00 on maps2009.codfw.wmnet with reason: PDU maintenance
* 15:34 hnowlan@puppetmaster1001: conftool action : set/pooled=no; selector: name=maps2009.codfw.wmnet
* 15:32 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on restbase2024.codfw.wmnet with reason: PDU maintenance
* 15:32 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on restbase2024.codfw.wmnet with reason: PDU maintenance
* 15:32 hnowlan@puppetmaster1001: conftool action : set/pooled=no; selector: name=restbase2024.codfw.wmnet
* 15:30 vgutierrez: clearing ats-be cache on cp6016 - [[phab:T309651|T309651]]
* 15:30 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P32241 and previous config saved to /var/cache/conftool/dbconfig/20220803-153009-marostegui.json
* 15:24 jayme@cumin1001: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) _etcd._tcp.eqsin.wmnet on all recursors
* 15:24 jayme@cumin1001: START - Cookbook sre.dns.wipe-cache _etcd._tcp.eqsin.wmnet on all recursors
* 15:24 jayme@cumin1001: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) _etcd._tcp.ulsfo.wmnet on all recursors
* 15:24 jayme@cumin1001: START - Cookbook sre.dns.wipe-cache _etcd._tcp.ulsfo.wmnet on all recursors
* 15:24 jayme@cumin1001: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) _etcd._tcp.codfw.wmnet on all recursors
* 15:24 jayme@cumin1001: START - Cookbook sre.dns.wipe-cache _etcd._tcp.codfw.wmnet on all recursors
* 15:21 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=restbase2021.codfw.wmnet
* 15:19 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2030.codfw.wmnet with reason: [[phab:T310070|T310070]]
* 15:19 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2030.codfw.wmnet with reason: [[phab:T310070|T310070]]
* 15:15 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P32240 and previous config saved to /var/cache/conftool/dbconfig/20220803-151502-marostegui.json
* 15:10 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for conf2004.codfw.wmnet
* 15:10 jayme@cumin1001: START - Cookbook sre.hosts.remove-downtime for conf2004.codfw.wmnet
* 15:04 jelto: power off mc2023
* 14:59 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32239 and previous config saved to /var/cache/conftool/dbconfig/20220803-145956-marostegui.json
* 14:59 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mc2023.codfw.wmnet with reason: PDU swap
* 14:59 jayme@cumin1001: START - Cookbook sre.hosts.downtime for 0:30:00 on mc2023.codfw.wmnet with reason: PDU swap
* 14:58 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32238 and previous config saved to /var/cache/conftool/dbconfig/20220803-145849-marostegui.json
* 14:58 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1172.eqiad.wmnet with reason: Maintenance
* 14:58 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1172.eqiad.wmnet with reason: Maintenance
* 14:58 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1109 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32237 and previous config saved to /var/cache/conftool/dbconfig/20220803-145828-marostegui.json
* 14:56 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 14:56 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 14:56 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 14:53 dancy@deploy1002: Pruned MediaWiki: 1.39.0-wmf.19 (duration: 05m 37s)
* 14:51 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 14:47 dancy@deploy1002: Pruned MediaWiki: 1.39.0-wmf.21 (duration: 06m 13s)
* 14:46 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 14:46 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 14:46 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 14:45 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 14:43 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1109', diff saved to https://phabricator.wikimedia.org/P32236 and previous config saved to /var/cache/conftool/dbconfig/20220803-144322-marostegui.json
* 14:34 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2029.codfw.wmnet with reason: [[phab:T310070|T310070]]
* 14:33 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2029.codfw.wmnet with reason: [[phab:T310070|T310070]]
* 14:32 Emperor: shutdown aqs200[5-8] prior to PDU work [[phab:T310070|T310070]]
* 14:31 mvernon@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on aqs[2005-2008].codfw.wmnet with reason: PDU work
* 14:31 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on thumbor[2003-2004].codfw.wmnet with reason: PDU swap
* 14:31 mvernon@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on aqs[2005-2008].codfw.wmnet with reason: PDU work
* 14:31 jayme@cumin1001: START - Cookbook sre.hosts.downtime for 0:30:00 on thumbor[2003-2004].codfw.wmnet with reason: PDU swap
* 14:28 jelto: power off thumbor2003 and thumbor2004
* 14:28 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1109', diff saved to https://phabricator.wikimedia.org/P32235 and previous config saved to /var/cache/conftool/dbconfig/20220803-142816-marostegui.json
* 14:27 moritzm: upgrading ganeti/esams to Ganeti 3.0.2 [[phab:T312637|T312637]]
* 14:13 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1109 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32234 and previous config saved to /var/cache/conftool/dbconfig/20220803-141310-marostegui.json
* 14:11 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1109 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32233 and previous config saved to /var/cache/conftool/dbconfig/20220803-141103-marostegui.json
* 14:10 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1109.eqiad.wmnet with reason: Maintenance
* 14:10 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1109.eqiad.wmnet with reason: Maintenance
* 14:10 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1099:3318 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32232 and previous config saved to /var/cache/conftool/dbconfig/20220803-141042-marostegui.json
* 14:06 moritzm: installing freetype security updates on bullseye
* 13:57 cdanis: ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕙☕ sudo cumin 'P<nowiki>{</nowiki>R:Class = Confd<nowiki>}</nowiki>' 'systemctl restart confd'
* 13:55 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1099:3318', diff saved to https://phabricator.wikimedia.org/P32231 and previous config saved to /var/cache/conftool/dbconfig/20220803-135536-marostegui.json
* 13:46 cdanis: ✔️ cdanis@deploy1002.eqiad.wmnet ~ 🕙☕ sudo systemctl restart confd
* 13:40 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1099:3318', diff saved to https://phabricator.wikimedia.org/P32230 and previous config saved to /var/cache/conftool/dbconfig/20220803-134030-marostegui.json
* 13:30 moritzm: installing Java 8 security updates for Buster
* 13:25 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1099:3318 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32229 and previous config saved to /var/cache/conftool/dbconfig/20220803-132524-marostegui.json
* 13:24 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 13:23 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 13:23 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 13:19 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1099:3318 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32228 and previous config saved to /var/cache/conftool/dbconfig/20220803-131916-marostegui.json
* 13:19 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance
* 13:19 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance
* 13:18 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1114 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32227 and previous config saved to /var/cache/conftool/dbconfig/20220803-131855-marostegui.json
* 13:18 sukhe: depool codfw for PDU upgrade: CR 819798
* 13:16 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 13:16 urbanecm@deploy1002: Synchronized wmf-config/MetaContactPages.php: {{Gerrit|f89f02e306a1fa580fa41ba56de978f4208ea672}}: Amend license request contact form per Legal ([[phab:T303359|T303359]]) (duration: 09m 27s)
* 13:12 jbond: introduce puppetmaster[12]004 for now as offline
* 13:11 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 13:09 root@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on kafka-logging2003.codfw.wmnet with reason: pdu
* 13:09 root@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on kafka-logging2003.codfw.wmnet with reason: pdu
* 13:07 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 13:07 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 13:06 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 13:05 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2044.codfw.wmnet with reason: [[phab:T310070|T310070]]
* 13:05 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2044.codfw.wmnet with reason: [[phab:T310070|T310070]]
* 13:04 pt1979@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 13:03 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1114', diff saved to https://phabricator.wikimedia.org/P32226 and previous config saved to /var/cache/conftool/dbconfig/20220803-130348-marostegui.json
* 12:59 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2043.codfw.wmnet with reason: [[phab:T310070|T310070]]
* 12:59 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2043.codfw.wmnet with reason: [[phab:T310070|T310070]]
* 12:56 pt1979@cumin1001: START - Cookbook sre.dns.netbox
* 12:48 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1114', diff saved to https://phabricator.wikimedia.org/P32224 and previous config saved to /var/cache/conftool/dbconfig/20220803-124842-marostegui.json
* 12:40 moritzm: uploaded openjdk-8 8u342-b07-1~deb10u1  to component/jdk8 for buster-wikimedia (rebuild of latest Java 8 security update)
* 12:36 oblivian@deploy1002: helmfile [codfw] DONE helmfile.d/services/mobileapps: apply
* 12:36 oblivian@deploy1002: helmfile [codfw] START helmfile.d/services/mobileapps: apply
* 12:33 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1114 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32223 and previous config saved to /var/cache/conftool/dbconfig/20220803-123336-marostegui.json
* 12:29 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1114 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32222 and previous config saved to /var/cache/conftool/dbconfig/20220803-122929-marostegui.json
* 12:29 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1114.eqiad.wmnet with reason: Maintenance
* 12:28 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1114.eqiad.wmnet with reason: Maintenance
* 12:28 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1116.eqiad.wmnet with reason: Maintenance
* 12:28 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1116.eqiad.wmnet with reason: Maintenance
* 12:28 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32221 and previous config saved to /var/cache/conftool/dbconfig/20220803-122819-marostegui.json
* 12:16 ebysans@deploy1002: Finished deploy [airflow-dags/analytics@614f7b2]: (no justification provided) (duration: 00m 11s)
* 12:16 ebysans@deploy1002: Started deploy [airflow-dags/analytics@614f7b2]: (no justification provided)
* 12:13 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P32220 and previous config saved to /var/cache/conftool/dbconfig/20220803-121313-marostegui.json
* 11:58 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P32219 and previous config saved to /var/cache/conftool/dbconfig/20220803-115807-marostegui.json
* 11:57 marostegui@cumin1001: dbctl commit (dc=all): 'Add db2176 to s1 [[phab:T311494|T311494]]', diff saved to https://phabricator.wikimedia.org/P32218 and previous config saved to /var/cache/conftool/dbconfig/20220803-115706-marostegui.json
* 11:49 root@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cumin2002.codfw.wmnet with reason: PDU maintenance, [[phab:T310145|T310145]]
* 11:49 root@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on cumin2002.codfw.wmnet with reason: PDU maintenance, [[phab:T310145|T310145]]
* 11:46 jayme@cumin1001: conftool action : set/weight=10; selector: name=(kubernetes2019.codfw.wmnet{{!}}kubernetes2021.codfw.wmnet{{!}}kubernetes2022.codfw.wmnet{{!}}kubernetes2018.codfw.wmnet{{!}}kubernetes2020.codfw.wmnet)
* 11:43 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32217 and previous config saved to /var/cache/conftool/dbconfig/20220803-114301-marostegui.json
* 11:41 jayme@cumin1001: conftool action : set/pooled=inactive; selector: name=(kubernetes2020.codfw.wmnet{{!}}kubernetes2009.codfw.wmnet{{!}}kubernetes2010.codfw.wmnet{{!}}kubernetes2011.codfw.wmnet{{!}}kubernetes2012.codfw.wmnet{{!}}kubestage2002.codfw.wmnet)
* 11:38 hnowlan@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host restbase2022.codfw.wmnet
* 11:37 hnowlan@cumin1001: START - Cookbook sre.hosts.reboot-single for host restbase2022.codfw.wmnet
* 11:35 jbond@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 11:32 jbond@cumin2002: START - Cookbook sre.dns.netbox
* 11:26 oblivian@puppetmaster1001: conftool action : set/pooled=false; selector: name=codfw,dnsdisc=wdqs
* 11:22 oblivian@puppetmaster1001: conftool action : set/pooled=true; selector: name=codfw,dnsdisc=kartotherian
* 11:22 oblivian@puppetmaster1001: conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=restbase-backend
* 11:21 oblivian@puppetmaster1001: conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=restbase-async
* 11:17 _joe_: depooling codfw services from all traffic
* 10:54 jmm@cumin2002: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2011.codfw.wmnet to cluster codfw and group C
* 10:53 jmm@cumin2002: START - Cookbook sre.ganeti.addnode for new host ganeti2011.codfw.wmnet to cluster codfw and group C
* 10:50 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2011.codfw.wmnet
* 10:47 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on kubestage2002.codfw.wmnet with reason: PDU swap
* 10:46 jelto@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on kubestage2002.codfw.wmnet with reason: PDU swap
* 10:42 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32216 and previous config saved to /var/cache/conftool/dbconfig/20220803-104246-marostegui.json
* 10:42 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1177.eqiad.wmnet with reason: Maintenance
* 10:42 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1177.eqiad.wmnet with reason: Maintenance
* 10:42 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1126 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32215 and previous config saved to /var/cache/conftool/dbconfig/20220803-104224-marostegui.json
* 10:41 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host ganeti2011.codfw.wmnet
* 10:40 hnowlan@puppetmaster1001: conftool action : set/pooled=no; selector: name=restbase201[45].codfw.wmnet
* 10:38 hnowlan@puppetmaster1001: conftool action : set/pooled=no; selector: name=restbase2022.codfw.wmnet
* 10:38 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on restbase[2014-2015,2021-2022].codfw.wmnet with reason: PDU maintenance
* 10:38 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on restbase[2014-2015,2021-2022].codfw.wmnet with reason: PDU maintenance
* 10:37 jelto: shutdown kubestage2002 kubernetes2020 kubernetes2009 kubernetes2010 kubernetes2011 kubernetes2012
* 10:30 oblivian@cumin1001: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) proton.discovery.wmnet on all recursors
* 10:30 oblivian@cumin1001: START - Cookbook sre.dns.wipe-cache proton.discovery.wmnet on all recursors
* 10:29 oblivian@cumin1001: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) mathoid.discovery.wmnet on all recursors
* 10:29 oblivian@cumin1001: START - Cookbook sre.dns.wipe-cache mathoid.discovery.wmnet on all recursors
* 10:27 oblivian@cumin1001: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) proton.discovery.wmnet on all recursors
* 10:27 oblivian@cumin1001: START - Cookbook sre.dns.wipe-cache proton.discovery.wmnet on all recursors
* 10:27 oblivian@cumin1001: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) mathoid.discovery.wmnet on all recursors
* 10:27 oblivian@cumin1001: START - Cookbook sre.dns.wipe-cache mathoid.discovery.wmnet on all recursors
* 10:27 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1126', diff saved to https://phabricator.wikimedia.org/P32213 and previous config saved to /var/cache/conftool/dbconfig/20220803-102718-marostegui.json
* 10:23 jelto@cumin1001: conftool action : set/pooled=no; selector: name=kubernetes2012.codfw.wmnet
* 10:23 jelto@cumin1001: conftool action : set/pooled=no; selector: name=kubernetes2011.codfw.wmnet
* 10:22 jelto@cumin1001: conftool action : set/pooled=no; selector: name=kubernetes2010.codfw.wmnet
* 10:22 jelto@cumin1001: conftool action : set/pooled=no; selector: name=kubernetes2009.codfw.wmnet
* 10:22 jelto@cumin1001: conftool action : set/pooled=no; selector: name=kubernetes2020.codfw.wmnet
* 10:20 jelto@cumin1001: conftool action : set/pooled=no; selector: name=kubestage2002.codfw.wmnet
* 10:14 oblivian@cumin1001: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) proton.discovery.wmnet on all recursors
* 10:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2011.codfw.wmnet with OS bullseye
* 10:14 oblivian@cumin1001: START - Cookbook sre.dns.wipe-cache proton.discovery.wmnet on all recursors
* 10:14 oblivian@cumin1001: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) mathoid.discovery.wmnet on all recursors
* 10:14 oblivian@cumin1001: START - Cookbook sre.dns.wipe-cache mathoid.discovery.wmnet on all recursors
* 10:12 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1126', diff saved to https://phabricator.wikimedia.org/P32212 and previous config saved to /var/cache/conftool/dbconfig/20220803-101212-marostegui.json
* 09:57 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1126 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32211 and previous config saved to /var/cache/conftool/dbconfig/20220803-095706-marostegui.json
* 09:56 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2011.codfw.wmnet with reason: host reimage
* 09:56 hnowlan@puppetmaster1001: conftool action : set/pooled=no; selector: name=restbase2021.codfw.wmnet
* 09:56 jelto: kubectl drain --ignore-daemonsets --delete-local-data kubernetes2012.codfw.wmnet
* 09:56 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1126 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32210 and previous config saved to /var/cache/conftool/dbconfig/20220803-095559-marostegui.json
* 09:55 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1126.eqiad.wmnet with reason: Maintenance
* 09:55 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1126.eqiad.wmnet with reason: Maintenance
* 09:55 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1101:3318 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32209 and previous config saved to /var/cache/conftool/dbconfig/20220803-095538-marostegui.json
* 09:55 hnowlan@puppetmaster1001: conftool action : set/weight=10; selector: name=restbase2027.codfw.wmnet
* 09:54 jelto: kubectl drain --ignore-daemonsets --delete-local-data kubernetes2011.codfw.wmnet
* 09:54 oblivian@cumin1001: END (PASS) - Cookbook sre.discovery.service-route (exit_code=0)
* 09:54 oblivian@cumin1001: START - Cookbook sre.discovery.service-route
* 09:54 jmm@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2011.codfw.wmnet with reason: host reimage
* 09:52 jelto: kubectl drain --ignore-daemonsets --delete-local-data kubernetes2010.codfw.wmnet
* 09:50 jelto: kubectl drain --ignore-daemonsets --delete-local-data kubernetes2009.codfw.wmnet
* 09:49 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 49 hosts with reason: PDU swap
* 09:48 jayme@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 49 hosts with reason: PDU swap
* 09:47 jelto: kubectl drain --ignore-daemonsets kubernetes2020.codfw.wmnet
* 09:46 jelto: kubectl cordon kubernetes2020.codfw.wmnet kubernetes2009.codfw.wmnet kubernetes2010.codfw.wmnet kubernetes2011.codfw.wmnet kubernetes2012.codfw.wmnet
* 09:43 jelto: kubectl drain --ignore-daemonsets kubestage2002.codfw.wmnet
* 09:43 vgutierrez: rolling restart of pybal in codfw lvs instances - [[phab:T310070|T310070]]
* 09:42 jelto: kubectl cordon kubestage2002
* 09:40 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1101:3318', diff saved to https://phabricator.wikimedia.org/P32208 and previous config saved to /var/cache/conftool/dbconfig/20220803-094032-marostegui.json
* 09:35 jmm@cumin2002: START - Cookbook sre.hosts.reimage for host ganeti2011.codfw.wmnet with OS bullseye
* 09:34 ebysans@deploy1002: Finished deploy [airflow-dags/analytics@674bb8b]: (no justification provided) (duration: 00m 10s)
* 09:33 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2090.codfw.wmnet
* 09:33 marostegui@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 09:33 ebysans@deploy1002: Started deploy [airflow-dags/analytics@674bb8b]: (no justification provided)
* 09:33 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on ganeti2011.codfw.wmnet with reason: Remove node for eventual reimage, [[phab:T311686|T311686]]
* 09:32 jmm@cumin2002: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on ganeti2011.codfw.wmnet with reason: Remove node for eventual reimage, [[phab:T311686|T311686]]
* 09:29 marostegui@cumin1001: START - Cookbook sre.dns.netbox
* 09:25 marostegui@cumin1001: START - Cookbook sre.hosts.decommission for hosts db2090.codfw.wmnet
* 09:25 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1101:3318', diff saved to https://phabricator.wikimedia.org/P32207 and previous config saved to /var/cache/conftool/dbconfig/20220803-092525-marostegui.json
* 09:24 oblivian@cumin1001: END (PASS) - Cookbook sre.discovery.service-route (exit_code=0)
* 09:24 oblivian@cumin1001: START - Cookbook sre.discovery.service-route
* 09:24 oblivian@cumin1001: END (PASS) - Cookbook sre.discovery.service-route (exit_code=0)
* 09:24 oblivian@cumin1001: START - Cookbook sre.discovery.service-route
* 09:23 oblivian@cumin1001: END (PASS) - Cookbook sre.discovery.service-route (exit_code=0)
* 09:23 oblivian@cumin1001: START - Cookbook sre.discovery.service-route
* 09:22 oblivian@cumin1001: END (FAIL) - Cookbook sre.discovery.service-route (exit_code=99)
* 09:22 oblivian@cumin1001: START - Cookbook sre.discovery.service-route
* 09:20 marostegui@cumin1001: dbctl commit (dc=all): 'Remove db2090 from dbctl [[phab:T314109|T314109]]', diff saved to https://phabricator.wikimedia.org/P32206 and previous config saved to /var/cache/conftool/dbconfig/20220803-092053-marostegui.json
* 09:20 jelto@cumin1001: START - Cookbook sre.hosts.reboot-single for host mc2024.codfw.wmnet
* 09:15 jelto: power on mc2024
* 09:10 XioNoX: configure BGP on the esams-drmrs link - [[phab:T307221|T307221]]
* 09:10 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1101:3318 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32205 and previous config saved to /var/cache/conftool/dbconfig/20220803-091019-marostegui.json
* 09:09 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1101:3318 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32204 and previous config saved to /var/cache/conftool/dbconfig/20220803-090912-marostegui.json
* 09:09 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance
* 09:08 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2031.codfw.wmnet
* 09:08 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance
* 09:08 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance
* 09:08 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance
* 09:08 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32203 and previous config saved to /var/cache/conftool/dbconfig/20220803-090836-marostegui.json
* 09:07 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2032.codfw.wmnet
* 09:06 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2042.codfw.wmnet
* 09:05 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2043.codfw.wmnet
* 09:04 jynus: stop backup2006 backup2009 for [[phab:T310070|T310070]]
* 09:00 jelto@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host mc2024.codfw.wmnet
* 09:00 jelto@cumin1001: START - Cookbook sre.hosts.reboot-single for host mc2024.codfw.wmnet
* 08:59 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp2031.codfw.wmnet
* 08:59 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp2032.codfw.wmnet
* 08:58 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single for host mc2042.codfw.wmnet
* 08:58 jelto@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host mc2024.codfw.wmnet
* 08:58 jelto@cumin1001: START - Cookbook sre.hosts.reboot-single for host mc2024.codfw.wmnet
* 08:57 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single for host mc2043.codfw.wmnet
* 08:57 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2041.codfw.wmnet
* 08:54 XioNoX: put the esams-drmrs link in service - [[phab:T307221|T307221]]
* 08:53 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P32202 and previous config saved to /var/cache/conftool/dbconfig/20220803-085330-marostegui.json
* 08:53 ayounsi@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 08:51 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-single for host mc2041.codfw.wmnet
* 08:49 ayounsi@cumin1001: START - Cookbook sre.dns.netbox
* 08:47 ayounsi@cumin1001: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
* 08:41 ayounsi@cumin1001: START - Cookbook sre.dns.netbox
* 08:38 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P32201 and previous config saved to /var/cache/conftool/dbconfig/20220803-083824-marostegui.json
* 08:23 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32200 and previous config saved to /var/cache/conftool/dbconfig/20220803-082318-marostegui.json
* 08:19 jynus: stop db2098 for [[phab:T310070|T310070]]
* 08:17 oblivian@puppetmaster1001: conftool action : set/pooled=true; selector: dnsdisc=(appservers{{!}}api)-ro,name=codfw
* 08:15 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2072.codfw.wmnet
* 08:15 marostegui@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 07:54 marostegui@cumin1001: START - Cookbook sre.dns.netbox
* 07:49 marostegui@cumin1001: START - Cookbook sre.hosts.decommission for hosts db2072.codfw.wmnet
* 07:48 marostegui@cumin1001: dbctl commit (dc=all): 'Remove db2072 from dbctl [[phab:T313911|T313911]]', diff saved to https://phabricator.wikimedia.org/P32199 and previous config saved to /var/cache/conftool/dbconfig/20220803-074806-marostegui.json
* 07:23 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 07:22 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32197 and previous config saved to /var/cache/conftool/dbconfig/20220803-072253-marostegui.json
* 07:22 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
* 07:22 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
* 07:22 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1167.eqiad.wmnet with reason: Maintenance
* 07:22 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1167.eqiad.wmnet with reason: Maintenance
* 07:22 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32196 and previous config saved to /var/cache/conftool/dbconfig/20220803-072214-marostegui.json
* 07:19 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 14 hosts with reason: codfw pdu maintenance
* 07:19 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 14 hosts with reason: codfw pdu maintenance
* 07:18 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[2134,2160].codfw.wmnet with reason: codfw pdu maintenance
* 07:18 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db[2134,2160].codfw.wmnet with reason: codfw pdu maintenance
* 07:18 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 14 hosts with reason: codfw pdu maintenance
* 07:17 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 14 hosts with reason: codfw pdu maintenance
* 07:17 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es[2020-2022].codfw.wmnet with reason: codfw pdu maintenance
* 07:17 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es[2020-2022].codfw.wmnet with reason: codfw pdu maintenance
* 07:17 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 10 hosts with reason: codfw pdu maintenance
* 07:16 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 10 hosts with reason: codfw pdu maintenance
* 07:16 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 07:16 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[2096,2101,2115,2131].codfw.wmnet with reason: codfw pdu maintenance
* 07:16 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 07:16 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db[2096,2101,2115,2131].codfw.wmnet with reason: codfw pdu maintenance
* 07:11 kartik@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:819227{{!}}CX: Set MT threshold for publishing in Armenian WP to 80% (T313208)]] (duration: 03m 49s)
* 07:09 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 07:07 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P32195 and previous config saved to /var/cache/conftool/dbconfig/20220803-070708-marostegui.json
* 07:05 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubestagetcd2002.codfw.wmnet with reason: Switch instance to plain disks, [[phab:T311686|T311686]]
* 07:05 jmm@cumin2002: START - Cookbook sre.hosts.downtime for 1:00:00 on kubestagetcd2002.codfw.wmnet with reason: Switch instance to plain disks, [[phab:T311686|T311686]]
* 07:00 moritzm: draining ganeti2011 [[phab:T311686|T311686]]
* 06:56 godog: grow sda/sdb 3 by 100G on thanos-be2003 - [[phab:T314275|T314275]]
* 06:56 godog: grow sda/sdb 3 by 100G on thanos-be1002 - [[phab:T314275|T314275]]
* 06:52 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P32194 and previous config saved to /var/cache/conftool/dbconfig/20220803-065202-marostegui.json
* 06:46 godog: power up centrallog2002 and prometheus2005 - [[phab:T310070|T310070]]
* 06:38 jmm@cumin2002: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2013.codfw.wmnet to cluster codfw and group C
* 06:37 jmm@cumin2002: START - Cookbook sre.ganeti.addnode for new host ganeti2013.codfw.wmnet to cluster codfw and group C
* 06:36 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32193 and previous config saved to /var/cache/conftool/dbconfig/20220803-063656-marostegui.json
* 06:31 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32192 and previous config saved to /var/cache/conftool/dbconfig/20220803-063148-marostegui.json
* 06:31 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1178.eqiad.wmnet with reason: Maintenance
* 06:31 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1178.eqiad.wmnet with reason: Maintenance
* 06:31 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 13 hosts with reason: Maintenance
* 06:31 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on 13 hosts with reason: Maintenance
* 06:31 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2161.codfw.wmnet with reason: Maintenance
* 06:30 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db2161.codfw.wmnet with reason: Maintenance
* 06:30 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1111 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32191 and previous config saved to /var/cache/conftool/dbconfig/20220803-063045-marostegui.json
* 06:15 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1111', diff saved to https://phabricator.wikimedia.org/P32190 and previous config saved to /var/cache/conftool/dbconfig/20220803-061538-marostegui.json
* 06:00 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1111', diff saved to https://phabricator.wikimedia.org/P32189 and previous config saved to /var/cache/conftool/dbconfig/20220803-060032-marostegui.json
* 05:45 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1111 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32188 and previous config saved to /var/cache/conftool/dbconfig/20220803-054526-marostegui.json
* 05:41 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1111 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32187 and previous config saved to /var/cache/conftool/dbconfig/20220803-054106-marostegui.json
* 05:41 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1111.eqiad.wmnet with reason: Maintenance
* 05:40 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1111.eqiad.wmnet with reason: Maintenance
* 05:40 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance
* 05:40 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance


== 2016-01-16 ==
== 2022-08-02 ==
* 19:52 andrewbogott: renaming and reimaging labcontrol2001 -> labtestweb2001
* 22:39 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 15:57 milimetric: piwik is taking events on bohrium but the interface can't complete the queries to load because there's too much dataMysql is maxing the CPU but it seems ok for now, will check again Monday.
* 22:32 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 15:22 milimetric: restarted mysql on bohrium because it had stopped working (probably due to piwik performance problems)
* 22:32 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 03:02 logmsgbot: l10nupdate@tin ResourceLoader cache refresh completed at Sat Jan 16 03:02:21 UTC 2016 (duration 6m 57s)
* 22:25 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 02:55 logmsgbot: mwdeploy@tin sync-l10n completed (1.27.0-wmf.10) (duration: 08m 35s)
* 22:15 mutante: gerrit - syncing data (/srv/gerrit /var/lib/gerrit2/review_site  /home) again after gerrit2002 was reimaged with buster [[phab:T313250|T313250]] [[phab:T313972|T313972]]
* 02:35 logmsgbot: mwdeploy@tin sync-l10n completed (1.27.0-wmf.9) (duration: 18m 55s)
* 22:04 dancy@deploy1002: Finished deploy [gerrit/gerrit@94c5028]: (no justification provided) (duration: 00m 06s)
* 22:04 dancy@deploy1002: Started deploy [gerrit/gerrit@94c5028]: (no justification provided)
* 22:00 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 21:59 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 21:59 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 21:58 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 21:58 dancy@deploy1002: rebuilt and synchronized wikiversions files: group0 wikis to 1.39.0-wmf.23  refs [[phab:T308076|T308076]]
* 21:53 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 21:47 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 21:46 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 21:40 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 21:35 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 21:29 ladsgroup@deploy1002: Synchronized php-1.39.0-wmf.23/extensions/CirrusSearch/includes/Sanity/Checker.php: Backport: [[gerrit:819621{{!}}Fix appending of join conds (T312421 T314439)]] (duration: 03m 15s)
* 21:28 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 21:28 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 21:27 bking@cumin1001: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: deploy wmf-elasticsearch-search-plugins pkg - bking@cumin1001 - [[phab:T314078|T314078]]
* 21:21 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 21:11 dzahn@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gerrit2002.wikimedia.org with OS buster
* 21:01 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 21:01 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 21:00 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 20:59 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 20:58 dancy@deploy1002: rebuilt and synchronized wikiversions files: group0 wikis to 1.39.0-wmf.22  refs [[phab:T308076|T308076]]
* 20:54 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 20:53 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 20:53 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 20:53 dzahn@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gerrit2002.wikimedia.org with reason: host reimage
* 20:52 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 20:51 dzahn@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on gerrit2002.wikimedia.org with reason: host reimage
* 20:50 dancy@deploy1002: rebuilt and synchronized wikiversions files: group0 wikis to 1.39.0-wmf.23  refs [[phab:T308076|T308076]]
* 20:38 mutante: re-imaging gerrit2002 with buster - because it's on bullseye, needs git-fat and that has not been ported to python3 yet which blocks upgrading gerrit machines otherwise [[phab:T313250|T313250]] [[phab:T243027|T243027]] [[phab:T279509|T279509]]
* 20:37 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 20:36 dzahn@cumin2002: START - Cookbook sre.hosts.reimage for host gerrit2002.wikimedia.org with OS buster
* 20:36 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 20:36 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 20:36 urbanecm: UTC evening B&C window done
* 20:35 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 20:33 urbanecm@deploy1002: Synchronized php-1.39.0-wmf.23/includes/Rest/Handler/HTMLTransformInput.php: {{Gerrit|69e91528a5c6f372af520307dc2f4227b9981442}}: ParsoidHandler: fix page bundle input with no orig HTML (duration: 03m 22s)
* 20:30 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 20:29 urbanecm@deploy1002: Synchronized php-1.39.0-wmf.23/includes/Rest/Handler/ParsoidHandler.php: {{Gerrit|322a960e3777bc01fa8823908340c36e3851a648}}: ParsoidHandler: pass metrics object to HTMLTransformInput (duration: 03m 19s)
* 20:27 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 20:27 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 20:22 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 20:20 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|5fac0aaf8e76a6f8cc3302771eac068e4f866e5f}}: GrowthExperiments: Remove wgGEHomepageTutorialTitle (duration: 03m 26s)
* 20:06 dancy@deploy1002: Finished scap: Backport for [[gerrit:819612]] Revert "Bump wikimedia/parsoid to 0.16.0-a18" (duration: 11m 30s)
* 20:01 dancy@deploy1002: Finished deploy [gerrit/gerrit@94c5028]: (no justification provided) (duration: 00m 05s)
* 20:01 dancy@deploy1002: Started deploy [gerrit/gerrit@94c5028]: (no justification provided)
* 19:59 dancy@deploy1002: Finished deploy [gerrit/gerrit@94c5028]: (no justification provided) (duration: 00m 01s)
* 19:59 dancy@deploy1002: Started deploy [gerrit/gerrit@94c5028]: (no justification provided)
* 19:55 dancy@deploy1002: Started scap: Backport for [[gerrit:819612]] Revert "Bump wikimedia/parsoid to 0.16.0-a18"
* 19:42 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 19:37 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2034.codfw.wmnet,service=ats-tls
* 19:37 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2034.codfw.wmnet,service=varnish-fe
* 19:37 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2034.codfw.wmnet,service=ats-be
* 19:36 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 19:36 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 19:36 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2033.codfw.wmnet,service=ats-tls
* 19:36 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2033.codfw.wmnet,service=varnish-fe
* 19:36 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2033.codfw.wmnet,service=ats-be
* 19:36 mvernon@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be[2041,2046].codfw.wmnet
* 19:35 mvernon@cumin2002: START - Cookbook sre.hosts.remove-downtime for ms-be[2041,2046].codfw.wmnet
* 19:29 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 19:28 mvernon@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for thanos-fe2002.codfw.wmnet
* 19:28 mvernon@cumin2002: START - Cookbook sre.hosts.remove-downtime for thanos-fe2002.codfw.wmnet
* 19:26 mvernon@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-fe2010.codfw.wmnet
* 19:26 mvernon@cumin2002: START - Cookbook sre.hosts.remove-downtime for ms-fe2010.codfw.wmnet
* 19:21 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2032.codfw.wmnet,service=ats-tls
* 19:21 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2032.codfw.wmnet,service=varnish-fe
* 19:21 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2032.codfw.wmnet,service=ats-be
* 19:17 rzl@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on mc2038.codfw.wmnet with reason: install
* 19:17 rzl@cumin2002: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on mc2038.codfw.wmnet with reason: install
* 19:13 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2031.codfw.wmnet,service=ats-tls
* 19:13 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2031.codfw.wmnet,service=varnish-fe
* 19:13 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2031.codfw.wmnet,service=ats-be
* 19:11 mutante: gerrit1001 - rsyncing /home/ to gerrit2002:/srv/home-gerrit1001.wikimedia.org [[phab:T313250|T313250]]
* 19:01 dzahn@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on gerrit2002.wikimedia.org with reason: new machine
* 19:01 dzahn@cumin2002: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on gerrit2002.wikimedia.org with reason: new machine
* 18:55 dancy@deploy1002: Finished scap: testwikis wikis to 1.39.0-wmf.23  refs [[phab:T308076|T308076]] (duration: 50m 39s)
* 18:54 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 18:52 ejegg: updated payments-wiki from {{Gerrit|589bb64e}} to {{Gerrit|e1b6036a}} (just i18n changes in extensions)
* 18:47 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 18:47 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 18:46 bking@cumin1001: START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: deploy wmf-elasticsearch-search-plugins pkg - bking@cumin1001 - [[phab:T314078|T314078]]
* 18:46 rzl@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mc2038.codfw.wmnet with reason: install
* 18:45 rzl@cumin2002: START - Cookbook sre.hosts.downtime for 1:00:00 on mc2038.codfw.wmnet with reason: install
* 18:41 rzl@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc2038.codfw.wmnet
* 18:41 rzl@cumin2002: START - Cookbook sre.hosts.remove-downtime for mc2038.codfw.wmnet
* 18:39 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 18:19 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 18:18 rzl@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2038.codfw.wmnet with reason: install
* 18:18 rzl@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on mc2038.codfw.wmnet with reason: install
* 18:17 rzl@cumin2002: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2038.codfw.wmnet with reason: install
* 18:17 rzl@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on mw2038.codfw.wmnet with reason: install
* 18:17 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 18:17 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 18:16 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on lvs2008.codfw.wmnet with reason: shutdown for PDU upgrade
* 18:16 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 1:00:00 on lvs2008.codfw.wmnet with reason: shutdown for PDU upgrade
* 18:16 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 18:11 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 18:08 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 18:08 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 18:05 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 18:04 dancy@deploy1002: Started scap: testwikis wikis to 1.39.0-wmf.23  refs [[phab:T308076|T308076]]
* 17:52 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1168 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32185 and previous config saved to /var/cache/conftool/dbconfig/20220802-175233-marostegui.json
* 17:43 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depool db2159', diff saved to https://phabricator.wikimedia.org/P32184 and previous config saved to /var/cache/conftool/dbconfig/20220802-174311-ladsgroup.json
* 17:37 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P32183 and previous config saved to /var/cache/conftool/dbconfig/20220802-173723-marostegui.json
* 17:35 moritzm: installing node-moment security updates
* 17:32 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic[2041-2042,2057].codfw.wmnet with reason: [[phab:T310070|T310070]]
* 17:32 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic[2041-2042,2057].codfw.wmnet with reason: [[phab:T310070|T310070]]
* 17:27 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2013.codfw.wmnet
* 17:25 moritzm: installing fribidi security updates
* 17:22 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P32182 and previous config saved to /var/cache/conftool/dbconfig/20220802-172217-marostegui.json
* 17:20 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2030.codfw.wmnet,service=ats-tls
* 17:20 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2030.codfw.wmnet,service=varnish-fe
* 17:20 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2030.codfw.wmnet,service=ats-be
* 17:18 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host ganeti2013.codfw.wmnet
* 17:07 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1168 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32181 and previous config saved to /var/cache/conftool/dbconfig/20220802-170711-marostegui.json
* 17:06 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc[2042-2043].codfw.wmnet with reason: shutdown for PDU upgrade
* 17:06 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on mc[2042-2043].codfw.wmnet with reason: shutdown for PDU upgrade
* 17:05 Emperor: ms-be20[31,32,41,46].codfw.wmnet,ms-fe2010.codfw.wmnet,thanos-fe2002.codfw.wmnet downtime for PDU work [[phab:T309957|T309957]]
* 17:05 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1168 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32180 and previous config saved to /var/cache/conftool/dbconfig/20220802-170503-marostegui.json
* 17:04 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance
* 17:04 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance
* 17:04 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance
* 17:04 mvernon@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 6 hosts with reason: shutdown for PDU replacement
* 17:04 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance
* 17:04 mvernon@cumin2002: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 6 hosts with reason: shutdown for PDU replacement
* 17:04 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance
* 17:03 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance
* 17:03 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance
* 17:03 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance
* 17:03 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1131 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32179 and previous config saved to /var/cache/conftool/dbconfig/20220802-170333-marostegui.json
* 17:01 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2029.codfw.wmnet,service=ats-tls
* 17:01 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2029.codfw.wmnet,service=varnish-fe
* 17:01 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2029.codfw.wmnet,service=ats-be
* 17:00 mvernon@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be[2030,2045,2052].codfw.wmnet
* 17:00 mvernon@cumin2002: START - Cookbook sre.hosts.remove-downtime for ms-be[2030,2045,2052].codfw.wmnet
* 16:57 btullis@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host an-airflow1004.eqiad.wmnet
* 16:54 hnowlan@deploy1002: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: sync
* 16:53 hnowlan@deploy1002: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: sync
* 16:51 hnowlan@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: sync
* 16:49 hnowlan@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: sync
* 16:48 hnowlan@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: sync
* 16:48 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P32178 and previous config saved to /var/cache/conftool/dbconfig/20220802-164827-marostegui.json
* 16:38 hnowlan@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: sync
* 16:35 hnowlan@deploy1002: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: sync
* 16:35 hnowlan@deploy1002: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: sync
* 16:33 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P32177 and previous config saved to /var/cache/conftool/dbconfig/20220802-163321-marostegui.json
* 16:29 dancy@mwmaint1002: pull aborted:  (duration: 00m 07s)
* 16:25 rzl: rzl@stat1007:~$ sudo systemctl stop wmde-analytics-daily-early # wedged, timer will restart it now with max_runtime_seconds
* 16:18 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1131 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32176 and previous config saved to /var/cache/conftool/dbconfig/20220802-161815-marostegui.json
* 16:16 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1131 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32175 and previous config saved to /var/cache/conftool/dbconfig/20220802-161607-marostegui.json
* 16:16 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance
* 16:15 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance
* 16:15 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1165 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32174 and previous config saved to /var/cache/conftool/dbconfig/20220802-161545-marostegui.json
* 16:10 btullis@cumin1001: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) an-airflow1004.eqiad.wmnet on all recursors
* 16:10 btullis@cumin1001: START - Cookbook sre.dns.wipe-cache an-airflow1004.eqiad.wmnet on all recursors
* 16:10 btullis@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 16:05 btullis@cumin1001: START - Cookbook sre.dns.netbox
* 16:05 btullis@cumin1001: START - Cookbook sre.ganeti.makevm for new host an-airflow1004.eqiad.wmnet
* 16:00 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P32173 and previous config saved to /var/cache/conftool/dbconfig/20220802-160039-marostegui.json
* 15:51 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2056.codfw.wmnet with reason: [[phab:T309957|T309957]]
* 15:50 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2056.codfw.wmnet with reason: [[phab:T309957|T309957]]
* 15:49 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2040.codfw.wmnet with reason: [[phab:T309957|T309957]]
* 15:49 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2040.codfw.wmnet with reason: [[phab:T309957|T309957]]
* 15:46 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2039.codfw.wmnet with reason: [[phab:T309957|T309957]]
* 15:45 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2039.codfw.wmnet with reason: [[phab:T309957|T309957]]
* 15:45 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P32172 and previous config saved to /var/cache/conftool/dbconfig/20220802-154533-marostegui.json
* 15:37 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc[2040-2041].codfw.wmnet with reason: shutdown for PDU upgrade
* 15:37 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on mc[2040-2041].codfw.wmnet with reason: shutdown for PDU upgrade
* 15:36 bking@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host elastic2037.codfw.wmnet
* 15:36 bking@cumin1001: START - Cookbook sre.hosts.reboot-single for host elastic2037.codfw.wmnet
* 15:30 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1165 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32171 and previous config saved to /var/cache/conftool/dbconfig/20220802-153027-marostegui.json
* 15:28 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1165 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32170 and previous config saved to /var/cache/conftool/dbconfig/20220802-152818-marostegui.json
* 15:28 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
* 15:27 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
* 15:27 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance
* 15:27 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance
* 15:27 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32169 and previous config saved to /var/cache/conftool/dbconfig/20220802-152740-marostegui.json
* 15:24 moritzm: installing gnupg2 security updates
* 15:15 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2024.codfw.wmnet with reason: shutdown for PDU upgrade
* 15:15 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on mc2024.codfw.wmnet with reason: shutdown for PDU upgrade
* 15:13 jbond@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host puppetmaster1004.eqiad.wmnet with OS buster
* 15:12 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P32167 and previous config saved to /var/cache/conftool/dbconfig/20220802-151234-marostegui.json
* 15:10 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ganeti-test[2001-2003].codfw.wmnet with reason: Power down for PDU maintenance, [[phab:T310070|T310070]]
* 15:10 jmm@cumin2002: START - Cookbook sre.hosts.downtime for 4:00:00 on ganeti-test[2001-2003].codfw.wmnet with reason: Power down for PDU maintenance, [[phab:T310070|T310070]]
* 15:08 root@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on thanos-be2001.codfw.wmnet with reason: pdu
* 15:08 root@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on thanos-be2001.codfw.wmnet with reason: pdu
* 15:07 mvernon@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on ms-be[2030,2045,2052].codfw.wmnet with reason: shutdown for PDU replacement
* 15:07 mvernon@cumin2002: START - Cookbook sre.hosts.downtime for 3:00:00 on ms-be[2030,2045,2052].codfw.wmnet with reason: shutdown for PDU replacement
* 15:06 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mc-gp2002.codfw.wmnet with reason: Power down for PDU maintenance, [[phab:T310070|T310070]]
* 15:06 jmm@cumin2002: START - Cookbook sre.hosts.downtime for 4:00:00 on mc-gp2002.codfw.wmnet with reason: Power down for PDU maintenance, [[phab:T310070|T310070]]
* 15:04 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2037.codfw.wmnet with reason: [[phab:T309957|T309957]]
* 15:04 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2037.codfw.wmnet with reason: [[phab:T309957|T309957]]
* 15:01 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: shutdown for PDU upgrade
* 15:00 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: shutdown for PDU upgrade
* 14:59 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2025.codfw.wmnet with reason: [[phab:T309957|T309957]]
* 14:59 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2025.codfw.wmnet with reason: [[phab:T309957|T309957]]
* 14:58 oblivian@puppetmaster1001: conftool action : set/pooled=false; selector: dnsdisc=(appservers{{!}}api)-ro,name=codfw
* 14:57 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P32166 and previous config saved to /var/cache/conftool/dbconfig/20220802-145728-marostegui.json
* 14:54 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2060.codfw.wmnet with OS bullseye
* 14:53 jbond@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on puppetmaster1004.eqiad.wmnet with reason: host reimage
* 14:50 moritzm: uploaded gnupg2 2.1.18-8~deb9u4+wmf1 to stretch-wikimedia
* 14:50 jbond@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on puppetmaster1004.eqiad.wmnet with reason: host reimage
* 14:42 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32164 and previous config saved to /var/cache/conftool/dbconfig/20220802-144222-marostegui.json
* 14:40 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1113:3316 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32163 and previous config saved to /var/cache/conftool/dbconfig/20220802-144013-marostegui.json
* 14:40 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance
* 14:39 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance
* 14:39 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32162 and previous config saved to /var/cache/conftool/dbconfig/20220802-143952-marostegui.json
* 14:37 jbond@cumin1001: START - Cookbook sre.hosts.reimage for host puppetmaster1004.eqiad.wmnet with OS buster
* 14:32 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2060.codfw.wmnet with reason: host reimage
* 14:28 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2060.codfw.wmnet with reason: host reimage
* 14:24 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P32161 and previous config saved to /var/cache/conftool/dbconfig/20220802-142446-marostegui.json
* 14:23 Emperor: shutdown ms-be20[30,45,52] for PDU work [[phab:T309957|T309957]]
* 14:22 mvernon@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ms-be[2030,2045,2052].codfw.wmnet with reason: shutdown for PDU replacement
* 14:21 mvernon@cumin2002: START - Cookbook sre.hosts.downtime for 1:00:00 on ms-be[2030,2045,2052].codfw.wmnet with reason: shutdown for PDU replacement
* 14:12 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic2060.codfw.wmnet with OS bullseye
* 14:09 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P32160 and previous config saved to /var/cache/conftool/dbconfig/20220802-140940-marostegui.json
* 14:05 jbond@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host puppetmaster2004.codfw.wmnet with OS buster
* 14:04 godog: grow sda/sdb 3 by 100G on thanos-be1001 - [[phab:T314275|T314275]]
* 14:03 root@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on centrallog2002.codfw.wmnet with reason: pdu
* 14:03 root@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on centrallog2002.codfw.wmnet with reason: pdu
* 14:01 root@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on prometheus2005.codfw.wmnet with reason: pdu
* 14:01 root@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on prometheus2005.codfw.wmnet with reason: pdu
* 13:57 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2030.codfw.wmnet,service=ats-tls
* 13:57 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2032.codfw.wmnet,service=ats-be
* 13:57 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2031.codfw.wmnet,service=ats-be
* 13:56 godog: schedule poweroff for centrallog2002 at 16 utc - [[phab:T310070|T310070]]
* 13:54 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp203[3-4].codfw.wmnet,service=ats-be
* 13:54 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32159 and previous config saved to /var/cache/conftool/dbconfig/20220802-135435-marostegui.json
* 13:53 godog: depool and poweroff prometheus2005 - [[phab:T310070|T310070]]
* 13:53 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp203[3-4].codfw.wmnet,service=ats-tls
* 13:53 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp203[3-4].codfw.wmnet,service=ats-tls
* 13:53 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp203[3-4].codfw.wmnet,service=varnish-fe
* 13:52 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1098:3316 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32158 and previous config saved to /var/cache/conftool/dbconfig/20220802-135226-marostegui.json
* 13:52 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance
* 13:52 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp203[1-2].codfw.wmnet,service=ats-tls
* 13:51 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance
* 13:51 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32157 and previous config saved to /var/cache/conftool/dbconfig/20220802-135155-marostegui.json
* 13:51 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp203[1-2].codfw.wmnet,service=ats-tls
* 13:51 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp203[1-2].codfw.wmnet,service=varnish-fe
* 13:50 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2030.codfw.wmnet,service=ats-be
* 13:50 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2030.codfw.wmnet,service=varnish-fe
* 13:50 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2030.codfw.wmnet,service=ats-be
* 13:50 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2029.codfw.wmnet,service=ats-tls
* 13:50 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2029.codfw.wmnet,service=varnish-fe
* 13:50 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2029.codfw.wmnet,service=ats-be
* 13:45 jbond@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on puppetmaster2004.codfw.wmnet with reason: host reimage
* 13:42 jbond@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on puppetmaster2004.codfw.wmnet with reason: host reimage
* 13:42 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 13:42 Lucas_WMDE: UTC afternoon backport+config window done
* 13:41 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2013.codfw.wmnet with OS bullseye
* 13:41 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 13:41 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 13:40 lucaswerkmeister-wmde@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:754933{{!}}Enable usage tracking for statement for cebwiki (T296384)]] – expected to gradually increase number of wbc_entity_usage and probably recentchanges rows on cebwiki, but not too much, see task for details (duration: 03m 06s)
* 13:40 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 13:39 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2028.codfw.wmnet with OS bullseye
* 13:36 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P32156 and previous config saved to /var/cache/conftool/dbconfig/20220802-133648-marostegui.json
* 13:35 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 13:34 lucaswerkmeister-wmde@deploy1002: Synchronized wmf-config/Wikibase.php: Config: [[gerrit:754937{{!}}Introduce $wmgEntityUsageModifierLimitsStatement (T296384)]] (2/2) (duration: 03m 21s)
* 13:34 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 13:34 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 13:33 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 13:31 lucaswerkmeister-wmde@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:754937{{!}}Introduce $wmgEntityUsageModifierLimitsStatement (T296384)]] (1/2) (duration: 03m 16s)
* 13:30 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ganeti2028.codfw.wmnet with reason: Power down for PDU maintenance, [[phab:T309957|T309957]]
* 13:30 jmm@cumin2002: START - Cookbook sre.hosts.downtime for 4:00:00 on ganeti2028.codfw.wmnet with reason: Power down for PDU maintenance, [[phab:T309957|T309957]]
* 13:27 jbond@cumin2002: START - Cookbook sre.hosts.reimage for host puppetmaster2004.codfw.wmnet with OS buster
* 13:24 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2013.codfw.wmnet with reason: host reimage
* 13:24 vgutierrez: restarting ATS 9.x instances to apply https://gerrit.wikimedia.org/r/819585 - [[phab:T309651|T309651]]
* 13:23 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2028.codfw.wmnet with reason: host reimage
* 13:21 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P32155 and previous config saved to /var/cache/conftool/dbconfig/20220802-132142-marostegui.json
* 13:19 jmm@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2013.codfw.wmnet with reason: host reimage
* 13:19 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2028.codfw.wmnet with reason: host reimage
* 13:17 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 13:17 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 13:17 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 13:16 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 13:15 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|a4499e5ac23a0558bed276e2b74134590afc5c95}}:  Revert "testwiki: Add mediawiki.web_ui.interactions stream" ([[phab:T314151|T314151]], [[phab:T311268|T311268]]) (duration: 03m 19s)
* 13:10 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 13:09 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 13:09 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 13:09 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|c2fb8a58d8f62e29a15ebee26198e79e4597d24c}}: Enable RealtimePreview on Group 0 wikis ([[phab:T314150|T314150]]) (duration: 03m 21s)
* 13:08 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 13:06 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32154 and previous config saved to /var/cache/conftool/dbconfig/20220802-130636-marostegui.json
* 13:04 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1096:3316 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32153 and previous config saved to /var/cache/conftool/dbconfig/20220802-130428-marostegui.json
* 13:04 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance
* 13:04 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance
* 13:04 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance
* 13:03 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance
* 13:03 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1180 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32152 and previous config saved to /var/cache/conftool/dbconfig/20220802-130351-marostegui.json
* 13:02 jmm@cumin2002: START - Cookbook sre.hosts.reimage for host ganeti2013.codfw.wmnet with OS bullseye
* 13:00 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic2028.codfw.wmnet with OS bullseye
* 13:00 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on ganeti2013.codfw.wmnet with reason: Remove node for eventual reimage, [[phab:T311686|T311686]]
* 12:59 jmm@cumin2002: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on ganeti2013.codfw.wmnet with reason: Remove node for eventual reimage, [[phab:T311686|T311686]]
* 12:48 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P32151 and previous config saved to /var/cache/conftool/dbconfig/20220802-124845-marostegui.json
* 12:33 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P32150 and previous config saved to /var/cache/conftool/dbconfig/20220802-123338-marostegui.json
* 12:18 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1180 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32149 and previous config saved to /var/cache/conftool/dbconfig/20220802-121832-marostegui.json
* 12:16 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1180 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32148 and previous config saved to /var/cache/conftool/dbconfig/20220802-121624-marostegui.json
* 12:16 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance
* 12:15 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance
* 12:13 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 12:12 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 12:12 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 12:11 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 12:01 marostegui: dbmaint x1@eqiad [[phab:T314087|T314087]]
* 11:57 marostegui: dbmaint s7@eqiad [[phab:T314377|T314377]]
* 11:57 marostegui: dbmaint s3@eqiad [[phab:T314377|T314377]]
* 11:57 marostegui: dbmaint s8@eqiad [[phab:T314377|T314377]]
* 11:55 marostegui: dbmait s8@eqiad [[phab:T314377|T314377]]
* 11:54 marostegui: dbmait s3@eqiad [[phab:T314377|T314377]]
* 11:50 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
* 11:48 marostegui: dbmait s7@eqiad [[phab:T314377|T314377]]
* 11:46 marostegui: dbmait s4@eqiad [[phab:T314377|T314377]]
* 11:35 elukey: restart rsyslog on ml-serve1006
* 10:50 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on an-worker1082.eqiad.wmnet with reason: [[phab:T312626|T312626]] btullis
* 10:50 btullis@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on an-worker1082.eqiad.wmnet with reason: [[phab:T312626|T312626]] btullis
* 10:49 godog: grow sda3 by 100G on thanos-be2004 - [[phab:T314275|T314275]]
* 10:42 btullis@puppetmaster1001: conftool action : set/pooled=inactive; selector: cluster=wikireplicas-b,name=dbproxy1018.eqiad.wmnet
* 10:42 btullis@puppetmaster1001: conftool action : set/pooled=yes; selector: cluster=wikireplicas-b,name=dbproxy1019.eqiad.wmnet
* 10:35 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 10:34 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 10:34 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 10:34 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 10:33 marostegui@cumin1001: dbctl commit (dc=all): 'db1181 (re)pooling @ 100%: After restart', diff saved to https://phabricator.wikimedia.org/P32147 and previous config saved to /var/cache/conftool/dbconfig/20220802-103318-root.json
* 10:18 marostegui@cumin1001: dbctl commit (dc=all): 'db1181 (re)