You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Server Admin Log: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Stashbot
(bd808@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'toolhub' for release 'main' .)
imported>Stashbot
(mutante: wikistats-bullseye:~$ /usr/lib/wikistats/update.php wp prefix blk ; /usr/lib/wikistats/update.php wp prefix kcg T315121)
(291 intermediate revisions by 2 users not shown)
Line 1: Line 1:
== 2021-10-01 ==
== 2022-08-12 ==
* 23:19 bd808@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'toolhub' for release 'main' .
* 23:41 mutante: wikistats-bullseye:~$ /usr/lib/wikistats/update.php wp prefix blk ; /usr/lib/wikistats/update.php wp prefix kcg [[phab:T315121|T315121]]
* 22:27 mutante: puppetmaster2001 - systemctl reset-failed
* 23:38 mutante: [mwmaint1002:~] $ sudo systemctl start mediawiki_job_initsitestats.timer [[phab:T315121|T315121]]
* 22:16 mutante: puppetmaster2001 systemctl disable geoip_update_ipinfo.timer
* 22:14 ryankemper@cumin1001: END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - [[phab:T289135|T289135]]
* 22:15 mutante: puppetmaster2001 - sudo /usr/local/bin/geoipupdate_job after adding new shell command and timer - succesfully downloaded enterprise database for [[phab:T288844|T288844]]
* 21:48 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1071.eqiad.wmnet with OS bullseye
* 21:56 bd808@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'toolhub' for release 'main' .
* 21:45 andrew@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host clouddb2002-dev.codfw.wmnet with OS bullseye
* 21:44 mutante: puppetmasters - temp. disabling puppet one more time, now for a different deploy, to fetch an additional MaxMind database - [[phab:T288844|T288844]]
* 21:27 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1071.eqiad.wmnet with reason: host reimage
* 21:19 mutante: puppetmaster2001 - puppet removed cron sync_volatile and cron sync_ca - starting and verifying new timers: 'systemctl status sync-puppet-volatile', 'systemctl status sync-puppet-ca' [[phab:T273673|T273673]]
* 21:25 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1071.eqiad.wmnet with reason: host reimage
* 21:12 mutante: puppetmaster1002, puppetmaster1003, puppetmaster2002, puppetmaster2003: re-enabled puppet, they are backends. backends don't have the sync cron/job/timer, so noop as well, just like 1004/1005/2004/2005. this just leaves the actual change on 2001  - [[phab:T273673|T273673]]
* 21:12 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1071.eqiad.wmnet with OS bullseye
* 21:07 mutante: puppetmaster1004, puppetmaster1005, puppetmaster2004, puppetmaster2005: re-enabled puppet, they are "insetup" role
* 21:10 andrew@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddb2002-dev.codfw.wmnet with reason: host reimage
* 21:06 mbsantos@deploy1002: Finished deploy [kartotherian/deploy@d309a6e] (eqiad): tegola: reduce load to 50% during the weekend (duration: 00m 54s)
* 21:06 andrew@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on clouddb2002-dev.codfw.wmnet with reason: host reimage
* 21:05 mbsantos@deploy1002: Started deploy [kartotherian/deploy@d309a6e] (eqiad): tegola: reduce load to 50% during the weekend
* 21:06 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1053.eqiad.wmnet with OS bullseye
* 21:05 mutante: puppetmaster1001 - re-enabled puppet, noop as expected, the passive host pulls from the active one, so only 2001 has the cron/job/timer
* 20:50 andrew@cumin1001: START - Cookbook sre.hosts.reimage for host clouddb2002-dev.codfw.wmnet with OS bullseye
* 21:05 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 20:43 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1053.eqiad.wmnet with reason: host reimage
* 21:02 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 20:39 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1053.eqiad.wmnet with reason: host reimage
* 21:01 legoktm@deploy1002: Synchronized wmf-config/CommonSettings.php: Revert "Have PdfHandler use Shellbox on Commons for 10% of requests" (duration: 00m 59s)
* 20:24 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1053.eqiad.wmnet with OS bullseye
* 20:58 mutante: temp disabling puppet on puppetmasters - deploying gerrit:724115 (gerrit:723310) [[phab:T273673|T273673]]
* 20:12 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1048.eqiad.wmnet with OS bullseye
* 18:58 robh@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-db1002.eqiad.wmnet with reason: REIMAGE
* 19:55 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1048.eqiad.wmnet with reason: host reimage
* 18:56 robh@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-db1001.eqiad.wmnet with reason: REIMAGE
* 19:53 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1048.eqiad.wmnet with reason: host reimage
* 18:55 robh@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-db1002.eqiad.wmnet with reason: REIMAGE
* 19:42 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1048.eqiad.wmnet with OS bullseye
* 18:53 robh@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-db1001.eqiad.wmnet with reason: REIMAGE
* 19:38 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1146:3314 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32375 and previous config saved to /var/cache/conftool/dbconfig/20220812-193822-ladsgroup.json
* 18:07 robh@cumin1001: END (ERROR) - Cookbook sre.experimental.reimage (exit_code=97) for host an-db1001.eqiad.wmnet
* 19:38 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance
* 18:05 robh@cumin1001: START - Cookbook sre.experimental.reimage for host an-db1001.eqiad.wmnet
* 19:38 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance
* 17:58 effie: depool mw1025, mw1319, mw1312 for test
* 19:38 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1121 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32374 and previous config saved to /var/cache/conftool/dbconfig/20220812-193801-ladsgroup.json
* 16:20 dancy: testing upcoming Scap 4.0.2 release on beta
* 19:33 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1054.eqiad.wmnet with OS bullseye
* 14:04 bblack: C:envoyproxy (appservers and others): restarting envoyproxy
* 19:22 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P32373 and previous config saved to /var/cache/conftool/dbconfig/20220812-192255-ladsgroup.json
* 14:04 bblack: C:envoyproxy (appservers and others): ca-certificates updated via cumin to workaround [[phab:T292291|T292291]] issues
* 19:12 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1054.eqiad.wmnet with reason: host reimage
* 13:45 elukey@deploy1002: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
* 19:09 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1054.eqiad.wmnet with reason: host reimage
* 13:45 elukey@deploy1002: helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
* 19:07 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P32372 and previous config saved to /var/cache/conftool/dbconfig/20220812-190749-ladsgroup.json
* 13:23 bblack: manually trying LE expired root workaround on mwdebug1001 with puppet disabled ...
* 18:58 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet with reason: Maint
* 13:12 gehel@cumin1001: START - Cookbook sre.wdqs.data-reload
* 18:58 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 2 days, 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet with reason: Maint
* 13:11 gehel@cumin1001: END (ERROR) - Cookbook sre.wdqs.data-reload (exit_code=97)
* 18:54 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1054.eqiad.wmnet with OS bullseye
* 13:11 gehel@cumin1001: START - Cookbook sre.wdqs.data-reload
* 18:52 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1121 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32371 and previous config saved to /var/cache/conftool/dbconfig/20220812-185243-ladsgroup.json
* 13:10 gehel@cumin1001: START - Cookbook sre.wdqs.data-reload
* 18:48 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1066.eqiad.wmnet with OS bullseye
* 11:42 jgiannelos@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' .
* 18:25 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1066.eqiad.wmnet with reason: host reimage
* 11:11 jynus: manually migrating some vms out of ganeti1009 to avoid excessive memory pressure
* 18:22 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1066.eqiad.wmnet with reason: host reimage
* 10:58 marostegui@cumin1001: dbctl commit (dc=all): 'db1164 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17413 and previous config saved to /var/cache/conftool/dbconfig/20211001-105849-root.json
* 18:08 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1066.eqiad.wmnet with OS bullseye
* 10:57 marostegui@cumin1001: dbctl commit (dc=all): 'db1177 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17412 and previous config saved to /var/cache/conftool/dbconfig/20211001-105735-root.json
* 18:00 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1064.eqiad.wmnet with OS bullseye
* 10:43 jgiannelos@deploy1002: Finished deploy [kartotherian/deploy@d4caf6d] (eqiad): Increase mirrored traffic to 100% for eqiad (duration: 00m 49s)
* 17:42 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1064.eqiad.wmnet with reason: host reimage
* 10:43 marostegui@cumin1001: dbctl commit (dc=all): 'db1164 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17411 and previous config saved to /var/cache/conftool/dbconfig/20211001-104345-root.json
* 17:39 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1064.eqiad.wmnet with reason: host reimage
* 10:43 jgiannelos@deploy1002: Started deploy [kartotherian/deploy@d4caf6d] (eqiad): Increase mirrored traffic to 100% for eqiad
* 17:24 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1064.eqiad.wmnet with OS bullseye
* 10:42 marostegui@cumin1001: dbctl commit (dc=all): 'db1177 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17410 and previous config saved to /var/cache/conftool/dbconfig/20211001-104232-root.json
* 17:21 pt1979@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts netmon2002.wikimedia.org
* 10:28 marostegui@cumin1001: dbctl commit (dc=all): 'db1164 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17409 and previous config saved to /var/cache/conftool/dbconfig/20211001-102841-root.json
* 17:21 pt1979@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts netmon2002.wikimedia.org
* 10:27 marostegui@cumin1001: dbctl commit (dc=all): 'db1177 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17408 and previous config saved to /var/cache/conftool/dbconfig/20211001-102728-root.json
* 17:19 pt1979@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netmon2002.wikimedia.org with OS bullseye
* 10:13 marostegui@cumin1001: dbctl commit (dc=all): 'db1164 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17407 and previous config saved to /var/cache/conftool/dbconfig/20211001-101338-root.json
* 17:04 pt1979@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netmon2002.wikimedia.org with reason: host reimage
* 10:12 marostegui@cumin1001: dbctl commit (dc=all): 'db1177 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17406 and previous config saved to /var/cache/conftool/dbconfig/20211001-101224-root.json
* 17:01 pt1979@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on netmon2002.wikimedia.org with reason: host reimage
* 10:00 jgiannelos@deploy1002: Finished deploy [kartotherian/deploy@c123ab9] (eqiad): Increase mirrored traffic to 80% for eqiad (duration: 00m 51s)
* 16:42 pt1979@cumin2002: START - Cookbook sre.hosts.reimage for host netmon2002.wikimedia.org with OS bullseye
* 09:59 jgiannelos@deploy1002: Started deploy [kartotherian/deploy@c123ab9] (eqiad): Increase mirrored traffic to 80% for eqiad
* 16:26 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1067.eqiad.wmnet with OS bullseye
* 09:58 marostegui@cumin1001: dbctl commit (dc=all): 'db1164 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17405 and previous config saved to /var/cache/conftool/dbconfig/20211001-095834-root.json
* 16:21 andrew@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudcontrol2003-dev.wikimedia.org
* 09:57 marostegui@cumin1001: dbctl commit (dc=all): 'db1177 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17404 and previous config saved to /var/cache/conftool/dbconfig/20211001-095720-root.json
* 16:21 andrew@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 09:55 marostegui: Upgrade db1164 and db1177
* 16:16 andrew@cumin1001: START - Cookbook sre.dns.netbox
* 09:54 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1177 and db1164 for upgrade', diff saved to https://phabricator.wikimedia.org/P17403 and previous config saved to /var/cache/conftool/dbconfig/20211001-095433-marostegui.json
* 16:11 andrew@cumin1001: START - Cookbook sre.hosts.decommission for hosts cloudcontrol2003-dev.wikimedia.org
* 09:49 marostegui@cumin1001: dbctl commit (dc=all): 'db1172 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17402 and previous config saved to /var/cache/conftool/dbconfig/20211001-094913-root.json
* 16:08 pt1979@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['netmon2002.wikimedia.org']
* 09:49 marostegui@cumin1001: dbctl commit (dc=all): 'db1135 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17401 and previous config saved to /var/cache/conftool/dbconfig/20211001-094902-root.json
* 16:03 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1067.eqiad.wmnet with reason: host reimage
* 09:38 urbanecm: [urbanecm@mwmaint1002 ~]$ mwscript extensions/GrowthExperiments/maintenance/updateMenteeData.php --wiki=frwiki --force # to get an idea about timing for [[phab:T290609|T290609]], runs in a tmux session under my account
* 15:58 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1067.eqiad.wmnet with reason: host reimage
* 09:34 marostegui@cumin1001: dbctl commit (dc=all): 'db1172 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17400 and previous config saved to /var/cache/conftool/dbconfig/20211001-093410-root.json
* 15:43 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1067.eqiad.wmnet with OS bullseye
* 09:33 marostegui@cumin1001: dbctl commit (dc=all): 'db1135 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17399 and previous config saved to /var/cache/conftool/dbconfig/20211001-093358-root.json
* 15:37 pt1979@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['netmon2002.wikimedia.org']
* 09:25 jmm@cumin2002: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host testvm2002.codfw.wmnet
* 15:31 jbond@cumin2002: END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['netmon2002.wikimedia.org']
* 09:19 marostegui@cumin1001: dbctl commit (dc=all): 'db1172 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17398 and previous config saved to /var/cache/conftool/dbconfig/20211001-091906-root.json
* 15:31 jbond@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['netmon2002.wikimedia.org']
* 09:18 marostegui@cumin1001: dbctl commit (dc=all): 'db1135 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17397 and previous config saved to /var/cache/conftool/dbconfig/20211001-091854-root.json
* 15:07 jbond@cumin2002: END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts netmon1002.wikimedia.org
* 09:04 marostegui@cumin1001: dbctl commit (dc=all): 'db1172 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17396 and previous config saved to /var/cache/conftool/dbconfig/20211001-090402-root.json
* 15:07 jbond@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts netmon1002.wikimedia.org
* 09:03 marostegui@cumin1001: dbctl commit (dc=all): 'db1135 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17395 and previous config saved to /var/cache/conftool/dbconfig/20211001-090351-root.json
* 15:04 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1061.eqiad.wmnet with OS bullseye
* 09:02 jmm@cumin2002: START - Cookbook sre.ganeti.makevm for new host testvm2002.codfw.wmnet
* 14:46 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1061.eqiad.wmnet with reason: host reimage
* 09:00 _joe_: restarting pybal low-traffic in eqiad to pick up the drop of proxyfetch to kubernetes services
* 14:46 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2042.codfw.wmnet,service=varnish-fe
* 08:48 marostegui@cumin1001: dbctl commit (dc=all): 'db1172 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17394 and previous config saved to /var/cache/conftool/dbconfig/20211001-084859-root.json
* 14:46 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2042.codfw.wmnet,service=ats-be
* 08:48 marostegui@cumin1001: dbctl commit (dc=all): 'db1135 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17393 and previous config saved to /var/cache/conftool/dbconfig/20211001-084847-root.json
* 14:46 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2042.codfw.wmnet,service=ats-tls
* 08:44 marostegui: Upgrade db1135 and db1172
* 14:43 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1061.eqiad.wmnet with reason: host reimage
* 08:44 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1172 for upgrade', diff saved to https://phabricator.wikimedia.org/P17392 and previous config saved to /var/cache/conftool/dbconfig/20211001-084435-marostegui.json
* 14:29 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maint
* 08:44 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1135 for upgrade', diff saved to https://phabricator.wikimedia.org/P17391 and previous config saved to /var/cache/conftool/dbconfig/20211001-084411-marostegui.json
* 14:28 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1061.eqiad.wmnet with OS bullseye
* 08:43 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db2080 [[phab:T290868|T290868]]', diff saved to https://phabricator.wikimedia.org/P17390 and previous config saved to /var/cache/conftool/dbconfig/20211001-084345-marostegui.json
* 14:28 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maint
* 08:17 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet
* 14:24 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1063.eqiad.wmnet with OS bullseye
* 08:15 _joe_: restarting pybal in codfw to pick up config changes
* 14:05 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1063.eqiad.wmnet with reason: host reimage
* 08:09 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet
* 14:02 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1063.eqiad.wmnet with reason: host reimage
* 08:04 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on testvm[2001,2003].codfw.wmnet with reason: Ganeti tests
* 13:47 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1063.eqiad.wmnet with OS bullseye
* 08:03 jmm@cumin2002: START - Cookbook sre.hosts.downtime for 8:00:00 on testvm[2001,2003].codfw.wmnet with reason: Ganeti tests
* 13:41 ryankemper@cumin1001: START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - [[phab:T289135|T289135]]
* 06:28 marostegui@cumin1001: dbctl commit (dc=all): 'db1114 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17388 and previous config saved to /var/cache/conftool/dbconfig/20211001-062846-root.json
* 06:01 ryankemper@puppetmaster1001: conftool action : set/pooled=yes:weight=10; selector: name=elastic10[8-9][0-9].*
* 06:27 elukey@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' .
* 05:54 ryankemper@puppetmaster1001: conftool action : set/pooled=yes:weight=10; selector: name=elastic110.*
* 06:24 marostegui@cumin1001: dbctl commit (dc=all): 'db1119 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17387 and previous config saved to /var/cache/conftool/dbconfig/20211001-062453-root.json
* 01:03 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1121 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32369 and previous config saved to /var/cache/conftool/dbconfig/20220812-010312-ladsgroup.json
* 06:13 marostegui@cumin1001: dbctl commit (dc=all): 'db1114 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17386 and previous config saved to /var/cache/conftool/dbconfig/20211001-061342-root.json
* 01:03 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
* 06:13 elukey@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' .
* 01:02 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
* 06:09 marostegui@cumin1001: dbctl commit (dc=all): 'db1119 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17385 and previous config saved to /var/cache/conftool/dbconfig/20211001-060949-root.json
* 01:02 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1121.eqiad.wmnet with reason: Maintenance
* 05:58 marostegui@cumin1001: dbctl commit (dc=all): 'db1114 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17384 and previous config saved to /var/cache/conftool/dbconfig/20211001-055838-root.json
* 01:02 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1121.eqiad.wmnet with reason: Maintenance
* 05:54 marostegui@cumin1001: dbctl commit (dc=all): 'db1119 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17383 and previous config saved to /var/cache/conftool/dbconfig/20211001-055445-root.json
* 01:02 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1147 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32368 and previous config saved to /var/cache/conftool/dbconfig/20220812-010233-ladsgroup.json
* 05:43 marostegui@cumin1001: dbctl commit (dc=all): 'db1114 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17382 and previous config saved to /var/cache/conftool/dbconfig/20211001-054335-root.json
* 00:47 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P32367 and previous config saved to /var/cache/conftool/dbconfig/20220812-004727-ladsgroup.json
* 05:39 marostegui@cumin1001: dbctl commit (dc=all): 'db1119 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17381 and previous config saved to /var/cache/conftool/dbconfig/20211001-053942-root.json
* 00:32 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P32366 and previous config saved to /var/cache/conftool/dbconfig/20220812-003221-ladsgroup.json
* 05:28 marostegui@cumin1001: dbctl commit (dc=all): 'db1114 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17380 and previous config saved to /var/cache/conftool/dbconfig/20211001-052831-root.json
* 00:17 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1147 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32365 and previous config saved to /var/cache/conftool/dbconfig/20220812-001715-ladsgroup.json
* 05:26 marostegui: Upgrade db1114
* 05:25 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1114 for upgrade', diff saved to https://phabricator.wikimedia.org/P17379 and previous config saved to /var/cache/conftool/dbconfig/20211001-052509-marostegui.json
* 05:24 marostegui@cumin1001: dbctl commit (dc=all): 'db1119 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17378 and previous config saved to /var/cache/conftool/dbconfig/20211001-052438-root.json
* 05:22 marostegui: Upgrade db1119
* 05:21 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1119 for upgrade', diff saved to https://phabricator.wikimedia.org/P17377 and previous config saved to /var/cache/conftool/dbconfig/20211001-052133-marostegui.json
* 04:00 legoktm@deploy1002: Synchronized wmf-config/CommonSettings.php: Have PdfHandler use Shellbox on Commons for 10% of requests ([[phab:T289228|T289228]]) (duration: 00m 59s)
* 04:00 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 03:57 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 03:24 bd808@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'toolhub' for release 'main' .
* 03:15 bd808@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'toolhub' for release 'main' .


== 2021-09-30 ==
== 2022-08-11 ==
* 23:57 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 21:30 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 23:54 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 21:29 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 23:51 reedy@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Put a https protocol into values (duration: 01m 00s)
* 21:29 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 23:48 dpifke@deploy1002: Finished deploy [statsv/statsv@afeff42]: Deploy statsv with Kafka TLS support (not yet enabled) [[phab:T290131|T290131]] (duration: 00m 05s)
* 21:28 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 23:48 dpifke@deploy1002: Started deploy [statsv/statsv@afeff42]: Deploy statsv with Kafka TLS support (not yet enabled) [[phab:T290131|T290131]]
* 21:23 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 23:41 dpifke@deploy1002: Finished deploy [performance/coal@1be49f8]: Deploy Coal with Kafka TLS support (not yet enabled) [[phab:T290131|T290131]] (duration: 01m 07s)
* 21:22 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 23:40 dpifke@deploy1002: Started deploy [performance/coal@1be49f8]: Deploy Coal with Kafka TLS support (not yet enabled) [[phab:T290131|T290131]]
* 21:22 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 23:39 dpifke@deploy1002: Finished deploy [performance/navtiming@29264fb]: Deploy Navtiming with Kafka TLS support (not yet enabled) [[phab:T290131|T290131]] (duration: 00m 05s)
* 21:21 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 23:39 dpifke@deploy1002: Started deploy [performance/navtiming@29264fb]: Deploy Navtiming with Kafka TLS support (not yet enabled) [[phab:T290131|T290131]]
* 21:16 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 23:34 ejegg: updated Fundraising CiviCRM from {{Gerrit|d4da344274}} to {{Gerrit|d74e9aa0a1}}
* 21:15 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 22:09 legoktm@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'shellbox-media' for release 'main' .
* 21:15 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 22:07 legoktm@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'shellbox-media' for release 'main' .
* 21:14 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 22:06 legoktm@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'shellbox-media' for release 'main' .
* 21:04 thcipriani@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: revert [[gerrit:806944{{!}}Define default value for "wmgSiteLogoVariants" (T305692 T308620)]] (duration: 03m 15s)
* 21:53 legoktm@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'shellbox-timeline' for release 'main' .
* 20:59 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 21:06 eileen: civicrm revision changed from {{Gerrit|2ecb8f0bcd}} to {{Gerrit|d4da344274}}, config revision is {{Gerrit|77cb7ec866}}
* 20:58 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 20:54 ryankemper: [WCQS] `ryankemper@wcqs1003:~$ sudo pool` (merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/725110 to unbreak readiness probe)
* 20:58 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 20:54 topranks: Routinator on rpki1001 upgraded to  0.10.0 and working again after force refresh.
* 20:57 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 20:49 brennen: gitlab1001: upgrade to 14.2.5 complete
* 20:52 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 20:32 brennen: gitlab2001, gitlab1001: downtime for upgrades to 14.2.5
* 20:51 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 20:18 ryankemper: [WCQS] `ryankemper@wcqs1003:~$ sudo depool` (not sure why pybal can't depool it, the other 2 servers are pooled)
* 20:50 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 19:51 topranks: Updating routinator on rpki1001 [[phab:T291543|T291543]]
* 20:49 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 19:39 legoktm@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'shellbox-timeline' for release 'main' .
* 20:47 thcipriani@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:806944{{!}}Define default value for "wmgSiteLogoVariants" (T305692 T308620)]] (duration: 03m 07s)
* 19:38 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 20:44 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 19:37 legoktm@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'shellbox-timeline' for release 'main' .
* 20:43 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 19:36 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 20:43 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 19:26 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 20:42 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 19:21 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 20:29 thcipriani@deploy1002: Synchronized php-1.39.0-wmf.23/extensions/VisualEditor/modules/ve-mw/preinit/ve.init.mw.DesktopArticleTarget.init.js: Backport: [[gerrit:822396{{!}}Do not show incompatible skin warning when page is not editable (T314952)]] (duration: 03m 16s)
* 19:14 jhuneidi@deploy1002: rebuilt and synchronized wikiversions files: all wikis to 1.38.0-wmf.2  refs [[phab:T281166|T281166]]
* 20:27 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 19:11 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 20:26 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 19:08 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 20:26 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 19:07 thcipriani@deploy1002: Synchronized php-1.38.0-wmf.2/extensions/MobileFrontend: Backport: [[gerrit:724979{{!}}Fix search within pages alignment (T292107)]] (duration: 01m 09s)
* 20:25 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 19:05 thcipriani@deploy1002: Synchronized php-1.38.0-wmf.1/extensions/EventBus/includes/EventBus.php: Backport: [[gerrit:724481{{!}}Guard against undefined index notice when setting x-client-ip (T288853)]] (duration: 01m 09s)
* 20:23 mutante: merging change on prod phabricator host to allow scap deployment, part 1
* 19:04 thcipriani@deploy1002: Synchronized php-1.38.0-wmf.2/extensions/EventBus/includes/EventBus.php: Backport: [[gerrit:724480{{!}}Guard against undefined index notice when setting x-client-ip (T288853)]] (duration: 01m 09s)
* 19:42 damilare: payments-wiki upgraded from {{Gerrit|cf5e1848}} to {{Gerrit|0894d75a}}
* 18:58 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 19:41 mutante: disabling puppet on C:profile::phabricator::main
* 18:58 thcipriani@deploy1002: Synchronized php-1.38.0-wmf.2/skins/Vector/resources/skins.vector.styles.legacy/components/MenuDropdown.less: Backport: [[gerrit:724798{{!}}Restore original more menu padding in legacy Vector (T289163)]] (duration: 01m 08s)
* 19:20 mvernon@cumin2002: END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-eqiad: upgrade to 3.11.13 [[phab:T309896|T309896]] - mvernon@cumin2002
* 18:54 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 17:58 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 18:44 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 17:58 taavi@deploy1002: Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:822428{{!}}Fix labtestwiki database name servers (T310795)]] (duration: 03m 39s)
* 18:43 thcipriani@deploy1002: Scap failed!: Call to mwscript eval.php stderr: not empty
* 17:57 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 18:42 moritzm: imported gitlab 14.2.5 to thirdparty/gitlab [[phab:T292219|T292219]]
* 17:57 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 18:41 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 17:56 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 18:38 thcipriani@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:704167{{!}}Use Wikimania's logo in a new vector (T286405)]] Part III (duration: 01m 07s)
* 17:52 sukhe: testing ATS 9.1.3-1wm1 on cp3064: [[phab:T309651|T309651]]
* 18:37 thcipriani@deploy1002: Synchronized static/images/mobile/copyright/wikimania-wordmark.svg: Config: [[gerrit:704167{{!}}Use Wikimania's logo in a new vector (T286405)]] Part II (duration: 01m 07s)
* 17:49 pt1979@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host netmon2002.mgmt.codfw.wmnet with reboot policy FORCED
* 18:35 thcipriani@deploy1002: Synchronized static/images/mobile/copyright/wikimania.svg: Config: [[gerrit:704167{{!}}Use Wikimania's logo in a new vector (T286405)]] part I (duration: 01m 07s)
* 17:46 sukhe: testing ATS 9.1.3-1wm1 on cp3064: [[phab:T3096515|T3096515]]
* 18:32 cmooney@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 17:41 pt1979@cumin2002: START - Cookbook sre.hosts.provision for host netmon2002.mgmt.codfw.wmnet with reboot policy FORCED
* 18:31 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 17:40 pt1979@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 18:31 thcipriani@deploy1002: Synchronized wmf-config: Config: [[gerrit:724514{{!}}Enable sticky header on beta cluster (T289721)]] (duration: 01m 08s)
* 17:38 sukhe: testing ATS 9.1.3-1wm1 on cp1090: [[phab:T309651|T309651]]
* 18:29 cmooney@cumin1001: START - Cookbook sre.dns.netbox
* 17:36 pt1979@cumin2002: START - Cookbook sre.dns.netbox
* 18:28 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 17:35 pt1979@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host netmon2002
* 18:27 otto@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts thorium.eqiad.wmnet
* 17:34 pt1979@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host netmon2002
* 18:22 bd808@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'toolhub' for release 'main' .
* 17:33 sukhe: testing ATS 9.1.3-1wm1 on cp3065: [[phab:T309651|T309651]]
* 18:20 thcipriani@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:724861{{!}}Disable legacy media dom on a few more wikis (T51097)]] (duration: 01m 08s)
* 17:28 sukhe: testing ATS 9.1.3-1wm1 on cp1089: [[phab:T309651|T309651]]
* 18:07 bd808@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'toolhub' for release 'main' .
* 17:19 bking@cumin1001: conftool action : set/weight=10:pooled=no; selector: service=elasticsearch-omega-ssl,name=elastic1100.eqiad.wmnet
* 17:49 elukey@deploy1002: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
* 17:18 bking@cumin1001: conftool action : set/weight=10:pooled=yes; selector: service=elasticsearch-omega-ssl,name=elastic1100.eqiad.wmnet
* 17:49 elukey@deploy1002: helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
* 17:15 bking@cumin1001: conftool action : set/weight=10:pooled=yes; selector: service=search-omega-https,name=elastic1100.eqiad.wmnet
* 17:49 otto@cumin1001: START - Cookbook sre.hosts.decommission for hosts thorium.eqiad.wmnet
* 16:35 mvernon@cumin2002: START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-eqiad: upgrade to 3.11.13 [[phab:T309896|T309896]] - mvernon@cumin2002
* 17:42 bstorm: updating packages for thirdparty/kubeadm-k8s-1-20 and thirdparty/kubeadm-k8s-1-19 in stretch-wikimedia on apt1001 [[phab:T292131|T292131]]
* 16:30 mvernon@cumin2002: END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-codfw: upgrade to 3.11.13 [[phab:T309896|T309896]] - mvernon@cumin2002
* 17:09 jgiannelos@deploy1002: Finished deploy [kartotherian/deploy@8fbf87c] (eqiad): Increase mirrored traffic to 50% for eqiad (duration: 00m 55s)
* 16:29 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic[1100-1102].eqiad.wmnet with reason: [[phab:T309810|T309810]]
* 17:08 jgiannelos@deploy1002: Started deploy [kartotherian/deploy@8fbf87c] (eqiad): Increase mirrored traffic to 50% for eqiad
* 16:29 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic[1100-1102].eqiad.wmnet with reason: [[phab:T309810|T309810]]
* 17:03 jgiannelos@deploy1002: Finished deploy [kartotherian/deploy@8fbf87c] (eqiad): Increase mirrored traffic to 50% for eqiad (duration: 00m 08s)
* 16:26 inflatador: bking@elastic1054 attempting to ban elastic1100-1102 from cluster due to firewall issues
* 17:02 jgiannelos@deploy1002: Started deploy [kartotherian/deploy@8fbf87c] (eqiad): Increase mirrored traffic to 50% for eqiad
* 16:13 bking@cumin1001: conftool action : set/weight=10:pooled=yes; selector: service=search-omega-https,name=elastic1100.eqiad.wmnet
* 17:02 elukey@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' .
* 16:12 bking@cumin1001: conftool action : set/weight=10:pooled=yes; selector: name=elastic1100
* 17:00 jgiannelos@deploy1002: Finished deploy [kartotherian/deploy@8fbf87c] (eqiad): Increase mirrored traffic to 50% for eqiad (duration: 00m 11s)
* 15:15 cmjohnson@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 17:00 jgiannelos@deploy1002: Started deploy [kartotherian/deploy@8fbf87c] (eqiad): Increase mirrored traffic to 50% for eqiad
* 15:09 cmjohnson@cumin1001: START - Cookbook sre.dns.netbox
* 16:49 sukhe: restart dnsdist.service on doh[1001-1002,2001-2002,3001-3002,4001-4002,5001-5002].wikimedia.org
* 14:58 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1162 (re)pooling @ 100%: Maint done', diff saved to https://phabricator.wikimedia.org/P32364 and previous config saved to /var/cache/conftool/dbconfig/20220811-145823-ladsgroup.json
* 16:43 jgiannelos@deploy1002: Finished deploy [kartotherian/deploy@67a4d22]: Increase mirrored traffic to 10% (duration: 02m 33s)
* 14:55 inflatador: bking@cumin1001 running puppet agent across eqiad elastic hosts
* 16:40 jgiannelos@deploy1002: Started deploy [kartotherian/deploy@67a4d22]: Increase mirrored traffic to 10%
* 14:48 ryankemper@cumin1001: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - [[phab:T289135|T289135]]
* 16:38 jgiannelos@deploy1002: Finished deploy [kartotherian/deploy@67a4d22] (eqiad): Increase mirrored traffic to 10% (duration: 00m 40s)
* 14:43 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1162 (re)pooling @ 75%: Maint done', diff saved to https://phabricator.wikimedia.org/P32362 and previous config saved to /var/cache/conftool/dbconfig/20220811-144318-ladsgroup.json
* 16:37 jgiannelos@deploy1002: Started deploy [kartotherian/deploy@67a4d22] (eqiad): Increase mirrored traffic to 10%
* 14:28 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1162 (re)pooling @ 25%: Maint done', diff saved to https://phabricator.wikimedia.org/P32361 and previous config saved to /var/cache/conftool/dbconfig/20220811-142813-ladsgroup.json
* 16:37 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 14:28 andrew@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudcontrol1003.wikimedia.org
* 16:33 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 14:28 andrew@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 16:32 hnowlan: Ran `GRANT pg_monitor TO prometheus` for maps in eqiad and codfw to fix empty prometheus connection metrics
* 14:24 andrew@cumin1001: START - Cookbook sre.dns.netbox
* 16:30 jgiannelos@deploy1002: Finished deploy [kartotherian/deploy@67a4d22] (eqiad): Increase mirrored traffic to 10% (duration: 00m 16s)
* 14:19 andrew@cumin1001: START - Cookbook sre.hosts.decommission for hosts cloudcontrol1003.wikimedia.org
* 16:30 jgiannelos@deploy1002: Started deploy [kartotherian/deploy@67a4d22] (eqiad): Increase mirrored traffic to 10%
* 14:19 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 16:24 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 14:18 andrew@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudcontrol1004.wikimedia.org
* 16:20 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 14:18 andrew@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 16:11 ladsgroup@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:725032{{!}}Disable jQuery migrate in metawiki (T280944)]] (duration: 01m 09s)
* 14:18 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 16:08 ladsgroup@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:725019{{!}}Enable dispatching via job to 10 prod wikis]] (duration: 01m 09s)
* 14:18 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 15:55 elukey@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' .
* 14:17 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 14:36 elukey: drop /etc/helmfile-defaults/private/backup_old_paths from deploy1002 (old data not needed anymore)
* 14:17 ladsgroup@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:822375{{!}}Stop writing to the old templatelinks fields in s2 (T312865)]] (duration: 03m 25s)
* 14:33 marostegui@cumin1001: dbctl commit (dc=all): 'db1111 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17374 and previous config saved to /var/cache/conftool/dbconfig/20210930-143325-root.json
* 14:16 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
* 14:30 marostegui@cumin1001: dbctl commit (dc=all): 'db1134 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17373 and previous config saved to /var/cache/conftool/dbconfig/20210930-143044-root.json
* 14:16 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
* 14:18 marostegui@cumin1001: dbctl commit (dc=all): 'db1111 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17372 and previous config saved to /var/cache/conftool/dbconfig/20210930-141822-root.json
* 14:16 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
* 14:15 marostegui@cumin1001: dbctl commit (dc=all): 'db1134 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17370 and previous config saved to /var/cache/conftool/dbconfig/20210930-141540-root.json
* 14:15 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
* 14:03 marostegui@cumin1001: dbctl commit (dc=all): 'db1111 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17369 and previous config saved to /var/cache/conftool/dbconfig/20210930-140318-root.json
* 14:13 andrew@cumin1001: START - Cookbook sre.dns.netbox
* 14:00 marostegui@cumin1001: dbctl commit (dc=all): 'db1134 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17368 and previous config saved to /var/cache/conftool/dbconfig/20210930-140037-root.json
* 14:13 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1162 (re)pooling @ 10%: Maint done', diff saved to https://phabricator.wikimedia.org/P32360 and previous config saved to /var/cache/conftool/dbconfig/20220811-141309-ladsgroup.json
* 13:48 marostegui@cumin1001: dbctl commit (dc=all): 'db1111 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17367 and previous config saved to /var/cache/conftool/dbconfig/20210930-134815-root.json
* 14:12 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 13:45 marostegui@cumin1001: dbctl commit (dc=all): 'db1134 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17366 and previous config saved to /var/cache/conftool/dbconfig/20210930-134533-root.json
* 14:11 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 13:42 jmm@cumin2002: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host testvm2001.codfw.wmnet
* 14:11 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 13:40 jgiannelos@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' .
* 14:11 awight: EU backport window complete
* 13:38 akosiaris@deploy1002: helmfile [eqiad] DONE helmfile.d/admin 'apply'.
* 14:10 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 13:38 akosiaris@deploy1002: helmfile [eqiad] START helmfile.d/admin 'apply'.
* 14:10 awight@deploy1002: Synchronized php-1.39.0-wmf.23/extensions/DiscussionTools/includes/CommentFormatter.php: Backport: [[gerrit:822149{{!}}CommentFormatter: Set 'data-mw-comment' even when reply tool disabled (T314707)]] (duration: 03m 31s)
* 13:37 akosiaris@deploy1002: helmfile [codfw] DONE helmfile.d/admin 'apply'.
* 14:09 andrew@cumin1001: START - Cookbook sre.hosts.decommission for hosts cloudcontrol1004.wikimedia.org
* 13:36 akosiaris@deploy1002: helmfile [codfw] START helmfile.d/admin 'apply'.
* 14:05 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 13:33 marostegui@cumin1001: dbctl commit (dc=all): 'db1111 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17365 and previous config saved to /var/cache/conftool/dbconfig/20210930-133311-root.json
* 14:04 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 13:30 marostegui@cumin1001: dbctl commit (dc=all): 'db1134 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17364 and previous config saved to /var/cache/conftool/dbconfig/20210930-133029-root.json
* 14:04 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 13:29 marostegui: Upgrade db1111
* 14:03 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 13:28 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1111 for upgrade', diff saved to https://phabricator.wikimedia.org/P17363 and previous config saved to /var/cache/conftool/dbconfig/20210930-132831-marostegui.json
* 13:52 mvernon@cumin2002: START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-codfw: upgrade to 3.11.13 [[phab:T309896|T309896]] - mvernon@cumin2002
* 13:27 marostegui: Upgrade db1134
* 13:50 awight@deploy1002: Synchronized wmf-config: Config: [[gerrit:820666{{!}}Revert "Revert "testwiki: Add mediawiki.web_ui.interactions stream""]] (duration: 03m 10s)
* 13:27 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1134 (s1) for upgrade', diff saved to https://phabricator.wikimedia.org/P17362 and previous config saved to /var/cache/conftool/dbconfig/20210930-132700-marostegui.json
* 13:48 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 13:26 marostegui: Upgrade db1133
* 13:47 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 13:26 jmm@cumin2002: START - Cookbook sre.ganeti.makevm for new host testvm2001.codfw.wmnet
* 13:47 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 13:02 urbanecm: Start server-side upload for 2 video files ([[phab:T292096|T292096]], [[phab:T291492|T291492]])
* 13:46 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 13:01 marostegui@cumin1001: dbctl commit (dc=all): 'db1099:3311 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17361 and previous config saved to /var/cache/conftool/dbconfig/20210930-130116-root.json
* 13:36 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1060.eqiad.wmnet with OS bullseye
* 13:01 marostegui@cumin1001: dbctl commit (dc=all): 'db1099:3318 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17360 and previous config saved to /var/cache/conftool/dbconfig/20210930-130109-root.json
* 13:36 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 12:46 marostegui@cumin1001: dbctl commit (dc=all): 'db1099:3311 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17359 and previous config saved to /var/cache/conftool/dbconfig/20210930-124612-root.json
* 13:36 awight@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:822130{{!}}trwikiquote: Install WikiLove extension (T314895)]] (duration: 03m 30s)
* 12:46 marostegui@cumin1001: dbctl commit (dc=all): 'db1099:3318 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17358 and previous config saved to /var/cache/conftool/dbconfig/20210930-124606-root.json
* 13:35 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 12:31 Reedy: downloading files for [[phab:T290900|T290900]] in screen on mwmaint1002
* 13:35 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 12:31 marostegui@cumin1001: dbctl commit (dc=all): 'db1099:3311 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17357 and previous config saved to /var/cache/conftool/dbconfig/20210930-123109-root.json
* 13:34 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 12:31 marostegui@cumin1001: dbctl commit (dc=all): 'db1099:3318 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17356 and previous config saved to /var/cache/conftool/dbconfig/20210930-123101-root.json
* 13:33 filippo@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host logstash2003.codfw.wmnet
* 12:18 jgiannelos@deploy1002: Finished deploy [kartotherian/deploy@35b9174]: tegola: remove mirror_threshold variable because of parsing errors (duration: 00m 17s)
* 13:25 awight@deploy1002: Synchronized static/images: Config: [[gerrit:821330{{!}}Revert "trwiki: Change old and new vector logos for 500k articles"]] (part 3) (duration: 03m 09s)
* 12:18 jgiannelos@deploy1002: Started deploy [kartotherian/deploy@35b9174]: tegola: remove mirror_threshold variable because of parsing errors
* 13:21 awight@deploy1002: Synchronized logos/: Config: [[gerrit:821330{{!}}Revert "trwiki: Change old and new vector logos for 500k articles"]] (part 2) (duration: 03m 09s)
* 12:17 moritzm: adapted MX records to point to both mx1001.wikimedia.org and mx2001.wikimedia.org with equal weights [[phab:T286911|T286911]]
* 13:19 topranks: merging CR821781 to expose additional network info in puppet facts
* 12:17 jgiannelos@deploy1002: Finished deploy [kartotherian/deploy@35b9174]: tegola: remove mirror_threshold variable because of parsing errors (duration: 00m 16s)
* 13:18 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 12:16 jgiannelos@deploy1002: Started deploy [kartotherian/deploy@35b9174]: tegola: remove mirror_threshold variable because of parsing errors
* 13:18 awight@deploy1002: Synchronized wmf-config/: Config: [[gerrit:821330{{!}}Revert "trwiki: Change old and new vector logos for 500k articles"]] (part 1) (duration: 03m 13s)
* 12:16 marostegui@cumin1001: dbctl commit (dc=all): 'db1099:3311 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17355 and previous config saved to /var/cache/conftool/dbconfig/20210930-121605-root.json
* 13:17 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 12:15 marostegui@cumin1001: dbctl commit (dc=all): 'db1099:3318 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17354 and previous config saved to /var/cache/conftool/dbconfig/20210930-121558-root.json
* 13:17 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 12:14 jgiannelos@deploy1002: Finished deploy [kartotherian/deploy@35b9174]: tegola: remove mirror_threshold variable because of parsing errors (duration: 00m 15s)
* 13:16 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 12:13 jgiannelos@deploy1002: Started deploy [kartotherian/deploy@35b9174]: tegola: remove mirror_threshold variable because of parsing errors
* 13:14 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1060.eqiad.wmnet with reason: host reimage
* 12:13 jgiannelos@deploy1002: Finished deploy [kartotherian/deploy@35b9174]: tegola: remove mirror_threshold variable because of parsing errors (duration: 00m 15s)
* 13:11 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 12:13 jgiannelos@deploy1002: Started deploy [kartotherian/deploy@35b9174]: tegola: remove mirror_threshold variable because of parsing errors
* 13:11 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1060.eqiad.wmnet with reason: host reimage
* 12:11 jgiannelos@deploy1002: Finished deploy [kartotherian/deploy@35b9174]: tegola: remove mirror_threshold variable because of parsing errors (duration: 00m 10s)
* 13:10 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 12:10 jgiannelos@deploy1002: Started deploy [kartotherian/deploy@35b9174]: tegola: remove mirror_threshold variable because of parsing errors
* 13:10 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 12:10 jgiannelos@deploy1002: Finished deploy [kartotherian/deploy@35b9174]: tegola: remove mirror_threshold variable because of parsing errors (duration: 00m 01s)
* 13:09 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 12:10 jgiannelos@deploy1002: Started deploy [kartotherian/deploy@35b9174]: tegola: remove mirror_threshold variable because of parsing errors
* 13:08 awight@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:822073{{!}}Enable editor line numbering on all namespaces, for twwiki (T302852)]] (duration: 03m 42s)
* 12:03 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 12:56 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1060.eqiad.wmnet with OS bullseye
* 12:01 marostegui@cumin1001: dbctl commit (dc=all): 'db1099:3311 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17353 and previous config saved to /var/cache/conftool/dbconfig/20210930-120102-root.json
* 12:55 ryankemper@cumin1001: START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - [[phab:T289135|T289135]]
* 12:00 marostegui@cumin1001: dbctl commit (dc=all): 'db1099:3318 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17352 and previous config saved to /var/cache/conftool/dbconfig/20210930-120054-root.json
* 12:49 aikochou@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
* 12:00 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 12:46 aikochou@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
* 11:58 hnowlan: imported wikidiff2_1.13.0-1/php-wikidiff2_1.13.0-1_amd64.deb to buster-wikimedia component/php72
* 12:26 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=restbase2018.codfw.wmnet
* 11:56 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1099 (s1 and s8) for upgrade', diff saved to https://phabricator.wikimedia.org/P17351 and previous config saved to /var/cache/conftool/dbconfig/20210930-115631-marostegui.json
* 12:26 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=restbase202[367].codfw.wmnet
* 11:51 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 12:17 elukey@deploy1002: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
* 11:47 jgiannelos@deploy1002: Finished deploy [kartotherian/deploy@0a38bc5]: tegola: use eqiad discovery endpoin (duration: 00m 03s)
* 12:17 elukey@deploy1002: helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
* 11:47 jgiannelos@deploy1002: Started deploy [kartotherian/deploy@0a38bc5]: tegola: use eqiad discovery endpoin
* 12:17 elukey@deploy1002: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
* 11:47 jgiannelos@deploy1002: Finished deploy [kartotherian/deploy@0a38bc5]: tegola: use eqiad discovery endpoin (duration: 00m 01s)
* 12:16 elukey@deploy1002: helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
* 11:47 jgiannelos@deploy1002: Started deploy [kartotherian/deploy@0a38bc5]: tegola: use eqiad discovery endpoin
* 12:13 filippo@cumin1001: START - Cookbook sre.hosts.reboot-single for host logstash2003.codfw.wmnet
* 11:47 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 12:11 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
* 11:46 jgiannelos@deploy1002: Finished deploy [kartotherian/deploy@0a38bc5]: tegola: use eqiad discovery endpoin (duration: 00m 01s)
* 12:10 elukey@deploy1002: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
* 11:46 jgiannelos@deploy1002: Started deploy [kartotherian/deploy@0a38bc5]: tegola: use eqiad discovery endpoin
* 12:09 elukey@deploy1002: helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
* 11:44 effie: downgrading scap to 3.17.1-1 on maps* hosts - [[phab:T291990|T291990]]
* 11:20 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance
* 11:43 kartik@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:724732{{!}}Make reply tool available as opt-out almost everywhere (phase 3) (T288485)]] (duration: 01m 07s)
* 11:20 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance
* 11:37 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 09:58 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance
* 11:35 kartik@deploy1002: Synchronized php-1.38.0-wmf.2/extensions/DiscussionTools: Backport: [[gerrit:724789{{!}}Add a link to preferences within the Reply and New Discussion Tools (T291002)]] (duration: 01m 08s)
* 09:57 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance
* 11:34 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 09:56 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance
* 11:30 kartik@deploy1002: Synchronized php-1.38.0-wmf.1/extensions/DiscussionTools: Backport: [[gerrit:724788{{!}}Add a link to preferences within the Reply and New Discussion Tools (T291002)]] (duration: 01m 09s)
* 09:56 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance
* 11:25 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 09:49 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1162.eqiad.wmnet with reason: Maintenance
* 11:21 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 09:49 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on db1162.eqiad.wmnet with reason: Maintenance
* 11:14 kartik@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:724458{{!}}Enable SectionTranslation in Igbo, Hausa, Yoruba Wikipedias (T290175)]] (duration: 01m 08s)
* 09:32 godog: arm keyholder on netmon2001
* 11:05 elukey@deploy1002: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
* 09:09 jbond: update gnutls28 on bullseye systems
* 11:05 elukey@deploy1002: helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
* 09:00 jbond: update unzip
* 10:13 akosiaris: upgrade znuny to 6.0.37
* 08:21 elukey@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
* 10:06 godog: test bounce logstash on logstash1023
* 08:13 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
* 08:21 moritzm: installing nettle security updates on stretch
* 08:12 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
* 08:10 jmm@cumin2002: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host testvm2003.codfw.wmnet
* 08:06 ayounsi@cumin1001: END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox interface ID cr3-ulsfo:xe-0/1/1
* 07:49 jmm@cumin2002: START - Cookbook sre.ganeti.makevm for new host testvm2003.codfw.wmnet
* 08:06 ayounsi@cumin1001: START - Cookbook sre.network.debug for Netbox interface ID cr3-ulsfo:xe-0/1/1
* 07:31 jgiannelos@deploy1002: Finished deploy [kartotherian/deploy@0a38bc5]: tegola: use eqiad discovery endpoin (duration: 00m 06s)
* 07:58 ayounsi@cumin1001: END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox interface ID cr3-ulsfo:xe-0/1/1
* 07:31 jgiannelos@deploy1002: Started deploy [kartotherian/deploy@0a38bc5]: tegola: use eqiad discovery endpoin
* 07:57 ayounsi@cumin1001: START - Cookbook sre.network.debug for Netbox interface ID cr3-ulsfo:xe-0/1/1
* 07:03 legoktm@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'shellbox-syntaxhighlight' for release 'main' .
* 07:55 oblivian@cumin1001: conftool action : set/pooled=false; selector: dnsdisc=k8s-ingress-wikikube-rw,name=codfw
* 06:58 legoktm@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'shellbox-syntaxhighlight' for release 'main' .
* 07:51 vgutierrez: rolling restart of pybal in eqsin and ulsfo
* 06:56 legoktm@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'shellbox-syntaxhighlight' for release 'main' .
* 07:24 oblivian@cumin1001: conftool action : set/pooled=false; selector: dnsdisc=restbase-async,name=eqiad
* 06:48 marostegui: Deploy schema change on s8 codfw (lag will show up) [[phab:T270620|T270620]]
* 07:24 oblivian@cumin1001: conftool action : set/pooled=true; selector: dnsdisc=shellbox-timeline
* 06:01 marostegui: Deploy schema change on s1 codfw (lag will show up) [[phab:T270620|T270620]]
* 07:23 oblivian@cumin1001: conftool action : set/pooled=true; selector: dnsdisc=inference
* 05:53 marostegui: Deploy schema change on s3 codfw (lag will show up) [[phab:T270620|T270620]]
* 07:19 _joe_: pooling all services in codfw
* 05:52 marostegui: Deploy schema change on s7 codfw (lag will show up) [[phab:T270620|T270620]]
* 07:03 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1147 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32357 and previous config saved to /var/cache/conftool/dbconfig/20220811-070312-ladsgroup.json
* 05:47 marostegui: Deploy schema change on s5 codfw (lag will show up) [[phab:T270620|T270620]]
* 07:03 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1147.eqiad.wmnet with reason: Maintenance
* 05:45 marostegui: Deploy schema change on s4 codfw (lag will show up) [[phab:T270620|T270620]]
* 07:02 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1147.eqiad.wmnet with reason: Maintenance
* 05:45 marostegui: Deploy schema change on s2 codfw (lag will show up) [[phab:T270620|T270620]]
* 07:02 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1160 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32356 and previous config saved to /var/cache/conftool/dbconfig/20220811-070252-ladsgroup.json
* 06:47 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P32355 and previous config saved to /var/cache/conftool/dbconfig/20220811-064746-ladsgroup.json
* 06:32 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P32354 and previous config saved to /var/cache/conftool/dbconfig/20220811-063240-ladsgroup.json
* 06:28 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance
* 06:28 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance
* 06:17 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1160 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32353 and previous config saved to /var/cache/conftool/dbconfig/20220811-061734-ladsgroup.json
* 06:17 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maint
* 06:17 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maint
* 06:06 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depool db1162 ([[phab:T314368|T314368]] [[phab:T298555|T298555]] [[phab:T312863|T312863]] [[phab:T310011|T310011]] [[phab:T309311|T309311]] [[phab:T60674|T60674]] [[phab:T298560|T298560]] [[phab:T303603|T303603]] [[phab:T310485|T310485]])', diff saved to https://phabricator.wikimedia.org/P32352 and previous config saved to /var/cache/conftool/dbconfig/20220811-060625-ladsgroup.json
* 06:01 ladsgroup@cumin1001: dbctl commit (dc=all): 'Promote db1122 to s2 primary and set section read-write [[phab:T314368|T314368]]', diff saved to https://phabricator.wikimedia.org/P32351 and previous config saved to /var/cache/conftool/dbconfig/20220811-060113-ladsgroup.json
* 06:00 ladsgroup@cumin1001: dbctl commit (dc=all): 'Set s2 eqiad as read-only for maintenance - [[phab:T314368|T314368]]', diff saved to https://phabricator.wikimedia.org/P32350 and previous config saved to /var/cache/conftool/dbconfig/20220811-060042-ladsgroup.json
* 06:00 Amir1: Starting s2 eqiad failover from db1162 to db1122 - [[phab:T314368|T314368]]
* 05:19 ladsgroup@cumin1001: dbctl commit (dc=all): 'Set db1122 with weight 0 [[phab:T314368|T314368]]', diff saved to https://phabricator.wikimedia.org/P32349 and previous config saved to /var/cache/conftool/dbconfig/20220811-051913-ladsgroup.json
* 05:19 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 25 hosts with reason: Primary switchover s2 [[phab:T314368|T314368]]
* 05:18 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1:00:00 on 25 hosts with reason: Primary switchover s2 [[phab:T314368|T314368]]
* m: chown -R librenms /srv/librenms/rrd/ on netmon1003 [[phab:T314972|T314972]]
* 03:51 cwhite: chown librenms /srv/librenms/rrd/* on netmon1003 [[phab:T314972|T314972]]
* 02:55 ejegg: civicrm upgraded from {{Gerrit|1f91ac2d}} to {{Gerrit|92467234}}
* 02:46 ejegg: updated process-control yaml files with @wmff alias
* 02:08 ejegg: civicrm rolled back from {{Gerrit|92467234}} to {{Gerrit|1f91ac2d}}
* 02:05 ejegg: civicrm upgraded from {{Gerrit|1f91ac2d}} to {{Gerrit|92467234}}
* 01:40 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 01:39 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 01:39 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 01:38 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 01:38 tstarling@deploy1002: Synchronized wmf-config/logging.php: (no justification provided) (duration: 03m 25s)
* 01:19 tstarling@puppetmaster1001: conftool action : set/pooled=true; selector: dnsdisc=(appservers{{!}}api)-ro,name=codfw
* 01:19 tstarling@puppetmaster1001: conftool action : set/pooled=true; selector: dnsdisc=sessionstore,name=codfw
* 00:58 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2042.codfw.wmnet,service=varnish-fe
* 00:58 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2042.codfw.wmnet,service=ats-be
* 00:58 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2042.codfw.wmnet,service=ats-tls
* 00:57 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on cp2042.codfw.wmnet with reason: host down; depooled and will debug tomorrow
* 00:57 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on cp2042.codfw.wmnet with reason: host down; depooled and will debug tomorrow


== 2021-09-29 ==
== 2022-08-10 ==
* 23:20 bd808@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'toolhub' for release 'main' .
* 21:25 bking@cumin1001: conftool action : set/weight=10:pooled=yes; selector: name=wdqs1016.eqiad.wmnet
* 23:05 bd808@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'toolhub' for release 'main' .
* 21:23 bking@cumin1001: conftool action : set/weight=10:pooled=yes; selector: name=wdqs1014.eqiad.wmnet
* 23:02 bd808@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'toolhub' for release 'main' .
* 21:10 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 16 hosts with reason: [[phab:T309810|T309810]]
* 22:10 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 21:10 bking@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on 16 hosts with reason: [[phab:T309810|T309810]]
* 22:07 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 21:09 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on elastic[1101-1102].eqiad.wmnet with reason: [[phab:T309810|T309810]]
* 21:57 legoktm@deploy1002: Synchronized php-1.38.0-wmf.2/extensions/timeline/includes/Timeline.php: Catch TimelineException from fixMap() ([[phab:T292126|T292126]]) (duration: 01m 07s)
* 21:09 bking@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on elastic[1101-1102].eqiad.wmnet with reason: [[phab:T309810|T309810]]
* 21:48 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 21:00 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 21:44 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 21:00 cjming: end of UTC late backport window
* 21:37 legoktm@deploy1002: Synchronized php-1.38.0-wmf.2/extensions/timeline/includes/Timeline.php: Bump Timeline::CACHE_VERSION (duration: 01m 08s)
* 20:59 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 20:34 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 20:59 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 20:31 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 20:59 cjming@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:820533{{!}}Remove unused $wgEnableMWSuggest]] (duration: 03m 04s)
* 20:22 jhuneidi@deploy1002: Synchronized php: group1 wikis to 1.38.0-wmf.2  refs [[phab:T281166|T281166]] (duration: 01m 08s)
* 20:58 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 20:21 jhuneidi@deploy1002: rebuilt and synchronized wikiversions files: group1 wikis to 1.38.0-wmf.2  refs [[phab:T281166|T281166]]
* 20:56 cjming@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:820568{{!}}Enable new topic tool on dewiki (T313699)]] (duration: 03m 01s)
* 20:20 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 20:34 cjming@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:822093{{!}}testwiki: set $wgCdnMatchParameterOrder to false (T314868)]] (duration: 03m 20s)
* 20:16 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 20:31 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 20:16 jhuneidi@deploy1002: Finished scap: Fix pywikibot feature detection (duration: 13m 38s)
* 20:30 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 20:02 jhuneidi@deploy1002: Started scap: Fix pywikibot feature detection
* 20:30 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 19:17 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 20:30 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 19:13 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 20:19 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 19:06 legoktm@deploy1002: Synchronized php-1.38.0-wmf.2/extensions/timeline/scripts/renderTimeline.sh: Fix passing temp directory to EasyTimeline.pl (duration: 01m 07s)
* 20:18 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 18:59 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 20:18 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 18:55 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 20:17 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 18:52 dancy@deploy1002: Synchronized php-1.38.0-wmf.2/skins/MinervaNeue/resources/skins.minerva.base.styles/ui.less: Backport: [[gerrit:724787{{!}}Search header should be vertically centered, not top aligned(take 2) (T292071)]] (duration: 01m 08s)
* 20:12 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 17:26 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 20:09 bking@cumin1001: END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
* 17:22 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 20:08 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 17:14 ladsgroup@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:724776{{!}}Fully enable change dispatching via jobs on test wikis]], Part I (duration: 01m 09s)
* 20:08 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 17:13 ladsgroup@deploy1002: Synchronized wmf-config/Wikibase.php: Config: [[gerrit:724776{{!}}Fully enable change dispatching via jobs on test wikis]], Part I (duration: 01m 07s)
* 20:08 cjming@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:820646{{!}}Start writing to cuc_actor everywhere except s4 and s8 (T233004)]] (duration: 03m 15s)
* 16:52 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 20:07 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 16:47 pt1979@cumin2002: END (PASS) - Cookbook sre.experimental.reimage (exit_code=0) for host thumbor2006.codfw.wmnet
* 19:51 rzl@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc[2053-2054].codfw.wmnet
* 16:47 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 19:51 rzl@cumin1001: START - Cookbook sre.hosts.remove-downtime for mc[2053-2054].codfw.wmnet
* 16:43 akosiaris: start hbal -L -G row_B -X on ganeti01.svc.codfw.wmnet . Rows C and D are fine
* 19:35 rzl@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse[2019-2020].codfw.wmnet
* 16:42 akosiaris: start hbal -L -G row_A -X on ganeti01.svc.codfw.wmnet
* 19:35 rzl@cumin1001: START - Cookbook sre.hosts.remove-downtime for parse[2019-2020].codfw.wmnet
* 16:40 akosiaris: migrate kubemaster2001 off ganeti2007 and to ganeti2008 due to memory starvation on ganeti2007
* 19:35 rzl@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse[2016-2018].codfw.wmnet
* 16:37 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 19:35 rzl@cumin1001: START - Cookbook sre.hosts.remove-downtime for parse[2016-2018].codfw.wmnet
* 16:34 pt1979@cumin2002: START - Cookbook sre.experimental.reimage for host thumbor2006.codfw.wmnet
* 19:34 rzl@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc2036.codfw.wmnet
* 16:33 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 19:34 rzl@cumin1001: START - Cookbook sre.hosts.remove-downtime for mc2036.codfw.wmnet
* 16:25 lucaswerkmeister-wmde@deploy1002: Synchronized php-1.38.0-wmf.2/extensions/WikimediaBadges/: Backport: [[gerrit:724561{{!}}Handle missing items in WikibaseClientSiteLinksForItemHandler (T291953)]] (duration: 01m 08s)
* 19:28 sukhe: testing ATS 9.1.3-1wm1 on cp4026: [[phab:T309651|T309651]]
* 16:24 lucaswerkmeister-wmde@deploy1002: Synchronized php-1.38.0-wmf.1/extensions/WikimediaBadges/: Backport: [[gerrit:724560{{!}}Handle missing items in WikibaseClientSiteLinksForItemHandler (T291953)]] (duration: 01m 10s)
* 19:09 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1087.eqiad.wmnet with OS bullseye
* 15:58 pt1979@cumin2002: END (FAIL) - Cookbook sre.experimental.reimage (exit_code=99) for host thumbor2006.codfw.wmnet
* 19:06 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1086.eqiad.wmnet with OS bullseye
* 15:53 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 18:55 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1087.eqiad.wmnet with reason: host reimage
* 15:50 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 18:51 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1086.eqiad.wmnet with reason: host reimage
* 15:45 Amir1: disabled cron dispatching for mediawikiwiki
* 18:50 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1087.eqiad.wmnet with reason: host reimage
* 15:44 ladsgroup@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:724765{{!}}Enable change dispatching via jobs in wikidatawiki (T48643)]] (duration: 01m 08s)
* 18:49 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1086.eqiad.wmnet with reason: host reimage
* 15:44 pt1979@cumin2002: START - Cookbook sre.experimental.reimage for host thumbor2006.codfw.wmnet
* 18:47 bking@cumin1001: START - Cookbook sre.wdqs.data-transfer
* 15:39 filippo@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2035.codfw.wmnet
* 18:38 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1087.eqiad.wmnet with OS bullseye
* 15:39 ladsgroup@deploy1002: Synchronized php-1.38.0-wmf.2/extensions/Wikibase/client: Backport: [[gerrit:724558{{!}}Track time until dispatched recent changes are inserted (T291962)]] (duration: 01m 10s)
* 18:36 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1086.eqiad.wmnet with OS bullseye
* 15:24 pt1979@cumin2002: END (PASS) - Cookbook sre.experimental.reimage (exit_code=0) for host thumbor2006.codfw.wmnet
* 18:22 urandom: truncating Cassandra hints (eqiad datacenter)  -- [[phab:T314941|T314941]]
* 15:22 filippo@cumin1001: START - Cookbook sre.hosts.reboot-single for host ms-be2035.codfw.wmnet
* 18:13 urandom: truncating codfw Cassandra hints (eqiad datacenter)  -- [[phab:T314941|T314941]]
* 15:21 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet
* 18:07 rzl@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kafka-main2005.codfw.wmnet
* 15:12 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet
* 18:07 rzl@cumin1001: START - Cookbook sre.hosts.remove-downtime for kafka-main2005.codfw.wmnet
* 15:02 elukey@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' .
* 18:05 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repool D8 DBs after PDU maint ([[phab:T310146|T310146]])', diff saved to https://phabricator.wikimedia.org/P32346 and previous config saved to /var/cache/conftool/dbconfig/20220810-180529-ladsgroup.json
* 14:51 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet
* 17:42 otto@deploy1002: Finished deploy [analytics/refinery@6e47e0e]: Add missing changes to the deletion script - [[phab:T270433|T270433]] - [analytics/refinery@6e47e0e] (duration: 05m 28s)
* 14:45 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet
* 17:39 fnegri@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts labweb1002.wikimedia.org
* 14:38 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 17:39 fnegri@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 14:35 jmm@cumin2002: START - Cookbook sre.dns.netbox
* 17:36 otto@deploy1002: Started deploy [analytics/refinery@6e47e0e]: Add missing changes to the deletion script - [[phab:T270433|T270433]] -  [analytics/refinery@6e47e0e]
* 14:25 jmm@cumin2002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts testvm2001.codfw.wmnet
* 17:35 fnegri@cumin1001: START - Cookbook sre.dns.netbox
* 14:17 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts testvm2001.codfw.wmnet
* 17:34 otto@deploy1002: Finished deploy [analytics/refinery@6e47e0e] (hadoop-test): Add missing changes to the deletion script - [[phab:T270433|T270433]] - TEST [analytics/refinery@6e47e0e] (duration: 04m 19s)
* 14:08 dcausse@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' .
* 17:30 fnegri@cumin1001: START - Cookbook sre.hosts.decommission for hosts labweb1002.wikimedia.org
* 14:07 elukey@deploy1002: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
* 17:30 otto@deploy1002: Started deploy [analytics/refinery@6e47e0e] (hadoop-test): Add missing changes to the deletion script - [[phab:T270433|T270433]] - TEST [analytics/refinery@6e47e0e]
* 14:07 elukey@deploy1002: helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
* 17:09 dzahn@cumin2002: START - Cookbook sre.dns.netbox
* 14:04 pt1979@cumin2002: START - Cookbook sre.experimental.reimage for host thumbor2006.codfw.wmnet
* 17:08 otto@deploy1002: Started deploy [analytics/refinery@d4dd7e4] (hadoop-test): Add safety limits to refinery-drop-older-than - [[phab:T270433|T270433]] - TEST [analytics/refinery@d4dd7e4]
* 14:01 dcausse@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' .
* 17:06 sukhe: testing ATS 9.1.3-1wm1 on cp4032: [[phab:T309651|T309651]]
* 13:38 elukey@deploy1002: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'.
* 17:06 urandom: flushing RESTBase Cassandra tables -row B- to (temporarily) free instance-data space -- [[phab:T314941|T314941]]
* 13:38 elukey@deploy1002: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'.
* 17:05 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on krb2001.codfw.wmnet with reason: btullis codfw maintenance
* 13:34 volans@cumin2002: END (PASS) - Cookbook sre.experimental.reimage (exit_code=0) for host sretest1001.eqiad.wmnet
* 17:05 btullis@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on krb2001.codfw.wmnet with reason: btullis codfw maintenance
* 13:31 dcausse@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' .
* 17:04 dzahn@cumin2002: START - Cookbook sre.hosts.decommission for hosts gerrit2001.wikimedia.org
* 13:11 otto@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' .
* 17:02 sukhe: testing ATS 9.1.3-1wm1 on cp6008: [[phab:T309651|T309651]]
* 13:11 otto@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' .
* 16:56 sukhe: testing ATS 9.1.3-1wm1 on cp6016: [[phab:T309651|T309651]]
* 13:09 volans@cumin2002: START - Cookbook sre.experimental.reimage for host sretest1001.eqiad.wmnet
* 16:55 fnegri@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts labweb1001.wikimedia.org
* 13:09 otto@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' .
* 16:55 fnegri@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 13:09 otto@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' .
* 16:32 dzahn@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts gerrit2001.wikimedia.org
* 13:08 elukey@deploy1002: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'.
* 16:32 dzahn@cumin2002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
* 13:08 elukey@deploy1002: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'.
* 16:32 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubernetes[2013-2014].codfw.wmnet
* 13:04 elukey@deploy1002: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
* 16:31 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for kubernetes[2013-2014].codfw.wmnet
* 13:04 elukey@deploy1002: helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
* 16:31 jelto: kubectl uncordon kubernetes2014.codfw.wmnet
* 12:09 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 16:31 fnegri@cumin1001: START - Cookbook sre.dns.netbox
* 12:06 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 16:30 jelto: kubectl uncordon kubernetes2013.codfw.wmnet
* 11:56 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 16:29 urandom: restarting Cassandra (RESTBase) -row A- to apply r822110 -- [[phab:T314941|T314941]]
* 11:52 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 16:27 dzahn@cumin2002: START - Cookbook sre.dns.netbox
* 11:48 Lucas_WMDE: EU backport+config window done
* 16:25 fnegri@cumin1001: START - Cookbook sre.hosts.decommission for hosts labweb1001.wikimedia.org
* 11:48 lucaswerkmeister-wmde@deploy1002: Synchronized php-1.38.0-wmf.2/skins/MinervaNeue/skinStyles/mobile.startup/Overlay.less: Backport: [[gerrit:724553{{!}}Revert "Search header should be vertically centered, not top aligned." (T292030)]] (duration: 01m 07s)
* 16:23 mutante: shutting down gerrit2001
* 11:43 lucaswerkmeister-wmde@deploy1002: Synchronized php-1.38.0-wmf.2/extensions/Wikibase/repo/includes/Store/Sql/SqlSiteLinkConflictLookup.php: Backport: [[gerrit:724371{{!}}Use CONN_TRX_AUTOCOMMIT in SqlSiteLinkConflictLookup (T291377)]] (duration: 01m 07s)
* 16:23 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc[2034-2035].codfw.wmnet
* 11:43 volans@cumin2002: END (PASS) - Cookbook sre.experimental.reimage (exit_code=0) for host sretest1001.eqiad.wmnet
* 16:23 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for mc[2034-2035].codfw.wmnet
* 11:42 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 16:22 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse[2016-2018].codfw.wmnet
* 11:38 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 16:22 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for parse[2016-2018].codfw.wmnet
* 11:26 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 16:16 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=maps2010.codfw.wmnet
* 11:26 lucaswerkmeister-wmde@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:722279{{!}}Enable line numbering on all namespaces (pilot wikis) (T280027)]] (duration: 01m 09s)
* 16:16 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=sessionstore2003.codfw.wmnet
* 11:22 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 16:13 sukhe: reprepro -C component/trafficserver9 include buster-wikimedia trafficserver_9.1.3-1wm1_amd64.changes: [[phab:T309651|T309651]]
* 11:20 lucaswerkmeister-wmde@deploy1002: Synchronized php-1.38.0-wmf.1/extensions/DiscussionTools/modules/dt.ui.ReplyWidget.js: Backport: [[gerrit:724378{{!}}Fix almost all errors codes being logged as `http-0` (T290514)]] (duration: 01m 09s)
* 16:13 dzahn@cumin2002: START - Cookbook sre.hosts.decommission for hosts gerrit2001.wikimedia.org
* 11:17 lucaswerkmeister-wmde@deploy1002: Synchronized php-1.38.0-wmf.2/extensions/DiscussionTools/modules/dt.ui.ReplyWidget.js: Backport: [[gerrit:724379{{!}}Fix almost all errors codes being logged as `http-0` (T290514)]] (duration: 01m 09s)
* 16:11 mvernon@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ms-be[2039,2050,2056,2059].codfw.wmnet,thanos-be2004.codfw.wmnet with reason: PDU work
* 11:16 volans@cumin2002: START - Cookbook sre.experimental.reimage for host sretest1001.eqiad.wmnet
* 16:10 mvernon@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ms-be[2039,2050,2056,2059].codfw.wmnet,thanos-be2004.codfw.wmnet with reason: PDU work
* 11:15 volans@cumin2002: END (ERROR) - Cookbook sre.experimental.reimage (exit_code=97) for host sretest1001.eqiad.wmnet
* 16:09 urandom: flushing tables in row D (RESTBase Cassandra cluster)  -- [[phab:T314941|T314941]]
* 10:35 volans@cumin2002: START - Cookbook sre.experimental.reimage for host sretest1001.eqiad.wmnet
* 15:54 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for gitlab-runner2004.codfw.wmnet
* 10:34 volans@cumin2002: END (ERROR) - Cookbook sre.experimental.reimage (exit_code=97) for host sretest1001.eqiad.wmnet
* 15:54 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for gitlab-runner2004.codfw.wmnet
* 10:24 volans@cumin2002: START - Cookbook sre.experimental.reimage for host sretest1001.eqiad.wmnet
* 15:53 sukhe: poweroff cp2041, 42 for PDU ugprade: rack D7
* 10:02 volans@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:05:00 on cumin1001.eqiad.wmnet with reason: testing latest change
* 15:51 urandom: flushing tables in row B (RESTBase Cassandra cluster)  -- [[phab:T314941|T314941]]
* 10:00 volans@cumin1001: START - Cookbook sre.hosts.downtime for 0:05:00 on cumin1001.eqiad.wmnet with reason: testing latest change
* 15:49 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve2004.codfw.wmnet with reason: PDU maintenance
* 09:54 godog: bounce mtail on centrallog* - [[phab:T246470|T246470]]
* 15:49 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2004.codfw.wmnet with reason: PDU maintenance
* 09:47 dcausse@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' .
* 15:46 urandom: flushing tables in row A (RESTBase Cassandra cluster) -- [[phab:T314941|T314941]]
* 09:40 jgiannelos@deploy1002: Finished deploy [kartotherian/deploy@0a38bc5]: tegola: use eqiad discovery endpoin (duration: 00m 11s)
* 15:46 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on aqs2012.codfw.wmnet with reason: btullis codfw maintenance
* 09:39 jgiannelos@deploy1002: Started deploy [kartotherian/deploy@0a38bc5]: tegola: use eqiad discovery endpoin
* 15:46 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on cp[2041-2042].codfw.wmnet with reason: shutdown for PDU upgrade: rack D4
* 08:58 dcausse@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' .
* 15:46 btullis@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on aqs2012.codfw.wmnet with reason: btullis codfw maintenance
* 08:39 dcausse@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' .
* 15:46 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 3:00:00 on cp[2041-2042].codfw.wmnet with reason: shutdown for PDU upgrade: rack D4
* 08:22 ema: fleet-wide rm /etc/rsyslog.d/00-abort-unclean-config.conf && systemctl restart rsyslog
* 15:46 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on aqs2011.codfw.wmnet with reason: btullis codfw maintenance
* 07:51 godog: fail sdg on be2036 - [[phab:T291988|T291988]]
* 15:46 btullis@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on aqs2011.codfw.wmnet with reason: btullis codfw maintenance
* 07:25 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db2081 [[phab:T290868|T290868]]', diff saved to https://phabricator.wikimedia.org/P17345 and previous config saved to /var/cache/conftool/dbconfig/20210929-072520-marostegui.json
* 15:45 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on aqs2010.codfw.wmnet with reason: btullis codfw maintenance
* 07:15 oblivian@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 15:45 btullis@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on aqs2010.codfw.wmnet with reason: btullis codfw maintenance
* 06:15 marostegui: Deploy schema change on s8 codfw (lag will show up) [[phab:T283499|T283499]]
* 15:45 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on aqs2009.codfw.wmnet with reason: btullis codfw maintenance
* 06:10 ryankemper: [[phab:T289517|T289517]] Ran puppet across query_service fleet `sudo cumin -b 6 'P<nowiki>{</nowiki>w*qs*<nowiki>}</nowiki>' 'sudo run-puppet-agent'`
* 15:45 btullis@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on aqs2009.codfw.wmnet with reason: btullis codfw maintenance
* 06:09 ryankemper: [[phab:T289517|T289517]] Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/720746 (fix dcat-ap loading)
* 15:37 urandom: (ephemerally) increasing hinted hand-off delivery rate limit to 16KB, RESTBase eqiad nodes  -- [[phab:T314941|T314941]]
* 05:56 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db2103 [[phab:T290865|T290865]]', diff saved to https://phabricator.wikimedia.org/P17344 and previous config saved to /var/cache/conftool/dbconfig/20210929-055645-marostegui.json
* 15:34 jbond: remove puppetmaster[12]002 from production
* 04:50 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db2081 [[phab:T290868|T290868]]', diff saved to https://phabricator.wikimedia.org/P17342 and previous config saved to /var/cache/conftool/dbconfig/20210929-045033-marostegui.json
* 15:30 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kafka-main2004.codfw.wmnet
* 03:18 eileen: civicrm revision changed from {{Gerrit|a0bc324a61}} to {{Gerrit|2ecb8f0bcd}}, config revision is {{Gerrit|77cb7ec866}}
* 15:30 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for kafka-main2004.codfw.wmnet
* 03:01 eileen: civicrm revision changed from {{Gerrit|1b7bae4033}} to {{Gerrit|a0bc324a61}}, config revision is {{Gerrit|77cb7ec866}}
* 15:20 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc[2051-2052].codfw.wmnet
* 03:00 eileen: civicrm revision changed from {{Gerrit|a480bf03c9}} to {{Gerrit|1b7bae4033}}, config revision is {{Gerrit|77cb7ec866}}
* 15:20 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for mc[2051-2052].codfw.wmnet
* 02:36 legoktm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Have PdfHandler/PagedTiffHandler use Shellbox on all wikis but Commons (duration: 01m 07s)
* 15:17 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc-gp2003.codfw.wmnet
* 00:52 eileen: civicrm revision changed from {{Gerrit|a1929b3dfd}} to {{Gerrit|a480bf03c9}}, config revision is {{Gerrit|77cb7ec866}}
* 15:17 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for mc-gp2003.codfw.wmnet
* 00:27 legoktm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Have SyntaxHighlight use Shellbox on all wikis (duration: 01m 18s)
* 15:16 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc2033.codfw.wmnet
* 00:21 ryankemper: [[phab:T280001|T280001]] `ryankemper@authdns1001:~$ sudo -i authdns-update` following merge of https://gerrit.wikimedia.org/r/c/operations/dns/+/724538
* 15:16 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for mc2033.codfw.wmnet
* 00:19 ryankemper: [[phab:T280001|T280001]] Okay now we're clear to proceed to https://wikitech.wikimedia.org/wiki/LVS#For_active/active_services; merging https://gerrit.wikimedia.org/r/c/operations/dns/+/724538
* 15:14 _joe_: power off krb2002
* 00:15 ryankemper: [[phab:T280001|T280001]] `ryankemper@cumin1001:~$ sudo cumin 'A:icinga or A:dns-auth' run-puppet-agent` per https://wikitech.wikimedia.org/wiki/LVS#Make_the_service_page,_add_discovery_resources
* 15:14 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on krb2002.codfw.wmnet with reason: PDU maintenance
* 00:14 ryankemper: [[phab:T280001|T280001]] Moving wcqs state from `monitoring_setup` to `production`; merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/724536
* 15:13 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on krb2002.codfw.wmnet with reason: PDU maintenance
* 15:13 _joe_: shutting down rdb2010,puppetmaster2002 for d5 maintenance
* 15:02 jelto: power off mc2035
* 15:01 jelto: power off mc2034
* 15:01 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mc2035.codfw.wmnet with reason: PDU swap
* 15:01 jelto@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on mc2035.codfw.wmnet with reason: PDU swap
* 15:01 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mc2034.codfw.wmnet with reason: PDU swap
* 15:01 jelto@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on mc2034.codfw.wmnet with reason: PDU swap
* 14:43 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2094.codfw.wmnet with reason: PDU Maint ([[phab:T310146|T310146]])
* 14:43 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2094.codfw.wmnet with reason: PDU Maint ([[phab:T310146|T310146]])
* 14:38 urandom: disabling reserved space on eqiad nodes (RESTBase), /dev/md2 (aka /srv/cassandra/instance-data) -- [[phab:T314941|T314941]]
* 14:28 jelto: power off kafka-main2004 gracefully
* 14:28 hnowlan: shutting down sessionstore2003
* 14:27 hnowlan@puppetmaster1001: conftool action : set/pooled=no; selector: name=sessionstore2003.codfw.wmnet
* 14:27 sukhe: power off cp2039, cp2040 for PDU upgrade: rack D
* 14:27 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sessionstore2003.codfw.wmnet with reason: PDU maintenance
* 14:27 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on sessionstore2003.codfw.wmnet with reason: PDU maintenance
* 14:25 jelto: power off mc-gp2003
* 14:25 jelto: power off mc2033
* 14:24 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:30:00 on kafka-main2004.codfw.wmnet with reason: PDU swap
* 14:23 jelto@cumin1001: START - Cookbook sre.hosts.downtime for 4:30:00 on kafka-main2004.codfw.wmnet with reason: PDU swap
* 14:23 sukhe: depool codfw for PDU upgrade: rack D
* 14:23 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:30:00 on mc-gp2003.codfw.wmnet with reason: PDU swap
* 14:23 jelto@cumin1001: START - Cookbook sre.hosts.downtime for 4:30:00 on mc-gp2003.codfw.wmnet with reason: PDU swap
* 14:23 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:30:00 on mc2033.codfw.wmnet with reason: PDU swap
* 14:23 jelto@cumin1001: START - Cookbook sre.hosts.downtime for 4:30:00 on mc2033.codfw.wmnet with reason: PDU swap
* 14:15 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp20[39{{!}}40]\.codfw\.wmnet,service=ats-tls
* 14:13 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on cp[2039-2040].codfw.wmnet with reason: shutdown for PDU upgrade: rack D4
* 14:13 urandom: flushing Cassandra tables, restbase1030
* 14:13 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 3:00:00 on cp[2039-2040].codfw.wmnet with reason: shutdown for PDU upgrade: rack D4
* 14:13 urandom: flushing Cassandra tables, restbase1019
* 14:12 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on dns2002.wikimedia.org with reason: shutdown for PDU upgrade: rack D4
* 14:12 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 3:00:00 on dns2002.wikimedia.org with reason: shutdown for PDU upgrade: rack D4
* 14:11 urandom: flushing Cassandra tables, restbase1017 1018 1021 1024 1025 1026 1028 1029
* 14:05 urandom: flushing tables, restbase1016
* 13:52 hnowlan: powered up restbase2018
* 13:32 filippo@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on netmon1003.wikimedia.org with reason: pdu
* 13:32 filippo@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on netmon1003.wikimedia.org with reason: pdu
* 13:32 filippo@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on logstash2003.codfw.wmnet with reason: pdu
* 13:31 filippo@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on logstash2003.codfw.wmnet with reason: pdu
* 13:31 root@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on logstash2029.codfw.wmnet with reason: pdu
* 13:31 root@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on logstash2029.codfw.wmnet with reason: pdu
* 13:30 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 6 hosts with reason: [[phab:T310146|T310146]]
* 13:30 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 6 hosts with reason: [[phab:T310146|T310146]]
* 13:17 elukey: powering on restbase2027
* 13:12 elukey: powering on restbase2026
* 13:12 _joe_: powering on restbase2023
* 13:01 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1160 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32343 and previous config saved to /var/cache/conftool/dbconfig/20220810-130108-ladsgroup.json
* 13:01 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance
* 13:00 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance
* 12:37 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic[2072,2084-2085].codfw.wmnet with reason: [[phab:T310146|T310146]]
* 12:37 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic[2072,2084-2085].codfw.wmnet with reason: [[phab:T310146|T310146]]
* 12:27 jbond: remove confd from serveres that shouldn;t have it
* 12:05 ladsgroup@deploy1002: Synchronized php-1.39.0-wmf.23/extensions/Echo/maintenance/removeOrphanedEvents.php: Backport: [[gerrit:821735{{!}}Run clean ups with removeOrphanedEvents in major batches (T310428)]] (duration: 03m 32s)
* 11:47 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 11:46 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 11:46 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 11:43 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 11:15 jbond@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1001.eqiad.wmnet with OS buster
* 10:54 jbond@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage
* 10:51 jbond@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage
* 10:37 jbond@cumin1001: START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS buster
* 10:31 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[2181-2182].codfw.wmnet with reason: D6 PDU maint ([[phab:T310146|T310146]])
* 10:31 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db[2181-2182].codfw.wmnet with reason: D6 PDU maint ([[phab:T310146|T310146]])
* 10:26 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on maps2010.codfw.wmnet with reason: PDU maintenance
* 10:26 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on maps2010.codfw.wmnet with reason: PDU maintenance
* 10:26 hnowlan@puppetmaster1001: conftool action : set/pooled=no; selector: name=maps2010.codfw.wmnet
* 10:25 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=maps2010.codfw.wmnet
* 10:25 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=maps2008.codfw.wmnet
* 10:24 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on restbase2018.codfw.wmnet with reason: PDU maintenance
* 10:24 hnowlan@puppetmaster1001: conftool action : set/pooled=no; selector: name=restbase2018.codfw.wmnet
* 10:24 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on restbase2018.codfw.wmnet with reason: PDU maintenance
* 10:23 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ml-serve2008.codfw.wmnet with reason: PDU maintenance
* 10:23 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on ml-serve2008.codfw.wmnet with reason: PDU maintenance
* 10:20 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ores2009.codfw.wmnet with reason: PDU maintenance
* 10:20 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on ores2009.codfw.wmnet with reason: PDU maintenance
* 10:19 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ores2008.codfw.wmnet with reason: PDU maintenance
* 10:19 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on ores2008.codfw.wmnet with reason: PDU maintenance
* 10:03 hnowlan@puppetmaster1001: conftool action : set/pooled=no; selector: name=restbase202[367].codfw.wmnet
* 10:02 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on restbase[2023,2026-2027].codfw.wmnet with reason: PDU maintenance
* 10:02 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on restbase[2023,2026-2027].codfw.wmnet with reason: PDU maintenance
* 09:53 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 6 hosts with reason: D8 PDU Maint ([[phab:T310146|T310146]])
* 09:52 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 6 hosts with reason: D8 PDU Maint ([[phab:T310146|T310146]])
* 09:51 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depool D8 DBs for PDU maint ([[phab:T310146|T310146]])', diff saved to https://phabricator.wikimedia.org/P32341 and previous config saved to /var/cache/conftool/dbconfig/20220810-095059-ladsgroup.json
* 09:36 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[2101,2130,2140].codfw.wmnet,dbproxy2004.codfw.wmnet with reason: D6 PDU maint ([[phab:T310146|T310146]])
* 09:36 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db[2101,2130,2140].codfw.wmnet,dbproxy2004.codfw.wmnet with reason: D6 PDU maint ([[phab:T310146|T310146]])
* 09:34 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depool D6 dbs ([[phab:T310146|T310146]])', diff saved to https://phabricator.wikimedia.org/P32340 and previous config saved to /var/cache/conftool/dbconfig/20220810-093433-ladsgroup.json
* 09:31 jelto: depool services in codfw for upcoming PDU replacement - [[phab:T309956|T309956]]
* 09:30 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 8 hosts with reason: Maintenance
* 09:29 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 8 hosts with reason: Maintenance
* 09:29 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance
* 09:29 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance
* 09:28 jynus: shutdown backup2007 before pdu upgrade [[phab:T310146|T310146]]
* 09:16 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 09:15 ladsgroup@deploy1002: Synchronized php-1.39.0-wmf.23/maintenance/namespaceDupes.php: Backport: [[gerrit:821734{{!}}maintenance: Add support for links migration to namespaceDupes.php (T314711)]] (duration: 03m 18s)
* 09:15 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[2093,2120,2129,2172].codfw.wmnet with reason: D5 PDU maint ([[phab:T310146|T310146]])
* 09:15 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db[2093,2120,2129,2172].codfw.wmnet with reason: D5 PDU maint ([[phab:T310146|T310146]])
* 09:14 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 09:14 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 09:13 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 09:10 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depool D5 dbs ([[phab:T310146|T310146]])', diff saved to https://phabricator.wikimedia.org/P32339 and previous config saved to /var/cache/conftool/dbconfig/20220810-091038-ladsgroup.json
* 08:49 jynus: shutdown dbprov2003 before pdu upgrade [[phab:T310146|T310146]]
* 08:49 ayounsi@cumin1001: END (FAIL) - Cookbook sre.network.debug (exit_code=99) for Netbox interface ID cr1-drmrs:xe-0/1/2
* 08:48 ayounsi@cumin1001: START - Cookbook sre.network.debug for Netbox interface ID cr1-drmrs:xe-0/1/2
* 08:48 mvernon@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be2028.codfw.wmnet
* 08:48 mvernon@cumin1001: START - Cookbook sre.hosts.remove-downtime for ms-be2028.codfw.wmnet
* 08:42 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1130 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P32337 and previous config saved to /var/cache/conftool/dbconfig/20220810-084222-ladsgroup.json
* 08:37 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 08:37 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
* 08:36 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
* 08:36 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 08:36 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 08:35 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 08:35 ladsgroup@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:822037{{!}}Stop writing to the old templatelinks fields in s5 (T312865)]] (duration: 03m 29s)
* 08:32 jelto: power off gitlab-runner2004
* 08:31 root@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:30:00 on gitlab-runner2004.codfw.wmnet with reason: PDU swap
* 08:31 root@cumin1001: START - Cookbook sre.hosts.downtime for 8:30:00 on gitlab-runner2004.codfw.wmnet with reason: PDU swap
* 08:29 mvernon@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ms-be2028.codfw.wmnet with reason: Trying to fix full /
* 08:28 mvernon@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ms-be2028.codfw.wmnet with reason: Trying to fix full /
* 08:28 ayounsi@cumin1001: END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox interface ID cr1-drmrs:xe-0/1/2
* 08:27 ayounsi@cumin1001: START - Cookbook sre.network.debug for Netbox interface ID cr1-drmrs:xe-0/1/2
* 08:27 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1130 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P32336 and previous config saved to /var/cache/conftool/dbconfig/20220810-082718-ladsgroup.json
* 08:25 ayounsi@cumin1001: END (FAIL) - Cookbook sre.network.debug (exit_code=99) for Netbox interface ID cr1-drmrs:xe-0/1/2
* 08:25 ayounsi@cumin1001: START - Cookbook sre.network.debug for Netbox interface ID cr1-drmrs:xe-0/1/2
* 08:24 ayounsi@cumin1001: END (FAIL) - Cookbook sre.network.debug (exit_code=99) for Netbox interface ID cr1-drmrs:xe-0/1/2
* 08:24 ayounsi@cumin1001: START - Cookbook sre.network.debug for Netbox interface ID cr1-drmrs:xe-0/1/2
* 08:23 kart_: Run: mwscript namespaceDupes.php arywiki --fix ([[phab:T291737|T291737]])
* 08:13 jynus: restart replication on db1117:m1 [[phab:T309074|T309074]]
* 08:12 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1130 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P32335 and previous config saved to /var/cache/conftool/dbconfig/20220810-081213-ladsgroup.json
* 08:09 kartik@deploy1002: Finished scap: Backport: [[gerrit:821732{{!}}arywiki: change namespace translations, add unchanged namespaces and add old translations as aliases (T291737)]] (duration: 10m 37s)
* 07:59 kartik@deploy1002: Started scap: Backport: [[gerrit:821732{{!}}arywiki: change namespace translations, add unchanged namespaces and add old translations as aliases (T291737)]]
* 07:57 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1130 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P32334 and previous config saved to /var/cache/conftool/dbconfig/20220810-075708-ladsgroup.json
* 07:56 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1130 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P32333 and previous config saved to /var/cache/conftool/dbconfig/20220810-075636-ladsgroup.json
* 07:55 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 07:52 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 07:52 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 07:52 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1130.eqiad.wmnet with reason: Maintenance
* 07:52 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on db1130.eqiad.wmnet with reason: Maintenance
* 07:51 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 07:51 dcaro@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 07:46 dcaro@cumin1001: START - Cookbook sre.dns.netbox
* 07:39 dcaro@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 07:34 dcaro@cumin1001: START - Cookbook sre.dns.netbox
* 07:33 godog: depool thanos-fe2001 for debugging
* 07:11 kartik@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:821170{{!}}Enable SectionTranslation on testwiki with new MT support from Google (T313296)]] (duration: 05m 44s)
* 07:11 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 07:08 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 07:08 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 07:07 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 05:24 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on kubernetes[2013-2014].codfw.wmnet with reason: PDU maintenance
* 05:24 oblivian@cumin1001: START - Cookbook sre.hosts.downtime for 18:00:00 on kubernetes[2013-2014].codfw.wmnet with reason: PDU maintenance
* 05:19 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on parse[2016-2020].codfw.wmnet with reason: PDU maintenance
* 05:19 oblivian@cumin1001: START - Cookbook sre.hosts.downtime for 18:00:00 on parse[2016-2020].codfw.wmnet with reason: PDU maintenance
* 05:12 _joe_: starting to shut down servers in codfw for the PDU maintenance
* 05:09 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on 10 hosts with reason: PDU maintenance
* 05:09 oblivian@cumin1001: START - Cookbook sre.hosts.downtime for 18:00:00 on 10 hosts with reason: PDU maintenance
* 05:09 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on mc-gp2003.codfw.wmnet with reason: PDU maintenance
* 05:09 oblivian@cumin1001: START - Cookbook sre.hosts.downtime for 18:00:00 on mc-gp2003.codfw.wmnet with reason: PDU maintenance
* 05:06 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on mc2033.codfw.wmnet with reason: PDU maintenance
* 05:06 oblivian@cumin1001: START - Cookbook sre.hosts.downtime for 18:00:00 on mc2033.codfw.wmnet with reason: PDU maintenance
* 05:05 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on 7 hosts with reason: PDU maintenance
* 05:05 oblivian@cumin1001: START - Cookbook sre.hosts.downtime for 18:00:00 on 7 hosts with reason: PDU maintenance
* 02:34 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 02:33 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 02:33 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 02:32 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 02:07 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 02:06 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 02:06 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 02:05 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply


== 2021-09-28 ==
== 2022-08-09 ==
* 23:53 ryankemper: [[phab:T280001|T280001]] New icinga checks are green, will proceed to next step of moving wcqs state from `monitoring_setup` -> `production`
* 23:17 bking@cumin1001: conftool action : set/weight=10:pooled=yes; selector: name=wdqs1011.eqiad.wmnet
 
* 23:07 bking@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
* 23:49 ryankemper: [[phab:T280001|T280001]] New icinga alerts showing up as expected following wcqs state change to `monitoring_setup`: `LVS wcqs codfw port 443/tcp - Wikimedia Commons Query Service IPv4` and `LVS wcqs eqiad port 443/tcp - Wikimedia Commons Query Service IPv4`
* 23:06 bking@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
* 23:45 ryankemper: [[phab:T280001|T280001]] Changing wcqs state from `lvs_setup` to `monitoring_setup`: `ryankemper@cumin1001:~$ sudo cumin 'A:icinga' 'run-puppet-agent'`
* 22:51 bking@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
* 23:14 ryankemper: !log [[phab:T282117|T282117]] `error: plugin_geoip: Invalid resource name 'disc-wcqs' detected from zonefile lookup` We must be missing a line, reverting change to fix
* 22:51 bking@deploy1002: helmfile [codfw] START helmfile.d/services/
* 23:14 ryankemper: [[phab:T282117|T282117]] `ryankemper@authdns1001:~$ sudo -i authdns-update` following merge of https://gerrit.wikimedia.org/r/724520
* 23:13 pt1979@cumin2002: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2419.codfw.wmnet with reason: REIMAGE
* 23:11 pt1979@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on mw2419.codfw.wmnet with reason: REIMAGE
* 22:46 pt1979@cumin2002: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2418.codfw.wmnet with reason: REIMAGE
* 22:44 pt1979@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on mw2418.codfw.wmnet with reason: REIMAGE
* 22:41 legoktm@deploy1002: Finished scap: Fix erroneous en-gb translations in 1.38.0-


== 2021-09-26 ==
== 2022-08-08 ==
* 14:51 volker-e@deploy1002: Finished deploy [design/style-guide@aac0ae9]: Deploy design/style-guide: {{Gerrit|aac0ae9}} “Apps”: Fix image path (#490) (duration: 00m 06s)
* 23:52 tstarling@deploy1002: Synchronized wmf-config/InitialiseSettings.php: clean up testwiki experiments [[phab:T314750|T314750]] (duration: 03m 19s)
* 14:51 volker-e@deploy1002: Started deploy [design/style-guide@aac0ae9]: Deploy design/style-guide: {{Gerrit|aac0ae9}} “Apps”: Fix image path (#490)
* 23:47 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 03:16 legoktm: killed queries on db1099
* 23:46 tstarling@deploy1002: Synchronized wmf-config/CommonSettings.php: clean up testwiki experiments [[phab:T314750|T314750]] (duration: 03m 27s)
* 03:14 legoktm: killing queries on db1105
* 23:46 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 23:46 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 23:45 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 23:32 eileen___: config revision changed from {{Gerrit|f5668044}} to 787cd0e0<eileen___> eileen
* 23:32 eileen___: civicrm upgraded from {{Gerrit|497bddf7}} to {{Gerrit|1f91ac2d}}
* 22:16 ryankemper@cumin1001: END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - [[phab:T289135|T289135]]
* 22:16 ryankemper@cumin1001: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host elastic1065.eqiad.wmnet with OS bullseye
* 21:53 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1065.eqiad.wmnet with reason: host reimage
* 21:50 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1065.eqiad.wmnet with reason: host reimage
* 21:36 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1065.eqiad.wmnet with OS bullseye
* 21:12 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1062.eqiad.wmnet with OS bullseye
* 20:53 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1062.eqiad.wmnet with reason: host reimage
* 20:50 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1062.eqiad.wmnet with reason: host reimage
* 20:36 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1062.eqiad.wmnet with OS bullseye
* 20:32 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 20:31 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 20:31 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 20:29 ryankemper@cumin1001: START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - [[phab:T289135|T289135]]
* 20:28 cjming: end of UTC late backport window
* 20:27 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 20:27 cjming@deploy1002: Synchronized php-1.39.0-wmf.23/skins/Vector/resources/skins.vector.styles/layouts/grid.less: Backport: [[gerrit:821243{{!}}Fix grid blowout bug (T314756)]] (duration: 03m 26s)
* 20:12 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 20:11 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 20:11 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 20:11 cjming@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:817785{{!}}Disable sticky header edit A/B test for pilot wikis (T312296)]] (duration: 03m 35s)
* 20:08 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 17:34 bking@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1088.eqiad.wmnet with OS bullseye
* 17:15 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1088.eqiad.wmnet with reason: host reimage
* 17:12 bking@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1088.eqiad.wmnet with reason: host reimage
* 17:00 bking@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1088.eqiad.wmnet with OS bullseye
* 16:54 bking@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1085.eqiad.wmnet with OS bullseye
* 16:49 ryankemper@cumin1001: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - [[phab:T289135|T289135]]
* 16:43 pt1979@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 16:41 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1085.eqiad.wmnet with reason: host reimage
* 16:39 pt1979@cumin2002: START - Cookbook sre.dns.netbox
* 16:38 bking@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1085.eqiad.wmnet with reason: host reimage
* 16:26 bking@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1085.eqiad.wmnet with OS bullseye
* 16:24 bking@cumin1001: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host elastic1085.eqiad.wmnet with OS bullseye
* 16:19 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1085.eqiad.wmnet with reason: host reimage
* 16:16 bking@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1085.eqiad.wmnet with reason: host reimage
* 16:16 ryankemper@cumin1001: START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - [[phab:T289135|T289135]]
* 16:14 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
* 16:12 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
* 16:10 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
* 16:09 ryankemper@cumin1001: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - [[phab:T289135|T289135]]
* 16:04 bking@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1085.eqiad.wmnet with OS bullseye
* 16:00 bking@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1084.eqiad.wmnet with OS bullseye
* 15:58 ryankemper@cumin1001: START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - [[phab:T289135|T289135]]
* 15:47 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1084.eqiad.wmnet with reason: host reimage
* 15:46 sukhe: upload reprepro -C main include bullseye-wikimedia python-pynetbox_6.6.0-1+wmf11u1_amd64.changes
* 15:45 bking@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1084.eqiad.wmnet with reason: host reimage
* 15:37 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2021.codfw.wmnet with reason: Maint
* 15:37 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2021.codfw.wmnet with reason: Maint
* 15:32 bking@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1084.eqiad.wmnet with OS bullseye
* 14:59 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
* 14:55 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
* 14:47 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on cp5001.eqsin.wmnet with reason: depooled: faulty DIMM: [[phab:T314256|T314256]]
* 14:46 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on cp5001.eqsin.wmnet with reason: depooled: faulty DIMM: [[phab:T314256|T314256]]
* 14:34 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
* 14:11 kevinbazira@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
* 13:03 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 13:01 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 13:01 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 12:58 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 12:56 urbanecm@deploy1002: Synchronized wmf-config/CommonSettings.php: {{Gerrit|77fd5abdd7d9462869259e1511bbcf2d7ce62246}}: Growth: Add new rights to wgAvailableRights (duration: 03m 24s)
* 12:30 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1102.eqiad.wmnet
* 12:09 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 12:09 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 12:06 urbanecm@deploy1002: Synchronized php-1.39.0-wmf.23/extensions/GrowthExperiments/: {{Gerrit|3eaf155678b7313c55dcca0cd39ab29f73eead37}}: MentorTools: Do not use MentorWeightManager ([[phab:T314362|T314362]]) (duration: 03m 31s)
* 12:04 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 11:43 btullis@cumin1001: START - Cookbook sre.hosts.reboot-single for host an-worker1102.eqiad.wmnet
* 11:21 jelto@cumin1001: conftool action : set/pooled=yes; selector: name=kubernetes2022.codfw.wmnet
* 11:21 jelto: kubectl uncordon kubernetes2022.codfw.wmnet
* 10:43 Amir1: Removing db2079 from orchestrator ([[phab:T313885|T313885]])
* 10:39 Amir1: Removing db2079 from zarcillo ([[phab:T313885|T313885]])
* 10:35 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2079.codfw.wmnet
* 10:35 ladsgroup@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 10:30 ladsgroup@cumin1001: START - Cookbook sre.dns.netbox
* 10:25 ladsgroup@cumin1001: START - Cookbook sre.hosts.decommission for hosts db2079.codfw.wmnet
* 10:18 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 12:00:00 on db2079.codfw.wmnet with reason: Decom
* 10:17 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 2 days, 12:00:00 on db2079.codfw.wmnet with reason: Decom
* 08:41 jbond: deploy libtirpc update
* 07:57 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1144:3314 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32310 and previous config saved to /var/cache/conftool/dbconfig/20220808-075723-ladsgroup.json
* 07:57 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance
* 07:57 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance
* 07:57 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1142 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32309 and previous config saved to /var/cache/conftool/dbconfig/20220808-075702-ladsgroup.json
* 07:53 godog: grow sda/sdb 3 by 100G on thanos-be2001 - [[phab:T314275|T314275]]
* 07:50 godog: grow sda/sdb 3 by 100G on thanos-be1004 - [[phab:T314275|T314275]]
* 07:41 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P32308 and previous config saved to /var/cache/conftool/dbconfig/20220808-074156-ladsgroup.json
* 07:32 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 07:27 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 07:27 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 07:26 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P32307 and previous config saved to /var/cache/conftool/dbconfig/20220808-072650-ladsgroup.json
* 07:23 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 07:22 kartik@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:820815{{!}}trwikivoyage: Create rollbacker user group (T314678)]] (duration: 03m 17s)
* 07:18 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 07:17 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 07:17 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 07:16 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 07:11 elukey: restart rsyslog on ml-serve2007
* 07:11 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 07:11 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1142 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32306 and previous config saved to /var/cache/conftool/dbconfig/20220808-071144-ladsgroup.json
* 07:10 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 07:10 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 07:09 kartik@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:820261{{!}}Enable SectionTranslation on 10 Wikipedias where ContentTranslation is default (T308829)]] (duration: 03m 15s)
* 07:09 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 07:06 XioNoX: add CSP headers to Netbox - [[phab:T296356|T296356]]
* 07:05 elukey: restart rsyslog on ml-serve-ctrl2001


== 2021-09-25 ==
== 2022-08-07 ==
* 02:00 legoktm@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'shellbox-timeline' for release 'main' .
* 19:58 taavi: taavi@mwmaint1002 ~ $ echo "https://upload.wikimedia.org/wikipedia/commons/1/15/Keep_tidy_ask.svg" {{!}} mwscript purgeList.php --wiki enwiki # [[phab:T314712|T314712]]
* 01:27 legoktm@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'shellbox-timeline' for release 'main' .
* 13:52 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1142 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32305 and previous config saved to /var/cache/conftool/dbconfig/20220807-135204-ladsgroup.json
* 01:24 legoktm@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'shellbox-timeline' for release 'main' .
* 13:51 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1142.eqiad.wmnet with reason: Maintenance
* 13:51 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1142.eqiad.wmnet with reason: Maintenance
* 13:51 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1141 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32304 and previous config saved to /var/cache/conftool/dbconfig/20220807-135143-ladsgroup.json
* 13:36 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P32303 and previous config saved to /var/cache/conftool/dbconfig/20220807-133637-ladsgroup.json
* 13:21 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P32302 and previous config saved to /var/cache/conftool/dbconfig/20220807-132131-ladsgroup.json
* 13:06 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1141 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32301 and previous config saved to /var/cache/conftool/dbconfig/20220807-130625-ladsgroup.json
* 12:06 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1141 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32300 and previous config saved to /var/cache/conftool/dbconfig/20220807-120610-ladsgroup.json
* 12:06 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1141.eqiad.wmnet with reason: Maintenance
* 12:05 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1141.eqiad.wmnet with reason: Maintenance
* 12:05 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1149 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32299 and previous config saved to /var/cache/conftool/dbconfig/20220807-120549-ladsgroup.json
* 11:50 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P32298 and previous config saved to /var/cache/conftool/dbconfig/20220807-115043-ladsgroup.json
* 11:35 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P32297 and previous config saved to /var/cache/conftool/dbconfig/20220807-113537-ladsgroup.json
* 11:20 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1149 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32296 and previous config saved to /var/cache/conftool/dbconfig/20220807-112031-ladsgroup.json


== 2021-09-24 ==
== 2022-08-06 ==
* 20:00 volker-e@deploy1002: Finished deploy [design/style-guide@362c6b1]: Deploy design/style-guide: {{Gerrit|362c6b1}} “Components”: Fix index link (#489) (duration: 00m 06s)
* 17:59 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1149 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32295 and previous config saved to /var/cache/conftool/dbconfig/20220806-175916-ladsgroup.json
* 20:00 volker-e@deploy1002: Started deploy [design/style-guide@362c6b1]: Deploy design/style-guide: {{Gerrit|362c6b1}} “Components”: Fix index link (#489)
* 17:59 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1149.eqiad.wmnet with reason: Maintenance
* 19:33 volker-e@deploy1002: Finished deploy [design/style-guide@6585e79]: Deploy design/style-guide: {{Gerrit|6585e79}} “Apps”: Add Apps x Design System section (#487) (duration: 00m 07s)
* 17:58 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1149.eqiad.wmnet with reason: Maintenance
* 19:33 volker-e@deploy1002: Started deploy [design/style-guide@6585e79]: Deploy design/style-guide: {{Gerrit|6585e79}} “Apps”: Add Apps x Design System section (#487)
* 03:10 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 19:07 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 03:09 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 19:04 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 03:09 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 18:57 legoktm@deploy1002: Synchronized php-1.38.0-wmf.1/includes/MovePage.php: MovePage: don't create a recent change for a redirect ([[phab:T291677|T291677]]) (duration: 00m 57s)
* 03:08 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 18:54 legoktm@deploy1002: Synchronized php-1.38.0-wmf.1/extensions/PageTriage/: Revert "Remove deprecated date.js library" ([[phab:T291675|T291675]]) (duration: 00m 59s)
* 03:03 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 18:53 legoktm@deploy1002: sync-file aborted: (no justification provided) (duration: 00m 00s)
* 03:02 krinkle@deploy1002: Synchronized w/: {{Gerrit|I9067d47fab0324}} (duration: 03m 25s)
* 18:13 legoktm@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'zotero' for release 'production' .
* 03:02 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 18:12 legoktm@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'zotero' for release 'production' .
* 03:02 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 17:20 elukey@cumin1001: END (PASS) - Cookbook sre.kafka.roll-restart-mirror-maker (exit_code=0) restart MirrorMaker for Kafka A:kafka-mirror-maker-jumbo-eqiad cluster: Roll restart of jvm daemons. - elukey@cumin1001
* 03:01 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 17:02 elukey@cumin1001: START - Cookbook sre.kafka.roll-restart-mirror-maker restart MirrorMaker for Kafka A:kafka-mirror-maker-jumbo-eqiad cluster: Roll restart of jvm daemons. - elukey@cumin1001
* 02:41 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 16:35 elukey@cumin1001: END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-jumbo-eqiad cluster: Roll restart of jvm daemons for openjdk upgrade. - elukey@cumin1001
* 02:39 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 15:59 elukey@cumin1001: END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-analytics cluster: Roll restart of jvm daemons. - elukey@cumin1001
* 02:39 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 15:53 elukey@cumin1001: START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-analytics cluster: Roll restart of jvm daemons. - elukey@cumin1001
* 02:38 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 15:52 elukey@cumin1001: END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-druid-public cluster: Roll restart of jvm daemons. - elukey@cumin1001
* 02:38 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1143.eqiad.wmnet with reason: Maintenance
* 15:46 elukey@cumin1001: START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-druid-public cluster: Roll restart of jvm daemons. - elukey@cumin1001
* 02:37 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1143.eqiad.wmnet with reason: Maintenance
* 15:23 elukey@cumin1001: END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-druid-analytics cluster: Roll restart of jvm daemons. - elukey@cumin1001
* 02:33 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 15:17 elukey@cumin1001: START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-druid-analytics cluster: Roll restart of jvm daemons. - elukey@cumin1001
* 02:32 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 15:09 elukey: sudo cumin -m async -b2  "c:profile::analytics::cluster::hdfs_mount"  "umount /mnt/hdfs" "mount /mnt/hdfs" - [[phab:T288625|T288625]]
* 02:32 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 14:32 bd808@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'toolhub' for release 'main' .
* 02:31 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 14:07 cmjohnson@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 14:03 cmjohnson@cumin1001: START - Cookbook sre.dns.netbox
* 13:31 Amir1: start of rebuilding metadata of images in commons to make them use json
* 13:24 elukey@cumin1001: START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-jumbo-eqiad cluster: Roll restart of jvm daemons for openjdk upgrade. - elukey@cumin1001
* 11:58 effie: upgrading scap on canaries - [[phab:T291095|T291095]]
* 11:39 jiji@cumin1001: conftool action : set/pooled=true; selector: dnsdisc=tegola-vector-tiles
* 11:32 effie: uploading scap-4.0.0 to buster-wikimedia and stretch-wikimedia
* 11:17 effie: restart pybal in low traffic load balancers
* 10:44 jynus: corrupting and fixing image metadata on testwiki before running script on commons [[phab:T290462|T290462]]
* 10:16 elukey@cumin1001: END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid public cluster: Roll restart of Druid jvm daemons. - elukey@cumin1001
* 10:11 btullis@cumin1001: END (FAIL) - Cookbook sre.hadoop.roll-restart-masters (exit_code=99) restart masters for Hadoop analytics cluster: Restart of jvm daemons. - btullis@cumin1001
* 09:39 jynus: upgrade and restart db2099
* 09:32 btullis@cumin1001: START - Cookbook sre.hadoop.roll-restart-masters restart masters for Hadoop analytics cluster: Restart of jvm daemons. - btullis@cumin1001
* 09:29 btullis@cumin1001: END (PASS) - Cookbook sre.hadoop.roll-restart-masters (exit_code=0) restart masters for Hadoop test cluster: Restart of jvm daemons. - btullis@cumin1001
* 09:25 marostegui: Rename flaggedimages on db1096(ruwiki) and db1098(arwiki) [[phab:T290340|T290340]]
* 09:25 elukey@cumin1001: START - Cookbook sre.druid.roll-restart-workers for Druid public cluster: Roll restart of Druid jvm daemons. - elukey@cumin1001
* 09:09 jynus: upgrade and restart db2139, db2101
* 09:03 btullis@cumin1001: START - Cookbook sre.hadoop.roll-restart-masters restart masters for Hadoop test cluster: Restart of jvm daemons. - btullis@cumin1001
* 08:35 elukey@cumin1001: END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-test-eqiad cluster: Roll restart of jvm daemons for openjdk upgrade. - elukey@cumin1001
* 08:22 jynus: upgrade and restart db2098 [[phab:T290868|T290868]]
* 08:20 elukey@cumin1001: END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid analytics cluster: Roll restart of Druid jvm daemons. - elukey@cumin1001
* 08:08 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mx2002.wikimedia.org
* 07:59 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts mx2002.wikimedia.org
* 07:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mx1002.wikimedia.org
* 07:34 elukey@cumin1001: START - Cookbook sre.druid.roll-restart-workers for Druid analytics cluster: Roll restart of Druid jvm daemons. - elukey@cumin1001
* 07:17 elukey@cumin1001: END (PASS) - Cookbook sre.hadoop.roll-restart-workers (exit_code=0) restart workers for Hadoop test cluster: Roll restart of jvm daemons for openjdk upgrade. - elukey@cumin1001
* 07:11 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts mx1002.wikimedia.org
* 07:01 elukey@cumin1001: START - Cookbook sre.hadoop.roll-restart-workers restart workers for Hadoop test cluster: Roll restart of jvm daemons for openjdk upgrade. - elukey@cumin1001
* 07:01 elukey@cumin1001: END (ERROR) - Cookbook sre.hadoop.roll-restart-workers (exit_code=97) restart workers for Hadoop test cluster: Roll restart of jvm daemons for openjdk upgrade. - elukey@cumin1001
* 07:00 elukey@cumin1001: START - Cookbook sre.hadoop.roll-restart-workers restart workers for Hadoop test cluster: Roll restart of jvm daemons for openjdk upgrade. - elukey@cumin1001
* 06:55 elukey@cumin1001: START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-test-eqiad cluster: Roll restart of jvm daemons for openjdk upgrade. - elukey@cumin1001
* 06:53 elukey@cumin1001: END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid test cluster: Roll restart of Druid jvm daemons. - elukey@cumin1001
* 06:44 elukey@cumin1001: START - Cookbook sre.druid.roll-restart-workers for Druid test cluster: Roll restart of Druid jvm daemons. - elukey@cumin1001
* 06:41 elukey@cumin1001: END (PASS) - Cookbook sre.presto.roll-restart-workers (exit_code=0) for Presto analytics cluster: Roll restart of all Presto's jvm daemons. - elukey@cumin1001
* 06:30 elukey@cumin1001: START - Cookbook sre.presto.roll-restart-workers for Presto analytics cluster: Roll restart of all Presto's jvm daemons. - elukey@cumin1001
* 06:26 elukey: restart archiva on archiva1002 to pick up new openjdk upgrades
* 06:11 marostegui@cumin1001: dbctl commit (dc=all): 'db1177 (re)pooling @ 100%: After fixing some indexes [[phab:T291584|T291584]]', diff saved to https://phabricator.wikimedia.org/P17324 and previous config saved to /var/cache/conftool/dbconfig/20210924-061105-root.json
* 05:56 marostegui@cumin1001: dbctl commit (dc=all): 'db1177 (re)pooling @ 75%: After fixing some indexes [[phab:T291584|T291584]]', diff saved to https://phabricator.wikimedia.org/P17323 and previous config saved to /var/cache/conftool/dbconfig/20210924-055601-root.json
* 05:40 marostegui@cumin1001: dbctl commit (dc=all): 'db1177 (re)pooling @ 50%: After fixing some indexes [[phab:T291584|T291584]]', diff saved to https://phabricator.wikimedia.org/P17322 and previous config saved to /var/cache/conftool/dbconfig/20210924-054057-root.json
* 05:25 marostegui@cumin1001: dbctl commit (dc=all): 'db1177 (re)pooling @ 25%: After fixing some indexes [[phab:T291584|T291584]]', diff saved to https://phabricator.wikimedia.org/P17321 and previous config saved to /var/cache/conftool/dbconfig/20210924-052554-root.json
* 05:10 marostegui@cumin1001: dbctl commit (dc=all): 'db1177 (re)pooling @ 10%: After fixing some indexes [[phab:T291584|T291584]]', diff saved to https://phabricator.wikimedia.org/P17320 and previous config saved to /var/cache/conftool/dbconfig/20210924-051050-root.json
* 05:07 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1177 [[phab:T291584|T291584]]', diff saved to https://phabricator.wikimedia.org/P17319 and previous config saved to /var/cache/conftool/dbconfig/20210924-050739-marostegui.json
* 01:27 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 01:23 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 01:16 krinkle@deploy1002: Synchronized wmf-config/profiler.php: {{Gerrit|I25f4b70b9d4b}} (duration: 00m 57s)
* 00:39 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 00:39 legoktm@deploy1002: Synchronized php-1.38.0-wmf.1/resources/src/mediawiki.searchSuggest/searchSuggest.js: Hiding fallback button depends on HTML order ([[phab:T291272|T291272]]) (duration: 00m 57s)
* 00:36 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .


== 2021-09-23 ==
== 2022-08-05 ==
* 23:38 foks: running wm-scripts/mcdc2021/populateEditCount.php on each wiki (s1 thru s8 simultaneously) https://phabricator.wikimedia.org/T291668
* 22:20 dcausse@deploy1002: Finished deploy [wikimedia/discovery/analytics@71fe016]: Fix schedule_interval for image_recommendation_weekly (duration: 02m 01s)
* 22:58 bd808@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'toolhub' for release 'main' .
* 22:18 dcausse@deploy1002: Started deploy [wikimedia/
* 22:58 foks: creating `mcdc2021_edits` table on each wiki for elections voterlist https://phabricator.wikimedia.org/T291668
* 22:37 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 22:34 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 22:33 reedy@deploy1002: Synchronized php-1.38.0-wmf.1/extensions/SecurePoll/cli/wm-scripts/: [[phab:T291668|T291668]] (duration: 00m 57s)
* 22:27 ryankemper: [[phab:T280001|T280001]] `ryankemper@cumin1001:~$ sudo cumin 'P<nowiki>{</nowiki>puppetmaster*<nowiki>}</nowiki>' 'sudo rm -fv /var/run/confd-template/.wcqs*'` complete, forcing recheck
* 22:27 ryankemper: [[phab:T280001|T280001]] The pooling of the `wcqs*` hosts has gotten `/srv/config-master/pybal/$<nowiki>{</nowiki>DC<nowiki>}</nowiki>/wcqs` to render, but we need to clear away the stale error files to get rid of the associated warnings `Stale template error files present for '/srv/config-master/pybal/$<nowiki>{</nowiki>DC<nowiki>}</nowiki>/wcqs'` => `sudo rm -fv /var/run/confd-template/.wcqs*`
* 22:20 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 22:18 ryankemper: [[phab:T280001|T280001]] `ryankemper@puppetmaster1001:/srv$ sudo confctl select 'name=wcqs.*' set/pooled=yes:weight=10`
* 22:17 ryankemper@puppetmaster1001: conftool action : set/pooled=yes:weight=10; selector: name=wcqs.*
* 22:17 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 22:13 ryankemper: [[phab:T280001|T280001]] [codfw] `root@lvs2010:/home/ryankemper# ipvsadm -Dt 10.2.2.67:443` and `root@lvs2009:/home/ryankemper# ipvsadm -Dt 10.2.2.67:443`
* 22:13 ryankemper: [[phab:T280001|T280001]] [eqiad] `root@lvs1016:/home/ryankemper# ipvsadm -Dt 10.2.1.67:443` and `root@lvs1015:/home/ryankemper# ipvsadm -Dt 10.2.1.67:443`
* 22:06 ryankemper: [[phab:T280001|T280001]] Restarted pybal on low-traffic primaries: `ryankemper@cumin1001:~$ sudo cumin 'P<nowiki>{</nowiki>lvs2009*,lvs1015*<nowiki>}</nowiki>' 'sudo systemctl restart pybal'`
* 22:06 ryankemper: [[phab:T280001|T280001]] Waited 120s and checked https://icinga.wikimedia.org/alerts, proceeding to primary low-traffic hosts `lvs2009` and `lvs1015`
* 22:05 ryankemper: [[phab:T280001|T280001]] [Cleanup required] `TCP  10.2.1.67:443 wrr` shows up on `ryankemper@lvs1016:~$ sudo ipvsadm -L -n` and `TCP  10.2.2.67:443 wrr` shows up on `ryankemper@lvs2010:~$ sudo ipvsadm -L -n` (erroneous)
* 22:05 ryankemper: [[phab:T280001|T280001]] [Sanity check] `TCP  10.2.2.67:443 wrr` shows up on `ryankemper@lvs1016:~$ sudo ipvsadm -L -n` and `TCP  10.2.1.67:443 wrr` shows up on `ryankemper@lvs2010:~$ sudo ipvsadm -L -n` as expected
* 22:04 ryankemper: [[phab:T280001|T280001]] Restarted pybal on low-traffic backups: `ryankemper@cumin1001:~$ sudo cumin 'P<nowiki>{</nowiki>lvs2010*,lvs1016*<nowiki>}</nowiki>' 'sudo systemctl restart pybal'`
* 22:03 ryankemper: [[phab:T280001|T280001]] Restarting pybal on low-traffic backups `lvs2010` and `lvs1016`...
* 22:03 ryankemper: [[phab:T280001|T280001]] Ran puppet on all lvs hosts: `ryankemper@cumin1001:~$ sudo cumin 'O:lvs::balancer' 'sudo run-puppet-agent'`
* 22:00 ryankemper: [[phab:T280001|T280001]] Running puppet on all lvs hosts: `ryankemper@cumin1001:~$ sudo cumin 'O:lvs::balancer' 'sudo run-puppet-agent'`...
* 21:59 ryankemper: [[phab:T280001|T280001]] Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/723315, ran puppet agent on `wcqs*` to fix `local lo:LVS destination IPs`
* 21:59 ryankemper: [[phab:T280001|T280001]] Swapped the netbox IPAM addresses back, after erroneously swapping them earlier. `sre.dns.netbox` cookbook run complete as well
* 21:57 ryankemper@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 21:53 ryankemper@cumin1001: START - Cookbook sre.dns.netbox
* 21:43 bd808@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'toolhub' for release 'main' .
* 21:
* 14:35 pt1979@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 14:35 pt1979@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 14:29 pt1979@cumin2002: START - Cookbook sre.dns.netbox
* 14:34 jbond: upload data-generators-clojure to puppet7 component
* 14:20 mutante: mw2264 - scap pull
* 14:31 pt1979@cumin2002: START - Cookbook sre.dns.netbox
* 14:18 elukey@deploy1002: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
* 14:
* 14:18 elukey@deploy1002: helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
* 13:11 jiji@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts mc1027.eqiad.wmnet
* 13:10 dcausse: installing openjdk-8-dbg on wdqs2007
* 13:04 jiji@cumin1001: START - Cookbook sre.hosts.decommission for hosts mc1027.eqiad.wmnet
* 13:02 jiji@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mc1023.eqiad.wmnet
* 12:48 jiji@cumin1001: START - Cookbook sre.hosts.decommission for hosts mc1023.eqiad.wmnet
* 12:46 jiji@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mc[1035-1036].eqiad.wmnet
* 12:32 jiji@cumin1001: START - Cookbook sre.hosts.decommission for hosts mc[1035-1036].eqiad.wmnet
* 12:12 jiji@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mc[1028-1032].eqiad.wmnet
* 12:03 joal@deploy1002: Finished deploy [analytics/refinery@7208d3d] (thin): Analytics hotfix deploy (bis) THIN [analytics/refinery@7208d3d] (duration: 00m 06s)
* 12:03 joal@deploy1002: Started deploy [analytics/refinery@7208d3d] (thin): Analytics hotfix deploy (bis) THIN [analytics/refinery@7208d3d]
* 12:03 joal@deploy1002: Finished deploy [analytics/refinery@7208d3d]: Analytics hotfix deploy (bis)[analytics/refinery@7208d3d] (duration: 19m 16s)
* 11:56 dcausse@deploy1002: Finished deploy [wdqs/wdqs@8361ac9]: ban queries from a generic UA (duration: 19m 21s)
* 11:44 joal@deploy1002: Started deploy [analytics/refinery@7208d3d]: Analytics hotfix deploy (bis)[analytics/refinery@7208d3d]
* 11:42 marostegui: Remove flaggedrevs_stats2 and flaggedrevs_stats from enwiki - [[phab:T289050|T289050]]
* 11:37 dcausse@deploy1002: Started deploy [wdqs/wdqs@8361ac9]: ban queries from a generic UA
* 11:36 dcausse@deploy1002: Finished deploy [wdqs/wdqs@8361ac9]: ban queries from a generic UA (duration: 01m 07s)
* 11:35 dcausse@deploy1002: Started deploy [wdqs/wdqs@8361ac9]: ban queries from a generic UA
* 10:58 jiji@cumin1001: START - Cookbook sre.hosts.decommission for hosts mc[1028-1032].eqiad.wmnet
* 10:54 jiji@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts mc[1025-1026].eqiad.wmnet
* 10:47 joal@deploy1002: Finished deploy [analytics/aqs/deploy@d273fde] (aqs-next): Deploy latest code on AQS new servers - test after failures (duration: 00m 32s)
* 10:46 joal@deploy1002: Started deploy [analytics/aqs/deploy@d273fde] (aqs-next): Deploy latest code on AQS new servers - test after failures
* 10:45 joal@deploy1002: deploy aborted: Deploy latest code on AQS new servers - test after failures (duration: 00m 05s)
* 10:45 joal@deploy1002: Started deploy [analytics/aqs/deploy@d273fde] (aqs-test): Deploy latest code on AQS new servers - test after failures
* 10:29 hnowlan@deploy1002: Finished deploy [analytics/aqs/deploy@d273fde] (aqs-next): deploying aqs to inactive aqs-next hosts (duration: 00m 03s)
* 10:29 hnowlan@deploy1002: Started deploy [analytics/aqs/deploy@d273fde] (aqs-next): deploying aqs to inactive aqs-next hosts
* 10:22 hnowlan@deploy1002: Finished deploy [analytics/aqs/deploy@d273fde] (aqs-next): deploying aqs to inactive aqs-next hosts (duration: 00m 55s)
* 10:21 hnowlan@deploy1002: Started deploy [analytics/aqs/deploy@d273fde] (aqs-next): deploying aqs to inactive aqs-next hosts
* 10:17 hnowlan@deploy1002: Finished deploy [analytics/aqs/deploy@d273fde] (aqs-next): deploying aqs to inactive aqs-next hosts (duration: 00m 36s)
* 10:16 hnowlan@deploy1002: Started deploy [analytics/aqs/deploy@d273fde] (aqs-next): deploying aqs to inactive aqs-next hosts
* 10:08 hnowlan@deploy1002: Finished deploy [analytics/aqs/deploy@d273fde] (aqs-next): deploying aqs to inactive aqs-next hosts (duration: 00m 45s)
* 10:08 hnowlan@deploy1002: Started deploy [analytics/aqs/deploy@d273fde] (aqs-next): deploying aqs to inactive aqs-next hosts
* 10:05 hnowlan@deploy1002: Finished deploy [analytics/aqs/deploy@d273fde] (aqs-next): deploying aqs to inactive aqs-next hosts (duration: 00m 36s)
* 10:04 hnowlan@deploy1002: Started deploy [analytics/aqs/deploy@d273fde] (aqs-next): deploying aqs to inactive aqs-next hosts
* 10:02 hnowlan@deploy1002: Finished deploy [analytics/aqs/deploy@d273fde] (aqs-next): deploying aqs to inactive aqs-next hosts (duration: 01m 25s)
* 10:01 hnowlan@deploy1002: Started deploy [analytics/aqs/deploy@d273fde] (aqs-next): deploying aqs to inactive aqs-next hosts
* 10:00 hnowlan@deploy1002: Finished deploy [analytics/aqs/deploy@d273fde] (aqs-next): deploying aqs to inactive aqs-next hosts (duration: 01m 53s)
* 09:58 hnowlan@deploy1002: Started deploy [analytics/aqs/deploy@d273fde] (aqs-next): deploying aqs to inactive aqs-next hosts
* 09:57 hnowlan@deploy1002: Finished deploy [analytics/aqs/deploy@d273fde] (aqs-next): deploying aqs to inactive aqs-next hosts (duration: 00m 09s)
* 09:57 hnowlan@deploy1002: Started deploy [analytics/aqs/deploy@d273fde] (aqs-next): deploying aqs to inactive aqs-next hosts
* 09:32 joal@deploy1002: Finished deploy [analytics/refinery@4ff8979] (thin): Analytics hotfix deploy THIN [analytics/refinery@4ff8979] (duration: 00m 07s)
* 09:32 joal@deploy1002: Started deploy [analytics/refinery@4ff8979] (thin): Analytics hotfix deploy THIN [analytics/refinery@4ff8979]
* 09:26 joal@deploy1002: Finished deploy [analytics/refinery@4ff8979]: Analytics hotfix deploy [analytics/refinery@4ff8979] (duration: 17m 36s)
* 09:25 jiji@cumin1001: START - Cookbook sre.hosts.decommission for hosts mc[1025-1026].eqiad.wmnet
* 09:15 jelto@deploy1002: helmfile [codfw] DONE helmfile.d/admin 'apply'.
* 09:14 jiji@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mc1022.eqiad.wmnet
* 09:13 jelto@deploy1002: helmfile [codfw] START helmfile.d/admin 'apply'.
* 09:09 akosiaris@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mathoid' for release 'production' .
* 09:09 jelto@deploy1002: helmfile [eqiad] DONE helmfile.d/admin 'apply'.
* 09:09 joal@deploy1002: Started deploy [analytics/refinery@4ff8979]: Analytics hotfix deploy [analytics/refinery@4ff8979]
* 09:08 jelto@deploy1002: helmfile [eqiad] START helmfile.d/admin 'apply'.
* 09:06 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mathoid' for release 'production' .
* 09:03 jelto@deploy1002: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
* 09:03 jelto@deploy1002: helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
* 08:53 jelto@deploy1002: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
* 08:52 jelto@deploy1002: helmfile [staging-codfw] START helmfile.d/admin 'apply'.
* 08:45 ema: cp-eqsin: clean apt cache to free up some space [[phab:T290305|T290305]]
* 08:45 jiji@cumin1001: START - Cookbook sre.hosts.decommission for hosts mc1022.eqiad.wmnet
* 08:23 akosiaris@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'mathoid' for release 'staging' .
* 07:43 legoktm: uploaded pygments 2.10.0+dfsg-1~wmf1 to apt.wm.o in component/pygments
* 07:42 marostegui: Remove flaggedrevs_stats2 and flaggedrevs_stats from severak s3 wikis - [[phab:T289050|T289050]]
* 07:10 godog: more weight to ms-be20[62-65] - [[phab:T288458|T288458]]
* 07:01 marostegui@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 06:57 marostegui@cumin1001: START - Cookbook sre.dns.netbox
* 06:45 elukey: run `apt-get clean` on cp5012 to free some space (94% of the root partition used)
* 06:12 marostegui@cumin1001: dbctl commit (dc=all): 'db2138:3314 (re)pooling @ 100%: Slowly repool after reimage [[phab:T288803|T288803]]', diff saved to https://phabricator.wikimedia.org/P17203 and previous config saved to /var/cache/conftool/dbconfig/20210903-061204-root.json
* 06:11 marostegui@cumin1001: dbctl commit (dc=all): 'db2138:3312 (re)pooling @ 100%: Slowly repool after reimage [[phab:T288803|T288803]]', diff saved to https://phabricator.wikimedia.org/P17202 and previous config saved to /var/cache/conftool/dbconfig/20210903-061138-root.json
* 05:57 marostegui@cumin1001: dbctl commit (dc=all): 'db2138:3314 (re


== 2021-08-04 ==
== 2022-08-04 ==
* 22:18 ebernhardson@deploy1002: Finished deploy [wikimedia/discovery/analytics@34cd541]: gerrit:709835 and 709836 (duration: 06m 52s)
* 23:07 mutante: switching gerrit-replica.wikimedia.org to new machine gerrit2002, dropping gerrit-replica-new.wikimedia.org [[phab:T313250|T313250]]
* 22:11 ebernhardson@deploy1002: Started deploy [wikimedia/discovery/analytics@34cd541]: gerrit:709835 and 709836
* 21:07 ryankemper@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
* 20:56 legoktm@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 20:59 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 20:21 legoktm@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 20:57 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 19:22 dduvall: 1.37.0-wmf.17 promoted to group1. no new errors or troubling error rates spotted (
* 20:57 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 20:56 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 20:56 thcipriani@deploy1002: Finished scap: Backport for [[gerrit:819774]] tkwiki: Update wordmark (duration: 06m 12s)
* 20:51 ryankemper@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
* 20:51 ryankemper@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
* 20:51 ryankemper@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
* 20:50 thcipriani@deploy1002: Started scap: Backport for [[gerrit:819774]] tkwiki: Update wordmark
* 20:48 thcipriani@deploy1002: Finished scap: Backport for [[gerrit:812391]] [config]: Add click event logging for mobile and desktop (duration: 39m 16s)
* 20:45 ryankemper@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
* 20:24 ryankemper@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
* 20:23 ryankemper@deploy1002: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply
* 20:22 ryankemper@deploy1002: helmfile [staging] START helmfile.d/


== 2021-08-03 ==
== 2022-08-03 ==
* 23:34 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 23:59 dzahn@cumin2002: START - Cookbook sre.hosts.downtime for 1:00:00 on gerrit1001.wikimedia.org with reason: service restart
* 23:28 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 23:50 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32270 and previous config saved to /var/cache/conftool/dbconfig/20220803-235030-marostegui.json
* 23:17 ebernhardson@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:709770{{!}}Re-enable commonswiki sister search (T277225)]] (duration:
* 22:50 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1170:3312 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32269 and previous config saved to /var/cache/conftool/dbconfig/20220803-225015-marostegui.json
* 22:50 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance
* 22:49 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance
* 22:49 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 9 hosts with reason: Maintenance
* 22:49 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on 9 hosts with reason: Maintenance
* 22:49 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance
* 22:49 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance
* 22:48 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
* 22:48 marostegui@cumin1001: START - Cookbook


== 2021-08-02 ==
== 2022-08-02 ==
* 23:58 legoktm@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 22:39 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 23:50 legoktm@cumin1001: START - Cookbook sre.dns.netbox
* 22:32 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 23:38 legoktm@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'shellbox-constraints' for release 'main' .
* 22:32 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 23:38 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 22:25 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 23:38 legoktm@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'shellbox-constraints' for release 'main' .
* 22:15 mutante: gerrit - syncing data (/srv/gerrit /var/lib/gerrit2/review_site  /home) again after gerrit2002 was reimaged with buster [[phab:T313250|T313250]] [[phab:T313972|T313972]]
* 23:35 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 22:04 dancy@deploy1002: Finished deploy [gerrit/gerrit@94c5028]: (no justification provided) (duration: 00m 06s)
* 23:28 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 22:04 dancy@deploy1002: Started deploy [gerrit/gerrit@94c5028]: (no justification provided)
* 23:25 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 22:00 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 23:21 legoktm: Previous sync also deployed {{Gerrit|c38998f03f}} "Stop enabling DPL on new wikis" ([[phab:T287380|T287380]])
* 21:59 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 23:18 legoktm@deploy1002: Synchronized dblists/: Move ruwikinews to large wikis dblist (2/2) (duration: 00m 56s)
* 21:59 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 23:16 legoktm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Move ruwikinews to large wikis dblist (1/2) (duration: 00m 57s)
* 21:58 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 21:31 tzatziki: removing 1 file for legal compliance
* 21:58 dancy@deploy1002: rebuilt and synchronized wikiversions files: group0 wikis to 1.39.0-wmf.23  refs [[phab:T308076|T308076]]
* 21:16 tzatziki: removing 7 files for legal compliance
* 21:53 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 19:35 razzi@cumin1001: END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid analytics cluster: Roll restart of Druid's jvm daemons. - razzi@cumin1001
* 21:47 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 19:15 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 21:46 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 19:13 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 21:40 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 19:05 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 21:35 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 19:01 urbanecm: Run extensions/GrowthExperiments/maintenance/initWikiConfig.php on a couple of wikis to init on-wiki config for Growth features ([[phab:T287868|T287868]], [[phab:T287874|T287874]], [[phab:T287873|T287873]])
* 21:29 ladsgroup@deploy1002: Synchronized php-1.39.0-wmf.23/extensions/CirrusSearch/includes/Sanity/Checker.php: Backport: [[gerrit:819621{{!}}Fix appending of join conds (T312421 T314439)]] (duration: 03m 15s)
* 19:00 urbanecm: Morning B&C window completed
* 21:28 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 19:00 urbanecm@deploy1002: Synchronized dblists/growthexperiments.dblist: {{Gerrit|bebf4a9819f80e19cbb94f115f47c1ff4d05b7d2}}: Enable Growth features on a couple of wikis in dark mode ([[phab:T287868|T287868]], [[phab:T287874|T287874]], [[phab:T287873|T287873]]; 2/2) (duration: 00m 56s)
* 21:28 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 18:59 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|bebf4a9819f80e19cbb94f115f47c1ff4d05b7d2}}: Enable Growth features on a couple of wikis in dark mode ([[phab:T287868|T287868]], [[phab:T287874|T287874]], [[phab:T287873|T287873]]; 1/2) (duration: 00m 57s)
* 21:27 bking@cumin1001: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: deploy wmf-elasticsearch-search-plugins pkg - bking@cumin1001 - [[phab:T314078|T314078]]
* 18:58 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 21:21 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 18:57 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 21:11 dzahn@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gerrit2002.wikimedia.org with OS buster
* 18:50 otto@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Stream config for android_notification_interaction - [[phab:T287652|T287652]] (duration: 00m 56s)
* 21:01 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 18:50 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 21:01 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 18:49 razzi@cumin1001: START - Cookbook sre.druid.roll-restart-workers for Druid analytics cluster: Roll restart of Druid's jvm daemons. - razzi@cumin1001
* 21:00 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 18:49 urbanecm: Run extensions/GrowthExperiments/maintenance/initWikiConfig.php on a couple of wikis to init on-wiki config for Growth features ([[phab:T287876|T287876]], [[phab:T287871|T287871]], [[phab:T287878|T287878]], [[phab:T287880|T287880]], [[phab:T287875|T287875]], [[phab:T287879|T287879]], [[phab:T287872|T287872]])
* 20:59 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 18:48 razzi@cumin1001: END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid public cluster: Roll restart of Druid's jvm daemons. - razzi@cumin1001
* 20:58 dancy@deploy1002: rebuilt and synchronized wikiversions files: group0 wikis to 1.39.0-wmf.22  refs [[phab:T308076|T308076]]
* 18:47 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 20:54 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 18:46 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|18cd360773a2a236f9817ac0a4eaf3790b6d8cff}}: Growth features: Enable features in dark mode on a few wikis ([[phab:T287876|T287876]], [[phab:T287871|T287871]], [[phab:T287878|T287878]], [[phab:T287880|T287880]], [[phab:T287875|T287875]], [[phab:T287879|T287879]], [[phab:T287872|T287872]]; 2/2) (duration: 00m 56s)
* 20:53 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 18:45 urbanecm@deploy1002: Synchronized dblists/growthexperiments.dblist: {{Gerrit|18cd360773a2a236f9817ac0a4eaf3790b6d8cff}}: Growth features: Enable features in dark mode on a few wikis ([[phab:T287876|T287876]], [[phab:T287871|T287871]], [[phab:T287878|T287878]], [[phab:T287880|T287880]], [[phab:T287875|T287875]], [[phab:T287879|T287879]], [[phab:T287872|T287872]]; 1/2) (duration: 00m 56s)
* 20:53 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 18:41 urbanecm: Create GrowthExperiments database tables for a bunch of wikis ([[phab:T287876|T287876]], [[phab:T287871|T287871]], [[phab:T287878|T287878]], [[phab:T287880|T287880]], [[phab:T287875|T287875]], [[phab:T287879|T287879]], [[phab:T287872|T287872]])
* 20:53 dzahn@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gerrit2002.wikimedia.org with reason: host reimage
* 18:40 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|ee47f9d9a867f0bc419928c010579fb4f6fea425}}: Add rollbacker group for kswiki ([[phab:T286789|T286789]]) (duration: 00m 56s)
* 20:52 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 18:35 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 20:51 dzahn@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on gerrit2002.wikimedia.org with reason: host reimage
* 18:33 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|eec997cf88437fc6e2e27a835301aef968c548c4}}: Enable SUL autologin for wikimania.wikimedia.org ([[phab:T285197|T285197]]) (duration: 00m 55s)
* 20:50 dancy@deploy1002: rebuilt and synchronized wikiversions files: group0 wikis to 1.39.0-wmf.23  refs [[phab:T308076|T308076]]
* 18:33 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 20:38 mutante: re-imaging gerrit2002 with buster - because it's on bullseye, needs git-fat and that has not been ported to python3 yet which blocks upgrading gerrit machines otherwise [[phab:T313250|T313250]] [[phab:T243027|T243027]] [[phab:T279509|T279509]]
* 18:31 urbanecm@deploy1002: Synchronized php-1.37.0-wmf.16/extensions/GrowthExperiments/modules/: {{Gerrit|05cf1d6de1695d2e38531f3fecb26381f4dc0b1d}}: Add a link: Show article extract instead of description in the link inspector ([[phab:T287636|T287636]]; 2/2) (duration: 00m 56s)
* 20:37 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 18:30 urbanecm@deploy1002: Synchronized php-1.37.0-wmf.16/extensions/GrowthExperiments/extension.json: {{Gerrit|05cf1d6de1695d2e38531f3fecb26381f4dc0b1d}}: Add a link: Show article extract instead of description in the link inspector ([[phab:T287636|T287636]]; 1/2) (duration: 00m 57s)
* 20:36 dzahn@cumin2002: START - Cookbook sre.hosts.reimage for host gerrit2002.wikimedia.org with OS buster
* 18:25 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 20:36 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 18:22 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 20:36 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 18:20 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|cc8ca452e66994c211efd684b7ed3810bdc84aaf}}: Add tewikisource as import source for tewikibooks ([[phab:T286978|T286978]]) (duration: 00m 56s)
* 20:36 urbanecm: UTC evening B&C window done
* 18:15 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|11e96bab3375d604126619169964a2db96808152}}: Add media.defense.gov to the wgCopyUploadsDomains allowlist of Wikimedia Commons ([[phab:T287264|T287264]]) (duration: 00m 56s)
* 20:35 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 18:11 urbanecm@deploy1002: Synchronized static/images/project-logos/: {{Gerrit|97b68972108feaf52ab328991f563617f3594d81}}: Remove unused enwiki celebration logos ([[phab:T272108|T272108]]) (duration: 00m 57s)
* 20:33 urbanecm@deploy1002: Synchronized php-1.39.0-wmf.23/includes/Rest/Handler/HTMLTransformInput.php: {{Gerrit|69e91528a5c6f372af520307dc2f4227b9981442}}: ParsoidHandler: fix page bundle input with no orig HTML (duration: 03m 22s)
* 18:07 urbanecm@deploy1002: Synchronized static/images/project-logos/: {{Gerrit|16f97941b7d8eacc9bddae7bc570e03b031bead2}}: Remove unused eswiki celebration logos ([[phab:T280908|T280908]]) (duration: 00m 57s)
* 20:30 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 17:57 razzi@cumin1001: START - Cookbook sre.druid.roll-restart-workers for Druid public cluster: Roll restart of Druid's jvm daemons. - razzi@cumin1001
* 20:29 urbanecm@deploy1002: Synchronized php-1.39.0-wmf.23/includes/Rest/Handler/ParsoidHandler.php: {{Gerrit|322a960e3777bc01fa8823908340c36e3851a648}}: ParsoidHandler: pass metrics object to HTMLTransformInput (duration: 03m 19s)
* 15:44 jynus: remove s2 from db1139 [[phab:T287230|T287230]]
* 20:27 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 14:37 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on lvs2010.codfw.wmnet with reason: NIC maintenance
* 20:27 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 14:37 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime for 1:00:00 on lvs2010.codfw.wmnet with reason: NIC maintenance
* 20:22 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 13:21 volans: uploaded spicerack_0.0.57 to apt.wikimedia.org buster-wikimedia,bullseye-wikimedia
* 20:20 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|5fac0aaf8e76a6f8cc3302771eac068e4f866e5f}}: GrowthExperiments: Remove wgGEHomepageTutorialTitle (duration: 03m 26s)
* 13:03 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on gerrit1001.wikimedia.org with reason: apply 706049
* 20:06 dancy@deploy1002: Finished scap: Backport for [[gerrit:819612]] Revert "Bump wikimedia/parsoid to 0.16.0-a18" (duration: 11m 30s)
* 13:03 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 1:00:00 on gerrit1001.wikimedia.org with reason: apply 706049
* 20:01 dancy@deploy1002: Finished deploy [gerrit/gerrit@94c5028]: (no justification provided) (duration: 00m 05s)
* 13:02 mutante: gerrit1001 - restarting service after 706049
* 20:01 dancy@deploy1002: Started deploy [gerrit/gerrit@94c5028]: (no justification provided)
* 13:00 mutante: gerrit1001 - re-enabling puppet, deploying sshd listening / firewall change
* 19:59 dancy@deploy1002: Finished deploy [gerrit/gerrit@94c5028]: (no justification provided) (duration: 00m 01s)
* 12:38 mutante: gerrit2001 - restarting gerrit after deploying 706049
* 19:59 dancy@deploy1002: Started deploy [gerrit/gerrit@94c5028]: (no justification provided)
* 12:20 mutante: gerrit servers: disabling puppet
* 19:55 dancy@deploy1002: Started scap: Backport for [[gerrit:819612]] Revert "Bump wikimedia/parsoid to 0.16.0-a18"
* 12:10 reedy@deploy1002: Synchronized php-1.37.0-wmf.16/extensions/MobileFrontend/: [[phab:T287528|T287528]] (duration: 00m 57s)
* 19:42 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 12:08 reedy@deploy1002: Synchronized php-1.37.0-wmf.16/extensions/SecurePoll/: [[phab:T287780|T287780]] (duration: 00m 57s)
* 19:37 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2034.codfw.wmnet,service=ats-tls
* 11:54 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw1288.eqiad.wmnet
* 19:37 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2034.codfw.wmnet,service=varnish-fe
* 11:42 dzahn@cumin1001: START - Cookbook sre.hosts.decommission for hosts mw1288.eqiad.wmnet
* 19:37 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2034.codfw.wmnet,service=ats-be
* 11:42 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw1271.eqiad.wmnet
* 19:36 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 11:40 reedy@deploy1002: Synchronized php-1.37.0-wmf.16/extensions/SecurePoll/: [[phab:T287782|T287782]] (duration: 00m 56s)
* 19:36 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 11:31 dzahn@cumin1001: START - Cookbook sre.hosts.decommission for hosts mw1271.eqiad.wmnet
* 19:36 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2033.codfw.wmnet,service=ats-tls
* 11:29 hashar: restarting gerrit primary server on gerrit1001
* 19:36 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2033.codfw.wmnet,service=varnish-fe
* 11:27 hashar: restarting Jenkins on contint2001
* 19:36 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2033.codfw.wmnet,service=ats-be
* 11:27 hashar: restarting Jenkins on contint1001
* 19:36 mvernon@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be[2041,2046].codfw.wmnet
* 11:26 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 19:35 mvernon@cumin2002: START - Cookbook sre.hosts.remove-downtime for ms-be[2041,2046].codfw.wmnet
* 11:24 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 19:29 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 11:18 dzahn@cumin1001: conftool action : set/pooled=inactive; selector: name=mw1288.eqiad.wmnet
* 19:28 mvernon@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for thanos-fe2002.codfw.wmnet
* 11:17 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 19:28 mvernon@cumin2002: START - Cookbook sre.hosts.remove-downtime for thanos-fe2002.codfw.wmnet
* 11:16 dzahn@cumin1001: conftool action : set/pooled=inactive; selector: name=mw1271.eqiad.wmnet
* 19:26 mvernon@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-fe2010.codfw.wmnet
* 11:14 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 19:26 mvernon@cumin2002: START - Cookbook sre.hosts.remove-downtime for ms-fe2010.codfw.wmnet
* 11:13 urbanecm: EU B&C window completed
* 19:21 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2032.codfw.wmnet,service=ats-tls
* 11:10 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|43020b72e8f466188d738aa73f2023f3017804d0}}: votewiki: Enable Single Transferable Vote ([[phab:T283728|T283728]]) (duration: 00m 57s)
* 19:21 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2032.codfw.wmnet,service=varnish-fe
* 11:08 moritzm: installing openjdk-11 security updates
* 19:21 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2032.codfw.wmnet,service=ats-be
* 11:05 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|26bcaafdcd57b1b7a78f9e0ad000325baaf36a72}}: Restore logging for mediamoderation script to better understand high error rate occurring when running script ([[phab:T287511|T287511]]) (duration: 00m 57s)
* 19:17 rzl@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on mc2038.codfw.wmnet with reason: install
* 07:53 moritzm: catch up bullseye installs with latest state of testing
* 19:17 rzl@cumin2002: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on mc2038.codfw.wmnet with reason: install
* 07:24 moritzm: installing libsndfile security updates on buster
* 19:13 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2031.codfw.wmnet,service=ats-tls
* 07:12 moritzm: installing aspell security updates
* 19:13 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2031.codfw.wmnet,service=varnish-fe
* 05:01 oblivian@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 19:13 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2031.codfw.wmnet,service=ats-be
* 04:58 oblivian@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 19:11 mutante: gerrit1001 - rsyncing /home/ to gerrit2002:/srv/home-gerrit1001.wikimedia.org [[phab:T313250|T313250]]
* 02:01 tstarling@deploy1002: Synchronized src/defines.php: for consistency only, should have no production impact (duration: 00m 57s)
* 19:01 dzahn@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on gerrit2002.wikimedia.org with reason: new machine
* 19:01 dzahn@cumin2002: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on gerrit2002.wikimedia.org with reason: new machine
* 18:55 dancy@deploy1002: Finished scap: testwikis wikis to 1.39.0-wmf.23  refs [[phab:T308076|T308076]] (duration: 50m 39s)
* 18:54 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 18:52 ejegg: updated payments-wiki from {{Gerrit|589bb64e}} to {{Gerrit|e1b6036a}} (just i18n changes in extensions)
* 18:47 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 18:47 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 18:46 bking@cumin1001: START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: deploy wmf-elasticsearch-search-plugins pkg - bking@cumin1001 - [[phab:T314078|T314078]]
* 18:46 rzl@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mc2038.codfw.wmnet with reason: install
* 18:45 rzl@cumin2002: START - Cookbook sre.hosts.downtime for 1:00:00 on mc2038.codfw.wmnet with reason: install
* 18:41 rzl@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc2038.codfw.wmnet
* 18:41 rzl@cumin2002: START - Cookbook sre.hosts.remove-downtime for mc2038.codfw.wmnet
* 18:39 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 18:19 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 18:18 rzl@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2038.codfw.wmnet with reason: install
* 18:18 rzl@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on mc2038.codfw.wmnet with reason: install
* 18:17 rzl@cumin2002: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2038.codfw.wmnet with reason: install
* 18:17 rzl@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on mw2038.codfw.wmnet with reason: install
* 18:17 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 18:17 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 18:16 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on lvs2008.codfw.wmnet with reason: shutdown for PDU upgrade
* 18:16 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 1:00:00 on lvs2008.codfw.wmnet with reason: shutdown for PDU upgrade
* 18:16 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 18:11 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 18:08 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 18:08 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 18:05 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 18:04 dancy@deploy1002: Started scap: testwikis wikis to 1.39.0-wmf.23  refs [[phab:T308076|T308076]]
* 17:52 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1168 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32185 and previous config saved to /var/cache/conftool/dbconfig/20220802-175233-marostegui.json
* 17:43 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depool db2159', diff saved to https://phabricator.wikimedia.org/P32184 and previous config saved to /var/cache/conftool/dbconfig/20220802-174311-ladsgroup.json
* 17:37 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P32183 and previous config saved to /var/cache/conftool/dbconfig/20220802-173723-marostegui.json
* 17:35 moritzm: installing node-moment security updates
* 17:32 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic[2041-2042,2057].codfw.wmnet with reason: [[phab:T310070|T310070]]
* 17:32 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic[2041-2042,2057].codfw.wmnet with reason: [[phab:T310070|T310070]]
* 17:27 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2013.codfw.wmnet
* 17:25 moritzm: installing fribidi security updates
* 17:22 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P32182 and previous config saved to /var/cache/conftool/dbconfig/20220802-172217-marostegui.json
* 17:20 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2030.codfw.wmnet,service=ats-tls
* 17:20 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2030.codfw.wmnet,service=varnish-fe
* 17:20 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2030.codfw.wmnet,service=ats-be
* 17:18 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host ganeti2013.codfw.wmnet
* 17:07 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1168 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32181 and previous config saved to /var/cache/conftool/dbconfig/20220802-170711-marostegui.json
* 17:06 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc[2042-2043].codfw.wmnet with reason: shutdown for PDU upgrade
* 17:06 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on mc[2042-2043].codfw.wmnet with reason: shutdown for PDU upgrade
* 17:05 Emperor: ms-be20[31,32,41,46].codfw.wmnet,ms-fe2010.codfw.wmnet,thanos-fe2002.codfw.wmnet downtime for PDU work [[phab:T309957|T309957]]
* 17:05 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1168 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32180 and previous config saved to /var/cache/conftool/dbconfig/20220802-170503-marostegui.json
* 17:04 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance
* 17:04 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance
* 17:04 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance
* 17:04 mvernon@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 6 hosts with reason: shutdown for PDU replacement
* 17:04 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance
* 17:04 mvernon@cumin2002: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 6 hosts with reason: shutdown for PDU replacement
* 17:04 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance
* 17:03 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance
* 17:03 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance
* 17:03 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance
* 17:03 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1131 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32179 and previous config saved to /var/cache/conftool/dbconfig/20220802-170333-marostegui.json
* 17:01 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2029.codfw.wmnet,service=ats-tls
* 17:01 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2029.codfw.wmnet,service=varnish-fe
* 17:01 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2029.codfw.wmnet,service=ats-be
* 17:00 mvernon@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be[2030,2045,2052].codfw.wmnet
* 17:00 mvernon@cumin2002: START - Cookbook sre.hosts.remove-downtime for ms-be[2030,2045,2052].codfw.wmnet
* 16:57 btullis@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host an-airflow1004.eqiad.wmnet
* 16:54 hnowlan@deploy1002: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: sync
* 16:53 hnowlan@deploy1002: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: sync
* 16:51 hnowlan@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: sync
* 16:49 hnowlan@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: sync
* 16:48 hnowlan@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: sync
* 16:48 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P32178 and previous config saved to /var/cache/conftool/dbconfig/20220802-164827-marostegui.json
* 16:38 hnowlan@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: sync
* 16:35 hnowlan@deploy1002: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: sync
* 16:35 hnowlan@deploy1002: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: sync
* 16:33 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P32177 and previous config saved to /var/cache/conftool/dbconfig/20220802-163321-marostegui.json
* 16:29 dancy@mwmaint1002: pull aborted:  (duration: 00m 07s)
* 16:25 rzl: rzl@stat1007:~$ sudo systemctl stop wmde-analytics-daily-early  # wedged, timer will restart it now with max_runtime_seconds
* 16:18 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1131 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32176 and previous config saved to /var/cache/conftool/dbconfig/20220802-161815-marostegui.json
* 16:16 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1131 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32175 and previous config saved to /var/cache/conftool/dbconfig/20220802-161607-marostegui.json
* 16:16 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance
* 16:15 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance
* 16:15 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1165 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32174 and previous config saved to /var/cache/conftool/dbconfig/20220802-161545-marostegui.json
* 16:10 btullis@cumin1001: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) an-airflow1004.eqiad.wmnet on all recursors
* 16:10 btullis@cumin1001: START - Cookbook sre.dns.wipe-cache an-airflow1004.eqiad.wmnet on all recursors
* 16:10 btullis@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 16:05 btullis@cumin1001: START - Cookbook sre.dns.netbox
* 16:05 btullis@cumin1001: START - Cookbook sre.ganeti.makevm for new host an-airflow1004.eqiad.wmnet
* 16:00 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P32173 and previous config saved to /var/cache/conftool/dbconfig/20220802-160039-marostegui.json
* 15:51 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2056.codfw.wmnet with reason: [[phab:T309957|T309957]]
* 15:50 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2056.codfw.wmnet with reason: [[phab:T309957|T309957]]
* 15:49 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2040.codfw.wmnet with reason: [[phab:T309957|T309957]]
* 15:49 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2040.codfw.wmnet with reason: [[phab:T309957|T309957]]
* 15:46 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2039.codfw.wmnet with reason: [[phab:T309957|T309957]]
* 15:45 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2039.codfw.wmnet with reason: [[phab:T309957|T309957]]
* 15:45 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P32172 and previous config saved to /var/cache/conftool/dbconfig/20220802-154533-marostegui.json
* 15:37 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc[2040-2041].codfw.wmnet with reason: shutdown for PDU upgrade
* 15:37 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on mc[2040-2041].codfw.wmnet with reason: shutdown for PDU upgrade
* 15:36 bking@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host elastic2037.codfw.wmnet
* 15:36 bking@cumin1001: START - Cookbook sre.hosts.reboot-single for host elastic2037.codfw.wmnet
* 15:30 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1165 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32171 and previous config saved to /var/cache/conftool/dbconfig/20220802-153027-marostegui.json
* 15:28 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1165 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32170 and previous config saved to /var/cache/conftool/dbconfig/20220802-152818-marostegui.json
* 15:28 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
* 15:27 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
* 15:27 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance
* 15:27 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance
* 15:27 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32169 and previous config saved to /var/cache/conftool/dbconfig/20220802-152740-marostegui.json
* 15:24 moritzm: installing gnupg2 security updates
* 15:15 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2024.codfw.wmnet with reason: shutdown for PDU upgrade
* 15:15 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on mc2024.codfw.wmnet with reason: shutdown for PDU upgrade
* 15:13 jbond@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host puppetmaster1004.eqiad.wmnet with OS buster
* 15:12 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P32167 and previous config saved to /var/cache/conftool/dbconfig/20220802-151234-marostegui.json
* 15:10 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ganeti-test[2001-2003].codfw.wmnet with reason: Power down for PDU maintenance, [[phab:T310070|T310070]]
* 15:10 jmm@cumin2002: START - Cookbook sre.hosts.downtime for 4:00:00 on ganeti-test[2001-2003].codfw.wmnet with reason: Power down for PDU maintenance, [[phab:T310070|T310070]]
* 15:08 root@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on thanos-be2001.codfw.wmnet with reason: pdu
* 15:08 root@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on thanos-be2001.codfw.wmnet with reason: pdu
* 15:07 mvernon@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on ms-be[2030,2045,2052].codfw.wmnet with reason: shutdown for PDU replacement
* 15:07 mvernon@cumin2002: START - Cookbook sre.hosts.downtime for 3:00:00 on ms-be[2030,2045,2052].codfw.wmnet with reason: shutdown for PDU replacement
* 15:06 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mc-gp2002.codfw.wmnet with reason: Power down for PDU maintenance, [[phab:T310070|T310070]]
* 15:06 jmm@cumin2002: START - Cookbook sre.hosts.downtime for 4:00:00 on mc-gp2002.codfw.wmnet with reason: Power down for PDU maintenance, [[phab:T310070|T310070]]
* 15:04 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2037.codfw.wmnet with reason: [[phab:T309957|T309957]]
* 15:04 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2037.codfw.wmnet with reason: [[phab:T309957|T309957]]
* 15:01 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: shutdown for PDU upgrade
* 15:00 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: shutdown for PDU upgrade
* 14:59 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2025.codfw.wmnet with reason: [[phab:T309957|T309957]]
* 14:59 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2025.codfw.wmnet with reason: [[phab:T309957|T309957]]
* 14:58 oblivian@puppetmaster1001: conftool action : set/pooled=false; selector: dnsdisc=(appservers{{!}}api)-ro,name=codfw
* 14:57 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P32166 and previous config saved to /var/cache/conftool/dbconfig/20220802-145728-marostegui.json
* 14:54 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2060.codfw.wmnet with OS bullseye
* 14:53 jbond@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on puppetmaster1004.eqiad.wmnet with reason: host reimage
* 14:50 moritzm: uploaded gnupg2 2.1.18-8~deb9u4+wmf1 to stretch-wikimedia
* 14:50 jbond@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on puppetmaster1004.eqiad.wmnet with reason: host reimage
* 14:42 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32164 and previous config saved to /var/cache/conftool/dbconfig/20220802-144222-marostegui.json
* 14:40 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1113:3316 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32163 and previous config saved to /var/cache/conftool/dbconfig/20220802-144013-marostegui.json
* 14:40 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance
* 14:39 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance
* 14:39 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32162 and previous config saved to /var/cache/conftool/dbconfig/20220802-143952-marostegui.json
* 14:37 jbond@cumin1001: START - Cookbook sre.hosts.reimage for host puppetmaster1004.eqiad.wmnet with OS buster
* 14:32 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2060.codfw.wmnet with reason: host reimage
* 14:28 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2060.codfw.wmnet with reason: host reimage
* 14:24 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P32161 and previous config saved to /var/cache/conftool/dbconfig/20220802-142446-marostegui.json
* 14:23 Emperor: shutdown ms-be20[30,45,52] for PDU work [[phab:T309957|T309957]]
* 14:22 mvernon@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ms-be[2030,2045,2052].codfw.wmnet with reason: shutdown for PDU replacement
* 14:21 mvernon@cumin2002: START - Cookbook sre.hosts.downtime for 1:00:00 on ms-be[2030,2045,2052].codfw.wmnet with reason: shutdown for PDU replacement
* 14:12 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic2060.codfw.wmnet with OS bullseye
* 14:09 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P32160 and previous config saved to /var/cache/conftool/dbconfig/20220802-140940-marostegui.json
* 14:05 jbond@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host puppetmaster2004.codfw.wmnet with OS buster
* 14:04 godog: grow sda/sdb 3 by 100G on thanos-be1001 - [[phab:T314275|T314275]]
* 14:03 root@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on centrallog2002.codfw.wmnet with reason: pdu
* 14:03 root@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on centrallog2002.codfw.wmnet with reason: pdu
* 14:01 root@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on prometheus2005.codfw.wmnet with reason: pdu
* 14:01 root@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on prometheus2005.codfw.wmnet with reason: pdu
* 13:57 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2030.codfw.wmnet,service=ats-tls
* 13:57 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2032.codfw.wmnet,service=ats-be
* 13:57 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2031.codfw.wmnet,service=ats-be
* 13:56 godog: schedule poweroff for centrallog2002 at 16 utc - [[phab:T310070|T310070]]
* 13:54 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp203[3-4].codfw.wmnet,service=ats-be
* 13:54 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32159 and previous config saved to /var/cache/conftool/dbconfig/20220802-135435-marostegui.json
* 13:53 godog: depool and poweroff prometheus2005 - [[phab:T310070|T310070]]
* 13:53 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp203[3-4].codfw.wmnet,service=ats-tls
* 13:53 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp203[3-4].codfw.wmnet,service=ats-tls
* 13:53 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp203[3-4].codfw.wmnet,service=varnish-fe
* 13:52 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1098:3316 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32158 and previous config saved to /var/cache/conftool/dbconfig/20220802-135226-marostegui.json
* 13:52 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance
* 13:52 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp203[1-2].codfw.wmnet,service=ats-tls
* 13:51 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance
* 13:51 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32157 and previous config saved to /var/cache/conftool/dbconfig/20220802-135155-marostegui.json
* 13:51 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp203[1-2].codfw.wmnet,service=ats-tls
* 13:51 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp203[1-2].codfw.wmnet,service=varnish-fe
* 13:50 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2030.codfw.wmnet,service=ats-be
* 13:50 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2030.codfw.wmnet,service=varnish-fe
* 13:50 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2030.codfw.wmnet,service=ats-be
* 13:50 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2029.codfw.wmnet,service=ats-tls
* 13:50 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2029.codfw.wmnet,service=varnish-fe
* 13:50 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2029.codfw.wmnet,service=ats-be
* 13:45 jbond@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on puppetmaster2004.codfw.wmnet with reason: host reimage
* 13:42 jbond@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on puppetmaster2004.codfw.wmnet with reason: host reimage
* 13:42 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 13:42 Lucas_WMDE: UTC afternoon backport+config window done
* 13:41 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2013.codfw.wmnet with OS bullseye
* 13:41 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 13:41 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 13:40 lucaswerkmeister-wmde@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:754933{{!}}Enable usage tracking for statement for cebwiki (T296384)]] – expected to gradually increase number of wbc_entity_usage and probably recentchanges rows on cebwiki, but not too much, see task for details (duration: 03m 06s)
* 13:40 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 13:39 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2028.codfw.wmnet with OS bullseye
* 13:36 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P32156 and previous config saved to /var/cache/conftool/dbconfig/20220802-133648-marostegui.json
* 13:35 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 13:34 lucaswerkmeister-wmde@deploy1002: Synchronized wmf-config/Wikibase.php: Config: [[gerrit:754937{{!}}Introduce $wmgEntityUsageModifierLimitsStatement (T296384)]] (2/2) (duration: 03m 21s)
* 13:34 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 13:34 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 13:33 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 13:31 lucaswerkmeister-wmde@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:754937{{!}}Introduce $wmgEntityUsageModifierLimitsStatement (T296384)]] (1/2) (duration: 03m 16s)
* 13:30 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ganeti2028.codfw.wmnet with reason: Power down for PDU maintenance, [[phab:T309957|T309957]]
* 13:30 jmm@cumin2002: START - Cookbook sre.hosts.downtime for 4:00:00 on ganeti2028.codfw.wmnet with reason: Power down for PDU maintenance, [[phab:T309957|T309957]]
* 13:27 jbond@cumin2002: START - Cookbook sre.hosts.reimage for host puppetmaster2004.codfw.wmnet with OS buster
* 13:24 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2013.codfw.wmnet with reason: host reimage
* 13:24 vgutierrez: restarting ATS 9.x instances to apply https://gerrit.wikimedia.org/r/819585 - [[phab:T309651|T309651]]
* 13:23 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2028.codfw.wmnet with reason: host reimage
* 13:21 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P32155 and previous config saved to /var/cache/conftool/dbconfig/20220802-132142-marostegui.json
* 13:19 jmm@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2013.codfw.wmnet with reason: host reimage
* 13:19 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2028.codfw.wmnet with reason: host reimage
* 13:17 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 13:17 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 13:17 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 13:16 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 13:15 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|a4499e5ac23a0558bed276e2b74134590afc5c95}}:  Revert "testwiki: Add mediawiki.web_ui.interactions stream" ([[phab:T314151|T314151]], [[phab:T311268|T311268]]) (duration: 03m 19s)
* 13:10 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 13:09 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 13:09 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 13:09 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|c2fb8a58d8f62e29a15ebee26198e79e4597d24c}}: Enable RealtimePreview on Group 0 wikis ([[phab:T314150|T314150]]) (duration: 03m 21s)
* 13:08 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 13:06 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32154 and previous config saved to /var/cache/conftool/dbconfig/20220802-130636-marostegui.json
* 13:04 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1096:3316 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32153 and previous config saved to /var/cache/conftool/dbconfig/20220802-130428-marostegui.json
* 13:04 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance
* 13:04 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance
* 13:04 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance
* 13:03 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance
* 13:03 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1180 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32152 and previous config saved to /var/cache/conftool/dbconfig/20220802-130351-marostegui.json
* 13:02 jmm@cumin2002: START - Cookbook sre.hosts.reimage for host ganeti2013.codfw.wmnet with OS bullseye
* 13:00 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic2028.codfw.wmnet with OS bullseye
* 13:00 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on ganeti2013.codfw.wmnet with reason: Remove node for eventual reimage, [[phab:T311686|T311686]]
* 12:59 jmm@cumin2002: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on ganeti2013.codfw.wmnet with reason: Remove node for eventual reimage, [[phab:T311686|T311686]]
* 12:48 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P32151 and previous config saved to /var/cache/conftool/dbconfig/20220802-124845-marostegui.json
* 12:33 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P32150 and previous config saved to /var/cache/conftool/dbconfig/20220802-123338-marostegui.json
* 12:18 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1180 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32149 and previous config saved to /var/cache/conftool/dbconfig/20220802-121832-marostegui.json
* 12:16 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1180 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32148 and previous config saved to /var/cache/conftool/dbconfig/20220802-121624-marostegui.json
* 12:16 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance
* 12:15 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance
* 12:13 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 12:12 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 12:12 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 12:11 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 12:01 marostegui: dbmaint x1@eqiad [[phab:T314087|T314087]]
* 11:57 marostegui: dbmaint s7@eqiad [[phab:T314377|T314377]]
* 11:57 marostegui: dbmaint s3@eqiad [[phab:T314377|T314377]]
* 11:57 marostegui: dbmaint s8@eqiad [[phab:T314377|T314377]]
* 11:55 marostegui: dbmait s8@eqiad [[phab:T314377|T314377]]
* 11:54 marostegui: dbmait s3@eqiad [[phab:T314377|T314377]]
* 11:50 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
* 11:48 marostegui: dbmait s7@eqiad [[phab:T314377|T314377]]
* 11:46 marostegui: dbmait s4@eqiad [[phab:T314377|T314377]]
* 11:35 elukey: restart rsyslog on ml-serve1006
* 10:50 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on an-worker1082.eqiad.wmnet with reason: [[phab:T312626|T312626]] btullis
* 10:50 btullis@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on an-worker1082.eqiad.wmnet with reason: [[phab:T312626|T312626]] btullis
* 10:49 godog: grow sda3 by 100G on thanos-be2004 - [[phab:T314275|T314275]]
* 10:42 btullis@puppetmaster1001: conftool action : set/pooled=inactive; selector: cluster=wikireplicas-b,name=dbproxy1018.eqiad.wmnet
* 10:42 btullis@puppetmaster1001: conftool action : set/pooled=yes; selector: cluster=wikireplicas-b,name=dbproxy1019.eqiad.wmnet
* 10:35 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 10:34 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 10:34 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 10:34 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 10:33 marostegui@cumin1001: dbctl commit (dc=all): 'db1181 (re)pooling @ 100%: After restart', diff saved to https://phabricator.wikimedia.org/P32147 and previous config saved to /var/cache/conftool/dbconfig/20220802-103318-root.json
* 10:18 marostegui@cumin1001: dbctl commit (dc=all): 'db1181 (re)pooling @ 75%: After restart', diff saved to https://phabricator.wikimedia.org/P32146 and previous config saved to /var/cache/conftool/dbconfig/20220802-101813-root.json
* 10:15 marostegui@cumin1001: dbctl commit (dc=all): 'Add db2175 to s2 [[phab:T311494|T311494]]', diff saved to https://phabricator.wikimedia.org/P32145 and previous config saved to /var/cache/conftool/dbconfig/20220802-101522-marostegui.json
* 10:12 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy1019.eqiad.wmnet with OS bullseye
* 10:05 jynus: shutdown dbprov2002 backup2005 backup2008 [[phab:T310070|T310070]]
* 10:03 marostegui@cumin1001: dbctl commit (dc=all): 'db1181 (re)pooling @ 50%: After restart', diff saved to https://phabricator.wikimedia.org/P32144 and previous config saved to /var/cache/conftool/dbconfig/20220802-100308-root.json
* 10:03 marostegui@cumin1001: dbctl commit (dc=all): 'db1174 (re)pooling @ 100%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32143 and previous config saved to /var/cache/conftool/dbconfig/20220802-100304-root.json
* 09:54 marostegui@cumin1001: dbctl commit (dc=all): 'Remove db2079 from dbctl [[phab:T313885|T313885]]', diff saved to https://phabricator.wikimedia.org/P32141 and previous config saved to /var/cache/conftool/dbconfig/20220802-095455-marostegui.json
* 09:52 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbproxy1019.eqiad.wmnet with reason: host reimage
* 09:49 btullis@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on dbproxy1019.eqiad.wmnet with reason: host reimage
* 09:49 btullis@cumin1001: END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-druid-analytics cluster: Roll restart of jvm daemons.
* 09:48 marostegui@cumin1001: dbctl commit (dc=all): 'db1181 (re)pooling @ 25%: After restart', diff saved to https://phabricator.wikimedia.org/P32140 and previous config saved to /var/cache/conftool/dbconfig/20220802-094804-root.json
* 09:47 marostegui@cumin1001: dbctl commit (dc=all): 'db1174 (re)pooling @ 75%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32139 and previous config saved to /var/cache/conftool/dbconfig/20220802-094759-root.json
* 09:44 godog: grow sdb3 by 100G on thanos-be2004 - [[phab:T314275|T314275]]
* 09:43 btullis@cumin1001: START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-druid-analytics cluster: Roll restart of jvm daemons.
* 09:42 btullis@cumin1001: END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-druid-public cluster: Roll restart of jvm daemons.
* 09:37 btullis@cumin1001: START - Cookbook sre.hosts.reimage for host dbproxy1019.eqiad.wmnet with OS bullseye
* 09:36 btullis@cumin1001: START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-druid-public cluster: Roll restart of jvm daemons.
* 09:33 marostegui@cumin1001: dbctl commit (dc=all): 'db1181 (re)pooling @ 10%: After restart', diff saved to https://phabricator.wikimedia.org/P32138 and previous config saved to /var/cache/conftool/dbconfig/20220802-093259-root.json
* 09:32 marostegui@cumin1001: dbctl commit (dc=all): 'db1174 (re)pooling @ 50%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32137 and previous config saved to /var/cache/conftool/dbconfig/20220802-093254-root.json
* 09:30 btullis@puppetmaster1001: conftool action : set/pooled=no; selector: cluster=wikireplicas-b,name=dbproxy1019.eqiad.wmnet
* 09:30 btullis@puppetmaster1001: conftool action : set/pooled=yes; selector: cluster=wikireplicas-b,name=dbproxy1018.eqiad.wmnet
* 09:28 btullis@cumin1001: END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-analytics cluster: Roll restart of jvm daemons.
* 09:26 btullis@puppetmaster1001: conftool action : set/pooled=inactive; selector: cluster=wikireplicas-a,name=dbproxy1019.eqiad.wmnet
* 09:25 btullis@puppetmaster1001: conftool action : set/pooled=yes; selector: cluster=wikireplicas-a,name=dbproxy1018.eqiad.wmnet
* 09:22 btullis@cumin1001: START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-analytics cluster: Roll restart of jvm daemons.
* 09:17 marostegui@cumin1001: dbctl commit (dc=all): 'db1181 (re)pooling @ 5%: After restart', diff saved to https://phabricator.wikimedia.org/P32136 and previous config saved to /var/cache/conftool/dbconfig/20220802-091754-root.json
* 09:17 marostegui@cumin1001: dbctl commit (dc=all): 'db1174 (re)pooling @ 10%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32135 and previous config saved to /var/cache/conftool/dbconfig/20220802-091749-root.json
* 09:15 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db2143', diff saved to https://phabricator.wikimedia.org/P32134 and previous config saved to /var/cache/conftool/dbconfig/20220802-091518-root.json
* 09:02 marostegui@cumin1001: dbctl commit (dc=all): 'db1181 (re)pooling @ 2%: After restart', diff saved to https://phabricator.wikimedia.org/P32133 and previous config saved to /var/cache/conftool/dbconfig/20220802-090250-root.json
* 09:02 marostegui@cumin1001: dbctl commit (dc=all): 'db1174 (re)pooling @ 5%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32132 and previous config saved to /var/cache/conftool/dbconfig/20220802-090245-root.json
* 08:47 marostegui@cumin1001: dbctl commit (dc=all): 'db1181 (re)pooling @ 1%: After restart', diff saved to https://phabricator.wikimedia.org/P32131 and previous config saved to /var/cache/conftool/dbconfig/20220802-084745-root.json
* 08:47 marostegui@cumin1001: dbctl commit (dc=all): 'db1174 (re)pooling @ 1%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32130 and previous config saved to /var/cache/conftool/dbconfig/20220802-084740-root.json
* 08:46 marostegui: stop mysql on db2095 db2107 db2109 db2137 db2147 db2159 db2160 pc2012 for pdu maintenance on codfw b5 [[phab:T310070|T310070]]
* 07:49 moritzm: upgrading drmrs ganeti clusters to 3.0.2 [[phab:T312637|T312637]]
* 07:33 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubetcd2005.codfw.wmnet with reason: Switch instance to plain disks, [[phab:T311686|T311686]]
* 07:33 jmm@cumin2002: START - Cookbook sre.hosts.downtime for 1:00:00 on kubetcd2005.codfw.wmnet with reason: Switch instance to plain disks, [[phab:T311686|T311686]]
* 07:22 godog: bounce icinga on alert2001 - [[phab:T314353|T314353]]
* 07:18 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubetcd2005.codfw.wmnet with reason: Switch instance to DRBD, [[phab:T311686|T311686]]
* 07:18 jmm@cumin2002: START - Cookbook sre.hosts.downtime for 1:00:00 on kubetcd2005.codfw.wmnet with reason: Switch instance to DRBD, [[phab:T311686|T311686]]
* 06:58 elukey: restart rsyslog on ml-serve2006
* 06:56 ladsgroup@deploy1002: Synchronized php-1.39.0-wmf.22/extensions/FlaggedRevs/maintenance/pruneRevData.php: Backport: [[gerrit:819077{{!}}pruneRevData: Make cleaning in larger batches (T296380)]] (duration: 03m 26s)
* 06:56 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 06:55 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 06:55 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 06:54 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 06:46 godog: bounce icinga on alert1001 - [[phab:T314353|T314353]]
* 05:48 marostegui@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts db2088.codfw.wmnet
* 05:48 marostegui@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 05:44 marostegui@cumin1001: START - Cookbook sre.dns.netbox
* 05:35 marostegui@cumin1001: START - Cookbook sre.hosts.decommission for hosts db2088.codfw.wmnet
* 05:29 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1181', diff saved to https://phabricator.wikimedia.org/P32127 and previous config saved to /var/cache/conftool/dbconfig/20220802-052923-root.json
* 05:24 marostegui: dbmait x1@eqiad [[phab:T314087|T314087]]
* 04:17 ryankemper: [Elastic] Small amendment to my earlier statement; based off epoch time `be_x_oldwiki_titlesuggest_1659407912` was not an old index hanging around after a reindex operation, but rather the new one that the reindex operation was trying to create, but had not yet finished (therefore didn't switch over the aliases). It presumably got interrupted by the reimage of `elastic2059`.
* 04:15 ryankemper: [Elastic] Blew away red index like so: `ryankemper@cumin1001:~$ curl -XDELETE https://search.svc.codfw.wmnet:9243/be_x_oldwiki_titlesuggest_1659407912`. Cluster is back to `green` status.
* 04:07 ryankemper: [Elastic] Per `curl -s https://search.svc.codfw.wmnet:9243/_cat/aliases {{!}} grep -i be_x` I see `be_x_oldwiki_titlesuggest ` alias points to `be_x_oldwiki_titlesuggest_1658396688`. I think this means the red index is an old index from an in-progress reindex operation. I likely just need to delete `be_x_oldwiki_titlesuggest_1659407912` but doing some quick digging first
* 04:04 ryankemper: [Elastic] Red cluster status in main codfw elasticsearch cluster (`https://search.svc.codfw.wmnet:9243`); culprit appears to be index `be_x_oldwiki_titlesuggest_1659407912`. Confusingly it has 2 replicas set so it's not clear to me how we got into this state starting from green (in the past we've gone into red status from indices that erroneously had 0 replicas in production)
* 03:47 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 03:46 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 03:46 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 03:45 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 03:40 krinkle@deploy1002: Synchronized multiversion/: {{Gerrit|I0802db272695}} (duration: 03m 10s)
* 03:40 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 03:39 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 03:39 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 03:38 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 03:34 krinkle@deploy1002: Synchronized wmf-config/: {{Gerrit|I9b89c0ff5c2}} (duration: 03m 32s)
* 03:33 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 03:32 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 03:32 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 03:31 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 03:27 krinkle@deploy1002: Synchronized multiversion/: {{Gerrit|I6e97d39a3}}, {{Gerrit|Ib843ebced31}} (duration: 03m 30s)
* 03:26 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 03:25 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 03:25 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 03:24 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 03:22 krinkle@mwmaint1002: pull aborted:  (duration: 00m 11s)
* 03:21 krinkle@deploy1002: Synchronized wmf-config/CommonSettings.php: {{Gerrit|I39a2b86065}} (duration: 03m 19s)
* 03:20 ryankemper@cumin1001: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host elastic2059.codfw.wmnet with OS bullseye
* 03:15 krinkle@deploy1002: Synchronized multiversion/: {{Gerrit|Ieaea60a991e5611}} (duration: 03m 03s)
* 03:14 krinkle@mwmaint2002: pull aborted:  (duration: 01m 36s)
* 03:14 krinkle@mwmaint1002: pull aborted:  (duration: 01m 31s)
* 03:13 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 03:12 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 03:12 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 03:11 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 02:58 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2059.codfw.wmnet with reason: host reimage
* 02:54 ryankemper: [WDQS] `ryankemper@wdqs1012:~$ sudo systemctl restart wdqs-blazegraph.service` to clear `Query Service HTTP Port` && `WDQS SPARQL` alerts
* 02:53 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2059.codfw.wmnet with reason: host reimage
* 02:36 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic2059.codfw.wmnet with OS bullseye
* 02:31 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 02:30 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 02:30 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 02:29 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 02:09 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 02:08 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 02:08 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 02:07 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 00:41 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 00:40 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 00:40 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 00:39 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 00:35 krinkle@deploy1002: Synchronized wmf-config/CommonSettings.php: {{Gerrit|Ieaea60a991e5}} (duration: 03m 10s)
* 00:29 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 00:28 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 00:28 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 00:27 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 00:23 krinkle@deploy1002: Synchronized multiversion/: {{Gerrit|Ia3406eba4ab8bb}} (duration: 03m 22s)
* 00:17 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 00:16 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 00:16 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 00:15 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 00:05 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 00:04 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 00:04 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 00:03 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply


==Archives==
== 2022-08-01 ==
* 23:59 krinkle@deploy1002: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|Id1ce285631f5}}, {{Gerrit|I194d419fbfe}} (duration: 03m 09s)
* 23:58 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 23:57 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 23:57 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 23:56 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 21:08 moritzm: drain ganeti2028 [[phab:T309957|T309957]]
* 21:03 mutante: gerrit2002 - mkdir /var/lib/gerrit2/review_site {{!}} gerrit1001 - rsyncing /var/lib/gerrit2/review_site/ to gerrit2002 [[phab:T313250|T313250]] [[phab:T313972|T313972]]
* 21:01 urbanecm: UTC late backport window done
* 21:00 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|461e0709a8987b110f669b74afc38c706b616e5d}}: itwiki: Change robot policy on NS2 and NS3 ([[phab:T314165|T314165]]) (duration: 03m 18s)
* 20:58 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 20:57 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 20:57 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 20:57 mutante: phab1001 - rsyncing repo data /srv/repos/ to phab2002 (in addition to phab1004 previously) [[phab:T313360|T313360]]
* 20:56 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 20:55 urbanecm: [urbanecm@mwmaint1002 ~]$ mwscript namespaceDupes.php --wiki=mnwwiktionary  --fix # [[phab:T314023|T314023]]
* 20:53 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|ba8c17759b7e737a6757792ad4136ff3af00030c}}: mnwwiktionary: Create Appendix namespace ([[phab:T314023|T314023]]) (duration: 03m 09s)
* 20:51 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 20:50 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 20:50 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 20:49 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 20:48 urbanecm: [urbanecm@mwmaint1002 ~]$ mwscript updateArticleCount.php --wiki=viwikibooks --update # [[phab:T314239|T314239]]
* 20:47 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|c19c3e36ab}}: DiscussionTools: Make new reply buttons available at mediawiki.org ([[phab:T314076|T314076]]); {{Gerrit|24db016c4}}: viwikibooks: Change wgArticleCountMethod to any ([[phab:T314239|T314239]]) (duration: 03m 10s)
* 20:35 daniel@deploy1002: Synchronized php-1.39.0-wmf.22/includes/Rest/Handler: Fix: [[gerrit:819129{{!}}Parsoid REST handler: allow pagebundle input without original HTML.]] (duration: 03m 15s)
* 20:25 urbanecm: Purge https://en.wikipedia.org/static/images/mobile/copyright/wikipedia-wordmark-ne.svg ([[phab:T311700|T311700]])
* 20:21 daniel@deploy1002: Synchronized static/images/mobile/copyright/wikipedia-wordmark-ne.svg: Config: [[gerrit:818614{{!}}newiki: Update wordmark (T311700)]] (duration: 03m 17s)
* 20:17 daniel@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:818614{{!}}newiki: Update wordmark (T311700)]] (duration: 03m 32s)
* 20:14 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 20:13 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 20:13 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 20:12 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 20:07 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 20:06 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 20:06 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 20:05 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 20:03 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2054.codfw.wmnet with OS bullseye
* 19:41 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2054.codfw.wmnet with reason: host reimage
* 19:35 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2054.codfw.wmnet with reason: host reimage
* 19:12 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic2054.codfw.wmnet with OS bullseye
* 18:56 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2031.codfw.wmnet with OS bullseye
* 18:44 mutante: gitlab - moved data_persistence group to new parent, under /repos/
* 18:34 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2031.codfw.wmnet with reason: host reimage
* 18:32 mutante: gitlab - created group 'data_persistence' - added Ladsgroup and upgraded from member to maintainer
* 18:27 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2031.codfw.wmnet with reason: host reimage
* 18:12 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic2031.codfw.wmnet with OS bullseye
* 17:58 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2025.codfw.wmnet with OS bullseye
* 17:37 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2025.codfw.wmnet with reason: host reimage
* 17:31 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2025.codfw.wmnet with reason: host reimage
* 17:18 ryankemper: [[phab:T289135|T289135]] [[phab:T314078|T314078]] Manually reimaging remaining codfw stretch hosts (`elastic[2025,2031,2054,2059-2060]`) to bullseye, one host at a time, waiting for green cluster status to return between each run. `ryankemper@cumin1001` tmux session `codfw_reimage`
* 17:16 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic2025.codfw.wmnet with OS bullseye
* 17:08 bking@cumin1001: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reimage (bullseye upgrade) - bking@cumin1001 - [[phab:T289135|T289135]]
* 17:08 bking@cumin1001: START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reimage (bullseye upgrade) - bking@cumin1001 - [[phab:T289135|T289135]]
* 17:06 mutante: alert1001 - systemctl restart nsca - pinged by fundraising tech because fundraising hosts have the "passive check is awol" issue again ([[phab:T196336|T196336]])
* 16:25 moritzm: installing tcpdump updates from bullseye point release
* 16:23 cwhite@puppetmaster1001: conftool action : set/pooled=yes; selector: dc=codfw,cluster=kibana7,name=logstash2023.codfw.wmnet
* 16:16 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy1018.eqiad.wmnet with OS bullseye
* 16:10 cwhite@puppetmaster1001: conftool action : set/pooled=no; selector: dc=codfw,cluster=kibana7,name=logstash2023.codfw.wmnet
* 15:57 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbproxy1018.eqiad.wmnet with reason: host reimage
* 15:54 btullis@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on dbproxy1018.eqiad.wmnet with reason: host reimage
* 15:41 btullis@cumin1001: START - Cookbook sre.hosts.reimage for host dbproxy1018.eqiad.wmnet with OS bullseye
* 15:39 mvernon@cumin1001: END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase1016.eqiad.wmnet: Canary testing of 3.11.13 on Restbase [[phab:T309896|T309896]] - mvernon@cumin1001
* 15:33 pt1979@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 15:29 pt1979@cumin2002: START - Cookbook sre.dns.netbox
* 15:29 mvernon@cumin1001: START - Cookbook sre.cassandra.roll-restart for nodes matching restbase1016.eqiad.wmnet: Canary testing of 3.11.13 on Restbase [[phab:T309896|T309896]] - mvernon@cumin1001
* 15:14 lucaswerkmeister-wmde@deploy1002: Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:818127{{!}}Beta: add configuration for redirect badges (T313896)]] (2/2, should be a no-op) (duration: 03m 30s)
* 15:11 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 15:11 lucaswerkmeister-wmde@deploy1002: Synchronized wmf-config/Wikibase.php: Config: [[gerrit:818127{{!}}Beta: add configuration for redirect badges (T313896)]] (1/2, should be a no-op) (duration: 03m 15s)
* 15:10 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 15:10 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 15:09 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 14:54 btullis@puppetmaster1001: conftool action : set/pooled=no; selector: cluster=wikireplicas-a,name=dbproxy1018.eqiad.wmnet
* 14:53 btullis@puppetmaster1001: conftool action : set/pooled=yes; selector: cluster=wikireplicas-a,name=dbproxy1019.eqiad.wmnet
* 14:42 moritzm: installing openjdk-11 security updates
* 14:39 btullis@puppetmaster1001: conftool action : set/pooled=inactive; selector: cluster=wikireplicas-a,name=dbproxy1019.eqiad.wmnet
* 14:39 btullis@puppetmaster1001: conftool action : set/pooled=yes; selector: cluster=wikireplicas-a,name=dbproxy1018.eqiad.wmnet
* 14:38 btullis@puppetmaster1001: conftool action : set/pooled=no; selector: cluster=wikireplicas-a,name=dbproxy1018.eqiad.wmnet
* 14:34 btullis@puppetmaster1001: conftool action : set/pooled=yes; selector: cluster=wikireplicas-a,name=dbproxy1019.eqiad.wmnet
* 14:30 elukey@deploy1002: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
* 14:30 elukey@deploy1002: helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
* 14:29 elukey@deploy1002: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
* 14:29 elukey@deploy1002: helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
* 14:29 elukey@deploy1002: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
* 14:29 elukey@deploy1002: helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
* 14:29 elukey@deploy1002: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
* 14:29 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on gitlab1004.wikimedia.org with reason: upgrade gitlab1004 to new version
* 14:29 elukey@deploy1002: helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
* 14:28 jelto@cumin1001: START - Cookbook sre.hosts.downtime for 0:15:00 on gitlab1004.wikimedia.org with reason: upgrade gitlab1004 to new version
* 14:14 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 14:13 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 14:13 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 14:13 urbanecm@deploy1002: Synchronized php-1.39.0-wmf.22/skins/Vector/: {{Gerrit|b5007c5f1c389deb344c5bb99e950b4190436cab}}: Revert "styles: Unify on standard external link icon"" (duration: 03m 16s)
* 14:12 bking@cumin1001: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reimage (bullseye upgrade) - bking@cumin1001 - [[phab:T289135|T289135]]
* 14:12 bking@cumin1001: START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reimage (bullseye upgrade) - bking@cumin1001 - [[phab:T289135|T289135]]
* 14:12 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 14:05 bking@cumin1001: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reimage (bullseye upgrade) - bking@cumin1001 - [[phab:T289135|T289135]]
* 14:04 bking@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2044.codfw.wmnet with OS bullseye
* 14:04 urbanecm@deploy1002: Synchronized wmf-config/logos.php: {{Gerrit|bcb7b0d4d07b454a169804d7b1011ec3f2530c00}}: Adjust width-height ratio of logo to fix display issue ([[phab:T310961|T310961]]; 2/2) (duration: 03m 17s)
* 14:04 urbanecm: Purge https://en.wikipedia.org/static/images/project-logos/srwikisource<nowiki>{</nowiki>.png;-1.5x.png;-2x.png<nowiki>}</nowiki> ([[phab:T310961|T310961]])
* 14:02 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 14:01 urbanecm@deploy1002: Synchronized static/images/project-logos/: {{Gerrit|bcb7b0d4d07b454a169804d7b1011ec3f2530c00}}: srwikisource: Adjust width-height ratio of logo to fix display issue ([[phab:T310961|T310961]]; 1/2) (duration: 03m 41s)
* 14:01 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 14:01 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 13:58 urbanecm: UTC afternoon backport window is going to overflow by a couple of minutes
* 13:58 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 13:48 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2044.codfw.wmnet with reason: host reimage
* 13:44 bking@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2044.codfw.wmnet with reason: host reimage
* 13:24 bking@cumin1001: START - Cookbook sre.hosts.reimage for host elastic2044.codfw.wmnet with OS bullseye
* 13:22 bking@cumin1001: START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reimage (bullseye upgrade) - bking@cumin1001 - [[phab:T289135|T289135]]
* 11:50 moritzm: installing openjdk-8 security updates for stretch
* 11:43 moritzm: uploaded openjdk-8 8u342-b07-1~deb9u1 for stretch-wikimedia
* 10:27 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1112 ([[phab:T314041|T314041]])', diff saved to https://phabricator.wikimedia.org/P32124 and previous config saved to /var/cache/conftool/dbconfig/20220801-102714-ladsgroup.json
* 10:12 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P32123 and previous config saved to /var/cache/conftool/dbconfig/20220801-101208-ladsgroup.json
* 10:09 vgutierrez: test ATS 9.1.2 on cp6016 - [[phab:T309651|T309651]]
* 10:05 vgutierrez: test ATS 9.1.2 on cp6008 - [[phab:T309651|T309651]]
* 10:00 ebysans@deploy1002: Finished deploy [airflow-dags/analytics@4da9195]: (no justification provided) (duration: 00m 19s)
* 10:00 ebysans@deploy1002: Started deploy [airflow-dags/analytics@4da9195]: (no justification provided)
* 09:57 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P32122 and previous config saved to /var/cache/conftool/dbconfig/20220801-095702-ladsgroup.json
* 09:56 ebysans@deploy1002: Finished deploy [airflow-dags/analytics@85585b0]: (no justification provided) (duration: 00m 05s)
* 09:56 ebysans@deploy1002: Started deploy [airflow-dags/analytics@85585b0]: (no justification provided)
* 09:41 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1112 ([[phab:T314041|T314041]])', diff saved to https://phabricator.wikimedia.org/P32121 and previous config saved to /var/cache/conftool/dbconfig/20220801-094156-ladsgroup.json
* 09:38 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1112 ([[phab:T314041|T314041]])', diff saved to https://phabricator.wikimedia.org/P32120 and previous config saved to /var/cache/conftool/dbconfig/20220801-093845-ladsgroup.json
* 09:38 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
* 09:38 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
* 09:38 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1112.eqiad.wmnet with reason: Maintenance
* 09:38 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1:00:00 on db1112.eqiad.wmnet with reason: Maintenance
* 09:36 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance
* 09:36 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on 6 hosts with reason: Maintenance
* 09:36 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2105.codfw.wmnet with reason: Maintenance
* 09:35 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1:00:00 on db2105.codfw.wmnet with reason: Maintenance
* 09:21 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab-runner2004.codfw.wmnet
* 09:10 jelto@cumin1001: START - Cookbook sre.hosts.reboot-single for host gitlab-runner2004.codfw.wmnet
* 09:10 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab-runner2003.codfw.wmnet
* 09:01 jelto@cumin1001: START - Cookbook sre.hosts.reboot-single for host gitlab-runner2003.codfw.wmnet
* 09:00 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab-runner2002.codfw.wmnet
* 08:59 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 08:58 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 08:58 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 08:57 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 08:53 ladsgroup@deploy1002: Synchronized php-1.39.0-wmf.22/includes/api: Backport: [[gerrit:818562{{!}}api: Support for links migration in ApiQueryBacklinks (T312865 T314112)]] (duration: 03m 01s)
* 08:51 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 08:50 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 08:50 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 08:50 jelto@cumin1001: START - Cookbook sre.hosts.reboot-single for host gitlab-runner2002.codfw.wmnet
* 08:50 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab-runner1004.eqiad.wmnet
* 08:48 godog: thanos-be2004: copy quarantined and tmp off sdb3 and into sdb4 for analysis and to free space - [[phab:T314275|T314275]]
* 08:48 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 08:47 ladsgroup@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:818998{{!}}Stop writing to the old templatelinks columns in itwikisource (T312865)]] (duration: 03m 12s)
* 08:43 vgutierrez: rolling upgrade of HAProxy to version 2.4.18
* 08:43 kevinbazira@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' .
* 08:41 kevinbazira@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' .
* 08:39 jelto@cumin1001: START - Cookbook sre.hosts.reboot-single for host gitlab-runner1004.eqiad.wmnet
* 08:39 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab-runner1003.eqiad.wmnet
* 08:28 jelto@cumin1001: START - Cookbook sre.hosts.reboot-single for host gitlab-runner1003.eqiad.wmnet
* 08:25 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab-runner1002.eqiad.wmnet
* 08:14 jelto@cumin1001: START - Cookbook sre.hosts.reboot-single for host gitlab-runner1002.eqiad.wmnet
* 06:19 oblivian@puppetmaster1001: conftool action : set/pooled=true; selector: dnsdisc=(appservers{{!}}api)-ro,name=codfw
* 06:14 oblivian@puppetmaster1001: conftool action : set/ttl=10; selector: dnsdisc=appservers-ro
* 06:13 oblivian@puppetmaster1001: conftool action : set/ttl=10; selector: dnsdisc=appserver-ro
* 06:13 oblivian@puppetmaster1001: conftool action : set/ttl=10; selector: dnsdisc=(appserver{{!}}api)-ro
* 05:43 moritzm: installing Linux 5.10.127-2 on Gitlab runners
* 01:00 krinkle@deploy1002: Synchronized multiversion/: {{Gerrit|Ic0dbcba9f60f20a}} (duration: 03m 31s)
* 00:57 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 00:56 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 00:56 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 00:53 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 00:45 krinkle@deploy1002: Synchronized multiversion/MWMultiVersion.php: {{Gerrit|I9d363abd7cfef}} (duration: 03m 17s)
* 00:43 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 00:42 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 00:42 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 00:39 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
==Archives ==
See [[Server Admin Log/Archives]].
See [[Server Admin Log/Archives]].
<noinclude>
<noinclude>

Revision as of 23:41, 12 August 2022

2022-08-12

  • 23:41 mutante: wikistats-bullseye:~$ /usr/lib/wikistats/update.php wp prefix blk ; /usr/lib/wikistats/update.php wp prefix kcg T315121
  • 23:38 mutante: [mwmaint1002:~] $ sudo systemctl start mediawiki_job_initsitestats.timer T315121
  • 22:14 ryankemper@cumin1001: END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - T289135
  • 21:48 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1071.eqiad.wmnet with OS bullseye
  • 21:45 andrew@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host clouddb2002-dev.codfw.wmnet with OS bullseye
  • 21:27 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1071.eqiad.wmnet with reason: host reimage
  • 21:25 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1071.eqiad.wmnet with reason: host reimage
  • 21:12 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1071.eqiad.wmnet with OS bullseye
  • 21:10 andrew@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddb2002-dev.codfw.wmnet with reason: host reimage
  • 21:06 andrew@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on clouddb2002-dev.codfw.wmnet with reason: host reimage
  • 21:06 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1053.eqiad.wmnet with OS bullseye
  • 20:50 andrew@cumin1001: START - Cookbook sre.hosts.reimage for host clouddb2002-dev.codfw.wmnet with OS bullseye
  • 20:43 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1053.eqiad.wmnet with reason: host reimage
  • 20:39 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1053.eqiad.wmnet with reason: host reimage
  • 20:24 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1053.eqiad.wmnet with OS bullseye
  • 20:12 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1048.eqiad.wmnet with OS bullseye
  • 19:55 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1048.eqiad.wmnet with reason: host reimage
  • 19:53 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1048.eqiad.wmnet with reason: host reimage
  • 19:42 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1048.eqiad.wmnet with OS bullseye
  • 19:38 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1146:3314 (T312863)', diff saved to https://phabricator.wikimedia.org/P32375 and previous config saved to /var/cache/conftool/dbconfig/20220812-193822-ladsgroup.json
  • 19:38 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance
  • 19:38 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance
  • 19:38 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1121 (T312863)', diff saved to https://phabricator.wikimedia.org/P32374 and previous config saved to /var/cache/conftool/dbconfig/20220812-193801-ladsgroup.json
  • 19:33 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1054.eqiad.wmnet with OS bullseye
  • 19:22 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P32373 and previous config saved to /var/cache/conftool/dbconfig/20220812-192255-ladsgroup.json
  • 19:12 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1054.eqiad.wmnet with reason: host reimage
  • 19:09 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1054.eqiad.wmnet with reason: host reimage
  • 19:07 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P32372 and previous config saved to /var/cache/conftool/dbconfig/20220812-190749-ladsgroup.json
  • 18:58 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet with reason: Maint
  • 18:58 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 2 days, 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet with reason: Maint
  • 18:54 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1054.eqiad.wmnet with OS bullseye
  • 18:52 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1121 (T312863)', diff saved to https://phabricator.wikimedia.org/P32371 and previous config saved to /var/cache/conftool/dbconfig/20220812-185243-ladsgroup.json
  • 18:48 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1066.eqiad.wmnet with OS bullseye
  • 18:25 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1066.eqiad.wmnet with reason: host reimage
  • 18:22 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1066.eqiad.wmnet with reason: host reimage
  • 18:08 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1066.eqiad.wmnet with OS bullseye
  • 18:00 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1064.eqiad.wmnet with OS bullseye
  • 17:42 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1064.eqiad.wmnet with reason: host reimage
  • 17:39 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1064.eqiad.wmnet with reason: host reimage
  • 17:24 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1064.eqiad.wmnet with OS bullseye
  • 17:21 pt1979@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts netmon2002.wikimedia.org
  • 17:21 pt1979@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts netmon2002.wikimedia.org
  • 17:19 pt1979@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netmon2002.wikimedia.org with OS bullseye
  • 17:04 pt1979@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netmon2002.wikimedia.org with reason: host reimage
  • 17:01 pt1979@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on netmon2002.wikimedia.org with reason: host reimage
  • 16:42 pt1979@cumin2002: START - Cookbook sre.hosts.reimage for host netmon2002.wikimedia.org with OS bullseye
  • 16:26 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1067.eqiad.wmnet with OS bullseye
  • 16:21 andrew@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudcontrol2003-dev.wikimedia.org
  • 16:21 andrew@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 16:16 andrew@cumin1001: START - Cookbook sre.dns.netbox
  • 16:11 andrew@cumin1001: START - Cookbook sre.hosts.decommission for hosts cloudcontrol2003-dev.wikimedia.org
  • 16:08 pt1979@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['netmon2002.wikimedia.org']
  • 16:03 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1067.eqiad.wmnet with reason: host reimage
  • 15:58 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1067.eqiad.wmnet with reason: host reimage
  • 15:43 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1067.eqiad.wmnet with OS bullseye
  • 15:37 pt1979@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['netmon2002.wikimedia.org']
  • 15:31 jbond@cumin2002: END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['netmon2002.wikimedia.org']
  • 15:31 jbond@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['netmon2002.wikimedia.org']
  • 15:07 jbond@cumin2002: END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts netmon1002.wikimedia.org
  • 15:07 jbond@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts netmon1002.wikimedia.org
  • 15:04 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1061.eqiad.wmnet with OS bullseye
  • 14:46 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1061.eqiad.wmnet with reason: host reimage
  • 14:46 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2042.codfw.wmnet,service=varnish-fe
  • 14:46 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2042.codfw.wmnet,service=ats-be
  • 14:46 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2042.codfw.wmnet,service=ats-tls
  • 14:43 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1061.eqiad.wmnet with reason: host reimage
  • 14:29 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maint
  • 14:28 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1061.eqiad.wmnet with OS bullseye
  • 14:28 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maint
  • 14:24 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1063.eqiad.wmnet with OS bullseye
  • 14:05 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1063.eqiad.wmnet with reason: host reimage
  • 14:02 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1063.eqiad.wmnet with reason: host reimage
  • 13:47 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1063.eqiad.wmnet with OS bullseye
  • 13:41 ryankemper@cumin1001: START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - T289135
  • 06:01 ryankemper@puppetmaster1001: conftool action : set/pooled=yes:weight=10; selector: name=elastic10[8-9][0-9].*
  • 05:54 ryankemper@puppetmaster1001: conftool action : set/pooled=yes:weight=10; selector: name=elastic110.*
  • 01:03 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1121 (T312863)', diff saved to https://phabricator.wikimedia.org/P32369 and previous config saved to /var/cache/conftool/dbconfig/20220812-010312-ladsgroup.json
  • 01:03 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
  • 01:02 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
  • 01:02 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1121.eqiad.wmnet with reason: Maintenance
  • 01:02 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1121.eqiad.wmnet with reason: Maintenance
  • 01:02 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1147 (T312863)', diff saved to https://phabricator.wikimedia.org/P32368 and previous config saved to /var/cache/conftool/dbconfig/20220812-010233-ladsgroup.json
  • 00:47 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P32367 and previous config saved to /var/cache/conftool/dbconfig/20220812-004727-ladsgroup.json
  • 00:32 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P32366 and previous config saved to /var/cache/conftool/dbconfig/20220812-003221-ladsgroup.json
  • 00:17 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1147 (T312863)', diff saved to https://phabricator.wikimedia.org/P32365 and previous config saved to /var/cache/conftool/dbconfig/20220812-001715-ladsgroup.json

2022-08-11

  • 21:30 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 21:29 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 21:29 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 21:28 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 21:23 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 21:22 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 21:22 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 21:21 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 21:16 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 21:15 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 21:15 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 21:14 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 21:04 thcipriani@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: revert Define default value for "wmgSiteLogoVariants" (T305692 T308620) (duration: 03m 15s)
  • 20:59 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 20:58 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 20:58 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 20:57 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 20:52 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 20:51 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 20:50 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 20:49 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 20:47 thcipriani@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: Define default value for "wmgSiteLogoVariants" (T305692 T308620) (duration: 03m 07s)
  • 20:44 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 20:43 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 20:43 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 20:42 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 20:29 thcipriani@deploy1002: Synchronized php-1.39.0-wmf.23/extensions/VisualEditor/modules/ve-mw/preinit/ve.init.mw.DesktopArticleTarget.init.js: Backport: Do not show incompatible skin warning when page is not editable (T314952) (duration: 03m 16s)
  • 20:27 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 20:26 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 20:26 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 20:25 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 20:23 mutante: merging change on prod phabricator host to allow scap deployment, part 1
  • 19:42 damilare: payments-wiki upgraded from cf5e1848 to 0894d75a
  • 19:41 mutante: disabling puppet on C:profile::phabricator::main
  • 19:20 mvernon@cumin2002: END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-eqiad: upgrade to 3.11.13 T309896 - mvernon@cumin2002
  • 17:58 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 17:58 taavi@deploy1002: Synchronized wmf-config/CommonSettings.php: Config: Fix labtestwiki database name servers (T310795) (duration: 03m 39s)
  • 17:57 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 17:57 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 17:56 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 17:52 sukhe: testing ATS 9.1.3-1wm1 on cp3064: T309651
  • 17:49 pt1979@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host netmon2002.mgmt.codfw.wmnet with reboot policy FORCED
  • 17:46 sukhe: testing ATS 9.1.3-1wm1 on cp3064: T3096515
  • 17:41 pt1979@cumin2002: START - Cookbook sre.hosts.provision for host netmon2002.mgmt.codfw.wmnet with reboot policy FORCED
  • 17:40 pt1979@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 17:38 sukhe: testing ATS 9.1.3-1wm1 on cp1090: T309651
  • 17:36 pt1979@cumin2002: START - Cookbook sre.dns.netbox
  • 17:35 pt1979@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host netmon2002
  • 17:34 pt1979@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host netmon2002
  • 17:33 sukhe: testing ATS 9.1.3-1wm1 on cp3065: T309651
  • 17:28 sukhe: testing ATS 9.1.3-1wm1 on cp1089: T309651
  • 17:19 bking@cumin1001: conftool action : set/weight=10:pooled=no; selector: service=elasticsearch-omega-ssl,name=elastic1100.eqiad.wmnet
  • 17:18 bking@cumin1001: conftool action : set/weight=10:pooled=yes; selector: service=elasticsearch-omega-ssl,name=elastic1100.eqiad.wmnet
  • 17:15 bking@cumin1001: conftool action : set/weight=10:pooled=yes; selector: service=search-omega-https,name=elastic1100.eqiad.wmnet
  • 16:35 mvernon@cumin2002: START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-eqiad: upgrade to 3.11.13 T309896 - mvernon@cumin2002
  • 16:30 mvernon@cumin2002: END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-codfw: upgrade to 3.11.13 T309896 - mvernon@cumin2002
  • 16:29 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic[1100-1102].eqiad.wmnet with reason: T309810
  • 16:29 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic[1100-1102].eqiad.wmnet with reason: T309810
  • 16:26 inflatador: bking@elastic1054 attempting to ban elastic1100-1102 from cluster due to firewall issues
  • 16:13 bking@cumin1001: conftool action : set/weight=10:pooled=yes; selector: service=search-omega-https,name=elastic1100.eqiad.wmnet
  • 16:12 bking@cumin1001: conftool action : set/weight=10:pooled=yes; selector: name=elastic1100
  • 15:15 cmjohnson@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 15:09 cmjohnson@cumin1001: START - Cookbook sre.dns.netbox
  • 14:58 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1162 (re)pooling @ 100%: Maint done', diff saved to https://phabricator.wikimedia.org/P32364 and previous config saved to /var/cache/conftool/dbconfig/20220811-145823-ladsgroup.json
  • 14:55 inflatador: bking@cumin1001 running puppet agent across eqiad elastic hosts
  • 14:48 ryankemper@cumin1001: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - T289135
  • 14:43 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1162 (re)pooling @ 75%: Maint done', diff saved to https://phabricator.wikimedia.org/P32362 and previous config saved to /var/cache/conftool/dbconfig/20220811-144318-ladsgroup.json
  • 14:28 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1162 (re)pooling @ 25%: Maint done', diff saved to https://phabricator.wikimedia.org/P32361 and previous config saved to /var/cache/conftool/dbconfig/20220811-142813-ladsgroup.json
  • 14:28 andrew@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudcontrol1003.wikimedia.org
  • 14:28 andrew@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 14:24 andrew@cumin1001: START - Cookbook sre.dns.netbox
  • 14:19 andrew@cumin1001: START - Cookbook sre.hosts.decommission for hosts cloudcontrol1003.wikimedia.org
  • 14:19 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 14:18 andrew@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudcontrol1004.wikimedia.org
  • 14:18 andrew@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 14:18 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 14:18 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 14:17 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 14:17 ladsgroup@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: Stop writing to the old templatelinks fields in s2 (T312865) (duration: 03m 25s)
  • 14:16 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
  • 14:16 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
  • 14:16 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
  • 14:15 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
  • 14:13 andrew@cumin1001: START - Cookbook sre.dns.netbox
  • 14:13 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1162 (re)pooling @ 10%: Maint done', diff saved to https://phabricator.wikimedia.org/P32360 and previous config saved to /var/cache/conftool/dbconfig/20220811-141309-ladsgroup.json
  • 14:12 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 14:11 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 14:11 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 14:11 awight: EU backport window complete
  • 14:10 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 14:10 awight@deploy1002: Synchronized php-1.39.0-wmf.23/extensions/DiscussionTools/includes/CommentFormatter.php: Backport: CommentFormatter: Set 'data-mw-comment' even when reply tool disabled (T314707) (duration: 03m 31s)
  • 14:09 andrew@cumin1001: START - Cookbook sre.hosts.decommission for hosts cloudcontrol1004.wikimedia.org
  • 14:05 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 14:04 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 14:04 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 14:03 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 13:52 mvernon@cumin2002: START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-codfw: upgrade to 3.11.13 T309896 - mvernon@cumin2002
  • 13:50 awight@deploy1002: Synchronized wmf-config: Config: Revert "Revert "testwiki: Add mediawiki.web_ui.interactions stream"" (duration: 03m 10s)
  • 13:48 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 13:47 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 13:47 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 13:46 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 13:36 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1060.eqiad.wmnet with OS bullseye
  • 13:36 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 13:36 awight@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: trwikiquote: Install WikiLove extension (T314895) (duration: 03m 30s)
  • 13:35 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 13:35 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 13:34 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 13:33 filippo@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host logstash2003.codfw.wmnet
  • 13:25 awight@deploy1002: Synchronized static/images: Config: Revert "trwiki: Change old and new vector logos for 500k articles" (part 3) (duration: 03m 09s)
  • 13:21 awight@deploy1002: Synchronized logos/: Config: Revert "trwiki: Change old and new vector logos for 500k articles" (part 2) (duration: 03m 09s)
  • 13:19 topranks: merging CR821781 to expose additional network info in puppet facts
  • 13:18 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 13:18 awight@deploy1002: Synchronized wmf-config/: Config: Revert "trwiki: Change old and new vector logos for 500k articles" (part 1) (duration: 03m 13s)
  • 13:17 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 13:17 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 13:16 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 13:14 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1060.eqiad.wmnet with reason: host reimage
  • 13:11 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 13:11 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1060.eqiad.wmnet with reason: host reimage
  • 13:10 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 13:10 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 13:09 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 13:08 awight@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: Enable editor line numbering on all namespaces, for twwiki (T302852) (duration: 03m 42s)
  • 12:56 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1060.eqiad.wmnet with OS bullseye
  • 12:55 ryankemper@cumin1001: START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - T289135
  • 12:49 aikochou@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
  • 12:46 aikochou@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
  • 12:26 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=restbase2018.codfw.wmnet
  • 12:26 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=restbase202[367].codfw.wmnet
  • 12:17 elukey@deploy1002: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
  • 12:17 elukey@deploy1002: helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
  • 12:17 elukey@deploy1002: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
  • 12:16 elukey@deploy1002: helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
  • 12:13 filippo@cumin1001: START - Cookbook sre.hosts.reboot-single for host logstash2003.codfw.wmnet
  • 12:11 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
  • 12:10 elukey@deploy1002: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
  • 12:09 elukey@deploy1002: helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
  • 11:20 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance
  • 11:20 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance
  • 09:58 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance
  • 09:57 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance
  • 09:56 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance
  • 09:56 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance
  • 09:49 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1162.eqiad.wmnet with reason: Maintenance
  • 09:49 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on db1162.eqiad.wmnet with reason: Maintenance
  • 09:32 godog: arm keyholder on netmon2001
  • 09:09 jbond: update gnutls28 on bullseye systems
  • 09:00 jbond: update unzip
  • 08:21 elukey@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
  • 08:13 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
  • 08:12 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
  • 08:06 ayounsi@cumin1001: END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox interface ID cr3-ulsfo:xe-0/1/1
  • 08:06 ayounsi@cumin1001: START - Cookbook sre.network.debug for Netbox interface ID cr3-ulsfo:xe-0/1/1
  • 07:58 ayounsi@cumin1001: END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox interface ID cr3-ulsfo:xe-0/1/1
  • 07:57 ayounsi@cumin1001: START - Cookbook sre.network.debug for Netbox interface ID cr3-ulsfo:xe-0/1/1
  • 07:55 oblivian@cumin1001: conftool action : set/pooled=false; selector: dnsdisc=k8s-ingress-wikikube-rw,name=codfw
  • 07:51 vgutierrez: rolling restart of pybal in eqsin and ulsfo
  • 07:24 oblivian@cumin1001: conftool action : set/pooled=false; selector: dnsdisc=restbase-async,name=eqiad
  • 07:24 oblivian@cumin1001: conftool action : set/pooled=true; selector: dnsdisc=shellbox-timeline
  • 07:23 oblivian@cumin1001: conftool action : set/pooled=true; selector: dnsdisc=inference
  • 07:19 _joe_: pooling all services in codfw
  • 07:03 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1147 (T312863)', diff saved to https://phabricator.wikimedia.org/P32357 and previous config saved to /var/cache/conftool/dbconfig/20220811-070312-ladsgroup.json
  • 07:03 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1147.eqiad.wmnet with reason: Maintenance
  • 07:02 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1147.eqiad.wmnet with reason: Maintenance
  • 07:02 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1160 (T312863)', diff saved to https://phabricator.wikimedia.org/P32356 and previous config saved to /var/cache/conftool/dbconfig/20220811-070252-ladsgroup.json
  • 06:47 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P32355 and previous config saved to /var/cache/conftool/dbconfig/20220811-064746-ladsgroup.json
  • 06:32 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P32354 and previous config saved to /var/cache/conftool/dbconfig/20220811-063240-ladsgroup.json
  • 06:28 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance
  • 06:28 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance
  • 06:17 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1160 (T312863)', diff saved to https://phabricator.wikimedia.org/P32353 and previous config saved to /var/cache/conftool/dbconfig/20220811-061734-ladsgroup.json
  • 06:17 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maint
  • 06:17 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maint
  • 06:06 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depool db1162 (T314368 T298555 T312863 T310011 T309311 T60674 T298560 T303603 T310485)', diff saved to https://phabricator.wikimedia.org/P32352 and previous config saved to /var/cache/conftool/dbconfig/20220811-060625-ladsgroup.json
  • 06:01 ladsgroup@cumin1001: dbctl commit (dc=all): 'Promote db1122 to s2 primary and set section read-write T314368', diff saved to https://phabricator.wikimedia.org/P32351 and previous config saved to /var/cache/conftool/dbconfig/20220811-060113-ladsgroup.json
  • 06:00 ladsgroup@cumin1001: dbctl commit (dc=all): 'Set s2 eqiad as read-only for maintenance - T314368', diff saved to https://phabricator.wikimedia.org/P32350 and previous config saved to /var/cache/conftool/dbconfig/20220811-060042-ladsgroup.json
  • 06:00 Amir1: Starting s2 eqiad failover from db1162 to db1122 - T314368
  • 05:19 ladsgroup@cumin1001: dbctl commit (dc=all): 'Set db1122 with weight 0 T314368', diff saved to https://phabricator.wikimedia.org/P32349 and previous config saved to /var/cache/conftool/dbconfig/20220811-051913-ladsgroup.json
  • 05:19 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 25 hosts with reason: Primary switchover s2 T314368
  • 05:18 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1:00:00 on 25 hosts with reason: Primary switchover s2 T314368
  • m: chown -R librenms /srv/librenms/rrd/ on netmon1003 T314972
  • 03:51 cwhite: chown librenms /srv/librenms/rrd/* on netmon1003 T314972
  • 02:55 ejegg: civicrm upgraded from 1f91ac2d to 92467234
  • 02:46 ejegg: updated process-control yaml files with @wmff alias
  • 02:08 ejegg: civicrm rolled back from 92467234 to 1f91ac2d
  • 02:05 ejegg: civicrm upgraded from 1f91ac2d to 92467234
  • 01:40 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 01:39 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 01:39 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 01:38 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 01:38 tstarling@deploy1002: Synchronized wmf-config/logging.php: (no justification provided) (duration: 03m 25s)
  • 01:19 tstarling@puppetmaster1001: conftool action : set/pooled=true; selector: dnsdisc=(appservers|api)-ro,name=codfw
  • 01:19 tstarling@puppetmaster1001: conftool action : set/pooled=true; selector: dnsdisc=sessionstore,name=codfw
  • 00:58 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2042.codfw.wmnet,service=varnish-fe
  • 00:58 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2042.codfw.wmnet,service=ats-be
  • 00:58 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2042.codfw.wmnet,service=ats-tls
  • 00:57 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on cp2042.codfw.wmnet with reason: host down; depooled and will debug tomorrow
  • 00:57 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on cp2042.codfw.wmnet with reason: host down; depooled and will debug tomorrow

2022-08-10

  • 21:25 bking@cumin1001: conftool action : set/weight=10:pooled=yes; selector: name=wdqs1016.eqiad.wmnet
  • 21:23 bking@cumin1001: conftool action : set/weight=10:pooled=yes; selector: name=wdqs1014.eqiad.wmnet
  • 21:10 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 16 hosts with reason: T309810
  • 21:10 bking@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on 16 hosts with reason: T309810
  • 21:09 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on elastic[1101-1102].eqiad.wmnet with reason: T309810
  • 21:09 bking@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on elastic[1101-1102].eqiad.wmnet with reason: T309810
  • 21:00 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 21:00 cjming: end of UTC late backport window
  • 20:59 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 20:59 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 20:59 cjming@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: Remove unused $wgEnableMWSuggest (duration: 03m 04s)
  • 20:58 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 20:56 cjming@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: Enable new topic tool on dewiki (T313699) (duration: 03m 01s)
  • 20:34 cjming@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: testwiki: set $wgCdnMatchParameterOrder to false (T314868) (duration: 03m 20s)
  • 20:31 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 20:30 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 20:30 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 20:30 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 20:19 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 20:18 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 20:18 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 20:17 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 20:12 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 20:09 bking@cumin1001: END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
  • 20:08 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 20:08 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 20:08 cjming@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: Start writing to cuc_actor everywhere except s4 and s8 (T233004) (duration: 03m 15s)
  • 20:07 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 19:51 rzl@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc[2053-2054].codfw.wmnet
  • 19:51 rzl@cumin1001: START - Cookbook sre.hosts.remove-downtime for mc[2053-2054].codfw.wmnet
  • 19:35 rzl@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse[2019-2020].codfw.wmnet
  • 19:35 rzl@cumin1001: START - Cookbook sre.hosts.remove-downtime for parse[2019-2020].codfw.wmnet
  • 19:35 rzl@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse[2016-2018].codfw.wmnet
  • 19:35 rzl@cumin1001: START - Cookbook sre.hosts.remove-downtime for parse[2016-2018].codfw.wmnet
  • 19:34 rzl@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc2036.codfw.wmnet
  • 19:34 rzl@cumin1001: START - Cookbook sre.hosts.remove-downtime for mc2036.codfw.wmnet
  • 19:28 sukhe: testing ATS 9.1.3-1wm1 on cp4026: T309651
  • 19:09 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1087.eqiad.wmnet with OS bullseye
  • 19:06 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1086.eqiad.wmnet with OS bullseye
  • 18:55 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1087.eqiad.wmnet with reason: host reimage
  • 18:51 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1086.eqiad.wmnet with reason: host reimage
  • 18:50 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1087.eqiad.wmnet with reason: host reimage
  • 18:49 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1086.eqiad.wmnet with reason: host reimage
  • 18:47 bking@cumin1001: START - Cookbook sre.wdqs.data-transfer
  • 18:38 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1087.eqiad.wmnet with OS bullseye
  • 18:36 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1086.eqiad.wmnet with OS bullseye
  • 18:22 urandom: truncating Cassandra hints (eqiad datacenter) -- T314941
  • 18:13 urandom: truncating codfw Cassandra hints (eqiad datacenter) -- T314941
  • 18:07 rzl@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kafka-main2005.codfw.wmnet
  • 18:07 rzl@cumin1001: START - Cookbook sre.hosts.remove-downtime for kafka-main2005.codfw.wmnet
  • 18:05 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repool D8 DBs after PDU maint (T310146)', diff saved to https://phabricator.wikimedia.org/P32346 and previous config saved to /var/cache/conftool/dbconfig/20220810-180529-ladsgroup.json
  • 17:42 otto@deploy1002: Finished deploy [analytics/refinery@6e47e0e]: Add missing changes to the deletion script - T270433 - [analytics/refinery@6e47e0e] (duration: 05m 28s)
  • 17:39 fnegri@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts labweb1002.wikimedia.org
  • 17:39 fnegri@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 17:36 otto@deploy1002: Started deploy [analytics/refinery@6e47e0e]: Add missing changes to the deletion script - T270433 - [analytics/refinery@6e47e0e]
  • 17:35 fnegri@cumin1001: START - Cookbook sre.dns.netbox
  • 17:34 otto@deploy1002: Finished deploy [analytics/refinery@6e47e0e] (hadoop-test): Add missing changes to the deletion script - T270433 - TEST [analytics/refinery@6e47e0e] (duration: 04m 19s)
  • 17:30 fnegri@cumin1001: START - Cookbook sre.hosts.decommission for hosts labweb1002.wikimedia.org
  • 17:30 otto@deploy1002: Started deploy [analytics/refinery@6e47e0e] (hadoop-test): Add missing changes to the deletion script - T270433 - TEST [analytics/refinery@6e47e0e]
  • 17:09 dzahn@cumin2002: START - Cookbook sre.dns.netbox
  • 17:08 otto@deploy1002: Started deploy [analytics/refinery@d4dd7e4] (hadoop-test): Add safety limits to refinery-drop-older-than - T270433 - TEST [analytics/refinery@d4dd7e4]
  • 17:06 sukhe: testing ATS 9.1.3-1wm1 on cp4032: T309651
  • 17:06 urandom: flushing RESTBase Cassandra tables -row B- to (temporarily) free instance-data space -- T314941
  • 17:05 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on krb2001.codfw.wmnet with reason: btullis codfw maintenance
  • 17:05 btullis@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on krb2001.codfw.wmnet with reason: btullis codfw maintenance
  • 17:04 dzahn@cumin2002: START - Cookbook sre.hosts.decommission for hosts gerrit2001.wikimedia.org
  • 17:02 sukhe: testing ATS 9.1.3-1wm1 on cp6008: T309651
  • 16:56 sukhe: testing ATS 9.1.3-1wm1 on cp6016: T309651
  • 16:55 fnegri@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts labweb1001.wikimedia.org
  • 16:55 fnegri@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 16:32 dzahn@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts gerrit2001.wikimedia.org
  • 16:32 dzahn@cumin2002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
  • 16:32 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubernetes[2013-2014].codfw.wmnet
  • 16:31 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for kubernetes[2013-2014].codfw.wmnet
  • 16:31 jelto: kubectl uncordon kubernetes2014.codfw.wmnet
  • 16:31 fnegri@cumin1001: START - Cookbook sre.dns.netbox
  • 16:30 jelto: kubectl uncordon kubernetes2013.codfw.wmnet
  • 16:29 urandom: restarting Cassandra (RESTBase) -row A- to apply r822110 -- T314941
  • 16:27 dzahn@cumin2002: START - Cookbook sre.dns.netbox
  • 16:25 fnegri@cumin1001: START - Cookbook sre.hosts.decommission for hosts labweb1001.wikimedia.org
  • 16:23 mutante: shutting down gerrit2001
  • 16:23 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc[2034-2035].codfw.wmnet
  • 16:23 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for mc[2034-2035].codfw.wmnet
  • 16:22 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse[2016-2018].codfw.wmnet
  • 16:22 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for parse[2016-2018].codfw.wmnet
  • 16:16 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=maps2010.codfw.wmnet
  • 16:16 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=sessionstore2003.codfw.wmnet
  • 16:13 sukhe: reprepro -C component/trafficserver9 include buster-wikimedia trafficserver_9.1.3-1wm1_amd64.changes: T309651
  • 16:13 dzahn@cumin2002: START - Cookbook sre.hosts.decommission for hosts gerrit2001.wikimedia.org
  • 16:11 mvernon@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ms-be[2039,2050,2056,2059].codfw.wmnet,thanos-be2004.codfw.wmnet with reason: PDU work
  • 16:10 mvernon@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ms-be[2039,2050,2056,2059].codfw.wmnet,thanos-be2004.codfw.wmnet with reason: PDU work
  • 16:09 urandom: flushing tables in row D (RESTBase Cassandra cluster) -- T314941
  • 15:54 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for gitlab-runner2004.codfw.wmnet
  • 15:54 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for gitlab-runner2004.codfw.wmnet
  • 15:53 sukhe: poweroff cp2041, 42 for PDU ugprade: rack D7
  • 15:51 urandom: flushing tables in row B (RESTBase Cassandra cluster) -- T314941
  • 15:49 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve2004.codfw.wmnet with reason: PDU maintenance
  • 15:49 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2004.codfw.wmnet with reason: PDU maintenance
  • 15:46 urandom: flushing tables in row A (RESTBase Cassandra cluster) -- T314941
  • 15:46 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on aqs2012.codfw.wmnet with reason: btullis codfw maintenance
  • 15:46 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on cp[2041-2042].codfw.wmnet with reason: shutdown for PDU upgrade: rack D4
  • 15:46 btullis@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on aqs2012.codfw.wmnet with reason: btullis codfw maintenance
  • 15:46 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 3:00:00 on cp[2041-2042].codfw.wmnet with reason: shutdown for PDU upgrade: rack D4
  • 15:46 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on aqs2011.codfw.wmnet with reason: btullis codfw maintenance
  • 15:46 btullis@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on aqs2011.codfw.wmnet with reason: btullis codfw maintenance
  • 15:45 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on aqs2010.codfw.wmnet with reason: btullis codfw maintenance
  • 15:45 btullis@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on aqs2010.codfw.wmnet with reason: btullis codfw maintenance
  • 15:45 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on aqs2009.codfw.wmnet with reason: btullis codfw maintenance
  • 15:45 btullis@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on aqs2009.codfw.wmnet with reason: btullis codfw maintenance
  • 15:37 urandom: (ephemerally) increasing hinted hand-off delivery rate limit to 16KB, RESTBase eqiad nodes -- T314941
  • 15:34 jbond: remove puppetmaster[12]002 from production
  • 15:30 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kafka-main2004.codfw.wmnet
  • 15:30 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for kafka-main2004.codfw.wmnet
  • 15:20 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc[2051-2052].codfw.wmnet
  • 15:20 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for mc[2051-2052].codfw.wmnet
  • 15:17 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc-gp2003.codfw.wmnet
  • 15:17 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for mc-gp2003.codfw.wmnet
  • 15:16 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc2033.codfw.wmnet
  • 15:16 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for mc2033.codfw.wmnet
  • 15:14 _joe_: power off krb2002
  • 15:14 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on krb2002.codfw.wmnet with reason: PDU maintenance
  • 15:13 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on krb2002.codfw.wmnet with reason: PDU maintenance
  • 15:13 _joe_: shutting down rdb2010,puppetmaster2002 for d5 maintenance
  • 15:02 jelto: power off mc2035
  • 15:01 jelto: power off mc2034
  • 15:01 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mc2035.codfw.wmnet with reason: PDU swap
  • 15:01 jelto@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on mc2035.codfw.wmnet with reason: PDU swap
  • 15:01 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mc2034.codfw.wmnet with reason: PDU swap
  • 15:01 jelto@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on mc2034.codfw.wmnet with reason: PDU swap
  • 14:43 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2094.codfw.wmnet with reason: PDU Maint (T310146)
  • 14:43 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2094.codfw.wmnet with reason: PDU Maint (T310146)
  • 14:38 urandom: disabling reserved space on eqiad nodes (RESTBase), /dev/md2 (aka /srv/cassandra/instance-data) -- T314941
  • 14:28 jelto: power off kafka-main2004 gracefully
  • 14:28 hnowlan: shutting down sessionstore2003
  • 14:27 hnowlan@puppetmaster1001: conftool action : set/pooled=no; selector: name=sessionstore2003.codfw.wmnet
  • 14:27 sukhe: power off cp2039, cp2040 for PDU upgrade: rack D
  • 14:27 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sessionstore2003.codfw.wmnet with reason: PDU maintenance
  • 14:27 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on sessionstore2003.codfw.wmnet with reason: PDU maintenance
  • 14:25 jelto: power off mc-gp2003
  • 14:25 jelto: power off mc2033
  • 14:24 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:30:00 on kafka-main2004.codfw.wmnet with reason: PDU swap
  • 14:23 jelto@cumin1001: START - Cookbook sre.hosts.downtime for 4:30:00 on kafka-main2004.codfw.wmnet with reason: PDU swap
  • 14:23 sukhe: depool codfw for PDU upgrade: rack D
  • 14:23 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:30:00 on mc-gp2003.codfw.wmnet with reason: PDU swap
  • 14:23 jelto@cumin1001: START - Cookbook sre.hosts.downtime for 4:30:00 on mc-gp2003.codfw.wmnet with reason: PDU swap
  • 14:23 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:30:00 on mc2033.codfw.wmnet with reason: PDU swap
  • 14:23 jelto@cumin1001: START - Cookbook sre.hosts.downtime for 4:30:00 on mc2033.codfw.wmnet with reason: PDU swap
  • 14:15 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp20[39|40]\.codfw\.wmnet,service=ats-tls
  • 14:13 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on cp[2039-2040].codfw.wmnet with reason: shutdown for PDU upgrade: rack D4
  • 14:13 urandom: flushing Cassandra tables, restbase1030
  • 14:13 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 3:00:00 on cp[2039-2040].codfw.wmnet with reason: shutdown for PDU upgrade: rack D4
  • 14:13 urandom: flushing Cassandra tables, restbase1019
  • 14:12 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on dns2002.wikimedia.org with reason: shutdown for PDU upgrade: rack D4
  • 14:12 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 3:00:00 on dns2002.wikimedia.org with reason: shutdown for PDU upgrade: rack D4
  • 14:11 urandom: flushing Cassandra tables, restbase1017 1018 1021 1024 1025 1026 1028 1029
  • 14:05 urandom: flushing tables, restbase1016
  • 13:52 hnowlan: powered up restbase2018
  • 13:32 filippo@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on netmon1003.wikimedia.org with reason: pdu
  • 13:32 filippo@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on netmon1003.wikimedia.org with reason: pdu
  • 13:32 filippo@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on logstash2003.codfw.wmnet with reason: pdu
  • 13:31 filippo@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on logstash2003.codfw.wmnet with reason: pdu
  • 13:31 root@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on logstash2029.codfw.wmnet with reason: pdu
  • 13:31 root@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on logstash2029.codfw.wmnet with reason: pdu
  • 13:30 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 6 hosts with reason: T310146
  • 13:30 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 6 hosts with reason: T310146
  • 13:17 elukey: powering on restbase2027
  • 13:12 elukey: powering on restbase2026
  • 13:12 _joe_: powering on restbase2023
  • 13:01 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1160 (T312863)', diff saved to https://phabricator.wikimedia.org/P32343 and previous config saved to /var/cache/conftool/dbconfig/20220810-130108-ladsgroup.json
  • 13:01 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance
  • 13:00 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance
  • 12:37 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic[2072,2084-2085].codfw.wmnet with reason: T310146
  • 12:37 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic[2072,2084-2085].codfw.wmnet with reason: T310146
  • 12:27 jbond: remove confd from serveres that shouldn;t have it
  • 12:05 ladsgroup@deploy1002: Synchronized php-1.39.0-wmf.23/extensions/Echo/maintenance/removeOrphanedEvents.php: Backport: Run clean ups with removeOrphanedEvents in major batches (T310428) (duration: 03m 32s)
  • 11:47 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 11:46 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 11:46 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 11:43 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 11:15 jbond@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1001.eqiad.wmnet with OS buster
  • 10:54 jbond@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage
  • 10:51 jbond@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage
  • 10:37 jbond@cumin1001: START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS buster
  • 10:31 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[2181-2182].codfw.wmnet with reason: D6 PDU maint (T310146)
  • 10:31 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db[2181-2182].codfw.wmnet with reason: D6 PDU maint (T310146)
  • 10:26 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on maps2010.codfw.wmnet with reason: PDU maintenance
  • 10:26 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on maps2010.codfw.wmnet with reason: PDU maintenance
  • 10:26 hnowlan@puppetmaster1001: conftool action : set/pooled=no; selector: name=maps2010.codfw.wmnet
  • 10:25 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=maps2010.codfw.wmnet
  • 10:25 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=maps2008.codfw.wmnet
  • 10:24 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on restbase2018.codfw.wmnet with reason: PDU maintenance
  • 10:24 hnowlan@puppetmaster1001: conftool action : set/pooled=no; selector: name=restbase2018.codfw.wmnet
  • 10:24 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on restbase2018.codfw.wmnet with reason: PDU maintenance
  • 10:23 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ml-serve2008.codfw.wmnet with reason: PDU maintenance
  • 10:23 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on ml-serve2008.codfw.wmnet with reason: PDU maintenance
  • 10:20 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ores2009.codfw.wmnet with reason: PDU maintenance
  • 10:20 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on ores2009.codfw.wmnet with reason: PDU maintenance
  • 10:19 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ores2008.codfw.wmnet with reason: PDU maintenance
  • 10:19 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on ores2008.codfw.wmnet with reason: PDU maintenance
  • 10:03 hnowlan@puppetmaster1001: conftool action : set/pooled=no; selector: name=restbase202[367].codfw.wmnet
  • 10:02 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on restbase[2023,2026-2027].codfw.wmnet with reason: PDU maintenance
  • 10:02 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on restbase[2023,2026-2027].codfw.wmnet with reason: PDU maintenance
  • 09:53 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 6 hosts with reason: D8 PDU Maint (T310146)
  • 09:52 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 6 hosts with reason: D8 PDU Maint (T310146)
  • 09:51 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depool D8 DBs for PDU maint (T310146)', diff saved to https://phabricator.wikimedia.org/P32341 and previous config saved to /var/cache/conftool/dbconfig/20220810-095059-ladsgroup.json
  • 09:36 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[2101,2130,2140].codfw.wmnet,dbproxy2004.codfw.wmnet with reason: D6 PDU maint (T310146)
  • 09:36 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db[2101,2130,2140].codfw.wmnet,dbproxy2004.codfw.wmnet with reason: D6 PDU maint (T310146)
  • 09:34 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depool D6 dbs (T310146)', diff saved to https://phabricator.wikimedia.org/P32340 and previous config saved to /var/cache/conftool/dbconfig/20220810-093433-ladsgroup.json
  • 09:31 jelto: depool services in codfw for upcoming PDU replacement - T309956
  • 09:30 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 8 hosts with reason: Maintenance
  • 09:29 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 8 hosts with reason: Maintenance
  • 09:29 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance
  • 09:29 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance
  • 09:28 jynus: shutdown backup2007 before pdu upgrade T310146
  • 09:16 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 09:15 ladsgroup@deploy1002: Synchronized php-1.39.0-wmf.23/maintenance/namespaceDupes.php: Backport: maintenance: Add support for links migration to namespaceDupes.php (T314711) (duration: 03m 18s)
  • 09:15 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[2093,2120,2129,2172].codfw.wmnet with reason: D5 PDU maint (T310146)
  • 09:15 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db[2093,2120,2129,2172].codfw.wmnet with reason: D5 PDU maint (T310146)
  • 09:14 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 09:14 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 09:13 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 09:10 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depool D5 dbs (T310146)', diff saved to https://phabricator.wikimedia.org/P32339 and previous config saved to /var/cache/conftool/dbconfig/20220810-091038-ladsgroup.json
  • 08:49 jynus: shutdown dbprov2003 before pdu upgrade T310146
  • 08:49 ayounsi@cumin1001: END (FAIL) - Cookbook sre.network.debug (exit_code=99) for Netbox interface ID cr1-drmrs:xe-0/1/2
  • 08:48 ayounsi@cumin1001: START - Cookbook sre.network.debug for Netbox interface ID cr1-drmrs:xe-0/1/2
  • 08:48 mvernon@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be2028.codfw.wmnet
  • 08:48 mvernon@cumin1001: START - Cookbook sre.hosts.remove-downtime for ms-be2028.codfw.wmnet
  • 08:42 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1130 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P32337 and previous config saved to /var/cache/conftool/dbconfig/20220810-084222-ladsgroup.json
  • 08:37 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 08:37 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
  • 08:36 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
  • 08:36 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 08:36 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 08:35 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 08:35 ladsgroup@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: Stop writing to the old templatelinks fields in s5 (T312865) (duration: 03m 29s)
  • 08:32 jelto: power off gitlab-runner2004
  • 08:31 root@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:30:00 on gitlab-runner2004.codfw.wmnet with reason: PDU swap
  • 08:31 root@cumin1001: START - Cookbook sre.hosts.downtime for 8:30:00 on gitlab-runner2004.codfw.wmnet with reason: PDU swap
  • 08:29 mvernon@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ms-be2028.codfw.wmnet with reason: Trying to fix full /
  • 08:28 mvernon@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ms-be2028.codfw.wmnet with reason: Trying to fix full /
  • 08:28 ayounsi@cumin1001: END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox interface ID cr1-drmrs:xe-0/1/2
  • 08:27 ayounsi@cumin1001: START - Cookbook sre.network.debug for Netbox interface ID cr1-drmrs:xe-0/1/2
  • 08:27 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1130 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P32336 and previous config saved to /var/cache/conftool/dbconfig/20220810-082718-ladsgroup.json
  • 08:25 ayounsi@cumin1001: END (FAIL) - Cookbook sre.network.debug (exit_code=99) for Netbox interface ID cr1-drmrs:xe-0/1/2
  • 08:25 ayounsi@cumin1001: START - Cookbook sre.network.debug for Netbox interface ID cr1-drmrs:xe-0/1/2
  • 08:24 ayounsi@cumin1001: END (FAIL) - Cookbook sre.network.debug (exit_code=99) for Netbox interface ID cr1-drmrs:xe-0/1/2
  • 08:24 ayounsi@cumin1001: START - Cookbook sre.network.debug for Netbox interface ID cr1-drmrs:xe-0/1/2
  • 08:23 kart_: Run: mwscript namespaceDupes.php arywiki --fix (T291737)
  • 08:13 jynus: restart replication on db1117:m1 T309074
  • 08:12 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1130 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P32335 and previous config saved to /var/cache/conftool/dbconfig/20220810-081213-ladsgroup.json
  • 08:09 kartik@deploy1002: Finished scap: Backport: arywiki: change namespace translations, add unchanged namespaces and add old translations as aliases (T291737) (duration: 10m 37s)
  • 07:59 kartik@deploy1002: Started scap: Backport: arywiki: change namespace translations, add unchanged namespaces and add old translations as aliases (T291737)
  • 07:57 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1130 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P32334 and previous config saved to /var/cache/conftool/dbconfig/20220810-075708-ladsgroup.json
  • 07:56 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1130 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P32333 and previous config saved to /var/cache/conftool/dbconfig/20220810-075636-ladsgroup.json
  • 07:55 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 07:52 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 07:52 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 07:52 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1130.eqiad.wmnet with reason: Maintenance
  • 07:52 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on db1130.eqiad.wmnet with reason: Maintenance
  • 07:51 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 07:51 dcaro@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 07:46 dcaro@cumin1001: START - Cookbook sre.dns.netbox
  • 07:39 dcaro@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 07:34 dcaro@cumin1001: START - Cookbook sre.dns.netbox
  • 07:33 godog: depool thanos-fe2001 for debugging
  • 07:11 kartik@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: Enable SectionTranslation on testwiki with new MT support from Google (T313296) (duration: 05m 44s)
  • 07:11 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 07:08 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 07:08 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 07:07 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 05:24 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on kubernetes[2013-2014].codfw.wmnet with reason: PDU maintenance
  • 05:24 oblivian@cumin1001: START - Cookbook sre.hosts.downtime for 18:00:00 on kubernetes[2013-2014].codfw.wmnet with reason: PDU maintenance
  • 05:19 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on parse[2016-2020].codfw.wmnet with reason: PDU maintenance
  • 05:19 oblivian@cumin1001: START - Cookbook sre.hosts.downtime for 18:00:00 on parse[2016-2020].codfw.wmnet with reason: PDU maintenance
  • 05:12 _joe_: starting to shut down servers in codfw for the PDU maintenance
  • 05:09 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on 10 hosts with reason: PDU maintenance
  • 05:09 oblivian@cumin1001: START - Cookbook sre.hosts.downtime for 18:00:00 on 10 hosts with reason: PDU maintenance
  • 05:09 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on mc-gp2003.codfw.wmnet with reason: PDU maintenance
  • 05:09 oblivian@cumin1001: START - Cookbook sre.hosts.downtime for 18:00:00 on mc-gp2003.codfw.wmnet with reason: PDU maintenance
  • 05:06 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on mc2033.codfw.wmnet with reason: PDU maintenance
  • 05:06 oblivian@cumin1001: START - Cookbook sre.hosts.downtime for 18:00:00 on mc2033.codfw.wmnet with reason: PDU maintenance
  • 05:05 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on 7 hosts with reason: PDU maintenance
  • 05:05 oblivian@cumin1001: START - Cookbook sre.hosts.downtime for 18:00:00 on 7 hosts with reason: PDU maintenance
  • 02:34 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 02:33 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 02:33 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 02:32 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 02:07 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 02:06 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 02:06 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 02:05 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply

2022-08-09

  • 23:17 bking@cumin1001: conftool action : set/weight=10:pooled=yes; selector: name=wdqs1011.eqiad.wmnet
  • 23:07 bking@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
  • 23:06 bking@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
  • 22:51 bking@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
  • 22:51 bking@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
  • 22:49 bking@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
  • 22:49 bking@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
  • 22:46 bking@cumin1001: conftool action : set/weight=10:pooled=yes; selector: name=wdqs1015.eqiad.wmnet
  • 22:31 ryankemper@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
  • 22:31 ryankemper@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
  • 22:28 bking@cumin1001: END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
  • 22:02 bking@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
  • 22:02 bking@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
  • 21:57 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wdqs2006.codfw.wmnet with reason: T310146
  • 21:57 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs2006.codfw.wmnet with reason: T310146
  • 21:53 bking@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
  • 21:52 bking@deploy1002: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply
  • 21:50 bking@deploy1002: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply
  • 21:49 bking@deploy1002: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply
  • 21:43 bking@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop: apply
  • 21:43 bking@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop: apply
  • 21:43 bking@deploy1002: helmfile [eqiad] DONE helmfile.d/services/changeprop: apply
  • 21:43 bking@deploy1002: helmfile [eqiad] START helmfile.d/services/changeprop: apply
  • 21:43 bking@deploy1002: helmfile [staging] DONE helmfile.d/services/changeprop: apply
  • 21:43 bking@deploy1002: helmfile [staging] START helmfile.d/services/changeprop: apply
  • 21:08 bking@cumin1001: START - Cookbook sre.wdqs.data-transfer
  • 21:00 bking@cumin1001: conftool action : set/weight=10:pooled=yes; selector: name=wdqs1014.eqiad.wmnet
  • 20:56 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
  • 20:55 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
  • 20:55 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1148 (T312863)', diff saved to https://phabricator.wikimedia.org/P32332 and previous config saved to /var/cache/conftool/dbconfig/20220809-205548-ladsgroup.json
  • 20:51 bking@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wdqs1014.eqiad.wmnet
  • 20:51 bking@cumin1001: START - Cookbook sre.hosts.remove-downtime for wdqs1014.eqiad.wmnet
  • 20:46 bking@cumin1001: END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
  • 20:40 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P32331 and previous config saved to /var/cache/conftool/dbconfig/20220809-204042-ladsgroup.json
  • 20:25 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P32330 and previous config saved to /var/cache/conftool/dbconfig/20220809-202536-ladsgroup.json
  • 20:10 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1148 (T312863)', diff saved to https://phabricator.wikimedia.org/P32329 and previous config saved to /var/cache/conftool/dbconfig/20220809-201030-ladsgroup.json
  • 19:57 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on wdqs1014.eqiad.wmnet with reason: T314890
  • 19:57 bking@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on wdqs1014.eqiad.wmnet with reason: T314890
  • 19:56 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on wdqs1016.eqiad.wmnet with reason: T314890
  • 19:56 bking@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on wdqs1016.eqiad.wmnet with reason: T314890
  • 19:55 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on wdqs1015.eqiad.wmnet with reason: T314890
  • 19:55 bking@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on wdqs1015.eqiad.wmnet with reason: T314890
  • 19:38 dcausse@deploy1002: helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply
  • 19:36 dcausse@deploy1002: helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply
  • 19:35 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1130.eqiad.wmnet with reason: Maintenance
  • 19:35 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 10:00:00 on db1130.eqiad.wmnet with reason: Maintenance
  • 19:25 bking@cumin1001: START - Cookbook sre.wdqs.data-transfer
  • 18:09 cmjohnson@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 18:06 cmjohnson@cumin1001: START - Cookbook sre.dns.netbox
  • 17:54 cmjohnson@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 17:47 cmjohnson@cumin1001: START - Cookbook sre.dns.netbox
  • 17:38 bking@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1072.eqiad.wmnet with OS bullseye
  • 17:29 vgutierrez: test trafficserver 9.1.2-1wm2 in cp6016 - T309651
  • 17:15 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1072.eqiad.wmnet with reason: host reimage
  • 17:13 bking@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1072.eqiad.wmnet with reason: host reimage
  • 17:00 bking@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1072.eqiad.wmnet with OS bullseye
  • 16:54 bking@cumin1001: END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0)
  • 16:54 bking@cumin1001: START - Cookbook sre.elasticsearch.force-shard-allocation
  • 16:53 bking@cumin1001: END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0)
  • 16:53 bking@cumin1001: START - Cookbook sre.elasticsearch.force-shard-allocation
  • 16:26 bking@deploy1002: helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
  • 16:26 bking@deploy1002: helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
  • 16:01 bking@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1069.eqiad.wmnet with OS bullseye
  • 15:45 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1069.eqiad.wmnet with reason: host reimage
  • 15:42 bking@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1069.eqiad.wmnet with reason: host reimage
  • 15:32 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 15:31 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 15:31 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 15:30 bking@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1069.eqiad.wmnet with OS bullseye
  • 15:28 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 15:27 bking@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1058.eqiad.wmnet with OS bullseye
  • 15:08 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1058.eqiad.wmnet with reason: host reimage
  • 15:05 bking@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1058.eqiad.wmnet with reason: host reimage
  • 14:59 elukey@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
  • 14:59 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
  • m: finished running 'homer "status:active" commit "netmon: Add the netmon1003 host as a syslog destination"' in the cumin1001 host. Homer reported no errors.
  • 14:54 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
  • 14:50 bking@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1058.eqiad.wmnet with OS bullseye
  • 14:28 bking@cumin1001: conftool action : set/pooled=false; selector: dnsdisc=wdqs,name=codfw
  • 13:57 kevinbazira@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
  • 13:57 kevinbazira@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
  • m: Add the new netmon1003 host as a syslog destination in homer templates/common/system.conf https://gerrit.wikimedia.org/r/c/operations/homer/public/+/819124
  • m: Successfully ran '# run-puppet-merge' in the netmon1002 and netmon1003 hosts.
  • m: Running '# run-puppet-agent' in the netmon1003 host
  • m: Running '# run-puppet-agent' in the netmon1002 host
  • 13:47 ryankemper@cumin1001: END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0)
  • 13:46 ryankemper@cumin1001: START - Cookbook sre.elasticsearch.force-shard-allocation
  • m: puppet-merge on puppetmaster2004.codfw.wmnet for patch 819179 succeeded
  • m: Set netmon1003 as netmon_server and netmon1002 as a netmon_servers_failover in the Puppet repository https://gerrit.wikimedia.org/r/c/operations/puppet/+/819179
  • m: authdns updated successfully
  • m: Had to revert https://gerrit.wikimedia.org/r/c/operations/dns/+/819177 because I rebased my changes incorrectly, sent the new patch in https://gerrit.wikimedia.org/r/c/operations/dns/+/821746
  • m: running '# authdns-update' in ns0.wikimedia.org
  • m: Flip DNS for LibreNMS and Smokeping from netmon1002 to netmon1003 https://gerrit.wikimedia.org/r/c/operations/dns/+/819177
  • 13:23 jynus: stop replication on db1117:m1 T309074
  • m: netmon1002 to netmon1003 failover
  • 13:17 elukey@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
  • 13:16 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
  • 10:58 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
  • 09:53 vgutierrez: rolling restart of pybal in eqsin - T310070
  • 09:25 elukey@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
  • 09:24 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
  • 09:24 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
  • 09:12 vgutierrez: rolling restart of pybal in codfw - T310070
  • 08:47 elukey@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
  • 08:30 elukey@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
  • 08:28 elukey@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
  • 08:27 elukey@deploy1002: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
  • 08:27 elukey@deploy1002: helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
  • 08:27 elukey@deploy1002: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
  • 08:26 elukey@deploy1002: helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
  • 08:24 jynus: starting data check using es1021 and es2021, expect increased read traffic T314559
  • 08:21 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
  • 06:22 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1130.eqiad.wmnet with reason: Maintenance
  • 06:22 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1130.eqiad.wmnet with reason: Maintenance
  • 06:19 Amir1: dbmaint s5@eqiad (T312863 T312984 T310011 T310485)
  • 06:11 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db1130.eqiad.wmnet with reason: Maint
  • 06:11 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db1130.eqiad.wmnet with reason: Maint
  • 06:08 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depool db1130 T314370', diff saved to https://phabricator.wikimedia.org/P32323 and previous config saved to /var/cache/conftool/dbconfig/20220809-060836-ladsgroup.json
  • 06:07 oblivian@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
  • 06:02 ladsgroup@cumin1001: dbctl commit (dc=all): 'Promote db1100 to s5 primary and set section read-write T314370', diff saved to https://phabricator.wikimedia.org/P32322 and previous config saved to /var/cache/conftool/dbconfig/20220809-060159-ladsgroup.json
  • 06:01 ladsgroup@cumin1001: dbctl commit (dc=all): 'Set s5 eqiad as read-only for maintenance - T314370', diff saved to https://phabricator.wikimedia.org/P32321 and previous config saved to /var/cache/conftool/dbconfig/20220809-060105-ladsgroup.json
  • 06:00 Amir1: Starting s5 eqiad failover from db1130 to db1100 - T314370
  • 05:12 ladsgroup@cumin1001: dbctl commit (dc=all): 'Set db1100 with weight 0 T314370', diff saved to https://phabricator.wikimedia.org/P32320 and previous config saved to /var/cache/conftool/dbconfig/20220809-051251-ladsgroup.json
  • 05:12 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 22 hosts with reason: Primary switchover s5 T314370
  • 05:11 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1:00:00 on 22 hosts with reason: Primary switchover s5 T314370
  • 02:42 ejegg: SmashPig upgraded from 9b97ea15 to 13e9e9cc
  • 02:31 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1148 (T312863)', diff saved to https://phabricator.wikimedia.org/P32318 and previous config saved to /var/cache/conftool/dbconfig/20220809-023113-ladsgroup.json
  • 02:31 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1148.eqiad.wmnet with reason: Maintenance
  • 02:30 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1148.eqiad.wmnet with reason: Maintenance
  • 02:30 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T312863)', diff saved to https://phabricator.wikimedia.org/P32317 and previous config saved to /var/cache/conftool/dbconfig/20220809-023052-ladsgroup.json
  • 02:28 ejegg: payments-wiki upgraded from 6880236d to cf5e1848
  • 02:15 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P32316 and previous config saved to /var/cache/conftool/dbconfig/20220809-021546-ladsgroup.json
  • 02:00 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P32315 and previous config saved to /var/cache/conftool/dbconfig/20220809-020040-ladsgroup.json
  • 01:45 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T312863)', diff saved to https://phabricator.wikimedia.org/P32314 and previous config saved to /var/cache/conftool/dbconfig/20220809-014534-ladsgroup.json

2022-08-08

  • 23:52 tstarling@deploy1002: Synchronized wmf-config/InitialiseSettings.php: clean up testwiki experiments T314750 (duration: 03m 19s)
  • 23:47 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 23:46 tstarling@deploy1002: Synchronized wmf-config/CommonSettings.php: clean up testwiki experiments T314750 (duration: 03m 27s)
  • 23:46 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 23:46 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 23:45 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 23:32 eileen___: config revision changed from f5668044 to 787cd0e0<eileen___> eileen
  • 23:32 eileen___: civicrm upgraded from 497bddf7 to 1f91ac2d
  • 22:16 ryankemper@cumin1001: END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - T289135
  • 22:16 ryankemper@cumin1001: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host elastic1065.eqiad.wmnet with OS bullseye
  • 21:53 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1065.eqiad.wmnet with reason: host reimage
  • 21:50 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1065.eqiad.wmnet with reason: host reimage
  • 21:36 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1065.eqiad.wmnet with OS bullseye
  • 21:12 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1062.eqiad.wmnet with OS bullseye
  • 20:53 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1062.eqiad.wmnet with reason: host reimage
  • 20:50 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1062.eqiad.wmnet with reason: host reimage
  • 20:36 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1062.eqiad.wmnet with OS bullseye
  • 20:32 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 20:31 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 20:31 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 20:29 ryankemper@cumin1001: START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - T289135
  • 20:28 cjming: end of UTC late backport window
  • 20:27 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 20:27 cjming@deploy1002: Synchronized php-1.39.0-wmf.23/skins/Vector/resources/skins.vector.styles/layouts/grid.less: Backport: Fix grid blowout bug (T314756) (duration: 03m 26s)
  • 20:12 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 20:11 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 20:11 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 20:11 cjming@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: Disable sticky header edit A/B test for pilot wikis (T312296) (duration: 03m 35s)
  • 20:08 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 17:34 bking@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1088.eqiad.wmnet with OS bullseye
  • 17:15 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1088.eqiad.wmnet with reason: host reimage
  • 17:12 bking@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1088.eqiad.wmnet with reason: host reimage
  • 17:00 bking@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1088.eqiad.wmnet with OS bullseye
  • 16:54 bking@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1085.eqiad.wmnet with OS bullseye
  • 16:49 ryankemper@cumin1001: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - T289135
  • 16:43 pt1979@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 16:41 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1085.eqiad.wmnet with reason: host reimage
  • 16:39 pt1979@cumin2002: START - Cookbook sre.dns.netbox
  • 16:38 bking@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1085.eqiad.wmnet with reason: host reimage
  • 16:26 bking@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1085.eqiad.wmnet with OS bullseye
  • 16:24 bking@cumin1001: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host elastic1085.eqiad.wmnet with OS bullseye
  • 16:19 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1085.eqiad.wmnet with reason: host reimage
  • 16:16 bking@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1085.eqiad.wmnet with reason: host reimage
  • 16:16 ryankemper@cumin1001: START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - T289135
  • 16:14 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
  • 16:12 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
  • 16:10 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
  • 16:09 ryankemper@cumin1001: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - T289135
  • 16:04 bking@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1085.eqiad.wmnet with OS bullseye
  • 16:00 bking@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1084.eqiad.wmnet with OS bullseye
  • 15:58 ryankemper@cumin1001: START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - T289135
  • 15:47 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1084.eqiad.wmnet with reason: host reimage
  • 15:46 sukhe: upload reprepro -C main include bullseye-wikimedia python-pynetbox_6.6.0-1+wmf11u1_amd64.changes
  • 15:45 bking@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1084.eqiad.wmnet with reason: host reimage
  • 15:37 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2021.codfw.wmnet with reason: Maint
  • 15:37 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2021.codfw.wmnet with reason: Maint
  • 15:32 bking@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1084.eqiad.wmnet with OS bullseye
  • 14:59 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
  • 14:55 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
  • 14:47 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on cp5001.eqsin.wmnet with reason: depooled: faulty DIMM: T314256
  • 14:46 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on cp5001.eqsin.wmnet with reason: depooled: faulty DIMM: T314256
  • 14:34 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
  • 14:11 kevinbazira@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
  • 13:03 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 13:01 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 13:01 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 12:58 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 12:56 urbanecm@deploy1002: Synchronized wmf-config/CommonSettings.php: 77fd5ab: Growth: Add new rights to wgAvailableRights (duration: 03m 24s)
  • 12:30 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1102.eqiad.wmnet
  • 12:09 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 12:09 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 12:06 urbanecm@deploy1002: Synchronized php-1.39.0-wmf.23/extensions/GrowthExperiments/: 3eaf155: MentorTools: Do not use MentorWeightManager (T314362) (duration: 03m 31s)
  • 12:04 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 11:43 btullis@cumin1001: START - Cookbook sre.hosts.reboot-single for host an-worker1102.eqiad.wmnet
  • 11:21 jelto@cumin1001: conftool action : set/pooled=yes; selector: name=kubernetes2022.codfw.wmnet
  • 11:21 jelto: kubectl uncordon kubernetes2022.codfw.wmnet
  • 10:43 Amir1: Removing db2079 from orchestrator (T313885)
  • 10:39 Amir1: Removing db2079 from zarcillo (T313885)
  • 10:35 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2079.codfw.wmnet
  • 10:35 ladsgroup@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 10:30 ladsgroup@cumin1001: START - Cookbook sre.dns.netbox
  • 10:25 ladsgroup@cumin1001: START - Cookbook sre.hosts.decommission for hosts db2079.codfw.wmnet
  • 10:18 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 12:00:00 on db2079.codfw.wmnet with reason: Decom
  • 10:17 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 2 days, 12:00:00 on db2079.codfw.wmnet with reason: Decom
  • 08:41 jbond: deploy libtirpc update
  • 07:57 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1144:3314 (T312863)', diff saved to https://phabricator.wikimedia.org/P32310 and previous config saved to /var/cache/conftool/dbconfig/20220808-075723-ladsgroup.json
  • 07:57 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance
  • 07:57 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance
  • 07:57 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1142 (T312863)', diff saved to https://phabricator.wikimedia.org/P32309 and previous config saved to /var/cache/conftool/dbconfig/20220808-075702-ladsgroup.json
  • 07:53 godog: grow sda/sdb 3 by 100G on thanos-be2001 - T314275
  • 07:50 godog: grow sda/sdb 3 by 100G on thanos-be1004 - T314275
  • 07:41 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P32308 and previous config saved to /var/cache/conftool/dbconfig/20220808-074156-ladsgroup.json
  • 07:32 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 07:27 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 07:27 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 07:26 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P32307 and previous config saved to /var/cache/conftool/dbconfig/20220808-072650-ladsgroup.json
  • 07:23 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 07:22 kartik@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: trwikivoyage: Create rollbacker user group (T314678) (duration: 03m 17s)
  • 07:18 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 07:17 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 07:17 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 07:16 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 07:11 elukey: restart rsyslog on ml-serve2007
  • 07:11 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 07:11 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1142 (T312863)', diff saved to https://phabricator.wikimedia.org/P32306 and previous config saved to /var/cache/conftool/dbconfig/20220808-071144-ladsgroup.json
  • 07:10 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 07:10 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 07:09 kartik@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: Enable SectionTranslation on 10 Wikipedias where ContentTranslation is default (T308829) (duration: 03m 15s)
  • 07:09 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 07:06 XioNoX: add CSP headers to Netbox - T296356
  • 07:05 elukey: restart rsyslog on ml-serve-ctrl2001

2022-08-07

  • 19:58 taavi: taavi@mwmaint1002 ~ $ echo "https://upload.wikimedia.org/wikipedia/commons/1/15/Keep_tidy_ask.svg" | mwscript purgeList.php --wiki enwiki # T314712
  • 13:52 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1142 (T312863)', diff saved to https://phabricator.wikimedia.org/P32305 and previous config saved to /var/cache/conftool/dbconfig/20220807-135204-ladsgroup.json
  • 13:51 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1142.eqiad.wmnet with reason: Maintenance
  • 13:51 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1142.eqiad.wmnet with reason: Maintenance
  • 13:51 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1141 (T312863)', diff saved to https://phabricator.wikimedia.org/P32304 and previous config saved to /var/cache/conftool/dbconfig/20220807-135143-ladsgroup.json
  • 13:36 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P32303 and previous config saved to /var/cache/conftool/dbconfig/20220807-133637-ladsgroup.json
  • 13:21 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P32302 and previous config saved to /var/cache/conftool/dbconfig/20220807-132131-ladsgroup.json
  • 13:06 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1141 (T312863)', diff saved to https://phabricator.wikimedia.org/P32301 and previous config saved to /var/cache/conftool/dbconfig/20220807-130625-ladsgroup.json
  • 12:06 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1141 (T312863)', diff saved to https://phabricator.wikimedia.org/P32300 and previous config saved to /var/cache/conftool/dbconfig/20220807-120610-ladsgroup.json
  • 12:06 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1141.eqiad.wmnet with reason: Maintenance
  • 12:05 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1141.eqiad.wmnet with reason: Maintenance
  • 12:05 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1149 (T312863)', diff saved to https://phabricator.wikimedia.org/P32299 and previous config saved to /var/cache/conftool/dbconfig/20220807-120549-ladsgroup.json
  • 11:50 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P32298 and previous config saved to /var/cache/conftool/dbconfig/20220807-115043-ladsgroup.json
  • 11:35 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P32297 and previous config saved to /var/cache/conftool/dbconfig/20220807-113537-ladsgroup.json
  • 11:20 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1149 (T312863)', diff saved to https://phabricator.wikimedia.org/P32296 and previous config saved to /var/cache/conftool/dbconfig/20220807-112031-ladsgroup.json

2022-08-06

  • 17:59 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1149 (T312863)', diff saved to https://phabricator.wikimedia.org/P32295 and previous config saved to /var/cache/conftool/dbconfig/20220806-175916-ladsgroup.json
  • 17:59 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1149.eqiad.wmnet with reason: Maintenance
  • 17:58 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1149.eqiad.wmnet with reason: Maintenance
  • 03:10 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 03:09 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 03:09 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 03:08 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 03:03 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 03:02 krinkle@deploy1002: Synchronized w/: I9067d4 (duration: 03m 25s)
  • 03:02 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 03:02 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 03:01 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 02:41 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 02:39 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 02:39 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 02:38 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 02:38 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1143.eqiad.wmnet with reason: Maintenance
  • 02:37 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1143.eqiad.wmnet with reason: Maintenance
  • 02:33 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 02:32 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 02:32 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 02:31 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply

2022-08-05

  • 22:20 dcausse@deploy1002: Finished deploy [wikimedia/discovery/analytics@71fe016]: Fix schedule_interval for image_recommendation_weekly (duration: 02m 01s)
  • 22:18 dcausse@deploy1002: Started deploy [wikimedia/discovery/analytics@71fe016]: Fix schedule_interval for image_recommendation_weekly
  • 17:08 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1195.eqiad.wmnet with OS bullseye
  • 16:54 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1194.eqiad.wmnet with OS bullseye
  • 16:53 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1195.eqiad.wmnet with reason: host reimage
  • 16:49 pt1979@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on db1195.eqiad.wmnet with reason: host reimage
  • 16:41 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1194.eqiad.wmnet with reason: host reimage
  • 16:37 pt1979@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on db1194.eqiad.wmnet with reason: host reimage
  • 16:34 pt1979@cumin1001: START - Cookbook sre.hosts.reimage for host db1195.eqiad.wmnet with OS bullseye
  • 16:27 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp203[56]\.codfw\.wmnet,service=varnish-fe
  • 16:27 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp203[56]\.codfw\.wmnet,service=ats-be
  • 16:27 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp203[56]\.codfw\.wmnet,service=ats-tls
  • 16:26 pt1979@cumin1001: START - Cookbook sre.hosts.reimage for host db1194.eqiad.wmnet with OS bullseye
  • 16:25 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1193.eqiad.wmnet with OS bullseye
  • 16:21 pt1979@cumin1001: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host db1192.eqiad.wmnet with OS bullseye
  • 16:12 dcausse@deploy1002: Finished deploy [wikimedia/discovery/analytics@8489923]: T304954: Automate imagesuggestion imports (duration: 02m 03s)
  • 16:11 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1193.eqiad.wmnet with reason: host reimage
  • 16:11 milimetric@deploy1002: Finished deploy [analytics/refinery@fe7bf9e]: Hotfix for webrequest load refine, now with FORCE :) (duration: 06m 09s)
  • 16:10 dcausse@deploy1002: Started deploy [wikimedia/discovery/analytics@8489923]: T304954: Automate imagesuggestion imports
  • 16:07 pt1979@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on db1193.eqiad.wmnet with reason: host reimage
  • 16:07 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1192.eqiad.wmnet with reason: host reimage
  • 16:05 milimetric@deploy1002: Started deploy [analytics/refinery@fe7bf9e]: Hotfix for webrequest load refine, now with FORCE :)
  • 16:04 milimetric@deploy1002: Finished deploy [analytics/refinery@fe7bf9e]: Hotfix for webrequest load refine (duration: 34m 38s)
  • 16:03 pt1979@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on db1192.eqiad.wmnet with reason: host reimage
  • 15:55 pt1979@cumin1001: START - Cookbook sre.hosts.reimage for host db1193.eqiad.wmnet with OS bullseye
  • 15:52 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1191.eqiad.wmnet with OS bullseye
  • 15:51 pt1979@cumin1001: START - Cookbook sre.hosts.reimage for host db1192.eqiad.wmnet with OS bullseye
  • 15:42 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1190.eqiad.wmnet with OS bullseye
  • 15:38 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1191.eqiad.wmnet with reason: host reimage
  • 15:34 pt1979@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on db1191.eqiad.wmnet with reason: host reimage
  • 15:30 milimetric@deploy1002: Started deploy [analytics/refinery@fe7bf9e]: Hotfix for webrequest load refine
  • 15:28 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1190.eqiad.wmnet with reason: host reimage
  • 15:25 pt1979@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on db1190.eqiad.wmnet with reason: host reimage
  • 15:24 jbond: upload trapperkeeper-metrics-clojure to puppet7 component
  • 15:22 pt1979@cumin1001: START - Cookbook sre.hosts.reimage for host db1191.eqiad.wmnet with OS bullseye
  • 15:19 jbond: upload puppetlabs-http-client-clojur to puppet7 component
  • 15:17 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 15:16 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 15:16 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 15:15 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 15:14 dancy@deploy1002: Finished scap: Backport for gerrit:820653 scap gitignore: ignore all files under the `scap` directory (duration: 04m 41s)
  • 15:11 jbond: upload jolokia to puppet7 component
  • 15:10 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1185.eqiad.wmnet with OS bullseye
  • 15:09 dancy@deploy1002: Started scap: Backport for gerrit:820653 scap gitignore: ignore all files under the `scap` directory
  • 15:09 jbond: upload test-chuck-clojure to puppet7 component
  • 15:05 pt1979@cumin1001: START - Cookbook sre.hosts.reimage for host db1190.eqiad.wmnet with OS bullseye
  • 15:04 jbond: upload test-check-clojure to puppet7 component
  • 14:57 jbond: upload nippy-clojure to puppet7 component
  • 14:56 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1185.eqiad.wmnet with reason: host reimage
  • 14:52 pt1979@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on db1185.eqiad.wmnet with reason: host reimage
  • 14:43 jbond: upload fressian to puppet7 component
  • <