You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Server Admin Log: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Stashbot
(cjming: end of UTC late backport & config window)
imported>Stashbot
(mutante: wikistats-bullseye:~$ /usr/lib/wikistats/update.php wp prefix blk ; /usr/lib/wikistats/update.php wp prefix kcg T315121)
(229 intermediate revisions by 2 users not shown)
Line 1: Line 1:
== 2021-12-10 ==
== 2022-08-12 ==
* 00:36 cjming: end of UTC late backport & config window
* 23:41 mutante: wikistats-bullseye:~$ /usr/lib/wikistats/update.php wp prefix blk ; /usr/lib/wikistats/update.php wp prefix kcg [[phab:T315121|T315121]]
* 00:34 cjming@deploy1002: Synchronized php-1.38.0-wmf.12/extensions/MediaSearch/resources/components/QuickView.vue: Backport: [[gerrit:745646{{!}}Search_result_page_id should be integer (T297400)]] (duration: 00m 55s)
* 23:38 mutante: [mwmaint1002:~] $ sudo systemctl start mediawiki_job_initsitestats.timer [[phab:T315121|T315121]]
* 00:33 cjming@deploy1002: Synchronized php-1.38.0-wmf.12/skins/Vector: Backport: [[gerrit:745607{{!}}Update A/B test enrollment name (T292587)]] (duration: 00m 56s)
* 22:14 ryankemper@cumin1001: END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - [[phab:T289135|T289135]]
* 00:31 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 21:48 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1071.eqiad.wmnet with OS bullseye
* 00:30 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 21:45 andrew@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host clouddb2002-dev.codfw.wmnet with OS bullseye
* 00:23 cjming@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:745385{{!}}Revert "VE on zh.wiki: Enable single-edit-tab mode" (T296269)]] (duration: 00m 56s)
* 21:27 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1071.eqiad.wmnet with reason: host reimage
* 00:23 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 21:25 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1071.eqiad.wmnet with reason: host reimage
* 00:22 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 21:12 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1071.eqiad.wmnet with OS bullseye
* 00:17 cjming@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:745598{{!}}Update WebABTestEnrollment name (T295972)]] (duration: 00m 57s)
* 21:10 andrew@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddb2002-dev.codfw.wmnet with reason: host reimage
* 00:11 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 21:06 andrew@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on clouddb2002-dev.codfw.wmnet with reason: host reimage
* 00:09 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 21:06 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1053.eqiad.wmnet with OS bullseye
* 20:50 andrew@cumin1001: START - Cookbook sre.hosts.reimage for host clouddb2002-dev.codfw.wmnet with OS bullseye
* 20:43 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1053.eqiad.wmnet with reason: host reimage
* 20:39 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1053.eqiad.wmnet with reason: host reimage
* 20:24 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1053.eqiad.wmnet with OS bullseye
* 20:12 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1048.eqiad.wmnet with OS bullseye
* 19:55 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1048.eqiad.wmnet with reason: host reimage
* 19:53 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1048.eqiad.wmnet with reason: host reimage
* 19:42 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1048.eqiad.wmnet with OS bullseye
* 19:38 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1146:3314 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32375 and previous config saved to /var/cache/conftool/dbconfig/20220812-193822-ladsgroup.json
* 19:38 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance
* 19:38 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance
* 19:38 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1121 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32374 and previous config saved to /var/cache/conftool/dbconfig/20220812-193801-ladsgroup.json
* 19:33 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1054.eqiad.wmnet with OS bullseye
* 19:22 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P32373 and previous config saved to /var/cache/conftool/dbconfig/20220812-192255-ladsgroup.json
* 19:12 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1054.eqiad.wmnet with reason: host reimage
* 19:09 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1054.eqiad.wmnet with reason: host reimage
* 19:07 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P32372 and previous config saved to /var/cache/conftool/dbconfig/20220812-190749-ladsgroup.json
* 18:58 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet with reason: Maint
* 18:58 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 2 days, 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet with reason: Maint
* 18:54 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1054.eqiad.wmnet with OS bullseye
* 18:52 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1121 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32371 and previous config saved to /var/cache/conftool/dbconfig/20220812-185243-ladsgroup.json
* 18:48 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1066.eqiad.wmnet with OS bullseye
* 18:25 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1066.eqiad.wmnet with reason: host reimage
* 18:22 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1066.eqiad.wmnet with reason: host reimage
* 18:08 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1066.eqiad.wmnet with OS bullseye
* 18:00 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1064.eqiad.wmnet with OS bullseye
* 17:42 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1064.eqiad.wmnet with reason: host reimage
* 17:39 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1064.eqiad.wmnet with reason: host reimage
* 17:24 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1064.eqiad.wmnet with OS bullseye
* 17:21 pt1979@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts netmon2002.wikimedia.org
* 17:21 pt1979@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts netmon2002.wikimedia.org
* 17:19 pt1979@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netmon2002.wikimedia.org with OS bullseye
* 17:04 pt1979@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netmon2002.wikimedia.org with reason: host reimage
* 17:01 pt1979@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on netmon2002.wikimedia.org with reason: host reimage
* 16:42 pt1979@cumin2002: START - Cookbook sre.hosts.reimage for host netmon2002.wikimedia.org with OS bullseye
* 16:26 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1067.eqiad.wmnet with OS bullseye
* 16:21 andrew@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudcontrol2003-dev.wikimedia.org
* 16:21 andrew@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 16:16 andrew@cumin1001: START - Cookbook sre.dns.netbox
* 16:11 andrew@cumin1001: START - Cookbook sre.hosts.decommission for hosts cloudcontrol2003-dev.wikimedia.org
* 16:08 pt1979@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['netmon2002.wikimedia.org']
* 16:03 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1067.eqiad.wmnet with reason: host reimage
* 15:58 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1067.eqiad.wmnet with reason: host reimage
* 15:43 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1067.eqiad.wmnet with OS bullseye
* 15:37 pt1979@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['netmon2002.wikimedia.org']
* 15:31 jbond@cumin2002: END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['netmon2002.wikimedia.org']
* 15:31 jbond@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['netmon2002.wikimedia.org']
* 15:07 jbond@cumin2002: END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts netmon1002.wikimedia.org
* 15:07 jbond@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts netmon1002.wikimedia.org
* 15:04 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1061.eqiad.wmnet with OS bullseye
* 14:46 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1061.eqiad.wmnet with reason: host reimage
* 14:46 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2042.codfw.wmnet,service=varnish-fe
* 14:46 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2042.codfw.wmnet,service=ats-be
* 14:46 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2042.codfw.wmnet,service=ats-tls
* 14:43 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1061.eqiad.wmnet with reason: host reimage
* 14:29 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maint
* 14:28 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1061.eqiad.wmnet with OS bullseye
* 14:28 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maint
* 14:24 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1063.eqiad.wmnet with OS bullseye
* 14:05 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1063.eqiad.wmnet with reason: host reimage
* 14:02 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1063.eqiad.wmnet with reason: host reimage
* 13:47 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1063.eqiad.wmnet with OS bullseye
* 13:41 ryankemper@cumin1001: START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - [[phab:T289135|T289135]]
* 06:01 ryankemper@puppetmaster1001: conftool action : set/pooled=yes:weight=10; selector: name=elastic10[8-9][0-9].*
* 05:54 ryankemper@puppetmaster1001: conftool action : set/pooled=yes:weight=10; selector: name=elastic110.*
* 01:03 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1121 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32369 and previous config saved to /var/cache/conftool/dbconfig/20220812-010312-ladsgroup.json
* 01:03 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
* 01:02 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
* 01:02 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1121.eqiad.wmnet with reason: Maintenance
* 01:02 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1121.eqiad.wmnet with reason: Maintenance
* 01:02 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1147 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32368 and previous config saved to /var/cache/conftool/dbconfig/20220812-010233-ladsgroup.json
* 00:47 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P32367 and previous config saved to /var/cache/conftool/dbconfig/20220812-004727-ladsgroup.json
* 00:32 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P32366 and previous config saved to /var/cache/conftool/dbconfig/20220812-003221-ladsgroup.json
* 00:17 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1147 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32365 and previous config saved to /var/cache/conftool/dbconfig/20220812-001715-ladsgroup.json


== 2021-12-09 ==
== 2022-08-11 ==
* 22:14 dancy@deploy1002: Synchronized README: testing https://gerrit.wikimedia.org/r/745572 (duration: 00m 55s)
* 21:30 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 22:07 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 21:29 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 22:06 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 21:29 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 22:05 ladsgroup@deploy1002: Synchronized php-1.38.0-wmf.12/extensions/FlaggedRevs/maintenance/pruneRevData.php: Backport: [[gerrit:745383{{!}}Fix the mistake in passing parameter (T296380)]] (duration: 02m 11s)
* 21:28 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 21:15 cmjohnson@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 21:23 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 20:58 cmjohnson@cumin1001: START - Cookbook sre.dns.netbox
* 21:22 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 20:09 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 21:22 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 20:07 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 21:21 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 20:04 dancy@deploy1002: rebuilt and synchronized wikiversions files: group2 wikis to 1.38.0-wmf.12  refs [[phab:T293953|T293953]]
* 21:16 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 19:54 legoktm: deployed patch for [[phab:T297416|T297416]]
* 21:15 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 19:40 majavah: deployed patch for [[phab:T297416|T297416]]
* 21:15 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 19:29 filippo@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host graphite2003.codfw.wmnet
* 21:14 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 19:23 filippo@cumin1001: START - Cookbook sre.hosts.reboot-single for host graphite2003.codfw.wmnet
* 21:04 thcipriani@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: revert [[gerrit:806944{{!}}Define default value for "wmgSiteLogoVariants" (T305692 T308620)]] (duration: 03m 15s)
* 19:21 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|4e6cba0d3590446bb02815b65ba1c4ae9ed7bfac}}: VE on zh.wiki: Enable single-edit-tab mode ([[phab:T296269|T296269]]) (duration: 01m 05s)
* 20:59 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 19:20 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 20:58 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 19:19 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 20:58 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 19:14 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|06bd3e627cfef805f8b56be2b38b9125471b1410}}: kartographer: Enable tegola on frwiki (duration: 01m 05s)
* 20:57 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 19:13 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 20:52 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 19:12 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 20:51 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 19:09 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|c1e95519682f0bd6633fb9fd7f49e7a664ec9f87}}: Deploy sticky header and A/B test enrollment to office, test wikis ([[phab:T295972|T295972]]) (duration: 01m 06s)
* 20:50 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 18:51 cwhite: powercycle graphite2003 [[phab:T297265|T297265]]
* 20:49 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 18:17 cmjohnson@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 20:47 thcipriani@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:806944{{!}}Define default value for "wmgSiteLogoVariants" (T305692 T308620)]] (duration: 03m 07s)
* 18:15 elukey: kafka-main2003 back in service with the old OS (stretch). Re-created a new puppet host key and signed it on the puppet master
* 20:44 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 18:11 cmjohnson@cumin1001: START - Cookbook sre.dns.netbox
* 20:43 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 17:48 cwhite: point kibana7 to OpenSearch in codfw [[phab:T288621|T288621]]
* 20:43 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 17:46 elukey@cumin1001: END (FAIL) - Cookbook sre.puppet.renew-cert (exit_code=99) for kafka-main2003.codfw.wmnet: Renew puppet certificate - elukey@cumin1001
* 20:42 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 17:46 elukey@cumin1001: START - Cookbook sre.puppet.renew-cert for kafka-main2003.codfw.wmnet: Renew puppet certificate - elukey@cumin1001
* 20:29 thcipriani@deploy1002: Synchronized php-1.39.0-wmf.23/extensions/VisualEditor/modules/ve-mw/preinit/ve.init.mw.DesktopArticleTarget.init.js: Backport: [[gerrit:822396{{!}}Do not show incompatible skin warning when page is not editable (T314952)]] (duration: 03m 16s)
* 17:40 elukey@cumin1001: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main2003.codfw.wmnet with OS buster
* 20:27 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 17:15 elukey@cumin1001: START - Cookbook sre.hosts.reimage for host kafka-main2003.codfw.wmnet with OS buster
* 20:26 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 17:00 elukey: stop kafka* on kafka-main2003 as pre-step before reimaging
* 20:26 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 16:56 hnowlan: remove restbase certificates and configuration entries for decommissioned hosts
* 20:25 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 16:54 mvernon@deploy1002: Synchronized private/PrivateSettings.php: Update swift config [[phab:T296767|T296767]] (duration: 01m 05s)
* 20:23 mutante: merging change on prod phabricator host to allow scap deployment, part 1
* 16:22 cmjohnson@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 19:42 damilare: payments-wiki upgraded from {{Gerrit|cf5e1848}} to {{Gerrit|0894d75a}}
* 16:17 cmjohnson@cumin1001: START - Cookbook sre.dns.netbox
* 19:41 mutante: disabling puppet on C:profile::phabricator::main
* 15:44 elukey: run `ipmitool -I lanplus -H "kafka-main2003.mgmt.codfw.wmnet" -U root -E mc reset cold` from cumin2001
* 19:20 mvernon@cumin2002: END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-eqiad: upgrade to 3.11.13 [[phab:T309896|T309896]] - mvernon@cumin2002
* 14:39 moritzm: installing postgres security updates on eqiad maps master (and replicas)
* 17:58 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 14:37 jayme: updated calico chart to calico-0.1.15 on all kubernetes clusters (introducing IPAMConfig) - [[phab:T296303|T296303]]
* 17:58 taavi@deploy1002: Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:822428{{!}}Fix labtestwiki database name servers (T310795)]] (duration: 03m 39s)
* 14:36 jayme@deploy1002: helmfile [eqiad] DONE helmfile.d/admin 'apply'.
* 17:57 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 14:30 jayme@deploy1002: helmfile [eqiad] START helmfile.d/admin 'apply'.
* 17:57 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 14:30 jayme@deploy1002: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'.
* 17:56 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 14:29 jayme@deploy1002: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'.
* 17:52 sukhe: testing ATS 9.1.3-1wm1 on cp3064: [[phab:T309651|T309651]]
* 14:26 jayme@deploy1002: helmfile [codfw] DONE helmfile.d/admin 'apply'.
* 17:49 pt1979@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host netmon2002.mgmt.codfw.wmnet with reboot policy FORCED
* 14:22 jayme@deploy1002: helmfile [codfw] START helmfile.d/admin 'apply'.
* 17:46 sukhe: testing ATS 9.1.3-1wm1 on cp3064: [[phab:T3096515|T3096515]]
* 14:21 jayme@deploy1002: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
* 17:41 pt1979@cumin2002: START - Cookbook sre.hosts.provision for host netmon2002.mgmt.codfw.wmnet with reboot policy FORCED
* 14:20 jayme@deploy1002: helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
* 17:40 pt1979@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 14:20 moritzm: installing postgres security updates on codfw maps master (and replicas)
* 17:38 sukhe: testing ATS 9.1.3-1wm1 on cp1090: [[phab:T309651|T309651]]
* 14:17 jayme@deploy1002: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'.
* 17:36 pt1979@cumin2002: START - Cookbook sre.dns.netbox
* 14:15 jayme@deploy1002: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'.
* 17:35 pt1979@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host netmon2002
* 14:14 moritzm: installing python-babel security updates
* 17:34 pt1979@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host netmon2002
* 14:07 moritzm: installing cups security updates on stretch hosts
* 17:33 sukhe: testing ATS 9.1.3-1wm1 on cp3065: [[phab:T309651|T309651]]
* 13:58 moritzm: installing postgres security updates on netboxdb hosts
* 17:28 sukhe: testing ATS 9.1.3-1wm1 on cp1089: [[phab:T309651|T309651]]
* 13:55 moritzm: installing postgres security updates on puppetdb2002
* 17:19 bking@cumin1001: conftool action : set/weight=10:pooled=no; selector: service=elasticsearch-omega-ssl,name=elastic1100.eqiad.wmnet
* 13:39 moritzm: installing tar security updates on stretch
* 17:18 bking@cumin1001: conftool action : set/weight=10:pooled=yes; selector: service=elasticsearch-omega-ssl,name=elastic1100.eqiad.wmnet
* 13:31 jayme@deploy1002: helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
* 17:15 bking@cumin1001: conftool action : set/weight=10:pooled=yes; selector: service=search-omega-https,name=elastic1100.eqiad.wmnet
* 13:29 jayme@deploy1002: helmfile [staging-codfw] START helmfile.d/admin 'sync'.
* 16:35 mvernon@cumin2002: START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-eqiad: upgrade to 3.11.13 [[phab:T309896|T309896]] - mvernon@cumin2002
* 13:29 jayme@deploy1002: helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
* 16:30 mvernon@cumin2002: END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-codfw: upgrade to 3.11.13 [[phab:T309896|T309896]] - mvernon@cumin2002
* 13:29 jayme@deploy1002: helmfile [staging-codfw] START helmfile.d/admin 'sync'.
* 16:29 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic[1100-1102].eqiad.wmnet with reason: [[phab:T309810|T309810]]
* 13:29 jayme@deploy1002: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
* 16:29 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic[1100-1102].eqiad.wmnet with reason: [[phab:T309810|T309810]]
* 13:28 jayme@deploy1002: helmfile [staging-codfw] START helmfile.d/admin 'apply'.
* 16:26 inflatador: bking@elastic1054 attempting to ban elastic1100-1102 from cluster due to firewall issues
* 13:28 jayme@deploy1002: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
* 16:13 bking@cumin1001: conftool action : set/weight=10:pooled=yes; selector: service=search-omega-https,name=elastic1100.eqiad.wmnet
* 13:28 jayme@deploy1002: helmfile [staging-codfw] START helmfile.d/admin 'apply'.
* 16:12 bking@cumin1001: conftool action : set/weight=10:pooled=yes; selector: name=elastic1100
* 13:23 jayme@deploy1002: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
* 15:15 cmjohnson@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 13:23 jayme@deploy1002: helmfile [staging-codfw] START helmfile.d/admin 'apply'.
* 15:09 cmjohnson@cumin1001: START - Cookbook sre.dns.netbox
* 12:59 ladsgroup@deploy1002: Synchronized php-1.38.0-wmf.12/includes/media/DjVuHandler.php: Backport: [[gerrit:745379{{!}}media: Invalidate all file-djvu WAN caches (T296001)]] (duration: 01m 05s)
* 14:58 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1162 (re)pooling @ 100%: Maint done', diff saved to https://phabricator.wikimedia.org/P32364 and previous config saved to /var/cache/conftool/dbconfig/20220811-145823-ladsgroup.json
* 12:57 ladsgroup@deploy1002: Synchronized php-1.38.0-wmf.12/extensions/FlaggedRevs/maintenance/pruneRevData.php: Backport: [[gerrit:745378{{!}}Change logic of pruneChange to allow deleting rows more flexibly (T296380)]] (duration: 01m 05s)
* 14:55 inflatador: bking@cumin1001 running puppet agent across eqiad elastic hosts
* 12:57 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 14:48 ryankemper@cumin1001: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - [[phab:T289135|T289135]]
* 12:56 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 14:43 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1162 (re)pooling @ 75%: Maint done', diff saved to https://phabricator.wikimedia.org/P32362 and previous config saved to /var/cache/conftool/dbconfig/20220811-144318-ladsgroup.json
* 12:55 ladsgroup@deploy1002: Synchronized php-1.38.0-wmf.9/extensions/FlaggedRevs/maintenance/pruneRevData.php: Backport: [[gerrit:745239{{!}}Major fixes to maintenance/pruneRevData.php (T290769)]] (duration: 01m 05s)
* 14:28 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1162 (re)pooling @ 25%: Maint done', diff saved to https://phabricator.wikimedia.org/P32361 and previous config saved to /var/cache/conftool/dbconfig/20220811-142813-ladsgroup.json
* 12:50 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 14:28 andrew@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudcontrol1003.wikimedia.org
* 12:49 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 14:28 andrew@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 12:47 kharlan@deploy1002: Synchronized php-1.38.0-wmf.12/extensions/GrowthExperiments/includes/NewcomerTasks/TaskSuggester/CacheDecorator.php: Backport: [[gerrit:745371{{!}}CacheDecorator: Bump cache version (T297248)]] (duration: 01m 05s)
* 14:24 andrew@cumin1001: START - Cookbook sre.dns.netbox
* 12:43 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 14:19 andrew@cumin1001: START - Cookbook sre.hosts.decommission for hosts cloudcontrol1003.wikimedia.org
* 12:42 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 14:19 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 12:23 ladsgroup@deploy1002: Synchronized wmf-config/config/zhwiki.yaml: Config: [[gerrit:744066{{!}}Enable VE on zh.wiki, but only for logged-in users (T296269)]] (duration: 01m 05s)
* 14:18 andrew@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudcontrol1004.wikimedia.org
* 12:20 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 14:18 andrew@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 12:19 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 14:18 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 12:16 ladsgroup@deploy1002: Synchronized dblists/visualeditor-nondefault.dblist: Config: [[gerrit:744066{{!}}Enable VE on zh.wiki, but only for logged-in users (T296269)]] (duration: 01m 05s)
* 14:18 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 12:14 ladsgroup@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:744066{{!}}Enable VE on zh.wiki, but only for logged-in users (T296269)]] (duration: 01m 06s)
* 14:17 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 12:13 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 14:17 ladsgroup@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:822375{{!}}Stop writing to the old templatelinks fields in s2 (T312865)]] (duration: 03m 25s)
* 12:12 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 14:16 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
* 12:00 topranks: Changing export policy applied on ulsfo CRs for local confed to not rewrite next-hop for routes learnt from other WMF POPs ([[phab:T295672|T295672]])
* 14:16 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
* 11:44 topranks: Re-enabling multihop BGP session from cr1-eqiad to cr2-eqord ([[phab:T295672|T295672]])
* 14:16 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
* 11:38 moritzm: added ganeti2027 to ganeti codfw cluster [[phab:T294139|T294139]]
* 14:15 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
* 11:30 hnowlan@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 14:13 andrew@cumin1001: START - Cookbook sre.dns.netbox
* 11:23 hnowlan@cumin1001: START - Cookbook sre.dns.netbox
* 14:13 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1162 (re)pooling @ 10%: Maint done', diff saved to https://phabricator.wikimedia.org/P32360 and previous config saved to /var/cache/conftool/dbconfig/20220811-141309-ladsgroup.json
* 11:20 btullis@cumin1001: END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid analytics cluster: Roll restart of Druid jvm daemons.
* 14:12 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 11:20 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 14:11 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 11:19 topranks: Changing export policy applied on eqiad CRs for local confed to not rewrite next-hop for routes learnt from other WMF POPs ([[phab:T295672|T295672]])
* 14:11 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 11:19 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 14:11 awight: EU backport window complete
* 11:13 elukey: reboot ores2001 (lost connectivity, we suspect some weird problem with the NIC, but no traces in the kernel logs)
* 14:10 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 11:10 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 14:10 awight@deploy1002: Synchronized php-1.39.0-wmf.23/extensions/DiscussionTools/includes/CommentFormatter.php: Backport: [[gerrit:822149{{!}}CommentFormatter: Set 'data-mw-comment' even when reply tool disabled (T314707)]] (duration: 03m 31s)
* 11:09 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 14:09 andrew@cumin1001: START - Cookbook sre.hosts.decommission for hosts cloudcontrol1004.wikimedia.org
* 11:07 hnowlan@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 14:05 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 11:07 lucaswerkmeister-wmde@deploy1002: Synchronized php-1.38.0-wmf.12/extensions/WikibaseLexeme/resources/widgets/: Backport: [[gerrit:745372{{!}}Fix LexemeHeader and GlossWidget mounting (T297328)]] (duration: 01m 06s)
* 14:04 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 11:07 btullis@cumin1001: START - Cookbook sre.druid.roll-restart-workers for Druid analytics cluster: Roll restart of Druid jvm daemons.
* 14:04 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 11:06 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet
* 14:03 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 11:01 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet
* 13:52 mvernon@cumin2002: START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-codfw: upgrade to 3.11.13 [[phab:T309896|T309896]] - mvernon@cumin2002
* 11:00 hnowlan@cumin1001: START - Cookbook sre.dns.netbox
* 13:50 awight@deploy1002: Synchronized wmf-config: Config: [[gerrit:820666{{!}}Revert "Revert "testwiki: Add mediawiki.web_ui.interactions stream""]] (duration: 03m 10s)
* 10:58 btullis@cumin1001: END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid analytics cluster: Roll restart of Druid jvm daemons.
* 13:48 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 10:58 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet
* 13:47 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 10:52 oblivian@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 13:47 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 10:48 oblivian@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 13:46 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 10:47 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet
* 13:36 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1060.eqiad.wmnet with OS bullseye
* 10:45 btullis@cumin1001: START - Cookbook sre.druid.roll-restart-workers for Druid analytics cluster: Roll restart of Druid jvm daemons.
* 13:36 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 10:02 vgutierrez: pool durum2002
* 13:36 awight@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:822130{{!}}trwikiquote: Install WikiLove extension (T314895)]] (duration: 03m 30s)
* 10:00 vgutierrez: depool durum2002
* 13:35 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 09:24 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet
* 13:35 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 09:20 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet
* 13:34 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 09:10 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2020.codfw.wmnet with OS buster
* 13:33 filippo@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host logstash2003.codfw.wmnet
* 13:25 awight@deploy1002: Synchronized static/images: Config: [[gerrit:821330{{!}}Revert "trwiki: Change old and new vector logos for 500k articles"]] (part 3) (duration: 03m 09s)
* 13:21 awight@deploy1002: Synchronized logos/: Config: [[gerrit:821330{{!}}Revert "trwiki: Change old and new vector logos for 500k articles"]] (part 2) (duration: 03m 09s)
* 13:19 topranks: merging CR821781 to expose additional network info in puppet facts
* 13:18 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 13:18 awight@deploy1002: Synchronized wmf-config/: Config: [[gerrit:821330{{!}}Revert "trwiki: Change old and new vector logos for 500k articles"]] (part 1) (duration: 03m 13s)
* 13:17 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 13:17 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 13:16 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 13:14 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1060.eqiad.wmnet with reason: host reimage
* 13:11 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 13:11 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1060.eqiad.wmnet with reason: host reimage
* 13:10 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 13:10 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 13:09 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 13:08 awight@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:822073{{!}}Enable editor line numbering on all namespaces, for twwiki (T302852)]] (duration: 03m 42s)
* 12:56 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1060.eqiad.wmnet with OS bullseye
* 12:55 ryankemper@cumin1001: START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - [[phab:T289135|T289135]]
* 12:49 aikochou@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
* 12:46 aikochou@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
* 12:26 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=restbase2018.codfw.wmnet
* 12:26 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=restbase202[367].codfw.wmnet
* 12:17 elukey@deploy1002: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
* 12:17 elukey@deploy1002: helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
* 12:17 elukey@deploy1002: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
* 12:16 elukey@deploy1002: helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
* 12:13 filippo@cumin1001: START - Cookbook sre.hosts.reboot-single for host logstash2003.codfw.wmnet
* 12:11 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
* 12:10 elukey@deploy1002: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
* 12:09 elukey@deploy1002: helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
* 11:20 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance
* 11:20 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance
* 09:58 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance
* 09:57 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance
* 09:56 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance
* 09:56 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance
* 09:49 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1162.eqiad.wmnet with reason: Maintenance
* 09:49 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on db1162.eqiad.wmnet with reason: Maintenance
* 09:32 godog: arm keyholder on netmon2001
* 09:09 jbond: update gnutls28 on bullseye systems
* 09:00 jbond: update unzip
* 08:21 elukey@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
* 08:13 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
* 08:12 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
* 08:06 ayounsi@cumin1001: END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox interface ID cr3-ulsfo:xe-0/1/1
* 08:06 ayounsi@cumin1001: START - Cookbook sre.network.debug for Netbox interface ID cr3-ulsfo:xe-0/1/1
* 07:58 ayounsi@cumin1001: END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox interface ID cr3-ulsfo:xe-0/1/1
* 07:57 ayounsi@cumin1001: START - Cookbook sre.network.debug for Netbox interface ID cr3-ulsfo:xe-0/1/1
* 07:55 oblivian@cumin1001: conftool action : set/pooled=false; selector: dnsdisc=k8s-ingress-wikikube-rw,name=codfw
* 07:51 vgutierrez: rolling restart of pybal in eqsin and ulsfo
* 07:24 oblivian@cumin1001: conftool action : set/pooled=false; selector: dnsdisc=restbase-async,name=eqiad
* 07:24 oblivian@cumin1001: conftool action : set/pooled=true; selector: dnsdisc=shellbox-timeline
* 07:23 oblivian@cumin1001: conftool action : set/pooled=true; selector: dnsdisc=inference
* 07:19 _joe_: pooling all services in codfw
* 07:03 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1147 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32357 and previous config saved to /var/cache/conftool/dbconfig/20220811-070312-ladsgroup.json
* 07:03 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1147.eqiad.wmnet with reason: Maintenance
* 07:02 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1147.eqiad.wmnet with reason: Maintenance
* 07:02 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1160 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32356 and previous config saved to /var/cache/conftool/dbconfig/20220811-070252-ladsgroup.json
* 06:47 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P32355 and previous config saved to /var/cache/conftool/dbconfig/20220811-064746-ladsgroup.json
* 06:32 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P32354 and previous config saved to /var/cache/conftool/dbconfig/20220811-063240-ladsgroup.json
* 06:28 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance
* 06:28 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance
* 06:17 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1160 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32353 and previous config saved to /var/cache/conftool/dbconfig/20220811-061734-ladsgroup.json
* 06:17 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maint
* 06:17 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maint
* 06:06 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depool db1162 ([[phab:T314368|T314368]] [[phab:T298555|T298555]] [[phab:T312863|T312863]] [[phab:T310011|T310011]] [[phab:T309311|T309311]] [[phab:T60674|T60674]] [[phab:T298560|T298560]] [[phab:T303603|T303603]] [[phab:T310485|T310485]])', diff saved to https://phabricator.wikimedia.org/P32352 and previous config saved to /var/cache/conftool/dbconfig/20220811-060625-ladsgroup.json
* 06:01 ladsgroup@cumin1001: dbctl commit (dc=all): 'Promote db1122 to s2 primary and set section read-write [[phab:T314368|T314368]]', diff saved to https://phabricator.wikimedia.org/P32351 and previous config saved to /var/cache/conftool/dbconfig/20220811-060113-ladsgroup.json
* 06:00 ladsgroup@cumin1001: dbctl commit (dc=all): 'Set s2 eqiad as read-only for maintenance - [[phab:T314368|T314368]]', diff saved to https://phabricator.wikimedia.org/P32350 and previous config saved to /var/cache/conftool/dbconfig/20220811-060042-ladsgroup.json
* 06:00 Amir1: Starting s2 eqiad failover from db1162 to db1122 - [[phab:T314368|T314368]]
* 05:19 ladsgroup@cumin1001: dbctl commit (dc=all): 'Set db1122 with weight 0 [[phab:T314368|T314368]]', diff saved to https://phabricator.wikimedia.org/P32349 and previous config saved to /var/cache/conftool/dbconfig/20220811-051913-ladsgroup.json
* 05:19 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 25 hosts with reason: Primary switchover s2 [[phab:T314368|T314368]]
* 05:18 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1:00:00 on 25 hosts with reason: Primary switchover s2 [[phab:T314368|T314368]]
* m: chown -R librenms /srv/librenms/rrd/ on netmon1003 [[phab:T314972|T314972]]
* 03:51 cwhite: chown librenms /srv/librenms/rrd/* on netmon1003 [[phab:T314972|T314972]]
* 02:55 ejegg: civicrm upgraded from {{Gerrit|1f91ac2d}} to {{Gerrit|92467234}}
* 02:46 ejegg: updated process-control yaml files with @wmff alias
* 02:08 ejegg: civicrm rolled back from {{Gerrit|92467234}} to {{Gerrit|1f91ac2d}}
* 02:05 ejegg: civicrm upgraded from {{Gerrit|1f91ac2d}} to {{Gerrit|92467234}}
* 01:40 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 01:39 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 01:39 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 01:38 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 01:38 tstarling@deploy1002: Synchronized wmf-config/logging.php: (no justification provided) (duration: 03m 25s)
* 01:19 tstarling@puppetmaster1001: conftool action : set/pooled=true; selector: dnsdisc=(appservers{{!}}api)-ro,name=codfw
* 01:19 tstarling@puppetmaster1001: conftool action : set/pooled=true; selector: dnsdisc=sessionstore,name=codfw
* 00:58 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2042.codfw.wmnet,service=varnish-fe
* 00:58 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2042.codfw.wmnet,service=ats-be
* 00:58 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2042.codfw.wmnet,service=ats-tls
* 00:57 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on cp2042.codfw.wmnet with reason: host down; depooled and will debug tomorrow
* 00:57 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on cp2042.codfw.wmnet with reason: host down; depooled and will debug tomorrow


== [[2021]]-12-09 ==
== 2022-08-10 ==
* 08:19 jmm@cumin2002: START - Cookbook sre.hosts.reimage for host ganeti2020.codfw.wmnet with OS buster
* 21:25 bking@cumin1001: conftool action : set/weight=10:pooled=yes; selector: name=wdqs1016.eqiad.wmnet
* 08:13 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti2020.codfw.wmnet with reason: Temporarily remove node from Ganeti for reimage
* 21:23 bking@cumin1001: conftool action : set/weight=10:pooled=yes; selector: name=wdqs1014.eqiad.wmnet
* 08:12 jmm@cumin2002: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ganeti2020.codfw.wmnet with reason: Temporarily remove node from Ganeti for reimage
* 21:10 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 16 hosts with reason: [[phab:T309810|T309810]]
* 03:37 cwhite: bounce superset on an-tool1010 and 1005 to pick up statsd changes [[phab:T247963|T247963]]
* 21:10 bking@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on 16 hosts with reason: [[phab:T309810|T309810]]
* 03:34 cwhite: bounce navtiming on webperf1001 to pick up statsd changes [[phab:T297265|T297265]]
* 21:09 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on elastic[1101-1102].eqiad.wmnet with reason: [[phab:T309810|T309810]]
* 03:32 cwhite@deploy1002: Synchronized wmf-config/ProductionServices.php: fail over statsd to graphite2003 [[phab:T297265|T297265]] (duration: 01m 05s)
* 21:09 bking@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on elastic[1101-1102].eqiad.wmnet with reason: [[phab:T309810|T309810]]
* 03:28 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 21:00 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 03:27 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 21:00 cjming: end of UTC late backport window
* 02:54 cwhite: failover statsd ingest host to graphite2003 [[phab:T297265|T297265]]
* 20:59 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 00:26 rzl: graphite1004.mgmt: /admin1-> racadm serveraction powercycle ([[phab:T297265|T297265]])
* 20:59 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 00:17 legoktm: deployed updated patches for [[phab:T297322|T297322]]
* 20:59 cjming@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:820533{{!}}Remove unused $wgEnableMWSuggest]] (duration: 03m 04s)
* 00:11 rzl: rzl@graphite1004:~$ sudo shutdown -r now [[phab:T297265|T297265]]
* 20:58 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 20:56 cjming@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:820568{{!}}Enable new topic tool on dewiki (T313699)]] (duration: 03m 01s)
* 20:34 cjming@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:822093{{!}}testwiki: set $wgCdnMatchParameterOrder to false (T314868)]] (duration: 03m 20s)
* 20:31 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 20:30 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 20:30 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 20:30 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 20:19 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 20:18 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 20:18 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 20:17 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 20:12 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 20:09 bking@cumin1001: END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
* 20:08 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 20:08 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 20:08 cjming@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:820646{{!}}Start writing to cuc_actor everywhere except s4 and s8 (T233004)]] (duration: 03m 15s)
* 20:07 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 19:51 rzl@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc[2053-2054].codfw.wmnet
* 19:51 rzl@cumin1001: START - Cookbook sre.hosts.remove-downtime for mc[2053-2054].codfw.wmnet
* 19:35 rzl@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse[2019-2020].codfw.wmnet
* 19:35 rzl@cumin1001: START - Cookbook sre.hosts.remove-downtime for parse[2019-2020].codfw.wmnet
* 19:35 rzl@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse[2016-2018].codfw.wmnet
* 19:35 rzl@cumin1001: START - Cookbook sre.hosts.remove-downtime for parse[2016-2018].codfw.wmnet
* 19:34 rzl@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc2036.codfw.wmnet
* 19:34 rzl@cumin1001: START - Cookbook sre.hosts.remove-downtime for mc2036.codfw.wmnet
* 19:28 sukhe: testing ATS 9.1.3-1wm1 on cp4026: [[phab:T309651|T309651]]
* 19:09 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1087.eqiad.wmnet with OS bullseye
* 19:06 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1086.eqiad.wmnet with OS bullseye
* 18:55 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1087.eqiad.wmnet with reason: host reimage
* 18:51 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1086.eqiad.wmnet with reason: host reimage
* 18:50 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1087.eqiad.wmnet with reason: host reimage
* 18:49 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1086.eqiad.wmnet with reason: host reimage
* 18:47 bking@cumin1001: START - Cookbook sre.wdqs.data-transfer
* 18:38 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1087.eqiad.wmnet with OS bullseye
* 18:36 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1086.eqiad.wmnet with OS bullseye
* 18:22 urandom: truncating Cassandra hints (eqiad datacenter)  -- [[phab:T314941|T314941]]
* 18:13 urandom: truncating codfw Cassandra hints (eqiad datacenter)  -- [[phab:T314941|T314941]]
* 18:07 rzl@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kafka-main2005.codfw.wmnet
* 18:07 rzl@cumin1001: START - Cookbook sre.hosts.remove-downtime for kafka-main2005.codfw.wmnet
* 18:05 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repool D8 DBs after PDU maint ([[phab:T310146|T310146]])', diff saved to https://phabricator.wikimedia.org/P32346 and previous config saved to /var/cache/conftool/dbconfig/20220810-180529-ladsgroup.json
* 17:42 otto@deploy1002: Finished deploy [analytics/refinery@6e47e0e]: Add missing changes to the deletion script - [[phab:T270433|T270433]] -  [analytics/refinery@6e47e0e] (duration: 05m 28s)
* 17:39 fnegri@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts labweb1002.wikimedia.org
* 17:39 fnegri@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 17:36 otto@deploy1002: Started deploy [analytics/refinery@6e47e0e]: Add missing changes to the deletion script - [[phab:T270433|T270433]] -  [analytics/refinery@6e47e0e]
* 17:35 fnegri@cumin1001: START - Cookbook sre.dns.netbox
* 17:34 otto@deploy1002: Finished deploy [analytics/refinery@6e47e0e] (hadoop-test): Add missing changes to the deletion script - [[phab:T270433|T270433]] - TEST [analytics/refinery@6e47e0e] (duration: 04m 19s)
* 17:30 fnegri@cumin1001: START - Cookbook sre.hosts.decommission for hosts labweb1002.wikimedia.org
* 17:30 otto@deploy1002: Started deploy [analytics/refinery@6e47e0e] (hadoop-test): Add missing changes to the deletion script - [[phab:T270433|T270433]] - TEST [analytics/refinery@6e47e0e]
* 17:09 dzahn@cumin2002: START - Cookbook sre.dns.netbox
* 17:08 otto@deploy1002: Started deploy [analytics/refinery@d4dd7e4] (hadoop-test): Add safety limits to refinery-drop-older-than - [[phab:T270433|T270433]] - TEST [analytics/refinery@d4dd7e4]
* 17:06 sukhe: testing ATS 9.1.3-1wm1 on cp4032: [[phab:T309651|T309651]]
* 17:06 urandom: flushing RESTBase Cassandra tables -row B- to (temporarily) free instance-data space -- [[phab:T314941|T314941]]
* 17:05 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on krb2001.codfw.wmnet with reason: btullis codfw maintenance
* 17:05 btullis@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on krb2001.codfw.wmnet with reason: btullis codfw maintenance
* 17:04 dzahn@cumin2002: START - Cookbook sre.hosts.decommission for hosts gerrit2001.wikimedia.org
* 17:02 sukhe: testing ATS 9.1.3-1wm1 on cp6008: [[phab:T309651|T309651]]
* 16:56 sukhe: testing ATS 9.1.3-1wm1 on cp6016: [[phab:T309651|T309651]]
* 16:55 fnegri@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts labweb1001.wikimedia.org
* 16:55 fnegri@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 16:32 dzahn@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts gerrit2001.wikimedia.org
* 16:32 dzahn@cumin2002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
* 16:32 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubernetes[2013-2014].codfw.wmnet
* 16:31 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for kubernetes[2013-2014].codfw.wmnet
* 16:31 jelto: kubectl uncordon kubernetes2014.codfw.wmnet
* 16:31 fnegri@cumin1001: START - Cookbook sre.dns.netbox
* 16:30 jelto: kubectl uncordon kubernetes2013.codfw.wmnet
* 16:29 urandom: restarting Cassandra (RESTBase) -row A- to apply r822110 -- [[phab:T314941|T314941]]
* 16:27 dzahn@cumin2002: START - Cookbook sre.dns.netbox
* 16:25 fnegri@cumin1001: START - Cookbook sre.hosts.decommission for hosts labweb1001.wikimedia.org
* 16:23 mutante: shutting down gerrit2001
* 16:23 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc[2034-2035].codfw.wmnet
* 16:23 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for mc[2034-2035].codfw.wmnet
* 16:22 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse[2016-2018].codfw.wmnet
* 16:22 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for parse[2016-2018].codfw.wmnet
* 16:16 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=maps2010.codfw.wmnet
* 16:16 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=sessionstore2003.codfw.wmnet
* 16:13 sukhe: reprepro -C component/trafficserver9 include buster-wikimedia trafficserver_9.1.3-1wm1_amd64.changes: [[phab:T309651|T309651]]
* 16:13 dzahn@cumin2002: START - Cookbook sre.hosts.decommission for hosts gerrit2001.wikimedia.org
* 16:11 mvernon@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ms-be[2039,2050,2056,2059].codfw.wmnet,thanos-be2004.codfw.wmnet with reason: PDU work
* 16:10 mvernon@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ms-be[2039,2050,2056,2059].codfw.wmnet,thanos-be2004.codfw.wmnet with reason: PDU work
* 16:09 urandom: flushing tables in row D (RESTBase Cassandra cluster)  -- [[phab:T314941|T314941]]
* 15:54 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for gitlab-runner2004.codfw.wmnet
* 15:54 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for gitlab-runner2004.codfw.wmnet
* 15:53 sukhe: poweroff cp2041, 42 for PDU ugprade: rack D7
* 15:51 urandom: flushing tables in row B (RESTBase Cassandra cluster)  -- [[phab:T314941|T314941]]
* 15:49 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve2004.codfw.wmnet with reason: PDU maintenance
* 15:49 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2004.codfw.wmnet with reason: PDU maintenance
* 15:46 urandom: flushing tables in row A (RESTBase Cassandra cluster)  -- [[phab:T314941|T314941]]
* 15:46 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on aqs2012.codfw.wmnet with reason: btullis codfw maintenance
* 15:46 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on cp[2041-2042].codfw.wmnet with reason: shutdown for PDU upgrade: rack D4
* 15:46 btullis@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on aqs2012.codfw.wmnet with reason: btullis codfw maintenance
* 15:46 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 3:00:00 on cp[2041-2042].codfw.wmnet with reason: shutdown for PDU upgrade: rack D4
* 15:46 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on aqs2011.codfw.wmnet with reason: btullis codfw maintenance
* 15:46 btullis@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on aqs2011.codfw.wmnet with reason: btullis codfw maintenance
* 15:45 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on aqs2010.codfw.wmnet with reason: btullis codfw maintenance
* 15:45 btullis@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on aqs2010.codfw.wmnet with reason: btullis codfw maintenance
* 15:45 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on aqs2009.codfw.wmnet with reason: btullis codfw maintenance
* 15:45 btullis@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on aqs2009.codfw.wmnet with reason: btullis codfw maintenance
* 15:37 urandom: (ephemerally) increasing hinted hand-off delivery rate limit to 16KB, RESTBase eqiad nodes  -- [[phab:T314941|T314941]]
* 15:34 jbond: remove puppetmaster[12]002 from production
* 15:30 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kafka-main2004.codfw.wmnet
* 15:30 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for kafka-main2004.codfw.wmnet
* 15:20 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc[2051-2052].codfw.wmnet
* 15:20 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for mc[2051-2052].codfw.wmnet
* 15:17 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc-gp2003.codfw.wmnet
* 15:17 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for mc-gp2003.codfw.wmnet
* 15:16 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc2033.codfw.wmnet
* 15:16 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for mc2033.codfw.wmnet
* 15:14 _joe_: power off krb2002
* 15:14 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on krb2002.codfw.wmnet with reason: PDU maintenance
* 15:13 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on krb2002.codfw.wmnet with reason: PDU maintenance
* 15:13 _joe_: shutting down rdb2010,puppetmaster2002 for d5 maintenance
* 15:02 jelto: power off mc2035
* 15:01 jelto: power off mc2034
* 15:01 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mc2035.codfw.wmnet with reason: PDU swap
* 15:01 jelto@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on mc2035.codfw.wmnet with reason: PDU swap
* 15:01 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mc2034.codfw.wmnet with reason: PDU swap
* 15:01 jelto@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on mc2034.codfw.wmnet with reason: PDU swap
* 14:43 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2094.codfw.wmnet with reason: PDU Maint ([[phab:T310146|T310146]])
* 14:43 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2094.codfw.wmnet with reason: PDU Maint ([[phab:T310146|T310146]])
* 14:38 urandom: disabling reserved space on eqiad nodes (RESTBase), /dev/md2 (aka /srv/cassandra/instance-data) -- [[phab:T314941|T314941]]
* 14:28 jelto: power off kafka-main2004 gracefully
* 14:28 hnowlan: shutting down sessionstore2003
* 14:27 hnowlan@puppetmaster1001: conftool action : set/pooled=no; selector: name=sessionstore2003.codfw.wmnet
* 14:27 sukhe: power off cp2039, cp2040 for PDU upgrade: rack D
* 14:27 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sessionstore2003.codfw.wmnet with reason: PDU maintenance
* 14:27 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on sessionstore2003.codfw.wmnet with reason: PDU maintenance
* 14:25 jelto: power off mc-gp2003
* 14:25 jelto: power off mc2033
* 14:24 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:30:00 on kafka-main2004.codfw.wmnet with reason: PDU swap
* 14:23 jelto@cumin1001: START - Cookbook sre.hosts.downtime for 4:30:00 on kafka-main2004.codfw.wmnet with reason: PDU swap
* 14:23 sukhe: depool codfw for PDU upgrade: rack D
* 14:23 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:30:00 on mc-gp2003.codfw.wmnet with reason: PDU swap
* 14:23 jelto@cumin1001: START - Cookbook sre.hosts.downtime for 4:30:00 on mc-gp2003.codfw.wmnet with reason: PDU swap
* 14:23 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:30:00 on mc2033.codfw.wmnet with reason: PDU swap
* 14:23 jelto@cumin1001: START - Cookbook sre.hosts.downtime for 4:30:00 on mc2033.codfw.wmnet with reason: PDU swap
* 14:15 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp20[39{{!}}40]\.codfw\.wmnet,service=ats-tls
* 14:13 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on cp[2039-2040].codfw.wmnet with reason: shutdown for PDU upgrade: rack D4
* 14:13 urandom: flushing Cassandra tables, restbase1030
* 14:13 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 3:00:00 on cp[2039-2040].codfw.wmnet with reason: shutdown for PDU upgrade: rack D4
* 14:13 urandom: flushing Cassandra tables, restbase1019
* 14:12 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on dns2002.wikimedia.org with reason: shutdown for PDU upgrade: rack D4
* 14:12 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 3:00:00 on dns2002.wikimedia.org with reason: shutdown for PDU upgrade: rack D4
* 14:11 urandom: flushing Cassandra tables, restbase1017 1018 1021 1024 1025 1026 1028 1029
* 14:05 urandom: flushing tables, restbase1016
* 13:52 hnowlan: powered up restbase2018
* 13:32 filippo@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on netmon1003.wikimedia.org with reason: pdu
* 13:32 filippo@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on netmon1003.wikimedia.org with reason: pdu
* 13:32 filippo@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on logstash2003.codfw.wmnet with reason: pdu
* 13:31 filippo@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on logstash2003.codfw.wmnet with reason: pdu
* 13:31 root@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on logstash2029.codfw.wmnet with reason: pdu
* 13:31 root@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on logstash2029.codfw.wmnet with reason: pdu
* 13:30 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 6 hosts with reason: [[phab:T310146|T310146]]
* 13:30 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 6 hosts with reason: [[phab:T310146|T310146]]
* 13:17 elukey: powering on restbase2027
* 13:12 elukey: powering on restbase2026
* 13:12 _joe_: powering on restbase2023
* 13:01 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1160 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32343 and previous config saved to /var/cache/conftool/dbconfig/20220810-130108-ladsgroup.json
* 13:01 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance
* 13:00 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance
* 12:37 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic[2072,2084-2085].codfw.wmnet with reason: [[phab:T310146|T310146]]
* 12:37 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic[2072,2084-2085].codfw.wmnet with reason: [[phab:T310146|T310146]]
* 12:27 jbond: remove confd from serveres that shouldn;t have it
* 12:05 ladsgroup@deploy1002: Synchronized php-1.39.0-wmf.23/extensions/Echo/maintenance/removeOrphanedEvents.php: Backport: [[gerrit:821735{{!}}Run clean ups with removeOrphanedEvents in major batches (T310428)]] (duration: 03m 32s)
* 11:47 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 11:46 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 11:46 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 11:43 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 11:15 jbond@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1001.eqiad.wmnet with OS buster
* 10:54 jbond@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage
* 10:51 jbond@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage
* 10:37 jbond@cumin1001: START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS buster
* 10:31 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[2181-2182].codfw.wmnet with reason: D6 PDU maint ([[phab:T310146|T310146]])
* 10:31 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db[2181-2182].codfw.wmnet with reason: D6 PDU maint ([[phab:T310146|T310146]])
* 10:26 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on maps2010.codfw.wmnet with reason: PDU maintenance
* 10:26 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on maps2010.codfw.wmnet with reason: PDU maintenance
* 10:26 hnowlan@puppetmaster1001: conftool action : set/pooled=no; selector: name=maps2010.codfw.wmnet
* 10:25 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=maps2010.codfw.wmnet
* 10:25 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=maps2008.codfw.wmnet
* 10:24 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on restbase2018.codfw.wmnet with reason: PDU maintenance
* 10:24 hnowlan@puppetmaster1001: conftool action : set/pooled=no; selector: name=restbase2018.codfw.wmnet
* 10:24 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on restbase2018.codfw.wmnet with reason: PDU maintenance
* 10:23 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ml-serve2008.codfw.wmnet with reason: PDU maintenance
* 10:23 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on ml-serve2008.codfw.wmnet with reason: PDU maintenance
* 10:20 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ores2009.codfw.wmnet with reason: PDU maintenance
* 10:20 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on ores2009.codfw.wmnet with reason: PDU maintenance
* 10:19 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ores2008.codfw.wmnet with reason: PDU maintenance
* 10:19 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on ores2008.codfw.wmnet with reason: PDU maintenance
* 10:03 hnowlan@puppetmaster1001: conftool action : set/pooled=no; selector: name=restbase202[367].codfw.wmnet
* 10:02 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on restbase[2023,2026-2027].codfw.wmnet with reason: PDU maintenance
* 10:02 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on restbase[2023,2026-2027].codfw.wmnet with reason: PDU maintenance
* 09:53 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 6 hosts with reason: D8 PDU Maint ([[phab:T310146|T310146]])
* 09:52 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 6 hosts with reason: D8 PDU Maint ([[phab:T310146|T310146]])
* 09:51 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depool D8 DBs for PDU maint ([[phab:T310146|T310146]])', diff saved to https://phabricator.wikimedia.org/P32341 and previous config saved to /var/cache/conftool/dbconfig/20220810-095059-ladsgroup.json
* 09:36 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[2101,2130,2140].codfw.wmnet,dbproxy2004.codfw.wmnet with reason: D6 PDU maint ([[phab:T310146|T310146]])
* 09:36 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db[2101,2130,2140].codfw.wmnet,dbproxy2004.codfw.wmnet with reason: D6 PDU maint ([[phab:T310146|T310146]])
* 09:34 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depool D6 dbs ([[phab:T310146|T310146]])', diff saved to https://phabricator.wikimedia.org/P32340 and previous config saved to /var/cache/conftool/dbconfig/20220810-093433-ladsgroup.json
* 09:31 jelto: depool services in codfw for upcoming PDU replacement - [[phab:T309956|T309956]]
* 09:30 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 8 hosts with reason: Maintenance
* 09:29 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 8 hosts with reason: Maintenance
* 09:29 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance
* 09:29 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance
* 09:28 jynus: shutdown backup2007 before pdu upgrade [[phab:T310146|T310146]]
* 09:16 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 09:15 ladsgroup@deploy1002: Synchronized php-1.39.0-wmf.23/maintenance/namespaceDupes.php: Backport: [[gerrit:821734{{!}}maintenance: Add support for links migration to namespaceDupes.php (T314711)]] (duration: 03m 18s)
* 09:15 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[2093,2120,2129,2172].codfw.wmnet with reason: D5 PDU maint ([[phab:T310146|T310146]])
* 09:15 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db[2093,2120,2129,2172].codfw.wmnet with reason: D5 PDU maint ([[phab:T310146|T310146]])
* 09:14 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 09:14 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 09:13 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 09:10 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depool D5 dbs ([[phab:T310146|T310146]])', diff saved to https://phabricator.wikimedia.org/P32339 and previous config saved to /var/cache/conftool/dbconfig/20220810-091038-ladsgroup.json
* 08:49 jynus: shutdown dbprov2003 before pdu upgrade [[phab:T310146|T310146]]
* 08:49 ayounsi@cumin1001: END (FAIL) - Cookbook sre.network.debug (exit_code=99) for Netbox interface ID cr1-drmrs:xe-0/1/2
* 08:48 ayounsi@cumin1001: START - Cookbook sre.network.debug for Netbox interface ID cr1-drmrs:xe-0/1/2
* 08:48 mvernon@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be2028.codfw.wmnet
* 08:48 mvernon@cumin1001: START - Cookbook sre.hosts.remove-downtime for ms-be2028.codfw.wmnet
* 08:42 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1130 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P32337 and previous config saved to /var/cache/conftool/dbconfig/20220810-084222-ladsgroup.json
* 08:37 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 08:37 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
* 08:36 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
* 08:36 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 08:36 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 08:35 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 08:35 ladsgroup@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:822037{{!}}Stop writing to the old templatelinks fields in s5 (T312865)]] (duration: 03m 29s)
* 08:32 jelto: power off gitlab-runner2004
* 08:31 root@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:30:00 on gitlab-runner2004.codfw.wmnet with reason: PDU swap
* 08:31 root@cumin1001: START - Cookbook sre.hosts.downtime for 8:30:00 on gitlab-runner2004.codfw.wmnet with reason: PDU swap
* 08:29 mvernon@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ms-be2028.codfw.wmnet with reason: Trying to fix full /
* 08:28 mvernon@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ms-be2028.codfw.wmnet with reason: Trying to fix full /
* 08:28 ayounsi@cumin1001: END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox interface ID cr1-drmrs:xe-0/1/2
* 08:27 ayounsi@cumin1001: START - Cookbook sre.network.debug for Netbox interface ID cr1-drmrs:xe-0/1/2
* 08:27 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1130 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P32336 and previous config saved to /var/cache/conftool/dbconfig/20220810-082718-ladsgroup.json
* 08:25 ayounsi@cumin1001: END (FAIL) - Cookbook sre.network.debug (exit_code=99) for Netbox interface ID cr1-drmrs:xe-0/1/2
* 08:25 ayounsi@cumin1001: START - Cookbook sre.network.debug for Netbox interface ID cr1-drmrs:xe-0/1/2
* 08:24 ayounsi@cumin1001: END (FAIL) - Cookbook sre.network.debug (exit_code=99) for Netbox interface ID cr1-drmrs:xe-0/1/2
* 08:24 ayounsi@cumin1001: START - Cookbook sre.network.debug for Netbox interface ID cr1-drmrs:xe-0/1/2
* 08:23 kart_: Run: mwscript namespaceDupes.php arywiki --fix ([[phab:T291737|T291737]])
* 08:13 jynus: restart replication on db1117:m1 [[phab:T309074|T309074]]
* 08:12 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1130 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P32335 and previous config saved to /var/cache/conftool/dbconfig/20220810-081213-ladsgroup.json
* 08:09 kartik@deploy1002: Finished scap: Backport: [[gerrit:821732{{!}}arywiki: change namespace translations, add unchanged namespaces and add old translations as aliases (T291737)]] (duration: 10m 37s)
* 07:59 kartik@deploy1002: Started scap: Backport: [[gerrit:821732{{!}}arywiki: change namespace translations, add unchanged namespaces and add old translations as aliases (T291737)]]
* 07:57 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1130 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P32334 and previous config saved to /var/cache/conftool/dbconfig/20220810-075708-ladsgroup.json
* 07:56 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1130 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P32333 and previous config saved to /var/cache/conftool/dbconfig/20220810-075636-ladsgroup.json
* 07:55 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 07:52 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 07:52 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 07:52 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1130.eqiad.wmnet with reason: Maintenance
* 07:52 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on db1130.eqiad.wmnet with reason: Maintenance
* 07:51 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 07:51 dcaro@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 07:46 dcaro@cumin1001: START - Cookbook sre.dns.netbox
* 07:39 dcaro@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 07:34 dcaro@cumin1001: START - Cookbook sre.dns.netbox
* 07:33 godog: depool thanos-fe2001 for debugging
* 07:11 kartik@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:821170{{!}}Enable SectionTranslation on testwiki with new MT support from Google (T313296)]] (duration: 05m 44s)
* 07:11 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 07:08 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 07:08 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 07:07 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 05:24 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on kubernetes[2013-2014].codfw.wmnet with reason: PDU maintenance
* 05:24 oblivian@cumin1001: START - Cookbook sre.hosts.downtime for 18:00:00 on kubernetes[2013-2014].codfw.wmnet with reason: PDU maintenance
* 05:19 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on parse[2016-2020].codfw.wmnet with reason: PDU maintenance
* 05:19 oblivian@cumin1001: START - Cookbook sre.hosts.downtime for 18:00:00 on parse[2016-2020].codfw.wmnet with reason: PDU maintenance
* 05:12 _joe_: starting to shut down servers in codfw for the PDU maintenance
* 05:09 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on 10 hosts with reason: PDU maintenance
* 05:09 oblivian@cumin1001: START - Cookbook sre.hosts.downtime for 18:00:00 on 10 hosts with reason: PDU maintenance
* 05:09 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on mc-gp2003.codfw.wmnet with reason: PDU maintenance
* 05:09 oblivian@cumin1001: START - Cookbook sre.hosts.downtime for 18:00:00 on mc-gp2003.codfw.wmnet with reason: PDU maintenance
* 05:06 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on mc2033.codfw.wmnet with reason: PDU maintenance
* 05:06 oblivian@cumin1001: START - Cookbook sre.hosts.downtime for 18:00:00 on mc2033.codfw.wmnet with reason: PDU maintenance
* 05:05 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on 7 hosts with reason: PDU maintenance
* 05:05 oblivian@cumin1001: START - Cookbook sre.hosts.downtime for 18:00:00 on 7 hosts with reason: PDU maintenance
* 02:34 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 02:33 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 02:33 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 02:32 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 02:07 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 02:06 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 02:06 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 02:05 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply


== 2021-12-08 ==
== 2022-08-09 ==
* 22:18 legoktm@deploy1002: Synchronized php-1.38.0-wmf.9/includes/actions/: [[phab:T297322|T297322]] (duration: 01m 05s)
* 23:17 bking@cumin1001: conftool action : set/weight=10:pooled=yes; selector: name=wdqs1011.eqiad.wmnet
* 22:16 legoktm@deploy1002: Synchronized php-1.38.0-wmf.12/includes/actions/: [[phab:T297322|T297322]] (duration: 01m 05s)
* 23:07 bking@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
* 21:48 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 23:06 bking@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
* 21:46 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 22:51 bking@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
* 21:43 dancy@deploy1002: Synchronized php: group1 wikis to 1.38.0-wmf.12  refs [[phab:T293953|T293953]] (duration: 01m 04s)
* 22:51 bking@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
* 21:41 dancy@deploy1002: rebuilt and synchronized wikiversions files: group1 wikis to 1.38.0-wmf.12  refs [[phab:T293953|T293953]]
* 22:49 bking@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
* 21:30 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 22:49 bking@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
* 21:30 ladsgroup@deploy1002: Synchronized php-1.38.0-wmf.12/extensions/WikidataPageBanner/includes/WikidataPageBanner.php: Backport: [[gerrit:745240{{!}}Make sure 'enable-toc' key is set (T297318)]] (duration: 01m 05s)
* 22:46 bking@cumin1001: conftool action : set/weight=10:pooled=yes; selector: name=wdqs1015.eqiad.wmnet
* 21:29 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 22:31 ryankemper@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
* 20:53 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 22:31 ryankemper@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
* 20:51 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 22:28 bking@cumin1001: END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
* 20:49 taavi@deploy1002: Synchronized php-1.38.0-wmf.12/extensions/3D/src/Hooks.php: Backport: [[gerrit:745238{{!}}Remove use of $wgUseAjax]] (duration: 01m 07s)
* 22:02 bking@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
* 20:30 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 22:02 bking@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
* 20:29 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 21:57 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wdqs2006.codfw.wmnet with reason: [[phab:T310146|T310146]]
* 20:28 jhathaway: enable exim on mx2001
* 21:57 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs2006.codfw.wmnet with reason: [[phab:T310146|T310146]]
* 20:27 jgiannelos@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' .
* 21:53 bking@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
* 20:27 jgiannelos@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' .
* 21:52 bking@deploy1002: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply
* 20:26 jgiannelos@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' .
* 21:50 bking@deploy1002: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply
* 20:23 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 21:49 bking@deploy1002: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply
* 20:22 dancy@deploy1002: Synchronized php: group1 wikis to 1.38.0-wmf.9  refs [[phab:T293953|T293953]] (duration: 01m 04s)
* 21:43 bking@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop: apply
* 20:22 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 21:43 bking@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop: apply
* 20:21 dancy@deploy1002: rebuilt and synchronized wikiversions files: group1 wikis to 1.38.0-wmf.9  refs [[phab:T293953|T293953]]
* 21:43 bking@deploy1002: helmfile [eqiad] DONE helmfile.d/services/changeprop: apply
* 20:17 dancy@deploy1002: Synchronized php: group1 wikis to 1.38.0-wmf.12  refs [[phab:T293953|T293953]] (duration: 01m 05s)
* 21:43 bking@deploy1002: helmfile [eqiad] START helmfile.d/services/changeprop: apply
* 20:16 dancy@deploy1002: rebuilt and synchronized wikiversions files: group1 wikis to 1.38.0-wmf.12  refs [[phab:T293953|T293953]]
* 21:43 bking@deploy1002: helmfile [staging] DONE helmfile.d/services/changeprop: apply
* 20:11 jgiannelos@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' .
* 21:43 bking@deploy1002: helmfile [staging] START helmfile.d/services/changeprop: apply
* 19:12 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 21:08 bking@cumin1001: START - Cookbook sre.wdqs.data-transfer
* 19:08 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 21:00 bking@cumin1001: conftool action : set/weight=10:pooled=yes; selector: name=wdqs1014.eqiad.wmnet
* 19:05 majavah: utc evening deploys done
* 20:56 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
* 19:05 taavi@deploy1002: Synchronized wmf-config/interwiki.php: Config: [[gerrit:745301{{!}}Update interwiki cache]] (duration: 01m 06s)
* 20:55 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
* 16:36 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 20:55 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1148 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32332 and previous config saved to /var/cache/conftool/dbconfig/20220809-205548-ladsgroup.json
* 16:35 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 20:51 bking@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wdqs1014.eqiad.wmnet
* 16:33 taavi@deploy1002: Synchronized php-1.38.0-wmf.12/extensions/CodeMirror/resources/modules/ve-cm/ve.ui.CodeMirror.init.less: Backport: [[gerrit:744803{{!}}Fix invalid reference to core resources/ directory (T296639)]] (duration: 01m 06s)
* 20:51 bking@cumin1001: START - Cookbook sre.hosts.remove-downtime for wdqs1014.eqiad.wmnet
* 15:49 krinkle@deploy1002: Synchronized php-1.38.0-wmf.12/resources/src/mediawiki.base/: {{Gerrit|Ie9fa768c0dc1}} (duration: 01m 06s)
* 20:46 bking@cumin1001: END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
* 15:28 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 20:40 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P32331 and previous config saved to /var/cache/conftool/dbconfig/20220809-204042-ladsgroup.json
* 15:27 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 20:25 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P32330 and previous config saved to /var/cache/conftool/dbconfig/20220809-202536-ladsgroup.json
* 15:04 Amir1: removing rest of wikiuser@localhost ([[phab:T296537|T296537]])
* 20:10 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1148 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32329 and previous config saved to /var/cache/conftool/dbconfig/20220809-201030-ladsgroup.json
* 14:17 moritzm: drain primary/secondary instance off ganeti2020 [[phab:T296622|T296622]]
* 19:57 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on wdqs1014.eqiad.wmnet with reason: [[phab:T314890|T314890]]
* 14:01 moritzm: installing nss regression updates for stretch
* 19:57 bking@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on wdqs1014.eqiad.wmnet with reason: [[phab:T314890|T314890]]
* 13:57 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ml-etcd2003.codfw.wmnet with reason: switch to drbd storage
* 19:56 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on wdqs1016.eqiad.wmnet with reason: [[phab:T314890|T314890]]
* 13:57 jmm@cumin2002: START - Cookbook sre.hosts.downtime for 1:00:00 on ml-etcd2003.codfw.wmnet with reason: switch to drbd storage
* 19:56 bking@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on wdqs1016.eqiad.wmnet with reason: [[phab:T314890|T314890]]
* 13:56 moritzm: drain primary/secondary instance off ganeti2015 [[phab:T296622|T296622]]
* 19:55 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on wdqs1015.eqiad.wmnet with reason: [[phab:T314890|T314890]]
* 13:40 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2014.codfw.wmnet
* 19:55 bking@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on wdqs1015.eqiad.wmnet with reason: [[phab:T314890|T314890]]
* 13:34 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host ganeti2014.codfw.wmnet
* 19:38 dcausse@deploy1002: helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply
* 13:04 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2013.codfw.wmnet
* 19:36 dcausse@deploy1002: helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply
* 13:00 ema: powercycle cp5006 [[phab:T290005|T290005]]
* 19:35 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1130.eqiad.wmnet with reason: Maintenance
* 12:59 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host ganeti2013.codfw.wmnet
* 19:35 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 10:00:00 on db1130.eqiad.wmnet with reason: Maintenance
* 10:43 ema@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp5006.eqsin.wmnet
* 19:25 bking@cumin1001: START - Cookbook sre.wdqs.data-transfer
* 10:42 ema: depool cp5006, the host is down [[phab:T290005|T290005]]#7555417
* 18:09 cmjohnson@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 10:35 ema: cp3051: repool w/ single backend experiment enabled [[phab:T288106|T288106]]
* 18:06 cmjohnson@cumin1001: START - Cookbook sre.dns.netbox
* 10:23 ema: cp3051: stop ats-be and clear its cache [[phab:T288106|T288106]]
* 17:54 cmjohnson@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 10:22 ema: cp3051: depool to enable single backend experiment [[phab:T288106|T288106]]
* 17:47 cmjohnson@cumin1001: START - Cookbook sre.dns.netbox
* 10:16 Amir1: ladsgroup@mwmaint1002:~$ mwscript createAndPromote.php --wiki=testwiki --custom-groups=steward --force "Dom walden"
* 17:38 bking@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1072.eqiad.wmnet with OS bullseye
* 10:05 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2014.codfw.wmnet with OS buster
* 17:29 vgutierrez: test trafficserver 9.1.2-1wm2 in cp6016 - [[phab:T309651|T309651]]
* 09:58 majavah: remove all users from obsolete "shell" and "clouadmin" groups on labtestwiki (labtestwikitech.wikimedia.org)
* 17:15 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1072.eqiad.wmnet with reason: host reimage
* 09:33 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 17:13 bking@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1072.eqiad.wmnet with reason: host reimage
* 09:32 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 17:00 bking@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1072.eqiad.wmnet with OS bullseye
* 09:29 dcausse: restarting blazegraph on wdqs1006 (jvm stuck for 24h)
* 16:54 bking@cumin1001: END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0)
* 09:23 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 16:54 bking@cumin1001: START - Cookbook sre.elasticsearch.force-shard-allocation
* 09:23 jmm@cumin2002: START - Cookbook sre.hosts.reimage for host ganeti2014.codfw.wmnet with OS buster
* 16:53 bking@cumin1001: END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0)
* 09:19 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 16:53 bking@cumin1001: START - Cookbook sre.elasticsearch.force-shard-allocation
* 09:18 taavi@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:743659{{!}}Remove UserMerge rights from labswiki (wikitech)]] (duration: 01m 07s)
* 16:26 bking@deploy1002: helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
* 09:05 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2013.codfw.wmnet with OS buster
* 16:26 bking@deploy1002: helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
* 08:34 jmm@cumin2002: START - Cookbook sre.hosts.reimage for host ganeti2013.codfw.wmnet with OS buster
* 16:01 bking@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1069.eqiad.wmnet with OS bullseye
* 04:50 andrew@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1028.eqiad.wmnet with OS buster
* 15:45 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1069.eqiad.wmnet with reason: host reimage
* 03:37 andrew@cumin1001: START - Cookbook sre.hosts.reimage for host cloudvirt1028.eqiad.wmnet with OS buster
* 15:42 bking@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1069.eqiad.wmnet with reason: host reimage
* 02:09 legoktm: powercycle graphite1004 via mgmt
* 15:32 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 00:51 ebernhardson@deploy1002: Synchronized php-1.38.0-wmf.12/extensions/GrowthExperiments/includes/NewcomerTasks/AddImage/AddImageSubmissionHandler.php: backport window for 744896 (duration: 01m 05s)
* 15:31 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 00:42 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 15:31 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 00:41 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 15:30 bking@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1069.eqiad.wmnet with OS bullseye
* 00:20 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 15:28 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 00:13 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 15:27 bking@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1058.eqiad.wmnet with OS bullseye
* 00:09 ebernhardson@deploy1002: Synchronized wmf-config/InitialiseSettings.php: [[phab:T296897|T296897]] Move cirrus traffic back to eqiad (duration: 01m 08s)
* 15:08 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1058.eqiad.wmnet with reason: host reimage
* 15:05 bking@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1058.eqiad.wmnet with reason: host reimage
* 14:59 elukey@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
* 14:59 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
* m: finished running 'homer "status:active" commit "netmon: Add the netmon1003 host as a syslog destination"' in the cumin1001 host. Homer reported no errors.
* 14:54 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
* 14:50 bking@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1058.eqiad.wmnet with OS bullseye
* 14:28 bking@cumin1001: conftool action : set/pooled=false; selector: dnsdisc=wdqs,name=codfw
* 13:57 kevinbazira@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
* 13:57 kevinbazira@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
* m: Add the new netmon1003 host as a syslog destination in homer templates/common/system.conf https://gerrit.wikimedia.org/r/c/operations/homer/public/+/819124
* m: Successfully ran '# run-puppet-merge' in the netmon1002 and netmon1003 hosts.
* m: Running '# run-puppet-agent' in the netmon1003 host
* m: Running '# run-puppet-agent' in the netmon1002 host
* 13:47 ryankemper@cumin1001: END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0)
* 13:46 ryankemper@cumin1001: START - Cookbook sre.elasticsearch.force-shard-allocation
* m: puppet-merge on puppetmaster2004.codfw.wmnet for patch 819179 succeeded
* m: Set netmon1003 as netmon_server and netmon1002 as a netmon_servers_failover in the Puppet repository https://gerrit.wikimedia.org/r/c/operations/puppet/+/819179
* m: authdns updated successfully
* m: Had to revert https://gerrit.wikimedia.org/r/c/operations/dns/+/819177 because I rebased my changes incorrectly, sent the new patch in https://gerrit.wikimedia.org/r/c/operations/dns/+/821746
* m: running '# authdns-update' in  ns0.wikimedia.org
* m: Flip DNS for LibreNMS and Smokeping from netmon1002 to netmon1003 https://gerrit.wikimedia.org/r/c/operations/dns/+/819177
* 13:23 jynus: stop replication on db1117:m1 [[phab:T309074|T309074]]
* m: netmon1002 to netmon1003 failover
* 13:17 elukey@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
* 13:16 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
* 10:58 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
* 09:53 vgutierrez: rolling restart of pybal in eqsin - [[phab:T310070|T310070]]
* 09:25 elukey@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
* 09:24 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
* 09:24 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
* 09:12 vgutierrez: rolling restart of pybal in codfw - [[phab:T310070|T310070]]
* 08:47 elukey@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
* 08:30 elukey@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
* 08:28 elukey@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
* 08:27 elukey@deploy1002: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
* 08:27 elukey@deploy1002: helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
* 08:27 elukey@deploy1002: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
* 08:26 elukey@deploy1002: helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
* 08:24 jynus: starting data check using es1021 and es2021, expect increased read traffic [[phab:T314559|T314559]]
* 08:21 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
* 06:22 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1130.eqiad.wmnet with reason: Maintenance
* 06:22 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1130.eqiad.wmnet with reason: Maintenance
* 06:19 Amir1: dbmaint s5@eqiad ([[phab:T312863|T312863]] [[phab:T312984|T312984]] [[phab:T310011|T310011]] [[phab:T310485|T310485]])
* 06:11 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db1130.eqiad.wmnet with reason: Maint
* 06:11 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db1130.eqiad.wmnet with reason: Maint
* 06:08 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depool db1130 [[phab:T314370|T314370]]', diff saved to https://phabricator.wikimedia.org/P32323 and previous config saved to /var/cache/conftool/dbconfig/20220809-060836-ladsgroup.json
* 06:07 oblivian@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
* 06:02 ladsgroup@cumin1001: dbctl commit (dc=all): 'Promote db1100 to s5 primary and set section read-write [[phab:T314370|T314370]]', diff saved to https://phabricator.wikimedia.org/P32322 and previous config saved to /var/cache/conftool/dbconfig/20220809-060159-ladsgroup.json
* 06:01 ladsgroup@cumin1001: dbctl commit (dc=all): 'Set s5 eqiad as read-only for maintenance - [[phab:T314370|T314370]]', diff saved to https://phabricator.wikimedia.org/P32321 and previous config saved to /var/cache/conftool/dbconfig/20220809-060105-ladsgroup.json
* 06:00 Amir1: Starting s5 eqiad failover from db1130 to db1100 - [[phab:T314370|T314370]]
* 05:12 ladsgroup@cumin1001: dbctl commit (dc=all): 'Set db1100 with weight 0 [[phab:T314370|T314370]]', diff saved to https://phabricator.wikimedia.org/P32320 and previous config saved to /var/cache/conftool/dbconfig/20220809-051251-ladsgroup.json
* 05:12 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 22 hosts with reason: Primary switchover s5 [[phab:T314370|T314370]]
* 05:11 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1:00:00 on 22 hosts with reason: Primary switchover s5 [[phab:T314370|T314370]]
* 02:42 ejegg: SmashPig upgraded from {{Gerrit|9b97ea15}} to {{Gerrit|13e9e9cc}}
* 02:31 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1148 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32318 and previous config saved to /var/cache/conftool/dbconfig/20220809-023113-ladsgroup.json
* 02:31 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1148.eqiad.wmnet with reason: Maintenance
* 02:30 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1148.eqiad.wmnet with reason: Maintenance
* 02:30 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32317 and previous config saved to /var/cache/conftool/dbconfig/20220809-023052-ladsgroup.json
* 02:28 ejegg: payments-wiki upgraded from {{Gerrit|6880236d}} to {{Gerrit|cf5e1848}}
* 02:15 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P32316 and previous config saved to /var/cache/conftool/dbconfig/20220809-021546-ladsgroup.json
* 02:00 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P32315 and previous config saved to /var/cache/conftool/dbconfig/20220809-020040-ladsgroup.json
* 01:45 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32314 and previous config saved to /var/cache/conftool/dbconfig/20220809-014534-ladsgroup.json


== 2021-12-07 ==
== 2022-08-08 ==
* 23:21 andrew@cumin1001: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1028.eqiad.wmnet with OS buster
* 23:52 tstarling@deploy1002: Synchronized wmf-config/InitialiseSettings.php: clean up testwiki experiments [[phab:T314750|T314750]] (duration: 03m 19s)
* 23:01 jgleeson: updated payments-wiki from {{Gerrit|4a4ef51d}} to {{Gerrit|2e164062}}
* 23:47 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 22:56 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 23:46 tstarling@deploy1002: Synchronized wmf-config/CommonSettings.php: clean up testwiki experiments [[phab:T314750|T314750]] (duration: 03m 27s)
* 22:49 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 23:46 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 22:33 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 23:46 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 22:27 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 23:45 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 22:19 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 23:32 eileen___: config revision changed from {{Gerrit|f5668044}} to 787cd0e0<eileen___> eileen
* 22:18 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 23:32 eileen___: civicrm upgraded from {{Gerrit|497bddf7}} to {{Gerrit|1f91ac2d}}
* 22:15 dancy@deploy1002: rebuilt and synchronized wikiversions files: group0 wikis to 1.38.0-wmf.12  refs [[phab:T293953|T293953]]
* 22:16 ryankemper@cumin1001: END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - [[phab:T289135|T289135]]
* 22:12 dancy@deploy1002: Pruned MediaWiki: 1.38.0-wmf.7 (duration: 04m 18s)
* 22:16 ryankemper@cumin1001: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host elastic1065.eqiad.wmnet with OS bullseye
* 22:07 dancy@deploy1002: Finished scap: testwikis wikis to 1.38.0-wmf.12  refs [[phab:T293953|T293953]] (duration: 44m 14s)
* 21:53 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1065.eqiad.wmnet with reason: host reimage
* 22:07 andrew@cumin1001: START - Cookbook sre.hosts.reimage for host cloudvirt1028.eqiad.wmnet with OS buster
* 21:50 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1065.eqiad.wmnet with reason: host reimage
* 22:06 andrew@cumin1001: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1028.eqiad.wmnet with OS buster
* 21:36 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1065.eqiad.wmnet with OS bullseye
* 21:26 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 21:12 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1062.eqiad.wmnet with OS bullseye
* 21:25 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 20:53 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1062.eqiad.wmnet with reason: host reimage
* 21:23 dancy@deploy1002: Started scap: testwikis wikis to 1.38.0-wmf.12  refs [[phab:T293953|T293953]]
* 20:50 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1062.eqiad.wmnet with reason: host reimage
* 21:18 andrew@cumin1001: START - Cookbook sre.hosts.reimage for host cloudvirt1028.eqiad.wmnet with OS buster
* 20:36 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1062.eqiad.wmnet with OS bullseye
* 19:58 ebernhardson@deploy1002: Finished deploy [wdqs/wdqs@c21117f] (wcqs): Deploy version 0.3.95 to wcqs (duration: 01m 48s)
* 20:32 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 19:56 ebernhardson@deploy1002: Started deploy [wdqs/wdqs@c21117f] (wcqs): Deploy version 0.3.95 to wcqs
* 20:31 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 19:56 cmooney@cumin1001: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1028.eqiad.wmnet with OS buster
* 20:31 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 19:49 eileen: revision civicrm {{Gerrit|311382de}} -> {{Gerrit|c47cf762}}
* 20:29 ryankemper@cumin1001: START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - [[phab:T289135|T289135]]
* 19:26 legoktm: upgrading sacp to 4.1.0 everywhere ([[phab:T296867|T296867]])
* 20:28 cjming: end of UTC late backport window
* 19:26 cmooney@cumin1001: START - Cookbook sre.hosts.reimage for host cloudvirt1028.eqiad.wmnet with OS buster
* 20:27 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 19:18 herron: graphite1004.mgmt: racadm serveraction powercycle
* 20:27 cjming@deploy1002: Synchronized php-1.39.0-wmf.23/skins/Vector/resources/skins.vector.styles/layouts/grid.less: Backport: [[gerrit:821243{{!}}Fix grid blowout bug (T314756)]] (duration: 03m 26s)
* 19:13 ebernhardson: start outage recovery for commonswiki against eqiad cirrus cluster after snapshot restore
* 20:12 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 18:47 jgiannelos@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' .
* 20:11 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 18:46 jgiannelos@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' .
* 20:11 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 18:45 jgiannelos@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' .
* 20:11 cjming@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:817785{{!}}Disable sticky header edit A/B test for pilot wikis (T312296)]] (duration: 03m 35s)
* 18:38 jgiannelos@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'proton' for release 'production' .
* 20:08 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 18:33 jgiannelos@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'proton' for release 'production' .
* 17:34 bking@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1088.eqiad.wmnet with OS bullseye
* 18:27 jgiannelos@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'proton' for release 'production' .
* 17:15 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1088.eqiad.wmnet with reason: host reimage
* 17:54 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2002.codfw.wmnet
* 17:12 bking@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1088.eqiad.wmnet with reason: host reimage
* 17:51 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host rpki2002.codfw.wmnet
* 17:00 bking@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1088.eqiad.wmnet with OS bullseye
* 17:41 herron: graphite1004.mgmt: racadm serveraction powercycle
* 16:54 bking@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1085.eqiad.wmnet with OS bullseye
* 17:40 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet
* 16:49 ryankemper@cumin1001: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - [[phab:T289135|T289135]]
* 17:36 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet
* 16:43 pt1979@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 17:36 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet
* 16:41 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1085.eqiad.wmnet with reason: host reimage
* 17:35 root@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'miscweb' for release 'main' .
* 16:39 pt1979@cumin2002: START - Cookbook sre.dns.netbox
* 17:34 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet
* 16:38 bking@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1085.eqiad.wmnet with reason: host reimage
* 17:33 root@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'miscweb' for release 'main' .
* 16:26 bking@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1085.eqiad.wmnet with OS bullseye
* 17:32 root@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'miscweb' for release 'main' .
* 16:24 bking@cumin1001: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host elastic1085.eqiad.wmnet with OS bullseye
* 17:31 root@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'miscweb' for release 'main' .
* 16:19 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1085.eqiad.wmnet with reason: host reimage
* 17:27 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 16:16 bking@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1085.eqiad.wmnet with reason: host reimage
* 17:27 root@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'miscweb' for release 'main' .
* 16:16 ryankemper@cumin1001: START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - [[phab:T289135|T289135]]
* 17:26 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 16:14 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
* 17:25 root@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'miscweb' for release 'main' .
* 16:12 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
* 17:19 root@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'miscweb' for release 'main' .
* 16:10 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
* 16:25 Amir1: deleting broken flaggedtemplates rows on dewiki ([[phab:T297094|T297094]])
* 16:09 ryankemper@cumin1001: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - [[phab:T289135|T289135]]
* 16:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host copernicium.wikimedia.org
* 16:04 bking@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1085.eqiad.wmnet with OS bullseye
* 16:08 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host copernicium.wikimedia.org
* 16:00 bking@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1084.eqiad.wmnet with OS bullseye
* 16:07 root@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'miscweb' for release 'main' .
* 15:58 ryankemper@cumin1001: START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - [[phab:T289135|T289135]]
* 16:07 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host failoid2002.codfw.wmnet
* 15:47 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1084.eqiad.wmnet with reason: host reimage
* 16:04 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host failoid2002.codfw.wmnet
* 15:46 sukhe: upload reprepro -C main include bullseye-wikimedia python-pynetbox_6.6.0-1+wmf11u1_amd64.changes
* 16:02 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-replica1003.wikimedia.org
* 15:45 bking@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1084.eqiad.wmnet with reason: host reimage
* 16:00 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host ldap-replica1003.wikimedia.org
* 15:37 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2021.codfw.wmnet with reason: Maint
* 15:55 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-replica2005.wikimedia.org
* 15:37 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2021.codfw.wmnet with reason: Maint
* 15:53 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host ldap-replica2005.wikimedia.org
* 15:32 bking@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1084.eqiad.wmnet with OS bullseye
* 15:53 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 14:59 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
* 15:52 kormat@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2084.codfw.wmnet with reason: Reracking [[phab:T296930|T296930]]
* 14:55 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
* 15:52 kormat@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2084.codfw.wmnet with reason: Reracking [[phab:T296930|T296930]]
* 14:47 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on cp5001.eqsin.wmnet with reason: depooled: faulty DIMM: [[phab:T314256|T314256]]
* 15:50 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-replica2006.wikimedia.org
* 14:46 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on cp5001.eqsin.wmnet with reason: depooled: faulty DIMM: [[phab:T314256|T314256]]
* 15:48 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 14:34 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
* 15:47 ladsgroup@deploy1002: Synchronized php-1.38.0-wmf.9/extensions/FlaggedRevs/backend/FlaggedRevision.php: Backport: [[gerrit:744797{{!}}Do not inject rev id of template when it's empty]] (duration: 00m 57s)
* 14:11 kevinbazira@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
* 15:44 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host ldap-replica2006.wikimedia.org
* 13:03 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 15:33 sukhe@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 10 hosts with reason: debugging bird/anycast-hc issues
* 13:01 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 15:33 sukhe@cumin1001: START - Cookbook sre.hosts.downtime for 0:30:00 on 10 hosts with reason: debugging bird/anycast-hc issues
* 13:01 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 15:25 pt1979@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2026.codfw.wmnet with OS buster
* 12:58 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 15:21 sukhe@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on durum2002.codfw.wmnet with reason: debugging bird/anycast-hc issues
* 12:56 urbanecm@deploy1002: Synchronized wmf-config/CommonSettings.php: {{Gerrit|77fd5abdd7d9462869259e1511bbcf2d7ce62246}}: Growth: Add new rights to wgAvailableRights (duration: 03m 24s)
* 15:21 sukhe@cumin1001: START - Cookbook sre.hosts.downtime for 0:30:00 on durum2002.codfw.wmnet with reason: debugging bird/anycast-hc issues
* 12:30 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1102.eqiad.wmnet
* 15:14 sukhe: running authdns-update for Gerrit:744094
* 12:09 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 15:09 pt1979@cumin2002: START - Cookbook sre.hosts.reimage for host restbase2026.codfw.wmnet with OS buster
* 12:09 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 14:38 jbond: renable puppet fleet wide post monitoring refactor 744787
* 12:06 urbanecm@deploy1002: Synchronized php-1.39.0-wmf.23/extensions/GrowthExperiments/: {{Gerrit|3eaf155678b7313c55dcca0cd39ab29f73eead37}}: MentorTools: Do not use MentorWeightManager ([[phab:T314362|T314362]]) (duration: 03m 31s)
* 14:28 godog: reboot graphite1004 - [[phab:T297180|T297180]]
* 12:04 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 14:15 Amir1: fixing heartbeat grants for wikiuser across the cluster ([[phab:T296537|T296537]])
* 11:43 btullis@cumin1001: START - Cookbook sre.hosts.reboot-single for host an-worker1102.eqiad.wmnet
* 14:11 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti[2013-2014].codfw.wmnet with reason: Temporarily remove node from Ganeti for reimage
* 11:21 jelto@cumin1001: conftool action : set/pooled=yes; selector: name=kubernetes2022.codfw.wmnet
* 14:11 jmm@cumin2002: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ganeti[2013-2014].codfw.wmnet with reason: Temporarily remove node from Ganeti for reimage
* 11:21 jelto: kubectl uncordon kubernetes2022.codfw.wmnet
* 14:07 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubetcd2006.codfw.wmnet with reason: switch to drbd storage
* 10:43 Amir1: Removing db2079 from orchestrator ([[phab:T313885|T313885]])
* 14:07 jmm@cumin2002: START - Cookbook sre.hosts.downtime for 1:00:00 on kubetcd2006.codfw.wmnet with reason: switch to drbd storage
* 10:39 Amir1: Removing db2079 from zarcillo ([[phab:T313885|T313885]])
* 13:52 Amir1: removing wikiuser@localhost on s6 ([[phab:T296537|T296537]])
* 10:35 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2079.codfw.wmnet
* 13:45 pt1979@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase2026.codfw.wmnet with OS buster
* 10:35 ladsgroup@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 13:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubetcd2004.codfw.wmnet with reason: switch to drbd storage
* 10:30 ladsgroup@cumin1001: START - Cookbook sre.dns.netbox
* 13:42 jmm@cumin2002: START - Cookbook sre.hosts.downtime for 1:00:00 on kubetcd2004.codfw.wmnet with reason: switch to drbd storage
* 10:25 ladsgroup@cumin1001: START - Cookbook sre.hosts.decommission for hosts db2079.codfw.wmnet
* 13:40 godog: reboot graphite2003 - [[phab:T297180|T297180]]
* 10:18 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 12:00:00 on db2079.codfw.wmnet with reason: Decom
* 13:39 jbond: disable puppet fleet wide to rollout 744787
* 10:17 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 2 days, 12:00:00 on db2079.codfw.wmnet with reason: Decom
* 13:26 pt1979@cumin2002: START - Cookbook sre.hosts.reimage for host restbase2026.codfw.wmnet with OS buster
* 08:41 jbond: deploy libtirpc update
* 13:16 jelto: update GitLab to 14.4.4-ce.0
* 07:57 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1144:3314 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32310 and previous config saved to /var/cache/conftool/dbconfig/20220808-075723-ladsgroup.json
* 13:07 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ganeti2014.codfw.wmnet with reason: Temporarily remove node from Ganeti for reimage
* 07:57 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance
* 13:07 jmm@cumin2002: START - Cookbook sre.hosts.downtime for 1:00:00 on ganeti2014.codfw.wmnet with reason: Temporarily remove node from Ganeti for reimage
* 07:57 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance
* 12:46 Lucas_WMDE: UTC morning backport+config window done
* 07:57 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1142 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32309 and previous config saved to /var/cache/conftool/dbconfig/20220808-075702-ladsgroup.json
* 12:46 Lucas_WMDE: deployed [[gerrit:744071{{!}}Update termbox to 2021-12-06-171243-production (T297006)]]
* 07:53 godog: grow sda/sdb 3 by 100G on thanos-be2001 - [[phab:T314275|T314275]]
* 12:44 lucaswerkmeister-wmde@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'termbox' for release 'production' .
* 07:50 godog: grow sda/sdb 3 by 100G on thanos-be1004 - [[phab:T314275|T314275]]
* 12:42 lucaswerkmeister-wmde@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'termbox' for release 'production' .
* 07:41 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P32308 and previous config saved to /var/cache/conftool/dbconfig/20220808-074156-ladsgroup.json
* 12:39 lucaswerkmeister-wmde@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'test' .
* 07:32 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 12:39 lucaswerkmeister-wmde@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' .
* 07:27 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 12:24 jbond: merge refactor of monitoring classes 725045
* 07:27 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 12:16 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T277354|T277354]])', diff saved to https://phabricator.wikimedia.org/P18071 and previous config saved to /var/cache/conftool/dbconfig/20211207-121655-marostegui.json
* 07:26 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P32307 and previous config saved to /var/cache/conftool/dbconfig/20220808-072650-ladsgroup.json
* 12:10 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 07:23 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 12:09 lucaswerkmeister-wmde@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:744043{{!}}Enable reply tool by default on mediawikiwiki (T296444)]] (duration: 00m 57s)
* 07:22 kartik@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:820815{{!}}trwikivoyage: Create rollbacker user group (T314678)]] (duration: 03m 17s)
* 12:09 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 07:18 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 12:01 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P18070 and previous config saved to /var/cache/conftool/dbconfig/20211207-120150-marostegui.json
* 07:17 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 11:51 moritzm: draining primary/secondary instances off ganeti2014 [[phab:T296622|T296622]]
* 07:17 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 11:46 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P18069 and previous config saved to /var/cache/conftool/dbconfig/20211207-114645-marostegui.json
* 07:16 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 11:38 cmooney@cumin1001: END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host cloudvirt1028.eqiad.wmnet
* 07:11 elukey: restart rsyslog on ml-serve2007
* 11:32 cmooney@cumin1001: START - Cookbook sre.hosts.dhcp for host cloudvirt1028.eqiad.wmnet
* 07:11 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 11:31 topranks: removing IP addressing on cloudvirt1028 manually and forcing DHCP to debug reimage failure ([[phab:T296906|T296906]])
* 07:11 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1142 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32306 and previous config saved to /var/cache/conftool/dbconfig/20220808-071144-ladsgroup.json
* 11:31 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T277354|T277354]])', diff saved to https://phabricator.wikimedia.org/P18068 and previous config saved to /var/cache/conftool/dbconfig/20211207-113140-marostegui.json
* 07:10 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 11:30 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T277354|T277354]])', diff saved to https://phabricator.wikimedia.org/P18067 and previous config saved to /var/cache/conftool/dbconfig/20211207-113005-marostegui.json
* 07:10 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 11:30 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db[1155-1156].eqiad.wmnet with reason: Maintenance [[phab:T277354|T277354]]
* 07:09 kartik@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:820261{{!}}Enable SectionTranslation on 10 Wikipedias where ContentTranslation is default (T308829)]] (duration: 03m 15s)
* 11:29 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db[1155-1156].eqiad.wmnet with reason: Maintenance [[phab:T277354|T277354]]
* 07:09 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 11:27 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T277354|T277354]])', diff saved to https://phabricator.wikimedia.org/P18066 and previous config saved to /var/cache/conftool/dbconfig/20211207-112707-marostegui.json
* 07:06 XioNoX: add CSP headers to Netbox - [[phab:T296356|T296356]]
* 11:26 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubetcd2004.codfw.wmnet with reason: switch to drbd storage
* 07:05 elukey: restart rsyslog on ml-serve-ctrl2001
* 11:26 jmm@cumin2002: START - Cookbook sre.hosts.downtime for 1:00:00 on kubetcd2004.codfw.wmnet with reason: switch to drbd storage
* 11:12 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P18065 and previous config saved to /var/cache/conftool/dbconfig/20211207-111203-marostegui.json
* 11:11 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2012.codfw.wmnet
* 11:06 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host ganeti2012.codfw.wmnet
* 10:56 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P18064 and previous config saved to /var/cache/conftool/dbconfig/20211207-105658-marostegui.json
* 10:41 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T277354|T277354]])', diff saved to https://phabricator.wikimedia.org/P18063 and previous config saved to /var/cache/conftool/dbconfig/20211207-104153-marostegui.json
* 10:40 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T277354|T277354]])', diff saved to https://phabricator.wikimedia.org/P18062 and previous config saved to /var/cache/conftool/dbconfig/20211207-104018-marostegui.json
* 10:40 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1182.eqiad.wmnet with reason: Maintenance [[phab:T277354|T277354]]
* 10:40 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on db1182.eqiad.wmnet with reason: Maintenance [[phab:T277354|T277354]]
* 10:40 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 ([[phab:T277354|T277354]])', diff saved to https://phabricator.wikimedia.org/P18061 and previous config saved to /var/cache/conftool/dbconfig/20211207-104010-marostegui.json
* 10:32 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2016.codfw.wmnet
* 10:26 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host ganeti2016.codfw.wmnet
* 10:25 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 10:25 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P18060 and previous config saved to /var/cache/conftool/dbconfig/20211207-102505-marostegui.json
* 10:24 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 10:13 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ganeti2013.codfw.wmnet with reason: Temporarily remove node from Ganeti for reimage
* 10:13 jmm@cumin2002: START - Cookbook sre.hosts.downtime for 1:00:00 on ganeti2013.codfw.wmnet with reason: Temporarily remove node from Ganeti for reimage
* 10:11 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2012.codfw.wmnet with OS buster
* 10:10 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P18059 and previous config saved to /var/cache/conftool/dbconfig/20211207-101001-marostegui.json
* 10:01 marostegui: Deploy schema change on mailman (m5) [[phab:T286552|T286552]]
* 09:54 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 ([[phab:T277354|T277354]])', diff saved to https://phabricator.wikimedia.org/P18058 and previous config saved to /var/cache/conftool/dbconfig/20211207-095456-marostegui.json
* 09:53 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1170:3312 ([[phab:T277354|T277354]])', diff saved to https://phabricator.wikimedia.org/P18057 and previous config saved to /var/cache/conftool/dbconfig/20211207-095319-marostegui.json
* 09:53 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1170.eqiad.wmnet with reason: Maintenance [[phab:T277354|T277354]]
* 09:53 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on db1170.eqiad.wmnet with reason: Maintenance [[phab:T277354|T277354]]
* 09:53 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 ([[phab:T277354|T277354]])', diff saved to https://phabricator.wikimedia.org/P18056 and previous config saved to /var/cache/conftool/dbconfig/20211207-095312-marostegui.json
* 09:40 XioNoX: codfw, normalize VRRP - [[phab:T289241|T289241]]
* 09:38 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P18055 and previous config saved to /var/cache/conftool/dbconfig/20211207-093807-marostegui.json
* 09:38 XioNoX: cr2-codfw - FPC 1 PIC 1 Need bounce - [[phab:T289241|T289241]]
* 09:34 XioNoX: move all VRRP primary to cr1-codfw - [[phab:T289241|T289241]]
* 09:31 XioNoX: cr1-codfw - FPC 1 PIC 0 Need bounce - [[phab:T289241|T289241]]
* 09:29 jmm@cumin2002: START - Cookbook sre.hosts.reimage for host ganeti2012.codfw.wmnet with OS buster
* 09:27 XioNoX: move all VRRP primary to cr2-codfw - https://phabricator.wikimedia.org/T289241
* 09:26 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2016.codfw.wmnet with OS buster
* 09:23 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P18054 and previous config saved to /var/cache/conftool/dbconfig/20211207-092302-marostegui.json
* 09:07 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 ([[phab:T277354|T277354]])', diff saved to https://phabricator.wikimedia.org/P18053 and previous config saved to /var/cache/conftool/dbconfig/20211207-090758-marostegui.json
* 09:06 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1146:3312 ([[phab:T277354|T277354]])', diff saved to https://phabricator.wikimedia.org/P18052 and previous config saved to /var/cache/conftool/dbconfig/20211207-090620-marostegui.json
* 09:06 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1146.eqiad.wmnet with reason: Maintenance [[phab:T277354|T277354]]
* 09:06 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on db1146.eqiad.wmnet with reason: Maintenance [[phab:T277354|T277354]]
* 09:06 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1129 ([[phab:T277354|T277354]])', diff saved to https://phabricator.wikimedia.org/P18051 and previous config saved to /var/cache/conftool/dbconfig/20211207-090613-marostegui.json
* 08:55 moritzm: draining primary/secondary instances off ganeti2013 [[phab:T296622|T296622]]
* 08:51 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P18050 and previous config saved to /var/cache/conftool/dbconfig/20211207-085108-marostegui.json
* 08:47 jmm@cumin2002: START - Cookbook sre.hosts.reimage for host ganeti2016.codfw.wmnet with OS buster
* 08:45 jmm@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti2016.codfw.wmnet with OS buster
* 08:36 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P18049 and previous config saved to /var/cache/conftool/dbconfig/20211207-083604-marostegui.json
* 08:21 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1129 ([[phab:T277354|T277354]])', diff saved to https://phabricator.wikimedia.org/P18048 and previous config saved to /var/cache/conftool/dbconfig/20211207-082059-marostegui.json
* 08:19 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1129 ([[phab:T277354|T277354]])', diff saved to https://phabricator.wikimedia.org/P18047 and previous config saved to /var/cache/conftool/dbconfig/20211207-081936-marostegui.json
* 08:19 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1129.eqiad.wmnet with reason: Maintenance [[phab:T277354|T277354]]
* 08:19 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on db1129.eqiad.wmnet with reason: Maintenance [[phab:T277354|T277354]]
* 08:19 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1162 ([[phab:T277354|T277354]])', diff saved to https://phabricator.wikimedia.org/P18046 and previous config saved to /var/cache/conftool/dbconfig/20211207-081928-marostegui.json
* 08:04 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P18045 and previous config saved to /var/cache/conftool/dbconfig/20211207-080424-marostegui.json
* 07:49 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P18044 and previous config saved to /var/cache/conftool/dbconfig/20211207-074919-marostegui.json
* 07:46 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 07:43 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 07:39 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|2178202b86acd50b713d939c4bcfedf7d2fa93e7}}: Deploy Growth mentor dashboard to all wikis ([[phab:T278920|T278920]]) (duration: 00m 58s)
* 07:37 oblivian@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 07:34 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1162 ([[phab:T277354|T277354]])', diff saved to https://phabricator.wikimedia.org/P18043 and previous config saved to /var/cache/conftool/dbconfig/20211207-073413-marostegui.json
* 07:33 oblivian@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 07:32 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1162 ([[phab:T277354|T277354]])', diff saved to https://phabricator.wikimedia.org/P18042 and previous config saved to /var/cache/conftool/dbconfig/20211207-073252-marostegui.json
* 07:32 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1162.eqiad.wmnet with reason: Maintenance [[phab:T277354|T277354]]
* 07:32 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on db1162.eqiad.wmnet with reason: Maintenance [[phab:T277354|T277354]]
* 07:23 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 8 hosts with reason: Maintenance [[phab:T277354|T277354]]
* 07:23 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on 8 hosts with reason: Maintenance [[phab:T277354|T277354]]
* 07:23 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 ([[phab:T277354|T277354]])', diff saved to https://phabricator.wikimedia.org/P18041 and previous config saved to /var/cache/conftool/dbconfig/20211207-072311-marostegui.json
* 07:16 marostegui: power off db2074, db2078, db2101, db2130, dbproxy2004 [[phab:T296930|T296930]]
* 07:08 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P18040 and previous config saved to /var/cache/conftool/dbconfig/20211207-070806-marostegui.json
* 06:53 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P18039 and previous config saved to /var/cache/conftool/dbconfig/20211207-065301-marostegui.json
* 06:37 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 ([[phab:T277354|T277354]])', diff saved to https://phabricator.wikimedia.org/P18038 and previous config saved to /var/cache/conftool/dbconfig/20211207-063756-marostegui.json
* 06:36 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1105:3312 ([[phab:T277354|T277354]])', diff saved to https://phabricator.wikimedia.org/P18037 and previous config saved to /var/cache/conftool/dbconfig/20211207-063621-marostegui.json
* 06:36 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1105.eqiad.wmnet with reason: Maintenance [[phab:T277354|T277354]]
* 06:36 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on db1105.eqiad.wmnet with reason: Maintenance [[phab:T277354|T277354]]
* 06:35 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [[phab:T277354|T277354]]
* 06:35 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [[phab:T277354|T277354]]
* 06:31 marostegui@cumin1001: dbctl commit (dc=all): 'After maintenance db1100 ([[phab:T277354|T277354]])', diff saved to https://phabricator.wikimedia.org/P18036 and previous config saved to /var/cache/conftool/dbconfig/20211207-063140-marostegui.json
* 06:16 marostegui@cumin1001: dbctl commit (dc=all): 'After maintenance db1100', diff saved to https://phabricator.wikimedia.org/P18035 and previous config saved to /var/cache/conftool/dbconfig/20211207-061635-marostegui.json
* 06:14 marostegui: Apply SET GLOBAL innodb_checksum_algorithm=full_crc32; on db1107 [[phab:T287244|T287244]]
* 06:01 marostegui@cumin1001: dbctl commit (dc=all): 'After maintenance db1100', diff saved to https://phabricator.wikimedia.org/P18034 and previous config saved to /var/cache/conftool/dbconfig/20211207-060130-marostegui.json
* 05:58 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db2074 and db2130 [[phab:T296930|T296930]]', diff saved to https://phabricator.wikimedia.org/P18033 and previous config saved to /var/cache/conftool/dbconfig/20211207-055808-marostegui.json
* 05:46 marostegui@cumin1001: dbctl commit (dc=all): 'After maintenance db1100 ([[phab:T277354|T277354]])', diff saved to https://phabricator.wikimedia.org/P18032 and previous config saved to /var/cache/conftool/dbconfig/20211207-054625-marostegui.json
* 05:45 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1100 ([[phab:T277354|T277354]])', diff saved to https://phabricator.wikimedia.org/P18031 and previous config saved to /var/cache/conftool/dbconfig/20211207-054506-marostegui.json
* 05:45 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1100.eqiad.wmnet with reason: Maintenance [[phab:T277354|T277354]]
* 05:45 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 1:00:00 on db1100.eqiad.wmnet with reason: Maintenance [[phab:T277354|T277354]]
* 00:10 cwhite: end codfw opensearch upgrade [[phab:T288621|T288621]]


== 2021-12-06 ==
== 2022-08-07 ==
* 22:19 mstyles@deploy1002: Synchronized php-1.38.0-wmf.9/includes/content/ContentModelChange.php: Deploy security patch for [[phab:T271037|T271037]] (duration: 00m 56s)
* 19:58 taavi: taavi@mwmaint1002 ~ $ echo "https://upload.wikimedia.org/wikipedia/commons/1/15/Keep_tidy_ask.svg" {{!}} mwscript purgeList.php --wiki enwiki # [[phab:T314712|T314712]]
* 20:14 cwhite: begin codfw opensearch upgrade [[phab:T288621|T288621]]
* 13:52 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1142 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32305 and previous config saved to /var/cache/conftool/dbconfig/20220807-135204-ladsgroup.json
* 20:14 cwhite: begin codfw opensearch upgrade [[phab:T288612|T288612]]
* 13:51 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1142.eqiad.wmnet with reason: Maintenance
* 19:58 legoktm: trying new dump of Special:CodeReview on mwmaint1002 ([[phab:T205361|T205361]])
* 13:51 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1142.eqiad.wmnet with reason: Maintenance
* 19:26 legoktm: installing php-yaml on all appservers
* 13:51 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1141 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32304 and previous config saved to /var/cache/conftool/dbconfig/20220807-135143-ladsgroup.json
* 19:08 damilare: updated civicrm from {{Gerrit|b82183b9}} to {{Gerrit|311382de}}
* 13:36 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P32303 and previous config saved to /var/cache/conftool/dbconfig/20220807-133637-ladsgroup.json
* 19:04 taavi@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:742835{{!}}bnwikibooks: add autopatrolled and patroller user groups (T296640)]] (duration: 00m 56s)
* 13:21 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P32302 and previous config saved to /var/cache/conftool/dbconfig/20220807-132131-ladsgroup.json
* 19:03 cmooney@cumin1001: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirt1028.eqiad.wmnet with OS buster
* 13:06 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1141 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32301 and previous config saved to /var/cache/conftool/dbconfig/20220807-130625-ladsgroup.json
* 19:02 cmooney@cumin1001: START - Cookbook sre.hosts.reimage for host cloudvirt1028.eqiad.wmnet with OS buster
* 12:06 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1141 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32300 and previous config saved to /var/cache/conftool/dbconfig/20220807-120610-ladsgroup.json
* 19:02 cmooney@cumin1001: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirt1028.eqiad.wmnet with OS buster
* 12:06 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1141.eqiad.wmnet with reason: Maintenance
* 19:00 cmooney@cumin1001: START - Cookbook sre.hosts.reimage for host cloudvirt1028.eqiad.wmnet with OS buster
* 12:05 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1141.eqiad.wmnet with reason: Maintenance
* 18:52 cmooney@cumin1001: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1028.eqiad.wmnet with OS buster
* 12:05 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1149 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32299 and previous config saved to /var/cache/conftool/dbconfig/20220807-120549-ladsgroup.json
* 18:45 cmooney@cumin1001: START - Cookbook sre.hosts.reimage for host cloudvirt1028.eqiad.wmnet with OS buster
* 11:50 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P32298 and previous config saved to /var/cache/conftool/dbconfig/20220807-115043-ladsgroup.json
* 18:43 cmooney@cumin1001: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1028.eqiad.wmnet with OS buster
* 11:35 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P32297 and previous config saved to /var/cache/conftool/dbconfig/20220807-113537-ladsgroup.json
* 18:34 cmooney@cumin1001: START - Cookbook sre.hosts.reimage for host cloudvirt1028.eqiad.wmnet with OS buster
* 11:20 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1149 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32296 and previous config saved to /var/cache/conftool/dbconfig/20220807-112031-ladsgroup.json
* 18:00 majavah: "foreachwiki namespaceDupes.php --fix {{!}} tee namespaceDupes-[[phab:T293839|T293839]]-fix.txt" FINISHED about 15 minutes ago [[phab:T293839|T293839]]
* 17:27 ebernhardson@deploy1002: Synchronized wmf-config/InitialiseSettings.php: [[phab:T296897|T296897]] Move cirrus traffic to codfw (duration: 00m 56s)
* 16:24 majavah: starting "foreachwiki namespaceDupes.php --fix {{!}} tee namespaceDupes-[[phab:T293839|T293839]]-fix.txt" in mwmaint1002 screen session, [[phab:T293839|T293839]]
* 15:55 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ganeti2012.codfw.wmnet with reason: Temporarily remove node from Ganeti for reimage
* 15:55 jmm@cumin2002: START - Cookbook sre.hosts.downtime for 1:00:00 on ganeti2012.codfw.wmnet with reason: Temporarily remove node from Ganeti for reimage
* 14:45 elukey: roll restart of nfacctd on netflow* nodes to pick up the new CA bundle for librdkafka
* 14:19 moritzm: draining primary/secondary instances off ganeti2012 [[phab:T296622|T296622]]
* 14:06 jmm@cumin2002: START - Cookbook sre.hosts.reimage for host ganeti2016.codfw.wmnet with OS buster
* 14:00 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|4d8a75d5f01e8e2cf724e19db2e9bcc12fb8f5f4}}: Deploy Growth features on zhwiki in dark mode ([[phab:T287884|T287884]]) (duration: 00m 56s)
* 13:56 urbanecm: [urbanecm@mwmaint1002 ~]$ mwscript extensions/GrowthExperiments/maintenance/initWikiConfig.php --wiki=zhwiki --phab=[[phab:T287884|T287884]]
* 13:52 urbanecm: [urbanecm@mwmaint1002 ~]$ mwscript extensions/WikimediaMaintenance/createExtensionTables.php --wiki=zhwiki growthexperiments # [[phab:T287884|T287884]]
* 13:31 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ganeti2016.codfw.wmnet with reason: Temporarily remove node from Ganeti for reimage
* 13:31 jmm@cumin2002: START - Cookbook sre.hosts.downtime for 1:00:00 on ganeti2016.codfw.wmnet with reason: Temporarily remove node from Ganeti for reimage
* 13:30 oblivian@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 13:25 oblivian@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 13:03 majavah: $ mwscript namespaceDupes.php --wiki barwiki --fix --add-prefix=BROKEN # [[phab:T293839|T293839]]
* 12:58 majavah: mwscript namespaceDupes.php --wiki skwiki --fix --add-prefix=BROKEN # [[phab:T293839|T293839]]
* 12:54 majavah: mwscript namespaceDupes.php --wiki skwiki --fix # [[phab:T293839|T293839]]
* 12:50 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti2011.codfw.wmnet with reason: readding to cluster after reimage
* 12:50 jmm@cumin2002: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ganeti2011.codfw.wmnet with reason: readding to cluster after reimage
* 12:48 taavi@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:734383{{!}}Set default two-letter NS_PROJECT aliases (T293839)]] (duration: 00m 55s)
* 12:41 taavi@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:743533{{!}}Enable Autopatroller level page protection for English Wiktionary (T296580)]] (duration: 00m 56s)
* 12:28 taavi@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:743529{{!}}Enable SandboxLink extension for bnwikivoyage (T296637)]] (duration: 00m 55s)
* 12:22 taavi@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:743528{{!}}Enable groups autopatrolled and patroller for bnwikivoyage (T296637)]] (duration: 00m 56s)
* 12:15 taavi@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:743158{{!}}Enable SectionTranslation in Malayalam, Malay, Azerbaijani, Tamil, Bashkir and Albanian WPs (T285842)]] (duration: 00m 56s)
* 12:08 taavi@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:742833{{!}}hewiki: add "templateeditor" permission group (T296769)]] (duration: 00m 57s)
* 11:48 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2011.codfw.wmnet
* 11:41 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host ganeti2011.codfw.wmnet
* 11:28 Amir1: dropping wikiadmin@localhost from all of s3 ([[phab:T296511|T296511]])
* 11:21 Amir1: dropping wikiadmin@localhost from all of s2 ([[phab:T296511|T296511]])
* 11:12 moritzm: draining primary/secondary instances off ganeti2016 [[phab:T296622|T296622]]
* 10:38 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ml-etcd2003.codfw.wmnet with reason: switch to drbd storage
* 10:38 jmm@cumin2002: START - Cookbook sre.hosts.downtime for 1:00:00 on ml-etcd2003.codfw.wmnet with reason: switch to drbd storage
* 10:36 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2011.codfw.wmnet with OS buster
* 10:31 oblivian@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 10:28 oblivian@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 10:23 moritzm: draining primary/secondary instances off ganeti2015 [[phab:T296622|T296622]]
* 09:58 jmm@cumin2002: START - Cookbook sre.hosts.reimage for host ganeti2011.codfw.wmnet with OS buster
* 09:09 elukey: move kafka main codfw to fixed uid/gid for the kafka user (requires a stop/start of all daemons) - [[phab:T296982|T296982]]
* 08:13 moritzm: installing remaining icu security updates on buster


== 2021-12-04 ==
== 2022-08-06 ==
* 01:14 mutante: mx2001 - did not come back from reboot, did not get IP on interface, could not start ferm, logged in via console with root password, in /etc/network/interfaces replaced all "ens5" with "ens13", rebooted again, selected previous kernel version
* 17:59 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1149 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32295 and previous config saved to /var/cache/conftool/dbconfig/20220806-175916-ladsgroup.json
* 00:54 mutante: rebooting mx2001
* 17:59 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1149.eqiad.wmnet with reason: Maintenance
* 00:31 jynus: manually restarting clamav on otrs1001 after being killed
* 17:58 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1149.eqiad.wmnet with reason: Maintenance
* 03:10 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 03:09 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 03:09 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 03:08 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 03:03 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 03:02 krinkle@deploy1002: Synchronized w/: {{Gerrit|I9067d47fab0324}} (duration: 03m 25s)
* 03:02 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 03:02 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 03:01 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 02:41 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 02:39 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 02:39 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 02:38 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 02:38 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1143.eqiad.wmnet with reason: Maintenance
* 02:37 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1143.eqiad.wmnet with reason: Maintenance
* 02:33 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 02:32 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 02:32 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 02:31 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply


== 2021-12-03 ==
== 2022-08-05 ==
* 20:29 cstone: revision changed from {{Gerrit|2c2e22cd}} to {{Gerrit|b82183b9}}
* 22:20 dcausse@deploy1002: Finished deploy [wikimedia/discovery/analytics@71fe016]: Fix schedule_interval for image_recommendation_weekly (duration: 02m 01s)
* 17:56 andrew@cumin1001: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1028.eqiad.wmnet with OS buster
* 22:18 dcausse@deploy1002: Started deploy [wikimedia/discovery/analytics@71fe016]: Fix schedule_interval for image_recommendation_weekly
* 17:47 andrew@cumin1001: START - Cookbook sre.hosts.reimage for host cloudvirt1028.eqiad.wmnet with OS buster
* 17:08 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1195.eqiad.wmnet with OS bullseye
* 17:47 andrew@cumin1001: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1028.eqiad.wmnet with OS buster
* 16:54 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1194.eqiad.wmnet with OS bullseye
* 17:35 andrew@cumin1001: START - Cookbook sre.hosts.reimage for host cloudvirt1028.eqiad.wmnet with OS buster
* 16:53 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1195.eqiad.wmnet with reason: host reimage
* 17:35 andrew@cumin1001: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1028.eqiad.wmnet with OS buster
* 16:49 pt1979@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on db1195.eqiad.wmnet with reason: host reimage
* 17:35 razzi@cumin1001: END (PASS) - Cookbook sre.aqs.roll-restart (exit_code=0) for AQS aqs cluster: Roll restart of all AQS's nodejs daemons.
* 16:41 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1194.eqiad.wmnet with reason: host reimage
* 17:22 razzi@cumin1001: START - Cookbook sre.aqs.roll-restart for AQS aqs cluster: Roll restart of all AQS's nodejs daemons.
* 16:37 pt1979@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on db1194.eqiad.wmnet with reason: host reimage
* 16:56 andrew@cumin1001: START - Cookbook sre.hosts.reimage for host cloudvirt1028.eqiad.wmnet with OS buster
* 16:34 pt1979@cumin1001: START - Cookbook sre.hosts.reimage for host db1195.eqiad.wmnet with OS bullseye
* 16:56 andrew@cumin1001: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1028.eqiad.wmnet with OS buster
* 16:27 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp203[56]\.codfw\.wmnet,service=varnish-fe
* 16:44 andrew@cumin1001: START - Cookbook sre.hosts.reimage for host cloudvirt1028.eqiad.wmnet with OS buster
* 16:27 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp203[56]\.codfw\.wmnet,service=ats-be
* 16:42 andrew@cumin1001: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1028.eqiad.wmnet with OS buster
* 16:27 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp203[56]\.codfw\.wmnet,service=ats-tls
* 16:42 andrew@cumin1001: START - Cookbook sre.hosts.reimage for host cloudvirt1028.eqiad.wmnet with OS buster
* 16:26 pt1979@cumin1001: START - Cookbook sre.hosts.reimage for host db1194.eqiad.wmnet with OS bullseye
* 16:39 andrew@cumin1001: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1028.eqiad.wmnet with OS buster
* 16:25 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1193.eqiad.wmnet with OS bullseye
* 16:39 andrew@cumin1001: START - Cookbook sre.hosts.reimage for host cloudvirt1028.eqiad.wmnet with OS buster
* 16:21 pt1979@cumin1001: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host db1192.eqiad.wmnet with OS bullseye
* 14:25 jelto@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host gitlab-runner2001.codfw.wmnet
* 16:12 dcausse@deploy1002: Finished deploy [wikimedia/discovery/analytics@8489923]: [[phab:T304954|T304954]]: Automate imagesuggestion imports (duration: 02m 03s)
* 14:10 jelto@cumin1001: START - Cookbook sre.ganeti.makevm for new host gitlab-runner2001.codfw.wmnet
* 16:11 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1193.eqiad.wmnet with reason: host reimage
* 12:53 moritzm: installing nss security updates on stretch
* 16:11 milimetric@deploy1002: Finished deploy [analytics/refinery@fe7bf9e]: Hotfix for webrequest load refine, now with FORCE :) (duration: 06m 09s)
* 12:37 moritzm: draining primary/secondary instances off ganeti2007 [[phab:T296622|T296622]]
* 16:10 dcausse@deploy1002: Started deploy [wikimedia/discovery/analytics@8489923]: [[phab:T304954|T304954]]: Automate imagesuggestion imports
* 12:33 jmm@cumin2002: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti2022.codfw.wmnet to ganeti01.svc.codfw.wmnet
* 16:07 pt1979@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on db1193.eqiad.wmnet with reason: host reimage
* 12:33 jmm@cumin2002: START - Cookbook sre.ganeti.addnode for new host ganeti2022.codfw.wmnet to ganeti01.svc.codfw.wmnet
* 16:07 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1192.eqiad.wmnet with reason: host reimage
* 12:30 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet
* 16:05 milimetric@deploy1002: Started deploy [analytics/refinery@fe7bf9e]: Hotfix for webrequest load refine, now with FORCE :)
* 12:26 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet
* 16:04 milimetric@deploy1002: Finished deploy [analytics/refinery@fe7bf9e]: Hotfix for webrequest load refine (duration: 34m 38s)
* 12:13 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2022.codfw.wmnet with OS buster
* 16:03 pt1979@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on db1192.eqiad.wmnet with reason: host reimage
* 11:30 jmm@cumin2002: START - Cookbook sre.hosts.reimage for host ganeti2022.codfw.wmnet with OS buster
* 15:55 pt1979@cumin1001: START - Cookbook sre.hosts.reimage for host db1193.eqiad.wmnet with OS bullseye
* 11:27 jmm@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti2011.codfw.wmnet with OS buster
* 15:52 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1191.eqiad.wmnet with OS bullseye
* 11:08 jmm@cumin2002: START - Cookbook sre.hosts.reimage for host ganeti2011.codfw.wmnet with OS buster
* 15:51 pt1979@cumin1001: START - Cookbook sre.hosts.reimage for host db1192.eqiad.wmnet with OS bullseye
* 11:06 jynus: stop and shutdown db1102 [[phab:T296546|T296546]]
* 15:42 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1190.eqiad.wmnet with OS bullseye
* 11:01 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on ganeti2011.codfw.wmnet with reason: Temporarily remove node from Ganeti for reimage
* 15:38 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1191.eqiad.wmnet with reason: host reimage
* 11:01 jmm@cumin2002: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on ganeti2011.codfw.wmnet with reason: Temporarily remove node from Ganeti for reimage
* 15:34 pt1979@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on db1191.eqiad.wmnet with reason: host reimage
* 09:38 moritzm: draining primary/secondary instances off ganeti2011 [[phab:T296622|T296622]]
* 15:30 milimetric@deploy1002: Started deploy [analytics/refinery@fe7bf9e]: Hotfix for webrequest load refine
* 09:25 jmm@cumin2002: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti2009.codfw.wmnet to ganeti01.svc.codfw.wmnet
* 15:28 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1190.eqiad.wmnet with reason: host reimage
* 09:24 jmm@cumin2002: START - Cookbook sre.ganeti.addnode for new host ganeti2009.codfw.wmnet to ganeti01.svc.codfw.wmnet
* 15:25 pt1979@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on db1190.eqiad.wmnet with reason: host reimage
* 09:23 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2009.codfw.wmnet
* 15:24 jbond: upload trapperkeeper-metrics-clojure to puppet7 component
* 09:18 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host ganeti2009.codfw.wmnet
* 15:22 pt1979@cumin1001: START - Cookbook sre.hosts.reimage for host db1191.eqiad.wmnet with OS bullseye
* 09:15 marostegui@cumin1001: dbctl commit (dc=all): 'After maintenance db1161 ([[phab:T277354|T277354]])', diff saved to https://phabricator.wikimedia.org/P18019 and previous config saved to /var/cache/conftool/dbconfig/20211203-091537-marostegui.json
* 15:19 jbond: upload puppetlabs-http-client-clojur to puppet7 component
* 09:00 marostegui@cumin1001: dbctl commit (dc=all): 'After maintenance db1161', diff saved to https://phabricator.wikimedia.org/P18018 and previous config saved to /var/cache/conftool/dbconfig/20211203-090033-marostegui.json
* 15:17 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 08:58 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2009.codfw.wmnet with OS buster
* 15:16 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 08:45 marostegui@cumin1001: dbctl commit (dc=all): 'After maintenance db1161', diff saved to https://phabricator.wikimedia.org/P18017 and previous config saved to /var/cache/conftool/dbconfig/20211203-084528-marostegui.json
* 15:16 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 08:44 oblivian@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 15:15 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 08:43 oblivian@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 15:14 dancy@deploy1002: Finished scap: Backport for [[gerrit:820653]] scap gitignore: ignore all files under the `scap` directory (duration: 04m 41s)
* 08:30 marostegui@cumin1001: dbctl commit (dc=all): 'After maintenance db1161 ([[phab:T277354|T277354]])', diff saved to https://phabricator.wikimedia.org/P18016 and previous config saved to /var/cache/conftool/dbconfig/20211203-083023-marostegui.json
* 15:11 jbond: upload jolokia to puppet7 component
* 08:30 jmm@cumin2002: START - Cookbook sre.hosts.reimage for host ganeti2009.codfw.wmnet with OS buster
* 15:10 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1185.eqiad.wmnet with OS bullseye
* 08:29 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1161 ([[phab:T277354|T277354]])', diff saved to https://phabricator.wikimedia.org/P18015 and previous config saved to /var/cache/conftool/dbconfig/20211203-082859-marostegui.json
* 15:09 dancy@deploy1002: Started scap: Backport for [[gerrit:820653]] scap gitignore: ignore all files under the `scap` directory
* 08:28 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db[1154,1161].eqiad.wmnet with reason: Maintenance [[phab:T277354|T277354]]
* 15:09 jbond: upload test-chuck-clojure to puppet7 component
* 08:28 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db[1154,1161].eqiad.wmnet with reason: Maintenance [[phab:T277354|T277354]]
* 15:05 pt1979@cumin1001: START - Cookbook sre.hosts.reimage for host db1190.eqiad.wmnet with OS bullseye
* 08:28 marostegui@cumin1001: dbctl commit (dc=all): 'After maintenance db1110 ([[phab:T277354|T277354]])', diff saved to https://phabricator.wikimedia.org/P18014 and previous config saved to /var/cache/conftool/dbconfig/20211203-082848-marostegui.json
* 15:04 jbond: upload test-check-clojure to puppet7 component
* 08:13 marostegui@cumin1001: dbctl commit (dc=all): 'After maintenance db1110', diff saved to https://phabricator.wikimedia.org/P18013 and previous config saved to /var/cache/conftool/dbconfig/20211203-081343-marostegui.json
* 14:57 jbond: upload nippy-clojure to puppet7 component
* 07:58 marostegui@cumin1001: dbctl commit (dc=all): 'After maintenance db1110', diff saved to https://phabricator.wikimedia.org/P18012 and previous config saved to /var/cache/conftool/dbconfig/20211203-075839-marostegui.json
* 14:56 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1185.eqiad.wmnet with reason: host reimage
* 07:43 marostegui@cumin1001: dbctl commit (dc=all): 'After maintenance db1110 ([[phab:T277354|T277354]])', diff saved to https://phabricator.wikimedia.org/P18011 and previous config saved to /var/cache/conftool/dbconfig/20211203-074334-marostegui.json
* 14:52 pt1979@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on db1185.eqiad.wmnet with reason: host reimage
* 07:39 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1110 ([[phab:T277354|T277354]])', diff saved to https://phabricator.wikimedia.org/P18010 and previous config saved to /var/cache/conftool/dbconfig/20211203-073910-marostegui.json
* 14:43 jbond: upload fressian to puppet7 component
* 07:39 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1110.eqiad.wmnet with reason: Maintenance [[phab:T277354|T277354]]
* 14:40 pt1979@cumin1001: START - Cookbook sre.hosts.reimage for host db1185.eqiad.wmnet with OS bullseye
* 07:39 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 1:00:00 on db1110.eqiad.wmnet with reason: Maintenance [[phab:T277354|T277354]]
* 14:40 jbond: upload test-generative-clojure to puppet7 component
* 07:34 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1150.eqiad.wmnet with reason: Maintenance [[phab:T277354|T277354]]
* 14:35 pt1979@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 07:34 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 1:00:00 on db1150.eqiad.wmnet with reason: Maintenance [[phab:T277354|T277354]]
* 14:34 jbond: upload data-generators-clojure to puppet7 component
* 07:34 marostegui@cumin1001: dbctl commit (dc=all): 'After maintenance db1144:3315 ([[phab:T277354|T277354]])', diff saved to https://phabricator.wikimedia.org/P18009 and previous config saved to /var/cache/conftool/dbconfig/20211203-073404-marostegui.json
* 14:31 pt1979@cumin2002: START - Cookbook sre.dns.netbox
* 07:19 marostegui@cumin1001: dbctl commit (dc=all): 'After maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P18008 and previous config saved to /var/cache/conftool/dbconfig/20211203-071900-marostegui.json
* 14:23 jbond: upload encore-clojure to puppet7 component
* 07:03 marostegui@cumin1001: dbctl commit (dc=all): 'After maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P18007 and previous config saved to /var/cache/conftool/dbconfig/20211203-070355-marostegui.json
* 14:17 jbond: upload truss-clojure to puppet7 component
* 06:48 marostegui@cumin1001: dbctl commit (dc=all): 'After maintenance db1144:3315 ([[phab:T277354|T277354]])', diff saved to https://phabricator.wikimedia.org/P18006 and previous config saved to /var/cache/conftool/dbconfig/20211203-064850-marostegui.json
* 14:13 jbond: upload structured-logging-clojure to puppet7 component
* 06:28 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1125.eqiad.wmnet with OS bullseye
* 14:06 jbond: upload murphy-clojure to puppet7 component
* 06:20 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1144:3315 ([[phab:T277354|T277354]])', diff saved to https://phabricator.wikimedia.org/P18005 and previous config saved to /var/cache/conftool/dbconfig/20211203-062019-marostegui.json
* 13:57 jbond: upload logstash-logback-encoder-7.2 to puppet7 component
* 06:20 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1144.eqiad.wmnet with reason: Maintenance [[phab:T277354|T277354]]
* 13:49 jbond: upload kitchensink-clojure to puppet7 component
* 06:20 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 1:00:00 on db1144.eqiad.wmnet with reason: Maintenance [[phab:T277354|T277354]]
* 13:27 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depool hosts with fragile power supply ([[phab:T314559|T314559]] [[phab:T314628|T314628]])', diff saved to https://phabricator.wikimedia.org/P32292 and previous config saved to /var/cache/conftool/dbconfig/20220805-132709-ladsgroup.json
* 06:20 marostegui@cumin1001: dbctl commit (dc=all): 'After maintenance db1113:3315 ([[phab:T277354|T277354]])', diff saved to https://phabricator.wikimedia.org/P18004 and previous config saved to /var/cache/conftool/dbconfig/20211203-062011-marostegui.json
* 13:12 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 12:00:00 on db2095.codfw.wmnet with reason: Maintenance
* 06:05 marostegui@cumin1001: dbctl commit (dc=all): 'After maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P18003 and previous config saved to /var/cache/conftool/dbconfig/20211203-060506-marostegui.json
* 13:12 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 2 days, 12:00:00 on db2095.codfw.wmnet with reason: Maintenance
* 06:02 marostegui@cumin1001: START - Cookbook sre.hosts.reimage for host db1125.eqiad.wmnet with OS bullseye
* 13:09 sukhe: repool codfw
* 05:50 marostegui@cumin1001: dbctl commit (dc=all): 'After maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P18002 and previous config saved to /var/cache/conftool/dbconfig/20211203-055001-marostegui.json
* 13:02 jbond: upload honeysql-clojure to puppet7 component
* 05:34 marostegui@cumin1001: dbctl commit (dc=all): 'After maintenance db1113:3315 ([[phab:T277354|T277354]])', diff saved to https://phabricator.wikimedia.org/P18001 and previous config saved to /var/cache/conftool/dbconfig/20211203-053457-marostegui.json
* 12:53 _joe_: progressive repool of services in codfw
* 05:30 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1113:3315 ([[phab:T277354|T277354]])', diff saved to https://phabricator.wikimedia.org/P18000 and previous config saved to /var/cache/conftool/dbconfig/20211203-053032-marostegui.json
* 12:24 moritzm: installing nano bugfix updates from bullseye point release
* 05:30 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1113.eqiad.wmnet with reason: Maintenance [[phab:T277354|T277354]]
* 11:50 hnowlan@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: sync
* 05:30 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 1:00:00 on db1113.eqiad.wmnet with reason: Maintenance [[phab:T277354|T277354]]
* 11:40 hnowlan@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: sync
* 01:39 pt1979@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2025.codfw.wmnet with OS buster
* 11:37 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repool after PDU maint on D3 ([[phab:T310146|T310146]])', diff saved to https://phabricator.wikimedia.org/P32291 and previous config saved to /var/cache/conftool/dbconfig/20220805-113729-ladsgroup.json
* 01:06 pt1979@cumin2002: START - Cookbook sre.hosts.reimage for host restbase2025.codfw.wmnet with OS buster
* 11:35 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repool after PDU maint on C6 ([[phab:T310145|T310145]])', diff saved to https://phabricator.wikimedia.org/P32290 and previous config saved to /var/cache/conftool/dbconfig/20220805-113555-ladsgroup.json
* 01:05 pt1979@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2024.codfw.wmnet with OS buster
* 11:34 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repool after PDU maint on C5 ([[phab:T310145|T310145]])', diff saved to https://phabricator.wikimedia.org/P32289 and previous config saved to /var/cache/conftool/dbconfig/20220805-113436-ladsgroup.json
* 01:01 tgr: UTC late deploys done
* 10:46 hnowlan@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: sync
* 01:00 tgr@deploy1002: Synchronized php-1.38.0-wmf.9/extensions/GrowthExperiments: Backport: [[gerrit:743177{{!}}Add an image: Add test version of GEInfoboxTemplates (T291232)]] (duration: 00m 57s)
* 10:36 hnowlan@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: sync
* 00:44 rzl: rzl@apt1001:~$ sudo -i reprepro -C main include buster-wikimedia /home/rzl/python3-imagecatalog/imagecatalog_0.0.1-1_amd64.changes
* 10:17 hnowlan@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: sync
* 00:37 tgr@deploy1002: Synchronized php-1.38.0-wmf.9/extensions/GrowthExperiments/includes: Backport: [[gerrit:743178{{!}}Avoid references to TemplateCollectionFeature]] step2 (duration: 00m 56s)
* 10:12 Amir1: dbmaint at s4@codfw ([[phab:T312863|T312863]])
* 00:36 tgr@deploy1002: Synchronized php-1.38.0-wmf.9/extensions/GrowthExperiments/includes/Config/Validation/GrowthConfigValidation.php: Backport: [[gerrit:743178{{!}}Avoid references to TemplateCollectionFeature]] step 1 (duration: 00m 56s)
* 10:07 hnowlan@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: sync
* 00:33 pt1979@cumin2002: START - Cookbook sre.hosts.reimage for host restbase2024.codfw.wmnet with OS buster
* 09:04 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 12 hosts with reason: Maintenance
* 09:03 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 12 hosts with reason: Maintenance
* 09:03 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2110.codfw.wmnet with reason: Maintenance
* 09:03 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2110.codfw.wmnet with reason: Maintenance
* 00:53 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8 days, 0:00:00 on gerrit2001.wikimedia.org with reason: decom, replaced by gerrit2002
* 00:53 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 8 days, 0:00:00 on gerrit2001.wikimedia.org with reason: decom, replaced by gerrit2002
* 00:53 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for gerrit2002.wikimedia.org
* 00:53 dzahn@cumin1001: START - Cookbook sre.hosts.remove-downtime for gerrit2002.wikimedia.org
* 00:52 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8 days, 0:00:00 on gerrit2002.wikimedia.org with reason: decom, replaced by gerrit2002
* 00:52 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 8 days, 0:00:00 on gerrit2002.wikimedia.org with reason: decom, replaced by gerrit2002
* 00:18 mutante: restarting gerrit for config change - removing old replica [[phab:T313250|T313250]]


== 2021-12-02 ==
== 2022-08-04 ==
* 20:05 legoktm: re-pooling mw1414 following testing
* 23:07 mutante: switching gerrit-replica.wikimedia.org to new machine gerrit2002, dropping gerrit-replica-new.wikimedia.org [[phab:T313250|T313250]]
* 19:35 legoktm: installing yaml PHP extension on canaries
* 21:07 ryankemper@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
* 19:29 andrewbogott: upgrading wikitech-static deb packages as well as moving to mediawiki 1.37.0
* 20:59 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 19:26 majavah: UTC evening deploys done
* 20:57 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 19:26 taavi@deploy1002: Synchronized php-1.38.0-wmf.9/extensions/WikimediaEvents/modules/ext.wikimediaEvents/webUIScroll.js:
* 20:57 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 20:56 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 20:56 thcipriani@deploy1002: Finished scap: Backport for [[gerrit:819774]] tkwiki: Update wordmark (duration: 06m 12s)
* 20:51 ryankemper@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
* 20:51 ryankemper@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
* 20:51 ryankemper@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
* 20:50 thcipriani@deploy1002: Started scap: Backport for [[gerrit:819774]] tkwiki: Update wordmark
* 20:48 thcipriani@deploy1002: Finished scap: Backport for [[gerrit:812391]] [config]: Add click event logging for mobile and desktop (duration: 39m 16s)
* 20:45 ryankemper@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
* 20:24 ryankemper@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
* 20:23 ryankemper@deploy1002: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply
* 20:22 ryankemper@deploy1002: helmfile [staging] START helmfile.d/


== 2021-12-01 ==
== 2022-08-03 ==
* 22:15 otto@deploy1002: Finished deploy [airflow-dags/analytics@2f59257] (hadoop-test): (no justification provided) (duration: 00m 07s)
* 23:59 dzahn@cumin2002: START - Cookbook sre.hosts.downtime for 1:00:00 on gerrit1001.wikimedia.org with reason: service restart
* 22:15 otto@deploy1002: Started deploy [airflow-dags/analytics@2f59257] (hadoop-test): (no justification provided)
* 23:50 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32270 and previous config saved to /var/cache/conftool/dbconfig/20220803-235030-marostegui.json
* 22:13 otto@deploy1002: Finished deploy [airflow-dags/analytics@2f59257] (hadoop-test): (no justification provided) (duration: 00m 07s)
* 22:50 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1170:3312 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32269 and previous config saved to /var/cache/conftool/dbconfig/20220803-225015-marostegui.json
* 22:13 otto@deploy1002: Started deploy [airflow-dags/analytics@2f59257] (hadoop-test): (no justification provided)
* 22:50 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance
* 22
* 22:49 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance
* 22:49 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 9 hosts with reason: Maintenance
* 22:49 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on 9 hosts with reason: Maintenance
* 22:49 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance
* 22:49 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance
* 22:48 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
* 22:48 marostegui@cumin1001: START - Cookbook


== 2021-11-30 ==
== 2022-08-02 ==
* 23:59 dzahn@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'miscweb' for release 'main' .
* 22:39 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 23:57 mutante: deploy1002 - kube_env miscweb staging ; helmfile -e staging destroy
* 22:32 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 23:56 dzahn@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'miscweb' for release 'main' .
* 22:32 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 23:46 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 22:25 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 23:45 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 22:15 mutante: gerrit - syncing data (/srv/gerrit /var/lib/gerrit2/review_site  /home) again after gerrit2002 was reimaged with buster [[phab:T313250|T313250]] [[phab:T313972|T313972]]
* 23:09 mutante: gerrit - added Majavah to wmf-deployment group for [[phab:T296777|T296777]]
* 22:04 dancy@deploy1002: Finished deploy [gerrit/gerrit@94c5028]: (no justification provided) (duration: 00m 06s)
* 22:30 krinkle@deploy1002: Finished deploy [integration/docroot@2af7007]: {{Gerrit|Ia89b6591639e5}} (duration: 00m 09s)
* 22:04 dancy@deploy1002: Started deploy [gerrit/gerrit@94c5028]: (no justification provided)
* 22:30 krinkle@deploy1002: Started deploy [integration/docroot@2af7007]: {{Gerrit|Ia89b6591639e5}}
* 22:00 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 22:21 mutante: welcome Majavah to MediaWiki deployers ([[phab:T296777|T296777]])
* 21:59 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 20:09 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|5443b78f197b782238632966891d721859733a74}}: uzwiki: Deploy Growth features to newcomers ([[phab:T294245|T294245]]) (duration: 00m 57s)
* 21:59 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 18:09 legoktm: uploaded php-yaml for component/php72 ([[phab:T296331|T296331]])
* 21:58 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 18:08 vgutierrez: restart haproxy on cp3064 - [[phab:T290005|T290005]]
* 21:58 dancy@deploy1002: rebuilt and synchronized wikiversions files: group0 wikis to 1.39.0-wmf.23  refs [[phab:T308076|T308076]]
* 17:44 jynus@cumin1001: dbctl commit (dc=all): 'Repool db1163 fully', diff saved to https://phabricator.wikimedia.org/P17912 and previous config saved to /var/cache/conftool/dbconfig/20211130-174434-jynus.json
* 21:53 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 17:39 jynus@cumin1001: dbctl commit (dc=all): 'Repool db1163 at 50%', diff saved to https://phabricator.wikimedia.org/P17911 and previous config saved to /var/cache/conftool/dbconfig/20211130-173935-jynus.json
* 21:47 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 17:35 jynus@cumin1001: dbctl commit (dc=all): 'Repool db1163 at 25%', diff saved to https://phabricator.wikimedia.org/P17910 and previous config saved to /var/cache/conftool/dbconfig/20211130-173517-jynus.json
* 21:46 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 17:34 moritzm: installing libvorbis security updates
* 21:40 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 17:15 jynus@cumin1001: dbctl commit (dc=all): 'Repool db1163 at 5%', diff saved to https://phabricator.wikimedia.org/P17908 and previous config saved to /var/cache/conftool/dbconfig/20211130-171550-jynus.json
* 21:35 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 17:00 jynus: move db1139:s1 under db1118
* 21:29 ladsgroup@deploy1002: Synchronized php-1.39.0-wmf.23/extensions/CirrusSearch/includes/Sanity/Checker.php: Backport: [[gerrit:819621{{!}}Fix appending of join conds (T312421 T314439)]] (duration: 03m 15s)
* 16:57 jynus@cumin1001: dbctl commit (dc=all): 'Depool db1163 fully', diff saved to https://phabricator.wikimedia.org/P17907 and previous config saved to /var/cache/conftool/dbconfig/20211130-165718-jynus.json
* 21:28 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 16:29 XioNoX: Move cr2-codfw lumen transit link to BO cable - [[phab:T289241|T289241]]
* 21:28 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 16:26 XioNoX: Move cr2-codfw eqord link to BO cable - [[phab:T289241|T289241]]
* 21:27 bking@cumin1001: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: deploy wmf-elasticsearch-search-plugins pkg - bking@cumin1001 - [[phab:T314078|T314078]]
* 16:23 XioNoX: Move cr2-codfw pfw3 link to BO cable - [[phab:T289241|T289241]]
* 21:21 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 16:20 Emperor: reboot ms-be2059 to fix device enumeration order re [[phab:T295563|T295563]]
* 21:11 dzahn@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gerrit2002.wikimedia.org with OS buster
* 16:14 jynus@cumin1001: dbctl commit (dc=all): 'Depool db1163 at 25%', diff saved to https://phabricator.wikimedia.org/P17906 and previous config saved to /var/cache/conftool/dbconfig/20211130-161457-jynus.json
* 21:01 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 16:13 XioNoX: cr2-codfw bounce fpc 1 pic 0 (vrrp backup) - [[phab:T289241|T289241]]
* 21:01 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 16:07 jynus@cumin1001: dbctl commit (dc=all): 'Depool db1163 at 50%', diff saved to https://phabricator.wikimedia.org/P17905 and previous config saved to /var/cache/conftool/dbconfig/20211130-160748-jynus.json
* 21:00 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 16:06 bblack: lvs2007 - repooling into service
* 20:59 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 16:01 bblack: lvs2007 - depooling for network maint - do not push LVS config changes please!
* 20:58 dancy@deploy1002: rebuilt and synchronized wikiversions files: group0 wikis to 1.39.0-wmf.22  refs [[phab:T308076|T308076]]
* 15:41 jbond@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts puppetboard2001.codfw.wmnet
* 20:54 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 15:41 jbond@cumin1001: START - Cookbook sre.hosts.decommission for hosts puppetboard2001.codfw.wmnet
* 20:53 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 15:38 jbond@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts puppetboard2001.codfw.wmnet
* 20:53 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 15:37 jbond@cumin1001: START - Cookbook sre.hosts.decommission for hosts puppetboard2001.codfw.wmnet
* 20:53 dzahn@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gerrit2002.wikimedia.org with reason: host reimage
* 15:32 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 20:52 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 15:29 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 20:51 dzahn@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on gerrit2002.wikimedia.org with reason: host reimage
* 15:23 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 20:50 dancy@deploy1002: rebuilt and synchronized wikiversions files: group0 wikis to 1.39.0-wmf.23  refs [[phab:T308076|T308076]]
* 15:22 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 20:38 mutante: re-imaging gerrit2002 with buster - because it's on bullseye, needs git-fat and that has not been ported to python3 yet which blocks upgrading gerrit machines otherwise [[phab:T313250|T313250]] [[phab:T243027|T243027]] [[phab:T279509|T279509]]
* 15:16 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 20:37 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 15:15 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 20:36 dzahn@cumin2002: START - Cookbook sre.hosts.reimage for host gerrit2002.wikimedia.org with OS buster
* 15:12 jforrester@deploy1002: Synchronized multiversion/MWMultiVersion.php: Add wikifunctions hard-coded value to setSiteInfoForWiki for Beta Cluster [[phab:T284162|T284162]] (duration: 00m 56s)
* 20:36 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 15:09 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 20:36 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 15:08 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 20:36 urbanecm: UTC evening B&C window done
* 13:45 elukey@cumin1001: END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-test-eqiad cluster: Roll restart of jvm daemons for openjdk upgrade.
* 20:35 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 13:25 elukey@cumin1001: START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-test-eqiad cluster: Roll restart of jvm daemons for openjdk upgrade.
* 20:33 urbanecm@deploy1002: Synchronized php-1.39.0-wmf.23/includes/Rest/Handler/HTMLTransformInput.php: {{Gerrit|69e91528a5c6f372af520307dc2f4227b9981442}}: ParsoidHandler: fix page bundle input with no orig HTML (duration: 03m 22s)
* 13:11 marostegui@cumin1001: dbctl commit (dc=all): 'After maintenance db2114 ([[phab:T277354|T277354]])', diff saved to https://phabricator.wikimedia.org/P17904 and previous config saved to /var/cache/conftool/dbconfig/20211130-131124-marostegui.json
* 20:30 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 13:05 topranks: Running homer against CR routers to adjust loopback4 filter enabling local NTP queries for status.  [[phab:T296623|T296623]]
* 20:29 urbanecm@deploy1002: Synchronized php-1.39.0-wmf.23/includes/Rest/Handler/ParsoidHandler.php: {{Gerrit|322a960e3777bc01fa8823908340c36e3851a648}}: ParsoidHandler: pass metrics object to HTMLTransformInput (duration: 03m 19s)
* 12:56 marostegui@cumin1001: dbctl commit (dc=all): 'After maintenance db2114 ([[phab:T277354|T277354]])', diff saved to https://phabricator.wikimedia.org/P17903 and previous config saved to /var/cache/conftool/dbconfig/20211130-125620-marostegui.json
* 20:27 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 12:41 marostegui@cumin1001: dbctl commit (dc=all): 'After maintenance db2114 ([[phab:T277354|T277354]])', diff saved to https://phabricator.wikimedia.org/P17902 and previous config saved to /var/cache/conftool/dbconfig/20211130-124115-marostegui.json
* 20:27 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 12:26 marostegui@cumin1001: dbctl commit (dc=all): 'After maintenance db2114 ([[phab:T277354|T277354]])', diff saved to https://phabricator.wikimedia.org/P17901 and previous config saved to /var/cache/conftool/dbconfig/20211130-122610-marostegui.json
* 20:22 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 12:25 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db2114 ([[phab:T277354|T277354]])', diff saved to https://phabricator.wikimedia.org/P17900 and previous config saved to /var/cache/conftool/dbconfig/20211130-122555-marostegui.json
* 20:20 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|5fac0aaf8e76a6f8cc3302771eac068e4f866e5f}}: GrowthExperiments: Remove wgGEHomepageTutorialTitle (duration: 03m 26s)
* 12:25 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2114.codfw.wmnet with reason: Maintenance [[phab:T277354|T277354]]
* 20:06 dancy@deploy1002: Finished scap: Backport for [[gerrit:819612]] Revert "Bump wikimedia/parsoid to 0.16.0-a18" (duration: 11m 30s)
* 12:25 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on db2114.codfw.wmnet with reason: Maintenance [[phab:T277354|T277354]]
* 20:01 dancy@deploy1002: Finished deploy [gerrit/gerrit@94c5028]: (no justification provided) (duration: 00m 05s)
* 12:09 jbond@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts puppetboard1001.eqiad.wmnet
* 20:01 dancy@deploy1002: Started deploy [gerrit/gerrit@94c5028]: (no justification provided)
* 12:02 jbond@cumin1001: START - Cookbook sre.hosts.decommission for hosts puppetboard1001.eqiad.wmnet
* 19:59 dancy@deploy1002: Finished deploy [gerrit/gerrit@94c5028]: (no justification provided) (duration: 00m 01s)
* 11:50 moritzm: running "sudo gnt-cluster renew-crypto --new-node-certificates --new-rapi-certificate --new-spice-certificate" for Ganeti codfw cluster [[phab:T296622|T296622]]
* 19:59 dancy@deploy1002: Started deploy [gerrit/gerrit@94c5028]: (no justification provided)
* 11:01 hnowlan: restarting tilerator, kartotherian and tileratorui for updates in eqiad
* 19:55 dancy@deploy1002: Started scap: Backport for [[gerrit:819612]] Revert "Bump wikimedia/parsoid to 0.16.0-a18"
* 11:01 hnowlan: restarting tilerator, kartotherian and tileratorui in codfw
* 19:42 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 10:39 elukey: rollout wmf-certificates 0~20211129-1 fleet wide (add group/others permissions to the cert bundle)
* 19:37 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2034.codfw.wmnet,service=ats-tls
* 10:30 lucaswerkmeister-wmde@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'termbox' for release 'production' .
* 19:37 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2034.codfw.wmnet,service=varnish-fe
* 10:29 lucaswerkmeister-wmde@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'termbox' for release 'production' .
* 19:37 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2034.codfw.wmnet,service=ats-be
* 09:58 moritzm: installing remaining ICU security updates
* 19:36 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 09:06 Amir1: dropping wikiadmin@localhost from all pooled replicas of s6 ([[phab:T296511|T296511]])
* 19:36 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 08:24 dcausse: restarting blazegraph on wdqs1006 (jvm stuck for 6hours)
* 19:36 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2033.codfw.wmnet,service=ats-tls
* 08:14 Amir1: revoking DROP from wikiadmin on all pooled replicas ([[phab:T249683|T249683]])
* 19:36 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2033.codfw.wmnet,service=varnish-fe
* 03:46 ejegg: updated payments-wiki from {{Gerrit|dbc92132}} to {{Gerrit|4a4ef51d}}
* 19:36 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2033.codfw.wmnet,service=ats-be
* 02:05 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 19:36 mvernon@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be[2041,2046].codfw.wmnet
* 02:04 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 19:35 mvernon@cumin2002: START - Cookbook sre.hosts.remove-downtime for ms-be[2041,2046].codfw.wmnet
* 00:22 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 19:29 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 00:20 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 19:28 mvernon@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for thanos-fe2002.codfw.wmnet
* 00:17 catrope@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:742524{{!}}Enable scroll tracking for all users (T292586)]] (duration: 00m 55s)
* 19:28 mvernon@cumin2002: START - Cookbook sre.hosts.remove-downtime for thanos-fe2002.codfw.wmnet
* 00:14 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 19:26 mvernon@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-fe2010.codfw.wmnet
* 00:14 catrope@deploy1002: Synchronized php-1.38.0-wmf.9/extensions/WikimediaEvents/modules/ext.wikimediaEvents/readingDepth.js: Backport: [[gerrit:742517{{!}}Provide fallback for config variable when not present]] (duration: 00m 55s)
* 19:26 mvernon@cumin2002: START - Cookbook sre.hosts.remove-downtime for ms-fe2010.codfw.wmnet
* 00:13 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 19:21 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2032.codfw.wmnet,service=ats-tls
* 00:13 catrope@deploy1002: Synchronized wmf-config/flaggedrevs.php: Config: [[gerrit:738530{{!}}allow sysops to set/remove reviewer group on ckbwiki (T294696)]] (duration: 00m 55s)
* 19:21 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2032.codfw.wmnet,service=varnish-fe
* 19:21 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2032.codfw.wmnet,service=ats-be
* 19:17 rzl@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on mc2038.codfw.wmnet with reason: install
* 19:17 rzl@cumin2002: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on mc2038.codfw.wmnet with reason: install
* 19:13 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2031.codfw.wmnet,service=ats-tls
* 19:13 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2031.codfw.wmnet,service=varnish-fe
* 19:13 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2031.codfw.wmnet,service=ats-be
* 19:11 mutante: gerrit1001 - rsyncing /home/ to gerrit2002:/srv/home-gerrit1001.wikimedia.org [[phab:T313250|T313250]]
* 19:01 dzahn@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on gerrit2002.wikimedia.org with reason: new machine
* 19:01 dzahn@cumin2002: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on gerrit2002.wikimedia.org with reason: new machine
* 18:55 dancy@deploy1002: Finished scap: testwikis wikis to 1.39.0-wmf.23  refs [[phab:T308076|T308076]] (duration: 50m 39s)
* 18:54 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 18:52 ejegg: updated payments-wiki from {{Gerrit|589bb64e}} to {{Gerrit|e1b6036a}} (just i18n changes in extensions)
* 18:47 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 18:47 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 18:46 bking@cumin1001: START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: deploy wmf-elasticsearch-search-plugins pkg - bking@cumin1001 - [[phab:T314078|T314078]]
* 18:46 rzl@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mc2038.codfw.wmnet with reason: install
* 18:45 rzl@cumin2002: START - Cookbook sre.hosts.downtime for 1:00:00 on mc2038.codfw.wmnet with reason: install
* 18:41 rzl@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc2038.codfw.wmnet
* 18:41 rzl@cumin2002: START - Cookbook sre.hosts.remove-downtime for mc2038.codfw.wmnet
* 18:39 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 18:19 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 18:18 rzl@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2038.codfw.wmnet with reason: install
* 18:18 rzl@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on mc2038.codfw.wmnet with reason: install
* 18:17 rzl@cumin2002: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2038.codfw.wmnet with reason: install
* 18:17 rzl@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on mw2038.codfw.wmnet with reason: install
* 18:17 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 18:17 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 18:16 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on lvs2008.codfw.wmnet with reason: shutdown for PDU upgrade
* 18:16 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 1:00:00 on lvs2008.codfw.wmnet with reason: shutdown for PDU upgrade
* 18:16 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 18:11 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 18:08 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 18:08 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 18:05 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 18:04 dancy@deploy1002: Started scap: testwikis wikis to 1.39.0-wmf.23  refs [[phab:T308076|T308076]]
* 17:52 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1168 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32185 and previous config saved to /var/cache/conftool/dbconfig/20220802-175233-marostegui.json
* 17:43 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depool db2159', diff saved to https://phabricator.wikimedia.org/P32184 and previous config saved to /var/cache/conftool/dbconfig/20220802-174311-ladsgroup.json
* 17:37 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P32183 and previous config saved to /var/cache/conftool/dbconfig/20220802-173723-marostegui.json
* 17:35 moritzm: installing node-moment security updates
* 17:32 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic[2041-2042,2057].codfw.wmnet with reason: [[phab:T310070|T310070]]
* 17:32 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic[2041-2042,2057].codfw.wmnet with reason: [[phab:T310070|T310070]]
* 17:27 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2013.codfw.wmnet
* 17:25 moritzm: installing fribidi security updates
* 17:22 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P32182 and previous config saved to /var/cache/conftool/dbconfig/20220802-172217-marostegui.json
* 17:20 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2030.codfw.wmnet,service=ats-tls
* 17:20 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2030.codfw.wmnet,service=varnish-fe
* 17:20 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2030.codfw.wmnet,service=ats-be
* 17:18 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host ganeti2013.codfw.wmnet
* 17:07 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1168 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32181 and previous config saved to /var/cache/conftool/dbconfig/20220802-170711-marostegui.json
* 17:06 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc[2042-2043].codfw.wmnet with reason: shutdown for PDU upgrade
* 17:06 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on mc[2042-2043].codfw.wmnet with reason: shutdown for PDU upgrade
* 17:05 Emperor: ms-be20[31,32,41,46].codfw.wmnet,ms-fe2010.codfw.wmnet,thanos-fe2002.codfw.wmnet downtime for PDU work [[phab:T309957|T309957]]
* 17:05 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1168 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32180 and previous config saved to /var/cache/conftool/dbconfig/20220802-170503-marostegui.json
* 17:04 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance
* 17:04 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance
* 17:04 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance
* 17:04 mvernon@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 6 hosts with reason: shutdown for PDU replacement
* 17:04 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance
* 17:04 mvernon@cumin2002: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 6 hosts with reason: shutdown for PDU replacement
* 17:04 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance
* 17:03 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance
* 17:03 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance
* 17:03 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance
* 17:03 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1131 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32179 and previous config saved to /var/cache/conftool/dbconfig/20220802-170333-marostegui.json
* 17:01 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2029.codfw.wmnet,service=ats-tls
* 17:01 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2029.codfw.wmnet,service=varnish-fe
* 17:01 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2029.codfw.wmnet,service=ats-be
* 17:00 mvernon@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be[2030,2045,2052].codfw.wmnet
* 17:00 mvernon@cumin2002: START - Cookbook sre.hosts.remove-downtime for ms-be[2030,2045,2052].codfw.wmnet
* 16:57 btullis@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host an-airflow1004.eqiad.wmnet
* 16:54 hnowlan@deploy1002: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: sync
* 16:53 hnowlan@deploy1002: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: sync
* 16:51 hnowlan@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: sync
* 16:49 hnowlan@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: sync
* 16:48 hnowlan@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: sync
* 16:48 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P32178 and previous config saved to /var/cache/conftool/dbconfig/20220802-164827-marostegui.json
* 16:38 hnowlan@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: sync
* 16:35 hnowlan@deploy1002: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: sync
* 16:35 hnowlan@deploy1002: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: sync
* 16:33 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P32177 and previous config saved to /var/cache/conftool/dbconfig/20220802-163321-marostegui.json
* 16:29 dancy@mwmaint1002: pull aborted:  (duration: 00m 07s)
* 16:25 rzl: rzl@stat1007:~$ sudo systemctl stop wmde-analytics-daily-early  # wedged, timer will restart it now with max_runtime_seconds
* 16:18 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1131 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32176 and previous config saved to /var/cache/conftool/dbconfig/20220802-161815-marostegui.json
* 16:16 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1131 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32175 and previous config saved to /var/cache/conftool/dbconfig/20220802-161607-marostegui.json
* 16:16 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance
* 16:15 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance
* 16:15 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1165 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32174 and previous config saved to /var/cache/conftool/dbconfig/20220802-161545-marostegui.json
* 16:10 btullis@cumin1001: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) an-airflow1004.eqiad.wmnet on all recursors
* 16:10 btullis@cumin1001: START - Cookbook sre.dns.wipe-cache an-airflow1004.eqiad.wmnet on all recursors
* 16:10 btullis@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 16:05 btullis@cumin1001: START - Cookbook sre.dns.netbox
* 16:05 btullis@cumin1001: START - Cookbook sre.ganeti.makevm for new host an-airflow1004.eqiad.wmnet
* 16:00 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P32173 and previous config saved to /var/cache/conftool/dbconfig/20220802-160039-marostegui.json
* 15:51 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2056.codfw.wmnet with reason: [[phab:T309957|T309957]]
* 15:50 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2056.codfw.wmnet with reason: [[phab:T309957|T309957]]
* 15:49 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2040.codfw.wmnet with reason: [[phab:T309957|T309957]]
* 15:49 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2040.codfw.wmnet with reason: [[phab:T309957|T309957]]
* 15:46 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2039.codfw.wmnet with reason: [[phab:T309957|T309957]]
* 15:45 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2039.codfw.wmnet with reason: [[phab:T309957|T309957]]
* 15:45 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P32172 and previous config saved to /var/cache/conftool/dbconfig/20220802-154533-marostegui.json
* 15:37 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc[2040-2041].codfw.wmnet with reason: shutdown for PDU upgrade
* 15:37 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on mc[2040-2041].codfw.wmnet with reason: shutdown for PDU upgrade
* 15:36 bking@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host elastic2037.codfw.wmnet
* 15:36 bking@cumin1001: START - Cookbook sre.hosts.reboot-single for host elastic2037.codfw.wmnet
* 15:30 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1165 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32171 and previous config saved to /var/cache/conftool/dbconfig/20220802-153027-marostegui.json
* 15:28 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1165 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32170 and previous config saved to /var/cache/conftool/dbconfig/20220802-152818-marostegui.json
* 15:28 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
* 15:27 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
* 15:27 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance
* 15:27 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance
* 15:27 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32169 and previous config saved to /var/cache/conftool/dbconfig/20220802-152740-marostegui.json
* 15:24 moritzm: installing gnupg2 security updates
* 15:15 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2024.codfw.wmnet with reason: shutdown for PDU upgrade
* 15:15 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on mc2024.codfw.wmnet with reason: shutdown for PDU upgrade
* 15:13 jbond@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host puppetmaster1004.eqiad.wmnet with OS buster
* 15:12 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P32167 and previous config saved to /var/cache/conftool/dbconfig/20220802-151234-marostegui.json
* 15:10 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ganeti-test[2001-2003].codfw.wmnet with reason: Power down for PDU maintenance, [[phab:T310070|T310070]]
* 15:10 jmm@cumin2002: START - Cookbook sre.hosts.downtime for 4:00:00 on ganeti-test[2001-2003].codfw.wmnet with reason: Power down for PDU maintenance, [[phab:T310070|T310070]]
* 15:08 root@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on thanos-be2001.codfw.wmnet with reason: pdu
* 15:08 root@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on thanos-be2001.codfw.wmnet with reason: pdu
* 15:07 mvernon@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on ms-be[2030,2045,2052].codfw.wmnet with reason: shutdown for PDU replacement
* 15:07 mvernon@cumin2002: START - Cookbook sre.hosts.downtime for 3:00:00 on ms-be[2030,2045,2052].codfw.wmnet with reason: shutdown for PDU replacement
* 15:06 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mc-gp2002.codfw.wmnet with reason: Power down for PDU maintenance, [[phab:T310070|T310070]]
* 15:06 jmm@cumin2002: START - Cookbook sre.hosts.downtime for 4:00:00 on mc-gp2002.codfw.wmnet with reason: Power down for PDU maintenance, [[phab:T310070|T310070]]
* 15:04 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2037.codfw.wmnet with reason: [[phab:T309957|T309957]]
* 15:04 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2037.codfw.wmnet with reason: [[phab:T309957|T309957]]
* 15:01 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: shutdown for PDU upgrade
* 15:00 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: shutdown for PDU upgrade
* 14:59 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2025.codfw.wmnet with reason: [[phab:T309957|T309957]]
* 14:59 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2025.codfw.wmnet with reason: [[phab:T309957|T309957]]
* 14:58 oblivian@puppetmaster1001: conftool action : set/pooled=false; selector: dnsdisc=(appservers{{!}}api)-ro,name=codfw
* 14:57 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P32166 and previous config saved to /var/cache/conftool/dbconfig/20220802-145728-marostegui.json
* 14:54 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2060.codfw.wmnet with OS bullseye
* 14:53 jbond@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on puppetmaster1004.eqiad.wmnet with reason: host reimage
* 14:50 moritzm: uploaded gnupg2 2.1.18-8~deb9u4+wmf1 to stretch-wikimedia
* 14:50 jbond@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on puppetmaster1004.eqiad.wmnet with reason: host reimage
* 14:42 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32164 and previous config saved to /var/cache/conftool/dbconfig/20220802-144222-marostegui.json
* 14:40 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1113:3316 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32163 and previous config saved to /var/cache/conftool/dbconfig/20220802-144013-marostegui.json
* 14:40 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance
* 14:39 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance
* 14:39 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32162 and previous config saved to /var/cache/conftool/dbconfig/20220802-143952-marostegui.json
* 14:37 jbond@cumin1001: START - Cookbook sre.hosts.reimage for host puppetmaster1004.eqiad.wmnet with OS buster
* 14:32 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2060.codfw.wmnet with reason: host reimage
* 14:28 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2060.codfw.wmnet with reason: host reimage
* 14:24 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P32161 and previous config saved to /var/cache/conftool/dbconfig/20220802-142446-marostegui.json
* 14:23 Emperor: shutdown ms-be20[30,45,52] for PDU work [[phab:T309957|T309957]]
* 14:22 mvernon@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ms-be[2030,2045,2052].codfw.wmnet with reason: shutdown for PDU replacement
* 14:21 mvernon@cumin2002: START - Cookbook sre.hosts.downtime for 1:00:00 on ms-be[2030,2045,2052].codfw.wmnet with reason: shutdown for PDU replacement
* 14:12 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic2060.codfw.wmnet with OS bullseye
* 14:09 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P32160 and previous config saved to /var/cache/conftool/dbconfig/20220802-140940-marostegui.json
* 14:05 jbond@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host puppetmaster2004.codfw.wmnet with OS buster
* 14:04 godog: grow sda/sdb 3 by 100G on thanos-be1001 - [[phab:T314275|T314275]]
* 14:03 root@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on centrallog2002.codfw.wmnet with reason: pdu
* 14:03 root@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on centrallog2002.codfw.wmnet with reason: pdu
* 14:01 root@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on prometheus2005.codfw.wmnet with reason: pdu
* 14:01 root@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on prometheus2005.codfw.wmnet with reason: pdu
* 13:57 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2030.codfw.wmnet,service=ats-tls
* 13:57 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2032.codfw.wmnet,service=ats-be
* 13:57 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2031.codfw.wmnet,service=ats-be
* 13:56 godog: schedule poweroff for centrallog2002 at 16 utc - [[phab:T310070|T310070]]
* 13:54 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp203[3-4].codfw.wmnet,service=ats-be
* 13:54 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32159 and previous config saved to /var/cache/conftool/dbconfig/20220802-135435-marostegui.json
* 13:53 godog: depool and poweroff prometheus2005 - [[phab:T310070|T310070]]
* 13:53 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp203[3-4].codfw.wmnet,service=ats-tls
* 13:53 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp203[3-4].codfw.wmnet,service=ats-tls
* 13:53 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp203[3-4].codfw.wmnet,service=varnish-fe
* 13:52 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1098:3316 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32158 and previous config saved to /var/cache/conftool/dbconfig/20220802-135226-marostegui.json
* 13:52 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance
* 13:52 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp203[1-2].codfw.wmnet,service=ats-tls
* 13:51 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance
* 13:51 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32157 and previous config saved to /var/cache/conftool/dbconfig/20220802-135155-marostegui.json
* 13:51 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp203[1-2].codfw.wmnet,service=ats-tls
* 13:51 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp203[1-2].codfw.wmnet,service=varnish-fe
* 13:50 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2030.codfw.wmnet,service=ats-be
* 13:50 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2030.codfw.wmnet,service=varnish-fe
* 13:50 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2030.codfw.wmnet,service=ats-be
* 13:50 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2029.codfw.wmnet,service=ats-tls
* 13:50 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2029.codfw.wmnet,service=varnish-fe
* 13:50 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2029.codfw.wmnet,service=ats-be
* 13:45 jbond@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on puppetmaster2004.codfw.wmnet with reason: host reimage
* 13:42 jbond@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on puppetmaster2004.codfw.wmnet with reason: host reimage
* 13:42 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 13:42 Lucas_WMDE: UTC afternoon backport+config window done
* 13:41 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2013.codfw.wmnet with OS bullseye
* 13:41 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 13:41 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 13:40 lucaswerkmeister-wmde@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:754933{{!}}Enable usage tracking for statement for cebwiki (T296384)]] – expected to gradually increase number of wbc_entity_usage and probably recentchanges rows on cebwiki, but not too much, see task for details (duration: 03m 06s)
* 13:40 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 13:39 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2028.codfw.wmnet with OS bullseye
* 13:36 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P32156 and previous config saved to /var/cache/conftool/dbconfig/20220802-133648-marostegui.json
* 13:35 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 13:34 lucaswerkmeister-wmde@deploy1002: Synchronized wmf-config/Wikibase.php: Config: [[gerrit:754937{{!}}Introduce $wmgEntityUsageModifierLimitsStatement (T296384)]] (2/2) (duration: 03m 21s)
* 13:34 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 13:34 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 13:33 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 13:31 lucaswerkmeister-wmde@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:754937{{!}}Introduce $wmgEntityUsageModifierLimitsStatement (T296384)]] (1/2) (duration: 03m 16s)
* 13:30 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ganeti2028.codfw.wmnet with reason: Power down for PDU maintenance, [[phab:T309957|T309957]]
* 13:30 jmm@cumin2002: START - Cookbook sre.hosts.downtime for 4:00:00 on ganeti2028.codfw.wmnet with reason: Power down for PDU maintenance, [[phab:T309957|T309957]]
* 13:27 jbond@cumin2002: START - Cookbook sre.hosts.reimage for host puppetmaster2004.codfw.wmnet with OS buster
* 13:24 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2013.codfw.wmnet with reason: host reimage
* 13:24 vgutierrez: restarting ATS 9.x instances to apply https://gerrit.wikimedia.org/r/819585 - [[phab:T309651|T309651]]
* 13:23 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2028.codfw.wmnet with reason: host reimage
* 13:21 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P32155 and previous config saved to /var/cache/conftool/dbconfig/20220802-132142-marostegui.json
* 13:19 jmm@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2013.codfw.wmnet with reason: host reimage
* 13:19 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2028.codfw.wmnet with reason: host reimage
* 13:17 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 13:17 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 13:17 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 13:16 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 13:15 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|a4499e5ac23a0558bed276e2b74134590afc5c95}}:  Revert "testwiki: Add mediawiki.web_ui.interactions stream" ([[phab:T314151|T314151]], [[phab:T311268|T311268]]) (duration: 03m 19s)
* 13:10 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 13:09 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 13:09 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 13:09 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|c2fb8a58d8f62e29a15ebee26198e79e4597d24c}}: Enable RealtimePreview on Group 0 wikis ([[phab:T314150|T314150]]) (duration: 03m 21s)
* 13:08 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 13:06 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32154 and previous config saved to /var/cache/conftool/dbconfig/20220802-130636-marostegui.json
* 13:04 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1096:3316 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32153 and previous config saved to /var/cache/conftool/dbconfig/20220802-130428-marostegui.json
* 13:04 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance
* 13:04 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance
* 13:04 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance
* 13:03 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance
* 13:03 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1180 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32152 and previous config saved to /var/cache/conftool/dbconfig/20220802-130351-marostegui.json
* 13:02 jmm@cumin2002: START - Cookbook sre.hosts.reimage for host ganeti2013.codfw.wmnet with OS bullseye
* 13:00 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic2028.codfw.wmnet with OS bullseye
* 13:00 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on ganeti2013.codfw.wmnet with reason: Remove node for eventual reimage, [[phab:T311686|T311686]]
* 12:59 jmm@cumin2002: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on ganeti2013.codfw.wmnet with reason: Remove node for eventual reimage, [[phab:T311686|T311686]]
* 12:48 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P32151 and previous config saved to /var/cache/conftool/dbconfig/20220802-124845-marostegui.json
* 12:33 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P32150 and previous config saved to /var/cache/conftool/dbconfig/20220802-123338-marostegui.json
* 12:18 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1180 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32149 and previous config saved to /var/cache/conftool/dbconfig/20220802-121832-marostegui.json
* 12:16 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1180 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32148 and previous config saved to /var/cache/conftool/dbconfig/20220802-121624-marostegui.json
* 12:16 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance
* 12:15 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance
* 12:13 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 12:12 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 12:12 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 12:11 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 12:01 marostegui: dbmaint x1@eqiad [[phab:T314087|T314087]]
* 11:57 marostegui: dbmaint s7@eqiad [[phab:T314377|T314377]]
* 11:57 marostegui: dbmaint s3@eqiad [[phab:T314377|T314377]]
* 11:57 marostegui: dbmaint s8@eqiad [[phab:T314377|T314377]]
* 11:55 marostegui: dbmait s8@eqiad [[phab:T314377|T314377]]
* 11:54 marostegui: dbmait s3@eqiad [[phab:T314377|T314377]]
* 11:50 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
* 11:48 marostegui: dbmait s7@eqiad [[phab:T314377|T314377]]
* 11:46 marostegui: dbmait s4@eqiad [[phab:T314377|T314377]]
* 11:35 elukey: restart rsyslog on ml-serve1006
* 10:50 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on an-worker1082.eqiad.wmnet with reason: [[phab:T312626|T312626]] btullis
* 10:50 btullis@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on an-worker1082.eqiad.wmnet with reason: [[phab:T312626|T312626]] btullis
* 10:49 godog: grow sda3 by 100G on thanos-be2004 - [[phab:T314275|T314275]]
* 10:42 btullis@puppetmaster1001: conftool action : set/pooled=inactive; selector: cluster=wikireplicas-b,name=dbproxy1018.eqiad.wmnet
* 10:42 btullis@puppetmaster1001: conftool action : set/pooled=yes; selector: cluster=wikireplicas-b,name=dbproxy1019.eqiad.wmnet
* 10:35 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 10:34 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 10:34 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 10:34 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 10:33 marostegui@cumin1001: dbctl commit (dc=all): 'db1181 (re)pooling @ 100%: After restart', diff saved to https://phabricator.wikimedia.org/P32147 and previous config saved to /var/cache/conftool/dbconfig/20220802-103318-root.json
* 10:18 marostegui@cumin1001: dbctl commit (dc=all): 'db1181 (re)pooling @ 75%: After restart', diff saved to https://phabricator.wikimedia.org/P32146 and previous config saved to /var/cache/conftool/dbconfig/20220802-101813-root.json
* 10:15 marostegui@cumin1001: dbctl commit (dc=all): 'Add db2175 to s2 [[phab:T311494|T311494]]', diff saved to https://phabricator.wikimedia.org/P32145 and previous config saved to /var/cache/conftool/dbconfig/20220802-101522-marostegui.json
* 10:12 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy1019.eqiad.wmnet with OS bullseye
* 10:05 jynus: shutdown dbprov2002 backup2005 backup2008 [[phab:T310070|T310070]]
* 10:03 marostegui@cumin1001: dbctl commit (dc=all): 'db1181 (re)pooling @ 50%: After restart', diff saved to https://phabricator.wikimedia.org/P32144 and previous config saved to /var/cache/conftool/dbconfig/20220802-100308-root.json
* 10:03 marostegui@cumin1001: dbctl commit (dc=all): 'db1174 (re)pooling @ 100%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32143 and previous config saved to /var/cache/conftool/dbconfig/20220802-100304-root.json
* 09:54 marostegui@cumin1001: dbctl commit (dc=all): 'Remove db2079 from dbctl [[phab:T313885|T313885]]', diff saved to https://phabricator.wikimedia.org/P32141 and previous config saved to /var/cache/conftool/dbconfig/20220802-095455-marostegui.json
* 09:52 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbproxy1019.eqiad.wmnet with reason: host reimage
* 09:49 btullis@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on dbproxy1019.eqiad.wmnet with reason: host reimage
* 09:49 btullis@cumin1001: END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-druid-analytics cluster: Roll restart of jvm daemons.
* 09:48 marostegui@cumin1001: dbctl commit (dc=all): 'db1181 (re)pooling @ 25%: After restart', diff saved to https://phabricator.wikimedia.org/P32140 and previous config saved to /var/cache/conftool/dbconfig/20220802-094804-root.json
* 09:47 marostegui@cumin1001: dbctl commit (dc=all): 'db1174 (re)pooling @ 75%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32139 and previous config saved to /var/cache/conftool/dbconfig/20220802-094759-root.json
* 09:44 godog: grow sdb3 by 100G on thanos-be2004 - [[phab:T314275|T314275]]
* 09:43 btullis@cumin1001: START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-druid-analytics cluster: Roll restart of jvm daemons.
* 09:42 btullis@cumin1001: END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-druid-public cluster: Roll restart of jvm daemons.
* 09:37 btullis@cumin1001: START - Cookbook sre.hosts.reimage for host dbproxy1019.eqiad.wmnet with OS bullseye
* 09:36 btullis@cumin1001: START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-druid-public cluster: Roll restart of jvm daemons.
* 09:33 marostegui@cumin1001: dbctl commit (dc=all): 'db1181 (re)pooling @ 10%: After restart', diff saved to https://phabricator.wikimedia.org/P32138 and previous config saved to /var/cache/conftool/dbconfig/20220802-093259-root.json
* 09:32 marostegui@cumin1001: dbctl commit (dc=all): 'db1174 (re)pooling @ 50%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32137 and previous config saved to /var/cache/conftool/dbconfig/20220802-093254-root.json
* 09:30 btullis@puppetmaster1001: conftool action : set/pooled=no; selector: cluster=wikireplicas-b,name=dbproxy1019.eqiad.wmnet
* 09:30 btullis@puppetmaster1001: conftool action : set/pooled=yes; selector: cluster=wikireplicas-b,name=dbproxy1018.eqiad.wmnet
* 09:28 btullis@cumin1001: END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-analytics cluster: Roll restart of jvm daemons.
* 09:26 btullis@puppetmaster1001: conftool action : set/pooled=inactive; selector: cluster=wikireplicas-a,name=dbproxy1019.eqiad.wmnet
* 09:25 btullis@puppetmaster1001: conftool action : set/pooled=yes; selector: cluster=wikireplicas-a,name=dbproxy1018.eqiad.wmnet
* 09:22 btullis@cumin1001: START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-analytics cluster: Roll restart of jvm daemons.
* 09:17 marostegui@cumin1001: dbctl commit (dc=all): 'db1181 (re)pooling @ 5%: After restart', diff saved to https://phabricator.wikimedia.org/P32136 and previous config saved to /var/cache/conftool/dbconfig/20220802-091754-root.json
* 09:17 marostegui@cumin1001: dbctl commit (dc=all): 'db1174 (re)pooling @ 10%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32135 and previous config saved to /var/cache/conftool/dbconfig/20220802-091749-root.json
* 09:15 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db2143', diff saved to https://phabricator.wikimedia.org/P32134 and previous config saved to /var/cache/conftool/dbconfig/20220802-091518-root.json
* 09:02 marostegui@cumin1001: dbctl commit (dc=all): 'db1181 (re)pooling @ 2%: After restart', diff saved to https://phabricator.wikimedia.org/P32133 and previous config saved to /var/cache/conftool/dbconfig/20220802-090250-root.json
* 09:02 marostegui@cumin1001: dbctl commit (dc=all): 'db1174 (re)pooling @ 5%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32132 and previous config saved to /var/cache/conftool/dbconfig/20220802-090245-root.json
* 08:47 marostegui@cumin1001: dbctl commit (dc=all): 'db1181 (re)pooling @ 1%: After restart', diff saved to https://phabricator.wikimedia.org/P32131 and previous config saved to /var/cache/conftool/dbconfig/20220802-084745-root.json
* 08:47 marostegui@cumin1001: dbctl commit (dc=all): 'db1174 (re)pooling @ 1%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32130 and previous config saved to /var/cache/conftool/dbconfig/20220802-084740-root.json
* 08:46 marostegui: stop mysql on db2095 db2107 db2109 db2137 db2147 db2159 db2160 pc2012 for pdu maintenance on codfw b5 [[phab:T310070|T310070]]
* 07:49 moritzm: upgrading drmrs ganeti clusters to 3.0.2 [[phab:T312637|T312637]]
* 07:33 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubetcd2005.codfw.wmnet with reason: Switch instance to plain disks, [[phab:T311686|T311686]]
* 07:33 jmm@cumin2002: START - Cookbook sre.hosts.downtime for 1:00:00 on kubetcd2005.codfw.wmnet with reason: Switch instance to plain disks, [[phab:T311686|T311686]]
* 07:22 godog: bounce icinga on alert2001 - [[phab:T314353|T314353]]
* 07:18 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubetcd2005.codfw.wmnet with reason: Switch instance to DRBD, [[phab:T311686|T311686]]
* 07:18 jmm@cumin2002: START - Cookbook sre.hosts.downtime for 1:00:00 on kubetcd2005.codfw.wmnet with reason: Switch instance to DRBD, [[phab:T311686|T311686]]
* 06:58 elukey: restart rsyslog on ml-serve2006
* 06:56 ladsgroup@deploy1002: Synchronized php-1.39.0-wmf.22/extensions/FlaggedRevs/maintenance/pruneRevData.php: Backport: [[gerrit:819077{{!}}pruneRevData: Make cleaning in larger batches (T296380)]] (duration: 03m 26s)
* 06:56 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 06:55 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 06:55 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 06:54 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 06:46 godog: bounce icinga on alert1001 - [[phab:T314353|T314353]]
* 05:48 marostegui@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts db2088.codfw.wmnet
* 05:48 marostegui@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 05:44 marostegui@cumin1001: START - Cookbook sre.dns.netbox
* 05:35 marostegui@cumin1001: START - Cookbook sre.hosts.decommission for hosts db2088.codfw.wmnet
* 05:29 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1181', diff saved to https://phabricator.wikimedia.org/P32127 and previous config saved to /var/cache/conftool/dbconfig/20220802-052923-root.json
* 05:24 marostegui: dbmait x1@eqiad [[phab:T314087|T314087]]
* 04:17 ryankemper: [Elastic] Small amendment to my earlier statement; based off epoch time `be_x_oldwiki_titlesuggest_1659407912` was not an old index hanging around after a reindex operation, but rather the new one that the reindex operation was trying to create, but had not yet finished (therefore didn't switch over the aliases). It presumably got interrupted by the reimage of `elastic2059`.
* 04:15 ryankemper: [Elastic] Blew away red index like so: `ryankemper@cumin1001:~$ curl -XDELETE https://search.svc.codfw.wmnet:9243/be_x_oldwiki_titlesuggest_1659407912`. Cluster is back to `green` status.
* 04:07 ryankemper: [Elastic] Per `curl -s https://search.svc.codfw.wmnet:9243/_cat/aliases {{!}} grep -i be_x` I see `be_x_oldwiki_titlesuggest ` alias points to `be_x_oldwiki_titlesuggest_1658396688`. I think this means the red index is an old index from an in-progress reindex operation. I likely just need to delete `be_x_oldwiki_titlesuggest_1659407912` but doing some quick digging first
* 04:04 ryankemper: [Elastic] Red cluster status in main codfw elasticsearch cluster (`https://search.svc.codfw.wmnet:9243`); culprit appears to be index `be_x_oldwiki_titlesuggest_1659407912`. Confusingly it has 2 replicas set so it's not clear to me how we got into this state starting from green (in the past we've gone into red status from indices that erroneously had 0 replicas in production)
* 03:47 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 03:46 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 03:46 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 03:45 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 03:40 krinkle@deploy1002: Synchronized multiversion/: {{Gerrit|I0802db272695}} (duration: 03m 10s)
* 03:40 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 03:39 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 03:39 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 03:38 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 03:34 krinkle@deploy1002: Synchronized wmf-config/: {{Gerrit|I9b89c0ff5c2}} (duration: 03m 32s)
* 03:33 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 03:32 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 03:32 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 03:31 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 03:27 krinkle@deploy1002: Synchronized multiversion/: {{Gerrit|I6e97d39a3}}, {{Gerrit|Ib843ebced31}} (duration: 03m 30s)
* 03:26 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 03:25 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 03:25 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 03:24 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 03:22 krinkle@mwmaint1002: pull aborted:  (duration: 00m 11s)
* 03:21 krinkle@deploy1002: Synchronized wmf-config/CommonSettings.php: {{Gerrit|I39a2b86065}} (duration: 03m 19s)
* 03:20 ryankemper@cumin1001: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host elastic2059.codfw.wmnet with OS bullseye
* 03:15 krinkle@deploy1002: Synchronized multiversion/: {{Gerrit|Ieaea60a991e5611}} (duration: 03m 03s)
* 03:14 krinkle@mwmaint2002: pull aborted:  (duration: 01m 36s)
* 03:14 krinkle@mwmaint1002: pull aborted:  (duration: 01m 31s)
* 03:13 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 03:12 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 03:12 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 03:11 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 02:58 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2059.codfw.wmnet with reason: host reimage
* 02:54 ryankemper: [WDQS] `ryankemper@wdqs1012:~$ sudo systemctl restart wdqs-blazegraph.service` to clear `Query Service HTTP Port` && `WDQS SPARQL` alerts
* 02:53 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2059.codfw.wmnet with reason: host reimage
* 02:36 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic2059.codfw.wmnet with OS bullseye
* 02:31 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 02:30 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 02:30 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 02:29 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 02:09 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 02:08 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 02:08 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 02:07 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 00:41 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 00:40 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 00:40 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 00:39 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 00:35 krinkle@deploy1002: Synchronized wmf-config/CommonSettings.php: {{Gerrit|Ieaea60a991e5}} (duration: 03m 10s)
* 00:29 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 00:28 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 00:28 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 00:27 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 00:23 krinkle@deploy1002: Synchronized multiversion/: {{Gerrit|Ia3406eba4ab8bb}} (duration: 03m 22s)
* 00:17 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 00:16 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 00:16 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 00:15 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 00:05 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 00:04 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 00:04 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 00:03 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply


== 2021-11-29 ==
== 2022-08-01 ==
* 22:32 sbassett@deploy1002: Synchronized php-1.38.0-wmf.9/extensions/EntitySchema/src/MediaWiki/Specials/SetEntitySchemaLabelDescriptionAliases.php: Deploy security patch for [[phab:T296578|T296578]] (duration: 00m 55s)
* 23:59 krinkle@deploy1002: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|Id1ce285631f5}}, {{Gerrit|I194d419fbfe}} (duration: 03m 09s)
* 22:21 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 23:58 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 22:20 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 23:57 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 22:20 sbassett@deploy1002: Synchronized php-1.38.0-wmf.9/extensions/FileImporter/src/Remote/MediaWiki/HttpApiLookup.php: Backport: [[gerrit:742263{{!}}SECURITY: Fix special page displaying unescaped user input (T296605)]] (duration: 00m 56s)
* 23:57 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 20:48 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 23:56 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 20:47 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 21:08 moritzm: drain ganeti2028 [[phab:T309957|T309957]]
* 20:46 jforrester@deploy1002: Synchronized wmf-config/CommonSettings.php: Fix wgWikiLambdaOrchestratorLocation service pointer typo (duration: 00m 55s)
* 21:03
* 20:27 tgr: UTC evening deploys done
* 20:26 tgr@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:742261{{!}}GrowthExperiments: Start imagerecommendation variant experiment]] (duration: 00m 55s)
* 20:23 tgr@deploy1002: Synchronized php-1.38.0-wmf.9/extensions/GrowthExperiments/includes/NewcomerTasks/AddImage/AddImageSubmissionHandler.php: Backport: [[gerrit:742262{{!}}AddImage: Refresh user's task feed after undecided rejection (T296491)]] (duration: 00m 56s)
* 20:21 tgr@deploy1002: Synchronized php-1.38.0-wmf.9/extensions/GrowthExperiments/includes/HomepageModules/SuggestedEdits.php: Backport: [[gerrit:742260{{!}}SuggestedEdits: Drop isActivated() check in getJsData (T296626)]] (duration: 00m 56s)
* 20:17 ejegg: updated payments-wiki from {{Gerrit|d1d6f024}} -> {{Gerrit|dbc92132}}
* 20:16 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 20:15 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 20:10 eileen: civicrm
* 20:09 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 20:08 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 20:02 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 20:01 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 20:00 ebernhardson@deploy1002: Synchronized wmf-config/InitialiseSettings.php: [[phab:T295705|T295705]] Move CirrusSearch traffic back to eqiad (duration: 00m 56s)
* 19:42 legoktm: uploaded php-yaml_2.2.1+2.1.0+2.0.4+1.3.2-2+wmf1~buster1_amd64.changes to apt.wm.o ([[phab:T296331|T296331]])
* 19:34 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 19:33 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 19:16 vgutierrez: pool cp3064 - [[phab:T290005|T290005]]
* 18:55 bblack: repooling esams
* 18:48 bblack: esams: shifting depool method to esams-offline (now that its config is fixed)
* 18:42 legoktm: depooling esams
* 18:17 vgutierrez: depool cp3064 - [[phab:T290005|T290005]]
* 17:58 ladsgroup@deploy1002: Synchronized php-1.38.0-wmf.9/includes/libs/rdbms/: Backport: [[gerrit:742259{{!}}rdbms: Add DB host to TransactionProfiler logging and fix time fields (T295706)]] (duration: 00m 56s)
* 17:56 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 17:55 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 17:49 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 17:48 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 17:42 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 17:41 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 17:40 jforrester@deploy1002: Synchronized wmf-config/CommonSettings.php: Initial Beta Cluster deployment of Wikifunctions: III - CS for [[phab:T289315|T289315]] (duration: 00m 55s)
* 17:38 vgutierrez: pool cp3064 - [[phab:T290005|T290005]]
* 17:29 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 17:28 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 17:25 lucaswerkmeister-wmde@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'test' .
* 17:25 lucaswerkmeister-wmde@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' .
* 17:22 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 17:22 jforrester@deploy1002: Synchronized wmf-config/ProductionServices.php: Initial Beta Cluster deployment of Wikifunctions: II - Services for [[phab:T289315|T289315]] (duration: 00m 55s)
* 17:21 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 17:18 jforrester@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Initial Beta Cluster deployment of Wikifunctions: I - IS for [[phab:T289315|T289315]] (duration: 00m 55s)
* 17:00 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 16:59 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|06d8d25f6e89be0b1692d017bdbc2c9524372c0b}}: foundationwiki: Remove explicit wmgUseOAuth (duration: 00m 57s)
* 16:59 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 16:56 urbanecm@deploy1002: Synchronized wmf-config/CommonSettings.php: {{Gerrit|bad34ed8d86b30eb4c240da0498ddfb44af30ea7}}: Make foundationwiki a standard CentralAuth wiki ([[phab:T205347|T205347]]) (duration: 00m 56s)
* 16:53 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 16:52 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|567f2a9d4883c9a98a3251f153ea0ad58d7774c6}}: Revert "foundationwiki: Set wmgLocalAuthLoginOnly=false temporarily" ([[phab:T205347|T205347]]) (duration: 00m 56s)
* 16:51 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 16:25 pt1979@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 16:21 pt1979@cumin2002: START - Cookbook sre.dns.netbox
* 16:20 pt1979@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2069.codfw.wmnet with OS buster
* 16:04 moritzm: sudo gnt-cluster upgrade --to 2.16 for Ganeti codfw cluster
* 15:55 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 15:54 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 15:52 vgutierrez: depool cp3064 - [[phab:T290005|T290005]]
* 15:51 James_F: Running mwscript extensions/WikimediaMaintenance/addWiki.php --wiki=enwiki en wikimedia wikifunctionswiki wikifunctions.beta.wmflabs.org in Beta Cluster for [[phab:T284162|T284162]]
* 15:51 pt1979@cumin2002: START - Cookbook sre.hosts.reimage for host elastic2069.codfw.wmnet with OS buster
* 15:48 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 15:47 papaul: power down logstash2028 for IDRAC reset
* 15:47 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 15:15 moritzm: gnt-cluster renew-crypto --new-cluster-certificate for codfw Ganeti cluster [[phab:T296622|T296622]]
* 14:40 mbsantos@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'wikifeeds' for release 'production' .
* 14:38 mbsantos@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'wikifeeds' for release 'production' .
* 14:37 mbsantos@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' .
* 13:55 vgutierrez: repool cp3064 - [[phab:T290005|T290005]]
* 12:51 moritzm: upgrading ganeti codfw cluster to 2.16 backport [[phab:T296622|T296622]]
* 12:49 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 12:48 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 12:36 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 12:35 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 12:32 vgutierrez: depool cp3064 - [[phab:T290005|T290005]]
* 12:32 urbanecm@deploy1002: Synchronized php-1.38.0-wmf.9/extensions/GrowthExperiments/includes/HomepageModules/SuggestedEdits.php: {{Gerrit|05704407395fbf227eec47cf716393dc60a36a35}}: Fix error handling in SuggestedEdits::getActionData ([[phab:T296366|T296366]]) (duration: 05m 37s)
* 12:29 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 12:28 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 12:17 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 12:16 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 12:13 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|7fdea3e71e4fd9e85c30efbc17f94c0711deb252}}:  Add planet4589.org to the wgCopyUploadsDomains allowlist of Wikimedia Commons ([[phab:T296136|T296136]]) (duration: 00m 56s)
* 12:11 vgutierrez: pool cp3064 (text) using HAProxy as TLS terminator - [[phab:T290005|T290005]]
* 12:10 mwdebug-deploy@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 12:09 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3064.esams.wmnet with OS buster
* 12:09 mwdebug-deploy@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
* 12:07 urbanecm@deploy1002: Synchronized docroot/: {{Gerrit|4662224229cb4083b8b01de436ccd65e8c00e7dd}}: Remove search.wikimedia.org files ([[phab:T289224|T289224]]) (duration: 00m 56s)
 
See [[Server Admin Log/Archives]].
See [[Server Admin Log/Archives]].
<noinclude>
<noinclude>

Revision as of 23:41, 12 August 2022

2022-08-12

  • 23:41 mutante: wikistats-bullseye:~$ /usr/lib/wikistats/update.php wp prefix blk ; /usr/lib/wikistats/update.php wp prefix kcg T315121
  • 23:38 mutante: [mwmaint1002:~] $ sudo systemctl start mediawiki_job_initsitestats.timer T315121
  • 22:14 ryankemper@cumin1001: END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - T289135
  • 21:48 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1071.eqiad.wmnet with OS bullseye
  • 21:45 andrew@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host clouddb2002-dev.codfw.wmnet with OS bullseye
  • 21:27 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1071.eqiad.wmnet with reason: host reimage
  • 21:25 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1071.eqiad.wmnet with reason: host reimage
  • 21:12 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1071.eqiad.wmnet with OS bullseye
  • 21:10 andrew@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddb2002-dev.codfw.wmnet with reason: host reimage
  • 21:06 andrew@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on clouddb2002-dev.codfw.wmnet with reason: host reimage
  • 21:06 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1053.eqiad.wmnet with OS bullseye
  • 20:50 andrew@cumin1001: START - Cookbook sre.hosts.reimage for host clouddb2002-dev.codfw.wmnet with OS bullseye
  • 20:43 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1053.eqiad.wmnet with reason: host reimage
  • 20:39 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1053.eqiad.wmnet with reason: host reimage
  • 20:24 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1053.eqiad.wmnet with OS bullseye
  • 20:12 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1048.eqiad.wmnet with OS bullseye
  • 19:55 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1048.eqiad.wmnet with reason: host reimage
  • 19:53 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1048.eqiad.wmnet with reason: host reimage
  • 19:42 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1048.eqiad.wmnet with OS bullseye
  • 19:38 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1146:3314 (T312863)', diff saved to https://phabricator.wikimedia.org/P32375 and previous config saved to /var/cache/conftool/dbconfig/20220812-193822-ladsgroup.json
  • 19:38 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance
  • 19:38 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance
  • 19:38 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1121 (T312863)', diff saved to https://phabricator.wikimedia.org/P32374 and previous config saved to /var/cache/conftool/dbconfig/20220812-193801-ladsgroup.json
  • 19:33 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1054.eqiad.wmnet with OS bullseye
  • 19:22 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P32373 and previous config saved to /var/cache/conftool/dbconfig/20220812-192255-ladsgroup.json
  • 19:12 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1054.eqiad.wmnet with reason: host reimage
  • 19:09 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1054.eqiad.wmnet with reason: host reimage
  • 19:07 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P32372 and previous config saved to /var/cache/conftool/dbconfig/20220812-190749-ladsgroup.json
  • 18:58 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet with reason: Maint
  • 18:58 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 2 days, 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet with reason: Maint
  • 18:54 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1054.eqiad.wmnet with OS bullseye
  • 18:52 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1121 (T312863)', diff saved to https://phabricator.wikimedia.org/P32371 and previous config saved to /var/cache/conftool/dbconfig/20220812-185243-ladsgroup.json
  • 18:48 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1066.eqiad.wmnet with OS bullseye
  • 18:25 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1066.eqiad.wmnet with reason: host reimage
  • 18:22 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1066.eqiad.wmnet with reason: host reimage
  • 18:08 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1066.eqiad.wmnet with OS bullseye
  • 18:00 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1064.eqiad.wmnet with OS bullseye
  • 17:42 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1064.eqiad.wmnet with reason: host reimage
  • 17:39 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1064.eqiad.wmnet with reason: host reimage
  • 17:24 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1064.eqiad.wmnet with OS bullseye
  • 17:21 pt1979@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts netmon2002.wikimedia.org
  • 17:21 pt1979@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts netmon2002.wikimedia.org
  • 17:19 pt1979@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netmon2002.wikimedia.org with OS bullseye
  • 17:04 pt1979@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netmon2002.wikimedia.org with reason: host reimage
  • 17:01 pt1979@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on netmon2002.wikimedia.org with reason: host reimage
  • 16:42 pt1979@cumin2002: START - Cookbook sre.hosts.reimage for host netmon2002.wikimedia.org with OS bullseye
  • 16:26 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1067.eqiad.wmnet with OS bullseye
  • 16:21 andrew@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudcontrol2003-dev.wikimedia.org
  • 16:21 andrew@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 16:16 andrew@cumin1001: START - Cookbook sre.dns.netbox
  • 16:11 andrew@cumin1001: START - Cookbook sre.hosts.decommission for hosts cloudcontrol2003-dev.wikimedia.org
  • 16:08 pt1979@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['netmon2002.wikimedia.org']
  • 16:03 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1067.eqiad.wmnet with reason: host reimage
  • 15:58 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1067.eqiad.wmnet with reason: host reimage
  • 15:43 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1067.eqiad.wmnet with OS bullseye
  • 15:37 pt1979@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['netmon2002.wikimedia.org']
  • 15:31 jbond@cumin2002: END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['netmon2002.wikimedia.org']
  • 15:31 jbond@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['netmon2002.wikimedia.org']
  • 15:07 jbond@cumin2002: END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts netmon1002.wikimedia.org
  • 15:07 jbond@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts netmon1002.wikimedia.org
  • 15:04 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1061.eqiad.wmnet with OS bullseye
  • 14:46 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1061.eqiad.wmnet with reason: host reimage
  • 14:46 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2042.codfw.wmnet,service=varnish-fe
  • 14:46 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2042.codfw.wmnet,service=ats-be
  • 14:46 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2042.codfw.wmnet,service=ats-tls
  • 14:43 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1061.eqiad.wmnet with reason: host reimage
  • 14:29 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maint
  • 14:28 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1061.eqiad.wmnet with OS bullseye
  • 14:28 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maint
  • 14:24 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1063.eqiad.wmnet with OS bullseye
  • 14:05 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1063.eqiad.wmnet with reason: host reimage
  • 14:02 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1063.eqiad.wmnet with reason: host reimage
  • 13:47 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1063.eqiad.wmnet with OS bullseye
  • 13:41 ryankemper@cumin1001: START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - T289135
  • 06:01 ryankemper@puppetmaster1001: conftool action : set/pooled=yes:weight=10; selector: name=elastic10[8-9][0-9].*
  • 05:54 ryankemper@puppetmaster1001: conftool action : set/pooled=yes:weight=10; selector: name=elastic110.*
  • 01:03 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1121 (T312863)', diff saved to https://phabricator.wikimedia.org/P32369 and previous config saved to /var/cache/conftool/dbconfig/20220812-010312-ladsgroup.json
  • 01:03 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
  • 01:02 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
  • 01:02 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1121.eqiad.wmnet with reason: Maintenance
  • 01:02 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1121.eqiad.wmnet with reason: Maintenance
  • 01:02 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1147 (T312863)', diff saved to https://phabricator.wikimedia.org/P32368 and previous config saved to /var/cache/conftool/dbconfig/20220812-010233-ladsgroup.json
  • 00:47 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P32367 and previous config saved to /var/cache/conftool/dbconfig/20220812-004727-ladsgroup.json
  • 00:32 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P32366 and previous config saved to /var/cache/conftool/dbconfig/20220812-003221-ladsgroup.json
  • 00:17 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1147 (T312863)', diff saved to https://phabricator.wikimedia.org/P32365 and previous config saved to /var/cache/conftool/dbconfig/20220812-001715-ladsgroup.json

2022-08-11

  • 21:30 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 21:29 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 21:29 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 21:28 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 21:23 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 21:22 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 21:22 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 21:21 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 21:16 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 21:15 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 21:15 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 21:14 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 21:04 thcipriani@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: revert Define default value for "wmgSiteLogoVariants" (T305692 T308620) (duration: 03m 15s)
  • 20:59 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 20:58 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 20:58 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 20:57 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 20:52 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 20:51 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 20:50 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 20:49 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 20:47 thcipriani@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: Define default value for "wmgSiteLogoVariants" (T305692 T308620) (duration: 03m 07s)
  • 20:44 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 20:43 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 20:43 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 20:42 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 20:29 thcipriani@deploy1002: Synchronized php-1.39.0-wmf.23/extensions/VisualEditor/modules/ve-mw/preinit/ve.init.mw.DesktopArticleTarget.init.js: Backport: Do not show incompatible skin warning when page is not editable (T314952) (duration: 03m 16s)
  • 20:27 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 20:26 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 20:26 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 20:25 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 20:23 mutante: merging change on prod phabricator host to allow scap deployment, part 1
  • 19:42 damilare: payments-wiki upgraded from cf5e1848 to 0894d75a
  • 19:41 mutante: disabling puppet on C:profile::phabricator::main
  • 19:20 mvernon@cumin2002: END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-eqiad: upgrade to 3.11.13 T309896 - mvernon@cumin2002
  • 17:58 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 17:58 taavi@deploy1002: Synchronized wmf-config/CommonSettings.php: Config: Fix labtestwiki database name servers (T310795) (duration: 03m 39s)
  • 17:57 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 17:57 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 17:56 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 17:52 sukhe: testing ATS 9.1.3-1wm1 on cp3064: T309651
  • 17:49 pt1979@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host netmon2002.mgmt.codfw.wmnet with reboot policy FORCED
  • 17:46 sukhe: testing ATS 9.1.3-1wm1 on cp3064: T3096515
  • 17:41 pt1979@cumin2002: START - Cookbook sre.hosts.provision for host netmon2002.mgmt.codfw.wmnet with reboot policy FORCED
  • 17:40 pt1979@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 17:38 sukhe: testing ATS 9.1.3-1wm1 on cp1090: T309651
  • 17:36 pt1979@cumin2002: START - Cookbook sre.dns.netbox
  • 17:35 pt1979@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host netmon2002
  • 17:34 pt1979@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host netmon2002
  • 17:33 sukhe: testing ATS 9.1.3-1wm1 on cp3065: T309651
  • 17:28 sukhe: testing ATS 9.1.3-1wm1 on cp1089: T309651
  • 17:19 bking@cumin1001: conftool action : set/weight=10:pooled=no; selector: service=elasticsearch-omega-ssl,name=elastic1100.eqiad.wmnet
  • 17:18 bking@cumin1001: conftool action : set/weight=10:pooled=yes; selector: service=elasticsearch-omega-ssl,name=elastic1100.eqiad.wmnet
  • 17:15 bking@cumin1001: conftool action : set/weight=10:pooled=yes; selector: service=search-omega-https,name=elastic1100.eqiad.wmnet
  • 16:35 mvernon@cumin2002: START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-eqiad: upgrade to 3.11.13 T309896 - mvernon@cumin2002
  • 16:30 mvernon@cumin2002: END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-codfw: upgrade to 3.11.13 T309896 - mvernon@cumin2002
  • 16:29 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic[1100-1102].eqiad.wmnet with reason: T309810
  • 16:29 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic[1100-1102].eqiad.wmnet with reason: T309810
  • 16:26 inflatador: bking@elastic1054 attempting to ban elastic1100-1102 from cluster due to firewall issues
  • 16:13 bking@cumin1001: conftool action : set/weight=10:pooled=yes; selector: service=search-omega-https,name=elastic1100.eqiad.wmnet
  • 16:12 bking@cumin1001: conftool action : set/weight=10:pooled=yes; selector: name=elastic1100
  • 15:15 cmjohnson@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 15:09 cmjohnson@cumin1001: START - Cookbook sre.dns.netbox
  • 14:58 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1162 (re)pooling @ 100%: Maint done', diff saved to https://phabricator.wikimedia.org/P32364 and previous config saved to /var/cache/conftool/dbconfig/20220811-145823-ladsgroup.json
  • 14:55 inflatador: bking@cumin1001 running puppet agent across eqiad elastic hosts
  • 14:48 ryankemper@cumin1001: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - T289135
  • 14:43 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1162 (re)pooling @ 75%: Maint done', diff saved to https://phabricator.wikimedia.org/P32362 and previous config saved to /var/cache/conftool/dbconfig/20220811-144318-ladsgroup.json
  • 14:28 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1162 (re)pooling @ 25%: Maint done', diff saved to https://phabricator.wikimedia.org/P32361 and previous config saved to /var/cache/conftool/dbconfig/20220811-142813-ladsgroup.json
  • 14:28 andrew@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudcontrol1003.wikimedia.org
  • 14:28 andrew@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 14:24 andrew@cumin1001: START - Cookbook sre.dns.netbox
  • 14:19 andrew@cumin1001: START - Cookbook sre.hosts.decommission for hosts cloudcontrol1003.wikimedia.org
  • 14:19 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 14:18 andrew@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudcontrol1004.wikimedia.org
  • 14:18 andrew@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 14:18 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 14:18 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 14:17 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 14:17 ladsgroup@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: Stop writing to the old templatelinks fields in s2 (T312865) (duration: 03m 25s)
  • 14:16 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
  • 14:16 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
  • 14:16 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
  • 14:15 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
  • 14:13 andrew@cumin1001: START - Cookbook sre.dns.netbox
  • 14:13 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1162 (re)pooling @ 10%: Maint done', diff saved to https://phabricator.wikimedia.org/P32360 and previous config saved to /var/cache/conftool/dbconfig/20220811-141309-ladsgroup.json
  • 14:12 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 14:11 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 14:11 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 14:11 awight: EU backport window complete
  • 14:10 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 14:10 awight@deploy1002: Synchronized php-1.39.0-wmf.23/extensions/DiscussionTools/includes/CommentFormatter.php: Backport: CommentFormatter: Set 'data-mw-comment' even when reply tool disabled (T314707) (duration: 03m 31s)
  • 14:09 andrew@cumin1001: START - Cookbook sre.hosts.decommission for hosts cloudcontrol1004.wikimedia.org
  • 14:05 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 14:04 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 14:04 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 14:03 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 13:52 mvernon@cumin2002: START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-codfw: upgrade to 3.11.13 T309896 - mvernon@cumin2002
  • 13:50 awight@deploy1002: Synchronized wmf-config: Config: Revert "Revert "testwiki: Add mediawiki.web_ui.interactions stream"" (duration: 03m 10s)
  • 13:48 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 13:47 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 13:47 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 13:46 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 13:36 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1060.eqiad.wmnet with OS bullseye
  • 13:36 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 13:36 awight@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: trwikiquote: Install WikiLove extension (T314895) (duration: 03m 30s)
  • 13:35 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 13:35 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 13:34 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 13:33 filippo@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host logstash2003.codfw.wmnet
  • 13:25 awight@deploy1002: Synchronized static/images: Config: Revert "trwiki: Change old and new vector logos for 500k articles" (part 3) (duration: 03m 09s)
  • 13:21 awight@deploy1002: Synchronized logos/: Config: Revert "trwiki: Change old and new vector logos for 500k articles" (part 2) (duration: 03m 09s)
  • 13:19 topranks: merging CR821781 to expose additional network info in puppet facts
  • 13:18 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 13:18 awight@deploy1002: Synchronized wmf-config/: Config: Revert "trwiki: Change old and new vector logos for 500k articles" (part 1) (duration: 03m 13s)
  • 13:17 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 13:17 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 13:16 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 13:14 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1060.eqiad.wmnet with reason: host reimage
  • 13:11 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 13:11 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1060.eqiad.wmnet with reason: host reimage
  • 13:10 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 13:10 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 13:09 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 13:08 awight@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: Enable editor line numbering on all namespaces, for twwiki (T302852) (duration: 03m 42s)
  • 12:56 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1060.eqiad.wmnet with OS bullseye
  • 12:55 ryankemper@cumin1001: START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - T289135
  • 12:49 aikochou@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
  • 12:46 aikochou@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
  • 12:26 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=restbase2018.codfw.wmnet
  • 12:26 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=restbase202[367].codfw.wmnet
  • 12:17 elukey@deploy1002: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
  • 12:17 elukey@deploy1002: helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
  • 12:17 elukey@deploy1002: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
  • 12:16 elukey@deploy1002: helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
  • 12:13 filippo@cumin1001: START - Cookbook sre.hosts.reboot-single for host logstash2003.codfw.wmnet
  • 12:11 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
  • 12:10 elukey@deploy1002: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
  • 12:09 elukey@deploy1002: helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
  • 11:20 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance
  • 11:20 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance
  • 09:58 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance
  • 09:57 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance
  • 09:56 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance
  • 09:56 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance
  • 09:49 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1162.eqiad.wmnet with reason: Maintenance
  • 09:49 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on db1162.eqiad.wmnet with reason: Maintenance
  • 09:32 godog: arm keyholder on netmon2001
  • 09:09 jbond: update gnutls28 on bullseye systems
  • 09:00 jbond: update unzip
  • 08:21 elukey@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
  • 08:13 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
  • 08:12 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
  • 08:06 ayounsi@cumin1001: END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox interface ID cr3-ulsfo:xe-0/1/1
  • 08:06 ayounsi@cumin1001: START - Cookbook sre.network.debug for Netbox interface ID cr3-ulsfo:xe-0/1/1
  • 07:58 ayounsi@cumin1001: END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox interface ID cr3-ulsfo:xe-0/1/1
  • 07:57 ayounsi@cumin1001: START - Cookbook sre.network.debug for Netbox interface ID cr3-ulsfo:xe-0/1/1
  • 07:55 oblivian@cumin1001: conftool action : set/pooled=false; selector: dnsdisc=k8s-ingress-wikikube-rw,name=codfw
  • 07:51 vgutierrez: rolling restart of pybal in eqsin and ulsfo
  • 07:24 oblivian@cumin1001: conftool action : set/pooled=false; selector: dnsdisc=restbase-async,name=eqiad
  • 07:24 oblivian@cumin1001: conftool action : set/pooled=true; selector: dnsdisc=shellbox-timeline
  • 07:23 oblivian@cumin1001: conftool action : set/pooled=true; selector: dnsdisc=inference
  • 07:19 _joe_: pooling all services in codfw
  • 07:03 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1147 (T312863)', diff saved to https://phabricator.wikimedia.org/P32357 and previous config saved to /var/cache/conftool/dbconfig/20220811-070312-ladsgroup.json
  • 07:03 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1147.eqiad.wmnet with reason: Maintenance
  • 07:02 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1147.eqiad.wmnet with reason: Maintenance
  • 07:02 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1160 (T312863)', diff saved to https://phabricator.wikimedia.org/P32356 and previous config saved to /var/cache/conftool/dbconfig/20220811-070252-ladsgroup.json
  • 06:47 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P32355 and previous config saved to /var/cache/conftool/dbconfig/20220811-064746-ladsgroup.json
  • 06:32 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P32354 and previous config saved to /var/cache/conftool/dbconfig/20220811-063240-ladsgroup.json
  • 06:28 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance
  • 06:28 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance
  • 06:17 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1160 (T312863)', diff saved to https://phabricator.wikimedia.org/P32353 and previous config saved to /var/cache/conftool/dbconfig/20220811-061734-ladsgroup.json
  • 06:17 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maint
  • 06:17 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maint
  • 06:06 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depool db1162 (T314368 T298555 T312863 T310011 T309311 T60674 T298560 T303603 T310485)', diff saved to https://phabricator.wikimedia.org/P32352 and previous config saved to /var/cache/conftool/dbconfig/20220811-060625-ladsgroup.json
  • 06:01 ladsgroup@cumin1001: dbctl commit (dc=all): 'Promote db1122 to s2 primary and set section read-write T314368', diff saved to https://phabricator.wikimedia.org/P32351 and previous config saved to /var/cache/conftool/dbconfig/20220811-060113-ladsgroup.json
  • 06:00 ladsgroup@cumin1001: dbctl commit (dc=all): 'Set s2 eqiad as read-only for maintenance - T314368', diff saved to https://phabricator.wikimedia.org/P32350 and previous config saved to /var/cache/conftool/dbconfig/20220811-060042-ladsgroup.json
  • 06:00 Amir1: Starting s2 eqiad failover from db1162 to db1122 - T314368
  • 05:19 ladsgroup@cumin1001: dbctl commit (dc=all): 'Set db1122 with weight 0 T314368', diff saved to https://phabricator.wikimedia.org/P32349 and previous config saved to /var/cache/conftool/dbconfig/20220811-051913-ladsgroup.json
  • 05:19 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 25 hosts with reason: Primary switchover s2 T314368
  • 05:18 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1:00:00 on 25 hosts with reason: Primary switchover s2 T314368
  • m: chown -R librenms /srv/librenms/rrd/ on netmon1003 T314972
  • 03:51 cwhite: chown librenms /srv/librenms/rrd/* on netmon1003 T314972
  • 02:55 ejegg: civicrm upgraded from 1f91ac2d to 92467234
  • 02:46 ejegg: updated process-control yaml files with @wmff alias
  • 02:08 ejegg: civicrm rolled back from 92467234 to 1f91ac2d
  • 02:05 ejegg: civicrm upgraded from 1f91ac2d to 92467234
  • 01:40 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 01:39 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 01:39 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 01:38 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 01:38 tstarling@deploy1002: Synchronized wmf-config/logging.php: (no justification provided) (duration: 03m 25s)
  • 01:19 tstarling@puppetmaster1001: conftool action : set/pooled=true; selector: dnsdisc=(appservers|api)-ro,name=codfw
  • 01:19 tstarling@puppetmaster1001: conftool action : set/pooled=true; selector: dnsdisc=sessionstore,name=codfw
  • 00:58 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2042.codfw.wmnet,service=varnish-fe
  • 00:58 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2042.codfw.wmnet,service=ats-be
  • 00:58 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2042.codfw.wmnet,service=ats-tls
  • 00:57 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on cp2042.codfw.wmnet with reason: host down; depooled and will debug tomorrow
  • 00:57 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on cp2042.codfw.wmnet with reason: host down; depooled and will debug tomorrow

2022-08-10

  • 21:25 bking@cumin1001: conftool action : set/weight=10:pooled=yes; selector: name=wdqs1016.eqiad.wmnet
  • 21:23 bking@cumin1001: conftool action : set/weight=10:pooled=yes; selector: name=wdqs1014.eqiad.wmnet
  • 21:10 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 16 hosts with reason: T309810
  • 21:10 bking@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on 16 hosts with reason: T309810
  • 21:09 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on elastic[1101-1102].eqiad.wmnet with reason: T309810
  • 21:09 bking@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on elastic[1101-1102].eqiad.wmnet with reason: T309810
  • 21:00 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 21:00 cjming: end of UTC late backport window
  • 20:59 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 20:59 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 20:59 cjming@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: Remove unused $wgEnableMWSuggest (duration: 03m 04s)
  • 20:58 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 20:56 cjming@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: Enable new topic tool on dewiki (T313699) (duration: 03m 01s)
  • 20:34 cjming@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: testwiki: set $wgCdnMatchParameterOrder to false (T314868) (duration: 03m 20s)
  • 20:31 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 20:30 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 20:30 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 20:30 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 20:19 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 20:18 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 20:18 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 20:17 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 20:12 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 20:09 bking@cumin1001: END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
  • 20:08 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 20:08 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 20:08 cjming@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: Start writing to cuc_actor everywhere except s4 and s8 (T233004) (duration: 03m 15s)
  • 20:07 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 19:51 rzl@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc[2053-2054].codfw.wmnet
  • 19:51 rzl@cumin1001: START - Cookbook sre.hosts.remove-downtime for mc[2053-2054].codfw.wmnet
  • 19:35 rzl@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse[2019-2020].codfw.wmnet
  • 19:35 rzl@cumin1001: START - Cookbook sre.hosts.remove-downtime for parse[2019-2020].codfw.wmnet
  • 19:35 rzl@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse[2016-2018].codfw.wmnet
  • 19:35 rzl@cumin1001: START - Cookbook sre.hosts.remove-downtime for parse[2016-2018].codfw.wmnet
  • 19:34 rzl@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc2036.codfw.wmnet
  • 19:34 rzl@cumin1001: START - Cookbook sre.hosts.remove-downtime for mc2036.codfw.wmnet
  • 19:28 sukhe: testing ATS 9.1.3-1wm1 on cp4026: T309651
  • 19:09 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1087.eqiad.wmnet with OS bullseye
  • 19:06 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1086.eqiad.wmnet with OS bullseye
  • 18:55 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1087.eqiad.wmnet with reason: host reimage
  • 18:51 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1086.eqiad.wmnet with reason: host reimage
  • 18:50 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1087.eqiad.wmnet with reason: host reimage
  • 18:49 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1086.eqiad.wmnet with reason: host reimage
  • 18:47 bking@cumin1001: START - Cookbook sre.wdqs.data-transfer
  • 18:38 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1087.eqiad.wmnet with OS bullseye
  • 18:36 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1086.eqiad.wmnet with OS bullseye
  • 18:22 urandom: truncating Cassandra hints (eqiad datacenter) -- T314941
  • 18:13 urandom: truncating codfw Cassandra hints (eqiad datacenter) -- T314941
  • 18:07 rzl@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kafka-main2005.codfw.wmnet
  • 18:07 rzl@cumin1001: START - Cookbook sre.hosts.remove-downtime for kafka-main2005.codfw.wmnet
  • 18:05 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repool D8 DBs after PDU maint (T310146)', diff saved to https://phabricator.wikimedia.org/P32346 and previous config saved to /var/cache/conftool/dbconfig/20220810-180529-ladsgroup.json
  • 17:42 otto@deploy1002: Finished deploy [analytics/refinery@6e47e0e]: Add missing changes to the deletion script - T270433 - [analytics/refinery@6e47e0e] (duration: 05m 28s)
  • 17:39 fnegri@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts labweb1002.wikimedia.org
  • 17:39 fnegri@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 17:36 otto@deploy1002: Started deploy [analytics/refinery@6e47e0e]: Add missing changes to the deletion script - T270433 - [analytics/refinery@6e47e0e]
  • 17:35 fnegri@cumin1001: START - Cookbook sre.dns.netbox
  • 17:34 otto@deploy1002: Finished deploy [analytics/refinery@6e47e0e] (hadoop-test): Add missing changes to the deletion script - T270433 - TEST [analytics/refinery@6e47e0e] (duration: 04m 19s)
  • 17:30 fnegri@cumin1001: START - Cookbook sre.hosts.decommission for hosts labweb1002.wikimedia.org
  • 17:30 otto@deploy1002: Started deploy [analytics/refinery@6e47e0e] (hadoop-test): Add missing changes to the deletion script - T270433 - TEST [analytics/refinery@6e47e0e]
  • 17:09 dzahn@cumin2002: START - Cookbook sre.dns.netbox
  • 17:08 otto@deploy1002: Started deploy [analytics/refinery@d4dd7e4] (hadoop-test): Add safety limits to refinery-drop-older-than - T270433 - TEST [analytics/refinery@d4dd7e4]
  • 17:06 sukhe: testing ATS 9.1.3-1wm1 on cp4032: T309651
  • 17:06 urandom: flushing RESTBase Cassandra tables -row B- to (temporarily) free instance-data space -- T314941
  • 17:05 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on krb2001.codfw.wmnet with reason: btullis codfw maintenance
  • 17:05 btullis@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on krb2001.codfw.wmnet with reason: btullis codfw maintenance
  • 17:04 dzahn@cumin2002: START - Cookbook sre.hosts.decommission for hosts gerrit2001.wikimedia.org
  • 17:02 sukhe: testing ATS 9.1.3-1wm1 on cp6008: T309651
  • 16:56 sukhe: testing ATS 9.1.3-1wm1 on cp6016: T309651
  • 16:55 fnegri@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts labweb1001.wikimedia.org
  • 16:55 fnegri@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 16:32 dzahn@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts gerrit2001.wikimedia.org
  • 16:32 dzahn@cumin2002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
  • 16:32 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubernetes[2013-2014].codfw.wmnet
  • 16:31 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for kubernetes[2013-2014].codfw.wmnet
  • 16:31 jelto: kubectl uncordon kubernetes2014.codfw.wmnet
  • 16:31 fnegri@cumin1001: START - Cookbook sre.dns.netbox
  • 16:30 jelto: kubectl uncordon kubernetes2013.codfw.wmnet
  • 16:29 urandom: restarting Cassandra (RESTBase) -row A- to apply r822110 -- T314941
  • 16:27 dzahn@cumin2002: START - Cookbook sre.dns.netbox
  • 16:25 fnegri@cumin1001: START - Cookbook sre.hosts.decommission for hosts labweb1001.wikimedia.org
  • 16:23 mutante: shutting down gerrit2001
  • 16:23 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc[2034-2035].codfw.wmnet
  • 16:23 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for mc[2034-2035].codfw.wmnet
  • 16:22 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse[2016-2018].codfw.wmnet
  • 16:22 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for parse[2016-2018].codfw.wmnet
  • 16:16 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=maps2010.codfw.wmnet
  • 16:16 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=sessionstore2003.codfw.wmnet
  • 16:13 sukhe: reprepro -C component/trafficserver9 include buster-wikimedia trafficserver_9.1.3-1wm1_amd64.changes: T309651
  • 16:13 dzahn@cumin2002: START - Cookbook sre.hosts.decommission for hosts gerrit2001.wikimedia.org
  • 16:11 mvernon@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ms-be[2039,2050,2056,2059].codfw.wmnet,thanos-be2004.codfw.wmnet with reason: PDU work
  • 16:10 mvernon@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ms-be[2039,2050,2056,2059].codfw.wmnet,thanos-be2004.codfw.wmnet with reason: PDU work
  • 16:09 urandom: flushing tables in row D (RESTBase Cassandra cluster) -- T314941
  • 15:54 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for gitlab-runner2004.codfw.wmnet
  • 15:54 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for gitlab-runner2004.codfw.wmnet
  • 15:53 sukhe: poweroff cp2041, 42 for PDU ugprade: rack D7
  • 15:51 urandom: flushing tables in row B (RESTBase Cassandra cluster) -- T314941
  • 15:49 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve2004.codfw.wmnet with reason: PDU maintenance
  • 15:49 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2004.codfw.wmnet with reason: PDU maintenance
  • 15:46 urandom: flushing tables in row A (RESTBase Cassandra cluster) -- T314941
  • 15:46 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on aqs2012.codfw.wmnet with reason: btullis codfw maintenance
  • 15:46 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on cp[2041-2042].codfw.wmnet with reason: shutdown for PDU upgrade: rack D4
  • 15:46 btullis@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on aqs2012.codfw.wmnet with reason: btullis codfw maintenance
  • 15:46 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 3:00:00 on cp[2041-2042].codfw.wmnet with reason: shutdown for PDU upgrade: rack D4
  • 15:46 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on aqs2011.codfw.wmnet with reason: btullis codfw maintenance
  • 15:46 btullis@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on aqs2011.codfw.wmnet with reason: btullis codfw maintenance
  • 15:45 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on aqs2010.codfw.wmnet with reason: btullis codfw maintenance
  • 15:45 btullis@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on aqs2010.codfw.wmnet with reason: btullis codfw maintenance
  • 15:45 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on aqs2009.codfw.wmnet with reason: btullis codfw maintenance
  • 15:45 btullis@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on aqs2009.codfw.wmnet with reason: btullis codfw maintenance
  • 15:37 urandom: (ephemerally) increasing hinted hand-off delivery rate limit to 16KB, RESTBase eqiad nodes -- T314941
  • 15:34 jbond: remove puppetmaster[12]002 from production
  • 15:30 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kafka-main2004.codfw.wmnet
  • 15:30 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for kafka-main2004.codfw.wmnet
  • 15:20 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc[2051-2052].codfw.wmnet
  • 15:20 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for mc[2051-2052].codfw.wmnet
  • 15:17 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc-gp2003.codfw.wmnet
  • 15:17 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for mc-gp2003.codfw.wmnet
  • 15:16 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc2033.codfw.wmnet
  • 15:16 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for mc2033.codfw.wmnet
  • 15:14 _joe_: power off krb2002
  • 15:14 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on krb2002.codfw.wmnet with reason: PDU maintenance
  • 15:13 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on krb2002.codfw.wmnet with reason: PDU maintenance
  • 15:13 _joe_: shutting down rdb2010,puppetmaster2002 for d5 maintenance
  • 15:02 jelto: power off mc2035
  • 15:01 jelto: power off mc2034
  • 15:01 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mc2035.codfw.wmnet with reason: PDU swap
  • 15:01 jelto@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on mc2035.codfw.wmnet with reason: PDU swap
  • 15:01 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mc2034.codfw.wmnet with reason: PDU swap
  • 15:01 jelto@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on mc2034.codfw.wmnet with reason: PDU swap
  • 14:43 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2094.codfw.wmnet with reason: PDU Maint (T310146)
  • 14:43 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2094.codfw.wmnet with reason: PDU Maint (T310146)
  • 14:38 urandom: disabling reserved space on eqiad nodes (RESTBase), /dev/md2 (aka /srv/cassandra/instance-data) -- T314941
  • 14:28 jelto: power off kafka-main2004 gracefully
  • 14:28 hnowlan: shutting down sessionstore2003
  • 14:27 hnowlan@puppetmaster1001: conftool action : set/pooled=no; selector: name=sessionstore2003.codfw.wmnet
  • 14:27 sukhe: power off cp2039, cp2040 for PDU upgrade: rack D
  • 14:27 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sessionstore2003.codfw.wmnet with reason: PDU maintenance
  • 14:27 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on sessionstore2003.codfw.wmnet with reason: PDU maintenance
  • 14:25 jelto: power off mc-gp2003
  • 14:25 jelto: power off mc2033
  • 14:24 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:30:00 on kafka-main2004.codfw.wmnet with reason: PDU swap
  • 14:23 jelto@cumin1001: START - Cookbook sre.hosts.downtime for 4:30:00 on kafka-main2004.codfw.wmnet with reason: PDU swap
  • 14:23 sukhe: depool codfw for PDU upgrade: rack D
  • 14:23 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:30:00 on mc-gp2003.codfw.wmnet with reason: PDU swap
  • 14:23 jelto@cumin1001: START - Cookbook sre.hosts.downtime for 4:30:00 on mc-gp2003.codfw.wmnet with reason: PDU swap
  • 14:23 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:30:00 on mc2033.codfw.wmnet with reason: PDU swap
  • 14:23 jelto@cumin1001: START - Cookbook sre.hosts.downtime for 4:30:00 on mc2033.codfw.wmnet with reason: PDU swap
  • 14:15 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp20[39|40]\.codfw\.wmnet,service=ats-tls
  • 14:13 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on cp[2039-2040].codfw.wmnet with reason: shutdown for PDU upgrade: rack D4
  • 14:13 urandom: flushing Cassandra tables, restbase1030
  • 14:13 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 3:00:00 on cp[2039-2040].codfw.wmnet with reason: shutdown for PDU upgrade: rack D4
  • 14:13 urandom: flushing Cassandra tables, restbase1019
  • 14:12 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on dns2002.wikimedia.org with reason: shutdown for PDU upgrade: rack D4
  • 14:12 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 3:00:00 on dns2002.wikimedia.org with reason: shutdown for PDU upgrade: rack D4
  • 14:11 urandom: flushing Cassandra tables, restbase1017 1018 1021 1024 1025 1026 1028 1029
  • 14:05 urandom: flushing tables, restbase1016
  • 13:52 hnowlan: powered up restbase2018
  • 13:32 filippo@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on netmon1003.wikimedia.org with reason: pdu
  • 13:32 filippo@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on netmon1003.wikimedia.org with reason: pdu
  • 13:32 filippo@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on logstash2003.codfw.wmnet with reason: pdu
  • 13:31 filippo@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on logstash2003.codfw.wmnet with reason: pdu
  • 13:31 root@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on logstash2029.codfw.wmnet with reason: pdu
  • 13:31 root@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on logstash2029.codfw.wmnet with reason: pdu
  • 13:30 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 6 hosts with reason: T310146
  • 13:30 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 6 hosts with reason: T310146
  • 13:17 elukey: powering on restbase2027
  • 13:12 elukey: powering on restbase2026
  • 13:12 _joe_: powering on restbase2023
  • 13:01 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1160 (T312863)', diff saved to https://phabricator.wikimedia.org/P32343 and previous config saved to /var/cache/conftool/dbconfig/20220810-130108-ladsgroup.json
  • 13:01 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance
  • 13:00 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance
  • 12:37 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic[2072,2084-2085].codfw.wmnet with reason: T310146
  • 12:37 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic[2072,2084-2085].codfw.wmnet with reason: T310146
  • 12:27 jbond: remove confd from serveres that shouldn;t have it
  • 12:05 ladsgroup@deploy1002: Synchronized php-1.39.0-wmf.23/extensions/Echo/maintenance/removeOrphanedEvents.php: Backport: Run clean ups with removeOrphanedEvents in major batches (T310428) (duration: 03m 32s)
  • 11:47 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 11:46 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 11:46 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 11:43 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 11:15 jbond@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1001.eqiad.wmnet with OS buster
  • 10:54 jbond@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage
  • 10:51 jbond@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage
  • 10:37 jbond@cumin1001: START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS buster
  • 10:31 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[2181-2182].codfw.wmnet with reason: D6 PDU maint (T310146)
  • 10:31 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db[2181-2182].codfw.wmnet with reason: D6 PDU maint (T310146)
  • 10:26 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on maps2010.codfw.wmnet with reason: PDU maintenance
  • 10:26 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on maps2010.codfw.wmnet with reason: PDU maintenance
  • 10:26 hnowlan@puppetmaster1001: conftool action : set/pooled=no; selector: name=maps2010.codfw.wmnet
  • 10:25 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=maps2010.codfw.wmnet
  • 10:25 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=maps2008.codfw.wmnet
  • 10:24 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on restbase2018.codfw.wmnet with reason: PDU maintenance
  • 10:24 hnowlan@puppetmaster1001: conftool action : set/pooled=no; selector: name=restbase2018.codfw.wmnet
  • 10:24 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on restbase2018.codfw.wmnet with reason: PDU maintenance
  • 10:23 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ml-serve2008.codfw.wmnet with reason: PDU maintenance
  • 10:23 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on ml-serve2008.codfw.wmnet with reason: PDU maintenance
  • 10:20 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ores2009.codfw.wmnet with reason: PDU maintenance
  • 10:20 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on ores2009.codfw.wmnet with reason: PDU maintenance
  • 10:19 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ores2008.codfw.wmnet with reason: PDU maintenance
  • 10:19 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on ores2008.codfw.wmnet with reason: PDU maintenance
  • 10:03 hnowlan@puppetmaster1001: conftool action : set/pooled=no; selector: name=restbase202[367].codfw.wmnet
  • 10:02 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on restbase[2023,2026-2027].codfw.wmnet with reason: PDU maintenance
  • 10:02 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on restbase[2023,2026-2027].codfw.wmnet with reason: PDU maintenance
  • 09:53 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 6 hosts with reason: D8 PDU Maint (T310146)
  • 09:52 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 6 hosts with reason: D8 PDU Maint (T310146)
  • 09:51 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depool D8 DBs for PDU maint (T310146)', diff saved to https://phabricator.wikimedia.org/P32341 and previous config saved to /var/cache/conftool/dbconfig/20220810-095059-ladsgroup.json
  • 09:36 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[2101,2130,2140].codfw.wmnet,dbproxy2004.codfw.wmnet with reason: D6 PDU maint (T310146)
  • 09:36 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db[2101,2130,2140].codfw.wmnet,dbproxy2004.codfw.wmnet with reason: D6 PDU maint (T310146)
  • 09:34 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depool D6 dbs (T310146)', diff saved to https://phabricator.wikimedia.org/P32340 and previous config saved to /var/cache/conftool/dbconfig/20220810-093433-ladsgroup.json
  • 09:31 jelto: depool services in codfw for upcoming PDU replacement - T309956
  • 09:30 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 8 hosts with reason: Maintenance
  • 09:29 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 8 hosts with reason: Maintenance
  • 09:29 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance
  • 09:29 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance
  • 09:28 jynus: shutdown backup2007 before pdu upgrade T310146
  • 09:16 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 09:15 ladsgroup@deploy1002: Synchronized php-1.39.0-wmf.23/maintenance/namespaceDupes.php: Backport: maintenance: Add support for links migration to namespaceDupes.php (T314711) (duration: 03m 18s)
  • 09:15 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[2093,2120,2129,2172].codfw.wmnet with reason: D5 PDU maint (T310146)
  • 09:15 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db[2093,2120,2129,2172].codfw.wmnet with reason: D5 PDU maint (T310146)
  • 09:14 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 09:14 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 09:13 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 09:10 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depool D5 dbs (T310146)', diff saved to https://phabricator.wikimedia.org/P32339 and previous config saved to /var/cache/conftool/dbconfig/20220810-091038-ladsgroup.json
  • 08:49 jynus: shutdown dbprov2003 before pdu upgrade T310146
  • 08:49 ayounsi@cumin1001: END (FAIL) - Cookbook sre.network.debug (exit_code=99) for Netbox interface ID cr1-drmrs:xe-0/1/2
  • 08:48 ayounsi@cumin1001: START - Cookbook sre.network.debug for Netbox interface ID cr1-drmrs:xe-0/1/2
  • 08:48 mvernon@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be2028.codfw.wmnet
  • 08:48 mvernon@cumin1001: START - Cookbook sre.hosts.remove-downtime for ms-be2028.codfw.wmnet
  • 08:42 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1130 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P32337 and previous config saved to /var/cache/conftool/dbconfig/20220810-084222-ladsgroup.json
  • 08:37 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 08:37 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
  • 08:36 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
  • 08:36 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 08:36 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 08:35 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 08:35 ladsgroup@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: Stop writing to the old templatelinks fields in s5 (T312865) (duration: 03m 29s)
  • 08:32 jelto: power off gitlab-runner2004
  • 08:31 root@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:30:00 on gitlab-runner2004.codfw.wmnet with reason: PDU swap
  • 08:31 root@cumin1001: START - Cookbook sre.hosts.downtime for 8:30:00 on gitlab-runner2004.codfw.wmnet with reason: PDU swap
  • 08:29 mvernon@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ms-be2028.codfw.wmnet with reason: Trying to fix full /
  • 08:28 mvernon@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ms-be2028.codfw.wmnet with reason: Trying to fix full /
  • 08:28 ayounsi@cumin1001: END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox interface ID cr1-drmrs:xe-0/1/2
  • 08:27 ayounsi@cumin1001: START - Cookbook sre.network.debug for Netbox interface ID cr1-drmrs:xe-0/1/2
  • 08:27 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1130 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P32336 and previous config saved to /var/cache/conftool/dbconfig/20220810-082718-ladsgroup.json
  • 08:25 ayounsi@cumin1001: END (FAIL) - Cookbook sre.network.debug (exit_code=99) for Netbox interface ID cr1-drmrs:xe-0/1/2
  • 08:25 ayounsi@cumin1001: START - Cookbook sre.network.debug for Netbox interface ID cr1-drmrs:xe-0/1/2
  • 08:24 ayounsi@cumin1001: END (FAIL) - Cookbook sre.network.debug (exit_code=99) for Netbox interface ID cr1-drmrs:xe-0/1/2
  • 08:24 ayounsi@cumin1001: START - Cookbook sre.network.debug for Netbox interface ID cr1-drmrs:xe-0/1/2
  • 08:23 kart_: Run: mwscript namespaceDupes.php arywiki --fix (T291737)
  • 08:13 jynus: restart replication on db1117:m1 T309074
  • 08:12 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1130 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P32335 and previous config saved to /var/cache/conftool/dbconfig/20220810-081213-ladsgroup.json
  • 08:09 kartik@deploy1002: Finished scap: Backport: arywiki: change namespace translations, add unchanged namespaces and add old translations as aliases (T291737) (duration: 10m 37s)
  • 07:59 kartik@deploy1002: Started scap: Backport: arywiki: change namespace translations, add unchanged namespaces and add old translations as aliases (T291737)
  • 07:57 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1130 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P32334 and previous config saved to /var/cache/conftool/dbconfig/20220810-075708-ladsgroup.json
  • 07:56 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1130 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P32333 and previous config saved to /var/cache/conftool/dbconfig/20220810-075636-ladsgroup.json
  • 07:55 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 07:52 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 07:52 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 07:52 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1130.eqiad.wmnet with reason: Maintenance
  • 07:52 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on db1130.eqiad.wmnet with reason: Maintenance
  • 07:51 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 07:51 dcaro@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 07:46 dcaro@cumin1001: START - Cookbook sre.dns.netbox
  • 07:39 dcaro@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 07:34 dcaro@cumin1001: START - Cookbook sre.dns.netbox
  • 07:33 godog: depool thanos-fe2001 for debugging
  • 07:11 kartik@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: Enable SectionTranslation on testwiki with new MT support from Google (T313296) (duration: 05m 44s)
  • 07:11 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 07:08 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 07:08 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 07:07 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 05:24 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on kubernetes[2013-2014].codfw.wmnet with reason: PDU maintenance
  • 05:24 oblivian@cumin1001: START - Cookbook sre.hosts.downtime for 18:00:00 on kubernetes[2013-2014].codfw.wmnet with reason: PDU maintenance
  • 05:19 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on parse[2016-2020].codfw.wmnet with reason: PDU maintenance
  • 05:19 oblivian@cumin1001: START - Cookbook sre.hosts.downtime for 18:00:00 on parse[2016-2020].codfw.wmnet with reason: PDU maintenance
  • 05:12 _joe_: starting to shut down servers in codfw for the PDU maintenance
  • 05:09 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on 10 hosts with reason: PDU maintenance
  • 05:09 oblivian@cumin1001: START - Cookbook sre.hosts.downtime for 18:00:00 on 10 hosts with reason: PDU maintenance
  • 05:09 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on mc-gp2003.codfw.wmnet with reason: PDU maintenance
  • 05:09 oblivian@cumin1001: START - Cookbook sre.hosts.downtime for 18:00:00 on mc-gp2003.codfw.wmnet with reason: PDU maintenance
  • 05:06 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on mc2033.codfw.wmnet with reason: PDU maintenance
  • 05:06 oblivian@cumin1001: START - Cookbook sre.hosts.downtime for 18:00:00 on mc2033.codfw.wmnet with reason: PDU maintenance
  • 05:05 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on 7 hosts with reason: PDU maintenance
  • 05:05 oblivian@cumin1001: START - Cookbook sre.hosts.downtime for 18:00:00 on 7 hosts with reason: PDU maintenance
  • 02:34 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 02:33 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 02:33 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 02:32 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 02:07 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 02:06 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 02:06 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 02:05 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply

2022-08-09

  • 23:17 bking@cumin1001: conftool action : set/weight=10:pooled=yes; selector: name=wdqs1011.eqiad.wmnet
  • 23:07 bking@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
  • 23:06 bking@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
  • 22:51 bking@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
  • 22:51 bking@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
  • 22:49 bking@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
  • 22:49 bking@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
  • 22:46 bking@cumin1001: conftool action : set/weight=10:pooled=yes; selector: name=wdqs1015.eqiad.wmnet
  • 22:31 ryankemper@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
  • 22:31 ryankemper@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
  • 22:28 bking@cumin1001: END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
  • 22:02 bking@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
  • 22:02 bking@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
  • 21:57 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wdqs2006.codfw.wmnet with reason: T310146
  • 21:57 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs2006.codfw.wmnet with reason: T310146
  • 21:53 bking@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
  • 21:52 bking@deploy1002: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply
  • 21:50 bking@deploy1002: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply
  • 21:49 bking@deploy1002: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply
  • 21:43 bking@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop: apply
  • 21:43 bking@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop: apply
  • 21:43 bking@deploy1002: helmfile [eqiad] DONE helmfile.d/services/changeprop: apply
  • 21:43 bking@deploy1002: helmfile [eqiad] START helmfile.d/services/changeprop: apply
  • 21:43 bking@deploy1002: helmfile [staging] DONE helmfile.d/services/changeprop: apply
  • 21:43 bking@deploy1002: helmfile [staging] START helmfile.d/services/changeprop: apply
  • 21:08 bking@cumin1001: START - Cookbook sre.wdqs.data-transfer
  • 21:00 bking@cumin1001: conftool action : set/weight=10:pooled=yes; selector: name=wdqs1014.eqiad.wmnet
  • 20:56 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
  • 20:55 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
  • 20:55 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1148 (T312863)', diff saved to https://phabricator.wikimedia.org/P32332 and previous config saved to /var/cache/conftool/dbconfig/20220809-205548-ladsgroup.json
  • 20:51 bking@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wdqs1014.eqiad.wmnet
  • 20:51 bking@cumin1001: START - Cookbook sre.hosts.remove-downtime for wdqs1014.eqiad.wmnet
  • 20:46 bking@cumin1001: END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
  • 20:40 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P32331 and previous config saved to /var/cache/conftool/dbconfig/20220809-204042-ladsgroup.json
  • 20:25 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P32330 and previous config saved to /var/cache/conftool/dbconfig/20220809-202536-ladsgroup.json
  • 20:10 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1148 (T312863)', diff saved to https://phabricator.wikimedia.org/P32329 and previous config saved to /var/cache/conftool/dbconfig/20220809-201030-ladsgroup.json
  • 19:57 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on wdqs1014.eqiad.wmnet with reason: T314890
  • 19:57 bking@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on wdqs1014.eqiad.wmnet with reason: T314890
  • 19:56 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on wdqs1016.eqiad.wmnet with reason: T314890
  • 19:56 bking@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on wdqs1016.eqiad.wmnet with reason: T314890
  • 19:55 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on wdqs1015.eqiad.wmnet with reason: T314890
  • 19:55 bking@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on wdqs1015.eqiad.wmnet with reason: T314890
  • 19:38 dcausse@deploy1002: helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply
  • 19:36 dcausse@deploy1002: helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply
  • 19:35 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1130.eqiad.wmnet with reason: Maintenance
  • 19:35 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 10:00:00 on db1130.eqiad.wmnet with reason: Maintenance
  • 19:25 bking@cumin1001: START - Cookbook sre.wdqs.data-transfer
  • 18:09 cmjohnson@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 18:06 cmjohnson@cumin1001: START - Cookbook sre.dns.netbox
  • 17:54 cmjohnson@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 17:47 cmjohnson@cumin1001: START - Cookbook sre.dns.netbox
  • 17:38 bking@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1072.eqiad.wmnet with OS bullseye
  • 17:29 vgutierrez: test trafficserver 9.1.2-1wm2 in cp6016 - T309651
  • 17:15 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1072.eqiad.wmnet with reason: host reimage
  • 17:13 bking@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1072.eqiad.wmnet with reason: host reimage
  • 17:00 bking@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1072.eqiad.wmnet with OS bullseye
  • 16:54 bking@cumin1001: END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0)
  • 16:54 bking@cumin1001: START - Cookbook sre.elasticsearch.force-shard-allocation
  • 16:53 bking@cumin1001: END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0)
  • 16:53 bking@cumin1001: START - Cookbook sre.elasticsearch.force-shard-allocation
  • 16:26 bking@deploy1002: helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
  • 16:26 bking@deploy1002: helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
  • 16:01 bking@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1069.eqiad.wmnet with OS bullseye
  • 15:45 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1069.eqiad.wmnet with reason: host reimage
  • 15:42 bking@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1069.eqiad.wmnet with reason: host reimage
  • 15:32 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 15:31 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 15:31 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 15:30 bking@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1069.eqiad.wmnet with OS bullseye
  • 15:28 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 15:27 bking@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1058.eqiad.wmnet with OS bullseye
  • 15:08 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1058.eqiad.wmnet with reason: host reimage
  • 15:05 bking@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1058.eqiad.wmnet with reason: host reimage
  • 14:59 elukey@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
  • 14:59 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
  • m: finished running 'homer "status:active" commit "netmon: Add the netmon1003 host as a syslog destination"' in the cumin1001 host. Homer reported no errors.
  • 14:54 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
  • 14:50 bking@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1058.eqiad.wmnet with OS bullseye
  • 14:28 bking@cumin1001: conftool action : set/pooled=false; selector: dnsdisc=wdqs,name=codfw
  • 13:57 kevinbazira@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
  • 13:57 kevinbazira@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
  • m: Add the new netmon1003 host as a syslog destination in homer templates/common/system.conf https://gerrit.wikimedia.org/r/c/operations/homer/public/+/819124
  • m: Successfully ran '# run-puppet-merge' in the netmon1002 and netmon1003 hosts.
  • m: Running '# run-puppet-agent' in the netmon1003 host
  • m: Running '# run-puppet-agent' in the netmon1002 host
  • 13:47 ryankemper@cumin1001: END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0)
  • 13:46 ryankemper@cumin1001: START - Cookbook sre.elasticsearch.force-shard-allocation
  • m: puppet-merge on puppetmaster2004.codfw.wmnet for patch 819179 succeeded
  • m: Set netmon1003 as netmon_server and netmon1002 as a netmon_servers_failover in the Puppet repository https://gerrit.wikimedia.org/r/c/operations/puppet/+/819179
  • m: authdns updated successfully
  • m: Had to revert https://gerrit.wikimedia.org/r/c/operations/dns/+/819177 because I rebased my changes incorrectly, sent the new patch in https://gerrit.wikimedia.org/r/c/operations/dns/+/821746
  • m: running '# authdns-update' in ns0.wikimedia.org
  • m: Flip DNS for LibreNMS and Smokeping from netmon1002 to netmon1003 https://gerrit.wikimedia.org/r/c/operations/dns/+/819177
  • 13:23 jynus: stop replication on db1117:m1 T309074
  • m: netmon1002 to netmon1003 failover
  • 13:17 elukey@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
  • 13:16 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
  • 10:58 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
  • 09:53 vgutierrez: rolling restart of pybal in eqsin - T310070
  • 09:25 elukey@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
  • 09:24 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
  • 09:24 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
  • 09:12 vgutierrez: rolling restart of pybal in codfw - T310070
  • 08:47 elukey@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
  • 08:30 elukey@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
  • 08:28 elukey@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
  • 08:27 elukey@deploy1002: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
  • 08:27 elukey@deploy1002: helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
  • 08:27 elukey@deploy1002: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
  • 08:26 elukey@deploy1002: helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
  • 08:24 jynus: starting data check using es1021 and es2021, expect increased read traffic T314559
  • 08:21 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
  • 06:22 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1130.eqiad.wmnet with reason: Maintenance
  • 06:22 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1130.eqiad.wmnet with reason: Maintenance
  • 06:19 Amir1: dbmaint s5@eqiad (T312863 T312984 T310011 T310485)
  • 06:11 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db1130.eqiad.wmnet with reason: Maint
  • 06:11 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db1130.eqiad.wmnet with reason: Maint
  • 06:08 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depool db1130 T314370', diff saved to https://phabricator.wikimedia.org/P32323 and previous config saved to /var/cache/conftool/dbconfig/20220809-060836-ladsgroup.json
  • 06:07 oblivian@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
  • 06:02 ladsgroup@cumin1001: dbctl commit (dc=all): 'Promote db1100 to s5 primary and set section read-write T314370', diff saved to https://phabricator.wikimedia.org/P32322 and previous config saved to /var/cache/conftool/dbconfig/20220809-060159-ladsgroup.json
  • 06:01 ladsgroup@cumin1001: dbctl commit (dc=all): 'Set s5 eqiad as read-only for maintenance - T314370', diff saved to https://phabricator.wikimedia.org/P32321 and previous config saved to /var/cache/conftool/dbconfig/20220809-060105-ladsgroup.json
  • 06:00 Amir1: Starting s5 eqiad failover from db1130 to db1100 - T314370
  • 05:12 ladsgroup@cumin1001: dbctl commit (dc=all): 'Set db1100 with weight 0 T314370', diff saved to https://phabricator.wikimedia.org/P32320 and previous config saved to /var/cache/conftool/dbconfig/20220809-051251-ladsgroup.json
  • 05:12 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 22 hosts with reason: Primary switchover s5 T314370
  • 05:11 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1:00:00 on 22 hosts with reason: Primary switchover s5 T314370
  • 02:42 ejegg: SmashPig upgraded from 9b97ea15 to 13e9e9cc
  • 02:31 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1148 (T312863)', diff saved to https://phabricator.wikimedia.org/P32318 and previous config saved to /var/cache/conftool/dbconfig/20220809-023113-ladsgroup.json
  • 02:31 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1148.eqiad.wmnet with reason: Maintenance
  • 02:30 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1148.eqiad.wmnet with reason: Maintenance
  • 02:30 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T312863)', diff saved to https://phabricator.wikimedia.org/P32317 and previous config saved to /var/cache/conftool/dbconfig/20220809-023052-ladsgroup.json
  • 02:28 ejegg: payments-wiki upgraded from 6880236d to cf5e1848
  • 02:15 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P32316 and previous config saved to /var/cache/conftool/dbconfig/20220809-021546-ladsgroup.json
  • 02:00 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P32315 and previous config saved to /var/cache/conftool/dbconfig/20220809-020040-ladsgroup.json
  • 01:45 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T312863)', diff saved to https://phabricator.wikimedia.org/P32314 and previous config saved to /var/cache/conftool/dbconfig/20220809-014534-ladsgroup.json

2022-08-08

  • 23:52 tstarling@deploy1002: Synchronized wmf-config/InitialiseSettings.php: clean up testwiki experiments T314750 (duration: 03m 19s)
  • 23:47 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 23:46 tstarling@deploy1002: Synchronized wmf-config/CommonSettings.php: clean up testwiki experiments T314750 (duration: 03m 27s)
  • 23:46 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 23:46 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 23:45 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 23:32 eileen___: config revision changed from f5668044 to 787cd0e0<eileen___> eileen
  • 23:32 eileen___: civicrm upgraded from 497bddf7 to 1f91ac2d
  • 22:16 ryankemper@cumin1001: END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - T289135
  • 22:16 ryankemper@cumin1001: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host elastic1065.eqiad.wmnet with OS bullseye
  • 21:53 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1065.eqiad.wmnet with reason: host reimage
  • 21:50 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1065.eqiad.wmnet with reason: host reimage
  • 21:36 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1065.eqiad.wmnet with OS bullseye
  • 21:12 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1062.eqiad.wmnet with OS bullseye
  • 20:53 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1062.eqiad.wmnet with reason: host reimage
  • 20:50 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1062.eqiad.wmnet with reason: host reimage
  • 20:36 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1062.eqiad.wmnet with OS bullseye
  • 20:32 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 20:31 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 20:31 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 20:29 ryankemper@cumin1001: START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - T289135
  • 20:28 cjming: end of UTC late backport window
  • 20:27 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 20:27 cjming@deploy1002: Synchronized php-1.39.0-wmf.23/skins/Vector/resources/skins.vector.styles/layouts/grid.less: Backport: Fix grid blowout bug (T314756) (duration: 03m 26s)
  • 20:12 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 20:11 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 20:11 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 20:11 cjming@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: Disable sticky header edit A/B test for pilot wikis (T312296) (duration: 03m 35s)
  • 20:08 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 17:34 bking@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1088.eqiad.wmnet with OS bullseye
  • 17:15 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1088.eqiad.wmnet with reason: host reimage
  • 17:12 bking@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1088.eqiad.wmnet with reason: host reimage
  • 17:00 bking@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1088.eqiad.wmnet with OS bullseye
  • 16:54 bking@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1085.eqiad.wmnet with OS bullseye
  • 16:49 ryankemper@cumin1001: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - T289135
  • 16:43 pt1979@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 16:41 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1085.eqiad.wmnet with reason: host reimage
  • 16:39 pt1979@cumin2002: START - Cookbook sre.dns.netbox
  • 16:38 bking@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1085.eqiad.wmnet with reason: host reimage
  • 16:26 bking@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1085.eqiad.wmnet with OS bullseye
  • 16:24 bking@cumin1001: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host elastic1085.eqiad.wmnet with OS bullseye
  • 16:19 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1085.eqiad.wmnet with reason: host reimage
  • 16:16 bking@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1085.eqiad.wmnet with reason: host reimage
  • 16:16 ryankemper@cumin1001: START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - T289135
  • 16:14 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
  • 16:12 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
  • 16:10 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
  • 16:09 ryankemper@cumin1001: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - T289135
  • 16:04 bking@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1085.eqiad.wmnet with OS bullseye
  • 16:00 bking@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1084.eqiad.wmnet with OS bullseye
  • 15:58 ryankemper@cumin1001: START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - T289135
  • 15:47 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1084.eqiad.wmnet with reason: host reimage
  • 15:46 sukhe: upload reprepro -C main include bullseye-wikimedia python-pynetbox_6.6.0-1+wmf11u1_amd64.changes
  • 15:45 bking@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1084.eqiad.wmnet with reason: host reimage
  • 15:37 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2021.codfw.wmnet with reason: Maint
  • 15:37 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2021.codfw.wmnet with reason: Maint
  • 15:32 bking@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1084.eqiad.wmnet with OS bullseye
  • 14:59 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
  • 14:55 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
  • 14:47 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on cp5001.eqsin.wmnet with reason: depooled: faulty DIMM: T314256
  • 14:46 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on cp5001.eqsin.wmnet with reason: depooled: faulty DIMM: T314256
  • 14:34 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
  • 14:11 kevinbazira@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
  • 13:03 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 13:01 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 13:01 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 12:58 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 12:56 urbanecm@deploy1002: Synchronized wmf-config/CommonSettings.php: 77fd5ab: Growth: Add new rights to wgAvailableRights (duration: 03m 24s)
  • 12:30 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1102.eqiad.wmnet
  • 12:09 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 12:09 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 12:06 urbanecm@deploy1002: Synchronized php-1.39.0-wmf.23/extensions/GrowthExperiments/: 3eaf155: MentorTools: Do not use MentorWeightManager (T314362) (duration: 03m 31s)
  • 12:04 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 11:43 btullis@cumin1001: START - Cookbook sre.hosts.reboot-single for host an-worker1102.eqiad.wmnet
  • 11:21 jelto@cumin1001: conftool action : set/pooled=yes; selector: name=kubernetes2022.codfw.wmnet
  • 11:21 jelto: kubectl uncordon kubernetes2022.codfw.wmnet
  • 10:43 Amir1: Removing db2079 from orchestrator (T313885)
  • 10:39 Amir1: Removing db2079 from zarcillo (T313885)
  • 10:35 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2079.codfw.wmnet
  • 10:35 ladsgroup@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 10:30 ladsgroup@cumin1001: START - Cookbook sre.dns.netbox
  • 10:25 ladsgroup@cumin1001: START - Cookbook sre.hosts.decommission for hosts db2079.codfw.wmnet
  • 10:18 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 12:00:00 on db2079.codfw.wmnet with reason: Decom
  • 10:17 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 2 days, 12:00:00 on db2079.codfw.wmnet with reason: Decom
  • 08:41 jbond: deploy libtirpc update
  • 07:57 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1144:3314 (T312863)', diff saved to https://phabricator.wikimedia.org/P32310 and previous config saved to /var/cache/conftool/dbconfig/20220808-075723-ladsgroup.json
  • 07:57 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance
  • 07:57 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance
  • 07:57 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1142 (T312863)', diff saved to https://phabricator.wikimedia.org/P32309 and previous config saved to /var/cache/conftool/dbconfig/20220808-075702-ladsgroup.json
  • 07:53 godog: grow sda/sdb 3 by 100G on thanos-be2001 - T314275
  • 07:50 godog: grow sda/sdb 3 by 100G on thanos-be1004 - T314275
  • 07:41 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P32308 and previous config saved to /var/cache/conftool/dbconfig/20220808-074156-ladsgroup.json
  • 07:32 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 07:27 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 07:27 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 07:26 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P32307 and previous config saved to /var/cache/conftool/dbconfig/20220808-072650-ladsgroup.json
  • 07:23 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 07:22 kartik@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: trwikivoyage: Create rollbacker user group (T314678) (duration: 03m 17s)
  • 07:18 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 07:17 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 07:17 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 07:16 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 07:11 elukey: restart rsyslog on ml-serve2007
  • 07:11 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 07:11 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1142 (T312863)', diff saved to https://phabricator.wikimedia.org/P32306 and previous config saved to /var/cache/conftool/dbconfig/20220808-071144-ladsgroup.json
  • 07:10 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 07:10 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 07:09 kartik@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: Enable SectionTranslation on 10 Wikipedias where ContentTranslation is default (T308829) (duration: 03m 15s)
  • 07:09 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 07:06 XioNoX: add CSP headers to Netbox - T296356
  • 07:05 elukey: restart rsyslog on ml-serve-ctrl2001

2022-08-07

  • 19:58 taavi: taavi@mwmaint1002 ~ $ echo "https://upload.wikimedia.org/wikipedia/commons/1/15/Keep_tidy_ask.svg" | mwscript purgeList.php --wiki enwiki # T314712
  • 13:52 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1142 (T312863)', diff saved to https://phabricator.wikimedia.org/P32305 and previous config saved to /var/cache/conftool/dbconfig/20220807-135204-ladsgroup.json
  • 13:51 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1142.eqiad.wmnet with reason: Maintenance
  • 13:51 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1142.eqiad.wmnet with reason: Maintenance
  • 13:51 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1141 (T312863)', diff saved to https://phabricator.wikimedia.org/P32304 and previous config saved to /var/cache/conftool/dbconfig/20220807-135143-ladsgroup.json
  • 13:36 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P32303 and previous config saved to /var/cache/conftool/dbconfig/20220807-133637-ladsgroup.json
  • 13:21 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P32302 and previous config saved to /var/cache/conftool/dbconfig/20220807-132131-ladsgroup.json
  • 13:06 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1141 (T312863)', diff saved to https://phabricator.wikimedia.org/P32301 and previous config saved to /var/cache/conftool/dbconfig/20220807-130625-ladsgroup.json
  • 12:06 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1141 (T312863)', diff saved to https://phabricator.wikimedia.org/P32300 and previous config saved to /var/cache/conftool/dbconfig/20220807-120610-ladsgroup.json
  • 12:06 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1141.eqiad.wmnet with reason: Maintenance
  • 12:05 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1141.eqiad.wmnet with reason: Maintenance
  • 12:05 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1149 (T312863)', diff saved to https://phabricator.wikimedia.org/P32299 and previous config saved to /var/cache/conftool/dbconfig/20220807-120549-ladsgroup.json
  • 11:50 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P32298 and previous config saved to /var/cache/conftool/dbconfig/20220807-115043-ladsgroup.json
  • 11:35 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P32297 and previous config saved to /var/cache/conftool/dbconfig/20220807-113537-ladsgroup.json
  • 11:20 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1149 (T312863)', diff saved to https://phabricator.wikimedia.org/P32296 and previous config saved to /var/cache/conftool/dbconfig/20220807-112031-ladsgroup.json

2022-08-06