You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Server Admin Log: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Stashbot
(ebernhardson@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Bump cirrus MLR models to latest (duration: 01m 06s))
imported>Stashbot
(sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2042.codfw.wmnet,service=varnish-fe)
(676 intermediate revisions by 4 users not shown)
Line 1: Line 1:
== 2020-07-21 ==
== 2022-08-11 ==
* 23:37 ebernhardson@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Bump cirrus MLR models to latest (duration: 01m 06s)
* 00:58 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2042.codfw.wmnet,service=varnish-fe
* 23:13 Urbanecm: Evening backport window done
* 00:58 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2042.codfw.wmnet,service=ats-be
* 23:12 urbanecm@deploy1001: Synchronized wmf-config/CommonSettings.php: {{Gerrit|7a50168d54b5e86834606fb8d7880eb3a923ffd5}}: Updating UploadWizard template: PD-old-70-1923->PD-old-70-expired ([[phab:T258523|T258523]]) (duration: 01m 06s)
* 00:58 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2042.codfw.wmnet,service=ats-tls
* 23:06 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|7acc9d966a07d589bb6aed5f801c9e1defc75fe1}}: Enable $wgWatchlistExpiry on testwiki ([[phab:T257506|T257506]]) (duration: 01m 08s)
* 00:57 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on cp2042.codfw.wmnet with reason: host down; depooled and will debug tomorrow
* 19:10 jhuneidi@deploy1001: rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.1
* 00:57 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on cp2042.codfw.wmnet with reason: host down; depooled and will debug tomorrow
* 19:02 catrope@deploy1001: Synchronized php-1.36.0-wmf.1/includes/Storage/PageUpdater.php: Fix handling of null edits ([[phab:T257766|T257766]]) (duration: 01m 06s)
* 19:01 catrope@deploy1001: Synchronized php-1.35.0-wmf.41/includes/Storage/PageUpdater.php: Fix handling of null edits ([[phab:T257766|T257766]]) (duration: 01m 11s)
* 18:33 jhuneidi@deploy1001: Finished scap: testwikis wikis to 1.36.0-wmf.1 (duration: 41m 22s)
* 18:27 ejegg: restored new URL for TY page in payments-wiki settings
* 18:22 mforns@deploy1001: Finished deploy [analytics/refinery@0c25de1] (thin): Redeploying to unbreak unique devices per domain monthly THIN [analytics/refinery@0c25de19a3a309276654b4463cca4f574336d8fd] (duration: 00m 07s)
* 18:22 mforns@deploy1001: Started deploy [analytics/refinery@0c25de1] (thin): Redeploying to unbreak unique devices per domain monthly THIN [analytics/refinery@0c25de19a3a309276654b4463cca4f574336d8fd]
* 18:21 mforns@deploy1001: Finished deploy [analytics/refinery@0c25de1]: Redeploying to unbreak unique devices per domain monthly - third try [analytics/refinery@0c25de19a3a309276654b4463cca4f574336d8fd] (duration: 00m 12s)
* 18:21 mforns@deploy1001: Started deploy [analytics/refinery@0c25de1]: Redeploying to unbreak unique devices per domain monthly - third try [analytics/refinery@0c25de19a3a309276654b4463cca4f574336d8fd]
* 18:17 mforns@deploy1001: Finished deploy [analytics/refinery@0c25de1]: Redeploying to unbreak unique devices per domain monthly - second try [analytics/refinery@0c25de19a3a309276654b4463cca4f574336d8fd] (duration: 00m 17s)
* 18:16 mforns@deploy1001: Started deploy [analytics/refinery@0c25de1]: Redeploying to unbreak unique devices per domain monthly - second try [analytics/refinery@0c25de19a3a309276654b4463cca4f574336d8fd]
* 18:13 mforns@deploy1001: Finished deploy [analytics/refinery@0c25de1]: Redeploying to unbreak unique devices per domain monthly [analytics/refinery@0c25de19a3a309276654b4463cca4f574336d8fd] (duration: 05m 32s)
* 18:08 mforns@deploy1001: Started deploy [analytics/refinery@0c25de1]: Redeploying to unbreak unique devices per domain monthly [analytics/refinery@0c25de19a3a309276654b4463cca4f574336d8fd]
* 17:52 jhuneidi@deploy1001: Started scap: testwikis wikis to 1.36.0-wmf.1
* 17:50 volans@cumin1001: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
* 17:45 volans@cumin1001: START - Cookbook sre.dns.netbox
* 17:10 jhuneidi@deploy1001: Pruned MediaWiki: 1.35.0-wmf.39 (duration: 16m 25s)
* 16:32 ppchelko@deploy1001: Finished deploy [restbase/deploy@4f3cb41]: Add new wikis to RESTBase, take 2 (duration: 04m 54s)
* 16:27 ppchelko@deploy1001: Started deploy [restbase/deploy@4f3cb41]: Add new wikis to RESTBase, take 2
* 16:27 ppchelko@deploy1001: Finished deploy [restbase/deploy@4f3cb41]: Add new wikis to RESTBase (duration: 10m 37s)
* 16:21 longma: 1.36.0-wmf.1 was branched at {{Gerrit|3a1faac3764ecae8dde813bd67a5a8e8f4975a85}} for [[phab:T257969|T257969]]
* 16:16 ppchelko@deploy1001: Started deploy [restbase/deploy@4f3cb41]: Add new wikis to RESTBase
* 15:16 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 15:12 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
* 15:10 moritzm: draining restbase1027 for eventual reboot for kernel security update
* 15:09 godog: poweroff ms-be1024 for bbu replacement - [[phab:T257949|T257949]]
* 15:08 filippo@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
* 15:08 filippo@cumin1001: START - Cookbook sre.hosts.downtime
* 15:04 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 15:01 vgutierrez: show a synthetic warning for traffic using ECDHE-RSA-AES128-SHA - [[phab:T258405|T258405]]
* 15:01 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
* 15:00 moritzm: draining restbase1026 for eventual reboot for kernel security update
* 14:57 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
* 14:57 jmm@cumin2001: START - Cookbook sre.hosts.downtime
* 14:56 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 14:52 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
* 14:51 moritzm: draining restbase1025 for eventual reboot for kernel security update
* 14:48 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 14:44 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
* 14:35 akosiaris@cumin1001: conftool action : set/weight=10; selector: dc=codfw,service=mobileapps,name=scb.*
* 14:35 akosiaris: decrease codfw mobileapps kubernetes traffic to 72% [[phab:T218733|T218733]]. Weird latency patterns exhibited when 92% was reached. See https://grafana.wikimedia.org/d/5CmeRcnMz/mobileapps?panelId=34&fullscreen&orgId=1&from=1595338489749&to=1595342071227&var-dc=codfw%20prometheus%2Fk8s&var-service=mobileapps&var-container_name=All
* 14:35 moritzm: draining restbase1024 for eventual reboot for kernel security update
* 14:32 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1119', diff saved to https://phabricator.wikimedia.org/P11994 and previous config saved to /var/cache/conftool/dbconfig/20200721-143204-marostegui.json
* 14:26 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1119', diff saved to https://phabricator.wikimedia.org/P11993 and previous config saved to /var/cache/conftool/dbconfig/20200721-142634-marostegui.json
* 14:24 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
* 14:24 jmm@cumin2001: START - Cookbook sre.hosts.downtime
* 14:23 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 14:19 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
* 14:18 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1119', diff saved to https://phabricator.wikimedia.org/P11992 and previous config saved to /var/cache/conftool/dbconfig/20200721-141813-marostegui.json
* 14:16 moritzm: draining restbase1023 for eventual reboot for kernel security update
* 14:10 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 14:06 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
* 14:06 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
* 14:06 jmm@cumin2001: START - Cookbook sre.hosts.downtime
* 14:03 moritzm: draining restbase1022 for eventual reboot for kernel security update
* 14:01 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 13:57 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
* 13:55 moritzm: draining restbase1021 for eventual reboot for kernel security update
* 13:51 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 13:50 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1119', diff saved to https://phabricator.wikimedia.org/P11991 and previous config saved to /var/cache/conftool/dbconfig/20200721-135028-marostegui.json
* 13:48 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
* 13:46 moritzm: draining restbase1020 for eventual reboot for kernel security update
* 13:42 akosiaris@cumin1001: conftool action : set/weight=1; selector: dc=codfw,service=mobileapps,name=scb.*
* 13:41 akosiaris: increase codfw mobileapps kubernetes traffic to 96% [[phab:T218733|T218733]]
* 13:41 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
* 13:41 jmm@cumin2001: START - Cookbook sre.hosts.downtime
* 13:15 Amir1: end of ladsgroup@mwmaint1002:~$ foreachwikiindblist wikidataclient extensions/Wikibase/lib/maintenance/populateSitesTable.php --force-protocol https ([[phab:T258472|T258472]] [[phab:T258473|T258473]])
* 13:13 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
* 13:11 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
* 13:10 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 13:06 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
* 13:03 moritzm: draining restbase1019 for eventual reboot for kernel security update
* 13:01 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
* 13:01 jmm@cumin2001: START - Cookbook sre.hosts.downtime
* 12:55 Amir1: start of ladsgroup@mwmaint1002:~$ foreachwikiindblist wikidataclient extensions/Wikibase/lib/maintenance/populateSitesTable.php --force-protocol https ([[phab:T258472|T258472]] [[phab:T258473|T258473]])
* 12:54 marostegui: Stop haproxy on dbproxy1012 - [[phab:T255408|T255408]]
* 12:13 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1087', diff saved to https://phabricator.wikimedia.org/P11988 and previous config saved to /var/cache/conftool/dbconfig/20200721-121302-marostegui.json
* 12:05 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 12:01 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
* 11:53 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 11:49 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
* 11:45 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 11:41 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
* 11:25 Urbanecm: EU B&C window done
* 11:24 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|7b96c7ea35557888c6cec2dd19768c246bff804b}}: Enable botpasswords at checkuserwiki and stewardwiki ([[phab:T258358|T258358]], [[phab:T258355|T258355]]) (duration: 00m 57s)
* 11:11 Urbanecm: Create bot_passwords table at checkuserwiki ([[phab:T258358|T258358]])
* 11:10 Urbanecm: Create bot_passwords table at stewardwiki ([[phab:T258355|T258355]])
* 11:09 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|5d5bb37c342310be5ca0b0e11a8490703867f4fd}}: Enable Vector opt in preference everywhere ([[phab:T254228|T254228]]) (duration: 00m 57s)
* 11:08 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1085 [[phab:T258360|T258360]]', diff saved to https://phabricator.wikimedia.org/P11987 and previous config saved to /var/cache/conftool/dbconfig/20200721-110854-marostegui.json
* 11:00 effie: enable puppet on P:mediawiki::mcrouter_wancache - [[phab:T247956|T247956]]
* 10:58 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1085 [[phab:T258360|T258360]]', diff saved to https://phabricator.wikimedia.org/P11986 and previous config saved to /var/cache/conftool/dbconfig/20200721-105852-marostegui.json
* 10:45 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1085 [[phab:T258360|T258360]]', diff saved to https://phabricator.wikimedia.org/P11985 and previous config saved to /var/cache/conftool/dbconfig/20200721-104546-marostegui.json
* 10:34 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1085', diff saved to https://phabricator.wikimedia.org/P11984 and previous config saved to /var/cache/conftool/dbconfig/20200721-103430-marostegui.json
* 10:20 effie: disable puppet on  P:mediawiki::mcrouter_wancache - [[phab:T247956|T247956]]
* 10:13 effie: enable puppet on on wtp*
* 10:02 marostegui: Analyze revision table on db1119 [[phab:T258480|T258480]]
* 10:02 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1119 [[phab:T258480|T258480]]', diff saved to https://phabricator.wikimedia.org/P11983 and previous config saved to /var/cache/conftool/dbconfig/20200721-100159-marostegui.json
* 09:59 akosiaris: move all codfw mobileapps nodes (kubernetes and scb) to weight 10. Traffic level remains at 72.727272% flowing to kubernetes, the rest to scb [[phab:T218733|T218733]]
* 09:59 akosiaris: move all codfw mobileapps nodes (kubernetes and scb) to weight 10. Traffic level remains at 72.727272% flowing to kubernetes, the rest to scb
* 09:59 effie: disable puppet on wtp* to merge 613307
* 09:58 akosiaris@cumin1001: conftool action : set/weight=10; selector: dc=codfw,service=mobileapps
* 09:58 akosiaris: increase codfw mobileapps kubernetes traffic to 72.727272% [[phab:T218733|T218733]]
* 09:57 akosiaris@cumin1001: conftool action : set/weight=1; selector: dc=codfw,service=mobileapps,name=scb.*
* 09:44 elukey: add term 'idp' to analytics-in4/6 filters on cr1-eqiad and cr2-eqiad (ref: https://gerrit.wikimedia.org/r/c/operations/homer/public/+/615160)
* 09:21 kormat@cumin1001: dbctl commit (dc=all): 'Re-pool es1020 at 25% in es4 [[phab:T257284|T257284]]', diff saved to https://phabricator.wikimedia.org/P11982 and previous config saved to /var/cache/conftool/dbconfig/20200721-092126-kormat.json
* 08:37 akosiaris: increase codfw mobileapps kubernetes traffic to 47% [[phab:T218733|T218733]]
* 08:34 akosiaris@cumin1001: conftool action : set/weight=3; selector: dc=codfw,service=mobileapps,name=scb.*
* 08:28 kormat@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
* 08:26 kormat@cumin1001: START - Cookbook sre.hosts.downtime
* 08:08 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1119', diff saved to https://phabricator.wikimedia.org/P11980 and previous config saved to /var/cache/conftool/dbconfig/20200721-080842-marostegui.json
* 07:52 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1119', diff saved to https://phabricator.wikimedia.org/P11979 and previous config saved to /var/cache/conftool/dbconfig/20200721-075233-marostegui.json
* 07:49 marostegui: Deploy schema change on db1087, lag will appear on s8 (wikidata) on labsdb hosts [[phab:T256685|T256685]]
* 07:48 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1087 [[phab:T256685|T256685]]', diff saved to https://phabricator.wikimedia.org/P11978 and previous config saved to /var/cache/conftool/dbconfig/20200721-074843-marostegui.json
* 07:37 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1119', diff saved to https://phabricator.wikimedia.org/P11977 and previous config saved to /var/cache/conftool/dbconfig/20200721-073757-marostegui.json
* 07:29 kormat@deploy1001: Synchronized wmf-config/db-eqiad.php: Re-enable writes to es4 [[phab:T257847|T257847]] (duration: 00m 57s)
* 07:22 kormat@cumin1001: dbctl commit (dc=all): 'Depool es1020 from es4 [[phab:T257847|T257847]]', diff saved to https://phabricator.wikimedia.org/P11976 and previous config saved to /var/cache/conftool/dbconfig/20200721-072251-kormat.json
* 07:21 kormat@cumin1001: dbctl commit (dc=all): 'Promote es1021 to es4 master [[phab:T257847|T257847]]', diff saved to https://phabricator.wikimedia.org/P11975 and previous config saved to /var/cache/conftool/dbconfig/20200721-072127-kormat.json
* 07:13 kormat: killing James_F('s script) on mwmaint1002
* 07:06 _joe_: systemctl reset-failed on deneb, the usual known issue with releng image reporting
* 07:03 kormat@deploy1001: Synchronized wmf-config/db-eqiad.php: Disable writes to es4 [[phab:T257847|T257847]] (duration: 01m 00s)
* 06:59 kormat: Starting es4 failover from es1020 to es1021 [[phab:T257847|T257847]]
* 06:54 kormat@cumin1001: dbctl commit (dc=all): 'Set es1021 to weight 50 [[phab:T257847|T257847]]', diff saved to https://phabricator.wikimedia.org/P11974 and previous config saved to /var/cache/conftool/dbconfig/20200721-065457-kormat.json
* 06:54 marostegui: Pool db1119 into enwiki with MCR schema change done - [[phab:T238966|T238966]]
* 06:54 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1119', diff saved to https://phabricator.wikimedia.org/P11973 and previous config saved to /var/cache/conftool/dbconfig/20200721-065430-marostegui.json
* 06:27 _joe_: systemctl reset-failed on lists1001, a network interface was failing since 1 month
* 06:26 _joe_: enabling notifications for lists1001
* 06:23 _joe_: systemctl reset-failed on both centrallogs
* 02:43 eileen: civicrm revision changed from {{Gerrit|7f1e7d8e38}} to {{Gerrit|cc5d17fbaf}}, config revision is {{Gerrit|23460676f6}}
* 00:02 ryankemper: Began Elasticsearch reindex job on index `dewiki_content` across [`eqiad`, `codfw`, `cloudelastic`], on `rkemper@mwmaint1002` under tmux session `reindex`. Should complete in <24 hours


== 2020-07-20 ==
== 2022-08-10 ==
* 23:49 eileen: tools revision changed from {{Gerrit|b915d8efbd}} to {{Gerrit|22550f38c5}}
* 21:25 bking@cumin1001: conftool action : set/weight=10:pooled=yes; selector: name=wdqs1016.eqiad.wmnet
* 23:34 ejegg: updated fundraising CiviCRM from {{Gerrit|8b09c87ce2}} to {{Gerrit|7f1e7d8e38}}
* 21:23 bking@cumin1001: conftool action : set/weight=10:pooled=yes; selector: name=wdqs1014.eqiad.wmnet
* 23:12 urbanecm@deploy1001: Synchronized php-1.35.0-wmf.41/extensions/ProofreadPage/ProofreadPage.namespaces.php: {{Gerrit|03ed74f0b9b8f55d01f9112c31f2f6ea17990f9c}}: Add ProofreadPage namespace translation for lij ([[phab:T257672|T257672]]) (duration: 00m 57s)
* 21:10 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 16 hosts with reason: [[phab:T309810|T309810]]
* 23:06 Urbanecm: run mwscript namespaceDupes.php --wiki=lijwikisource -- fix ([[phab:T257672|T257672]])
* 21:10 bking@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on 16 hosts with reason: [[phab:T309810|T309810]]
* 23:05 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|2147774caaa0819f8b5d71cc16bc021d94677702}}: Add English aliases for WS-specific namespaces to lijwikisource ([[phab:T257672|T257672]]) (duration: 00m 57s)
* 21:09 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on elastic[1101-1102].eqiad.wmnet with reason: [[phab:T309810|T309810]]
* 22:59 ryankemper@deploy1001: Synchronized wmf-config/InitialiseSettings.php: 613669: cirrussearch: Allow 2 dewiki->content shards/node {{!}} https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/613669 (duration: 00m 57s)
* 21:09 bking@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on elastic[1101-1102].eqiad.wmnet with reason: [[phab:T309810|T309810]]
* 21:53 eileen: tools revision changed from {{Gerrit|40d52a0008}} to {{Gerrit|b915d8efbd}}
* 21:00 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 21:15 sbassett: Revised mitigation deployed for [[phab:T257687|T257687]]
* 21:00 cjming: end of UTC late backport window
* 20:07 eileen: tools revision changed from {{Gerrit|711d671600}} to {{Gerrit|40d52a0008}}
* 20:59 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 19:10 mforns@deploy1001: Finished deploy [analytics/refinery@af86a05] (thin): Regular analytics weekly train THIN [analytics/refinery@af86a05be470ed8283f6585afb5cc231b26944a2] (duration: 00m 07s)
* 20:59 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 19:10 mforns@deploy1001: Started deploy [analytics/refinery@af86a05] (thin): Regular analytics weekly train THIN [analytics/refinery@af86a05be470ed8283f6585afb5cc231b26944a2]
* 20:59 cjming@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:820533{{!}}Remove unused $wgEnableMWSuggest]] (duration: 03m 04s)
* 19:09 mforns@deploy1001: Finished deploy [analytics/refinery@af86a05]: Regular analytics weekly train [analytics/refinery@af86a05be470ed8283f6585afb5cc231b26944a2] (duration: 05m 46s)
* 20:58 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 19:03 mforns@deploy1001: Started deploy [analytics/refinery@af86a05]: Regular analytics weekly train [analytics/refinery@af86a05be470ed8283f6585afb5cc231b26944a2]
* 20:56 cjming@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:820568{{!}}Enable new topic tool on dewiki (T313699)]] (duration: 03m 01s)
* 18:37 urbanecm@deploy1001: Synchronized wmf-config/CommonSettings.php: {{Gerrit|df2584f181f08da0e1191f97e619e912e587b48d}}: Switch $wgUrlShortenerDomainsWhitelist --> $wgUrlShortenerAllowedDomains ([[phab:T255491|T255491]]) (duration: 00m 57s)
* 20:34 cjming@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:822093{{!}}testwiki: set $wgCdnMatchParameterOrder to false (T314868)]] (duration: 03m 20s)
* 18:26 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|dfed4727c6f9e003f9e1949b2995a0cf0ad4f1cc}}: Adding rollbacker group for arzwiki ([[phab:T258100|T258100]]) (duration: 00m 57s)
* 20:31 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 18:24 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|ee7ac95e16f55e850b318f7354842795e08e0270}}: Change of rollbacker group settings at jawiki ([[phab:T258339|T258339]]) (duration: 00m 57s)
* 20:30 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 17:36 ejegg: updated payments-wiki settings to point TY page at new URL
* 20:30 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 16:32 ebernhardson@deploy1001: Finished deploy [wikimedia/discovery/analytics@10afb4b]: airflow: Turn off catchup on cirrus_namespace_map (duration: 00m 25s)
* 20:30 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 16:31 ebernhardson@deploy1001: Started deploy [wikimedia/discovery/analytics@10afb4b]: airflow: Turn off catchup on cirrus_namespace_map
* 20:19 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 16:27 akosiaris: increase codfw mobileapps kubernetes traffic to 25% [[phab:T218733|T218733]]. Take #2
* 20:18 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 16:27 akosiaris@cumin1001: conftool action : set/weight=8; selector: dc=codfw,service=mobileapps,name=scb.*
* 20:18 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 15:59 elukey: restart airflow-webserver/scheduler to pick up TLS to mysql settings
* 20:17 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 15:21 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
* 20:12 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 15:21 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime
* 20:09 bking@cumin1001: END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
* 15:17 hnowlan: draining and restarting sessionstore2002
* 20:08 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 15:17 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
* 20:08 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 15:17 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime
* 20:08 cjming@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:820646{{!}}Start writing to cuc_actor everywhere except s4 and s8 (T233004)]] (duration: 03m 15s)
* 15:16 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 20:07 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 15:13 jynus: dropping and recreating nagios@localhost users on all m1 servers
* 19:51 rzl@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc[2053-2054].codfw.wmnet
* 15:12 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
* 19:51 rzl@cumin1001: START - Cookbook sre.hosts.remove-downtime for mc[2053-2054].codfw.wmnet
* 15:09 hnowlan: draining and restarting sessionstore2001
* 19:35 rzl@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse[2019-2020].codfw.wmnet
* 15:09 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
* 19:35 rzl@cumin1001: START - Cookbook sre.hosts.remove-downtime for parse[2019-2020].codfw.wmnet
* 15:09 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime
* 19:35 rzl@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse[2016-2018].codfw.wmnet
* 15:09 hnowlan@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
* 19:35 rzl@cumin1001: START - Cookbook sre.hosts.remove-downtime for parse[2016-2018].codfw.wmnet
* 15:09 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime
* 19:34 rzl@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc2036.codfw.wmnet
* 15:08 moritzm: draining restbase2023 for eventual reboot for kernel security update
* 19:34 rzl@cumin1001: START - Cookbook sre.hosts.remove-downtime for mc2036.codfw.wmnet
* 15:04 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 19:28 sukhe: testing ATS 9.1.3-1wm1 on cp4026: [[phab:T309651|T309651]]
* 15:00 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
* 19:09 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1087.eqiad.wmnet with OS bullseye
* 14:56 moritzm: draining restbase2022 for eventual reboot for kernel security update
* 19:06 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1086.eqiad.wmnet with OS bullseye
* 14:56 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
* 18:55 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1087.eqiad.wmnet with reason: host reimage
* 14:56 jmm@cumin2001: START - Cookbook sre.hosts.downtime
* 18:51 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1086.eqiad.wmnet with reason: host reimage
* 14:54 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 18:50 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1087.eqiad.wmnet with reason: host reimage
* 14:52 hnowlan: draining and restarting sessionstore1003
* 18:49 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1086.eqiad.wmnet with reason: host reimage
* 14:52 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
* 18:47 bking@cumin1001: START - Cookbook sre.wdqs.data-transfer
* 14:52 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime
* 18:38 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1087.eqiad.wmnet with OS bullseye
* 14:51 mholloway-shell@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' .
* 18:36 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1086.eqiad.wmnet with OS bullseye
* 14:51 mholloway-shell@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
* 18:22 urandom: truncating Cassandra hints (eqiad datacenter) -- [[phab:T314941|T314941]]
* 14:49 mholloway-shell@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
* 18:13 urandom: truncating codfw Cassandra hints (eqiad datacenter)  -- [[phab:T314941|T314941]]
* 14:49 mholloway-shell@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' .
* 18:07 rzl@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kafka-main2005.codfw.wmnet
* 14:49 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
* 18:07 rzl@cumin1001: START - Cookbook sre.hosts.remove-downtime for kafka-main2005.codfw.wmnet
* 14:47 mholloway-shell@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'mobileapps' for release 'staging' .
* 18:05 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repool D8 DBs after PDU maint ([[phab:T310146|T310146]])', diff saved to https://phabricator.wikimedia.org/P32346 and previous config saved to /var/cache/conftool/dbconfig/20220810-180529-ladsgroup.json
* 14:47 moritzm: draining restbase2021 for eventual reboot for kernel security update
* 17:42 otto@deploy1002: Finished deploy [analytics/refinery@6e47e0e]: Add missing changes to the deletion script - [[phab:T270433|T270433]] -  [analytics/refinery@6e47e0e] (duration: 05m 28s)
* 14:44 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
* 17:39 fnegri@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts labweb1002.wikimedia.org
* 14:43 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime
* 17:39 fnegri@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 14:37 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 17:36 otto@deploy1002: Started deploy [analytics/refinery@6e47e0e]: Add missing changes to the deletion script - [[phab:T270433|T270433]] - [analytics/refinery@6e47e0e]
* 14:36 mholloway-shell@deploy1001: Finished deploy [mobileapps/deploy@ff49fdf]: Update mobileapps to {{Gerrit|0bf7bafa}} (duration: 03m 50s)
* 17:35 fnegri@cumin1001: START - Cookbook sre.dns.netbox
* 14:34 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
* 17:34 otto@deploy1002: Finished deploy [analytics/refinery@6e47e0e] (hadoop-test): Add missing changes to the deletion script - [[phab:T270433|T270433]] - TEST [analytics/refinery@6e47e0e] (duration: 04m 19s)
* 14:34 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime
* 17:30 fnegri@cumin1001: START - Cookbook sre.hosts.decommission for hosts labweb1002.wikimedia.org
* 14:34 hnowlan: starting drain and restart of sessionstore hosts for new kernel
* 17:30 otto@deploy1002: Started deploy [analytics/refinery@6e47e0e] (hadoop-test): Add missing changes to the deletion script - [[phab:T270433|T270433]] - TEST [analytics/refinery@6e47e0e]
* 14:33 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
* 17:09 dzahn@cumin2002: START - Cookbook sre.dns.netbox
* 14:32 mholloway-shell@deploy1001: Started deploy [mobileapps/deploy@ff49fdf]: Update mobileapps to {{Gerrit|0bf7bafa}}
* 17:08 otto@deploy1002: Started deploy [analytics/refinery@d4dd7e4] (hadoop-test): Add safety limits to refinery-drop-older-than - [[phab:T270433|T270433]] - TEST [analytics/refinery@d4dd7e4]
* 14:26 moritzm: draining restbase2020 for eventual reboot for kernel security update
* 17:06 sukhe: testing ATS 9.1.3-1wm1 on cp4032: [[phab:T309651|T309651]]
* 14:23 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
* 17:06 urandom: flushing RESTBase Cassandra tables -row B- to (temporarily) free instance-data space -- [[phab:T314941|T314941]]
* 14:23 jmm@cumin2001: START - Cookbook sre.hosts.downtime
* 17:05 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on krb2001.codfw.wmnet with reason: btullis codfw maintenance
* 14:20 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 17:05 btullis@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on krb2001.codfw.wmnet with reason: btullis codfw maintenance
* 14:17 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
* 17:04 dzahn@cumin2002: START - Cookbook sre.hosts.decommission for hosts gerrit2001.wikimedia.org
* 14:14 moritzm: draining restbase2019 for eventual reboot for kernel security update
* 17:02 sukhe: testing ATS 9.1.3-1wm1 on cp6008: [[phab:T309651|T309651]]
* 14:08 ema: lvs101[34] (primaries) - restart pybal to apply varnish healthcheck changes https://gerrit.wikimedia.org/r/c/operations/puppet/+/610047 [[phab:T255015|T255015]]
* 16:56 sukhe: testing ATS 9.1.3-1wm1 on cp6016: [[phab:T309651|T309651]]
* 14:07 ema: lvs1016 (secondary) - restart pybal to apply varnish healthcheck changes https://gerrit.wikimedia.org/r/c/operations/puppet/+/610047 [[phab:T255015|T255015]]
* 16:55 fnegri@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts labweb1001.wikimedia.org
* 14:06 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 16:55 fnegri@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 14:02 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
* 16:32 dzahn@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts gerrit2001.wikimedia.org
* 13:59 ema: lvs300[56] (primaries) - restart pybal to apply varnish healthcheck changes https://gerrit.wikimedia.org/r/c/operations/puppet/+/610047 [[phab:T255015|T255015]]
* 16:32 dzahn@cumin2002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
* 13:57 ema: lvs3007 (secondary) - restart pybal to apply varnish healthcheck changes https://gerrit.wikimedia.org/r/c/operations/puppet/+/610047 [[phab:T255015|T255015]]
* 16:32 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubernetes[2013-2014].codfw.wmnet
* 13:50 ema: lvs500[12] (primaries) - restart pybal to apply varnish healthcheck changes https://gerrit.wikimedia.org/r/c/operations/puppet/+/610047 [[phab:T255015|T255015]]
* 16:31 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for kubernetes[2013-2014].codfw.wmnet
* 13:48 moritzm: draining restbase2018 for eventual reboot for kernel security update
* 16:31 jelto: kubectl uncordon kubernetes2014.codfw.wmnet
* 13:47 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
* 16:31 fnegri@cumin1001: START - Cookbook sre.dns.netbox
* 13:47 jmm@cumin2001: START - Cookbook sre.hosts.downtime
* 16:30 jelto: kubectl uncordon kubernetes2013.codfw.wmnet
* 13:47 ema: lvs5003 (secondary) - restart pybal to apply varnish healthcheck changes https://gerrit.wikimedia.org/r/c/operations/puppet/+/610047 [[phab:T255015|T255015]]
* 16:29 urandom: restarting Cassandra (RESTBase) -row A- to apply r822110 -- [[phab:T314941|T314941]]
* 13:44 ema: lvs200[78] (primaries) - restart pybal to apply varnish healthcheck changes https://gerrit.wikimedia.org/r/c/operations/puppet/+/610047 [[phab:T255015|T255015]]
* 16:27 dzahn@cumin2002: START - Cookbook sre.dns.netbox
* 13:42 ema: lvs2010 (secondary) - restart pybal to apply varnish healthcheck changes https://gerrit.wikimedia.org/r/c/operations/puppet/+/610047 [[phab:T255015|T255015]]
* 16:25 fnegri@cumin1001: START - Cookbook sre.hosts.decommission for hosts labweb1001.wikimedia.org
* 13:34 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 16:23 mutante: shutting down gerrit2001
* 13:31 ema: lvs400[56] (primaries) - restart pybal to apply varnish healthcheck changes https://gerrit.wikimedia.org/r/c/operations/puppet/+/610047 [[phab:T255015|T255015]]
* 16:23 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc[2034-2035].codfw.wmnet
* 13:31 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
* 16:23 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for mc[2034-2035].codfw.wmnet
* 13:27 moritzm: draining restbase2017 for eventual reboot for kernel security update
* 16:22 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse[2016-2018].codfw.wmnet
* 13:24 ema: lvs4007 (secondary) - restart pybal to apply varnish healthcheck changes https://gerrit.wikimedia.org/r/c/operations/puppet/+/610047 [[phab:T255015|T255015]]
* 16:22 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for parse[2016-2018].codfw.wmnet
* 13:22 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 16:16 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=maps2010.codfw.wmnet
* 13:16 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
* 16:16 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=sessionstore2003.codfw.wmnet
* 13:09 moritzm: draining restbase2016 for eventual reboot for kernel security update
* 16:13 sukhe: reprepro -C component/trafficserver9 include buster-wikimedia trafficserver_9.1.3-1wm1_amd64.changes: [[phab:T309651|T309651]]
* 13:08 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
* 16:13 dzahn@cumin2002: START - Cookbook sre.hosts.decommission for hosts gerrit2001.wikimedia.org
* 13:08 jmm@cumin2001: START - Cookbook sre.hosts.downtime
* 16:11 mvernon@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ms-be[2039,2050,2056,2059].codfw.wmnet,thanos-be2004.codfw.wmnet with reason: PDU work
* 13:07 moritzm: reset broken ifup systemd states on puppetdb* hosts
* 16:10 mvernon@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ms-be[2039,2050,2056,2059].codfw.wmnet,thanos-be2004.codfw.wmnet with reason: PDU work
* 13:05 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 16:09 urandom: flushing tables in row D (RESTBase Cassandra cluster)  -- [[phab:T314941|T314941]]
* 13:01 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
* 15:54 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for gitlab-runner2004.codfw.wmnet
* 12:59 Urbanecm: creating arywiki ([[phab:T257674|T257674]]), lijwikisource ([[phab:T257672|T257672]]), sysop_itwiki ([[phab:T256545|T256545]]) done
* 15:54 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for gitlab-runner2004.codfw.wmnet
* 12:59 moritzm: draining restbase2015 for eventual reboot for kernel security update
* 15:53 sukhe: poweroff cp2041, 42 for PDU ugprade: rack D7
* 12:56 Urbanecm: Create Daimona Eaytoy at sysop_itwiki ([[phab:T256545|T256545]])
* 15:51 urandom: flushing tables in row B (RESTBase Cassandra cluster)  -- [[phab:T314941|T314941]]
* 12:55 urbanecm@deploy1001: Synchronized wmf-config/interwiki.php: Update interwiki cache (duration: 01m 59s)
* 15:49 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve2004.codfw.wmnet with reason: PDU maintenance
* 12:50 urbanecm@deploy1001: Synchronized static/images/project-logos/: Creating sysop_itwiki ([[phab:T256545|T256545]]) (duration: 00m 57s)
* 15:49 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2004.codfw.wmnet with reason: PDU maintenance
* 12:49 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Creating sysop_itwiki ([[phab:T256545|T256545]]) (duration: 00m 57s)
* 15:46 urandom: flushing tables in row A (RESTBase Cassandra cluster)  -- [[phab:T314941|T314941]]
* 12:48 urbanecm@deploy1001: rebuilt and synchronized wikiversions files: Creating sysop_itwiki ([[phab:T256545|T256545]])
* 15:46 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on aqs2012.codfw.wmnet with reason: btullis codfw maintenance
* 12:46 urbanecm@deploy1001: Synchronized dblists: Creating sysop_itwiki ([[phab:T256545|T256545]]) (duration: 00m 57s)
* 15:46 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on cp[2041-2042].codfw.wmnet with reason: shutdown for PDU upgrade: rack D4
* 12:46 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 15:46 btullis@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on aqs2012.codfw.wmnet with reason: btullis codfw maintenance
* 12:43 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
* 15:46 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 3:00:00 on cp[2041-2042].codfw.wmnet with reason: shutdown for PDU upgrade: rack D4
* 12:40 moritzm: draining restbase2014 for eventual reboot for kernel security update
* 15:46 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on aqs2011.codfw.wmnet with reason: btullis codfw maintenance
* 12:38 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
* 15:46 btullis@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on aqs2011.codfw.wmnet with reason: btullis codfw maintenance
* 12:38 jmm@cumin2001: START - Cookbook sre.hosts.downtime
* 15:45 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on aqs2010.codfw.wmnet with reason: btullis codfw maintenance
* 12:35 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 15:45 btullis@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on aqs2010.codfw.wmnet with reason: btullis codfw maintenance
* 12:34 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Creating lijwikisource ([[phab:T257672|T257672]]) (duration: 00m 57s)
* 15:45 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on aqs2009.codfw.wmnet with reason: btullis codfw maintenance
* 12:32 urbanecm@deploy1001: rebuilt and synchronized wikiversions files: Creating lijwikisource ([[phab:T257672|T257672]])
* 15:45 btullis@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on aqs2009.codfw.wmnet with reason: btullis codfw maintenance
* 12:31 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
* 15:37 urandom: (ephemerally) increasing hinted hand-off delivery rate limit to 16KB, RESTBase eqiad nodes  -- [[phab:T314941|T314941]]
* 12:30 urbanecm@deploy1001: Synchronized dblists: Creating lijwikisource ([[phab:T257672|T257672]]) (duration: 00m 56s)
* 15:34 jbond: remove puppetmaster[12]002 from production
* 12:28 urbanecm@deploy1001: Synchronized dblists/rtl.dblist: Add arywiki to rtl.dblist ([[phab:T257674|T257674]]) (duration: 00m 57s)
* 15:30 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kafka-main2004.codfw.wmnet
* 12:27 moritzm: draining restbase2013 for eventual reboot for kernel security update
* 15:30 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for kafka-main2004.codfw.wmnet
* 12:27 urbanecm@deploy1001: sync-file aborted: (no justification provided) (duration: 00m 00s)
* 15:20 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc[2051-2052].codfw.wmnet
* 12:21 urbanecm@deploy1001: Synchronized langlist: Creating arywiki ([[phab:T257674|T257674]]) (duration: 00m 56s)
* 15:20 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for mc[2051-2052].codfw.wmnet
* 12:20 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Creating arywiki ([[phab:T257674|T257674]]) (duration: 00m 56s)
* 15:17 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc-gp2003.codfw.wmnet
* 12:19 urbanecm@deploy1001: Synchronized static/images/project-logos/: Creating arywiki ([[phab:T257674|T257674]]) (duration: 00m 57s)
* 15:17 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for mc-gp2003.codfw.wmnet
* 12:17 urbanecm@deploy1001: rebuilt and synchronized wikiversions files: Creating arywiki ([[phab:T257674|T257674]])
* 15:16 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc2033.codfw.wmnet
* 12:16 urbanecm@deploy1001: Synchronized dblists: Creating arywiki ([[phab:T257674|T257674]]) (duration: 00m 57s)
* 15:16 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for mc2033.codfw.wmnet
* 12:02 moritzm: installing qemu security updates on buster
* 15:14 _joe_: power off krb2002
* 11:50 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|946bf3d239f278b4e099f5dec676f5e2be61d8ca}}: Update brwikimedia logo and add upscaled versions (config) ([[phab:T257925|T257925]]) (duration: 00m 57s)
* 15:14 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on krb2002.codfw.wmnet with reason: PDU maintenance
* 11:49 urbanecm@deploy1001: sync-file aborted: (no justification provided) (duration: 00m 00s)
* 15:13 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on krb2002.codfw.wmnet with reason: PDU maintenance
* 11:49 Urbanecm: Purge 'https://en.wikipedia.org/static/images/project-logos/bnwikimedia.png'
* 15:13 _joe_: shutting down rdb2010,puppetmaster2002 for d5 maintenance
* 11:46 urbanecm@deploy1001: Synchronized static/images/project-logos/: {{Gerrit|f7560b6061dd3a60ccf56c916ebf70a3f104bea7}}: Update brwikimedia logo and add upscaled versions ([[phab:T257925|T257925]]) (duration: 00m 56s)
* 15:02 jelto: power off mc2035
* 11:44 urbanecm@deploy1001: Synchronized wmf-config/CommonSettings.php: {{Gerrit|5b97a06fa2e9a06c251a9c1fd2ddd9beec01a683}}: Set $wgUrlShortenerAllowedDomains for all wikis ([[phab:T258134|T258134]]) (duration: 00m 57s)
* 15:01 jelto: power off mc2034
* 11:42 urbanecm@deploy1001: sync-file aborted: (no justification provided) (duration: 00m 00s)
* 15:01 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mc2035.codfw.wmnet with reason: PDU swap
* 11:36 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|c12f1dee6b9888849c64312c2a4fd65ecbd4091e}}: Remove wgPopupsPageBlacklist config setting ([[phab:T254676|T254676]]) (duration: 00m 57s)
* 15:01 jelto@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on mc2035.codfw.wmnet with reason: PDU swap
* 11:35 Lucas_WMDE: lucaswerkmeister-wmde@mwmaint1002:~$ mwscript createAndPromote.php testwikidatawiki --custom-groups=interface-admin --force 'Lucas Werkmeister (WMDE)'
* 15:01 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mc2034.codfw.wmnet with reason: PDU swap
* 11:34 urbanecm@deploy1001: sync-file aborted: (no justification provided) (duration: 00m 01s)
* 15:01 jelto@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on mc2034.codfw.wmnet with reason: PDU swap
* 11:25 Urbanecm: mwscript namespaceDupes.php --wiki=kowikiquote  --fix ([[phab:T255031|T255031]])
* 14:43 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2094.codfw.wmnet with reason: PDU Maint ([[phab:T310146|T310146]])
* 11:24 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|3719668511231589b4fc6a723ccdfa772068ad5f}}: Add NamespaceAliases for kowikiquote ([[phab:T255031|T255031]]) (duration: 00m 57s)
* 14:43 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2094.codfw.wmnet with reason: PDU Maint ([[phab:T310146|T310146]])
* 11:22 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|bc5671a90c65b66989e470fc41225986b2ec9fb5}}: Add media.farsnews.ir to the wgCopyUploadsDomains allowlist of Wikimedia Commons ([[phab:T253800|T253800]]) (duration: 00m 57s)
* 14:38 urandom: disabling reserved space on eqiad nodes (RESTBase), /dev/md2 (aka /srv/cassandra/instance-data) -- [[phab:T314941|T314941]]
* 11:18 Urbanecm: Run mwscript updateCollation.php --wiki=bswiktionary --previous-collation=uppercase in a tmux session at mwmaint1002 ([[phab:T258346|T258346]])
* 14:28 jelto: power off kafka-main2004 gracefully
* 11:17 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|0c784784d75c2bbfb570495a6a097d4c44cbe6b3}}: Set $wgCategoryCollation to uca-bs-u-kn on Bosnian Wiktionary ([[phab:T258346|T258346]]) (duration: 00m 58s)
* 14:28 hnowlan: shutting down sessionstore2003
* 11:13 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|6830723b0ad5031e67062ba838f09cd07c2b97a1}}: Convert ukwikisource ns:250 and ns:251 to have subpages ([[phab:T255930|T255930]]) (duration: 00m 57s)
* 14:27 hnowlan@puppetmaster1001: conftool action : set/pooled=no; selector: name=sessionstore2003.codfw.wmnet
* 11:10 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|1c7a6215d06aff6cb0a75701292d8147f006d9e4}}: Create closer group at itwikinews ([[phab:T257927|T257927]]) (duration: 00m 57s)
* 14:27 sukhe: power off cp2039, cp2040 for PDU upgrade: rack D
* 10:55 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 14:27 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sessionstore2003.codfw.wmnet with reason: PDU maintenance
* 10:51 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
* 14:27 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on sessionstore2003.codfw.wmnet with reason: PDU maintenance
* 10:50 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 14:25 jelto: power off mc-gp2003
* 10:48 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
* 14:25 jelto: power off mc2033
* 10:48 moritzm: rebooting releases* hosts for kernel security update
* 14:24 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:30:00 on kafka-main2004.codfw.wmnet with reason: PDU swap
* 10:35 jdrewniak@deploy1001: Synchronized portals: Wikimedia Portals Update: [[gerrit:614698{{!}} Bumping portals to master (614698)]] (duration: 00m 56s)
* 14:23 jelto@cumin1001: START - Cookbook sre.hosts.downtime for 4:30:00 on kafka-main2004.codfw.wmnet with reason: PDU swap
* 10:34 jdrewniak@deploy1001: Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:614698{{!}} Bumping portals to master (614698)]] (duration: 00m 59s)
* 14:23 sukhe: depool codfw for PDU upgrade: rack D
* 10:30 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1114', diff saved to https://phabricator.wikimedia.org/P11962 and previous config saved to /var/cache/conftool/dbconfig/20200720-103058-marostegui.json
* 14:23 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:30:00 on mc-gp2003.codfw.wmnet with reason: PDU swap
* 09:48 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1114', diff saved to https://phabricator.wikimedia.org/P11961 and previous config saved to /var/cache/conftool/dbconfig/20200720-094609-marostegui.json
* 14:23 jelto@cumin1001: START - Cookbook sre.hosts.downtime for 4:30:00 on mc-gp2003.codfw.wmnet with reason: PDU swap
* 09:31 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1114', diff saved to https://phabricator.wikimedia.org/P11960 and previous config saved to /var/cache/conftool/dbconfig/20200720-093154-marostegui.json
* 14:23 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:30:00 on mc2033.codfw.wmnet with reason: PDU swap
* 09:25 godog: update compiler facts
* 14:23 jelto@cumin1001: START - Cookbook sre.hosts.downtime for 4:30:00 on mc2033.codfw.wmnet with reason: PDU swap
* 09:17 jayme: updating envoyproxy to 1.14.4-1 on all eqiad hosts
* 14:15 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp20[39{{!}}40]\.codfw\.wmnet,service=ats-tls
* 09:11 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1114', diff saved to https://phabricator.wikimedia.org/P11959 and previous config saved to /var/cache/conftool/dbconfig/20200720-091119-marostegui.json
* 14:13 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on cp[2039-2040].codfw.wmnet with reason: shutdown for PDU upgrade: rack D4
* 09:04 jayme: updating envoyproxy to 1.14.4-1 on all codfw hosts
* 14:13 urandom: flushing Cassandra tables, restbase1030
* 07:54 moritzm: installing libopenmpt security updates
* 14:13 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 3:00:00 on cp[2039-2040].codfw.wmnet with reason: shutdown for PDU upgrade: rack D4
* 07:51 jayme: updating envoyproxy to 1.14.4-1 on all non mw and restbase hosts
* 14:13 urandom: flushing Cassandra tables, restbase1019
* 07:29 marostegui: Move m1-master from dbproxy1012 to dbproxy1014 - [[phab:T255408|T255408]]
* 14:12 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on dns2002.wikimedia.org with reason: shutdown for PDU upgrade: rack D4
* 07:19 marostegui: Drop non used reviewdb database - [[phab:T255715|T255715]]
* 14:12 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 3:00:00 on dns2002.wikimedia.org with reason: shutdown for PDU upgrade: rack D4
* 06:55 elukey: restart matomo1002's mariadb to pick up new TLS settings
* 14:11 urandom: flushing Cassandra tables, restbase1017 1018 1021 1024 1025 1026 1028 1029
* 06:54 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1114', diff saved to https://phabricator.wikimedia.org/P11958 and previous config saved to /var/cache/conftool/dbconfig/20200720-065438-marostegui.json
* 14:05 urandom: flushing tables, restbase1016
* 06:15 tstarling@deploy1001: Synchronized php-1.35.0-wmf.41/extensions/Score/includes/Score.php: reverting Reedy's temporary patch for hardcoding the lilypond version (duration: 00m 57s)
* 13:52 hnowlan: powered up restbase2018
* 06:07 tstarling@deploy1001: Finished scap: fixing missing message from previous sync-dir (duration: 29m 57s)
* 13:32 filippo@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on netmon1003.wikimedia.org with reason: pdu
* 05:56 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1082 after a crash [[phab:T258336|T258336]]', diff saved to https://phabricator.wikimedia.org/P11957 and previous config saved to /var/cache/conftool/dbconfig/20200720-055614-marostegui.json
* 13:32 filippo@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on netmon1003.wikimedia.org with reason: pdu
* 05:47 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1082 after a crash [[phab:T258336|T258336]]', diff saved to https://phabricator.wikimedia.org/P11956 and previous config saved to /var/cache/conftool/dbconfig/20200720-054747-marostegui.json
* 13:32 filippo@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on logstash2003.codfw.wmnet with reason: pdu
* 05:40 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1082 after a crash [[phab:T258336|T258336]]', diff saved to https://phabricator.wikimedia.org/P11955 and previous config saved to /var/cache/conftool/dbconfig/20200720-053816-marostegui.json
* 13:31 filippo@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on logstash2003.codfw.wmnet with reason: pdu
* 05:37 tstarling@deploy1001: Started scap: fixing missing message from previous sync-dir
* 13:31 root@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on logstash2029.codfw.wmnet with reason: pdu
* 05:30 tstarling@deploy1001: scap sync-l10n completed (1.35.0-wmf.41) (duration: 02m 44s)
* 13:31 root@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on logstash2029.codfw.wmnet with reason: pdu
* 05:25 marostegui: Deploy MCR schema change on enwiki on db1119 - [[phab:T238966|T238966]]
* 13:30 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 6 hosts with reason: [[phab:T310146|T310146]]
* 05:24 tstarling@deploy1001: Synchronized wmf-config/CommonSettings.php: disable lilypond with better error message (duration: 00m 57s)
* 13:30 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 6 hosts with reason: [[phab:T310146|T310146]]
* 05:18 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1082 after a crash [[phab:T258336|T258336]]', diff saved to https://phabricator.wikimedia.org/P11953 and previous config saved to /var/cache/conftool/dbconfig/20200720-051846-marostegui.json
* 13:17 elukey: powering on restbase2027
* 05:18 tstarling@deploy1001: Synchronized php-1.35.0-wmf.41/extensions/Score: better error message for disabling of Score (duration: 01m 10s)
* 13:12 elukey: powering on restbase2026
* 13:12 _joe_: powering on restbase2023
* 13:01 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1160 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32343 and previous config saved to /var/cache/conftool/dbconfig/20220810-130108-ladsgroup.json
* 13:01 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance
* 13:00 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance
* 12:37 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic[2072,2084-2085].codfw.wmnet with reason: [[phab:T310146|T310146]]
* 12:37 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic[2072,2084-2085].codfw.wmnet with reason: [[phab:T310146|T310146]]
* 12:27 jbond: remove confd from serveres that shouldn;t have it
* 12:05 ladsgroup@deploy1002: Synchronized php-1.39.0-wmf.23/extensions/Echo/maintenance/removeOrphanedEvents.php: Backport: [[gerrit:821735{{!}}Run clean ups with removeOrphanedEvents in major batches (T310428)]] (duration: 03m 32s)
* 11:47 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 11:46 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 11:46 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 11:43 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 11:15 jbond@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1001.eqiad.wmnet with OS buster
* 10:54 jbond@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage
* 10:51 jbond@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage
* 10:37 jbond@cumin1001: START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS buster
* 10:31 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[2181-2182].codfw.wmnet with reason: D6 PDU maint ([[phab:T310146|T310146]])
* 10:31 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db[2181-2182].codfw.wmnet with reason: D6 PDU maint ([[phab:T310146|T310146]])
* 10:26 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on maps2010.codfw.wmnet with reason: PDU maintenance
* 10:26 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on maps2010.codfw.wmnet with reason: PDU maintenance
* 10:26 hnowlan@puppetmaster1001: conftool action : set/pooled=no; selector: name=maps2010.codfw.wmnet
* 10:25 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=maps2010.codfw.wmnet
* 10:25 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=maps2008.codfw.wmnet
* 10:24 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on restbase2018.codfw.wmnet with reason: PDU maintenance
* 10:24 hnowlan@puppetmaster1001: conftool action : set/pooled=no; selector: name=restbase2018.codfw.wmnet
* 10:24 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on restbase2018.codfw.wmnet with reason: PDU maintenance
* 10:23 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ml-serve2008.codfw.wmnet with reason: PDU maintenance
* 10:23 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on ml-serve2008.codfw.wmnet with reason: PDU maintenance
* 10:20 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ores2009.codfw.wmnet with reason: PDU maintenance
* 10:20 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on ores2009.codfw.wmnet with reason: PDU maintenance
* 10:19 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ores2008.codfw.wmnet with reason: PDU maintenance
* 10:19 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on ores2008.codfw.wmnet with reason: PDU maintenance
* 10:03 hnowlan@puppetmaster1001: conftool action : set/pooled=no; selector: name=restbase202[367].codfw.wmnet
* 10:02 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on restbase[2023,2026-2027].codfw.wmnet with reason: PDU maintenance
* 10:02 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on restbase[2023,2026-2027].codfw.wmnet with reason: PDU maintenance
* 09:53 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 6 hosts with reason: D8 PDU Maint ([[phab:T310146|T310146]])
* 09:52 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 6 hosts with reason: D8 PDU Maint ([[phab:T310146|T310146]])
* 09:51 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depool D8 DBs for PDU maint ([[phab:T310146|T310146]])', diff saved to https://phabricator.wikimedia.org/P32341 and previous config saved to /var/cache/conftool/dbconfig/20220810-095059-ladsgroup.json
* 09:36 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[2101,2130,2140].codfw.wmnet,dbproxy2004.codfw.wmnet with reason: D6 PDU maint ([[phab:T310146|T310146]])
* 09:36 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db[2101,2130,2140].codfw.wmnet,dbproxy2004.codfw.wmnet with reason: D6 PDU maint ([[phab:T310146|T310146]])
* 09:34 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depool D6 dbs ([[phab:T310146|T310146]])', diff saved to https://phabricator.wikimedia.org/P32340 and previous config saved to /var/cache/conftool/dbconfig/20220810-093433-ladsgroup.json
* 09:31 jelto: depool services in codfw for upcoming PDU replacement - [[phab:T309956|T309956]]
* 09:30 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 8 hosts with reason: Maintenance
* 09:29 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 8 hosts with reason: Maintenance
* 09:29 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance
* 09:29 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance
* 09:28 jynus: shutdown backup2007 before pdu upgrade [[phab:T310146|T310146]]
* 09:16 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 09:15 ladsgroup@deploy1002: Synchronized php-1.39.0-wmf.23/maintenance/namespaceDupes.php: Backport: [[gerrit:821734{{!}}maintenance: Add support for links migration to namespaceDupes.php (T314711)]] (duration: 03m 18s)
* 09:15 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[2093,2120,2129,2172].codfw.wmnet with reason: D5 PDU maint ([[phab:T310146|T310146]])
* 09:15 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db[2093,2120,2129,2172].codfw.wmnet with reason: D5 PDU maint ([[phab:T310146|T310146]])
* 09:14 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 09:14 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 09:13 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 09:10 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depool D5 dbs ([[phab:T310146|T310146]])', diff saved to https://phabricator.wikimedia.org/P32339 and previous config saved to /var/cache/conftool/dbconfig/20220810-091038-ladsgroup.json
* 08:49 jynus: shutdown dbprov2003 before pdu upgrade [[phab:T310146|T310146]]
* 08:49 ayounsi@cumin1001: END (FAIL) - Cookbook sre.network.debug (exit_code=99) for Netbox interface ID cr1-drmrs:xe-0/1/2
* 08:48 ayounsi@cumin1001: START - Cookbook sre.network.debug for Netbox interface ID cr1-drmrs:xe-0/1/2
* 08:48 mvernon@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be2028.codfw.wmnet
* 08:48 mvernon@cumin1001: START - Cookbook sre.hosts.remove-downtime for ms-be2028.codfw.wmnet
* 08:42 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1130 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P32337 and previous config saved to /var/cache/conftool/dbconfig/20220810-084222-ladsgroup.json
* 08:37 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 08:37 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
* 08:36 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
* 08:36 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 08:36 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 08:35 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 08:35 ladsgroup@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:822037{{!}}Stop writing to the old templatelinks fields in s5 (T312865)]] (duration: 03m 29s)
* 08:32 jelto: power off gitlab-runner2004
* 08:31 root@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:30:00 on gitlab-runner2004.codfw.wmnet with reason: PDU swap
* 08:31 root@cumin1001: START - Cookbook sre.hosts.downtime for 8:30:00 on gitlab-runner2004.codfw.wmnet with reason: PDU swap
* 08:29 mvernon@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ms-be2028.codfw.wmnet with reason: Trying to fix full /
* 08:28 mvernon@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ms-be2028.codfw.wmnet with reason: Trying to fix full /
* 08:28 ayounsi@cumin1001: END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox interface ID cr1-drmrs:xe-0/1/2
* 08:27 ayounsi@cumin1001: START - Cookbook sre.network.debug for Netbox interface ID cr1-drmrs:xe-0/1/2
* 08:27 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1130 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P32336 and previous config saved to /var/cache/conftool/dbconfig/20220810-082718-ladsgroup.json
* 08:25 ayounsi@cumin1001: END (FAIL) - Cookbook sre.network.debug (exit_code=99) for Netbox interface ID cr1-drmrs:xe-0/1/2
* 08:25 ayounsi@cumin1001: START - Cookbook sre.network.debug for Netbox interface ID cr1-drmrs:xe-0/1/2
* 08:24 ayounsi@cumin1001: END (FAIL) - Cookbook sre.network.debug (exit_code=99) for Netbox interface ID cr1-drmrs:xe-0/1/2
* 08:24 ayounsi@cumin1001: START - Cookbook sre.network.debug for Netbox interface ID cr1-drmrs:xe-0/1/2
* 08:23 kart_: Run: mwscript namespaceDupes.php arywiki --fix ([[phab:T291737|T291737]])
* 08:13 jynus: restart replication on db1117:m1 [[phab:T309074|T309074]]
* 08:12 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1130 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P32335 and previous config saved to /var/cache/conftool/dbconfig/20220810-081213-ladsgroup.json
* 08:09 kartik@deploy1002: Finished scap: Backport: [[gerrit:821732{{!}}arywiki: change namespace translations, add unchanged namespaces and add old translations as aliases (T291737)]] (duration: 10m 37s)
* 07:59 kartik@deploy1002: Started scap: Backport: [[gerrit:821732{{!}}arywiki: change namespace translations, add unchanged namespaces and add old translations as aliases (T291737)]]
* 07:57 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1130 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P32334 and previous config saved to /var/cache/conftool/dbconfig/20220810-075708-ladsgroup.json
* 07:56 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1130 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P32333 and previous config saved to /var/cache/conftool/dbconfig/20220810-075636-ladsgroup.json
* 07:55 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 07:52 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 07:52 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 07:52 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1130.eqiad.wmnet with reason: Maintenance
* 07:52 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on db1130.eqiad.wmnet with reason: Maintenance
* 07:51 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 07:51 dcaro@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 07:46 dcaro@cumin1001: START - Cookbook sre.dns.netbox
* 07:39 dcaro@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 07:34 dcaro@cumin1001: START - Cookbook sre.dns.netbox
* 07:33 godog: depool thanos-fe2001 for debugging
* 07:11 kartik@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:821170{{!}}Enable SectionTranslation on testwiki with new MT support from Google (T313296)]] (duration: 05m 44s)
* 07:11 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 07:08 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 07:08 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 07:07 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 05:24 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on kubernetes[2013-2014].codfw.wmnet with reason: PDU maintenance
* 05:24 oblivian@cumin1001: START - Cookbook sre.hosts.downtime for 18:00:00 on kubernetes[2013-2014].codfw.wmnet with reason: PDU maintenance
* 05:19 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on parse[2016-2020].codfw.wmnet with reason: PDU maintenance
* 05:19 oblivian@cumin1001: START - Cookbook sre.hosts.downtime for 18:00:00 on parse[2016-2020].codfw.wmnet with reason: PDU maintenance
* 05:12 _joe_: starting to shut down servers in codfw for the PDU maintenance
* 05:09 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on 10 hosts with reason: PDU maintenance
* 05:09 oblivian@cumin1001: START - Cookbook sre.hosts.downtime for 18:00:00 on 10 hosts with reason: PDU maintenance
* 05:09 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on mc-gp2003.codfw.wmnet with reason: PDU maintenance
* 05:09 oblivian@cumin1001: START - Cookbook sre.hosts.downtime for 18:00:00 on mc-gp2003.codfw.wmnet with reason: PDU maintenance
* 05:06 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on mc2033.codfw.wmnet with reason: PDU maintenance
* 05:06 oblivian@cumin1001: START - Cookbook sre.hosts.downtime for 18:00:00 on mc2033.codfw.wmnet with reason: PDU maintenance
* 05:05 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on 7 hosts with reason: PDU maintenance
* 05:05 oblivian@cumin1001: START - Cookbook sre.hosts.downtime for 18:00:00 on 7 hosts with reason: PDU maintenance
* 02:34 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 02:33 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 02:33 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 02:32 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 02:07 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 02:06 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 02:06 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 02:05 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply


== 2020-07-19 ==
== 2022-08-09 ==
* 19:16 marostegui: Upgrade and reboot db1085 [[phab:T258360|T258360]]
* 23:17 bking@cumin1001: conftool action : set/weight=10:pooled=yes; selector: name=wdqs1011.eqiad.wmnet
* 18:57 marostegui: Start mysql on db1082 [[phab:T258336|T258336]]
* 23:07 bking@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
* 18:51 marostegui: Upgrade and reboot db1082 [[phab:T258336|T258336]]
* 23:06 bking@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
* 18:45 cdanis@cumin1001: dbctl commit (dc=all): 'db1085 also crashed', diff saved to https://phabricator.wikimedia.org/P11952 and previous config saved to /var/cache/conftool/dbconfig/20200719-184511-cdanis.json
* 22:51 bking@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
* 18:06 Urbanecm: Run mwscript emptyUserGroup.php --wiki=testwiki contestadmin ([[phab:T256555|T256555]])
* 22:51 bking@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
* 22:49 bking@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
* 22:49 bking@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
* 22:46 bking@cumin1001: conftool action : set/weight=10:pooled=yes; selector: name=wdqs1015.eqiad.wmnet
* 22:31 ryankemper@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
* 22:31 ryankemper@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
* 22:28 bking@cumin1001: END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
* 22:02 bking@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
* 22:02 bking@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
* 21:57 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wdqs2006.codfw.wmnet with reason: [[phab:T310146|T310146]]
* 21:57 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs2006.codfw.wmnet with reason: [[phab:T310146|T310146]]
* 21:53 bking@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
* 21:52 bking@deploy1002: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply
* 21:50 bking@deploy1002: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply
* 21:49 bking@deploy1002: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply
* 21:43 bking@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop: apply
* 21:43 bking@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop: apply
* 21:43 bking@deploy1002: helmfile [eqiad] DONE helmfile.d/services/changeprop: apply
* 21:43 bking@deploy1002: helmfile [eqiad] START helmfile.d/services/changeprop: apply
* 21:43 bking@deploy1002: helmfile [staging] DONE helmfile.d/services/changeprop: apply
* 21:43 bking@deploy1002: helmfile [staging] START helmfile.d/services/changeprop: apply
* 21:08 bking@cumin1001: START - Cookbook sre.wdqs.data-transfer
* 21:00 bking@cumin1001: conftool action : set/weight=10:pooled=yes; selector: name=wdqs1014.eqiad.wmnet
* 20:56 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
* 20:55 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
* 20:55 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1148 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32332 and previous config saved to /var/cache/conftool/dbconfig/20220809-205548-ladsgroup.json
* 20:51 bking@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wdqs1014.eqiad.wmnet
* 20:51 bking@cumin1001: START - Cookbook sre.hosts.remove-downtime for wdqs1014.eqiad.wmnet
* 20:46 bking@cumin1001: END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
* 20:40 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P32331 and previous config saved to /var/cache/conftool/dbconfig/20220809-204042-ladsgroup.json
* 20:25 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P32330 and previous config saved to /var/cache/conftool/dbconfig/20220809-202536-ladsgroup.json
* 20:10 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1148 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32329 and previous config saved to /var/cache/conftool/dbconfig/20220809-201030-ladsgroup.json
* 19:57 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on wdqs1014.eqiad.wmnet with reason: [[phab:T314890|T314890]]
* 19:57 bking@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on wdqs1014.eqiad.wmnet with reason: [[phab:T314890|T314890]]
* 19:56 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on wdqs1016.eqiad.wmnet with reason: [[phab:T314890|T314890]]
* 19:56 bking@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on wdqs1016.eqiad.wmnet with reason: [[phab:T314890|T314890]]
* 19:55 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on wdqs1015.eqiad.wmnet with reason: [[phab:T314890|T314890]]
* 19:55 bking@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on wdqs1015.eqiad.wmnet with reason: [[phab:T314890|T314890]]
* 19:38 dcausse@deploy1002: helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply
* 19:36 dcausse@deploy1002: helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply
* 19:35 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1130.eqiad.wmnet with reason: Maintenance
* 19:35 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 10:00:00 on db1130.eqiad.wmnet with reason: Maintenance
* 19:25 bking@cumin1001: START - Cookbook sre.wdqs.data-transfer
* 18:09 cmjohnson@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 18:06 cmjohnson@cumin1001: START - Cookbook sre.dns.netbox
* 17:54 cmjohnson@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 17:47 cmjohnson@cumin1001: START - Cookbook sre.dns.netbox
* 17:38 bking@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1072.eqiad.wmnet with OS bullseye
* 17:29 vgutierrez: test trafficserver 9.1.2-1wm2 in cp6016 - [[phab:T309651|T309651]]
* 17:15 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1072.eqiad.wmnet with reason: host reimage
* 17:13 bking@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1072.eqiad.wmnet with reason: host reimage
* 17:00 bking@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1072.eqiad.wmnet with OS bullseye
* 16:54 bking@cumin1001: END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0)
* 16:54 bking@cumin1001: START - Cookbook sre.elasticsearch.force-shard-allocation
* 16:53 bking@cumin1001: END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0)
* 16:53 bking@cumin1001: START - Cookbook sre.elasticsearch.force-shard-allocation
* 16:26 bking@deploy1002: helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
* 16:26 bking@deploy1002: helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
* 16:01 bking@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1069.eqiad.wmnet with OS bullseye
* 15:45 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1069.eqiad.wmnet with reason: host reimage
* 15:42 bking@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1069.eqiad.wmnet with reason: host reimage
* 15:32 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 15:31 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 15:31 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 15:30 bking@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1069.eqiad.wmnet with OS bullseye
* 15:28 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 15:27 bking@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1058.eqiad.wmnet with OS bullseye
* 15:08 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1058.eqiad.wmnet with reason: host reimage
* 15:05 bking@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1058.eqiad.wmnet with reason: host reimage
* 14:59 elukey@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
* 14:59 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
* m: finished running 'homer "status:active" commit "netmon: Add the netmon1003 host as a syslog destination"' in the cumin1001 host. Homer reported no errors.
* 14:54 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
* 14:50 bking@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1058.eqiad.wmnet with OS bullseye
* 14:28 bking@cumin1001: conftool action : set/pooled=false; selector: dnsdisc=wdqs,name=codfw
* 13:57 kevinbazira@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
* 13:57 kevinbazira@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
* m: Add the new netmon1003 host as a syslog destination in homer templates/common/system.conf https://gerrit.wikimedia.org/r/c/operations/homer/public/+/819124
* m: Successfully ran '# run-puppet-merge' in the netmon1002 and netmon1003 hosts.
* m: Running '# run-puppet-agent' in the netmon1003 host
* m: Running '# run-puppet-agent' in the netmon1002 host
* 13:47 ryankemper@cumin1001: END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0)
* 13:46 ryankemper@cumin1001: START - Cookbook sre.elasticsearch.force-shard-allocation
* m: puppet-merge on puppetmaster2004.codfw.wmnet for patch 819179 succeeded
* m: Set netmon1003 as netmon_server and netmon1002 as a netmon_servers_failover in the Puppet repository https://gerrit.wikimedia.org/r/c/operations/puppet/+/819179
* m: authdns updated successfully
* m: Had to revert https://gerrit.wikimedia.org/r/c/operations/dns/+/819177 because I rebased my changes incorrectly, sent the new patch in https://gerrit.wikimedia.org/r/c/operations/dns/+/821746
* m: running '# authdns-update' in  ns0.wikimedia.org
* m: Flip DNS for LibreNMS and Smokeping from netmon1002 to netmon1003 https://gerrit.wikimedia.org/r/c/operations/dns/+/819177
* 13:23 jynus: stop replication on db1117:m1 [[phab:T309074|T309074]]
* m: netmon1002 to netmon1003 failover
* 13:17 elukey@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
* 13:16 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
* 10:58 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
* 09:53 vgutierrez: rolling restart of pybal in eqsin - [[phab:T310070|T310070]]
* 09:25 elukey@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
* 09:24 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
* 09:24 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
* 09:12 vgutierrez: rolling restart of pybal in codfw - [[phab:T310070|T310070]]
* 08:47 elukey@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
* 08:30 elukey@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
* 08:28 elukey@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
* 08:27 elukey@deploy1002: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
* 08:27 elukey@deploy1002: helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
* 08:27 elukey@deploy1002: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
* 08:26 elukey@deploy1002: helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
* 08:24 jynus: starting data check using es1021 and es2021, expect increased read traffic [[phab:T314559|T314559]]
* 08:21 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
* 06:22 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1130.eqiad.wmnet with reason: Maintenance
* 06:22 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1130.eqiad.wmnet with reason: Maintenance
* 06:19 Amir1: dbmaint s5@eqiad ([[phab:T312863|T312863]] [[phab:T312984|T312984]] [[phab:T310011|T310011]] [[phab:T310485|T310485]])
* 06:11 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db1130.eqiad.wmnet with reason: Maint
* 06:11 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db1130.eqiad.wmnet with reason: Maint
* 06:08 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depool db1130 [[phab:T314370|T314370]]', diff saved to https://phabricator.wikimedia.org/P32323 and previous config saved to /var/cache/conftool/dbconfig/20220809-060836-ladsgroup.json
* 06:07 oblivian@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
* 06:02 ladsgroup@cumin1001: dbctl commit (dc=all): 'Promote db1100 to s5 primary and set section read-write [[phab:T314370|T314370]]', diff saved to https://phabricator.wikimedia.org/P32322 and previous config saved to /var/cache/conftool/dbconfig/20220809-060159-ladsgroup.json
* 06:01 ladsgroup@cumin1001: dbctl commit (dc=all): 'Set s5 eqiad as read-only for maintenance - [[phab:T314370|T314370]]', diff saved to https://phabricator.wikimedia.org/P32321 and previous config saved to /var/cache/conftool/dbconfig/20220809-060105-ladsgroup.json
* 06:00 Amir1: Starting s5 eqiad failover from db1130 to db1100 - [[phab:T314370|T314370]]
* 05:12 ladsgroup@cumin1001: dbctl commit (dc=all): 'Set db1100 with weight 0 [[phab:T314370|T314370]]', diff saved to https://phabricator.wikimedia.org/P32320 and previous config saved to /var/cache/conftool/dbconfig/20220809-051251-ladsgroup.json
* 05:12 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 22 hosts with reason: Primary switchover s5 [[phab:T314370|T314370]]
* 05:11 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1:00:00 on 22 hosts with reason: Primary switchover s5 [[phab:T314370|T314370]]
* 02:42 ejegg: SmashPig upgraded from {{Gerrit|9b97ea15}} to {{Gerrit|13e9e9cc}}
* 02:31 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1148 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32318 and previous config saved to /var/cache/conftool/dbconfig/20220809-023113-ladsgroup.json
* 02:31 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1148.eqiad.wmnet with reason: Maintenance
* 02:30 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1148.eqiad.wmnet with reason: Maintenance
* 02:30 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32317 and previous config saved to /var/cache/conftool/dbconfig/20220809-023052-ladsgroup.json
* 02:28 ejegg: payments-wiki upgraded from {{Gerrit|6880236d}} to {{Gerrit|cf5e1848}}
* 02:15 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P32316 and previous config saved to /var/cache/conftool/dbconfig/20220809-021546-ladsgroup.json
* 02:00 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P32315 and previous config saved to /var/cache/conftool/dbconfig/20220809-020040-ladsgroup.json
* 01:45 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32314 and previous config saved to /var/cache/conftool/dbconfig/20220809-014534-ladsgroup.json


== 2020-07-18 ==
== 2022-08-08 ==
* 21:41 shdubsh: restart logstash on logstash200[456]
* 23:52 tstarling@deploy1002: Synchronized wmf-config/InitialiseSettings.php: clean up testwiki experiments [[phab:T314750|T314750]] (duration: 03m 19s)
* 21:14 shdubsh: bounce logstash on logstash1007
* 23:47 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 21:10 shdubsh: bounce logstash on logstash1008
* 23:46 tstarling@deploy1002: Synchronized wmf-config/CommonSettings.php: clean up testwiki experiments [[phab:T314750|T314750]] (duration: 03m 27s)
* 21:06 shdubsh: bounce logstash on logstash1009
* 23:46 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 20:52 marostegui: Due to db1082 crash there will be replication lag on s5 on labsdb hosts - [[phab:T258336|T258336]]
* 23:46 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 20:37 cdanis@cumin1001: dbctl commit (dc=all): 'depool db1082, it crashed', diff saved to https://phabricator.wikimedia.org/P11951 and previous config saved to /var/cache/conftool/dbconfig/20200718-203704-cdanis.json
* 23:45 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 00:13 dpifke: Performing one-time expiration of ArcLamp files older than 40 days (normal retention is 45 days), to solve disk space issue until either Ganeti issue is solved or compressed logfile support is merged.
* 23:32 eileen___: config revision changed from {{Gerrit|f5668044}} to 787cd0e0<eileen___> eileen
* 23:32 eileen___: civicrm upgraded from {{Gerrit|497bddf7}} to {{Gerrit|1f91ac2d}}
* 22:16 ryankemper@cumin1001: END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - [[phab:T289135|T289135]]
* 22:16 ryankemper@cumin1001: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host elastic1065.eqiad.wmnet with OS bullseye
* 21:53 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1065.eqiad.wmnet with reason: host reimage
* 21:50 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1065.eqiad.wmnet with reason: host reimage
* 21:36 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1065.eqiad.wmnet with OS bullseye
* 21:12 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1062.eqiad.wmnet with OS bullseye
* 20:53 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1062.eqiad.wmnet with reason: host reimage
* 20:50 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1062.eqiad.wmnet with reason: host reimage
* 20:36 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1062.eqiad.wmnet with OS bullseye
* 20:32 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 20:31 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 20:31 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 20:29 ryankemper@cumin1001: START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - [[phab:T289135|T289135]]
* 20:28 cjming: end of UTC late backport window
* 20:27 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 20:27 cjming@deploy1002: Synchronized php-1.39.0-wmf.23/skins/Vector/resources/skins.vector.styles/layouts/grid.less: Backport: [[gerrit:821243{{!}}Fix grid blowout bug (T314756)]] (duration: 03m 26s)
* 20:12 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 20:11 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 20:11 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 20:11 cjming@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:817785{{!}}Disable sticky header edit A/B test for pilot wikis (T312296)]] (duration: 03m 35s)
* 20:08 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 17:34 bking@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1088.eqiad.wmnet with OS bullseye
* 17:15 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1088.eqiad.wmnet with reason: host reimage
* 17:12 bking@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1088.eqiad.wmnet with reason: host reimage
* 17:00 bking@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1088.eqiad.wmnet with OS bullseye
* 16:54 bking@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1085.eqiad.wmnet with OS bullseye
* 16:49 ryankemper@cumin1001: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - [[phab:T289135|T289135]]
* 16:43 pt1979@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 16:41 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1085.eqiad.wmnet with reason: host reimage
* 16:39 pt1979@cumin2002: START - Cookbook sre.dns.netbox
* 16:38 bking@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1085.eqiad.wmnet with reason: host reimage
* 16:26 bking@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1085.eqiad.wmnet with OS bullseye
* 16:24 bking@cumin1001: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host elastic1085.eqiad.wmnet with OS bullseye
* 16:19 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1085.eqiad.wmnet with reason: host reimage
* 16:16 bking@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1085.eqiad.wmnet with reason: host reimage
* 16:16 ryankemper@cumin1001: START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - [[phab:T289135|T289135]]
* 16:14 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
* 16:12 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
* 16:10 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
* 16:09 ryankemper@cumin1001: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - [[phab:T289135|T289135]]
* 16:04 bking@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1085.eqiad.wmnet with OS bullseye
* 16:00 bking@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1084.eqiad.wmnet with OS bullseye
* 15:58 ryankemper@cumin1001: START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - [[phab:T289135|T289135]]
* 15:47 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1084.eqiad.wmnet with reason: host reimage
* 15:46 sukhe: upload reprepro -C main include bullseye-wikimedia python-pynetbox_6.6.0-1+wmf11u1_amd64.changes
* 15:45 bking@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1084.eqiad.wmnet with reason: host reimage
* 15:37 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2021.codfw.wmnet with reason: Maint
* 15:37 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2021.codfw.wmnet with reason: Maint
* 15:32 bking@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1084.eqiad.wmnet with OS bullseye
* 14:59 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
* 14:55 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
* 14:47 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on cp5001.eqsin.wmnet with reason: depooled: faulty DIMM: [[phab:T314256|T314256]]
* 14:46 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on cp5001.eqsin.wmnet with reason: depooled: faulty DIMM: [[phab:T314256|T314256]]
* 14:34 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
* 14:11 kevinbazira@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
* 13:03 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 13:01 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 13:01 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 12:58 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 12:56 urbanecm@deploy1002: Synchronized wmf-config/CommonSettings.php: {{Gerrit|77fd5abdd7d9462869259e1511bbcf2d7ce62246}}: Growth: Add new rights to wgAvailableRights (duration: 03m 24s)
* 12:30 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1102.eqiad.wmnet
* 12:09 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 12:09 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 12:06 urbanecm@deploy1002: Synchronized php-1.39.0-wmf.23/extensions/GrowthExperiments/: {{Gerrit|3eaf155678b7313c55dcca0cd39ab29f73eead37}}: MentorTools: Do not use MentorWeightManager ([[phab:T314362|T314362]]) (duration: 03m 31s)
* 12:04 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 11:43 btullis@cumin1001: START - Cookbook sre.hosts.reboot-single for host an-worker1102.eqiad.wmnet
* 11:21 jelto@cumin1001: conftool action : set/pooled=yes; selector: name=kubernetes2022.codfw.wmnet
* 11:21 jelto: kubectl uncordon kubernetes2022.codfw.wmnet
* 10:43 Amir1: Removing db2079 from orchestrator ([[phab:T313885|T313885]])
* 10:39 Amir1: Removing db2079 from zarcillo ([[phab:T313885|T313885]])
* 10:35 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2079.codfw.wmnet
* 10:35 ladsgroup@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 10:30 ladsgroup@cumin1001: START - Cookbook sre.dns.netbox
* 10:25 ladsgroup@cumin1001: START - Cookbook sre.hosts.decommission for hosts db2079.codfw.wmnet
* 10:18 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 12:00:00 on db2079.codfw.wmnet with reason: Decom
* 10:17 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 2 days, 12:00:00 on db2079.codfw.wmnet with reason: Decom
* 08:41 jbond: deploy libtirpc update
* 07:57 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1144:3314 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32310 and previous config saved to /var/cache/conftool/dbconfig/20220808-075723-ladsgroup.json
* 07:57 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance
* 07:57 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance
* 07:57 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1142 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32309 and previous config saved to /var/cache/conftool/dbconfig/20220808-075702-ladsgroup.json
* 07:53 godog: grow sda/sdb 3 by 100G on thanos-be2001 - [[phab:T314275|T314275]]
* 07:50 godog: grow sda/sdb 3 by 100G on thanos-be1004 - [[phab:T314275|T314275]]
* 07:41 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P32308 and previous config saved to /var/cache/conftool/dbconfig/20220808-074156-ladsgroup.json
* 07:32 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 07:27 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 07:27 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 07:26 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P32307 and previous config saved to /var/cache/conftool/dbconfig/20220808-072650-ladsgroup.json
* 07:23 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 07:22 kartik@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:820815{{!}}trwikivoyage: Create rollbacker user group (T314678)]] (duration: 03m 17s)
* 07:18 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 07:17 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 07:17 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 07:16 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 07:11 elukey: restart rsyslog on ml-serve2007
* 07:11 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 07:11 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1142 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32306 and previous config saved to /var/cache/conftool/dbconfig/20220808-071144-ladsgroup.json
* 07:10 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 07:10 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 07:09 kartik@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:820261{{!}}Enable SectionTranslation on 10 Wikipedias where ContentTranslation is default (T308829)]] (duration: 03m 15s)
* 07:09 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 07:06 XioNoX: add CSP headers to Netbox - [[phab:T296356|T296356]]
* 07:05 elukey: restart rsyslog on ml-serve-ctrl2001


== 2020-07-17 ==
== 2022-08-07 ==
* 21:16 dpifke: Removing MongoDB packages and data from webperf1002.
* 19:58 taavi: taavi@mwmaint1002 ~ $ echo "https://upload.wikimedia.org/wikipedia/commons/1/15/Keep_tidy_ask.svg" {{!}} mwscript purgeList.php --wiki enwiki # [[phab:T314712|T314712]]
* 17:39 dpifke@deploy1001: Finished deploy [performance/arc-lamp@a5d2fd3]: (no justification provided) (duration: 00m 05s)
* 13:52 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1142 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32305 and previous config saved to /var/cache/conftool/dbconfig/20220807-135204-ladsgroup.json
* 17:38 dpifke@deploy1001: Started deploy [performance/arc-lamp@a5d2fd3]: (no justification provided)
* 13:51 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1142.eqiad.wmnet with reason: Maintenance
* 13:53 akosiaris: powercycle kubernetes2002
* 13:51 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1142.eqiad.wmnet with reason: Maintenance
* 12:24 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1104', diff saved to https://phabricator.wikimedia.org/P11944 and previous config saved to /var/cache/conftool/dbconfig/20200717-122400-marostegui.json
* 13:51 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1141 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32304 and previous config saved to /var/cache/conftool/dbconfig/20220807-135143-ladsgroup.json
* 12:01 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1104', diff saved to https://phabricator.wikimedia.org/P11941 and previous config saved to /var/cache/conftool/dbconfig/20200717-120126-marostegui.json
* 13:36 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P32303 and previous config saved to /var/cache/conftool/dbconfig/20220807-133637-ladsgroup.json
* 11:51 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1104', diff saved to https://phabricator.wikimedia.org/P11940 and previous config saved to /var/cache/conftool/dbconfig/20200717-115155-marostegui.json
* 13:21 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P32302 and previous config saved to /var/cache/conftool/dbconfig/20220807-132131-ladsgroup.json
* 11:38 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1104', diff saved to https://phabricator.wikimedia.org/P11939 and previous config saved to /var/cache/conftool/dbconfig/20200717-113800-marostegui.json
* 13:06 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1141 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32301 and previous config saved to /var/cache/conftool/dbconfig/20220807-130625-ladsgroup.json
* 11:30 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1104', diff saved to https://phabricator.wikimedia.org/P11938 and previous config saved to /var/cache/conftool/dbconfig/20200717-113050-marostegui.json
* 12:06 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1141 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32300 and previous config saved to /var/cache/conftool/dbconfig/20220807-120610-ladsgroup.json
* 11:24 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1104', diff saved to https://phabricator.wikimedia.org/P11937 and previous config saved to /var/cache/conftool/dbconfig/20200717-112413-marostegui.json
* 12:06 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1141.eqiad.wmnet with reason: Maintenance
* 09:15 elukey@puppetmaster1001: conftool action : set/pooled=yes; selector: name=mw1280.eqiad.wmnet
* 12:05 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1141.eqiad.wmnet with reason: Maintenance
* 09:12 elukey@puppetmaster1001: conftool action : set/pooled=inactive; selector: name=mw1280.eqiad.wmnet
* 12:05 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1149 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32299 and previous config saved to /var/cache/conftool/dbconfig/20220807-120549-ladsgroup.json
* 08:48 moritzm: imported prometheus-atlas-exporter 1.0+git20191204.ffafab7-2 to buster-wikimedia [[phab:T247967|T247967]]
* 11:50 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P32298 and previous config saved to /var/cache/conftool/dbconfig/20220807-115043-ladsgroup.json
* 08:29 elukey@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
* 11:35 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P32297 and previous config saved to /var/cache/conftool/dbconfig/20220807-113537-ladsgroup.json
* 08:05 elukey@cumin1001: START - Cookbook sre.ganeti.makevm
* 11:20 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1149 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32296 and previous config saved to /var/cache/conftool/dbconfig/20220807-112031-ladsgroup.json
* 07:54 elukey@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
* 07:51 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1104', diff saved to https://phabricator.wikimedia.org/P11936 and previous config saved to /var/cache/conftool/dbconfig/20200717-075124-marostegui.json
* 07:43 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1111', diff saved to https://phabricator.wikimedia.org/P11935 and previous config saved to /var/cache/conftool/dbconfig/20200717-074335-marostegui.json
* 07:34 akosiaris@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' .
* 07:34 akosiaris@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
* 07:33 akosiaris@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
* 07:33 akosiaris@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' .
* 07:32 akosiaris@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'mobileapps' for release 'staging' .
* 07:30 elukey@cumin1001: START - Cookbook sre.ganeti.makevm
* 06:30 XioNoX: rename msw1-codfw interface range
* 06:28 XioNoX: rename msw1-eqiad interface range
* 04:47 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1111', diff saved to https://phabricator.wikimedia.org/P11934 and previous config saved to /var/cache/conftool/dbconfig/20200717-044748-marostegui.json
* 04:46 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1092', diff saved to https://phabricator.wikimedia.org/P11933 and previous config saved to /var/cache/conftool/dbconfig/20200717-044658-marostegui.json


== 2020-07-16 ==
== 2022-08-06 ==
* 22:15 mutante: testreduce1001 manually git clone 'scandium' branch of integration/visualdiff into /srv/visualdiff ([[phab:T257906|T257906]])
* 17:59 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1149 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32295 and previous config saved to /var/cache/conftool/dbconfig/20220806-175916-ladsgroup.json
* 21:54 crusnov@deploy1001: Finished deploy [netbox/deploy@39c5cae]: Deploying Netbox 2.8.7 part 3 (duration: 01m 49s)
* 17:59 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1149.eqiad.wmnet with reason: Maintenance
* 21:52 crusnov@deploy1001: Started deploy [netbox/deploy@39c5cae]: Deploying Netbox 2.8.7 part 3
* 17:58 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1149.eqiad.wmnet with reason: Maintenance
* 21:42 crusnov@deploy1001: Finished deploy [netbox/deploy@39c5cae]: Deploying Netbox 2.8.7 part 2 (duration: 01m 33s)
* 03:10 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 21:41 crusnov@deploy1001: Started deploy [netbox/deploy@39c5cae]: Deploying Netbox 2.8.7 part 2
* 03:09 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 21:40 crusnov@deploy1001: Finished deploy [netbox/deploy@39c5cae]: Deploying Netbox 2.8.7 (duration: 01m 01s)
* 03:09 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 21:39 crusnov@deploy1001: Started deploy [netbox/deploy@39c5cae]: Deploying Netbox 2.8.7
* 03:08 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 21:08 cstone: payments-wiki revision changed from {{Gerrit|91852dbc9b}} to {{Gerrit|bf91f8adff}}
* 03:03 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 20:32 mholloway-shell@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Enable client error logging on Catalan Wikipedia ([[phab:T258073|T258073]]) (duration: 00m 57s)
* 03:02 krinkle@deploy1002: Synchronized w/: {{Gerrit|I9067d47fab0324}} (duration: 03m 25s)
* 19:32 sbassett: Deployed mitigations for [[phab:T257687|T257687]]
* 03:02 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 19:14 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: [[phab:T248418|T248418]] TimedMediaHandler: Make videojs the only player on all group0 (duration: 00m 57s)
* 03:02 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 18:54 herron@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
* 03:01 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 18:53 herron@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
* 02:41 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 18:50 herron@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
* 02:39 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 18:49 addshore: deployment windows finished with
* 02:39 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 18:46 addshore@deploy1001: Synchronized wmf-config/extension-list: [[gerrit:611393]] extension-list: Load WikibaseClient via JSON (duration: 00m 56s)
* 02:38 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 18:36 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: [[gerrit:613226]] Wikibase: Always set wgWBRepoSettings idGeneratorSeparateDbConnection PT 2/2 (duration: 00m 56s)
* 02:38 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1143.eqiad.wmnet with reason: Maintenance
* 18:35 addshore@deploy1001: Synchronized wmf-config/Wikibase.php: [[gerrit:613226]] Wikibase: Always set wgWBRepoSettings idGeneratorSeparateDbConnection PT 1/2 (duration: 00m 56s)
* 02:37 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1143.eqiad.wmnet with reason: Maintenance
* 18:25 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: [[gerrit:613165]] [[phab:T138104|T138104]] Wikibase: stop setting wmgWikibaseTmpSerializeEmptyListsAsObjects (duration: 00m 57s)
* 02:33 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 18:23 addshore@deploy1001: Synchronized wmf-config/config/incubatorwiki.yaml: [[gerrit:613199]] [[phab:T256957|T256957]] Move VisualEditor from beta to default on incubatorwiki PT2/2 (duration: 00m 57s)
* 02:32 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 18:22 addshore@deploy1001: Synchronized dblists/visualeditor-nondefault.dblist: [[gerrit:613199]] [[phab:T256957|T256957]] Move VisualEditor from beta to default on incubatorwiki PT1/2 (duration: 00m 56s)
* 02:32 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 18:20 addshore@deploy1001: Synchronized wmf-config/config/nlwikimedia.yaml: [[gerrit:613198]] [[phab:T256142|T256142]] Move VisualEditor from beta to default on nlwikimedia PT2/2 (duration: 00m 57s)
* 02:31 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 18:18 addshore@deploy1001: Synchronized dblists/visualeditor-nondefault.dblist: [[gerrit:613198]] [[phab:T256142|T256142]] Move VisualEditor from beta to default on nlwikimedia PT1/2 (duration: 00m 56s)
* 18:14 addshore@deploy1001: Synchronized wmf-config/Wikibase.php: [[gerrit:613164]] [[phab:T138104|T138104]] Wikibase: stop setting wgWBRepoSettings tmpSerializeEmptyListsAsObjects (duration: 00m 57s)
* 18:12 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: [[gerrit:613192]] [[phab:T246420|T246420]] Enable limited-width layout for Modern Vector (duration: 00m 56s)
* 18:08 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: [[gerrit:612870]] [[phab:T246977|T246977]] Disable affinity quicksurveys for the following wikis (duration: 00m 57s)
* 18:03 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
* 18:03 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
* 17:54 herron@cumin1001: START - Cookbook sre.ganeti.makevm
* 17:53 herron@cumin1001: START - Cookbook sre.ganeti.makevm
* 17:50 herron@cumin1001: START - Cookbook sre.ganeti.makevm
* 17:50 herron@cumin1001: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99)
* 17:49 herron@cumin1001: START - Cookbook sre.ganeti.makevm
* 17:17 XioNoX: msw1-eqiad delete unused VC-ports
* 17:05 XioNoX: msw1-codfw - replace member-range with list of individual interfaces
* 16:45 lucaswerkmeister-wmde@deploy1001: Synchronized php-1.35.0-wmf.41/extensions/Wikibase/: Backport: [[gerrit:613173{{!}}Re add OtherProjectsSidebarGenerator::buildProjectLinkSidebarFromItemId (T258184)]] (duration: 01m 02s)
* 16:11 effie: reboot rdb1009 - [[phab:T254990|T254990]]
* 16:06 effie: Reboot rdb1010 - [[phab:T254990|T254990]]
* 15:51 lucaswerkmeister-wmde@deploy1001: Synchronized php-1.35.0-wmf.41/extensions/Wikibase/: Backport: [[gerrit:613170{{!}}Revert "Revert "Removes OtherProjectsSidebar hook"" (T258184)]] (duration: 01m 02s)
* 15:40 lucaswerkmeister-wmde@deploy1001: scap failed: average error rate on 7/9 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/e474f13ffac6b8c3bf919c4aeafc8c9b for details)
* 15:15 akosiaris: lower codfw mobileapps kubernetes traffic to 10% [[phab:T218733|T218733]]. Will open up task for it
* 15:15 akosiaris@cumin1001: conftool action : set/weight=24; selector: dc=codfw,service=mobileapps,name=scb.*
* 15:07 XioNoX: repool eqsin - [[phab:T257154|T257154]]
* 15:04 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 15:02 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
* 15:00 jayme@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'mathoid' for release 'production' .
* 14:54 XioNoX: load config on cr3-eqsin - [[phab:T257154|T257154]]
* 14:54 lucaswerkmeister-wmde@deploy1001: Synchronized php-1.35.0-wmf.41/extensions/Wikibase/: Backport: [[gerrit:613167{{!}}Avoid trying to register wikibase.Site twice (T258065)]] (duration: 01m 03s)
* 14:43 jayme@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'mathoid' for release 'production' .
* 14:31 jayme@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'mathoid' for release 'staging' .
* 14:15 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 14:12 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
* 14:12 moritzm: rebooting webperf hosts in eqiad for kernel update
* 14:09 XioNoX: upgrade junos on cr3-eqsin - [[phab:T257154|T257154]]
* 14:03 jayme: published image docker-registry.discovery.wmnet/envoy:1.14.4-1
* 13:47 XioNoX: remove nonstop-bridging from asw1-eqsin
* 13:36 XioNoX: power-off cr3-eqsin - [[phab:T257154|T257154]]
* 13:36 akosiaris: increase codfw mobileapps kubernetes traffic to 25% [[phab:T218733|T218733]]
* 13:35 akosiaris@cumin1001: conftool action : set/weight=8; selector: dc=codfw,service=mobileapps,name=scb.*
* 13:30 XioNoX: deactivate BGP groups IX/Transit/PyBal on cr3-eqsin - [[phab:T257154|T257154]]
* 13:27 moritzm: installing an-tool1008
* 13:23 XioNoX: depool eqsin for cr3 replacement - [[phab:T257154|T257154]]
* 13:13 volans@deploy1001: Finished deploy [homer/deploy@fcf4332]: Force deploy of the homer plugin (duration: 01m 27s)
* 13:12 volans@deploy1001: Started deploy [homer/deploy@fcf4332]: Force deploy of the homer plugin
* 13:04 kormat: restarting tendril to pick up new mariadb config [[phab:T257816|T257816]]
* 13:02 jforrester@deploy1001: rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.41
* 13:02 akosiaris: increase codfw mobileapps kubernetes traffic to 10% [[phab:T218733|T218733]]
* 13:01 akosiaris@cumin1001: conftool action : set/weight=24; selector: dc=codfw,service=mobileapps,name=scb.*
* 12:56 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1092', diff saved to https://phabricator.wikimedia.org/P11926 and previous config saved to /var/cache/conftool/dbconfig/20200716-125643-marostegui.json
* 12:56 ayounsi@deploy1001: Finished deploy [homer/deploy@fcf4332]: CR607011 (duration: 04m 32s)
* 12:52 ayounsi@deploy1001: Started deploy [homer/deploy@fcf4332]: CR607011
* 12:42 ayounsi@deploy1001: Finished deploy [homer/deploy@fcf4332]: CR607011 (duration: 03m 42s)
* 12:38 ayounsi@deploy1001: Started deploy [homer/deploy@fcf4332]: CR607011
* 12:38 cmjohnson@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 12:36 akosiaris@cumin1001: conftool action : set/weight=50; selector: dc=codfw,service=mobileapps,name=scb.*
* 12:35 akosiaris: increase codfw mobileapps kubernetes traffic to 5% [[phab:T218733|T218733]]
* 12:35 akosiaris: increase codfw mobileapps kubernetes traffic to 5%
* 12:34 cmjohnson@cumin1001: START - Cookbook sre.dns.netbox
* 12:22 jmm@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
* 12:12 jmm@cumin1001: START - Cookbook sre.ganeti.makevm
* 12:12 jmm@cumin1001: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99)
* 12:12 jmm@cumin1001: START - Cookbook sre.ganeti.makevm
* 12:12 jmm@cumin1001: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99)
* 12:12 jmm@cumin1001: START - Cookbook sre.ganeti.makevm
* 12:08 jayme: updated envoyproxy to 1.14.4-1 on mw-canary and restbase-canary
* 11:44 XioNoX: remove BGP to AS396253 in eqdfw (peer left the IX)
* 11:26 jforrester@deploy1001: Synchronized php-1.35.0-wmf.41/extensions/UrlShortener/includes/UrlShortenerUtils.php: [[phab:T258134|T258134]] Fix config variables regex concatenation (duration: 01m 05s)
* 11:23 addshore@deploy1001: Synchronized wmf-config/Wikibase.php: [[phab:T254315|T254315]] [[gerrit:612670]] Wikibase: remove wmgWikibaseLocalEntitySourceName (duration: 01m 05s)
* 11:18 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: [[phab:T254315|T254315]] [[phab:T257266|T257266]] [[gerrit:609988]] Wikidata client wikis: Define entity sources configuration (take 3) (duration: 01m 08s)
* 10:17 jbond42: upgrade to hiera5
* 10:08 jbond42: disable puppet for hiera5 deployment
* 09:37 jayme: updated envoyproxy to 1.14.4-1 on mw1325.eqiad.wmnet and restbase1026.eqiad.wmnet
* 09:32 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 09:30 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
* 09:22 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 09:21 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
* 09:17 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 09:15 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
* 09:15 jmm@cumin2001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99)
* 09:15 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
* 09:15 moritzm: rebooting flowspec1001
* 08:52 jayme: updated envoyproxy to 1.14.4-1 on mwdebug1001.eqiad.wmnet
* 08:41 moritzm: installing sqlite3 security updates
* 08:39 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db2081', diff saved to https://phabricator.wikimedia.org/P11924 and previous config saved to /var/cache/conftool/dbconfig/20200716-083954-marostegui.json
* 08:35 XioNoX: Remove PIM/IGMP related CR stanza (acls) - [[phab:T257573|T257573]]
* 08:33 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
* 08:33 jmm@cumin2001: START - Cookbook sre.hosts.downtime
* 08:33 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
* 08:33 jmm@cumin2001: START - Cookbook sre.hosts.downtime
* 08:32 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
* 08:32 jmm@cumin2001: START - Cookbook sre.hosts.downtime
* 08:26 moritzm: installing dbus security updates
* 08:25 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
* 08:24 XioNoX: remove igmp-snooping from access switches - [[phab:T257573|T257573]]
* 08:23 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
* 08:15 moritzm: installing python-urllib3 security updates
* 08:15 XioNoX: remove PIM config from eqord/eqdfw/knams routers - [[phab:T257573|T257573]]
* 08:14 XioNoX: remove PIM config from eqiad routers - [[phab:T257573|T257573]]
* 08:11 XioNoX: remove PIM config from esams routers - [[phab:T257573|T257573]]
* 08:09 XioNoX: remove PIM config from eqsin routers - [[phab:T257573|T257573]]
* 08:08 jbond42: update mail delivery for phabricator to use phabricator.discovery.wmnet cname
* 08:07 XioNoX: remove PIM config from codfw routers - [[phab:T257573|T257573]]
* 08:06 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db2081', diff saved to https://phabricator.wikimedia.org/P11923 and previous config saved to /var/cache/conftool/dbconfig/20200716-080613-marostegui.json
* 08:03 XioNoX: remove PIM config from ulsfo routers - [[phab:T257573|T257573]]
* 07:41 jayme: imported envoyproxy_1.14.4-1 to stretch-wikimedia
* 07:31 jayme: imported envoyproxy_1.14.4-1 to buster-wikimedia
* 07:28 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1131', diff saved to https://phabricator.wikimedia.org/P11922 and previous config saved to /var/cache/conftool/dbconfig/20200716-072838-marostegui.json
* 07:25 marostegui: Drop database reviewdb-test [[phab:T255715|T255715]]
* 07:03 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1131', diff saved to https://phabricator.wikimedia.org/P11921 and previous config saved to /var/cache/conftool/dbconfig/20200716-070331-marostegui.json
* 06:40 XioNoX: remove peering with AS8403 in eqsin (peer left the IX)
* 05:13 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1131', diff saved to https://phabricator.wikimedia.org/P11920 and previous config saved to /var/cache/conftool/dbconfig/20200716-051342-marostegui.json
* 05:11 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1131', diff saved to https://phabricator.wikimedia.org/P11919 and previous config saved to /var/cache/conftool/dbconfig/20200716-051109-marostegui.json


== 2020-07-15 ==
== 2022-08-05 ==
* 23:54 eileen: tools revision changed from {{Gerrit|7b6018a16e}} to {{Gerrit|711d671600}}
* 22:20 dcausse@deploy1002: Finished deploy [wikimedia/discovery/analytics@71fe016]: Fix schedule_interval for image_recommendation_weekly (duration: 02m 01s)
* 23:50 eileen: process-control config revision is {{Gerrit|1fc4a9686d}}
* 22:18 dcausse@deploy1002: Started deploy [wikimedia/discovery/analytics@71fe016]: Fix schedule_interval for image_recommendation_weekly
* 23:21 dzahn@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
* 17:08 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1195.eqiad.wmnet with OS bullseye
* 23:04 bd808: tools.admin Removed valhallasw from maintainers ([[phab:T255697|T255697]])
* 16:54 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1194.eqiad.wmnet with OS bullseye
* 23:02 dzahn@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
* 16:53 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1195.eqiad.wmnet with reason: host reimage
* 22:58 dzahn@cumin1001: START - Cookbook sre.ganeti.makevm
* 16:49 pt1979@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on db1195.eqiad.wmnet with reason: host reimage
* 22:52 dzahn@cumin1001: START - Cookbook sre.ganeti.makevm
* 16:41 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1194.eqiad.wmnet with reason: host reimage
* 22:52 dzahn@cumin1001: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99)
* 16:37 pt1979@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on db1194.eqiad.wmnet with reason: host reimage
* 22:52 dzahn@cumin1001: START - Cookbook sre.ganeti.makevm
* 16:34 pt1979@cumin1001: START - Cookbook sre.hosts.reimage for host db1195.eqiad.wmnet with OS bullseye
* 22:30 akosiaris@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' .
* 16:27 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp203[56]\.codfw\.wmnet,service=varnish-fe
* 22:30 akosiaris@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
* 16:27 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp203[56]\.codfw\.wmnet,service=ats-be
* 22:29 akosiaris@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' .
* 16:27 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp203[56]\.codfw\.wmnet,service=ats-tls
* 22:29 akosiaris@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
* 16:26 pt1979@cumin1001: START - Cookbook sre.hosts.reimage for host db1194.eqiad.wmnet with OS bullseye
* 22:27 akosiaris@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'mobileapps' for release 'staging' .
* 16:25 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1193.eqiad.wmnet with OS bullseye
* 22:21 akosiaris@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' .
* 16:21 pt1979@cumin1001: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host db1192.eqiad.wmnet with OS bullseye
* 22:21 akosiaris@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
* 16:12 dcausse@deploy1002: Finished deploy [wikimedia/discovery/analytics@8489923]: [[phab:T304954|T304954]]: Automate imagesuggestion imports (duration: 02m 03s)
* 22:10 akosiaris@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'mobileapps' for release 'staging' .
* 16:11 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1193.eqiad.wmnet with reason: host reimage
* 18:16 brennen: restarting jenkins for upgrade
* 16:11 milimetric@deploy1002: Finished deploy [analytics/refinery@fe7bf9e]: Hotfix for webrequest load refine, now with FORCE :) (duration: 06m 09s)
* 18:00 mutante: DNS - new language 'avk' has been added - This language is called Kotava and is "a proposed international auxiliary language (IAL) that focuses especially on the principle of cultural neutrality". Learn more at https://en.wikipedia.org/wiki/Kotava
* 16:10 dcausse@deploy1002: Started deploy [wikimedia/discovery/analytics@8489923]: [[phab:T304954|T304954]]: Automate imagesuggestion imports
* 17:32 mutante: puppetmaster - revoking cert for planet.discovery.wmnet, add planet.wikimedia.org, remove planet.svc records, remove specific and outdated hostnames ([[phab:T257840|T257840]])
* 16:07 pt1979@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on db1193.eqiad.wmnet with reason: host reimage
* 16:11 moritzm: uploaded jenkins 2.235.2 to thirdparty/ci for stretch/buster [[phab:T257614|T257614]]
* 16:07 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1192.eqiad.wmnet with reason: host reimage
* 15:29 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 16:05 milimetric@deploy1002: Started deploy [analytics/refinery@fe7bf9e]: Hotfix for webrequest load refine, now with FORCE :)
* 15:24 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
* 16:04 milimetric@deploy1002: Finished deploy [analytics/refinery@fe7bf9e]: Hotfix for webrequest load refine (duration: 34m 38s)
* 15:24 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 16:03 pt1979@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on db1192.eqiad.wmnet with reason: host reimage
* 15:20 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
* 15:55 pt1979@cumin1001: START - Cookbook sre.hosts.reimage for host db1193.eqiad.wmnet with OS bullseye
* 15:20 moritzm: rebooting webperf* hosts for kernel update
* 15:52 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1191.eqiad.wmnet with OS bullseye
* 14:58 addshore@deploy1001: Synchronized php-1.35.0-wmf.41/extensions/Wikibase/repo: [[gerrit:612723]] Stop checking if WikibaseLib is loaded [[phab:T258062|T258062]] (already on mwmaint1002) (duration: 01m 08s)
* 15:51 pt1979@cumin1001: START - Cookbook sre.hosts.reimage for host db1192.eqiad.wmnet with OS bullseye
* 14:51 addshore: pulled https://gerrit.wikimedia.org/r/612723 onto mwmaint 1002 ahead of syncing everywhere (and CI finishing)
* 15:42 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1190.eqiad.wmnet with OS bullseye
* 14:37 ema: A:cp: upgrade purged to 0.17 [[phab:T257573|T257573]]
* 15:38 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1191.eqiad.wmnet with reason: host reimage
* 14:30 ema: upload purged 0.17 to buster-wikimedia [[phab:T257573|T257573]]
* 15:34 pt1979@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on db1191.eqiad.wmnet with reason: host reimage
* 14:28 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Add exceptional wikitech VE/Parsoid config [[phab:T241961|T241961]] (duration: 01m 04s)
* 15:30 milimetric@deploy1002: Started deploy [analytics/refinery@fe7bf9e]: Hotfix for webrequest load refine
* 14:26 jforrester@deploy1001: Synchronized wmf-config/CommonSettings.php: Add exceptional wikitech VE/Parsoid config [[phab:T241961|T241961]] (duration: 01m 05s)
* 15:28 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1190.eqiad.wmnet with reason: host reimage
* 14:25 gehel: repooling wdqs1006 - catched up on lag
* 15:25 pt1979@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on db1190.eqiad.wmnet with reason: host reimage
* 14:12 akosiaris: increase codfw mobileapps kubernetes traffic to 2% [[phab:T218733|T218733]]
* 15:24 jbond: upload trapperkeeper-metrics-clojure to puppet7 component
* 14:10 akosiaris@cumin1001: conftool action : set/weight=132; selector: dc=codfw,service=mobileapps,name=scb.*
* 15:22 pt1979@cumin1001: START - Cookbook sre.hosts.reimage for host db1191.eqiad.wmnet with OS bullseye
* 13:58 jforrester@deploy1001: Synchronized php-1.35.0-wmf.41/extensions/UrlShortener/includes/UrlShortenerUtils.php: [[phab:T258056|T258056]] Add temporary fix to ensure array is passed to array_map() (duration: 01m 08s)
* 15:19 jbond: upload puppetlabs-http-client-clojur to puppet7 component
* 13:54 akosiaris: pool kubernetes nodes for mobileapps in codfw
* 15:17 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 13:53 akosiaris@cumin1001: conftool action : set/pooled=yes; selector: dc=codfw,service=mobileapps,name=kubernetes.*
* 15:16 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 13:53 akosiaris@cumin1001: conftool action : set/weight=264; selector: dc=codfw,service=mobileapps,name=scb.*
* 15:16 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 13:51 akosiaris@cumin1001: conftool action : set/weight=1; selector: dc=codfw,service=mobileapps,name=kubernetes.*
* 15:15 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 13:04 jforrester@deploy1001: Synchronized php: group1 wikis to 1.35.0-wmf.41 (duration: 01m 05s)
* 15:14 dancy@deploy1002: Finished scap: Backport for [[gerrit:820653]] scap gitignore: ignore all files under the `scap` directory (duration: 04m 41s)
* 13:03 jforrester@deploy1001: rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.41
* 15:11 jbond: upload jolokia to puppet7 component
* 11:59 addshore: deploy window closed / done :)
* 15:10 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1185.eqiad.wmnet with OS bullseye
* 11:57 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: [[gerrit:609987]] Commons: Define entity sources configuration (take 2) [[phab:T254315|T254315]] (duration: 01m 03s)
* 15:09 dancy@deploy1002: Started scap: Backport for [[gerrit:820653]] scap gitignore: ignore all files under the `scap` directory
* 11:36 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: [[gerrit:612668]] Wikibase test: Client local entity sources are always testwikidata [[phab:T254315|T254315]] (duration: 01m 05s)
* 15:09 jbond: upload test-chuck-clojure to puppet7 component
* 11:27 addshore@deploy1001: Synchronized wmf-config: [[phab:T254315|T254315]] [[gerrit:612669]] Wikidata test: Split client db lists. PT2/2 (duration: 01m 06s)
* 15:05 pt1979@cumin1001: START - Cookbook sre.hosts.reimage for host db1190.eqiad.wmnet with OS bullseye
* 11:26 addshore@deploy1001: Synchronized dblists/wikidataclient.dblist: [[phab:T254315|T254315]] [[gerrit:612669]] Wikidata test: Split client db lists. PT1/2 (duration: 01m 05s)
* 15:04 jbond: upload test-check-clojure to puppet7 component
* 11:16 XioNoX: remove as-path prepending in esams
* 14:57 jbond: upload nippy-clojure to puppet7 component
* 11:11 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings-labs.php: LABS [[gerrit:612667]] Wikibase labs: All client "local" entity sources are wikidata [[phab:T254315|T254315]] (duration: 01m 04s)
* 14:56 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1185.eqiad.wmnet with reason: host reimage
* 11:08 addshore@deploy1001: Synchronized wmf-config/Wikibase.php: [[gerrit:612666]] Wikibase: Split localEntitySourceName config for repo and client [[phab:T254315|T254315]] (duration: 01m 16s)
* 14:52 pt1979@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on db1185.eqiad.wmnet with reason: host reimage
* 11:05 XioNoX: re-enable ping offload in esams
* 14:43 jbond: upload fressian to puppet7 component
* 11:05 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 14:40 pt1979@cumin1001: START - Cookbook sre.hosts.reimage for host db1185.eqiad.wmnet with OS bullseye
* 11:01 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
* 14:40 jbond: upload test-generative-clojure to puppet7 component
* 10:56 XioNoX: disable ping offload in esams
* 14:35 pt1979@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 10:55 XioNoX: re-enable ping offload in codfw
* 14:34 jbond: upload data-generators-clojure to puppet7 component
* 10:52 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 14:31 pt1979@cumin2002: START - Cookbook sre.dns.netbox
* 10:50 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
* 14:23 jbond: upload encore-clojure to puppet7 component
* 10:45 XioNoX: disable ping offload in codfw
* 14:17 jbond: upload truss-clojure to puppet7 component
* 10:44 XioNoX: re-enable ping offload in eqiad
* 14:13 jbond: upload structured-logging-clojure to puppet7 component
* 10:43 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 14:06 jbond: upload murphy-clojure to puppet7 component
* 10:41 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
* 13:57 jbond: upload logstash-logback-encoder-7.2 to puppet7 component
* 10:31 XioNoX: disable ping offload in eqiad
* 13:49 jbond: upload kitchensink-clojure to puppet7 component
* 10:31 akosiaris@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' .
* 13:27 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depool hosts with fragile power supply ([[phab:T314559|T314559]] [[phab:T314628|T314628]])', diff saved to https://phabricator.wikimedia.org/P32292 and previous config saved to /var/cache/conftool/dbconfig/20220805-132709-ladsgroup.json
* 10:30 akosiaris@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' .
* 13:12 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 12:00:00 on db2095.codfw.wmnet with reason: Maintenance
* 10:30 akosiaris@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
* 13:12 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 2 days, 12:00:00 on db2095.codfw.wmnet with reason: Maintenance
* 10:30 akosiaris@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'mobileapps' for release 'staging' .
* 13:09 sukhe: repool codfw
* 10:26 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1120 after reimage', diff saved to https://phabricator.wikimedia.org/P11916 and previous config saved to /var/cache/conftool/dbconfig/20200715-102605-marostegui.json
* 13:02 jbond: upload honeysql-clojure to puppet7 component
* 10:20 jayme: updating python3-docker-report to 0.0.5-1 on deneb
* 12:53 _joe_: progressive repool of services in codfw
* 10:08 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1120 after reimage', diff saved to https://phabricator.wikimedia.org/P11915 and previous config saved to /var/cache/conftool/dbconfig/20200715-100855-marostegui.json
* 12:24 moritzm: installing nano bugfix updates from bullseye point release
* 10:07 jayme: imported docker-report_0.0.5-1 to buster-wikimedia
* 11:50 hnowlan@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: sync
* 09:48 marostegui: Deploy schema change on s8 codfw master, lag will appear on codfw [[phab:T256685|T256685]]
* 11:40 hnowlan@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: sync
* 09:42 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1120 after reimage', diff saved to https://phabricator.wikimedia.org/P11914 and previous config saved to /var/cache/conftool/dbconfig/20200715-094226-marostegui.json
* 11:37 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repool after PDU maint on D3 ([[phab:T310146|T310146]])', diff saved to https://phabricator.wikimedia.org/P32291 and previous config saved to /var/cache/conftool/dbconfig/20220805-113729-ladsgroup.json
* 09:22 akosiaris@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
* 11:35 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repool after PDU maint on C6 ([[phab:T310145|T310145]])', diff saved to https://phabricator.wikimedia.org/P32290 and previous config saved to /var/cache/conftool/dbconfig/20220805-113555-ladsgroup.json
* 09:21 akosiaris@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
* 11:34 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repool after PDU maint on C5 ([[phab:T310145|T310145]])', diff saved to https://phabricator.wikimedia.org/P32289 and previous config saved to /var/cache/conftool/dbconfig/20220805-113436-ladsgroup.json
* 09:19 akosiaris@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'mobileapps' for release 'staging' .
* 10:46 hnowlan@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: sync
* 09:19 akosiaris: deploy mobileapps in kubernetes to talk HTTPS to the mw API
* 10:36 hnowlan@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: sync
* 09:10 akosiaris@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
* 10:17 hnowlan@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: sync
* 09:10 akosiaris@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
* 10:12 Amir1: dbmaint at s4@codfw ([[phab:T312863|T312863]])
* 09:07 akosiaris: Correction: deploy eventgate-analytics-external in staging, eqiad, codfw for switching to using discovery records and HTTPS for talking to the API
* 10:07 hnowlan@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: sync
* 09:06 akosiaris: deploy eventgate-analytics in staging, eqiad, codfw for switching to using discovery records and HTTPS for talking to the API
* 09:04 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 12 hosts with reason: Maintenance
* 09:06 akosiaris@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
* 09:03 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 12 hosts with reason: Maintenance
* 09:06 akosiaris@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
* 09:03 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2110.codfw.wmnet with reason: Maintenance
* 09:05 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1099:3318', diff saved to https://phabricator.wikimedia.org/P11913 and previous config saved to /var/cache/conftool/dbconfig/20200715-090545-marostegui.json
* 09:03 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2110.codfw.wmnet with reason: Maintenance
* 09:04 akosiaris@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
* 00:53 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8 days, 0:00:00 on gerrit2001.wikimedia.org with reason: decom, replaced by gerrit2002
* 09:04 akosiaris@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
* 00:53 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 8 days, 0:00:00 on gerrit2001.wikimedia.org with reason: decom, replaced by gerrit2002
* 08:50 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1120 after reimage', diff saved to https://phabricator.wikimedia.org/P11912 and previous config saved to /var/cache/conftool/dbconfig/20200715-085032-marostegui.json
* 00:53 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for gerrit2002.wikimedia.org
* 08:39 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
* 00:53 dzahn@cumin1001: START - Cookbook sre.hosts.remove-downtime for gerrit2002.wikimedia.org
* 08:36 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
* 00:52 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8 days, 0:00:00 on gerrit2002.wikimedia.org with reason: decom, replaced by gerrit2002
* 08:19 moritzm: piwik.wikimedia.org switched to CAS authentication
* 00:52 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 8 days, 0:00:00 on gerrit2002.wikimedia.org with reason: decom, replaced by gerrit2002
* 08:19 elukey: move piwik.wikimedia.org to CAS (idp.wikimedia.org)
* 00:18 mutante: restarting gerrit for config change - removing old replica [[phab:T313250|T313250]]
* 07:29 XioNoX: delete deprecated AS3209 AMS-IX router
* 06:59 dcausse: depooling wdqs1006 (high lag)
* 06:09 marostegui: Stop replication on db1120 to avoid having 10.4 -> 10.1 replication for long [[phab:T254871|T254871]]
* 06:06 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1120 for reimage [[phab:T254871|T254871]]', diff saved to https://phabricator.wikimedia.org/P11911 and previous config saved to /var/cache/conftool/dbconfig/20200715-060649-marostegui.json
* 06:01 marostegui@cumin1001: dbctl commit (dc=all): 'Promote db1103 to x1 master [[phab:T254871|T254871]]', diff saved to https://phabricator.wikimedia.org/P11910 and previous config saved to /var/cache/conftool/dbconfig/20200715-060145-marostegui.json
* 06:00 marostegui: Starting x1 failover from db1120 to db1103 - [[phab:T254871|T254871]]
* 05:29 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1099:3318 ', diff saved to https://phabricator.wikimedia.org/P11909 and previous config saved to /var/cache/conftool/dbconfig/20200715-052939-marostegui.json
* 04:46 marostegui: Start x1 pre failover steps [[phab:T254871|T254871]]
* 04:44 marostegui@cumin1001: dbctl commit (dc=all): 'Set db1103 weight to 0 before the switchover [[phab:T254871|T254871]]', diff saved to https://phabricator.wikimedia.org/P11908 and previous config saved to /var/cache/conftool/dbconfig/20200715-044432-marostegui.json
* 04:43 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1135', diff saved to https://phabricator.wikimedia.org/P11907 and previous config saved to /var/cache/conftool/dbconfig/20200715-044332-marostegui.json
* 01:45 eileen: tools revision changed from {{Gerrit|a9e7dc1559}} to {{Gerrit|7b6018a16e}}
* 00:26 ryankemper@deploy1001: Finished deploy [wdqs/wdqs@8f6f660]: 0.3.41 (duration: 15m 10s)
* 00:11 ryankemper@deploy1001: Started deploy [wdqs/wdqs@8f6f660]: 0.3.41


== 2020-07-14 ==
== 2022-08-04 ==
* 19:52 jforrester@deploy1001: Synchronized php-1.35.0-wmf.41/vendor/wikimedia/parsoid/: [[phab:T252448|T252448]] [[phab:T255190|T255190]] Bump Parsoid to v0.12.0-a23 (duration: 01m 06s)
* 23:07 mutante: switching gerrit-replica.wikimedia.org to new machine gerrit2002, dropping gerrit-replica-new.wikimedia.org [[phab:T313250|T313250]]
* 18:13 ryankemper: all long-running elasticsearch reindex jobs are complete
* 21:07 ryankemper@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
* 18:09 jforrester@deploy1001: Synchronized dblists/: [[phab:T32405|T32405]] [[phab:T254287|T254287]] Remove the mobilemainpagelegacy dblist (duration: 01m 04s)
* 20:59 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 18:07 jforrester@deploy1001: Synchronized multiversion/MWConfigCacheGenerator.php: [[phab
* 20:57 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 20:57 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 20:56 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 20:56 thcipriani@deploy1002: Finished scap: Backport for [[gerrit:819774]] tkwiki: Update wordmark (duration: 06m 12s)
* 20:51 ryankemper@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
* 20:51 ryankemper@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
* 20:51 ryankemper@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
* 20:50 thcipriani@deploy1002: Started scap: Backport for [[gerrit:819774]] tkwiki: Update wordmark
* 20:48 thcipriani@deploy1002: Finished scap: Backport for [[gerrit:812391]] [config]: Add click event logging for mobile and desktop (duration: 39m 16s)
* 20:45 ryankemper@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
* 20:24 ryankemper@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
* 20:23 ryankemper@deploy1002: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply
* 20:22 ryankemper@deploy1002: helmfile [staging] START helmfile.d/


== 2020-07-13 ==
== 2022-08-03 ==
* 23:06 mutante: releases* delete /usr/local/sbin/sync-* scripts created by rsync::quickdatacopy and let puppet recreate the ones still needed
* 23:59 dzahn@cumin2002: START - Cookbook sre.hosts.downtime for 1:00:00 on gerrit1001.wikimedia.org with reason: service restart
* 22:27 krinkle@deploy1001: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|I80ca62643f5c}} (duration: 00m 58s)
* 23:50 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32270 and previous config saved to /var/cache/conftool/dbconfig/20220803-235030-marostegui.json
* 20:12 ebernhardson@deploy1001: Finished deploy [wikimedia/discovery/analytics@1edde21]: airflow: ship_to_es: Implement multi-index understanding (duration: 00m 29s)
* 22:50 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1170:3312 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32269 and previous config saved to /var/cache/conftool/dbconfig/20220803-225015-marostegui.json
* 20:12 ebernhardson@deploy1001: Started deploy [wikimedia/discovery/analytics@1edde21]: airflow: ship_to_es: Implement multi-index understanding
* 22:50 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance
* 20:03 mutante: rsynced reprepro data from releases1001 to releases1002, releases2002
* 22:49 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance
* 19:50 eileen: disable target smart job process-control config revision is {{Gerrit|b00e7680ca}}
* 22:49 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 9 hosts with reason: Maintenance
* 19:48 milimetric@deploy1001: Finished deploy [analytics/refinery@de0a1f1] (thin): Regular analytics weekly train THIN [analytics/refinery@de0a1f1] (duration: 00m 07s)
* 22:49 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on 9 hosts with reason: Maintenance
* 19:47 milimetric@deploy1001: Started deploy [analytics/refinery@de0a1f1] (thin): Regular analytics weekly train THIN [analytics/refinery@de0a1f1]
* 22:49 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance
* 19:47 milimetric@deploy1001: Finished deploy [analytics/refinery@de0a1f1]: Regular analytics weekly train [analytics/refinery@de0a1f1] (duration: 06m 41s)
* 22:49 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance
* 19:41 milimetric@deploy1001: Started deploy [analytics/refinery@de0a1f1]: Regular analytics weekly train [analytics/refinery@de0a1f1]
* 22:48 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
* 19:39 cmjohnson@cumin1001: START - Cookbook sre.dns.netbox
* 22:48 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
* 19:33 krinkle@deploy1001: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|I1a12124f1811e9a}} (duration: 00m 57s)
* 22:48 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32268 and previous config saved to /var/cache/conftool/dbconfig/20220803-224827-marostegui.json
* 18:53 jforrester@deploy1001: Synchronized wmf-config/CommonSettings.php: [[phab:T248343|T248343]] Don't use the 'zeroconf' configuration for VisualEditor (duration: 00m 55s)
* 22:33 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P32267 and previous config saved to /var/cache/conftool/dbconfig/20220803-223321-marostegui.json
* 18:43 dcausse: BACON done
* 22:18 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P32266 and previous config saved to /var/cache/conftool/dbconfig/20220803-221815-marostegui.json
* 18:40 dcausse@deploy1001: Synchronized wmf-config/InitialiseSettings.php: [[phab:T257745|T257745]]: Add rollbacker to elwiki (duration: 00m 56s)
* 22:03 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32265 and previous config saved to /var/cache/conftool/dbconfig/20220803-220309-marostegui.json
* 18:26 dcausse@deploy1001: Synchronized wmf-config/InitialiseSettings.php: [[phab:T250810|T250810]]: Set proper language code for some wikis (duration: 00m 56s)
* 22:00 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32264 and previous config saved to /var/cache/conftool/dbconfig/20220803-220057-marostegui.json
* 18:18 dcausse@deploy1001: Synchronized wmf-config/InitialiseSettings.php: [[phab:T256928|T256928]]: Scale largest shards to be closer to 30GB (duration: 00m 56s)
* 22:00 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
* 16:17 aborrero@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
* 22:00 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
* 16:17 aborrero@cumin1001: START - Cookbook sre.hosts.downtime
* 22:00 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance
* 15:56 ladsgroup@deploy1001: Synchronized wmf-config/Wikibase.php: [[gerrit:610265{{!}}Load WikibaseClient using extension registration in beta (T257435)]] (duration: 00m 55s)
* 22:00 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance
* 15:52 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1130', diff saved to https://phabricator.wikimedia.org/P11882 and previous config saved to /var/cache/conftool/dbconfig/20200713-155240-marostegui.json
* 22:00 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32263 and previous config saved to /var/cache/conftool/dbconfig/20220803-220007-marostegui.json
* 15:48 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1144:3315', diff saved to https://phabricator.wikimedia.org/P11881 and previous config saved to /var/cache/conftool/dbconfig/20200713-154847-marostegui.json
* 21:45 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P32262 and previous config saved to /var/cache/conftool/dbconfig/20220803-214501-marostegui.json
* 15:39 mholloway-shell@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'proton' for release 'production' .
* 21:44 damilare: payments-wiki updated from {{Gerrit|e1b6036a}} to {{Gerrit|712df4ce}}
* 15:35 mholloway-shell@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'proton' for release 'production' .
* 21:37 ryankemper@cumin1001: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster plugin upgrade - ryankemper@cumin1001 - [[phab:T314078|T314078]]
* 15:30 mholloway-shell@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'proton' for release 'production' .
* 21:35 ryankemper@deploy1002: helmfile [eqiad] DONE helmfile.d/services/changeprop: apply
* 14:50 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Stop setting DiscussionToolsEnableVisual, default value (duration: 00m 57s)
* 21:35 ryankemper@deploy1002: helmfile [eqiad] START helmfile.d/services/changeprop: apply
* 14:17 moritzm: removing lilypond from production [[phab:T257066|T257066]]
* 21:30 ryankemper@deploy1002: helmfile [staging] DONE helmfile.d/services/changeprop: apply
* 13:36 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1144:3315', diff saved to https://phabricator.wikimedia.org/P11880 and previous config saved to /var/cache/conftool/dbconfig/20200713-133604-marostegui.json
* 21:30 ryankemper@deploy1002: helmfile [staging] START helmfile.d/services/changeprop: apply
* 13:35 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1082', diff saved to https://phabricator.wikimedia.org/P11879 and previous config saved to /var/cache/conftool/dbconfig/20200713-133535-marostegui.json
* 21:29 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P32261 and previous config saved to /var/cache/conftool/dbconfig/20220803-212955-marostegui.json
* 13:05 kormat@cumin1001: dbctl commit (dc=all): 'Fully repool es1022, and set es1020 to zero weight [[phab:T257284|T257284]]', diff saved to https://phabricator.wikimedia.org/P11878 and previous config saved to /var/cache/conftool/dbconfig/20200713-130532-kormat.json
* 21:14 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32260 and previous config saved to /var/cache/conftool/dbconfig/20220803-211449-marostegui.json
* 12:08 kormat@cumin1001: dbctl commit (dc=all): 'Start repooling es1022 after reimaging [[phab:T257284|T257284]]', diff saved to https://phabricator.wikimedia.org/P11873 and previous config saved to /var/cache/conftool/dbconfig/20200713-120818-kormat.json
* 21:12 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1105:3312 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32259 and previous config saved to /var/cache/conftool/dbconfig/20220803-211237-marostegui.json
* 11:49 Urbanecm: Password reset for User:Alert5 ([[phab:T257806|T257806]])
* 21:12 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance
* 11:44 akosiaris: repool ganeti1007 [[phab:T244530|T244530]]. Start emptying ganeti1008
* 21:12 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance
* 11:08 Urbanecm: EU B&C done
* 21:12 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1122 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32258 and previous config saved to /var/cache/conftool/dbconfig/20220803-211216-marostegui.json
* 11:06 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|896c042296b4e1f5d88f786981537655e5d9fea9}}: Enable SandboxLink extension in trwiki ([[phab:T256782|T256782]]) (duration: 00m 56s)
* 21:03 ejegg: updated standalone SmashPig deployment from {{Gerrit|8e8f0017}} to {{Gerrit|9b97ea15}}
* 10:44 jdrewniak@deploy1001: Synchronized portals: Wikimedia Portals Update: [[gerrit:612175{{!}} Bumping portals to master (612175)]] (duration: 00m 56s)
* 21:02 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 10:43 jdrewniak@deploy1001: Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:612175{{!}} Bumping portals to master (612175)]] (duration: 00m 56s)
* 21:01 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 09:42 kormat@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
* 21:01 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 09:39 kormat@cumin1001: START - Cookbook sre.hosts.downtime
* 21:00 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 08:58 ema: cp: rolling ats-backend-restart to apply SyslogIdentifier changes -> https://gerrit.wikimedia.org/r/c/operations/puppet/+/611311
* 20:57 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P32257 and previous config saved to /var/cache/conftool/dbconfig/20220803-205710-marostegui.json
* 08:57 jforrester@deploy1001: Synchronized wmf-config/CommonSettings.php: [[phab:T248343|T248343]] Explicitly set visualeditor-enable to 0 when non-default (duration: 00m 57s)
* 20:55 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 08:44 kormat@cumin1001: dbctl commit (dc=all): 'Depool es1022 for reimaging [[phab:T257284|T257284]]', diff saved to https://phabricator.wikimedia.org/P11871 and previous config saved to /var/cache/conftool/dbconfig/20200713-084449-kormat.json
* 20:55 ebernhardson@deploy1002: Synchronized wmf-config/CirrusSearch-production.php: Config: [[gerrit:820223{{!}}cirrus: Set ElasticaWrite partition count for cloudelastic to 3]] (duration: 03m 29s)
* 08:39 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1093', diff saved to https://phabricator.wikimedia.org/P11870 and previous config saved to /var/cache/conftool/dbconfig/20200713-083902-marostegui.json
* 20:54 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 08:34 kormat@cumin1001: dbctl commit (dc=all): 'Add weight to es1020, reduce weight on es1022 [[phab:T257284|T257284]]', diff saved to https://phabricator.wikimedia.org/P11869 and previous config saved to /var/cache/conftool/dbconfig/20200713-083414-kormat.json
* 20:54 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 08:20 kormat: reimaging es1022 [[phab:T257284|T257284]]
* 20:53 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 06:54 marostegui@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1)
* 20:48 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 06:53 marostegui@cumin1001: START - Cookbook sre.hosts.decommission
* 20:48 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 06:52 marostegui@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1)
* 20:48 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 06:52 marostegui@cumin1001: START - Cookbook sre.hosts.decommission
* 20:43 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 06:51 marostegui@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99)
* 20:43 urbanecm@deploy1002: Synchronized php-1.39.0-wmf.23/extensions/VisualEditor/includes/VisualEditorParsoidClient.php: {{Gerrit|a804fe18f1e14795ba7836d3ebf6c361bb1538a7}}: Update call to PageConfigFactory::create to use new signature ([[phab:T314523|T314523]]) (duration: 03m 25s)
* 06:50 marostegui@cumin1001: START - Cookbook sre.hosts.decommission
* 20:42 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P32256 and previous config saved to /var/cache/conftool/dbconfig/20220803-204204-marostegui.json
* 06:16 marostegui: Reverse gerrit password on m2 master - [[phab:T255715|T255715]]
* 20:39 urbanecm@deploy1002: sync-file aborted: {{Gerrit|a804fe18f1e14795ba7836d3ebf6c361bb1538a7}}: Update call to PageConfigFactory::create to use new signature ([[phab:T314523|T314523]]ú (duration: 00m 00s)
* 06:04 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1093', diff saved to https://phabricator.wikimedia.org/P11868 and previous config saved to /var/cache/conftool/dbconfig/20200713-060410-marostegui.json
* 20:38 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 05:54 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1093', diff saved to https://phabricator.wikimedia.org/P11867 and previous config saved to /var/cache/conftool/dbconfig/20200713-055422-marostegui.json
* 20:36 urbanecm@deploy1002: Synchronized php-1.39.0-wmf.23/extensions/DiscussionTools/: {{Gerrit|b840eef86837aed3e566885110e93b2ca9ab5f42}}: Fix ReplyLinksController#teardown (duration: 03m 27s)
* 05:48 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1093 for upgrade', diff saved to https://phabricator.wikimedia.org/P11866 and previous config saved to /var/cache/conftool/dbconfig/20200713-054840-marostegui.json
* 20:34 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 05:34 marostegui: Deploy schema change on s3 codfw master, lag will appear on codfw [[phab:T253276|T253276]]
* 20:34 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 05:30 marostegui: Stop replication on db1082 for schema change and triggers removal [[phab:T238966|T238966]]
* 20:33 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 05:29 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1082', diff saved to https://phabricator.wikimedia.org/P11865 and previous config saved to /var/cache/conftool/dbconfig/20200713-052928-marostegui.json
* 20:31 urbanecm@deploy1002: Synchronized php-1.39.0-wmf.23/extensions/CirrusSearch/: {{Gerrit|70a18f5846111a0dfe8ba473daf384cbb8e88804}}:  Add explicit partitioning key to ElasticaWrite ([[phab:T314426|T314426]]) (duration: 03m 13s)
* 05:14 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1119 for innodb compression', diff saved to https://phabricator.wikimedia.org/P11864 and previous config saved to /var/cache/conftool/dbconfig/20200713-051428-marostegui.json
* 20:28 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 20:28 urbanecm@deploy1002: Synchronized php-1.39.0-wmf.22/extensions/CirrusSearch/: {{Gerrit|9961e9bc8f5873f8ddc8a11108de0a7bfcb14ae6}}: Add explicit partitioning key to ElasticaWrite ([[phab:T314426|T314426]]) (duration: 03m 23s)
* 20:28 cwhite@cumin2002: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host logstash2032.codfw.wmnet
* 20:27 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 20:27 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 20:27 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1122 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32255 and previous config saved to /var/cache/conftool/dbconfig/20220803-202658-marostegui.json
* 20:23 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 20:21 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1122 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32254 and previous config saved to /var/cache/conftool/dbconfig/20220803-202146-marostegui.json
* 20:21 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1122.eqiad.wmnet with reason: Maintenance
* 20:21 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1122.eqiad.wmnet with reason: Maintenance
* 20:21 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32253 and previous config saved to /var/cache/conftool/dbconfig/20220803-202125-marostegui.json
* 20:14 rzl@deploy1002: helmfile [codfw] DONE helmfile.d/services/mobileapps: apply
* 20:13 rzl@deploy1002: helmfile [codfw] START helmfile.d/services/mobileapps: apply
* 20:13 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 20:12 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|195f8090b9694be65c937cea108ff4f6400972ec}}: Start writing to cuc_actor on test wikis ([[phab:T233004|T233004]]) (duration: 03m 27s)
* 20:09 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 20:09 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 20:09 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 20:08 cwhite@cumin2002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) logstash2032.codfw.wmnet on all recursors
* 20:08 cwhite@cumin2002: START - Cookbook sre.dns.wipe-cache logstash2032.codfw.wmnet on all recursors
* 20:08 cwhite@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 20:07 mutante: gerrit - adding second replica [[phab:T313250|T313250]]
* 20:06 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P32252 and previous config saved to /var/cache/conftool/dbconfig/20220803-200619-marostegui.json
* 20:04 cwhite@cumin2002: START - Cookbook sre.dns.netbox
* 20:03 cwhite@cumin2002: START - Cookbook sre.ganeti.makevm for new host logstash2032.codfw.wmnet
* 20:00 rzl@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubernetes2012.codfw.wmnet
* 20:00 rzl@cumin1001: START - Cookbook sre.hosts.remove-downtime for kubernetes2012.codfw.wmnet
* 20:00 rzl@deploy1002: conftool action : set/pooled=yes; selector: name=kubernetes2012.codfw.wmnet
* 19:51 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P32251 and previous config saved to /var/cache/conftool/dbconfig/20220803-195113-marostegui.json
* 19:40 ryankemper: [[phab:T314078|T314078]] Forgot to mention, restart is at `ryankemper@cumin1001` tmux session `codfw_restarts`
* 19:39 ryankemper: [[phab:T314078|T314078]] Rolling upgrade of codfw hosts; after this all of eqiad/codfw will have the new plugin version and we can resume the `search-loader` instances: `sudo -E cookbook sre.elasticsearch.rolling-operation search_codfw "codfw cluster plugin upgrade" --upgrade --nodes-per-run 3 --start-datetime 2022-08-03T19:38:10 --task-id [[phab:T314078|T314078]]`
* 19:38 ryankemper@cumin1001: START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster plugin upgrade - ryankemper@cumin1001 - [[phab:T314078|T314078]]
* 19:36 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32250 and previous config saved to /var/cache/conftool/dbconfig/20220803-193607-marostegui.json
* 19:33 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32249 and previous config saved to /var/cache/conftool/dbconfig/20220803-193354-marostegui.json
* 19:33 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance
* 19:33 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance
* 19:33 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1129 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32248 and previous config saved to /var/cache/conftool/dbconfig/20220803-193334-marostegui.json
* 19:25 mutante: gerrit1001 - rsyncing /var/lib/gerrit/review_site/ over to gerrit2002 815401
* 19:18 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P32247 and previous config saved to /var/cache/conftool/dbconfig/20220803-191828-marostegui.json
* 19:03 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P32246 and previous config saved to /var/cache/conftool/dbconfig/20220803-190321-marostegui.json
* 18:56 rzl@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubernetes2011.codfw.wmnet
* 18:56 rzl@cumin1001: START - Cookbook sre.hosts.remove-downtime for kubernetes2011.codfw.wmnet
* 18:56 rzl@deploy1002: conftool action : set/pooled=yes; selector: name=kubernetes2011.codfw.wmnet
* 18:33 rzl@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc[2027,2037].codfw.wmnet
* 18:33 rzl@cumin1001: START - Cookbook sre.hosts.remove-downtime for mc[2027,2037].codfw.wmnet
* 18:23 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 18:16 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 18:16 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 18:16 dancy@deploy1002: Synchronized php: group1 wikis to 1.39.0-wmf.23  refs [[phab:T308076|T308076]] (duration: 03m 37s)
* 18:15 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 18:12 dancy@deploy1002: rebuilt and synchronized wikiversions files: group1 wikis to 1.39.0-wmf.23  refs [[phab:T308076|T308076]]
* 17:58 rzl@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubestage2002.codfw.wmnet
* 17:58 rzl@cumin1001: START - Cookbook sre.hosts.remove-downtime for kubestage2002.codfw.wmnet
* 17:57 rzl@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc[2025-2026].codfw.wmnet
* 17:57 rzl@cumin1001: START - Cookbook sre.hosts.remove-downtime for mc[2025-2026].codfw.wmnet
* 17:57 bking@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for elastic2044.codfw.wmnet
* 17:57 bking@cumin1001: START - Cookbook sre.hosts.remove-downtime for elastic2044.codfw.wmnet
* 17:56 bking@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for elastic2043.codfw.wmnet
* 17:56 bking@cumin1001: START - Cookbook sre.hosts.remove-downtime for elastic2043.codfw.wmnet
* 17:55 ottomata: increasing partitions from 5 to 6 for *.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite topics in Kafka main-eqiad and main-codfw - [[phab:T314426|T314426]]
* 17:55 mvernon@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be2055.codfw.wmnet
* 17:55 mvernon@cumin1001: START - Cookbook sre.hosts.remove-downtime for ms-be2055.codfw.wmnet
* 17:50 rzl@cumin1001: conftool action : set/pooled=yes; selector: name=kubestage2002.codfw.wmnet
* 17:38 rzl@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse[2008-2010].codfw.wmnet
* 17:38 rzl@cumin1001: START - Cookbook sre.hosts.remove-downtime for parse[2008-2010].codfw.wmnet
* 17:23 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=restbase20[12]4.codfw.wmnet
* 17:14 mvernon@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 6 hosts
* 17:14 mvernon@cumin1001: START - Cookbook sre.hosts.remove-downtime for 6 hosts
* 17:08 ryankemper: [[phab:T310145|T310145]] `elastic2031` and `wcqs2002` powered off in preparation for C1 maintenance
* 17:06 jayme@cumin1001: conftool action : set/pooled=yes; selector: name=(kubernetes2020.codfw.wmnet{{!}}kubernetes2009.codfw.wmnet{{!}}kubernetes2010.codfw.wmnet)
* 17:00 btullis@cumin1001: END (FAIL) - Cookbook sre.aqs.roll-restart (exit_code=99) for AQS aqs cluster: Roll restart of all AQS's nodejs daemons.
* 16:48 Emperor: shutdown  moss-fe2001.codfw.wmnet,ms-fe2011.codfw.wmnet,ms-be20[34,35,42,48,55,68].codfw.wmnet PDU work [[phab:T310145|T310145]]
* 16:47 mvernon@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 8 hosts with reason: PDU work
* 16:47 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on gerrit2002.wikimedia.org with reason: in setup / flapping
* 16:47 mvernon@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 8 hosts with reason: PDU work
* 16:47 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on gerrit2002.wikimedia.org with reason: in setup / flapping
* 16:46 mvernon@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be[2033,2047].codfw.wmnet,thanos-be2002.codfw.wmnet
* 16:46 mvernon@cumin1001: START - Cookbook sre.hosts.remove-downtime for ms-be[2033,2047].codfw.wmnet,thanos-be2002.codfw.wmnet
* 16:40 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc2046.codfw.wmnet
* 16:40 jayme@cumin1001: START - Cookbook sre.hosts.remove-downtime for mc2046.codfw.wmnet
* 16:39 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 10 hosts
* 16:39 jayme@cumin1001: START - Cookbook sre.hosts.remove-downtime for 10 hosts
* 16:38 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc2023.codfw.wmnet
* 16:38 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for mc2023.codfw.wmnet
* 16:37 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on gitlab-runner2002.codfw.wmnet with reason: PDU swap
* 16:37 jelto@cumin1001: START - Cookbook sre.hosts.downtime for 0:30:00 on gitlab-runner2002.codfw.wmnet with reason: PDU swap
* 16:35 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mc[2025-2026].codfw.wmnet with reason: PDU swap
* 16:35 jayme@cumin1001: START - Cookbook sre.hosts.downtime for 0:30:00 on mc[2025-2026].codfw.wmnet with reason: PDU swap
* 16:32 jelto: power off mc2025-2026
* 16:31 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for rdb2008.codfw.wmnet
* 16:30 jayme@cumin1001: START - Cookbook sre.hosts.remove-downtime for rdb2008.codfw.wmnet
* 16:28 btullis@cumin1001: START - Cookbook sre.aqs.roll-restart for AQS aqs cluster: Roll restart of all AQS's nodejs daemons.
* 16:28 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubernetes[2009-2010,2020].codfw.wmnet
* 16:27 jayme@cumin1001: START - Cookbook sre.hosts.remove-downtime for kubernetes[2009-2010,2020].codfw.wmnet
* 16:11 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 12 hosts
* 16:11 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for 12 hosts
* 16:08 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 15 hosts
* 16:08 jayme@cumin1001: START - Cookbook sre.hosts.remove-downtime for 15 hosts
* 16:08 mvernon@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for aqs[2005-2008].codfw.wmnet
* 16:08 mvernon@cumin1001: START - Cookbook sre.hosts.remove-downtime for aqs[2005-2008].codfw.wmnet
* 15:59 Emperor: shutdown ms-be20[33,47],thanos-be2002 prior to PDU work [[phab:T310070|T310070]]
* 15:58 mvernon@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ms-be[2033,2047].codfw.wmnet,thanos-be2002.codfw.wmnet with reason: PDU work
* 15:58 mvernon@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ms-be[2033,2047].codfw.wmnet,thanos-be2002.codfw.wmnet with reason: PDU work
* 15:52 jelto: pooling mw2259-2270 again
* 15:45 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32242 and previous config saved to /var/cache/conftool/dbconfig/20220803-154515-marostegui.json
* 15:38 vgutierrez: clearing ats-be cache on cp6008 - [[phab:T309651|T309651]]
* 15:38 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 15:38 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 15:37 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 15:37 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 15:36 elukey: powercycle kafka-logging2003 - not responsive to serial console
* 15:36 urbanecm@deploy1002: Synchronized php-1.39.0-wmf.22/extensions/GrowthExperiments/includes/NewcomerTasks/AddImage/ServiceImageRecommendationProvider.php: {{Gerrit|4438957e78e0012aff646e52dc16a4fb796cfd6b}}: ServiceImageRecommendationProvider: Add extra logging when no JSON response received ([[phab:T313973|T313973]]) (duration: 03m 04s)
* 15:35 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on maps2009.codfw.wmnet with reason: PDU maintenance
* 15:35 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime for 3:00:00 on maps2009.codfw.wmnet with reason: PDU maintenance
* 15:34 hnowlan@puppetmaster1001: conftool action : set/pooled=no; selector: name=maps2009.codfw.wmnet
* 15:32 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on restbase2024.codfw.wmnet with reason: PDU maintenance
* 15:32 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on restbase2024.codfw.wmnet with reason: PDU maintenance
* 15:32 hnowlan@puppetmaster1001: conftool action : set/pooled=no; selector: name=restbase2024.codfw.wmnet
* 15:30 vgutierrez: clearing ats-be cache on cp6016 - [[phab:T309651|T309651]]
* 15:30 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P32241 and previous config saved to /var/cache/conftool/dbconfig/20220803-153009-marostegui.json
* 15:24 jayme@cumin1001: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) _etcd._tcp.eqsin.wmnet on all recursors
* 15:24 jayme@cumin1001: START - Cookbook sre.dns.wipe-cache _etcd._tcp.eqsin.wmnet on all recursors
* 15:24 jayme@cumin1001: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) _etcd._tcp.ulsfo.wmnet on all recursors
* 15:24 jayme@cumin1001: START - Cookbook sre.dns.wipe-cache _etcd._tcp.ulsfo.wmnet on all recursors
* 15:24 jayme@cumin1001: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) _etcd._tcp.codfw.wmnet on all recursors
* 15:24 jayme@cumin1001: START - Cookbook sre.dns.wipe-cache _etcd._tcp.codfw.wmnet on all recursors
* 15:21 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=restbase2021.codfw.wmnet
* 15:19 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2030.codfw.wmnet with reason: [[phab:T310070|T310070]]
* 15:19 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2030.codfw.wmnet with reason: [[phab:T310070|T310070]]
* 15:15 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P32240 and previous config saved to /var/cache/conftool/dbconfig/20220803-151502-marostegui.json
* 15:10 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for conf2004.codfw.wmnet
* 15:10 jayme@cumin1001: START - Cookbook sre.hosts.remove-downtime for conf2004.codfw.wmnet
* 15:04 jelto: power off mc2023
* 14:59 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32239 and previous config saved to /var/cache/conftool/dbconfig/20220803-145956-marostegui.json
* 14:59 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mc2023.codfw.wmnet with reason: PDU swap
* 14:59 jayme@cumin1001: START - Cookbook sre.hosts.downtime for 0:30:00 on mc2023.codfw.wmnet with reason: PDU swap
* 14:58 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32238 and previous config saved to /var/cache/conftool/dbconfig/20220803-145849-marostegui.json
* 14:58 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1172.eqiad.wmnet with reason: Maintenance
* 14:58 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1172.eqiad.wmnet with reason: Maintenance
* 14:58 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1109 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32237 and previous config saved to /var/cache/conftool/dbconfig/20220803-145828-marostegui.json
* 14:56 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 14:56 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 14:56 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 14:53 dancy@deploy1002: Pruned MediaWiki: 1.39.0-wmf.19 (duration: 05m 37s)
* 14:51 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 14:47 dancy@deploy1002: Pruned MediaWiki: 1.39.0-wmf.21 (duration: 06m 13s)
* 14:46 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 14:46 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 14:46 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 14:45 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 14:43 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1109', diff saved to https://phabricator.wikimedia.org/P32236 and previous config saved to /var/cache/conftool/dbconfig/20220803-144322-marostegui.json
* 14:34 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2029.codfw.wmnet with reason: [[phab:T310070|T310070]]
* 14:33 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2029.codfw.wmnet with reason: [[phab:T310070|T310070]]
* 14:32 Emperor: shutdown aqs200[5-8] prior to PDU work [[phab:T310070|T310070]]
* 14:31 mvernon@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on aqs[2005-2008].codfw.wmnet with reason: PDU work
* 14:31 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on thumbor[2003-2004].codfw.wmnet with reason: PDU swap
* 14:31 mvernon@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on aqs[2005-2008].codfw.wmnet with reason: PDU work
* 14:31 jayme@cumin1001: START - Cookbook sre.hosts.downtime for 0:30:00 on thumbor[2003-2004].codfw.wmnet with reason: PDU swap
* 14:28 jelto: power off thumbor2003 and thumbor2004
* 14:28 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1109', diff saved to https://phabricator.wikimedia.org/P32235 and previous config saved to /var/cache/conftool/dbconfig/20220803-142816-marostegui.json
* 14:27 moritzm: upgrading ganeti/esams to Ganeti 3.0.2 [[phab:T312637|T312637]]
* 14:13 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1109 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32234 and previous config saved to /var/cache/conftool/dbconfig/20220803-141310-marostegui.json
* 14:11 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1109 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32233 and previous config saved to /var/cache/conftool/dbconfig/20220803-141103-marostegui.json
* 14:10 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1109.eqiad.wmnet with reason: Maintenance
* 14:10 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1109.eqiad.wmnet with reason: Maintenance
* 14:10 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1099:3318 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32232 and previous config saved to /var/cache/conftool/dbconfig/20220803-141042-marostegui.json
* 14:06 moritzm: installing freetype security updates on bullseye
* 13:57 cdanis: ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕙☕ sudo cumin 'P<nowiki>{</nowiki>R:Class = Confd<nowiki>}</nowiki>' 'systemctl restart confd'
* 13:55 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1099:3318', diff saved to https://phabricator.wikimedia.org/P32231 and previous config saved to /var/cache/conftool/dbconfig/20220803-135536-marostegui.json
* 13:46 cdanis: ✔️ cdanis@deploy1002.eqiad.wmnet ~ 🕙☕ sudo systemctl restart confd
* 13:40 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1099:3318', diff saved to https://phabricator.wikimedia.org/P32230 and previous config saved to /var/cache/conftool/dbconfig/20220803-134030-marostegui.json
* 13:30 moritzm: installing Java 8 security updates for Buster
* 13:25 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1099:3318 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32229 and previous config saved to /var/cache/conftool/dbconfig/20220803-132524-marostegui.json
* 13:24 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 13:23 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 13:23 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 13:19 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1099:3318 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32228 and previous config saved to /var/cache/conftool/dbconfig/20220803-131916-marostegui.json
* 13:19 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance
* 13:19 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance
* 13:18 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1114 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32227 and previous config saved to /var/cache/conftool/dbconfig/20220803-131855-marostegui.json
* 13:18 sukhe: depool codfw for PDU upgrade: CR 819798
* 13:16 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 13:16 urbanecm@deploy1002: Synchronized wmf-config/MetaContactPages.php: {{Gerrit|f89f02e306a1fa580fa41ba56de978f4208ea672}}: Amend license request contact form per Legal ([[phab:T303359|T303359]]) (duration: 09m 27s)
* 13:12 jbond: introduce puppetmaster[12]004 for now as offline
* 13:11 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 13:09 root@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on kafka-logging2003.codfw.wmnet with reason: pdu
* 13:09 root@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on kafka-logging2003.codfw.wmnet with reason: pdu
* 13:07 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 13:07 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 13:06 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 13:05 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2044.codfw.wmnet with reason: [[phab:T310070|T310070]]
* 13:05 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2044.codfw.wmnet with reason: [[phab:T310070|T310070]]
* 13:04 pt1979@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 13:03 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1114', diff saved to https://phabricator.wikimedia.org/P32226 and previous config saved to /var/cache/conftool/dbconfig/20220803-130348-marostegui.json
* 12:59 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2043.codfw.wmnet with reason: [[phab:T310070|T310070]]
* 12:59 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2043.codfw.wmnet with reason: [[phab:T310070|T310070]]
* 12:56 pt1979@cumin1001: START - Cookbook sre.dns.netbox
* 12:48 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1114', diff saved to https://phabricator.wikimedia.org/P32224 and previous config saved to /var/cache/conftool/dbconfig/20220803-124842-marostegui.json
* 12:40 moritzm: uploaded openjdk-8 8u342-b07-1~deb10u1  to component/jdk8 for buster-wikimedia (rebuild of latest Java 8 security update)
* 12:36 oblivian@deploy1002: helmfile [codfw] DONE helmfile.d/services/mobileapps: apply
* 12:36 oblivian@deploy1002: helmfile [codfw] START helmfile.d/services/mobileapps: apply
* 12:33 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1114 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32223 and previous config saved to /var/cache/conftool/dbconfig/20220803-123336-marostegui.json
* 12:29 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1114 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32222 and previous config saved to /var/cache/conftool/dbconfig/20220803-122929-marostegui.json
* 12:29 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1114.eqiad.wmnet with reason: Maintenance
* 12:28 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1114.eqiad.wmnet with reason: Maintenance
* 12:28 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1116.eqiad.wmnet with reason: Maintenance
* 12:28 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1116.eqiad.wmnet with reason: Maintenance
* 12:28 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32221 and previous config saved to /var/cache/conftool/dbconfig/20220803-122819-marostegui.json
* 12:16 ebysans@deploy1002: Finished deploy [airflow-dags/analytics@614f7b2]: (no justification provided) (duration: 00m 11s)
* 12:16 ebysans@deploy1002: Started deploy [airflow-dags/analytics@614f7b2]: (no justification provided)
* 12:13 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P32220 and previous config saved to /var/cache/conftool/dbconfig/20220803-121313-marostegui.json
* 11:58 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P32219 and previous config saved to /var/cache/conftool/dbconfig/20220803-115807-marostegui.json
* 11:57 marostegui@cumin1001: dbctl commit (dc=all): 'Add db2176 to s1 [[phab:T311494|T311494]]', diff saved to https://phabricator.wikimedia.org/P32218 and previous config saved to /var/cache/conftool/dbconfig/20220803-115706-marostegui.json
* 11:49 root@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cumin2002.codfw.wmnet with reason: PDU maintenance, [[phab:T310145|T310145]]
* 11:49 root@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on cumin2002.codfw.wmnet with reason: PDU maintenance, [[phab:T310145|T310145]]
* 11:46 jayme@cumin1001: conftool action : set/weight=10; selector: name=(kubernetes2019.codfw.wmnet{{!}}kubernetes2021.codfw.wmnet{{!}}kubernetes2022.codfw.wmnet{{!}}kubernetes2018.codfw.wmnet{{!}}kubernetes2020.codfw.wmnet)
* 11:43 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32217 and previous config saved to /var/cache/conftool/dbconfig/20220803-114301-marostegui.json
* 11:41 jayme@cumin1001: conftool action : set/pooled=inactive; selector: name=(kubernetes2020.codfw.wmnet{{!}}kubernetes2009.codfw.wmnet{{!}}kubernetes2010.codfw.wmnet{{!}}kubernetes2011.codfw.wmnet{{!}}kubernetes2012.codfw.wmnet{{!}}kubestage2002.codfw.wmnet)
* 11:38 hnowlan@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host restbase2022.codfw.wmnet
* 11:37 hnowlan@cumin1001: START - Cookbook sre.hosts.reboot-single for host restbase2022.codfw.wmnet
* 11:35 jbond@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 11:32 jbond@cumin2002: START - Cookbook sre.dns.netbox
* 11:26 oblivian@puppetmaster1001: conftool action : set/pooled=false; selector: name=codfw,dnsdisc=wdqs
* 11:22 oblivian@puppetmaster1001: conftool action : set/pooled=true; selector: name=codfw,dnsdisc=kartotherian
* 11:22 oblivian@puppetmaster1001: conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=restbase-backend
* 11:21 oblivian@puppetmaster1001: conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=restbase-async
* 11:17 _joe_: depooling codfw services from all traffic
* 10:54 jmm@cumin2002: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2011.codfw.wmnet to cluster codfw and group C
* 10:53 jmm@cumin2002: START - Cookbook sre.ganeti.addnode for new host ganeti2011.codfw.wmnet to cluster codfw and group C
* 10:50 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2011.codfw.wmnet
* 10:47 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on kubestage2002.codfw.wmnet with reason: PDU swap
* 10:46 jelto@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on kubestage2002.codfw.wmnet with reason: PDU swap
* 10:42 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32216 and previous config saved to /var/cache/conftool/dbconfig/20220803-104246-marostegui.json
* 10:42 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1177.eqiad.wmnet with reason: Maintenance
* 10:42 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1177.eqiad.wmnet with reason: Maintenance
* 10:42 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1126 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32215 and previous config saved to /var/cache/conftool/dbconfig/20220803-104224-marostegui.json
* 10:41 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host ganeti2011.codfw.wmnet
* 10:40 hnowlan@puppetmaster1001: conftool action : set/pooled=no; selector: name=restbase201[45].codfw.wmnet
* 10:38 hnowlan@puppetmaster1001: conftool action : set/pooled=no; selector: name=restbase2022.codfw.wmnet
* 10:38 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on restbase[2014-2015,2021-2022].codfw.wmnet with reason: PDU maintenance
* 10:38 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on restbase[2014-2015,2021-2022].codfw.wmnet with reason: PDU maintenance
* 10:37 jelto: shutdown kubestage2002 kubernetes2020 kubernetes2009 kubernetes2010 kubernetes2011 kubernetes2012
* 10:30 oblivian@cumin1001: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) proton.discovery.wmnet on all recursors
* 10:30 oblivian@cumin1001: START - Cookbook sre.dns.wipe-cache proton.discovery.wmnet on all recursors
* 10:29 oblivian@cumin1001: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) mathoid.discovery.wmnet on all recursors
* 10:29 oblivian@cumin1001: START - Cookbook sre.dns.wipe-cache mathoid.discovery.wmnet on all recursors
* 10:27 oblivian@cumin1001: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) proton.discovery.wmnet on all recursors
* 10:27 oblivian@cumin1001: START - Cookbook sre.dns.wipe-cache proton.discovery.wmnet on all recursors
* 10:27 oblivian@cumin1001: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) mathoid.discovery.wmnet on all recursors
* 10:27 oblivian@cumin1001: START - Cookbook sre.dns.wipe-cache mathoid.discovery.wmnet on all recursors
* 10:27 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1126', diff saved to https://phabricator.wikimedia.org/P32213 and previous config saved to /var/cache/conftool/dbconfig/20220803-102718-marostegui.json
* 10:23 jelto@cumin1001: conftool action : set/pooled=no; selector: name=kubernetes2012.codfw.wmnet
* 10:23 jelto@cumin1001: conftool action : set/pooled=no; selector: name=kubernetes2011.codfw.wmnet
* 10:22 jelto@cumin1001: conftool action : set/pooled=no; selector: name=kubernetes2010.codfw.wmnet
* 10:22 jelto@cumin1001: conftool action : set/pooled=no; selector: name=kubernetes2009.codfw.wmnet
* 10:22 jelto@cumin1001: conftool action : set/pooled=no; selector: name=kubernetes2020.codfw.wmnet
* 10:20 jelto@cumin1001: conftool action : set/pooled=no; selector: name=kubestage2002.codfw.wmnet
* 10:14 oblivian@cumin1001: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) proton.discovery.wmnet on all recursors
* 10:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2011.codfw.wmnet with OS bullseye
* 10:14 oblivian@cumin1001: START - Cookbook sre.dns.wipe-cache proton.discovery.wmnet on all recursors
* 10:14 oblivian@cumin1001: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) mathoid.discovery.wmnet on all recursors
* 10:14 oblivian@cumin1001: START - Cookbook sre.dns.wipe-cache mathoid.discovery.wmnet on all recursors
* 10:12 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1126', diff saved to https://phabricator.wikimedia.org/P32212 and previous config saved to /var/cache/conftool/dbconfig/20220803-101212-marostegui.json
* 09:57 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1126 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32211 and previous config saved to /var/cache/conftool/dbconfig/20220803-095706-marostegui.json
* 09:56 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2011.codfw.wmnet with reason: host reimage
* 09:56 hnowlan@puppetmaster1001: conftool action : set/pooled=no; selector: name=restbase2021.codfw.wmnet
* 09:56 jelto: kubectl drain --ignore-daemonsets --delete-local-data kubernetes2012.codfw.wmnet
* 09:56 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1126 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32210 and previous config saved to /var/cache/conftool/dbconfig/20220803-095559-marostegui.json
* 09:55 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1126.eqiad.wmnet with reason: Maintenance
* 09:55 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1126.eqiad.wmnet with reason: Maintenance
* 09:55 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1101:3318 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32209 and previous config saved to /var/cache/conftool/dbconfig/20220803-095538-marostegui.json
* 09:55 hnowlan@puppetmaster1001: conftool action : set/weight=10; selector: name=restbase2027.codfw.wmnet
* 09:54 jelto: kubectl drain --ignore-daemonsets --delete-local-data kubernetes2011.codfw.wmnet
* 09:54 oblivian@cumin1001: END (PASS) - Cookbook sre.discovery.service-route (exit_code=0)
* 09:54 oblivian@cumin1001: START - Cookbook sre.discovery.service-route
* 09:54 jmm@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2011.codfw.wmnet with reason: host reimage
* 09:52 jelto: kubectl drain --ignore-daemonsets --delete-local-data kubernetes2010.codfw.wmnet
* 09:50 jelto: kubectl drain --ignore-daemonsets --delete-local-data kubernetes2009.codfw.wmnet
* 09:49 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 49 hosts with reason: PDU swap
* 09:48 jayme@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 49 hosts with reason: PDU swap
* 09:47 jelto: kubectl drain --ignore-daemonsets kubernetes2020.codfw.wmnet
* 09:46 jelto: kubectl cordon kubernetes2020.codfw.wmnet kubernetes2009.codfw.wmnet kubernetes2010.codfw.wmnet kubernetes2011.codfw.wmnet kubernetes2012.codfw.wmnet
* 09:43 jelto: kubectl drain --ignore-daemonsets kubestage2002.codfw.wmnet
* 09:43 vgutierrez: rolling restart of pybal in codfw lvs instances - [[phab:T310070|T310070]]
* 09:42 jelto: kubectl cordon kubestage2002
* 09:40 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1101:3318', diff saved to https://phabricator.wikimedia.org/P32208 and previous config saved to /var/cache/conftool/dbconfig/20220803-094032-marostegui.json
* 09:35 jmm@cumin2002: START - Cookbook sre.hosts.reimage for host ganeti2011.codfw.wmnet with OS bullseye
* 09:34 ebysans@deploy1002: Finished deploy [airflow-dags/analytics@674bb8b]: (no justification provided) (duration: 00m 10s)
* 09:33 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2090.codfw.wmnet
* 09:33 marostegui@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 09:33 ebysans@deploy1002: Started deploy [airflow-dags/analytics@674bb8b]: (no justification provided)
* 09:33 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on ganeti2011.codfw.wmnet with reason: Remove node for eventual reimage, [[phab:T311686|T311686]]
* 09:32 jmm@cumin2002: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on ganeti2011.codfw.wmnet with reason: Remove node for eventual reimage, [[phab:T311686|T311686]]
* 09:29 marostegui@cumin1001: START - Cookbook sre.dns.netbox
* 09:25 marostegui@cumin1001: START - Cookbook sre.hosts.decommission for hosts db2090.codfw.wmnet
* 09:25 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1101:3318', diff saved to https://phabricator.wikimedia.org/P32207 and previous config saved to /var/cache/conftool/dbconfig/20220803-092525-marostegui.json
* 09:24 oblivian@cumin1001: END (PASS) - Cookbook sre.discovery.service-route (exit_code=0)
* 09:24 oblivian@cumin1001: START - Cookbook sre.discovery.service-route
* 09:24 oblivian@cumin1001: END (PASS) - Cookbook sre.discovery.service-route (exit_code=0)
* 09:24 oblivian@cumin1001: START - Cookbook sre.discovery.service-route
* 09:23 oblivian@cumin1001: END (PASS) - Cookbook sre.discovery.service-route (exit_code=0)
* 09:23 oblivian@cumin1001: START - Cookbook sre.discovery.service-route
* 09:22 oblivian@cumin1001: END (FAIL) - Cookbook sre.discovery.service-route (exit_code=99)
* 09:22 oblivian@cumin1001: START - Cookbook sre.discovery.service-route
* 09:20 marostegui@cumin1001: dbctl commit (dc=all): 'Remove db2090 from dbctl [[phab:T314109|T314109]]', diff saved to https://phabricator.wikimedia.org/P32206 and previous config saved to /var/cache/conftool/dbconfig/20220803-092053-marostegui.json
* 09:20 jelto@cumin1001: START - Cookbook sre.hosts.reboot-single for host mc2024.codfw.wmnet
* 09:15 jelto: power on mc2024
* 09:10 XioNoX: configure BGP on the esams-drmrs link - [[phab:T307221|T307221]]
* 09:10 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1101:3318 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32205 and previous config saved to /var/cache/conftool/dbconfig/20220803-091019-marostegui.json
* 09:09 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1101:3318 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32204 and previous config saved to /var/cache/conftool/dbconfig/20220803-090912-marostegui.json
* 09:09 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance
* 09:08 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2031.codfw.wmnet
* 09:08 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance
* 09:08 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance
* 09:08 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance
* 09:08 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32203 and previous config saved to /var/cache/conftool/dbconfig/20220803-090836-marostegui.json
* 09:07 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2032.codfw.wmnet
* 09:06 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2042.codfw.wmnet
* 09:05 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2043.codfw.wmnet
* 09:04 jynus: stop backup2006 backup2009 for [[phab:T310070|T310070]]
* 09:00 jelto@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host mc2024.codfw.wmnet
* 09:00 jelto@cumin1001: START - Cookbook sre.hosts.reboot-single for host mc2024.codfw.wmnet
* 08:59 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp2031.codfw.wmnet
* 08:59 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp2032.codfw.wmnet
* 08:58 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single for host mc2042.codfw.wmnet
* 08:58 jelto@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host mc2024.codfw.wmnet
* 08:58 jelto@cumin1001: START - Cookbook sre.hosts.reboot-single for host mc2024.codfw.wmnet
* 08:57 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single for host mc2043.codfw.wmnet
* 08:57 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2041.codfw.wmnet
* 08:54 XioNoX: put the esams-drmrs link in service - [[phab:T307221|T307221]]
* 08:53 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P32202 and previous config saved to /var/cache/conftool/dbconfig/20220803-085330-marostegui.json
* 08:53 ayounsi@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 08:51 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-single for host mc2041.codfw.wmnet
* 08:49 ayounsi@cumin1001: START - Cookbook sre.dns.netbox
* 08:47 ayounsi@cumin1001: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
* 08:41 ayounsi@cumin1001: START - Cookbook sre.dns.netbox
* 08:38 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P32201 and previous config saved to /var/cache/conftool/dbconfig/20220803-083824-marostegui.json
* 08:23 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32200 and previous config saved to /var/cache/conftool/dbconfig/20220803-082318-marostegui.json
* 08:19 jynus: stop db2098 for [[phab:T310070|T310070]]
* 08:17 oblivian@puppetmaster1001: conftool action : set/pooled=true; selector: dnsdisc=(appservers{{!}}api)-ro,name=codfw
* 08:15 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2072.codfw.wmnet
* 08:15 marostegui@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 07:54 marostegui@cumin1001: START - Cookbook sre.dns.netbox
* 07:49 marostegui@cumin1001: START - Cookbook sre.hosts.decommission for hosts db2072.codfw.wmnet
* 07:48 marostegui@cumin1001: dbctl commit (dc=all): 'Remove db2072 from dbctl [[phab:T313911|T313911]]', diff saved to https://phabricator.wikimedia.org/P32199 and previous config saved to /var/cache/conftool/dbconfig/20220803-074806-marostegui.json
* 07:23 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 07:22 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32197 and previous config saved to /var/cache/conftool/dbconfig/20220803-072253-marostegui.json
* 07:22 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
* 07:22 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
* 07:22 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1167.eqiad.wmnet with reason: Maintenance
* 07:22 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1167.eqiad.wmnet with reason: Maintenance
* 07:22 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32196 and previous config saved to /var/cache/conftool/dbconfig/20220803-072214-marostegui.json
* 07:19 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 14 hosts with reason: codfw pdu maintenance
* 07:19 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 14 hosts with reason: codfw pdu maintenance
* 07:18 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[2134,2160].codfw.wmnet with reason: codfw pdu maintenance
* 07:18 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db[2134,2160].codfw.wmnet with reason: codfw pdu maintenance
* 07:18 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 14 hosts with reason: codfw pdu maintenance
* 07:17 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 14 hosts with reason: codfw pdu maintenance
* 07:17 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es[2020-2022].codfw.wmnet with reason: codfw pdu maintenance
* 07:17 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es[2020-2022].codfw.wmnet with reason: codfw pdu maintenance
* 07:17 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 10 hosts with reason: codfw pdu maintenance
* 07:16 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 10 hosts with reason: codfw pdu maintenance
* 07:16 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 07:16 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[2096,2101,2115,2131].codfw.wmnet with reason: codfw pdu maintenance
* 07:16 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 07:16 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db[2096,2101,2115,2131].codfw.wmnet with reason: codfw pdu maintenance
* 07:11 kartik@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:819227{{!}}CX: Set MT threshold for publishing in Armenian WP to 80% (T313208)]] (duration: 03m 49s)
* 07:09 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 07:07 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P32195 and previous config saved to /var/cache/conftool/dbconfig/20220803-070708-marostegui.json
* 07:05 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubestagetcd2002.codfw.wmnet with reason: Switch instance to plain disks, [[phab:T311686|T311686]]
* 07:05 jmm@cumin2002: START - Cookbook sre.hosts.downtime for 1:00:00 on kubestagetcd2002.codfw.wmnet with reason: Switch instance to plain disks, [[phab:T311686|T311686]]
* 07:00 moritzm: draining ganeti2011 [[phab:T311686|T311686]]
* 06:56 godog: grow sda/sdb 3 by 100G on thanos-be2003 - [[phab:T314275|T314275]]
* 06:56 godog: grow sda/sdb 3 by 100G on thanos-be1002 - [[phab:T314275|T314275]]
* 06:52 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P32194 and previous config saved to /var/cache/conftool/dbconfig/20220803-065202-marostegui.json
* 06:46 godog: power up centrallog2002 and prometheus2005 - [[phab:T310070|T310070]]
* 06:38 jmm@cumin2002: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2013.codfw.wmnet to cluster codfw and group C
* 06:37 jmm@cumin2002: START - Cookbook sre.ganeti.addnode for new host ganeti2013.codfw.wmnet to cluster codfw and group C
* 06:36 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32193 and previous config saved to /var/cache/conftool/dbconfig/20220803-063656-marostegui.json
* 06:31 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32192 and previous config saved to /var/cache/conftool/dbconfig/20220803-063148-marostegui.json
* 06:31 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1178.eqiad.wmnet with reason: Maintenance
* 06:31 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1178.eqiad.wmnet with reason: Maintenance
* 06:31 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 13 hosts with reason: Maintenance
* 06:31 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on 13 hosts with reason: Maintenance
* 06:31 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2161.codfw.wmnet with reason: Maintenance
* 06:30 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db2161.codfw.wmnet with reason: Maintenance
* 06:30 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1111 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32191 and previous config saved to /var/cache/conftool/dbconfig/20220803-063045-marostegui.json
* 06:15 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1111', diff saved to https://phabricator.wikimedia.org/P32190 and previous config saved to /var/cache/conftool/dbconfig/20220803-061538-marostegui.json
* 06:00 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1111', diff saved to https://phabricator.wikimedia.org/P32189 and previous config saved to /var/cache/conftool/dbconfig/20220803-060032-marostegui.json
* 05:45 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1111 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32188 and previous config saved to /var/cache/conftool/dbconfig/20220803-054526-marostegui.json
* 05:41 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1111 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32187 and previous config saved to /var/cache/conftool/dbconfig/20220803-054106-marostegui.json
* 05:41 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1111.eqiad.wmnet with reason: Maintenance
* 05:40 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1111.eqiad.wmnet with reason: Maintenance
* 05:40 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance
* 05:40 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance


== 2020-07-11 ==
== 2022-08-02 ==
* 19:16 qchris: Restarting Gerrit on gerrit1001 to switch to new gerrit.war and zuul plugin
* 22:39 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 19:16 qchris@deploy1001: Finished deploy [gerrit/gerrit@a71a0df]: Gerrit to v3.2.2-138-g230805407f and zuul plugin to master-12-ge51d7e8 on gerrit1001 (duration: 00m 07s)
* 22:32 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 19:15 qchris@deploy1001: Started deploy [gerrit/gerrit@a71a0df]: Gerrit to v3.2.2-138-g230805407f and zuul plugin to master-12-ge51d7e8 on gerrit1001
* 22:32 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 19:08 qchris: Restarting Gerrit on gerrit2001 to switch to new gerrit.war and zuul plugin
* 22:25 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 18:55 qchris@deploy1001: Finished deploy [gerrit/gerrit@a71a0df]: Gerrit to v3.2.2-138-g230805407f and zuul plugin to master-12-ge51d7e8 on gerrit2001 (duration: 00m 10s)
* 22:15 mutante: gerrit - syncing data (/srv/gerrit /var/lib/gerrit2/review_site  /home) again after gerrit2002 was reimaged with buster [[phab:T313250|T313250]] [[phab:T313972|T313972]]
* 18:55 qchris@deploy1001: Started deploy [gerrit/gerrit@a71a0df]: Gerrit to v3.2.2-138-g230805407f and zuul plugin to master-12-ge51d7e8 on gerrit2001
* 22:04 dancy@deploy1002: Finished deploy [gerrit/gerrit@94c5028]: (no justification provided) (duration: 00m 06s)
* 22:04 dancy@deploy1002: Started deploy [gerrit/gerrit@94c5028]: (no justification provided)
* 22:00 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 21:59 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 21:59 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 21:58 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 21:58 dancy@deploy1002: rebuilt and synchronized wikiversions files: group0 wikis to 1.39.0-wmf.23  refs [[phab:T308076|T308076]]
* 21:53 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 21:47 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 21:46 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 21:40 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 21:35 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 21:29 ladsgroup@deploy1002: Synchronized php-1.39.0-wmf.23/extensions/CirrusSearch/includes/Sanity/Checker.php: Backport: [[gerrit:819621{{!}}Fix appending of join conds (T312421 T314439)]] (duration: 03m 15s)
* 21:28 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 21:28 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 21:27 bking@cumin1001: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: deploy wmf-elasticsearch-search-plugins pkg - bking@cumin1001 - [[phab:T314078|T314078]]
* 21:21 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 21:11 dzahn@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gerrit2002.wikimedia.org with OS buster
* 21:01 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 21:01 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 21:00 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 20:59 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 20:58 dancy@deploy1002: rebuilt and synchronized wikiversions files: group0 wikis to 1.39.0-wmf.22  refs [[phab:T308076|T308076]]
* 20:54 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 20:53 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 20:53 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 20:53 dzahn@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gerrit2002.wikimedia.org with reason: host reimage
* 20:52 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 20:51 dzahn@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on gerrit2002.wikimedia.org with reason: host reimage
* 20:50 dancy@deploy1002: rebuilt and synchronized wikiversions files: group0 wikis to 1.39.0-wmf.23  refs [[phab:T308076|T308076]]
* 20:38 mutante: re-imaging gerrit2002 with buster - because it's on bullseye, needs git-fat and that has not been ported to python3 yet which blocks upgrading gerrit machines otherwise [[phab:T313250|T313250]] [[phab:T243027|T243027]] [[phab:T279509|T279509]]
* 20:37 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 20:36 dzahn@cumin2002: START - Cookbook sre.hosts.reimage for host gerrit2002.wikimedia.org with OS buster
* 20:36 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 20:36 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 20:36 urbanecm: UTC evening B&C window done
* 20:35 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 20:33 urbanecm@deploy1002: Synchronized php-1.39.0-wmf.23/includes/Rest/Handler/HTMLTransformInput.php: {{Gerrit|69e91528a5c6f372af520307dc2f4227b9981442}}: ParsoidHandler: fix page bundle input with no orig HTML (duration: 03m 22s)
* 20:30 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 20:29 urbanecm@deploy1002: Synchronized php-1.39.0-wmf.23/includes/Rest/Handler/ParsoidHandler.php: {{Gerrit|322a960e3777bc01fa8823908340c36e3851a648}}: ParsoidHandler: pass metrics object to HTMLTransformInput (duration: 03m 19s)
* 20:27 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 20:27 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 20:22 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 20:20 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|5fac0aaf8e76a6f8cc3302771eac068e4f866e5f}}: GrowthExperiments: Remove wgGEHomepageTutorialTitle (duration: 03m 26s)
* 20:06 dancy@deploy1002: Finished scap: Backport for [[gerrit:819612]] Revert "Bump wikimedia/parsoid to 0.16.0-a18" (duration: 11m 30s)
* 20:01 dancy@deploy1002: Finished deploy [gerrit/gerrit@94c5028]: (no justification provided) (duration: 00m 05s)
* 20:01 dancy@deploy1002: Started deploy [gerrit/gerrit@94c5028]: (no justification provided)
* 19:59 dancy@deploy1002: Finished deploy [gerrit/gerrit@94c5028]: (no justification provided) (duration: 00m 01s)
* 19:59 dancy@deploy1002: Started deploy [gerrit/gerrit@94c5028]: (no justification provided)
* 19:55 dancy@deploy1002: Started scap: Backport for [[gerrit:819612]] Revert "Bump wikimedia/parsoid to 0.16.0-a18"
* 19:42 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 19:37 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2034.codfw.wmnet,service=ats-tls
* 19:37 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2034.codfw.wmnet,service=varnish-fe
* 19:37 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2034.codfw.wmnet,service=ats-be
* 19:36 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 19:36 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 19:36 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2033.codfw.wmnet,service=ats-tls
* 19:36 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2033.codfw.wmnet,service=varnish-fe
* 19:36 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2033.codfw.wmnet,service=ats-be
* 19:36 mvernon@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be[2041,2046].codfw.wmnet
* 19:35 mvernon@cumin2002: START - Cookbook sre.hosts.remove-downtime for ms-be[2041,2046].codfw.wmnet
* 19:29 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 19:28 mvernon@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for thanos-fe2002.codfw.wmnet
* 19:28 mvernon@cumin2002: START - Cookbook sre.hosts.remove-downtime for thanos-fe2002.codfw.wmnet
* 19:26 mvernon@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-fe2010.codfw.wmnet
* 19:26 mvernon@cumin2002: START - Cookbook sre.hosts.remove-downtime for ms-fe2010.codfw.wmnet
* 19:21 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2032.codfw.wmnet,service=ats-tls
* 19:21 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2032.codfw.wmnet,service=varnish-fe
* 19:21 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2032.codfw.wmnet,service=ats-be
* 19:17 rzl@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on mc2038.codfw.wmnet with reason: install
* 19:17 rzl@cumin2002: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on mc2038.codfw.wmnet with reason: install
* 19:13 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2031.codfw.wmnet,service=ats-tls
* 19:13 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2031.codfw.wmnet,service=varnish-fe
* 19:13 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2031.codfw.wmnet,service=ats-be
* 19:11 mutante: gerrit1001 - rsyncing /home/ to gerrit2002:/srv/home-gerrit1001.wikimedia.org [[phab:T313250|T313250]]
* 19:01 dzahn@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on gerrit2002.wikimedia.org with reason: new machine
* 19:01 dzahn@cumin2002: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on gerrit2002.wikimedia.org with reason: new machine
* 18:55 dancy@deploy1002: Finished scap: testwikis wikis to 1.39.0-wmf.23  refs [[phab:T308076|T308076]] (duration: 50m 39s)
* 18:54 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 18:52 ejegg: updated payments-wiki from {{Gerrit|589bb64e}} to {{Gerrit|e1b6036a}} (just i18n changes in extensions)
* 18:47 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 18:47 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 18:46 bking@cumin1001: START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: deploy wmf-elasticsearch-search-plugins pkg - bking@cumin1001 - [[phab:T314078|T314078]]
* 18:46 rzl@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mc2038.codfw.wmnet with reason: install
* 18:45 rzl@cumin2002: START - Cookbook sre.hosts.downtime for 1:00:00 on mc2038.codfw.wmnet with reason: install
* 18:41 rzl@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc2038.codfw.wmnet
* 18:41 rzl@cumin2002: START - Cookbook sre.hosts.remove-downtime for mc2038.codfw.wmnet
* 18:39 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 18:19 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 18:18 rzl@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2038.codfw.wmnet with reason: install
* 18:18 rzl@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on mc2038.codfw.wmnet with reason: install
* 18:17 rzl@cumin2002: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2038.codfw.wmnet with reason: install
* 18:17 rzl@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on mw2038.codfw.wmnet with reason: install
* 18:17 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 18:17 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 18:16 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on lvs2008.codfw.wmnet with reason: shutdown for PDU upgrade
* 18:16 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 1:00:00 on lvs2008.codfw.wmnet with reason: shutdown for PDU upgrade
* 18:16 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 18:11 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 18:08 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 18:08 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 18:05 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 18:04 dancy@deploy1002: Started scap: testwikis wikis to 1.39.0-wmf.23  refs [[phab:T308076|T308076]]
* 17:52 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1168 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32185 and previous config saved to /var/cache/conftool/dbconfig/20220802-175233-marostegui.json
* 17:43 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depool db2159', diff saved to https://phabricator.wikimedia.org/P32184 and previous config saved to /var/cache/conftool/dbconfig/20220802-174311-ladsgroup.json
* 17:37 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P32183 and previous config saved to /var/cache/conftool/dbconfig/20220802-173723-marostegui.json
* 17:35 moritzm: installing node-moment security updates
* 17:32 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic[2041-2042,2057].codfw.wmnet with reason: [[phab:T310070|T310070]]
* 17:32 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic[2041-2042,2057].codfw.wmnet with reason: [[phab:T310070|T310070]]
* 17:27 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2013.codfw.wmnet
* 17:25 moritzm: installing fribidi security updates
* 17:22 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P32182 and previous config saved to /var/cache/conftool/dbconfig/20220802-172217-marostegui.json
* 17:20 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2030.codfw.wmnet,service=ats-tls
* 17:20 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2030.codfw.wmnet,service=varnish-fe
* 17:20 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2030.codfw.wmnet,service=ats-be
* 17:18 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host ganeti2013.codfw.wmnet
* 17:07 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1168 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32181 and previous config saved to /var/cache/conftool/dbconfig/20220802-170711-marostegui.json
* 17:06 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc[2042-2043].codfw.wmnet with reason: shutdown for PDU upgrade
* 17:06 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on mc[2042-2043].codfw.wmnet with reason: shutdown for PDU upgrade
* 17:05 Emperor: ms-be20[31,32,41,46].codfw.wmnet,ms-fe2010.codfw.wmnet,thanos-fe2002.codfw.wmnet downtime for PDU work [[phab:T309957|T309957]]
* 17:05 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1168 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32180 and previous config saved to /var/cache/conftool/dbconfig/20220802-170503-marostegui.json
* 17:04 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance
* 17:04 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance
* 17:04 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance
* 17:04 mvernon@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 6 hosts with reason: shutdown for PDU replacement
* 17:04 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance
* 17:04 mvernon@cumin2002: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 6 hosts with reason: shutdown for PDU replacement
* 17:04 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance
* 17:03 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance
* 17:03 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance
* 17:03 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance
* 17:03 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1131 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32179 and previous config saved to /var/cache/conftool/dbconfig/20220802-170333-marostegui.json
* 17:01 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2029.codfw.wmnet,service=ats-tls
* 17:01 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2029.codfw.wmnet,service=varnish-fe
* 17:01 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2029.codfw.wmnet,service=ats-be
* 17:00 mvernon@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be[2030,2045,2052].codfw.wmnet
* 17:00 mvernon@cumin2002: START - Cookbook sre.hosts.remove-downtime for ms-be[2030,2045,2052].codfw.wmnet
* 16:57 btullis@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host an-airflow1004.eqiad.wmnet
* 16:54 hnowlan@deploy1002: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: sync
* 16:53 hnowlan@deploy1002: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: sync
* 16:51 hnowlan@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: sync
* 16:49 hnowlan@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: sync
* 16:48 hnowlan@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: sync
* 16:48 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P32178 and previous config saved to /var/cache/conftool/dbconfig/20220802-164827-marostegui.json
* 16:38 hnowlan@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: sync
* 16:35 hnowlan@deploy1002: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: sync
* 16:35 hnowlan@deploy1002: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: sync
* 16:33 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P32177 and previous config saved to /var/cache/conftool/dbconfig/20220802-163321-marostegui.json
* 16:29 dancy@mwmaint1002: pull aborted:  (duration: 00m 07s)
* 16:25 rzl: rzl@stat1007:~$ sudo systemctl stop wmde-analytics-daily-early  # wedged, timer will restart it now with max_runtime_seconds
* 16:18 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1131 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32176 and previous config saved to /var/cache/conftool/dbconfig/20220802-161815-marostegui.json
* 16:16 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1131 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32175 and previous config saved to /var/cache/conftool/dbconfig/20220802-161607-marostegui.json
* 16:16 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance
* 16:15 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance
* 16:15 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1165 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32174 and previous config saved to /var/cache/conftool/dbconfig/20220802-161545-marostegui.json
* 16:10 btullis@cumin1001: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) an-airflow1004.eqiad.wmnet on all recursors
* 16:10 btullis@cumin1001: START - Cookbook sre.dns.wipe-cache an-airflow1004.eqiad.wmnet on all recursors
* 16:10 btullis@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 16:05 btullis@cumin1001: START - Cookbook sre.dns.netbox
* 16:05 btullis@cumin1001: START - Cookbook sre.ganeti.makevm for new host an-airflow1004.eqiad.wmnet
* 16:00 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P32173 and previous config saved to /var/cache/conftool/dbconfig/20220802-160039-marostegui.json
* 15:51 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2056.codfw.wmnet with reason: [[phab:T309957|T309957]]
* 15:50 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2056.codfw.wmnet with reason: [[phab:T309957|T309957]]
* 15:49 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2040.codfw.wmnet with reason: [[phab:T309957|T309957]]
* 15:49 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2040.codfw.wmnet with reason: [[phab:T309957|T309957]]
* 15:46 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2039.codfw.wmnet with reason: [[phab:T309957|T309957]]
* 15:45 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2039.codfw.wmnet with reason: [[phab:T309957|T309957]]
* 15:45 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P32172 and previous config saved to /var/cache/conftool/dbconfig/20220802-154533-marostegui.json
* 15:37 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc[2040-2041].codfw.wmnet with reason: shutdown for PDU upgrade
* 15:37 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on mc[2040-2041].codfw.wmnet with reason: shutdown for PDU upgrade
* 15:36 bking@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host elastic2037.codfw.wmnet
* 15:36 bking@cumin1001: START - Cookbook sre.hosts.reboot-single for host elastic2037.codfw.wmnet
* 15:30 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1165 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32171 and previous config saved to /var/cache/conftool/dbconfig/20220802-153027-marostegui.json
* 15:28 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1165 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32170 and previous config saved to /var/cache/conftool/dbconfig/20220802-152818-marostegui.json
* 15:28 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
* 15:27 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
* 15:27 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance
* 15:27 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance
* 15:27 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32169 and previous config saved to /var/cache/conftool/dbconfig/20220802-152740-marostegui.json
* 15:24 moritzm: installing gnupg2 security updates
* 15:15 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2024.codfw.wmnet with reason: shutdown for PDU upgrade
* 15:15 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on mc2024.codfw.wmnet with reason: shutdown for PDU upgrade
* 15:13 jbond@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host puppetmaster1004.eqiad.wmnet with OS buster
* 15:12 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P32167 and previous config saved to /var/cache/conftool/dbconfig/20220802-151234-marostegui.json
* 15:10 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ganeti-test[2001-2003].codfw.wmnet with reason: Power down for PDU maintenance, [[phab:T310070|T310070]]
* 15:10 jmm@cumin2002: START - Cookbook sre.hosts.downtime for 4:00:00 on ganeti-test[2001-2003].codfw.wmnet with reason: Power down for PDU maintenance, [[phab:T310070|T310070]]
* 15:08 root@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on thanos-be2001.codfw.wmnet with reason: pdu
* 15:08 root@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on thanos-be2001.codfw.wmnet with reason: pdu
* 15:07 mvernon@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on ms-be[2030,2045,2052].codfw.wmnet with reason: shutdown for PDU replacement
* 15:07 mvernon@cumin2002: START - Cookbook sre.hosts.downtime for 3:00:00 on ms-be[2030,2045,2052].codfw.wmnet with reason: shutdown for PDU replacement
* 15:06 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mc-gp2002.codfw.wmnet with reason: Power down for PDU maintenance, [[phab:T310070|T310070]]
* 15:06 jmm@cumin2002: START - Cookbook sre.hosts.downtime for 4:00:00 on mc-gp2002.codfw.wmnet with reason: Power down for PDU maintenance, [[phab:T310070|T310070]]
* 15:04 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2037.codfw.wmnet with reason: [[phab:T309957|T309957]]
* 15:04 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2037.codfw.wmnet with reason: [[phab:T309957|T309957]]
* 15:01 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: shutdown for PDU upgrade
* 15:00 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: shutdown for PDU upgrade
* 14:59 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2025.codfw.wmnet with reason: [[phab:T309957|T309957]]
* 14:59 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2025.codfw.wmnet with reason: [[phab:T309957|T309957]]
* 14:58 oblivian@puppetmaster1001: conftool action : set/pooled=false; selector: dnsdisc=(appservers{{!}}api)-ro,name=codfw
* 14:57 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P32166 and previous config saved to /var/cache/conftool/dbconfig/20220802-145728-marostegui.json
* 14:54 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2060.codfw.wmnet with OS bullseye
* 14:53 jbond@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on puppetmaster1004.eqiad.wmnet with reason: host reimage
* 14:50 moritzm: uploaded gnupg2 2.1.18-8~deb9u4+wmf1 to stretch-wikimedia
* 14:50 jbond@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on puppetmaster1004.eqiad.wmnet with reason: host reimage
* 14:42 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32164 and previous config saved to /var/cache/conftool/dbconfig/20220802-144222-marostegui.json
* 14:40 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1113:3316 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32163 and previous config saved to /var/cache/conftool/dbconfig/20220802-144013-marostegui.json
* 14:40 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance
* 14:39 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance
* 14:39 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32162 and previous config saved to /var/cache/conftool/dbconfig/20220802-143952-marostegui.json
* 14:37 jbond@cumin1001: START - Cookbook sre.hosts.reimage for host puppetmaster1004.eqiad.wmnet with OS buster
* 14:32 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2060.codfw.wmnet with reason: host reimage
* 14:28 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2060.codfw.wmnet with reason: host reimage
* 14:24 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P32161 and previous config saved to /var/cache/conftool/dbconfig/20220802-142446-marostegui.json
* 14:23 Emperor: shutdown ms-be20[30,45,52] for PDU work [[phab:T309957|T309957]]
* 14:22 mvernon@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ms-be[2030,2045,2052].codfw.wmnet with reason: shutdown for PDU replacement
* 14:21 mvernon@cumin2002: START - Cookbook sre.hosts.downtime for 1:00:00 on ms-be[2030,2045,2052].codfw.wmnet with reason: shutdown for PDU replacement
* 14:12 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic2060.codfw.wmnet with OS bullseye
* 14:09 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P32160 and previous config saved to /var/cache/conftool/dbconfig/20220802-140940-marostegui.json
* 14:05 jbond@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host puppetmaster2004.codfw.wmnet with OS buster
* 14:04 godog: grow sda/sdb 3 by 100G on thanos-be1001 - [[phab:T314275|T314275]]
* 14:03 root@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on centrallog2002.codfw.wmnet with reason: pdu
* 14:03 root@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on centrallog2002.codfw.wmnet with reason: pdu
* 14:01 root@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on prometheus2005.codfw.wmnet with reason: pdu
* 14:01 root@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on prometheus2005.codfw.wmnet with reason: pdu
* 13:57 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2030.codfw.wmnet,service=ats-tls
* 13:57 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2032.codfw.wmnet,service=ats-be
* 13:57 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2031.codfw.wmnet,service=ats-be
* 13:56 godog: schedule poweroff for centrallog2002 at 16 utc - [[phab:T310070|T310070]]
* 13:54 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp203[3-4].codfw.wmnet,service=ats-be
* 13:54 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32159 and previous config saved to /var/cache/conftool/dbconfig/20220802-135435-marostegui.json
* 13:53 godog: depool and poweroff prometheus2005 - [[phab:T310070|T310070]]
* 13:53 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp203[3-4].codfw.wmnet,service=ats-tls
* 13:53 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp203[3-4].codfw.wmnet,service=ats-tls
* 13:53 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp203[3-4].codfw.wmnet,service=varnish-fe
* 13:52 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1098:3316 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32158 and previous config saved to /var/cache/conftool/dbconfig/20220802-135226-marostegui.json
* 13:52 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance
* 13:52 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp203[1-2].codfw.wmnet,service=ats-tls
* 13:51 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance
* 13:51 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32157 and previous config saved to /var/cache/conftool/dbconfig/20220802-135155-marostegui.json
* 13:51 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp203[1-2].codfw.wmnet,service=ats-tls
* 13:51 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp203[1-2].codfw.wmnet,service=varnish-fe
* 13:50 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2030.codfw.wmnet,service=ats-be
* 13:50 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2030.codfw.wmnet,service=varnish-fe
* 13:50 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2030.codfw.wmnet,service=ats-be
* 13:50 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2029.codfw.wmnet,service=ats-tls
* 13:50 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2029.codfw.wmnet,service=varnish-fe
* 13:50 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2029.codfw.wmnet,service=ats-be
* 13:45 jbond@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on puppetmaster2004.codfw.wmnet with reason: host reimage
* 13:42 jbond@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on puppetmaster2004.codfw.wmnet with reason: host reimage
* 13:42 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 13:42 Lucas_WMDE: UTC afternoon backport+config window done
* 13:41 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2013.codfw.wmnet with OS bullseye
* 13:41 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 13:41 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 13:40 lucaswerkmeister-wmde@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:754933{{!}}Enable usage tracking for statement for cebwiki (T296384)]] – expected to gradually increase number of wbc_entity_usage and probably recentchanges rows on cebwiki, but not too much, see task for details (duration: 03m 06s)
* 13:40 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 13:39 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2028.codfw.wmnet with OS bullseye
* 13:36 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P32156 and previous config saved to /var/cache/conftool/dbconfig/20220802-133648-marostegui.json
* 13:35 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 13:34 lucaswerkmeister-wmde@deploy1002: Synchronized wmf-config/Wikibase.php: Config: [[gerrit:754937{{!}}Introduce $wmgEntityUsageModifierLimitsStatement (T296384)]] (2/2) (duration: 03m 21s)
* 13:34 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 13:34 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 13:33 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 13:31 lucaswerkmeister-wmde@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:754937{{!}}Introduce $wmgEntityUsageModifierLimitsStatement (T296384)]] (1/2) (duration: 03m 16s)
* 13:30 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ganeti2028.codfw.wmnet with reason: Power down for PDU maintenance, [[phab:T309957|T309957]]
* 13:30 jmm@cumin2002: START - Cookbook sre.hosts.downtime for 4:00:00 on ganeti2028.codfw.wmnet with reason: Power down for PDU maintenance, [[phab:T309957|T309957]]
* 13:27 jbond@cumin2002: START - Cookbook sre.hosts.reimage for host puppetmaster2004.codfw.wmnet with OS buster
* 13:24 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2013.codfw.wmnet with reason: host reimage
* 13:24 vgutierrez: restarting ATS 9.x instances to apply https://gerrit.wikimedia.org/r/819585 - [[phab:T309651|T309651]]
* 13:23 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2028.codfw.wmnet with reason: host reimage
* 13:21 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P32155 and previous config saved to /var/cache/conftool/dbconfig/20220802-132142-marostegui.json
* 13:19 jmm@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2013.codfw.wmnet with reason: host reimage
* 13:19 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2028.codfw.wmnet with reason: host reimage
* 13:17 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 13:17 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 13:17 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 13:16 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 13:15 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|a4499e5ac23a0558bed276e2b74134590afc5c95}}:  Revert "testwiki: Add mediawiki.web_ui.interactions stream" ([[phab:T314151|T314151]], [[phab:T311268|T311268]]) (duration: 03m 19s)
* 13:10 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 13:09 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 13:09 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 13:09 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|c2fb8a58d8f62e29a15ebee26198e79e4597d24c}}: Enable RealtimePreview on Group 0 wikis ([[phab:T314150|T314150]]) (duration: 03m 21s)
* 13:08 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 13:06 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32154 and previous config saved to /var/cache/conftool/dbconfig/20220802-130636-marostegui.json
* 13:04 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1096:3316 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32153 and previous config saved to /var/cache/conftool/dbconfig/20220802-130428-marostegui.json
* 13:04 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance
* 13:04 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance
* 13:04 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance
* 13:03 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance
* 13:03 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1180 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32152 and previous config saved to /var/cache/conftool/dbconfig/20220802-130351-marostegui.json
* 13:02 jmm@cumin2002: START - Cookbook sre.hosts.reimage for host ganeti2013.codfw.wmnet with OS bullseye
* 13:00 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic2028.codfw.wmnet with OS bullseye
* 13:00 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on ganeti2013.codfw.wmnet with reason: Remove node for eventual reimage, [[phab:T311686|T311686]]
* 12:59 jmm@cumin2002: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on ganeti2013.codfw.wmnet with reason: Remove node for eventual reimage, [[phab:T311686|T311686]]
* 12:48 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P32151 and previous config saved to /var/cache/conftool/dbconfig/20220802-124845-marostegui.json
* 12:33 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P32150 and previous config saved to /var/cache/conftool/dbconfig/20220802-123338-marostegui.json
* 12:18 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1180 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32149 and previous config saved to /var/cache/conftool/dbconfig/20220802-121832-marostegui.json
* 12:16 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1180 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32148 and previous config saved to /var/cache/conftool/dbconfig/20220802-121624-marostegui.json
* 12:16 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance
* 12:15 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance
* 12:13 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 12:12 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 12:12 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 12:11 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 12:01 marostegui: dbmaint x1@eqiad [[phab:T314087|T314087]]
* 11:57 marostegui: dbmaint s7@eqiad [[phab:T314377|T314377]]
* 11:57 marostegui: dbmaint s3@eqiad [[phab:T314377|T314377]]
* 11:57 marostegui: dbmaint s8@eqiad [[phab:T314377|T314377]]
* 11:55 marostegui: dbmait s8@eqiad [[phab:T314377|T314377]]
* 11:54 marostegui: dbmait s3@eqiad [[phab:T314377|T314377]]
* 11:50 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
* 11:48 marostegui: dbmait s7@eqiad [[phab:T314377|T314377]]
* 11:46 marostegui: dbmait s4@eqiad [[phab:T314377|T314377]]
* 11:35 elukey: restart rsyslog on ml-serve1006
* 10:50 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on an-worker1082.eqiad.wmnet with reason: [[phab:T312626|T312626]] btullis
* 10:50 btullis@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on an-worker1082.eqiad.wmnet with reason: [[phab:T312626|T312626]] btullis
* 10:49 godog: grow sda3 by 100G on thanos-be2004 - [[phab:T314275|T314275]]
* 10:42 btullis@puppetmaster1001: conftool action : set/pooled=inactive; selector: cluster=wikireplicas-b,name=dbproxy1018.eqiad.wmnet
* 10:42 btullis@puppetmaster1001: conftool action : set/pooled=yes; selector: cluster=wikireplicas-b,name=dbproxy1019.eqiad.wmnet
* 10:35 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 10:34 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 10:34 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 10:34 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 10:33 marostegui@cumin1001: dbctl commit (dc=all): 'db1181 (re)pooling @ 100%: After restart', diff saved to https://phabricator.wikimedia.org/P32147 and previous config saved to /var/cache/conftool/dbconfig/20220802-103318-root.json
* 10:18 marostegui@cumin1001: dbctl commit (dc=all): 'db1181 (re)pooling @ 75%: After restart', diff saved to https://phabricator.wikimedia.org/P32146 and previous config saved to /var/cache/conftool/dbconfig/20220802-101813-root.json
* 10:15 marostegui@cumin1001: dbctl commit (dc=all): 'Add db2175 to s2 [[phab:T311494|T311494]]', diff saved to https://phabricator.wikimedia.org/P32145 and previous config saved to /var/cache/conftool/dbconfig/20220802-101522-marostegui.json
* 10:12 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy1019.eqiad.wmnet with OS bullseye
* 10:05 jynus: shutdown dbprov2002 backup2005 backup2008 [[phab:T310070|T310070]]
* 10:03 marostegui@cumin1001: dbctl commit (dc=all): 'db1181 (re)pooling @ 50%: After restart', diff saved to https://phabricator.wikimedia.org/P32144 and previous config saved to /var/cache/conftool/dbconfig/20220802-100308-root.json
* 10:03 marostegui@cumin1001: dbctl commit (dc=all): 'db1174 (re)pooling @ 100%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32143 and previous config saved to /var/cache/conftool/dbconfig/20220802-100304-root.json
* 09:54 marostegui@cumin1001: dbctl commit (dc=all): 'Remove db2079 from dbctl [[phab:T313885|T313885]]', diff saved to https://phabricator.wikimedia.org/P32141 and previous config saved to /var/cache/conftool/dbconfig/20220802-095455-marostegui.json
* 09:52 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbproxy1019.eqiad.wmnet with reason: host reimage
* 09:49 btullis@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on dbproxy1019.eqiad.wmnet with reason: host reimage
* 09:49 btullis@cumin1001: END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-druid-analytics cluster: Roll restart of jvm daemons.
* 09:48 marostegui@cumin1001: dbctl commit (dc=all): 'db1181 (re)pooling @ 25%: After restart', diff saved to https://phabricator.wikimedia.org/P32140 and previous config saved to /var/cache/conftool/dbconfig/20220802-094804-root.json
* 09:47 marostegui@cumin1001: dbctl commit (dc=all): 'db1174 (re)pooling @ 75%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32139 and previous config saved to /var/cache/conftool/dbconfig/20220802-094759-root.json
* 09:44 godog: grow sdb3 by 100G on thanos-be2004 - [[phab:T314275|T314275]]
* 09:43 btullis@cumin1001: START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-druid-analytics cluster: Roll restart of jvm daemons.
* 09:42 btullis@cumin1001: END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-druid-public cluster: Roll restart of jvm daemons.
* 09:37 btullis@cumin1001: START - Cookbook sre.hosts.reimage for host dbproxy1019.eqiad.wmnet with OS bullseye
* 09:36 btullis@cumin1001: START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-druid-public cluster: Roll restart of jvm daemons.
* 09:33 marostegui@cumin1001: dbctl commit (dc=all): 'db1181 (re)pooling @ 10%: After restart', diff saved to https://phabricator.wikimedia.org/P32138 and previous config saved to /var/cache/conftool/dbconfig/20220802-093259-root.json
* 09:32 marostegui@cumin1001: dbctl commit (dc=all): 'db1174 (re)pooling @ 50%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32137 and previous config saved to /var/cache/conftool/dbconfig/20220802-093254-root.json
* 09:30 btullis@puppetmaster1001: conftool action : set/pooled=no; selector: cluster=wikireplicas-b,name=dbproxy1019.eqiad.wmnet
* 09:30 btullis@puppetmaster1001: conftool action : set/pooled=yes; selector: cluster=wikireplicas-b,name=dbproxy1018.eqiad.wmnet
* 09:28 btullis@cumin1001: END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-analytics cluster: Roll restart of jvm daemons.
* 09:26 btullis@puppetmaster1001: conftool action : set/pooled=inactive; selector: cluster=wikireplicas-a,name=dbproxy1019.eqiad.wmnet
* 09:25 btullis@puppetmaster1001: conftool action : set/pooled=yes; selector: cluster=wikireplicas-a,name=dbproxy1018.eqiad.wmnet
* 09:22 btullis@cumin1001: START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-analytics cluster: Roll restart of jvm daemons.
* 09:17 marostegui@cumin1001: dbctl commit (dc=all): 'db1181 (re)pooling @ 5%: After restart', diff saved to https://phabricator.wikimedia.org/P32136 and previous config saved to /var/cache/conftool/dbconfig/20220802-091754-root.json
* 09:17 marostegui@cumin1001: dbctl commit (dc=all): 'db1174 (re)pooling @ 10%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32135 and previous config saved to /var/cache/conftool/dbconfig/20220802-091749-root.json
* 09:15 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db2143', diff saved to https://phabricator.wikimedia.org/P32134 and previous config saved to /var/cache/conftool/dbconfig/20220802-091518-root.json
* 09:02 marostegui@cumin1001: dbctl commit (dc=all): 'db1181 (re)pooling @ 2%: After restart', diff saved to https://phabricator.wikimedia.org/P32133 and previous config saved to /var/cache/conftool/dbconfig/20220802-090250-root.json
* 09:02 marostegui@cumin1001: dbctl commit (dc=all): 'db1174 (re)pooling @ 5%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32132 and previous config saved to /var/cache/conftool/dbconfig/20220802-090245-root.json
* 08:47 marostegui@cumin1001: dbctl commit (dc=all): 'db1181 (re)pooling @ 1%: After restart', diff saved to https://phabricator.wikimedia.org/P32131 and previous config saved to /var/cache/conftool/dbconfig/20220802-084745-root.json
* 08:47 marostegui@cumin1001: dbctl commit (dc=all): 'db1174 (re)pooling @ 1%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32130 and previous config saved to /var/cache/conftool/dbconfig/20220802-084740-root.json
* 08:46 marostegui: stop mysql on db2095 db2107 db2109 db2137 db2147 db2159 db2160 pc2012 for pdu maintenance on codfw b5 [[phab:T310070|T310070]]
* 07:49 moritzm: upgrading drmrs ganeti clusters to 3.0.2 [[phab:T312637|T312637]]
* 07:33 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubetcd2005.codfw.wmnet with reason: Switch instance to plain disks, [[phab:T311686|T311686]]
* 07:33 jmm@cumin2002: START - Cookbook sre.hosts.downtime for 1:00:00 on kubetcd2005.codfw.wmnet with reason: Switch instance to plain disks, [[phab:T311686|T311686]]
* 07:22 godog: bounce icinga on alert2001 - [[phab:T314353|T314353]]
* 07:18 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubetcd2005.codfw.wmnet with reason: Switch instance to DRBD, [[phab:T311686|T311686]]
* 07:18 jmm@cumin2002: START - Cookbook sre.hosts.downtime for 1:00:00 on kubetcd2005.codfw.wmnet with reason: Switch instance to DRBD, [[phab:T311686|T311686]]
* 06:58 elukey: restart rsyslog on ml-serve2006
* 06:56 ladsgroup@deploy1002: Synchronized php-1.39.0-wmf.22/extensions/FlaggedRevs/maintenance/pruneRevData.php: Backport: [[gerrit:819077{{!}}pruneRevData: Make cleaning in larger batches (T296380)]] (duration: 03m 26s)
* 06:56 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 06:55 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 06:55 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 06:54 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 06:46 godog: bounce icinga on alert1001 - [[phab:T314353|T314353]]
* 05:48 marostegui@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts db2088.codfw.wmnet
* 05:48 marostegui@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 05:44 marostegui@cumin1001: START - Cookbook sre.dns.netbox
* 05:35 marostegui@cumin1001: START - Cookbook sre.hosts.decommission for hosts db2088.codfw.wmnet
* 05:29 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1181', diff saved to https://phabricator.wikimedia.org/P32127 and previous config saved to /var/cache/conftool/dbconfig/20220802-052923-root.json
* 05:24 marostegui: dbmait x1@eqiad [[phab:T314087|T314087]]
* 04:17 ryankemper: [Elastic] Small amendment to my earlier statement; based off epoch time `be_x_oldwiki_titlesuggest_1659407912` was not an old index hanging around after a reindex operation, but rather the new one that the reindex operation was trying to create, but had not yet finished (therefore didn't switch over the aliases). It presumably got interrupted by the reimage of `elastic2059`.
* 04:15 ryankemper: [Elastic] Blew away red index like so: `ryankemper@cumin1001:~$ curl -XDELETE https://search.svc.codfw.wmnet:9243/be_x_oldwiki_titlesuggest_1659407912`. Cluster is back to `green` status.
* 04:07 ryankemper: [Elastic] Per `curl -s https://search.svc.codfw.wmnet:9243/_cat/aliases {{!}} grep -i be_x` I see `be_x_oldwiki_titlesuggest ` alias points to `be_x_oldwiki_titlesuggest_1658396688`. I think this means the red index is an old index from an in-progress reindex operation. I likely just need to delete `be_x_oldwiki_titlesuggest_1659407912` but doing some quick digging first
* 04:04 ryankemper: [Elastic] Red cluster status in main codfw elasticsearch cluster (`https://search.svc.codfw.wmnet:9243`); culprit appears to be index `be_x_oldwiki_titlesuggest_1659407912`. Confusingly it has 2 replicas set so it's not clear to me how we got into this state starting from green (in the past we've gone into red status from indices that erroneously had 0 replicas in production)
* 03:47 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 03:46 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 03:46 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 03:45 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 03:40 krinkle@deploy1002: Synchronized multiversion/: {{Gerrit|I0802db272695}} (duration: 03m 10s)
* 03:40 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 03:39 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 03:39 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 03:38 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 03:34 krinkle@deploy1002: Synchronized wmf-config/: {{Gerrit|I9b89c0ff5c2}} (duration: 03m 32s)
* 03:33 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 03:32 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 03:32 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 03:31 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 03:27 krinkle@deploy1002: Synchronized multiversion/: {{Gerrit|I6e97d39a3}}, {{Gerrit|Ib843ebced31}} (duration: 03m 30s)
* 03:26 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 03:25 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 03:25 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 03:24 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 03:22 krinkle@mwmaint1002: pull aborted:  (duration: 00m 11s)
* 03:21 krinkle@deploy1002: Synchronized wmf-config/CommonSettings.php: {{Gerrit|I39a2b86065}} (duration: 03m 19s)
* 03:20 ryankemper@cumin1001: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host elastic2059.codfw.wmnet with OS bullseye
* 03:15 krinkle@deploy1002: Synchronized multiversion/: {{Gerrit|Ieaea60a991e5611}} (duration: 03m 03s)
* 03:14 krinkle@mwmaint2002: pull aborted:  (duration: 01m 36s)
* 03:14 krinkle@mwmaint1002: pull aborted:  (duration: 01m 31s)
* 03:13 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 03:12 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 03:12 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 03:11 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 02:58 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2059.codfw.wmnet with reason: host reimage
* 02:54 ryankemper: [WDQS] `ryankemper@wdqs1012:~$ sudo systemctl restart wdqs-blazegraph.service` to clear `Query Service HTTP Port` && `WDQS SPARQL` alerts
* 02:53 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2059.codfw.wmnet with reason: host reimage
* 02:36 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic2059.codfw.wmnet with OS bullseye
* 02:31 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 02:30 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 02:30 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 02:29 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 02:09 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 02:08 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 02:08 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 02:07 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 00:41 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 00:40 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 00:40 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 00:39 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 00:35 krinkle@deploy1002: Synchronized wmf-config/CommonSettings.php: {{Gerrit|Ieaea60a991e5}} (duration: 03m 10s)
* 00:29 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 00:28 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 00:28 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 00:27 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 00:23 krinkle@deploy1002: Synchronized multiversion/: {{Gerrit|Ia3406eba4ab8bb}} (duration: 03m 22s)
* 00:17 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 00:16 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 00:16 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 00:15 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 00:05 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 00:04 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 00:04 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 00:03 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply


== 2020-07-10 ==
== 2022-08-01 ==
* 21:52 ryankemper: Started long-running reindex of Elasticsearch indices in `eqiad`, `codfw`, and `dewiki` on `mwmaint1002` under tmux session `reindex` for user `ryankemper`
* 23:59 krinkle@deploy1002: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|Id1ce285631f5}}, {{Gerrit|I194d419fbfe}} (duration: 03m 09s)
* 20:26 jgleeson: updated fundraising-tools from {{Gerrit|08ba1f6177}} to {{Gerrit|f8e424fe32}}
* 23:58 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 19:02 mutante: removing firewall hole for gerrit -> mysql servers on dbproxy servers for misc db's
* 23:57 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 18:44 mutante: kubernetes1004 - started nagios-nrpe-server
* 23:57 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 17:57 ebernhardson: change loginwiki password for Cindy-the-browser-test-bot, no email account was associated to allow for normal reset.
* 23:56 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 17:05 krinkle@deploy1001: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|I63fcea7737}} (duration: 00m 57s)
* 21:08 moritzm: drain ganeti2028 [[phab:T309957|T309957]]
* 16:16 elukey@cumin1001: END (FAIL) - Cookbook sre.hadoop.change-distro (exit_code=99)
* 21:03
* 15:57 milimetric@deploy1001: Finished deploy [analytics/refinery@4d40145] (thin): Update EventLogging refine whitelist (THIN) (duration: 00m 08s)
* 15:56 milimetric@deploy1001: Started deploy [analytics/refinery@4d40145] (thin): Update EventLogging refine whitelist (THIN)
* 15:44 milimetric@deploy1001: Finished deploy [analytics/refinery@4d40145]: Update EventLogging refine whitelist (duration: 15m 17s)
* 15:30 elukey@cumin1001: START - Cookbook sre.hadoop.change-distro
* 15:29 elukey@cumin1001: END (PASS) - Cookbook sre.hadoop.stop-cluster (exit_code=0)
* 15:29 milimetric@deploy1001: Started deploy [analytics/refinery@4d40145]: Update EventLogging refine whitelist
* 15:19 elukey@cumin1001: START - Cookbook sre.hadoop.stop-cluster
* 15:03 elukey@cumin1001: END (PASS) - Cookbook sre.hadoop.change-distro (exit_code=0)
* 14:39 elukey@cumin1001: START - Cookbook sre.hadoop.change-distro
* 14:37 elukey@cumin1001: END (PASS) - Cookbook sre.hadoop.stop-cluster (exit_code=0)
* 14:30 elukey@cumin1001: START - Cookbook sre.hadoop.stop-cluster
* 13:41 godog: bounce ms-be1037, not quite responsive
* 12:36 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1110', diff saved to https://phabricator.wikimedia.org/P11860 and previous config saved to /var/cache/conftool/dbconfig/20200710-123604-marostegui.json
* 12:20 reedy@deploy1001: Synchronized php-1.35.0-wmf.40/extensions/Score/: Make Score errors use a specific css class (duration: 00m 58s)
* 10:21 kormat@cumin1001: dbctl commit (dc=all): 'Finish repooling es1021, and remove weight from es1010 [[phab:T257284|T257284]]', diff saved to https://phabricator.wikimedia.org/P11859 and previous config saved to /var/cache/conftool/dbconfig/20200710-102147-kormat.json
* 09:49 kormat@cumin1001: dbctl commit (dc=all): 'Start repooling es1021 after reimage @ 50% [[phab:T257284|T257284]]', diff saved to https://phabricator.wikimedia.org/P11858 and previous config saved to /var/cache/conftool/dbconfig/20200710-094954-kormat.json
* 09:04 kormat@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
* 09:02 kormat@cumin1001: START - Cookbook sre.hosts.downtime
* 08:51 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1110', diff saved to https://phabricator.wikimedia.org/P11857 and previous config saved to /var/cache/conftool/dbconfig/20200710-085157-marostegui.json
* 08:51 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1106', diff saved to https://phabricator.wikimedia.org/P11856 and previous config saved to /var/cache/conftool/dbconfig/20200710-085112-marostegui.json
* 08:50 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1107', diff saved to https://phabricator.wikimedia.org/P11855 and previous config saved to /var/cache/conftool/dbconfig/20200710-085040-marostegui.json
* 08:23 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1106', diff saved to https://phabricator.wikimedia.org/P11853 and previous config saved to /var/cache/conftool/dbconfig/20200710-082346-marostegui.json
* 08:23 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1107', diff saved to https://phabricator.wikimedia.org/P11852 and previous config saved to /var/cache/conftool/dbconfig/20200710-082329-marostegui.json
* 08:22 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
* 08:22 jmm@cumin2001: START - Cookbook sre.hosts.downtime
* 08:22 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
* 08:22 jmm@cumin2001: START - Cookbook sre.hosts.downtime
* 08:09 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1107', diff saved to https://phabricator.wikimedia.org/P11851 and previous config saved to /var/cache/conftool/dbconfig/20200710-080912-marostegui.json
* 08:09 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1119', diff saved to https://phabricator.wikimedia.org/P11850 and previous config saved to /var/cache/conftool/dbconfig/20200710-080854-marostegui.json
* 08:09 kormat@cumin1001: dbctl commit (dc=all): 'Depool es1021 for reimaging [[phab:T257284|T257284]]', diff saved to https://phabricator.wikimedia.org/P11849 and previous config saved to /var/cache/conftool/dbconfig/20200710-080843-kormat.json
* 08:01 kormat@cumin1001: dbctl commit (dc=all): 'Reset es2020/es2021 to correct weights after master switch [[phab:T257284|T257284]]', diff saved to https://phabricator.wikimedia.org/P11848 and previous config saved to /var/cache/conftool/dbconfig/20200710-080133-kormat.json
* 08:00 moritzm: installing cron security updates on jessie (stretch/buster already fixed)
* 07:56 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1119', diff saved to https://phabricator.wikimedia.org/P11847 and previous config saved to /var/cache/conftool/dbconfig/20200710-075608-marostegui.json
* 07:55 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1107', diff saved to https://phabricator.wikimedia.org/P11846 and previous config saved to /var/cache/conftool/dbconfig/20200710-075500-marostegui.json
* 07:54 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1079', diff saved to https://phabricator.wikimedia.org/P11845 and previous config saved to /var/cache/conftool/dbconfig/20200710-075431-marostegui.json
* 07:44 kormat: reimaging es1021 to buster [[phab:T257284|T257284]]
* 07:43 kormat@cumin1001: dbctl commit (dc=all): 'Add weight to es1020, reduce weight on es1021 [[phab:T257284|T257284]]', diff saved to https://phabricator.wikimedia.org/P11844 and previous config saved to /var/cache/conftool/dbconfig/20200710-074326-kormat.json
* 07:41 jbond@deploy1001: Finished deploy [librenms/librenms@0a88d64]: redeplopy to [try and] fix php errors (duration: 00m 05s)
* 07:41 jbond@deploy1001: Started deploy [librenms/librenms@0a88d64]: redeplopy to [try and] fix php errors
* 07:32 moritzm: installing e2fsprogs security updates on jessie (stretch/buster already fixed)
* 07:15 akosiaris@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'proton' for release 'production' .
* 07:14 akosiaris@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'proton' for release 'production' .
* 07:13 akosiaris@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'proton' for release 'production' .
* 06:57 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1099:3311', diff saved to https://phabricator.wikimedia.org/P11843 and previous config saved to /var/cache/conftool/dbconfig/20200710-065751-marostegui.json
* 06:38 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1099:3311', diff saved to https://phabricator.wikimedia.org/P11841 and previous config saved to /var/cache/conftool/dbconfig/20200710-063818-marostegui.json
* 06:37 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1134', diff saved to https://phabricator.wikimedia.org/P11840 and previous config saved to /var/cache/conftool/dbconfig/20200710-063746-marostegui.json
* 06:35 marostegui: Compress InnoDB on db1124:3311 (Sanitarium - lag will appear on s1 on labsdb) - [[phab:T254462|T254462]]
* 04:44 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1134', diff saved to https://phabricator.wikimedia.org/P11839 and previous config saved to /var/cache/conftool/dbconfig/20200710-044428-marostegui.json
* 01:44 mutante: LDAP - adding coka to wmde and nda ([[phab:T257038|T257038]])
* 00:47 Reedy: truncated labswiki.interwiki table (outdated and unnecessary)
 
== 2020-07-09 ==
* 23:10 krinkle@deploy1001: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|I2c2dea832}} (duration: 00m 56s)
* 21:52 tgr: all sessions have been invalidated due to [[phab:T256395|T256395]]
* 20:58 eileen: https://phabricator.wikimedia.org/T253152
* 19:16 herron: upgraded eqiad elk7 cluster from 7.4.2 to 7.8.0 [[phab:T234854|T234854]]
* 19:05 twentyafterfour@deploy1001: rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.40  refs [[phab:T256668|T256668]]
* 18:51 elukey: update spark2 to 2.4.4-bin-hadoop2.6-3 for buster-wikimedia
* 18:44 mutante: stat1004, stat1006, stat1007 - upgrading git-review package from 1.25 to 1.27 so that it keeps working with new Gerrit 3.2 ([[phab:T257609|T257609]])
* 18:10 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|9f2557f848e99facaa62ca6b3a948cc3e32c32a3}}: Updating config for Readers Web affinity quicksurvey ([[phab:T246977|T246977]]) (duration: 01m 06s)
* 17:42 chaomodus: codfw frack management dns automation deployment complete [[phab:T233183|T233183]]
* 17:37 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 17:36 James_F: Synchronized wmf-config/CommonSettings.php: ExtensionDistribution: Drop REL1_33, EOL'ed [[phab:T256087|T256087]]
* 17:35 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
* 17:35 moritzm: rebooting moscovium for kernel update
* 17:33 chaomodus: deploying frack codfw management dns automation
* 17:32 crusnov@cumin2001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 17:31 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 17:28 crusnov@cumin2001: START - Cookbook sre.dns.netbox
* 17:27 moritzm: rebooting planet1002 (planet.wikimedia.org) for kernel update
* 17:27 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
* 17:10 krinkle@deploy1001: Synchronized wmf-config/: {{Gerrit|Ia2f5eddbf2aad2}} (duration: 01m 04s)
* 17:09 krinkle@deploy1001: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|Ia2f5eddbf2aad2}} (duration: 01m 05s)
* 15:32 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 15:29 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
* 14:29 papaul: replacing msw-b1,b2,b3 and b4
* 14:03 moritzm: installing libtirpc security updates
* 13:45 moritzm: installing gnutls28 security updates
* 13:31 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1089', diff saved to https://phabricator.wikimedia.org/P11831 and previous config saved to /var/cache/conftool/dbconfig/20200709-133134-marostegui.json
* 13:31 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 13:29 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
* 13:29 moritzm: rebooting puppetboard1001 (puppetboard.wikimedia.org) for kernel update
* 13:15 moritzm: installing ffmpeg security updates
* 13:11 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 13:10 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1089', diff saved to https://phabricator.wikimedia.org/P11830 and previous config saved to /var/cache/conftool/dbconfig/20200709-131039-marostegui.json
* 13:08 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
* 13:07 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 13:05 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
* 13:00 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 12:58 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
* 12:57 akosiaris@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'proton' for release 'production' .
* 12:57 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 12:56 akosiaris@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'proton' for release 'production' .
* 12:56 akosiaris@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'proton' for release 'production' .
* 12:54 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
* 12:54 moritzm: rebooting install* servers for kernel security update
* 12:43 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 12:40 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
* 12:40 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 12:38 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
* 12:38 moritzm: rebooting urldownloader1001/2001 for kernel update (failed over, these are now the inactive ones)
* 12:23 jmm@cumin2001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99)
* 12:22 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
* 12:22 moritzm: rebooting dbmonitor1001 / tendril.wikimedia.org for kernek update
* 12:11 XioNoX: enable asw2-b-eqiad:ae3 (to cloudsw1-c8) - [[phab:T251632|T251632]]
* 11:56 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 11:54 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
* 11:52 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 11:50 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
* 11:50 moritzm: rebooting debmonitor1001 for kernel update
* 11:42 urbanecm@deploy1001: Synchronized php-1.35.0-wmf.40/extensions/Translate/tag/SpecialPageTranslation.php: {{Gerrit|6541d3ff51f52fe8a1bdbfa86022f8d97d6c7680}}: DeprecatablePropertyArray: Use MW_VERSION instead of array_key_exists ([[phab:T257531|T257531]]) (duration: 01m 05s)
* 11:28 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|3a7c1c33e58637437f819edf039008a00dc5be27}}: Rename namespace on kn.wikipedia.org ([[phab:T255337|T255337]]) (duration: 01m 04s)
* 11:24 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|0a3c1f94a702b527842ed4f34d8bf41b26235e64}}: Add *.oireachtas.ie to the wgCopyUploadsDomains whitelist for commonswiki ([[phab:T256543|T256543]]) (duration: 01m 04s)
* 11:19 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 11:17 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
* 11:10 aborrero@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
* 11:10 aborrero@cumin1001: START - Cookbook sre.hosts.downtime
* 11:09 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|e6f442c6900524482806aeb1b5162e65bf7c97ac}}: Enable Quicksurveys for Desktop Improvements Project ([[phab:T246977|T246977]]) (duration: 01m 06s)
* 11:01 vgutierrez: restart ats-tls on cp1085
* 10:55 _joe_: restarting php7.2-fpm on mw1282, workers failing with sigill
* 10:54 _joe_: depool mw1282
* 10:54 mvolz@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'citoid' for release 'production' .
* 10:34 mvolz@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'citoid' for release 'production' .
* 10:23 _joe_: rolling restart the remaining restbases in eqiad, and all of codfw
* 10:22 mvolz@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'citoid' for release 'staging' .
* 10:09 _joe_: restarting restbase on rb1020-22
* 09:53 _joe_: restarting restbase on restbase1024,1023
* 09:36 _joe_: restarting restbase on rb1026,1027 to switch to proton on k8s
* 09:34 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
* 09:31 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
* 09:28 _joe_: restarting restbase on restbase1025 to pick up the switch to k8s of proton
* 09:27 godog: bounce thanos-compact on thanos-fe2001
* 09:07 elukey@cumin1001: END (PASS) - Cookbook sre.hadoop.change-distro (exit_code=0)
* 08:52 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1079', diff saved to https://phabricator.wikimedia.org/P11828 and previous config saved to /var/cache/conftool/dbconfig/20200709-085228-marostegui.json
* 08:44 marostegui: Stop haproxy on dbproxy1017 before upgrading to buster - [[phab:T255408|T255408]]
* 08:23 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1136', diff saved to https://phabricator.wikimedia.org/P11827 and previous config saved to /var/cache/conftool/dbconfig/20200709-082355-marostegui.json
* 08:23 moritzm: imported osm2pgsql 0.96.0+ds-1~bpo9+1 to "main" component [[phab:T256877|T256877]]
* 08:22 elukey@cumin1001: START - Cookbook sre.hadoop.change-distro
* 08:20 elukey@cumin1001: END (PASS) - Cookbook sre.hadoop.stop-cluster (exit_code=0)
* 08:13 elukey@cumin1001: START - Cookbook sre.hadoop.stop-cluster
* 08:11 XioNoX: disable igmp snooping on msw1-codfw
* 07:59 marostegui: Stop db1117:3322 to clone db1084, this will trigger haproxy alerts - [[phab:T257540|T257540]]
* 07:57 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1136', diff saved to https://phabricator.wikimedia.org/P11825 and previous config saved to /var/cache/conftool/dbconfig/20200709-075749-marostegui.json
* 07:19 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
* 07:15 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
* 06:52 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
* 06:49 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
* 05:39 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1101:3317', diff saved to https://phabricator.wikimedia.org/P11824 and previous config saved to /var/cache/conftool/dbconfig/20200709-053905-marostegui.json
* 05:32 marostegui@cumin1001: dbctl commit (dc=all): 'Remove db1084 from dbctl', diff saved to https://phabricator.wikimedia.org/P11823 and previous config saved to /var/cache/conftool/dbconfig/20200709-053206-marostegui.json
* 05:18 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1084', diff saved to https://phabricator.wikimedia.org/P11822 and previous config saved to /var/cache/conftool/dbconfig/20200709-051826-marostegui.json
* 05:13 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1101:3317', diff saved to https://phabricator.wikimedia.org/P11821 and previous config saved to /var/cache/conftool/dbconfig/20200709-051355-marostegui.json
* 05:11 marostegui: Remove revision triggers from db2093:3315 [[phab:T238966|T238966]]
* 05:10 marostegui: Deploy schema change on s5 codfw, lag will be generated - [[phab:T238966|T238966]]
* 01:43 tzatziki: reset email for GseSro
* 00:58 cdanis: ✔️
See [[Server Admin Log/Archives]].
See [[Server Admin Log/Archives]].
<noinclude>
<noinclude>

Revision as of 00:58, 11 August 2022

2022-08-11

  • 00:58 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2042.codfw.wmnet,service=varnish-fe
  • 00:58 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2042.codfw.wmnet,service=ats-be
  • 00:58 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2042.codfw.wmnet,service=ats-tls
  • 00:57 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on cp2042.codfw.wmnet with reason: host down; depooled and will debug tomorrow
  • 00:57 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on cp2042.codfw.wmnet with reason: host down; depooled and will debug tomorrow

2022-08-10

  • 21:25 bking@cumin1001: conftool action : set/weight=10:pooled=yes; selector: name=wdqs1016.eqiad.wmnet
  • 21:23 bking@cumin1001: conftool action : set/weight=10:pooled=yes; selector: name=wdqs1014.eqiad.wmnet
  • 21:10 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 16 hosts with reason: T309810
  • 21:10 bking@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on 16 hosts with reason: T309810
  • 21:09 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on elastic[1101-1102].eqiad.wmnet with reason: T309810
  • 21:09 bking@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on elastic[1101-1102].eqiad.wmnet with reason: T309810
  • 21:00 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 21:00 cjming: end of UTC late backport window
  • 20:59 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 20:59 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 20:59 cjming@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: Remove unused $wgEnableMWSuggest (duration: 03m 04s)
  • 20:58 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 20:56 cjming@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: Enable new topic tool on dewiki (T313699) (duration: 03m 01s)
  • 20:34 cjming@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: testwiki: set $wgCdnMatchParameterOrder to false (T314868) (duration: 03m 20s)
  • 20:31 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 20:30 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 20:30 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 20:30 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 20:19 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 20:18 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 20:18 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 20:17 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 20:12 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 20:09 bking@cumin1001: END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
  • 20:08 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 20:08 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 20:08 cjming@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: Start writing to cuc_actor everywhere except s4 and s8 (T233004) (duration: 03m 15s)
  • 20:07 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 19:51 rzl@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc[2053-2054].codfw.wmnet
  • 19:51 rzl@cumin1001: START - Cookbook sre.hosts.remove-downtime for mc[2053-2054].codfw.wmnet
  • 19:35 rzl@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse[2019-2020].codfw.wmnet
  • 19:35 rzl@cumin1001: START - Cookbook sre.hosts.remove-downtime for parse[2019-2020].codfw.wmnet
  • 19:35 rzl@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse[2016-2018].codfw.wmnet
  • 19:35 rzl@cumin1001: START - Cookbook sre.hosts.remove-downtime for parse[2016-2018].codfw.wmnet
  • 19:34 rzl@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc2036.codfw.wmnet
  • 19:34 rzl@cumin1001: START - Cookbook sre.hosts.remove-downtime for mc2036.codfw.wmnet
  • 19:28 sukhe: testing ATS 9.1.3-1wm1 on cp4026: T309651
  • 19:09 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1087.eqiad.wmnet with OS bullseye
  • 19:06 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1086.eqiad.wmnet with OS bullseye
  • 18:55 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1087.eqiad.wmnet with reason: host reimage
  • 18:51 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1086.eqiad.wmnet with reason: host reimage
  • 18:50 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1087.eqiad.wmnet with reason: host reimage
  • 18:49 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1086.eqiad.wmnet with reason: host reimage
  • 18:47 bking@cumin1001: START - Cookbook sre.wdqs.data-transfer
  • 18:38 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1087.eqiad.wmnet with OS bullseye
  • 18:36 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1086.eqiad.wmnet with OS bullseye
  • 18:22 urandom: truncating Cassandra hints (eqiad datacenter) -- T314941
  • 18:13 urandom: truncating codfw Cassandra hints (eqiad datacenter) -- T314941
  • 18:07 rzl@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kafka-main2005.codfw.wmnet
  • 18:07 rzl@cumin1001: START - Cookbook sre.hosts.remove-downtime for kafka-main2005.codfw.wmnet
  • 18:05 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repool D8 DBs after PDU maint (T310146)', diff saved to https://phabricator.wikimedia.org/P32346 and previous config saved to /var/cache/conftool/dbconfig/20220810-180529-ladsgroup.json
  • 17:42 otto@deploy1002: Finished deploy [analytics/refinery@6e47e0e]: Add missing changes to the deletion script - T270433 - [analytics/refinery@6e47e0e] (duration: 05m 28s)
  • 17:39 fnegri@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts labweb1002.wikimedia.org
  • 17:39 fnegri@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 17:36 otto@deploy1002: Started deploy [analytics/refinery@6e47e0e]: Add missing changes to the deletion script - T270433 - [analytics/refinery@6e47e0e]
  • 17:35 fnegri@cumin1001: START - Cookbook sre.dns.netbox
  • 17:34 otto@deploy1002: Finished deploy [analytics/refinery@6e47e0e] (hadoop-test): Add missing changes to the deletion script - T270433 - TEST [analytics/refinery@6e47e0e] (duration: 04m 19s)
  • 17:30 fnegri@cumin1001: START - Cookbook sre.hosts.decommission for hosts labweb1002.wikimedia.org
  • 17:30 otto@deploy1002: Started deploy [analytics/refinery@6e47e0e] (hadoop-test): Add missing changes to the deletion script - T270433 - TEST [analytics/refinery@6e47e0e]
  • 17:09 dzahn@cumin2002: START - Cookbook sre.dns.netbox
  • 17:08 otto@deploy1002: Started deploy [analytics/refinery@d4dd7e4] (hadoop-test): Add safety limits to refinery-drop-older-than - T270433 - TEST [analytics/refinery@d4dd7e4]
  • 17:06 sukhe: testing ATS 9.1.3-1wm1 on cp4032: T309651
  • 17:06 urandom: flushing RESTBase Cassandra tables -row B- to (temporarily) free instance-data space -- T314941
  • 17:05 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on krb2001.codfw.wmnet with reason: btullis codfw maintenance
  • 17:05 btullis@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on krb2001.codfw.wmnet with reason: btullis codfw maintenance
  • 17:04 dzahn@cumin2002: START - Cookbook sre.hosts.decommission for hosts gerrit2001.wikimedia.org
  • 17:02 sukhe: testing ATS 9.1.3-1wm1 on cp6008: T309651
  • 16:56 sukhe: testing ATS 9.1.3-1wm1 on cp6016: T309651
  • 16:55 fnegri@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts labweb1001.wikimedia.org
  • 16:55 fnegri@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 16:32 dzahn@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts gerrit2001.wikimedia.org
  • 16:32 dzahn@cumin2002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
  • 16:32 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubernetes[2013-2014].codfw.wmnet
  • 16:31 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for kubernetes[2013-2014].codfw.wmnet
  • 16:31 jelto: kubectl uncordon kubernetes2014.codfw.wmnet
  • 16:31 fnegri@cumin1001: START - Cookbook sre.dns.netbox
  • 16:30 jelto: kubectl uncordon kubernetes2013.codfw.wmnet
  • 16:29 urandom: restarting Cassandra (RESTBase) -row A- to apply r822110 -- T314941
  • 16:27 dzahn@cumin2002: START - Cookbook sre.dns.netbox
  • 16:25 fnegri@cumin1001: START - Cookbook sre.hosts.decommission for hosts labweb1001.wikimedia.org
  • 16:23 mutante: shutting down gerrit2001
  • 16:23 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc[2034-2035].codfw.wmnet
  • 16:23 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for mc[2034-2035].codfw.wmnet
  • 16:22 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse[2016-2018].codfw.wmnet
  • 16:22 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for parse[2016-2018].codfw.wmnet
  • 16:16 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=maps2010.codfw.wmnet
  • 16:16 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=sessionstore2003.codfw.wmnet
  • 16:13 sukhe: reprepro -C component/trafficserver9 include buster-wikimedia trafficserver_9.1.3-1wm1_amd64.changes: T309651
  • 16:13 dzahn@cumin2002: START - Cookbook sre.hosts.decommission for hosts gerrit2001.wikimedia.org
  • 16:11 mvernon@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ms-be[2039,2050,2056,2059].codfw.wmnet,thanos-be2004.codfw.wmnet with reason: PDU work
  • 16:10 mvernon@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ms-be[2039,2050,2056,2059].codfw.wmnet,thanos-be2004.codfw.wmnet with reason: PDU work
  • 16:09 urandom: flushing tables in row D (RESTBase Cassandra cluster) -- T314941
  • 15:54 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for gitlab-runner2004.codfw.wmnet
  • 15:54 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for gitlab-runner2004.codfw.wmnet
  • 15:53 sukhe: poweroff cp2041, 42 for PDU ugprade: rack D7
  • 15:51 urandom: flushing tables in row B (RESTBase Cassandra cluster) -- T314941
  • 15:49 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve2004.codfw.wmnet with reason: PDU maintenance
  • 15:49 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2004.codfw.wmnet with reason: PDU maintenance
  • 15:46 urandom: flushing tables in row A (RESTBase Cassandra cluster) -- T314941
  • 15:46 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on aqs2012.codfw.wmnet with reason: btullis codfw maintenance
  • 15:46 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on cp[2041-2042].codfw.wmnet with reason: shutdown for PDU upgrade: rack D4
  • 15:46 btullis@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on aqs2012.codfw.wmnet with reason: btullis codfw maintenance
  • 15:46 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 3:00:00 on cp[2041-2042].codfw.wmnet with reason: shutdown for PDU upgrade: rack D4
  • 15:46 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on aqs2011.codfw.wmnet with reason: btullis codfw maintenance
  • 15:46 btullis@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on aqs2011.codfw.wmnet with reason: btullis codfw maintenance
  • 15:45 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on aqs2010.codfw.wmnet with reason: btullis codfw maintenance
  • 15:45 btullis@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on aqs2010.codfw.wmnet with reason: btullis codfw maintenance
  • 15:45 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on aqs2009.codfw.wmnet with reason: btullis codfw maintenance
  • 15:45 btullis@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on aqs2009.codfw.wmnet with reason: btullis codfw maintenance
  • 15:37 urandom: (ephemerally) increasing hinted hand-off delivery rate limit to 16KB, RESTBase eqiad nodes -- T314941
  • 15:34 jbond: remove puppetmaster[12]002 from production
  • 15:30 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kafka-main2004.codfw.wmnet
  • 15:30 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for kafka-main2004.codfw.wmnet
  • 15:20 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc[2051-2052].codfw.wmnet
  • 15:20 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for mc[2051-2052].codfw.wmnet
  • 15:17 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc-gp2003.codfw.wmnet
  • 15:17 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for mc-gp2003.codfw.wmnet
  • 15:16 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc2033.codfw.wmnet
  • 15:16 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for mc2033.codfw.wmnet
  • 15:14 _joe_: power off krb2002
  • 15:14 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on krb2002.codfw.wmnet with reason: PDU maintenance
  • 15:13 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on krb2002.codfw.wmnet with reason: PDU maintenance
  • 15:13 _joe_: shutting down rdb2010,puppetmaster2002 for d5 maintenance
  • 15:02 jelto: power off mc2035
  • 15:01 jelto: power off mc2034
  • 15:01 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mc2035.codfw.wmnet with reason: PDU swap
  • 15:01 jelto@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on mc2035.codfw.wmnet with reason: PDU swap
  • 15:01 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mc2034.codfw.wmnet with reason: PDU swap
  • 15:01 jelto@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on mc2034.codfw.wmnet with reason: PDU swap
  • 14:43 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2094.codfw.wmnet with reason: PDU Maint (T310146)
  • 14:43 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2094.codfw.wmnet with reason: PDU Maint (T310146)
  • 14:38 urandom: disabling reserved space on eqiad nodes (RESTBase), /dev/md2 (aka /srv/cassandra/instance-data) -- T314941
  • 14:28 jelto: power off kafka-main2004 gracefully
  • 14:28 hnowlan: shutting down sessionstore2003
  • 14:27 hnowlan@puppetmaster1001: conftool action : set/pooled=no; selector: name=sessionstore2003.codfw.wmnet
  • 14:27 sukhe: power off cp2039, cp2040 for PDU upgrade: rack D
  • 14:27 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sessionstore2003.codfw.wmnet with reason: PDU maintenance
  • 14:27 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on sessionstore2003.codfw.wmnet with reason: PDU maintenance
  • 14:25 jelto: power off mc-gp2003
  • 14:25 jelto: power off mc2033
  • 14:24 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:30:00 on kafka-main2004.codfw.wmnet with reason: PDU swap
  • 14:23 jelto@cumin1001: START - Cookbook sre.hosts.downtime for 4:30:00 on kafka-main2004.codfw.wmnet with reason: PDU swap
  • 14:23 sukhe: depool codfw for PDU upgrade: rack D
  • 14:23 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:30:00 on mc-gp2003.codfw.wmnet with reason: PDU swap
  • 14:23 jelto@cumin1001: START - Cookbook sre.hosts.downtime for 4:30:00 on mc-gp2003.codfw.wmnet with reason: PDU swap
  • 14:23 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:30:00 on mc2033.codfw.wmnet with reason: PDU swap
  • 14:23 jelto@cumin1001: START - Cookbook sre.hosts.downtime for 4:30:00 on mc2033.codfw.wmnet with reason: PDU swap
  • 14:15 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp20[39|40]\.codfw\.wmnet,service=ats-tls
  • 14:13 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on cp[2039-2040].codfw.wmnet with reason: shutdown for PDU upgrade: rack D4
  • 14:13 urandom: flushing Cassandra tables, restbase1030
  • 14:13 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 3:00:00 on cp[2039-2040].codfw.wmnet with reason: shutdown for PDU upgrade: rack D4
  • 14:13 urandom: flushing Cassandra tables, restbase1019
  • 14:12 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on dns2002.wikimedia.org with reason: shutdown for PDU upgrade: rack D4
  • 14:12 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 3:00:00 on dns2002.wikimedia.org with reason: shutdown for PDU upgrade: rack D4
  • 14:11 urandom: flushing Cassandra tables, restbase1017 1018 1021 1024 1025 1026 1028 1029
  • 14:05 urandom: flushing tables, restbase1016
  • 13:52 hnowlan: powered up restbase2018
  • 13:32 filippo@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on netmon1003.wikimedia.org with reason: pdu
  • 13:32 filippo@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on netmon1003.wikimedia.org with reason: pdu
  • 13:32 filippo@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on logstash2003.codfw.wmnet with reason: pdu
  • 13:31 filippo@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on logstash2003.codfw.wmnet with reason: pdu
  • 13:31 root@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on logstash2029.codfw.wmnet with reason: pdu
  • 13:31 root@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on logstash2029.codfw.wmnet with reason: pdu
  • 13:30 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 6 hosts with reason: T310146
  • 13:30 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 6 hosts with reason: T310146
  • 13:17 elukey: powering on restbase2027
  • 13:12 elukey: powering on restbase2026
  • 13:12 _joe_: powering on restbase2023
  • 13:01 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1160 (T312863)', diff saved to https://phabricator.wikimedia.org/P32343 and previous config saved to /var/cache/conftool/dbconfig/20220810-130108-ladsgroup.json
  • 13:01 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance
  • 13:00 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance
  • 12:37 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic[2072,2084-2085].codfw.wmnet with reason: T310146
  • 12:37 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic[2072,2084-2085].codfw.wmnet with reason: T310146
  • 12:27 jbond: remove confd from serveres that shouldn;t have it
  • 12:05 ladsgroup@deploy1002: Synchronized php-1.39.0-wmf.23/extensions/Echo/maintenance/removeOrphanedEvents.php: Backport: Run clean ups with removeOrphanedEvents in major batches (T310428) (duration: 03m 32s)
  • 11:47 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 11:46 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 11:46 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 11:43 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 11:15 jbond@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1001.eqiad.wmnet with OS buster
  • 10:54 jbond@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage
  • 10:51 jbond@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage
  • 10:37 jbond@cumin1001: START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS buster
  • 10:31 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[2181-2182].codfw.wmnet with reason: D6 PDU maint (T310146)
  • 10:31 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db[2181-2182].codfw.wmnet with reason: D6 PDU maint (T310146)
  • 10:26 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on maps2010.codfw.wmnet with reason: PDU maintenance
  • 10:26 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on maps2010.codfw.wmnet with reason: PDU maintenance
  • 10:26 hnowlan@puppetmaster1001: conftool action : set/pooled=no; selector: name=maps2010.codfw.wmnet
  • 10:25 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=maps2010.codfw.wmnet
  • 10:25 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=maps2008.codfw.wmnet
  • 10:24 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on restbase2018.codfw.wmnet with reason: PDU maintenance
  • 10:24 hnowlan@puppetmaster1001: conftool action : set/pooled=no; selector: name=restbase2018.codfw.wmnet
  • 10:24 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on restbase2018.codfw.wmnet with reason: PDU maintenance
  • 10:23 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ml-serve2008.codfw.wmnet with reason: PDU maintenance
  • 10:23 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on ml-serve2008.codfw.wmnet with reason: PDU maintenance
  • 10:20 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ores2009.codfw.wmnet with reason: PDU maintenance
  • 10:20 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on ores2009.codfw.wmnet with reason: PDU maintenance
  • 10:19 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ores2008.codfw.wmnet with reason: PDU maintenance
  • 10:19 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on ores2008.codfw.wmnet with reason: PDU maintenance
  • 10:03 hnowlan@puppetmaster1001: conftool action : set/pooled=no; selector: name=restbase202[367].codfw.wmnet
  • 10:02 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on restbase[2023,2026-2027].codfw.wmnet with reason: PDU maintenance
  • 10:02 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on restbase[2023,2026-2027].codfw.wmnet with reason: PDU maintenance
  • 09:53 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 6 hosts with reason: D8 PDU Maint (T310146)
  • 09:52 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 6 hosts with reason: D8 PDU Maint (T310146)
  • 09:51 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depool D8 DBs for PDU maint (T310146)', diff saved to https://phabricator.wikimedia.org/P32341 and previous config saved to /var/cache/conftool/dbconfig/20220810-095059-ladsgroup.json
  • 09:36 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[2101,2130,2140].codfw.wmnet,dbproxy2004.codfw.wmnet with reason: D6 PDU maint (T310146)
  • 09:36 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db[2101,2130,2140].codfw.wmnet,dbproxy2004.codfw.wmnet with reason: D6 PDU maint (T310146)
  • 09:34 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depool D6 dbs (T310146)', diff saved to https://phabricator.wikimedia.org/P32340 and previous config saved to /var/cache/conftool/dbconfig/20220810-093433-ladsgroup.json
  • 09:31 jelto: depool services in codfw for upcoming PDU replacement - T309956
  • 09:30 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 8 hosts with reason: Maintenance
  • 09:29 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 8 hosts with reason: Maintenance
  • 09:29 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance
  • 09:29 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance
  • 09:28 jynus: shutdown backup2007 before pdu upgrade T310146
  • 09:16 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 09:15 ladsgroup@deploy1002: Synchronized php-1.39.0-wmf.23/maintenance/namespaceDupes.php: Backport: maintenance: Add support for links migration to namespaceDupes.php (T314711) (duration: 03m 18s)
  • 09:15 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[2093,2120,2129,2172].codfw.wmnet with reason: D5 PDU maint (T310146)
  • 09:15 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db[2093,2120,2129,2172].codfw.wmnet with reason: D5 PDU maint (T310146)
  • 09:14 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 09:14 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 09:13 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 09:10 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depool D5 dbs (T310146)', diff saved to https://phabricator.wikimedia.org/P32339 and previous config saved to /var/cache/conftool/dbconfig/20220810-091038-ladsgroup.json
  • 08:49 jynus: shutdown dbprov2003 before pdu upgrade T310146
  • 08:49 ayounsi@cumin1001: END (FAIL) - Cookbook sre.network.debug (exit_code=99) for Netbox interface ID cr1-drmrs:xe-0/1/2
  • 08:48 ayounsi@cumin1001: START - Cookbook sre.network.debug for Netbox interface ID cr1-drmrs:xe-0/1/2
  • 08:48 mvernon@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be2028.codfw.wmnet
  • 08:48 mvernon@cumin1001: START - Cookbook sre.hosts.remove-downtime for ms-be2028.codfw.wmnet
  • 08:42 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1130 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P32337 and previous config saved to /var/cache/conftool/dbconfig/20220810-084222-ladsgroup.json
  • 08:37 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 08:37 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
  • 08:36 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
  • 08:36 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 08:36 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 08:35 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 08:35 ladsgroup@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: Stop writing to the old templatelinks fields in s5 (T312865) (duration: 03m 29s)
  • 08:32 jelto: power off gitlab-runner2004
  • 08:31 root@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:30:00 on gitlab-runner2004.codfw.wmnet with reason: PDU swap
  • 08:31 root@cumin1001: START - Cookbook sre.hosts.downtime for 8:30:00 on gitlab-runner2004.codfw.wmnet with reason: PDU swap
  • 08:29 mvernon@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ms-be2028.codfw.wmnet with reason: Trying to fix full /
  • 08:28 mvernon@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ms-be2028.codfw.wmnet with reason: Trying to fix full /
  • 08:28 ayounsi@cumin1001: END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox interface ID cr1-drmrs:xe-0/1/2
  • 08:27 ayounsi@cumin1001: START - Cookbook sre.network.debug for Netbox interface ID cr1-drmrs:xe-0/1/2
  • 08:27 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1130 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P32336 and previous config saved to /var/cache/conftool/dbconfig/20220810-082718-ladsgroup.json
  • 08:25 ayounsi@cumin1001: END (FAIL) - Cookbook sre.network.debug (exit_code=99) for Netbox interface ID cr1-drmrs:xe-0/1/2
  • 08:25 ayounsi@cumin1001: START - Cookbook sre.network.debug for Netbox interface ID cr1-drmrs:xe-0/1/2
  • 08:24 ayounsi@cumin1001: END (FAIL) - Cookbook sre.network.debug (exit_code=99) for Netbox interface ID cr1-drmrs:xe-0/1/2
  • 08:24 ayounsi@cumin1001: START - Cookbook sre.network.debug for Netbox interface ID cr1-drmrs:xe-0/1/2
  • 08:23 kart_: Run: mwscript namespaceDupes.php arywiki --fix (T291737)
  • 08:13 jynus: restart replication on db1117:m1 T309074
  • 08:12 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1130 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P32335 and previous config saved to /var/cache/conftool/dbconfig/20220810-081213-ladsgroup.json
  • 08:09 kartik@deploy1002: Finished scap: Backport: arywiki: change namespace translations, add unchanged namespaces and add old translations as aliases (T291737) (duration: 10m 37s)
  • 07:59 kartik@deploy1002: Started scap: Backport: arywiki: change namespace translations, add unchanged namespaces and add old translations as aliases (T291737)
  • 07:57 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1130 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P32334 and previous config saved to /var/cache/conftool/dbconfig/20220810-075708-ladsgroup.json
  • 07:56 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1130 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P32333 and previous config saved to /var/cache/conftool/dbconfig/20220810-075636-ladsgroup.json
  • 07:55 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 07:52 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 07:52 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 07:52 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1130.eqiad.wmnet with reason: Maintenance
  • 07:52 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on db1130.eqiad.wmnet with reason: Maintenance
  • 07:51 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 07:51 dcaro@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 07:46 dcaro@cumin1001: START - Cookbook sre.dns.netbox
  • 07:39 dcaro@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 07:34 dcaro@cumin1001: START - Cookbook sre.dns.netbox
  • 07:33 godog: depool thanos-fe2001 for debugging
  • 07:11 kartik@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: Enable SectionTranslation on testwiki with new MT support from Google (T313296) (duration: 05m 44s)
  • 07:11 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 07:08 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 07:08 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 07:07 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 05:24 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on kubernetes[2013-2014].codfw.wmnet with reason: PDU maintenance
  • 05:24 oblivian@cumin1001: START - Cookbook sre.hosts.downtime for 18:00:00 on kubernetes[2013-2014].codfw.wmnet with reason: PDU maintenance
  • 05:19 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on parse[2016-2020].codfw.wmnet with reason: PDU maintenance
  • 05:19 oblivian@cumin1001: START - Cookbook sre.hosts.downtime for 18:00:00 on parse[2016-2020].codfw.wmnet with reason: PDU maintenance
  • 05:12 _joe_: starting to shut down servers in codfw for the PDU maintenance
  • 05:09 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on 10 hosts with reason: PDU maintenance
  • 05:09 oblivian@cumin1001: START - Cookbook sre.hosts.downtime for 18:00:00 on 10 hosts with reason: PDU maintenance
  • 05:09 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on mc-gp2003.codfw.wmnet with reason: PDU maintenance
  • 05:09 oblivian@cumin1001: START - Cookbook sre.hosts.downtime for 18:00:00 on mc-gp2003.codfw.wmnet with reason: PDU maintenance
  • 05:06 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on mc2033.codfw.wmnet with reason: PDU maintenance
  • 05:06 oblivian@cumin1001: START - Cookbook sre.hosts.downtime for 18:00:00 on mc2033.codfw.wmnet with reason: PDU maintenance
  • 05:05 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on 7 hosts with reason: PDU maintenance
  • 05:05 oblivian@cumin1001: START - Cookbook sre.hosts.downtime for 18:00:00 on 7 hosts with reason: PDU maintenance
  • 02:34 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 02:33 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 02:33 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 02:32 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 02:07 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 02:06 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 02:06 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 02:05 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply

2022-08-09

  • 23:17 bking@cumin1001: conftool action : set/weight=10:pooled=yes; selector: name=wdqs1011.eqiad.wmnet
  • 23:07 bking@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
  • 23:06 bking@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
  • 22:51 bking@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
  • 22:51 bking@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
  • 22:49 bking@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
  • 22:49 bking@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
  • 22:46 bking@cumin1001: conftool action : set/weight=10:pooled=yes; selector: name=wdqs1015.eqiad.wmnet
  • 22:31 ryankemper@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
  • 22:31 ryankemper@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
  • 22:28 bking@cumin1001: END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
  • 22:02 bking@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
  • 22:02 bking@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
  • 21:57 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wdqs2006.codfw.wmnet with reason: T310146
  • 21:57 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs2006.codfw.wmnet with reason: T310146
  • 21:53 bking@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
  • 21:52 bking@deploy1002: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply
  • 21:50 bking@deploy1002: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply
  • 21:49 bking@deploy1002: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply
  • 21:43 bking@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop: apply
  • 21:43 bking@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop: apply
  • 21:43 bking@deploy1002: helmfile [eqiad] DONE helmfile.d/services/changeprop: apply
  • 21:43 bking@deploy1002: helmfile [eqiad] START helmfile.d/services/changeprop: apply
  • 21:43 bking@deploy1002: helmfile [staging] DONE helmfile.d/services/changeprop: apply
  • 21:43 bking@deploy1002: helmfile [staging] START helmfile.d/services/changeprop: apply
  • 21:08 bking@cumin1001: START - Cookbook sre.wdqs.data-transfer
  • 21:00 bking@cumin1001: conftool action : set/weight=10:pooled=yes; selector: name=wdqs1014.eqiad.wmnet
  • 20:56 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
  • 20:55 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
  • 20:55 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1148 (T312863)', diff saved to https://phabricator.wikimedia.org/P32332 and previous config saved to /var/cache/conftool/dbconfig/20220809-205548-ladsgroup.json
  • 20:51 bking@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wdqs1014.eqiad.wmnet
  • 20:51 bking@cumin1001: START - Cookbook sre.hosts.remove-downtime for wdqs1014.eqiad.wmnet
  • 20:46 bking@cumin1001: END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
  • 20:40 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P32331 and previous config saved to /var/cache/conftool/dbconfig/20220809-204042-ladsgroup.json
  • 20:25 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P32330 and previous config saved to /var/cache/conftool/dbconfig/20220809-202536-ladsgroup.json
  • 20:10 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1148 (T312863)', diff saved to https://phabricator.wikimedia.org/P32329 and previous config saved to /var/cache/conftool/dbconfig/20220809-201030-ladsgroup.json
  • 19:57 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on wdqs1014.eqiad.wmnet with reason: T314890
  • 19:57 bking@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on wdqs1014.eqiad.wmnet with reason: T314890
  • 19:56 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on wdqs1016.eqiad.wmnet with reason: T314890
  • 19:56 bking@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on wdqs1016.eqiad.wmnet with reason: T314890
  • 19:55 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on wdqs1015.eqiad.wmnet with reason: T314890
  • 19:55 bking@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on wdqs1015.eqiad.wmnet with reason: T314890
  • 19:38 dcausse@deploy1002: helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply
  • 19:36 dcausse@deploy1002: helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply
  • 19:35 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1130.eqiad.wmnet with reason: Maintenance
  • 19:35 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 10:00:00 on db1130.eqiad.wmnet with reason: Maintenance
  • 19:25 bking@cumin1001: START - Cookbook sre.wdqs.data-transfer
  • 18:09 cmjohnson@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 18:06 cmjohnson@cumin1001: START - Cookbook sre.dns.netbox
  • 17:54 cmjohnson@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 17:47 cmjohnson@cumin1001: START - Cookbook sre.dns.netbox
  • 17:38 bking@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1072.eqiad.wmnet with OS bullseye
  • 17:29 vgutierrez: test trafficserver 9.1.2-1wm2 in cp6016 - T309651
  • 17:15 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1072.eqiad.wmnet with reason: host reimage
  • 17:13 bking@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1072.eqiad.wmnet with reason: host reimage
  • 17:00 bking@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1072.eqiad.wmnet with OS bullseye
  • 16:54 bking@cumin1001: END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0)
  • 16:54 bking@cumin1001: START - Cookbook sre.elasticsearch.force-shard-allocation
  • 16:53 bking@cumin1001: END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0)
  • 16:53 bking@cumin1001: START - Cookbook sre.elasticsearch.force-shard-allocation
  • 16:26 bking@deploy1002: helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
  • 16:26 bking@deploy1002: helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
  • 16:01 bking@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1069.eqiad.wmnet with OS bullseye
  • 15:45 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1069.eqiad.wmnet with reason: host reimage
  • 15:42 bking@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1069.eqiad.wmnet with reason: host reimage
  • 15:32 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 15:31 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 15:31 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 15:30 bking@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1069.eqiad.wmnet with OS bullseye
  • 15:28 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 15:27 bking@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1058.eqiad.wmnet with OS bullseye
  • 15:08 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1058.eqiad.wmnet with reason: host reimage
  • 15:05 bking@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1058.eqiad.wmnet with reason: host reimage
  • 14:59 elukey@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
  • 14:59 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
  • m: finished running 'homer "status:active" commit "netmon: Add the netmon1003 host as a syslog destination"' in the cumin1001 host. Homer reported no errors.
  • 14:54 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
  • 14:50 bking@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1058.eqiad.wmnet with OS bullseye
  • 14:28 bking@cumin1001: conftool action : set/pooled=false; selector: dnsdisc=wdqs,name=codfw
  • 13:57 kevinbazira@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
  • 13:57 kevinbazira@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
  • m: Add the new netmon1003 host as a syslog destination in homer templates/common/system.conf https://gerrit.wikimedia.org/r/c/operations/homer/public/+/819124
  • m: Successfully ran '# run-puppet-merge' in the netmon1002 and netmon1003 hosts.
  • m: Running '# run-puppet-agent' in the netmon1003 host
  • m: Running '# run-puppet-agent' in the netmon1002 host
  • 13:47 ryankemper@cumin1001: END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0)
  • 13:46 ryankemper@cumin1001: START - Cookbook sre.elasticsearch.force-shard-allocation
  • m: puppet-merge on puppetmaster2004.codfw.wmnet for patch 819179 succeeded
  • m: Set netmon1003 as netmon_server and netmon1002 as a netmon_servers_failover in the Puppet repository https://gerrit.wikimedia.org/r/c/operations/puppet/+/819179
  • m: authdns updated successfully
  • m: Had to revert https://gerrit.wikimedia.org/r/c/operations/dns/+/819177 because I rebased my changes incorrectly, sent the new patch in https://gerrit.wikimedia.org/r/c/operations/dns/+/821746
  • m: running '# authdns-update' in ns0.wikimedia.org
  • m: Flip DNS for LibreNMS and Smokeping from netmon1002 to netmon1003 https://gerrit.wikimedia.org/r/c/operations/dns/+/819177
  • 13:23 jynus: stop replication on db1117:m1 T309074
  • m: netmon1002 to netmon1003 failover
  • 13:17 elukey@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
  • 13:16 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
  • 10:58 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
  • 09:53 vgutierrez: rolling restart of pybal in eqsin - T310070
  • 09:25 elukey@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
  • 09:24 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
  • 09:24 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
  • 09:12 vgutierrez: rolling restart of pybal in codfw - T310070
  • 08:47 elukey@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
  • 08:30 elukey@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
  • 08:28 elukey@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
  • 08:27 elukey@deploy1002: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
  • 08:27 elukey@deploy1002: helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
  • 08:27 elukey@deploy1002: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
  • 08:26 elukey@deploy1002: helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
  • 08:24 jynus: starting data check using es1021 and es2021, expect increased read traffic T314559
  • 08:21 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
  • 06:22 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1130.eqiad.wmnet with reason: Maintenance
  • 06:22 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1130.eqiad.wmnet with reason: Maintenance
  • 06:19 Amir1: dbmaint s5@eqiad (T312863 T312984 T310011 T310485)
  • 06:11 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db1130.eqiad.wmnet with reason: Maint
  • 06:11 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db1130.eqiad.wmnet with reason: Maint
  • 06:08 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depool db1130 T314370', diff saved to https://phabricator.wikimedia.org/P32323 and previous config saved to /var/cache/conftool/dbconfig/20220809-060836-ladsgroup.json
  • 06:07 oblivian@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
  • 06:02 ladsgroup@cumin1001: dbctl commit (dc=all): 'Promote db1100 to s5 primary and set section read-write T314370', diff saved to https://phabricator.wikimedia.org/P32322 and previous config saved to /var/cache/conftool/dbconfig/20220809-060159-ladsgroup.json
  • 06:01 ladsgroup@cumin1001: dbctl commit (dc=all): 'Set s5 eqiad as read-only for maintenance - T314370', diff saved to https://phabricator.wikimedia.org/P32321 and previous config saved to /var/cache/conftool/dbconfig/20220809-060105-ladsgroup.json
  • 06:00 Amir1: Starting s5 eqiad failover from db1130 to db1100 - T314370
  • 05:12 ladsgroup@cumin1001: dbctl commit (dc=all): 'Set db1100 with weight 0 T314370', diff saved to https://phabricator.wikimedia.org/P32320 and previous config saved to /var/cache/conftool/dbconfig/20220809-051251-ladsgroup.json
  • 05:12 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 22 hosts with reason: Primary switchover s5 T314370
  • 05:11 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1:00:00 on 22 hosts with reason: Primary switchover s5 T314370
  • 02:42 ejegg: SmashPig upgraded from 9b97ea15 to 13e9e9cc
  • 02:31 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1148 (T312863)', diff saved to https://phabricator.wikimedia.org/P32318 and previous config saved to /var/cache/conftool/dbconfig/20220809-023113-ladsgroup.json
  • 02:31 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1148.eqiad.wmnet with reason: Maintenance
  • 02:30 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1148.eqiad.wmnet with reason: Maintenance
  • 02:30 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T312863)', diff saved to https://phabricator.wikimedia.org/P32317 and previous config saved to /var/cache/conftool/dbconfig/20220809-023052-ladsgroup.json
  • 02:28 ejegg: payments-wiki upgraded from 6880236d to cf5e1848
  • 02:15 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P32316 and previous config saved to /var/cache/conftool/dbconfig/20220809-021546-ladsgroup.json
  • 02:00 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P32315 and previous config saved to /var/cache/conftool/dbconfig/20220809-020040-ladsgroup.json
  • 01:45 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T312863)', diff saved to https://phabricator.wikimedia.org/P32314 and previous config saved to /var/cache/conftool/dbconfig/20220809-014534-ladsgroup.json

2022-08-08

  • 23:52 tstarling@deploy1002: Synchronized wmf-config/InitialiseSettings.php: clean up testwiki experiments T314750 (duration: 03m 19s)
  • 23:47 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 23:46 tstarling@deploy1002: Synchronized wmf-config/CommonSettings.php: clean up testwiki experiments T314750 (duration: 03m 27s)
  • 23:46 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 23:46 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 23:45 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 23:32 eileen___: config revision changed from f5668044 to 787cd0e0<eileen___> eileen
  • 23:32 eileen___: civicrm upgraded from 497bddf7 to 1f91ac2d
  • 22:16 ryankemper@cumin1001: END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - T289135
  • 22:16 ryankemper@cumin1001: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host elastic1065.eqiad.wmnet with OS bullseye
  • 21:53 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1065.eqiad.wmnet with reason: host reimage
  • 21:50 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1065.eqiad.wmnet with reason: host reimage
  • 21:36 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1065.eqiad.wmnet with OS bullseye
  • 21:12 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1062.eqiad.wmnet with OS bullseye
  • 20:53 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1062.eqiad.wmnet with reason: host reimage
  • 20:50 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1062.eqiad.wmnet with reason: host reimage
  • 20:36 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1062.eqiad.wmnet with OS bullseye
  • 20:32 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 20:31 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 20:31 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 20:29 ryankemper@cumin1001: START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - T289135
  • 20:28 cjming: end of UTC late backport window
  • 20:27 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 20:27 cjming@deploy1002: Synchronized php-1.39.0-wmf.23/skins/Vector/resources/skins.vector.styles/layouts/grid.less: Backport: Fix grid blowout bug (T314756) (duration: 03m 26s)
  • 20:12 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 20:11 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 20:11 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 20:11 cjming@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: Disable sticky header edit A/B test for pilot wikis (T312296) (duration: 03m 35s)
  • 20:08 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 17:34 bking@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1088.eqiad.wmnet with OS bullseye
  • 17:15 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1088.eqiad.wmnet with reason: host reimage
  • 17:12 bking@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1088.eqiad.wmnet with reason: host reimage
  • 17:00 bking@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1088.eqiad.wmnet with OS bullseye
  • 16:54 bking@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1085.eqiad.wmnet with OS bullseye
  • 16:49 ryankemper@cumin1001: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - T289135
  • 16:43 pt1979@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 16:41 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1085.eqiad.wmnet with reason: host reimage
  • 16:39 pt1979@cumin2002: START - Cookbook sre.dns.netbox
  • 16:38 bking@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1085.eqiad.wmnet with reason: host reimage
  • 16:26 bking@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1085.eqiad.wmnet with OS bullseye
  • 16:24 bking@cumin1001: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host elastic1085.eqiad.wmnet with OS bullseye
  • 16:19 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1085.eqiad.wmnet with reason: host reimage
  • 16:16 bking@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1085.eqiad.wmnet with reason: host reimage
  • 16:16 ryankemper@cumin1001: START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - T289135
  • 16:14 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
  • 16:12 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
  • 16:10 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
  • 16:09 ryankemper@cumin1001: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - T289135
  • 16:04 bking@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1085.eqiad.wmnet with OS bullseye
  • 16:00 bking@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1084.eqiad.wmnet with OS bullseye
  • 15:58 ryankemper@cumin1001: START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - T289135
  • 15:47 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1084.eqiad.wmnet with reason: host reimage
  • 15:46 sukhe: upload reprepro -C main include bullseye-wikimedia python-pynetbox_6.6.0-1+wmf11u1_amd64.changes
  • 15:45 bking@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1084.eqiad.wmnet with reason: host reimage
  • 15:37 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2021.codfw.wmnet with reason: Maint
  • 15:37 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2021.codfw.wmnet with reason: Maint
  • 15:32 bking@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1084.eqiad.wmnet with OS bullseye
  • 14:59 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
  • 14:55 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
  • 14:47 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on cp5001.eqsin.wmnet with reason: depooled: faulty DIMM: T314256
  • 14:46 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on cp5001.eqsin.wmnet with reason: depooled: faulty DIMM: T314256
  • 14:34 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
  • 14:11 kevinbazira@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
  • 13:03 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 13:01 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 13:01 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 12:58 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 12:56 urbanecm@deploy1002: Synchronized wmf-config/CommonSettings.php: 77fd5ab: Growth: Add new rights to wgAvailableRights (duration: 03m 24s)
  • 12:30 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1102.eqiad.wmnet
  • 12:09 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 12:09 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 12:06 urbanecm@deploy1002: Synchronized php-1.39.0-wmf.23/extensions/GrowthExperiments/: 3eaf155: MentorTools: Do not use MentorWeightManager (T314362) (duration: 03m 31s)
  • 12:04 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 11:43 btullis@cumin1001: START - Cookbook sre.hosts.reboot-single for host an-worker1102.eqiad.wmnet
  • 11:21 jelto@cumin1001: conftool action : set/pooled=yes; selector: name=kubernetes2022.codfw.wmnet
  • 11:21 jelto: kubectl uncordon kubernetes2022.codfw.wmnet
  • 10:43 Amir1: Removing db2079 from orchestrator (T313885)
  • 10:39 Amir1: Removing db2079 from zarcillo (T313885)
  • 10:35 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2079.codfw.wmnet
  • 10:35 ladsgroup@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 10:30 ladsgroup@cumin1001: START - Cookbook sre.dns.netbox
  • 10:25 ladsgroup@cumin1001: START - Cookbook sre.hosts.decommission for hosts db2079.codfw.wmnet
  • 10:18 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 12:00:00 on db2079.codfw.wmnet with reason: Decom
  • 10:17 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 2 days, 12:00:00 on db2079.codfw.wmnet with reason: Decom
  • 08:41 jbond: deploy libtirpc update
  • 07:57 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1144:3314 (T312863)', diff saved to https://phabricator.wikimedia.org/P32310 and previous config saved to /var/cache/conftool/dbconfig/20220808-075723-ladsgroup.json
  • 07:57 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance
  • 07:57 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance
  • 07:57 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1142 (T312863)', diff saved to https://phabricator.wikimedia.org/P32309 and previous config saved to /var/cache/conftool/dbconfig/20220808-075702-ladsgroup.json
  • 07:53 godog: grow sda/sdb 3 by 100G on thanos-be2001 - T314275
  • 07:50 godog: grow sda/sdb 3 by 100G on thanos-be1004 - T314275
  • 07:41 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P32308 and previous config saved to /var/cache/conftool/dbconfig/20220808-074156-ladsgroup.json
  • 07:32 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 07:27 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 07:27 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 07:26 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P32307 and previous config saved to /var/cache/conftool/dbconfig/20220808-072650-ladsgroup.json
  • 07:23 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 07:22 kartik@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: trwikivoyage: Create rollbacker user group (T314678) (duration: 03m 17s)
  • 07:18 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 07:17 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 07:17 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 07:16 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 07:11 elukey: restart rsyslog on ml-serve2007
  • 07:11 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
  • 07:11 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1142 (T312863)', diff saved to https://phabricator.wikimedia.org/P32306 and previous config saved to /var/cache/conftool/dbconfig/20220808-071144-ladsgroup.json
  • 07:10 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
  • 07:10 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 07:09 kartik@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: Enable SectionTranslation on 10 Wikipedias where ContentTranslation is default (T308829) (duration: 03m 15s)
  • 07:09 mwdebug-deploy@deploy1002: helmfile [eqiad] STA