You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Server Admin Log: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Stashbot
(eileen: civicrm revision changed from 27d5900f7d to ce28723709, config revision is 706cf3c898)
imported>Stashbot
(sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2042.codfw.wmnet,service=varnish-fe)
(648 intermediate revisions by 4 users not shown)
Line 1: Line 1:
== 2020-08-20 ==
== 2022-08-11 ==
* 22:31 eileen: civicrm revision changed from {{Gerrit|27d5900f7d}} to {{Gerrit|ce28723709}}, config revision is {{Gerrit|706cf3c898}}
* 00:58 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2042.codfw.wmnet,service=varnish-fe
* 22:20 eileen: civicrm revision is {{Gerrit|27d5900f7d}}, config revision is {{Gerrit|706cf3c898}}
* 00:58 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2042.codfw.wmnet,service=ats-be
* 22:20 mutante: permanently shut down tungsten.eqiad.wmnet [[phab:T260395|T260395]] [[phab:T158837|T158837]] [[phab:T180761|T180761]] [[phab:T224549|T224549]]
* 00:58 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp2042.codfw.wmnet,service=ats-tls
* 22:18 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
* 00:57 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on cp2042.codfw.wmnet with reason: host down; depooled and will debug tomorrow
* 22:17 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
* 00:57 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on cp2042.codfw.wmnet with reason: host down; depooled and will debug tomorrow
* 21:35 ejegg: updated fundraising CiviCRM from {{Gerrit|958a79f660}} to {{Gerrit|27d5900f7d}}
* 20:53 cdanis: repool eqsin
* 20:37 ppchelko@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
* 20:36 ppchelko@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
* 20:34 ppchelko@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
* 20:25 cdanis: cdanis@cr2-eqsin> request vmhost reboot
* 20:17 ppchelko@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
* 20:16 ppchelko@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
* 20:13 cdanis: cdanis@cr2-eqsin> request vmhost software add /var/tmp/junos-vmhost-install-mx-x86-64-18.2R3-S5.3.tgz
* 20:11 ppchelko@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
* 20:02 cdanis: depool eqsin for router upgrade
* 19:57 ayounsi@cumin1001: END (ERROR) - Cookbook sre.network.prepare-upgrade (exit_code=97)
* 19:37 ppchelko@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
* 19:34 ppchelko@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
* 19:34 ppchelko@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
* 19:24 ppchelko@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
* 19:17 bstorm@cumin1001: END (FAIL) - Cookbook wmcs.wikireplicas.add_wiki (exit_code=99)
* 19:17 bstorm@cumin1001: START - Cookbook wmcs.wikireplicas.add_wiki
* 19:17 twentyafterfour@deploy1001: rebuilt and synchronized wikiversions files: all wikis to 1.36.0-wmf.5  refs [[phab:T257973|T257973]]
* 19:08 mutante: restarted apache on cont2001 for integration.wikimedia.org docroot change
* 19:07 mutante: switching document root of integration.wikimedia.org to scap ([[phab:T149924|T149924]])
* 19:02 twentyafterfour: 1.36.0-wmf.5 has no known blockers and logspam is cleaned up, time to roll group2 wikis to wmf.5
* 18:42 bstorm@cumin1001: END (FAIL) - Cookbook wmcs.wikireplicas.add_wiki (exit_code=99)
* 18:42 bstorm@cumin1001: START - Cookbook wmcs.wikireplicas.add_wiki
* 18:19 mutante: ores1004 - starting failed celery-ores-worker
* 18:18 mutante: testreduce1001 - rt_client and vd_client now properly stopped by puppet [[phab:T257906|T257906]]
* 17:29 shdubsh: restart elasticsearch on logstash1012 (not 1020) -- high gc runtimes
* 17:28 shdubsh: restart elasticsearch on logstash1020 -- high gc runtimes
* 17:23 ayounsi@cumin1001: START - Cookbook sre.network.prepare-upgrade
* 17:23 ayounsi@cumin1001: END (ERROR) - Cookbook sre.network.prepare-upgrade (exit_code=97)
* 17:23 ayounsi@cumin1001: START - Cookbook sre.network.prepare-upgrade
* 17:22 ayounsi@cumin1001: END (FAIL) - Cookbook sre.network.prepare-upgrade (exit_code=99)
* 17:22 ayounsi@cumin1001: START - Cookbook sre.network.prepare-upgrade
* 16:48 pt1979@cumin2001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 16:48 ppchelko@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
* 16:46 ppchelko@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
* 16:45 pt1979@cumin2001: START - Cookbook sre.dns.netbox
* 16:43 ppchelko@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
* 16:40 _joe_: restarted apache2 on icinga1001
* 16:13 ppchelko@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
* 16:11 shdubsh: restart elasticsearch on logstash1011 -- long gc runs
* 16:10 ppchelko@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
* 16:08 ppchelko@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
* 16:02 ppchelko@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
* 14:55 ppchelko@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
* 14:06 oblivian@deploy1001: Finished deploy [ores/deploy@8540eec]: various configuration fixes (duration: 09m 03s)
* 13:57 oblivian@deploy1001: Started deploy [ores/deploy@8540eec]: various configuration fixes
* 13:53 ppchelko@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
* 13:53 oblivian@deploy1001: Finished deploy [ores/deploy@e860508]: switch everything to use envoy as a service proxy [[phab:T244843|T244843]] (duration: 14m 00s)
* 13:39 oblivian@deploy1001: Started deploy [ores/deploy@e860508]: switch everything to use envoy as a service proxy [[phab:T244843|T244843]]
* 13:26 oblivian@deploy1001: Finished deploy [ores/deploy@74677b6]: switch testwiki to use envoy as a service proxy [[phab:T244843|T244843]] (take 2) (duration: 11m 37s)
* 13:14 oblivian@deploy1001: Started deploy [ores/deploy@74677b6]: switch testwiki to use envoy as a service proxy [[phab:T244843|T244843]] (take 2)
* 13:11 oblivian@deploy1001: Finished deploy [ores/deploy@a208a0e]: switch testwiki to use envoy as a service proxy [[phab:T244843|T244843]] (duration: 11m 19s)
* 13:09 gehel: repool wdqs1007 - catched up on lag
* 13:00 oblivian@deploy1001: Started deploy [ores/deploy@a208a0e]: switch testwiki to use envoy as a service proxy [[phab:T244843|T244843]]
* 12:51 oblivian@deploy1001: Finished deploy [ores/deploy@a208a0e]: switch testwiki to use envoy as a service proxy [[phab:T244843|T244843]] (duration: 07m 03s)
* 12:44 oblivian@deploy1001: Started deploy [ores/deploy@a208a0e]: switch testwiki to use envoy as a service proxy [[phab:T244843|T244843]]
* 11:49 Lucas_WMDE: EU backport window done
* 11:44 urbanecm@deploy1001: Synchronized php-1.36.0-wmf.4/extensions/AbuseFilter/includes/AbuseFilterHooks.php: {{Gerrit|d762e7b5526d91fe21e5980bc5e9f3be06a2f85c}}: Use $user param when filtering edits ([[phab:T258717|T258717]]) (duration: 01m 05s)
* 11:41 eileen: civicrm revision changed from {{Gerrit|6c9441a18e}} to {{Gerrit|958a79f660}}, config revision is {{Gerrit|706cf3c898}}
* 11:38 urbanecm@deploy1001: Synchronized php-1.36.0-wmf.5/extensions/AbuseFilter/includes/AbuseFilterHooks.php: {{Gerrit|00da39b6913ac2eab600bbb61258472b60d2cbcb}}: Use $user param when filtering edits ([[phab:T258717|T258717]]) (duration: 01m 05s)
* 11:32 lucaswerkmeister-wmde@deploy1001: Synchronized php-1.36.0-wmf.5/extensions/Wikibase/client/data-bridge/dist/: Backport: [[gerrit:621488{{!}}Don't try to load source maps in production (T260852)]] (duration: 01m 07s)
* 11:07 mlitn@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Fix testwikidata depicts id & CirrusSearchUserTesting config (duration: 01m 06s)
* 11:07 Urbanecm: [urbanecm@mwmaint1002 ~]$ mwscript emptyUserGroup.php --wiki=trwiki editor # [[phab:T260899|T260899]]
* 10:58 XioNoX: re-pool codfw - [[phab:T259621|T259621]]
* 10:53 XioNoX: un-drain cr1-codfw - [[phab:T259621|T259621]]
* 10:45 XioNoX: cr1-codfw> request chassis routing-engine master switch - [[phab:T259621|T259621]]
* 10:26 hashar: Restarted zuul-merger instances on contint1001 and contint2001
* 10:24 hashar@deploy1001: Finished deploy [zuul/deploy@8a05b4d]: Support Gerrit replication events (duration: 00m 24s)
* 10:24 hashar@deploy1001: Started deploy [zuul/deploy@8a05b4d]: Support Gerrit replication events
* 10:21 XioNoX: cr1-codfw> request chassis routing-engine master switch - [[phab:T259621|T259621]]
* 10:12 XioNoX: reboot cr1-codfw:re1 (backup) for upgrade - [[phab:T259621|T259621]]
* 09:57 XioNoX: bump cr1-codfw OSPF metrics - [[phab:T259621|T259621]]
* 09:51 XioNoX: enable transit/peering and re-set normal OSPF values on cr2-codfw - [[phab:T259621|T259621]]
* 09:41 XioNoX: cr2-codfw> request chassis routing-engine master switch - [[phab:T259621|T259621]]
* 09:36 eileen: civicrm revision changed from {{Gerrit|cf9fadbeed}} to {{Gerrit|6c9441a18e}}, config revision is {{Gerrit|706cf3c898}}
* 09:33 XioNoX: reboot cr2-codfw:re0 (backup) for upgrade - [[phab:T259621|T259621]]
* 09:18 XioNoX: cr2-codfw> request chassis routing-engine master switch - [[phab:T259621|T259621]]
* 09:18 kormat: stress-testing db2125 [[phab:T260670|T260670]]
* 09:08 XioNoX: reboot cr2-codfw:re1 (backup) for upgrade - [[phab:T259621|T259621]]
* 09:03 kormat@cumin1001: dbctl commit (dc=all): 'Repool db2125 after host failure [[phab:T260670|T260670]]', diff saved to https://phabricator.wikimedia.org/P12303 and previous config saved to /var/cache/conftool/dbconfig/20200820-090313-kormat.json
* 08:52 kormat: removing /usr/bin/check_mariadb.py from all db hosts [[phab:T259516|T259516]]
* 08:52 XioNoX: disable transit/peering on cr2-codfw - [[phab:T259621|T259621]]
* 08:48 XioNoX: bump cr2-codfw OSPF metrics - [[phab:T259621|T259621]]
* 08:44 jynus: running analyze table on db1115's tendril.global_status_log, may case some stalls on tendril/dbtree [[phab:T260876|T260876]]
* 08:41 XioNoX: depool codfw for routers upgrade - [[phab:T259621|T259621]]
* 08:31 XioNoX: enable transit/peering on cr3-knams - [[phab:T259621|T259621]]
* 08:21 XioNoX: reboot cr3-knams for upgrade - [[phab:T259621|T259621]]
* 08:07 XioNoX: disable transit/peering on cr3-knams - [[phab:T259621|T259621]]
* 07:39 hashar: contint2001: restarted zuul
* 07:29 hashar: contint1001: restarted zuul-merger
* 07:29 hashar@deploy1001: Finished deploy [zuul/deploy@5989ed0]: Upgrade gear from 0.7.0 to 1.15.1+wmf1 - [[phab:T258630|T258630]] (duration: 00m 13s)
* 07:28 hashar@deploy1001: Started deploy [zuul/deploy@5989ed0]: Upgrade gear from 0.7.0 to 1.15.1+wmf1 - [[phab:T258630|T258630]]
* 01:54 ejegg: re-enabled fundraising scheduled jobs
* 00:51 mutante: ms-be1039 - started failed ferm service
* 00:35 ejegg: stopped fundraising scheduled jobs
* 00:27 eileen: civicrm revision changed from {{Gerrit|c442a09153}} to {{Gerrit|cf9fadbeed}}, config revision is {{Gerrit|3cdffd4fc2}}


== 2020-08-19 ==
== 2022-08-10 ==
* 23:20 Urbanecm: Evening B&C window closed
* 21:25 bking@cumin1001: conftool action : set/weight=10:pooled=yes; selector: name=wdqs1016.eqiad.wmnet
* 23:08 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|a80899948c26ca36b970b80fbad07600fe4ce92c}}: Enable VisualEditor in namespaces Draft and Wikiproject on hywiki ([[phab:T260825|T260825]]) (duration: 01m 05s)
* 21:23 bking@cumin1001: conftool action : set/weight=10:pooled=yes; selector: name=wdqs1014.eqiad.wmnet
* 22:41 eileen: civicrm revision changed from {{Gerrit|34f95a3311}} to {{Gerrit|c442a09153}}, config revision is {{Gerrit|3cdffd4fc2}}
* 21:10 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 16 hosts with reason: [[phab:T309810|T309810]]
* 21:27 eileen: civicrm revision changed from {{Gerrit|154519cc1f}} to {{Gerrit|34f95a3311}}, config revision is {{Gerrit|3cdffd4fc2}}
* 21:10 bking@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on 16 hosts with reason: [[phab:T309810|T309810]]
* 21:17 cdanis@cumin1001: END (PASS) - Cookbook sre.network.cf (exit_code=0)
* 21:09 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on elastic[1101-1102].eqiad.wmnet with reason: [[phab:T309810|T309810]]
* 21:17 cdanis@cumin1001: START - Cookbook sre.network.cf
* 21:09 bking@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on elastic[1101-1102].eqiad.wmnet with reason: [[phab:T309810|T309810]]
* 20:39 dpifke@deploy1001: Finished deploy [performance/arc-lamp@2ef1af7]: Deploy fixes for notifications and OOM prevention ([[phab:T259167|T259167]]) (duration: 00m 06s)
* 21:00 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 20:39 dpifke@deploy1001: Started deploy [performance/arc-lamp@2ef1af7]: Deploy fixes for notifications and OOM prevention ([[phab:T259167|T259167]])
* 21:00 cjming: end of UTC late backport window
* 19:43 ebernhardson: restart mjolnir-kafka-bulk-daemon on search-loader2001 with debug logging
* 20:59 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 19:20 mutante: testreduce1001 - re-enabled puppet, confirmed parsoid-rt service was now stopped properly by puppet while it runs as before on scandium, the previous parsoid-testing host. switching it over is now a Hiera one-liner. ([[phab:T257906|T257906]])
* 20:59 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 19:15 twentyafterfour@deploy1001: Synchronized php: group1 wikis to 1.36.0-wmf.5  refs [[phab:T257973|T257973]] (duration: 01m 04s)
* 20:59 cjming@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:820533{{!}}Remove unused $wgEnableMWSuggest]] (duration: 03m 04s)
* 19:14 twentyafterfour@deploy1001: rebuilt and synchronized wikiversions files: group1 wikis to 1.36.0-wmf.5  refs [[phab:T257973|T257973]]
* 20:58 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 19:02 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|60af096b80a8ef7bc94ec40ce203fd27b0c97f26}}: Add autopatrolled group at arzwiki ([[phab:T260761|T260761]]) (duration: 01m 04s)
* 20:56 cjming@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:820568{{!}}Enable new topic tool on dewiki (T313699)]] (duration: 03m 01s)
* 18:52 mutante: testreduce1001 - disable puppet; stop parsoid-rt service
* 20:34 cjming@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:822093{{!}}testwiki: set $wgCdnMatchParameterOrder to false (T314868)]] (duration: 03m 20s)
* 18:47 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|924a03bd624d6750a7e776e09713056cc45e5cc5}}: Add clinton.presidentiallibraries.us to the wgCopyUploadsDomains allowlist of Wikimedia Commons ([[phab:T259927|T259927]]) (duration: 01m 04s)
* 20:31 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 18:45 urbanecm@deploy1001: Synchronized wmf-config/CommonSettings.php: {{Gerrit|83b34e1bd1ed804a70f67e089580e082f89e2a0f}}: ClosedWikiProvider: Use testUserForCreation rather than testForAuthentication ([[phab:T258695|T258695]]) (duration: 01m 04s)
* 20:30 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 18:39 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|95d45f6e002df78d4860a711042d77a6b0bdecb9}}: Dont index Draft (118) and Draft talk (119) on hywiki ([[phab:T260804|T260804]]) (duration: 01m 04s)
* 20:30 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 18:32 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|803cb1a0d2c8cc6df8e4e88ab3c4d27eb71d01b3}}: Update taglines for various projects ([[phab:T258552|T258552]]) (duration: 01m 04s)
* 20:30 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 18:30 urbanecm@deploy1001: Synchronized static/images/mobile/copyright/: {{Gerrit|803cb1a0d2c8cc6df8e4e88ab3c4d27eb71d01b3}}: Update taglines for various projects ([[phab:T258552|T258552]]) (duration: 01m 06s)
* 20:19 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 18:25 mutante: rebooting webperf1002 VM on ganeti level (outside OS) to upgrade rom 8 to 16GB RAM ([[phab:T260192|T260192]])
* 20:18 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 18:24 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
* 20:18 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 18:24 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
* 20:17 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 18:24 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|bb4aa44b0bd5b2b33d190d3af81e038e5fc55e3f}}: Configure namespaces on commons to include categories ([[phab:T198716|T198716]]) (duration: 01m 04s)
* 20:12 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 18:21 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|b9043331c1c1b352256cffd471b9ff128806607c}}: Update project wordmarks ([[phab:T254788|T254788]]; sync 2/2) (duration: 01m 04s)
* 20:09 bking@cumin1001: END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
* 18:19 urbanecm@deploy1001: Synchronized static/images/mobile/copyright/: {{Gerrit|b9043331c1c1b352256cffd471b9ff128806607c}}: Update project wordmarks ([[phab:T254788|T254788]]; sync 1/2) (duration: 01m 06s)
* 20:08 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 18:15 mutante: rebooting webperf2002 VM on ganeti level (outside OS) to upgrade rom 8 to 16GB RAM ([[phab:T260192|T260192]])
* 20:08 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 18:15 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|a6f8354e7599a5e92bea060807065f5b42c540e5}}: Enable $wgMFNoindexPages for all wikis ([[phab:T255458|T255458]]) (duration: 01m 07s)
* 20:08 cjming@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:820646{{!}}Start writing to cuc_actor everywhere except s4 and s8 (T233004)]] (duration: 03m 15s)
* 18:13 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
* 20:07 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 18:13 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
* 19:51 rzl@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc[2053-2054].codfw.wmnet
* 18:13 ppchelko@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
* 19:51 rzl@cumin1001: START - Cookbook sre.hosts.remove-downtime for mc[2053-2054].codfw.wmnet
* 17:38 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
* 19:35 rzl@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse[2019-2020].codfw.wmnet
* 17:38 mutante: decom'ing releases2001.codfw.wmnet (
* 19:35 rzl@cumin1001: START - Cookbook sre.hosts.remove-downtime for parse[2019-2020].codfw.wmnet
* 17:37 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
* 19:35 rzl@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse[2016-2018].codfw.wmnet
* 16:39 ppchelko@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
* 19:35 rzl@cumin1001: START - Cookbook sre.hosts.remove-downtime for parse[2016-2018].codfw.wmnet
* 16:37 ppchelko@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
* 19:34 rzl@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc2036.codfw.wmnet
* 16:32 andrew@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
* 19:34 rzl@cumin1001: START - Cookbook sre.hosts.remove-downtime for mc2036.codfw.wmnet
* 16:30 andrew@cumin1001: START - Cookbook sre.hosts.downtime
* 19:28 sukhe: testing ATS 9.1.3-1wm1 on cp4026: [[phab:T309651|T309651]]
* 15:41 rzl: finished exercising the switchdc cookbooks with --live-test for now, all changes reverted including re-enabling puppet on cumin1001
* 19:09 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1087.eqiad.wmnet with OS bullseye
* 15:38 rzl@cumin1001: END (PASS) - Cookbook sre.switchdc.mediawiki.08-start-maintenance (exit_code=0)
* 19:06 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1086.eqiad.wmnet with OS bullseye
* 15:37 rzl@cumin1001: START - Cookbook sre.switchdc.mediawiki.08-start-maintenance
* 18:55 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1087.eqiad.wmnet with reason: host reimage
* 15:34 rzl@cumin1001: END (PASS) - Cookbook sre.switchdc.mediawiki.08-restore-ttl (exit_code=0)
* 18:51 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1086.eqiad.wmnet with reason: host reimage
* 15:34 rzl@cumin1001: START - Cookbook sre.switchdc.mediawiki.08-restore-ttl
* 18:50 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1087.eqiad.wmnet with reason: host reimage
* 15:33 rzl@cumin1001: END (FAIL) - Cookbook sre.switchdc.mediawiki.08-restore-ttl (exit_code=99)
* 18:49 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1086.eqiad.wmnet with reason: host reimage
* 15:33 rzl@cumin1001: START - Cookbook sre.switchdc.mediawiki.08-restore-ttl
* 18:47 bking@cumin1001: START - Cookbook sre.wdqs.data-transfer
* 15:31 jbond42: update java.security https://gerrit.wikimedia.org/r/c/operations/puppet/+/593467
* 18:38 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1087.eqiad.wmnet with OS bullseye
* 15:30 oblivian@cumin1001: conftool action : set/ttl=300; selector: dnsdisc=api-rw
* 18:36 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1086.eqiad.wmnet with OS bullseye
* 15:26 rzl@cumin1001: END (FAIL) - Cookbook sre.switchdc.mediawiki.00-reduce-ttl (exit_code=99)
* 18:22 urandom: truncating Cassandra hints (eqiad datacenter)  -- [[phab:T314941|T314941]]
* 15:26 rzl@cumin1001: START - Cookbook sre.switchdc.mediawiki.00-reduce-ttl
* 18:13 urandom: truncating codfw Cassandra hints (eqiad datacenter)  -- [[phab:T314941|T314941]]
* 15:22 rzl@cumin1001: END (FAIL) - Cookbook sre.switchdc.mediawiki.08-restore-ttl (exit_code=99)
* 18:07 rzl@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kafka-main2005.codfw.wmnet
* 15:22 rzl@cumin1001: START - Cookbook sre.switchdc.mediawiki.08-restore-ttl
* 18:07 rzl@cumin1001: START - Cookbook sre.hosts.remove-downtime for kafka-main2005.codfw.wmnet
* 15:18 godog: prometheus codfw lvextend --resizefs --size +80G /dev/mapper/vg--ssd-prometheus--ops
* 18:05 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repool D8 DBs after PDU maint ([[phab:T310146|T310146]])', diff saved to https://phabricator.wikimedia.org/P32346 and previous config saved to /var/cache/conftool/dbconfig/20220810-180529-ladsgroup.json
* 15:17 rzl@cumin1001: END (PASS) - Cookbook sre.switchdc.mediawiki.00-reduce-ttl (exit_code=0)
* 17:42 otto@deploy1002: Finished deploy [analytics/refinery@6e47e0e]: Add missing changes to the deletion script - [[phab:T270433|T270433]] - [analytics/refinery@6e47e0e] (duration: 05m 28s)
* 15:17 rzl@cumin1001: START - Cookbook sre.switchdc.mediawiki.00-reduce-ttl
* 17:39 fnegri@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts labweb1002.wikimedia.org
* 15:16 rzl@cumin1001: END (FAIL) - Cookbook sre.switchdc.mediawiki.00-reduce-ttl (exit_code=99)
* 17:39 fnegri@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 15:16 rzl@cumin1001: START - Cookbook sre.switchdc.mediawiki.00-reduce-ttl
* 17:36 otto@deploy1002: Started deploy [analytics/refinery@6e47e0e]: Add missing changes to the deletion script - [[phab:T270433|T270433]] - [analytics/refinery@6e47e0e]
* 15:14 rzl@cumin1001: END (PASS) - Cookbook sre.switchdc.mediawiki.08-restore-ttl (exit_code=0)
* 17:35 fnegri@cumin1001: START - Cookbook sre.dns.netbox
* 15:14 rzl@cumin1001: START - Cookbook sre.switchdc.mediawiki.08-restore-ttl
* 17:34 otto@deploy1002: Finished deploy [analytics/refinery@6e47e0e] (hadoop-test): Add missing changes to the deletion script - [[phab:T270433|T270433]] - TEST [analytics/refinery@6e47e0e] (duration: 04m 19s)
* 15:08 rzl@cumin1001: END (FAIL) - Cookbook sre.switchdc.mediawiki.08-restore-ttl (exit_code=99)
* 17:30 fnegri@cumin1001: START - Cookbook sre.hosts.decommission for hosts labweb1002.wikimedia.org
* 15:08 rzl@cumin1001: START - Cookbook sre.switchdc.mediawiki.08-restore-ttl
* 17:30 otto@deploy1002: Started deploy [analytics/refinery@6e47e0e] (hadoop-test): Add missing changes to the deletion script - [[phab:T270433|T270433]] - TEST [analytics/refinery@6e47e0e]
* 15:06 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
* 17:09 dzahn@cumin2002: START - Cookbook sre.dns.netbox
* 15:04 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
* 17:08 otto@deploy1002: Started deploy [analytics/refinery@d4dd7e4] (hadoop-test): Add safety limits to refinery-drop-older-than - [[phab:T270433|T270433]] - TEST [analytics/refinery@d4dd7e4]
* 14:50 rzl@cumin1001: END (FAIL) - Cookbook sre.switchdc.mediawiki.00-reduce-ttl (exit_code=99)
* 17:06 sukhe: testing ATS 9.1.3-1wm1 on cp4032: [[phab:T309651|T309651]]
* 14:50 rzl@cumin1001: START - Cookbook sre.switchdc.mediawiki.00-reduce-ttl
* 17:06 urandom: flushing RESTBase Cassandra tables -row B- to (temporarily) free instance-data space -- [[phab:T314941|T314941]]
* 14:50 rzl: running the switchdc cookbooks with --live-test, simulating a switch to eqiad where we're already running, no production impact is expected
* 17:05 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on krb2001.codfw.wmnet with reason: btullis codfw maintenance
* 14:47 rzl@cumin1001: END (PASS) - Cookbook sre.switchdc.mediawiki.00-disable-puppet (exit_code=0)
* 17:05 btullis@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on krb2001.codfw.wmnet with reason: btullis codfw maintenance
* 14:47 rzl@cumin1001: START - Cookbook sre.switchdc.mediawiki.00-disable-puppet
* 17:04 dzahn@cumin2002: START - Cookbook sre.hosts.decommission for hosts gerrit2001.wikimedia.org
* 14:41 rzl: disable puppet on cumin1001 for switchdc testing
* 17:02 sukhe: testing ATS 9.1.3-1wm1 on cp6008: [[phab:T309651|T309651]]
* 14:35 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
* 16:56 sukhe: testing ATS 9.1.3-1wm1 on cp6016: [[phab:T309651|T309651]]
* 14:33 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
* 16:55 fnegri@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts labweb1001.wikimedia.org
* 14:27 ppchelko@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
* 16:55 fnegri@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 13:38 ppchelko@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
* 16:32 dzahn@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts gerrit2001.wikimedia.org
* 13:34 gehel: depooling wdqs1007 and restarting blazegraph
* 16:32 dzahn@cumin2002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
* 13:29 _joe_: depooling and disabling puppet on restbase1024 for further investigation
* 16:32 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubernetes[2013-2014].codfw.wmnet
* 13:27 ppchelko@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
* 16:31 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for kubernetes[2013-2014].codfw.wmnet
* 13:26 ppchelko@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
* 16:31 jelto: kubectl uncordon kubernetes2014.codfw.wmnet
* 13:25 ppchelko@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
* 16:31 fnegri@cumin1001: START - Cookbook sre.dns.netbox
* 13:10 pt1979@cumin2001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 16:30 jelto: kubectl uncordon kubernetes2013.codfw.wmnet
* 13:03 _joe_: building and uploading fluent-bit, ratelimit images
* 16:29 urandom: restarting Cassandra (RESTBase) -row A- to apply r822110 -- [[phab:T314941|T314941]]
* 13:01 pt1979@cumin2001: START - Cookbook sre.dns.netbox
* 16:27 dzahn@cumin2002: START - Cookbook sre.dns.netbox
* 12:57 _joe_: building a new version of the base docker images
* 16:25 fnegri@cumin1001: START - Cookbook sre.hosts.decommission for hosts labweb1001.wikimedia.org
* 11:29 awight: EU bacon finished
* 16:23 mutante: shutting down gerrit2001
* 11:28 effie: restart mwdebug* servers
* 16:23 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc[2034-2035].codfw.wmnet
* 11:08 awight@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:621227{{!}}Fix typos in flaggedrevs comments ()]] (duration: 01m 19s)
* 16:23 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for mc[2034-2035].codfw.wmnet
* 09:22 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0)
* 16:22 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse[2016-2018].codfw.wmnet
* 08:48 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 16:22 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for parse[2016-2018].codfw.wmnet
* 08:48 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 16:16 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=maps2010.codfw.wmnet
* 08:43 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single
* 16:16 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=sessionstore2003.codfw.wmnet
* 08:43 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single
* 16:13 sukhe: reprepro -C component/trafficserver9 include buster-wikimedia trafficserver_9.1.3-1wm1_amd64.changes: [[phab:T309651|T309651]]
* 08:36 XioNoX: update firewall policies on pfw - [[phab:T260585|T260585]]
* 16:13 dzahn@cumin2002: START - Cookbook sre.hosts.decommission for hosts gerrit2001.wikimedia.org
* 08:35 jayme: running puppet on A:all-mw-eqiad
* 16:11 mvernon@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ms-be[2039,2050,2056,2059].codfw.wmnet,thanos-be2004.codfw.wmnet with reason: PDU work
* 08:23 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 16:10 mvernon@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ms-be[2039,2050,2056,2059].codfw.wmnet,thanos-be2004.codfw.wmnet with reason: PDU work
* 08:23 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 16:09 urandom: flushing tables in row D (RESTBase Cassandra cluster)  -- [[phab:T314941|T314941]]
* 08:20 godog: switch grafana.w.o to grafana 7 in codfw - [[phab:T259143|T259143]]
* 15:54 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for gitlab-runner2004.codfw.wmnet
* 08:19 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single
* 15:54 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for gitlab-runner2004.codfw.wmnet
* 08:18 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single
* 15:53 sukhe: poweroff cp2041, 42 for PDU ugprade: rack D7
* 08:14 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-cluster
* 15:51 urandom: flushing tables in row B (RESTBase Cassandra cluster)  -- [[phab:T314941|T314941]]
* 08:06 jayme: running puppet on A:all-mw-eqiad
* 15:49 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve2004.codfw.wmnet with reason: PDU maintenance
* 07:46 godog: upgrade to grafana 7 on cloudmetrics hosts - [[phab:T259143|T259143]]
* 15:49 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2004.codfw.wmnet with reason: PDU maintenance
* 07:15 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0)
* 15:46 urandom: flushing tables in row A (RESTBase Cassandra cluster) -- [[phab:T314941|T314941]]
* 07:10 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 15:46 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on aqs2012.codfw.wmnet with reason: btullis codfw maintenance
* 06:39 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-single
* 15:46 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on cp[2041-2042].codfw.wmnet with reason: shutdown for PDU upgrade: rack D4
* 06:13 eileen: tools revision changed from {{Gerrit|b4ebd1e564}} to {{Gerrit|0b9d971bc4}}
* 15:46 btullis@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on aqs2012.codfw.wmnet with reason: btullis codfw maintenance
* 06:07 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 15:46 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 3:00:00 on cp[2041-2042].codfw.wmnet with reason: shutdown for PDU upgrade: rack D4
* 06:04 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-cluster
* 15:46 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on aqs2011.codfw.wmnet with reason: btullis codfw maintenance
* 06:03 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-single
* 15:46 btullis@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on aqs2011.codfw.wmnet with reason: btullis codfw maintenance
* 06:00 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 15:45 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on aqs2010.codfw.wmnet with reason: btullis codfw maintenance
* 05:55 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-single
* 15:45 btullis@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on aqs2010.codfw.wmnet with reason: btullis codfw maintenance
* 05:53 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 15:45 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on aqs2009.codfw.wmnet with reason: btullis codfw maintenance
* 05:47 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-single
* 15:45 btullis@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on aqs2009.codfw.wmnet with reason: btullis codfw maintenance
* 05:37 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 15:37 urandom: (ephemerally) increasing hinted hand-off delivery rate limit to 16KB, RESTBase eqiad nodes  -- [[phab:T314941|T314941]]
* 05:31 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-single
* 15:34 jbond: remove puppetmaster[12]002 from production
* 03:42 andrew@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
* 15:30 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kafka-main2004.codfw.wmnet
* 03:40 andrew@cumin1001: START - Cookbook sre.hosts.downtime
* 15:30 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for kafka-main2004.codfw.wmnet
* 02:53 cstone: civicrm revision changed from {{Gerrit|f5469d0a4c}} to {{Gerrit|154519cc1f}}
* 15:20 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc[2051-2052].codfw.wmnet
* 02:00 wkandek@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0)
* 15:20 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for mc[2051-2052].codfw.wmnet
* 01:05 dpifke@deploy1001: Synchronized wmf-config/profiler.php: Deploying https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/620139 (duration: 01m 18s)
* 15:17 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc-gp2003.codfw.wmnet
* 00:49 dpifke@deploy1001: Synchronized wmf-config/ProductionServices.php: Disabling old XHGui backend ([[phab:T180761|T180761]]) (duration: 05m 13s)
* 15:17 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for mc-gp2003.codfw.wmnet
* 00:15 wkandek@cumin1001: START - Cookbook sre.hosts.reboot-cluster
* 15:16 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc2033.codfw.wmnet
* 15:16 jelto@cumin1001: START - Cookbook sre.hosts.remove-downtime for mc2033.codfw.wmnet
* 15:14 _joe_: power off krb2002
* 15:14 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on krb2002.codfw.wmnet with reason: PDU maintenance
* 15:13 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on krb2002.codfw.wmnet with reason: PDU maintenance
* 15:13 _joe_: shutting down rdb2010,puppetmaster2002 for d5 maintenance
* 15:02 jelto: power off mc2035
* 15:01 jelto: power off mc2034
* 15:01 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mc2035.codfw.wmnet with reason: PDU swap
* 15:01 jelto@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on mc2035.codfw.wmnet with reason: PDU swap
* 15:01 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mc2034.codfw.wmnet with reason: PDU swap
* 15:01 jelto@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on mc2034.codfw.wmnet with reason: PDU swap
* 14:43 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2094.codfw.wmnet with reason: PDU Maint ([[phab:T310146|T310146]])
* 14:43 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2094.codfw.wmnet with reason: PDU Maint ([[phab:T310146|T310146]])
* 14:38 urandom: disabling reserved space on eqiad nodes (RESTBase), /dev/md2 (aka /srv/cassandra/instance-data) -- [[phab:T314941|T314941]]
* 14:28 jelto: power off kafka-main2004 gracefully
* 14:28 hnowlan: shutting down sessionstore2003
* 14:27 hnowlan@puppetmaster1001: conftool action : set/pooled=no; selector: name=sessionstore2003.codfw.wmnet
* 14:27 sukhe: power off cp2039, cp2040 for PDU upgrade: rack D
* 14:27 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sessionstore2003.codfw.wmnet with reason: PDU maintenance
* 14:27 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on sessionstore2003.codfw.wmnet with reason: PDU maintenance
* 14:25 jelto: power off mc-gp2003
* 14:25 jelto: power off mc2033
* 14:24 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:30:00 on kafka-main2004.codfw.wmnet with reason: PDU swap
* 14:23 jelto@cumin1001: START - Cookbook sre.hosts.downtime for 4:30:00 on kafka-main2004.codfw.wmnet with reason: PDU swap
* 14:23 sukhe: depool codfw for PDU upgrade: rack D
* 14:23 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:30:00 on mc-gp2003.codfw.wmnet with reason: PDU swap
* 14:23 jelto@cumin1001: START - Cookbook sre.hosts.downtime for 4:30:00 on mc-gp2003.codfw.wmnet with reason: PDU swap
* 14:23 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:30:00 on mc2033.codfw.wmnet with reason: PDU swap
* 14:23 jelto@cumin1001: START - Cookbook sre.hosts.downtime for 4:30:00 on mc2033.codfw.wmnet with reason: PDU swap
* 14:15 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp20[39{{!}}40]\.codfw\.wmnet,service=ats-tls
* 14:13 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on cp[2039-2040].codfw.wmnet with reason: shutdown for PDU upgrade: rack D4
* 14:13 urandom: flushing Cassandra tables, restbase1030
* 14:13 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 3:00:00 on cp[2039-2040].codfw.wmnet with reason: shutdown for PDU upgrade: rack D4
* 14:13 urandom: flushing Cassandra tables, restbase1019
* 14:12 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on dns2002.wikimedia.org with reason: shutdown for PDU upgrade: rack D4
* 14:12 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 3:00:00 on dns2002.wikimedia.org with reason: shutdown for PDU upgrade: rack D4
* 14:11 urandom: flushing Cassandra tables, restbase1017 1018 1021 1024 1025 1026 1028 1029
* 14:05 urandom: flushing tables, restbase1016
* 13:52 hnowlan: powered up restbase2018
* 13:32 filippo@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on netmon1003.wikimedia.org with reason: pdu
* 13:32 filippo@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on netmon1003.wikimedia.org with reason: pdu
* 13:32 filippo@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on logstash2003.codfw.wmnet with reason: pdu
* 13:31 filippo@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on logstash2003.codfw.wmnet with reason: pdu
* 13:31 root@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on logstash2029.codfw.wmnet with reason: pdu
* 13:31 root@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on logstash2029.codfw.wmnet with reason: pdu
* 13:30 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 6 hosts with reason: [[phab:T310146|T310146]]
* 13:30 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 6 hosts with reason: [[phab:T310146|T310146]]
* 13:17 elukey: powering on restbase2027
* 13:12 elukey: powering on restbase2026
* 13:12 _joe_: powering on restbase2023
* 13:01 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1160 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32343 and previous config saved to /var/cache/conftool/dbconfig/20220810-130108-ladsgroup.json
* 13:01 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance
* 13:00 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance
* 12:37 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic[2072,2084-2085].codfw.wmnet with reason: [[phab:T310146|T310146]]
* 12:37 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic[2072,2084-2085].codfw.wmnet with reason: [[phab:T310146|T310146]]
* 12:27 jbond: remove confd from serveres that shouldn;t have it
* 12:05 ladsgroup@deploy1002: Synchronized php-1.39.0-wmf.23/extensions/Echo/maintenance/removeOrphanedEvents.php: Backport: [[gerrit:821735{{!}}Run clean ups with removeOrphanedEvents in major batches (T310428)]] (duration: 03m 32s)
* 11:47 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 11:46 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 11:46 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 11:43 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 11:15 jbond@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1001.eqiad.wmnet with OS buster
* 10:54 jbond@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage
* 10:51 jbond@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage
* 10:37 jbond@cumin1001: START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS buster
* 10:31 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[2181-2182].codfw.wmnet with reason: D6 PDU maint ([[phab:T310146|T310146]])
* 10:31 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db[2181-2182].codfw.wmnet with reason: D6 PDU maint ([[phab:T310146|T310146]])
* 10:26 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on maps2010.codfw.wmnet with reason: PDU maintenance
* 10:26 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on maps2010.codfw.wmnet with reason: PDU maintenance
* 10:26 hnowlan@puppetmaster1001: conftool action : set/pooled=no; selector: name=maps2010.codfw.wmnet
* 10:25 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=maps2010.codfw.wmnet
* 10:25 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=maps2008.codfw.wmnet
* 10:24 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on restbase2018.codfw.wmnet with reason: PDU maintenance
* 10:24 hnowlan@puppetmaster1001: conftool action : set/pooled=no; selector: name=restbase2018.codfw.wmnet
* 10:24 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on restbase2018.codfw.wmnet with reason: PDU maintenance
* 10:23 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ml-serve2008.codfw.wmnet with reason: PDU maintenance
* 10:23 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on ml-serve2008.codfw.wmnet with reason: PDU maintenance
* 10:20 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ores2009.codfw.wmnet with reason: PDU maintenance
* 10:20 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on ores2009.codfw.wmnet with reason: PDU maintenance
* 10:19 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ores2008.codfw.wmnet with reason: PDU maintenance
* 10:19 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on ores2008.codfw.wmnet with reason: PDU maintenance
* 10:03 hnowlan@puppetmaster1001: conftool action : set/pooled=no; selector: name=restbase202[367].codfw.wmnet
* 10:02 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on restbase[2023,2026-2027].codfw.wmnet with reason: PDU maintenance
* 10:02 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on restbase[2023,2026-2027].codfw.wmnet with reason: PDU maintenance
* 09:53 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 6 hosts with reason: D8 PDU Maint ([[phab:T310146|T310146]])
* 09:52 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 6 hosts with reason: D8 PDU Maint ([[phab:T310146|T310146]])
* 09:51 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depool D8 DBs for PDU maint ([[phab:T310146|T310146]])', diff saved to https://phabricator.wikimedia.org/P32341 and previous config saved to /var/cache/conftool/dbconfig/20220810-095059-ladsgroup.json
* 09:36 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[2101,2130,2140].codfw.wmnet,dbproxy2004.codfw.wmnet with reason: D6 PDU maint ([[phab:T310146|T310146]])
* 09:36 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db[2101,2130,2140].codfw.wmnet,dbproxy2004.codfw.wmnet with reason: D6 PDU maint ([[phab:T310146|T310146]])
* 09:34 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depool D6 dbs ([[phab:T310146|T310146]])', diff saved to https://phabricator.wikimedia.org/P32340 and previous config saved to /var/cache/conftool/dbconfig/20220810-093433-ladsgroup.json
* 09:31 jelto: depool services in codfw for upcoming PDU replacement - [[phab:T309956|T309956]]
* 09:30 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 8 hosts with reason: Maintenance
* 09:29 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 8 hosts with reason: Maintenance
* 09:29 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance
* 09:29 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance
* 09:28 jynus: shutdown backup2007 before pdu upgrade [[phab:T310146|T310146]]
* 09:16 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 09:15 ladsgroup@deploy1002: Synchronized php-1.39.0-wmf.23/maintenance/namespaceDupes.php: Backport: [[gerrit:821734{{!}}maintenance: Add support for links migration to namespaceDupes.php (T314711)]] (duration: 03m 18s)
* 09:15 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[2093,2120,2129,2172].codfw.wmnet with reason: D5 PDU maint ([[phab:T310146|T310146]])
* 09:15 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db[2093,2120,2129,2172].codfw.wmnet with reason: D5 PDU maint ([[phab:T310146|T310146]])
* 09:14 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 09:14 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 09:13 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 09:10 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depool D5 dbs ([[phab:T310146|T310146]])', diff saved to https://phabricator.wikimedia.org/P32339 and previous config saved to /var/cache/conftool/dbconfig/20220810-091038-ladsgroup.json
* 08:49 jynus: shutdown dbprov2003 before pdu upgrade [[phab:T310146|T310146]]
* 08:49 ayounsi@cumin1001: END (FAIL) - Cookbook sre.network.debug (exit_code=99) for Netbox interface ID cr1-drmrs:xe-0/1/2
* 08:48 ayounsi@cumin1001: START - Cookbook sre.network.debug for Netbox interface ID cr1-drmrs:xe-0/1/2
* 08:48 mvernon@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be2028.codfw.wmnet
* 08:48 mvernon@cumin1001: START - Cookbook sre.hosts.remove-downtime for ms-be2028.codfw.wmnet
* 08:42 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1130 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P32337 and previous config saved to /var/cache/conftool/dbconfig/20220810-084222-ladsgroup.json
* 08:37 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 08:37 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
* 08:36 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
* 08:36 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 08:36 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 08:35 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 08:35 ladsgroup@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:822037{{!}}Stop writing to the old templatelinks fields in s5 (T312865)]] (duration: 03m 29s)
* 08:32 jelto: power off gitlab-runner2004
* 08:31 root@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:30:00 on gitlab-runner2004.codfw.wmnet with reason: PDU swap
* 08:31 root@cumin1001: START - Cookbook sre.hosts.downtime for 8:30:00 on gitlab-runner2004.codfw.wmnet with reason: PDU swap
* 08:29 mvernon@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ms-be2028.codfw.wmnet with reason: Trying to fix full /
* 08:28 mvernon@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ms-be2028.codfw.wmnet with reason: Trying to fix full /
* 08:28 ayounsi@cumin1001: END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox interface ID cr1-drmrs:xe-0/1/2
* 08:27 ayounsi@cumin1001: START - Cookbook sre.network.debug for Netbox interface ID cr1-drmrs:xe-0/1/2
* 08:27 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1130 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P32336 and previous config saved to /var/cache/conftool/dbconfig/20220810-082718-ladsgroup.json
* 08:25 ayounsi@cumin1001: END (FAIL) - Cookbook sre.network.debug (exit_code=99) for Netbox interface ID cr1-drmrs:xe-0/1/2
* 08:25 ayounsi@cumin1001: START - Cookbook sre.network.debug for Netbox interface ID cr1-drmrs:xe-0/1/2
* 08:24 ayounsi@cumin1001: END (FAIL) - Cookbook sre.network.debug (exit_code=99) for Netbox interface ID cr1-drmrs:xe-0/1/2
* 08:24 ayounsi@cumin1001: START - Cookbook sre.network.debug for Netbox interface ID cr1-drmrs:xe-0/1/2
* 08:23 kart_: Run: mwscript namespaceDupes.php arywiki --fix ([[phab:T291737|T291737]])
* 08:13 jynus: restart replication on db1117:m1 [[phab:T309074|T309074]]
* 08:12 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1130 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P32335 and previous config saved to /var/cache/conftool/dbconfig/20220810-081213-ladsgroup.json
* 08:09 kartik@deploy1002: Finished scap: Backport: [[gerrit:821732{{!}}arywiki: change namespace translations, add unchanged namespaces and add old translations as aliases (T291737)]] (duration: 10m 37s)
* 07:59 kartik@deploy1002: Started scap: Backport: [[gerrit:821732{{!}}arywiki: change namespace translations, add unchanged namespaces and add old translations as aliases (T291737)]]
* 07:57 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1130 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P32334 and previous config saved to /var/cache/conftool/dbconfig/20220810-075708-ladsgroup.json
* 07:56 ladsgroup@cumin1001: dbctl commit (dc=all): 'db1130 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P32333 and previous config saved to /var/cache/conftool/dbconfig/20220810-075636-ladsgroup.json
* 07:55 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 07:52 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 07:52 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 07:52 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1130.eqiad.wmnet with reason: Maintenance
* 07:52 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on db1130.eqiad.wmnet with reason: Maintenance
* 07:51 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 07:51 dcaro@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 07:46 dcaro@cumin1001: START - Cookbook sre.dns.netbox
* 07:39 dcaro@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 07:34 dcaro@cumin1001: START - Cookbook sre.dns.netbox
* 07:33 godog: depool thanos-fe2001 for debugging
* 07:11 kartik@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:821170{{!}}Enable SectionTranslation on testwiki with new MT support from Google (T313296)]] (duration: 05m 44s)
* 07:11 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 07:08 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 07:08 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 07:07 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 05:24 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on kubernetes[2013-2014].codfw.wmnet with reason: PDU maintenance
* 05:24 oblivian@cumin1001: START - Cookbook sre.hosts.downtime for 18:00:00 on kubernetes[2013-2014].codfw.wmnet with reason: PDU maintenance
* 05:19 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on parse[2016-2020].codfw.wmnet with reason: PDU maintenance
* 05:19 oblivian@cumin1001: START - Cookbook sre.hosts.downtime for 18:00:00 on parse[2016-2020].codfw.wmnet with reason: PDU maintenance
* 05:12 _joe_: starting to shut down servers in codfw for the PDU maintenance
* 05:09 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on 10 hosts with reason: PDU maintenance
* 05:09 oblivian@cumin1001: START - Cookbook sre.hosts.downtime for 18:00:00 on 10 hosts with reason: PDU maintenance
* 05:09 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on mc-gp2003.codfw.wmnet with reason: PDU maintenance
* 05:09 oblivian@cumin1001: START - Cookbook sre.hosts.downtime for 18:00:00 on mc-gp2003.codfw.wmnet with reason: PDU maintenance
* 05:06 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on mc2033.codfw.wmnet with reason: PDU maintenance
* 05:06 oblivian@cumin1001: START - Cookbook sre.hosts.downtime for 18:00:00 on mc2033.codfw.wmnet with reason: PDU maintenance
* 05:05 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on 7 hosts with reason: PDU maintenance
* 05:05 oblivian@cumin1001: START - Cookbook sre.hosts.downtime for 18:00:00 on 7 hosts with reason: PDU maintenance
* 02:34 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 02:33 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 02:33 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 02:32 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 02:07 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 02:06 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 02:06 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 02:05 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply


== 2020-08-18 ==
== 2022-08-09 ==
* 23:45 catrope@deploy1001: Synchronized php-1.36.0-wmf.5/extensions/GrowthExperiments: Only fetch task card data for users in variant C/D ([[phab:T258021|T258021]]) (duration: 01m 05s)
* 23:17 bking@cumin1001: conftool action : set/weight=10:pooled=yes; selector: name=wdqs1011.eqiad.wmnet
* 23:44 catrope@deploy1001: Synchronized php-1.36.0-wmf.4/extensions/GrowthExperiments: Only fetch task card data for users in variant C/D ([[phab:T258021|T258021]]) (duration: 01m 06s)
* 23:07 bking@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
* 23:38 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1301.eqiad.wmnet
* 23:06 bking@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
* 23:34 Urbanecm: Run scap pull at mw1301
* 22:51 bking@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
* 23:33 catrope@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Enable static maps on testwiki, disable them on test2wiki (duration: 03m 22s)
* 22:51 bking@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
* 23:32 mutante: rebooting mw1301 via mgmt
* 22:49 bking@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
* 23:22 mutante: killed reboot-cluster on cumin1001
* 22:49 bking@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
* 23:09 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|ac34f7274823e40d0c79752eb5ffe74c76856d04}}: Enable subpages in NS:0 in techconductwiki ([[phab:T260350|T260350]]) (duration: 05m 14s)
* 22:46 bking@cumin1001: conftool action : set/weight=10:pooled=yes; selector: name=wdqs1015.eqiad.wmnet
* 23:04 wkandek@cumin1001: conftool action : set/pooled=yes; selector: name=mw1300.eqiad.wmnet
* 22:31 ryankemper@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
* 22:47 wkandek@cumin1001: START - Cookbook sre.hosts.reboot-cluster
* 22:31 ryankemper@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
* 22:41 wkandek@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-cluster (exit_code=1)
* 22:28 bking@cumin1001: END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
* 22:09 mholloway-shell@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'wikifeeds' for release 'production' .
* 22:02 bking@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
* 22:07 mholloway-shell@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'wikifeeds' for release 'production' .
* 22:02 bking@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
* 22:06 mholloway-shell@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' .
* 21:57 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wdqs2006.codfw.wmnet with reason: [[phab:T310146|T310146]]
* 21:39 mholloway-shell@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'wikifeeds' for release 'production' .
* 21:57 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs2006.codfw.wmnet with reason: [[phab:T310146|T310146]]
* 21:37 mholloway-shell@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'wikifeeds' for release 'production' .
* 21:53 bking@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
* 21:34 mholloway-shell@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' .
* 21:52 bking@deploy1002: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply
* 21:24 wkandek@cumin1001: START - Cookbook sre.hosts.reboot-cluster
* 21:50 bking@deploy1002: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply
* 21:03 andrew@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
* 21:49 bking@deploy1002: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply
* 21:01 andrew@cumin1001: START - Cookbook sre.hosts.downtime
* 21:43 bking@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop: apply
* 20:27 hashar: https://releases-jenkins.wikimedia.org/ changed agent from releases1001 to releases1002
* 21:43 bking@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop: apply
* 20:14 twentyafterfour@deploy1001: rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.5  refs [[phab:T257973|T257973]]
* 21:43 bking@deploy1002: helmfile [eqiad] DONE helmfile.d/services/changeprop: apply
* 20:11 mutante: running puppet on cp-ats-ulsfo and switching releases-jenkins backend
* 21:43 bking@deploy1002: helmfile [eqiad] START helmfile.d/services/changeprop: apply
* 20:07 twentyafterfour@deploy1001: Finished scap: testwikis wikis to 1.36.0-wmf.5  refs [[phab:T257973|T257973]] (duration: 53m 12s)
* 21:43 bking@deploy1002: helmfile [staging] DONE helmfile.d/services/changeprop: apply
* 20:00 mutante: releases1001 rm /etc/rsync.d/frag* & run puppet
* 21:43 bking@deploy1002: helmfile [staging] START helmfile.d/services/changeprop: apply
* 19:54 mutante: rsyncing /var/lib/jenkins from releases1001 to releases1002/2002 with --delete  [[phab:T256164|T256164]]
* 21:08 bking@cumin1001: START - Cookbook sre.wdqs.data-transfer
* 19:47 ejegg: updated payments-wiki from {{Gerrit|a7ee1790e0}} to {{Gerrit|ef7ebd08cb}}
* 21:00 bking@cumin1001: conftool action : set/weight=10:pooled=yes; selector: name=wdqs1014.eqiad.wmnet
* 19:44 hashar: Deleting old jobs from https://releases-jenkins.wikimedia.org/  # [[phab:T256164|T256164]]
* 20:56 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
* 19:41 hashar: releases1001: deleting old legacy mediawiki snapshots under /var/lib/jenkins/<nowiki>{</nowiki>REL1_27,REL1_29,REL1_30<nowiki>}</nowiki>  # [[phab:T256164|T256164]]
* 20:55 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
* 19:14 twentyafterfour@deploy1001: Started scap: testwikis wikis to 1.36.0-wmf.5  refs [[phab:T257973|T257973]]
* 20:55 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1148 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32332 and previous config saved to /var/cache/conftool/dbconfig/20220809-205548-ladsgroup.json
* 19:13 twentyafterfour: Promote testwikis from 1.36.0-wmf.4 to 1.36.0-wmf.5 refs [[phab:T257973|T257973]]
* 20:51 bking@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wdqs1014.eqiad.wmnet
* 17:51 cmjohnson@cumin1001: START - Cookbook sre.dns.netbox
* 20:51 bking@cumin1001: START - Cookbook sre.hosts.remove-downtime for wdqs1014.eqiad.wmnet
* 16:12 oblivian@cumin1001: conftool action : set/pooled=yes; selector: name=mw14(09{{!}}11{{!}}13).*
* 20:46 bking@cumin1001: END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
* 16:03 oblivian@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-cluster (exit_code=1)
* 20:40 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P32331 and previous config saved to /var/cache/conftool/dbconfig/20220809-204042-ladsgroup.json
* 15:36 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0)
* 20:25 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P32330 and previous config saved to /var/cache/conftool/dbconfig/20220809-202536-ladsgroup.json
* 15:30 jayme@cumin1001: START - Cookbook sre.hosts.reboot-cluster
* 20:10 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1148 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32329 and previous config saved to /var/cache/conftool/dbconfig/20220809-201030-ladsgroup.json
* 15:02 jayme@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-cluster (exit_code=99)
* 19:57 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on wdqs1014.eqiad.wmnet with reason: [[phab:T314890|T314890]]
* 14:56 papaul: replacing msw-c1,c2 and c4
* 19:57 bking@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on wdqs1014.eqiad.wmnet with reason: [[phab:T314890|T314890]]
* 14:55 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-cluster
* 19:56 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on wdqs1016.eqiad.wmnet with reason: [[phab:T314890|T314890]]
* 14:53 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1104', diff saved to https://phabricator.wikimedia.org/P12293 and previous config saved to /var/cache/conftool/dbconfig/20200818-145337-marostegui.json
* 19:56 bking@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on wdqs1016.eqiad.wmnet with reason: [[phab:T314890|T314890]]
* 14:48 oblivian@cumin1001: conftool action : set/pooled=yes; selector: name=mw13(55{{!}}64{{!}}65).*
* 19:55 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on wdqs1015.eqiad.wmnet with reason: [[phab:T314890|T314890]]
* 14:46 XioNoX: move v4 HE on cr3-ulsfo from peering to transit bgp group
* 19:55 bking@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on wdqs1015.eqiad.wmnet with reason: [[phab:T314890|T314890]]
* 14:44 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1104', diff saved to https://phabricator.wikimedia.org/P12292 and previous config saved to /var/cache/conftool/dbconfig/20200818-144415-marostegui.json
* 19:38 dcausse@deploy1002: helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply
* 14:37 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1104', diff saved to https://phabricator.wikimedia.org/P12291 and previous config saved to /var/cache/conftool/dbconfig/20200818-143758-marostegui.json
* 19:36 dcausse@deploy1002: helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply
* 14:35 oblivian@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-cluster (exit_code=99)
* 19:35 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1130.eqiad.wmnet with reason: Maintenance
* 14:29 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1104', diff saved to https://phabricator.wikimedia.org/P12290 and previous config saved to /var/cache/conftool/dbconfig/20200818-142937-marostegui.json
* 19:35 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 10:00:00 on db1130.eqiad.wmnet with reason: Maintenance
* 14:28 marostegui: Stop MYSQL on db2125 for on-site maintenance - [[phab:T260670|T260670]]
* 19:25 bking@cumin1001: START - Cookbook sre.wdqs.data-transfer
* 13:54 marostegui: Revoke DELETE and CREATE from xhgui user on m2 [[phab:T260640|T260640]]
* 18:09 cmjohnson@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 13:53 XioNoX: bump Zayo v4 BGP session in eqiad
* 18:06 cmjohnson@cumin1001: START - Cookbook sre.dns.netbox
* 13:49 XioNoX: move v4 HE on cr2-eqord from peering to transit bgp group
* 13:37 XioNoX: move v4 cr1-eqiad from peering to transit bgp group
* 13:04 kormat: disabling puppet on all db machines [[phab:T259516|T259516]]
* 12:57 _joe_: rebooting appservers in eqiad, 3 at a time
* 12:57 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-cluster
* 12:37 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0)
* 12:34 kormat: deploying wmfmariadbpy 0.4
* 12:21 jayme@cumin1001: START - Cookbook sre.hosts.reboot-cluster
* 11:53 XioNoX: add new icinga hosts to mr policies - [[phab:T260533|T260533]]
* 11:40 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-cluster
* 11:36 Lucas_WMDE: EU backport&config done
* 11:33 lucaswerkmeister-wmde@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:620888{{!}}Add Wikisource wordmark for trwikisource (T260658)]], part 2 (duration: 00m 55s)
* 11:32 Lucas_WMDE: lucaswerkmeister-wmde@mwmaint1002:~$ printf '%s\n' 'https://en.wikipedia.org/static/images/mobile/copyright/wikisource-wordmark-tr.svg' {{!}} mwscript purgeList.php # [[phab:T260658|T260658]]
* 11:32 lucaswerkmeister-wmde@deploy1001: Synchronized static/images/mobile/copyright/wikisource-wordmark-tr.svg: Config: [[gerrit:620888{{!}}Add Wikisource wordmark for trwikisource (T260658)]], part 1 (duration: 00m 55s)
* 11:24 lucaswerkmeister-wmde@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:595543{{!}}Enable Data Bridge on Catalan Wikipedia (T232584)]] (duration: 01m 01s)
* 11:06 jbond42: deploy net-snmp update to buster
* 10:56 oblivian@cumin1001: conftool action : set/pooled=yes; selector: cluster=api_appserver,dc=codfw,name=mw229.*
* 10:55 oblivian@cumin1001: END (ERROR) - Cookbook sre.hosts.reboot-cluster (exit_code=97)
* 10:54 marostegui: Reboot db2125 after running a full upgrade - [[phab:T260670|T260670]]
* 10:46 marostegui: Powercycle db2125 from the idrac [[phab:T260670|T260670]]
* 10:07 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db2125 - host down [[phab:T260670|T260670]]', diff saved to https://phabricator.wikimedia.org/P12288 and previous config saved to /var/cache/conftool/dbconfig/20200818-100718-marostegui.json
* 09:45 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-cluster
* 09:43 jiji@cumin1001: conftool action : set/pooled=yes; selector: name=mw2250.codfw.wmnet
* 09:40 oblivian@cumin1001: conftool action : set/pooled=yes; selector: cluster=api_appserver,dc=codfw,name=mw214[234].*
* 09:40 oblivian@cumin1001: END (ERROR) - Cookbook sre.hosts.reboot-cluster (exit_code=97)
* 09:35 kart_: Update cxserver to 2020-08-17-090424-production ([[phab:T259980|T259980]])
* 09:32 kartik@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'cxserver' for release 'production' .
* 09:29 kartik@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'cxserver' for release 'production' .
* 09:28 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-cluster
* 09:28 oblivian@cumin1001: conftool action : set/pooled=yes; selector: cluster=api_appserver,dc=codfw,name=mw214[02].*
* 09:26 volans: upgraded spicerack to v0.0.39 on cumin hosts
* 09:25 kartik@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'staging' .
* 09:21 volans: uploaded spicerack_0.0.39-1+deb10u1 to apt.wikimedia.org buster-wikimedia
* 09:05 hashar: Restarting CI Jenkins
* 08:44 vgutierrez: restart ats-tls on cp5006
* 08:24 oblivian@cumin1001: END (ERROR) - Cookbook sre.hosts.reboot-cluster (exit_code=97)
* 08:17 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-cluster
* 08:16 oblivian@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-cluster (exit_code=99)
* 08:10 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-cluster
* 08:02 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1089', diff saved to https://phabricator.wikimedia.org/P12284 and previous config saved to /var/cache/conftool/dbconfig/20200818-080256-marostegui.json
* 07:58 oblivian@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-cluster (exit_code=99)
* 07:53 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-cluster
* 07:45 godog: VictorOps ack'd incidents will re-trigger after 24h if not resolved - [[phab:T259465|T259465]]
* 07:44 oblivian@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-cluster (exit_code=1)
* 07:43 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1089', diff saved to https://phabricator.wikimedia.org/P12283 and previous config saved to /var/cache/conftool/dbconfig/20200818-074325-marostegui.json
* 07:42 _joe_: performing rolling reboot of all codfw api servers
* 07:38 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-cluster
* 07:23 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1089', diff saved to https://phabricator.wikimedia.org/P12282 and previous config saved to /var/cache/conftool/dbconfig/20200818-072349-marostegui.json
* 07:19 oblivian@cumin1001: conftool action : set/pooled=yes; selector: name=mw213[5-9].codfw.wmnet
* 07:16 jynus: update rest of phabricator passwords [[phab:T250361|T250361]]
* 07:11 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1089', diff saved to https://phabricator.wikimedia.org/P12281 and previous config saved to /var/cache/conftool/dbconfig/20200818-071121-marostegui.json
* 07:08 oblivian@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-cluster (exit_code=99)
* 07:07 godog: prometheus eqiad: add 100G to prometheus/global
* 07:01 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-cluster
* 07:01 oblivian@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-cluster (exit_code=99)
* 07:01 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-cluster
* 06:53 twentyafterfour: phabricator maintenance successful
* 06:48 jynus: deploy another password change to phabricator service (potentially disruptive) [[phab:T250361|T250361]]
* 06:41 XioNoX: add cloudflare PNI IPs in eqiad - [[phab:T259036|T259036]]
* 06:21 jynus: deploy password change to phabricator service [[phab:T146055|T146055]]
* 06:06 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 06:01 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-single
* 05:52 _joe_: running puppet on mc1020 [[phab:T260622|T260622]]
* 05:02 twentyafterfour: phabricator appears to be fully functional
* 05:01 twentyafterfour: phabricator read-only ended
* 05:00 twentyafterfour: phabricator is now read-only
* 05:00 marostegui: Failover m3 (phabricator) database master from db1128 to db1132 - [[phab:T259589|T259589]]
* 04:32 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1088', diff saved to https://phabricator.wikimedia.org/P12279 and previous config saved to /var/cache/conftool/dbconfig/20200818-043241-marostegui.json
* 01:54 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1376.eqiad.wmnet
* 01:38 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1343.eqiad.wmnet
* 01:37 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1344.eqiad.wmnet
* 01:04 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 00:53 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 00:48 dzahn@cumin1001: START - Cookbook sre.hosts.reboot-single
* 00:45 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 00:39 dzahn@cumin1001: START - Cookbook sre.hosts.reboot-single
* 00:39 dzahn@cumin1001: START - Cookbook sre.hosts.reboot-single
* 00:38 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1341.eqiad.wmnet
* 00:37 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1340.eqiad.wmnet
* 00:37 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1339.eqiad.wmnet
* 00:31 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 00:24 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 00:15 dzahn@cumin1001: START - Cookbook sre.hosts.reboot-single
* 00:13 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 00:07 dzahn@cumin1001: START - Cookbook sre.hosts.reboot-single
* 00:07 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1315.eqiad.wmnet
* 00:06 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
 
== 2020-08-17 ==
* 23:59 dzahn@cumin1001: START - Cookbook sre.hosts.reboot-single
* 23:58 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 23:49 dzahn@cumin1001: START - Cookbook sre.hosts.reboot-single
* 23:49 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1313.eqiad.wmnet
* 23:49 dzahn@cumin1001: START - Cookbook sre.hosts.reboot-single
* 23:49 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 23:47 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 23:41 dzahn@cumin1001: START - Cookbook sre.hosts.reboot-single
* 23:40 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1312.eqiad.wmnet
* 23:40 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 23:30 dzahn@cumin1001: START - Cookbook sre.hosts.reboot-single
* 23:28 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1297.eqiad.wmnet
* 23:26 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 23:25 dzahn@cumin1001: START - Cookbook sre.hosts.reboot-single
* 23:23 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 23:11 dzahn@cumin1001: START - Cookbook sre.hosts.reboot-single
* 23:11 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99)
* 23:11 dzahn@cumin1001: START - Cookbook sre.hosts.reboot-single
* 23:10 dzahn@cumin1001: START - Cookbook sre.hosts.reboot-single
* 23:10 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1288.eqiad.wmnet
* 23:00 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 22:55 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 22:47 dzahn@cumin1001: START - Cookbook sre.hosts.reboot-single
* 22:45 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 22:43 dzahn@cumin1001: START - Cookbook sre.hosts.reboot-single
* 22:43 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1286.eqiad.wmnet
* 22:43 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 22:41 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0)
* 22:37 dzahn@cumin1001: START - Cookbook sre.hosts.reboot-single
* 22:37 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1285.eqiad.wmnet
* 22:33 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 22:26 dzahn@cumin1001: START - Cookbook sre.hosts.reboot-single
* 22:26 ppchelko@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' .
* 22:26 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1284.eqiad.wmnet
* 22:25 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 22:23 ppchelko@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' .
* 22:21 ppchelko@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' .
* 22:21 ppchelko@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' .
* 22:17 dzahn@cumin1001: START - Cookbook sre.hosts.reboot-single
* 22:15 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 22:09 dzahn@cumin1001: START - Cookbook sre.hosts.reboot-single
* 22:08 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1282.eqiad.wmnet
* 22:04 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 22:02 dzahn@cumin1001: START - Cookbook sre.hosts.reboot-single
* 22:02 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1281.eqiad.wmnet
* 22:00 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 21:57 Amir1: ladsgroup@mwmaint1002:~$ mwscript extensions/Cognate/maintenance/populateCognateSites.php --wiki=aawiktionary --site-group wiktionary ([[phab:T259360|T259360]])
* 21:56 ppchelko@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' .
* 21:56 ppchelko@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' .
* 21:53 ppchelko@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Add api-gateway.request stream config [[phab:T259736|T259736]], one host timed out (duration: 00m 55s)
* 21:49 dzahn@cumin1001: START - Cookbook sre.hosts.reboot-single
* 21:48 ppchelko@deploy1001: sync-file aborted: Add api-gateway.request stream config [[phab:T259736|T259736]] (duration: 05m 01s)
* 21:47 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1278.eqiad.wmnet
* 21:46 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-cluster
* 21:43 dzahn@cumin1001: START - Cookbook sre.hosts.reboot-single
* 21:42 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1279.eqiad.wmnet
* 21:42 oblivian@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-cluster (exit_code=99)
* 21:38 sbassett@deploy1001: Synchronized private/PrivateSettings.php: Further mitigations for [[phab:T257687|T257687]] (duration: 00m 57s)
* 21:38 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 21:36 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 21:34 effie: blocking temporarily traffic to mc1020
* 21:23 dzahn@cumin1001: START - Cookbook sre.hosts.reboot-single
* 21:22 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 21:20 dzahn@cumin1001: START - Cookbook sre.hosts.reboot-single
* 21:19 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1276.eqiad.wmnet
* 21:12 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw2240.codfw.wmnet
* 21:08 dzahn@cumin1001: START - Cookbook sre.hosts.reboot-single
* 21:04 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 20:54 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 20:47 dzahn@cumin1001: START - Cookbook sre.hosts.reboot-single
* 20:38 dzahn@cumin1001: START - Cookbook sre.hosts.reboot-single
* 20:33 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 20:27 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 20:24 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-cluster
* 20:20 dzahn@cumin1001: START - Cookbook sre.hosts.reboot-single
* 20:19 dzahn@cumin1001: START - Cookbook sre.hosts.reboot-single
* 20:19 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 20:02 dzahn@cumin1001: START - Cookbook sre.hosts.reboot-single
* 19:53 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 19:39 dzahn@cumin1001: START - Cookbook sre.hosts.reboot-single
* 19:30 ppchelko@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
* 19:28 ppchelko@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
* 19:22 ppchelko@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
* 19:01 ppchelko@deploy1001: Finished deploy [restbase/deploy@7f16bad]: Add thankyouwiki [[phab:T259002|T259002]], take 3 (duration: 02m 57s)
* 18:58 ppchelko@deploy1001: Started deploy [restbase/deploy@7f16bad]: Add thankyouwiki [[phab:T259002|T259002]], take 3
* 18:58 ppchelko@deploy1001: Finished deploy [restbase/deploy@7f16bad]: Add thankyouwiki [[phab:T259002|T259002]], take 2 (duration: 11m 19s)
* 18:46 ppchelko@deploy1001: Started deploy [restbase/deploy@7f16bad]: Add thankyouwiki [[phab:T259002|T259002]], take 2
* 18:46 ppchelko@deploy1001: Finished deploy [restbase/deploy@7f16bad]: Add thankyouwiki [[phab:T259002|T259002]] (duration: 131m 17s)
* 18:46 ppchelko@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
* 18:43 ppchelko@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
* 18:39 ppchelko@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
* 18:32 oblivian@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-cluster (exit_code=99)
* 18:08 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|808c17d28c5ebf5ed75f70c224d66129eb2edcd8}}: Change logo for lldwiki to match the requested one ([[phab:T259432|T259432]]) (duration: 00m 56s)
* 18:04 urbanecm@deploy1001: Synchronized static/images/project-logos/: {{Gerrit|67e8f886cd1a9cd2b63ed69761bec6c52889a5b6}}: Add logo files for lldwiki ([[phab:T259432|T259432]]) (duration: 00m 56s)
* 17:17 cdanis@cumin1001: conftool action : set/pooled=yes; selector: name=mw1359.*
* 17:06 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-cluster
* 17:04 oblivian@cumin1001: conftool action : set/pooled=yes; selector: cluster=jobrunner,dc=codfw,name=mw2246.codfw.wmnet
* 17:01 oblivian@cumin1001: END (ERROR) - Cookbook sre.hosts.reboot-cluster (exit_code=97)
* 16:36 jynus: restart backup2001, backup1001 one after the other
* 16:35 ppchelko@deploy1001: Started deploy [restbase/deploy@7f16bad]: Add thankyouwiki [[phab:T259002|T259002]]
* 16:31 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-cluster
* 16:27 urbanecm@deploy1001: Synchronized private/PrivateSettings.php: Update [[phab:T250887|T250887]] mitigations (duration: 00m 56s)
* 16:23 otto@deploy1001: Synchronized wmf-config/InitialiseSettings.php: wgEventLoggingSchemas - remove unneeded override for SearchSatisfaction - [[phab:T259163|T259163]] (duration: 00m 56s)
* 16:22 cdanis@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
* 16:21 oblivian@cumin1001: conftool action : set/pooled=inactive; selector: cluster=jobrunner,dc=codfw,name=mw2250.codfw.wmnet
* 16:20 oblivian@cumin1001: conftool action : set/pooled=yes; selector: cluster=jobrunner,dc=codfw
* 16:20 cdanis@cumin1001: START - Cookbook sre.hosts.downtime
* 16:14 cdanis@cumin1001: conftool action : set/pooled=inactive; selector: name=mw1359.*
* 16:12 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0)
* 16:12 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-cluster
* 15:44 ppchelko@deploy1001: Finished deploy [restbase/deploy@ddcecce]: [[phab:T257943|T257943]] [[phab:T260556|T260556]] [[phab:T253478|T253478]] [[phab:T254490|T254490]] [[phab:T259054|T259054]]. take 3. feeds timed out (duration: 01m 31s)
* 15:43 ppchelko@deploy1001: Started deploy [restbase/deploy@ddcecce]: [[phab:T257943|T257943]] [[phab:T260556|T260556]] [[phab:T253478|T253478]] [[phab:T254490|T254490]] [[phab:T259054|T259054]]. take 3. feeds timed out
* 15:43 ppchelko@deploy1001: Finished deploy [restbase/deploy@ddcecce]: [[phab:T257943|T257943]] [[phab:T260556|T260556]] [[phab:T253478|T253478]] [[phab:T254490|T254490]] [[phab:T259054|T259054]]. take 2. feeds timed out (duration: 20m 40s)
* 15:36 cdanis: ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕦☕ homer 'cr*'  commit 'revert skipping RPKI validation for Jio AS55836 {{Gerrit|I0fd4683}} [[phab:T260452|T260452]]'
* 15:30 cdanis: ❌cdanis@cumin1001.eqiad.wmnet ~ 🕦☕ homer 'cr*-codfw*'  commit 'revert skipping RPKI validation for Jio AS55836 {{Gerrit|I0fd4683}} [[phab:T260452|T260452]]'
* 15:22 ppchelko@deploy1001: Started deploy [restbase/deploy@ddcecce]: [[phab:T257943|T257943]] [[phab:T260556|T260556]] [[phab:T253478|T253478]] [[phab:T254490|T254490]] [[phab:T259054|T259054]]. take 2. feeds timed out
* 15:22 ppchelko@deploy1001: Finished deploy [restbase/deploy@ddcecce]: [[phab:T257943|T257943]] [[phab:T260556|T260556]] [[phab:T253478|T253478]] [[phab:T254490|T254490]] [[phab:T259054|T259054]] (duration: 02m 30s)
* 15:19 ppchelko@deploy1001: Started deploy [restbase/deploy@ddcecce]: [[phab:T257943|T257943]] [[phab:T260556|T260556]] [[phab:T253478|T253478]] [[phab:T254490|T254490]] [[phab:T259054|T259054]]
* 15:08 ppchelko@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'changeprop' for release 'production' .
* 15:06 ppchelko@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'changeprop' for release 'production' .
* 15:04 ppchelko@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
* 14:57 otto@deploy1001: Synchronized wmf-config/InitialiseSettings.php: wgEventLoggingSchemas - schema revision version bump for erroring schemas - all wikis (take 2) - [[phab:T254606|T254606]] (duration: 00m 53s)
* 14:57 oblivian@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-cluster (exit_code=99)
* 14:57 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-cluster
* 14:44 otto@deploy1001: Synchronized wmf-config/InitialiseSettings.php: wgEventLoggingSchemas - schema revision version bump for erroring schemas - all wikis - [[phab:T254606|T254606]] (duration: 00m 55s)
* 14:35 otto@deploy1001: Synchronized wmf-config/InitialiseSettings.php: wgEventLoggingSchemas - schema revision version bump for erroring schemas - group0 - [[phab:T254606|T254606]] (duration: 00m 56s)
* 14:14 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1088 after upgrading its mysql package', diff saved to https://phabricator.wikimedia.org/P12277 and previous config saved to /var/cache/conftool/dbconfig/20200817-141449-marostegui.json
* 14:09 marostegui: Sanitize thankyouwiki on db1124:3315, db2094:3315 - [[phab:T260551|T260551]]
* 14:03 marostegui: Sanitize lldwiki on db1124:3315 and db2094:3315 [[phab:T259436|T259436]]
* 14:02 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1088 after upgrading its mysql package', diff saved to https://phabricator.wikimedia.org/P12276 and previous config saved to /var/cache/conftool/dbconfig/20200817-140229-marostegui.json
* 13:58 Amir1: start of foreachwikiindblist wikidataclient extensions/Wikibase/lib/maintenance/populateSitesTable.php --force-protocol https ([[phab:T259432|T259432]])
* 13:54 Urbanecm: Creating thankyouwiki and lldwiki is done
* 13:54 urbanecm@deploy1001: Synchronized wmf-config/interwiki.php: Update interwiki cache (duration: 01m 52s)
* 13:54 Urbanecm: Create account Pcoombe (WMF) at thankyouwiki, email set to pcoombe@wikimedia.org ([[phab:T259002|T259002]])
* 13:52 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Creating thankyouwiki ([[phab:T259002|T259002]]) (duration: 00m 55s)
* 13:51 urbanecm@deploy1001: Synchronized static/images/project-logos/: Creating thankyouwiki ([[phab:T259002|T259002]]) (duration: 00m 55s)
* 13:49 urbanecm@deploy1001: rebuilt and synchronized wikiversions files: Creating thankyouwiki ([[phab:T259002|T259002]])
* 13:48 urbanecm@deploy1001: Synchronized dblists: Creating thankyouwiki ([[phab:T259002|T259002]]) (duration: 00m 55s)
* 13:47 marostegui: Deploy MCR change on db1104
* 13:47 urbanecm@deploy1001: Synchronized wmf-config/db-codfw.php: Creating thankyouwiki ([[phab:T259002|T259002]]) (duration: 00m 56s)
* 13:47 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1104 for MCR change', diff saved to https://phabricator.wikimedia.org/P12275 and previous config saved to /var/cache/conftool/dbconfig/20200817-134701-marostegui.json
* 13:46 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1099:3318', diff saved to https://phabricator.wikimedia.org/P12274 and previous config saved to /var/cache/conftool/dbconfig/20200817-134619-marostegui.json
* 13:46 urbanecm@deploy1001: Synchronized wmf-config/db-eqiad.php: Creating thankyouwiki ([[phab:T259002|T259002]]) (duration: 00m 55s)
* 13:46 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1088 after upgrading its mysql package', diff saved to https://phabricator.wikimedia.org/P12273 and previous config saved to /var/cache/conftool/dbconfig/20200817-134604-marostegui.json
* 13:41 jayme: imported td-agent-bit_1.5.3-0 to buster-wikimedia - [[phab:T260536|T260536]]
* 13:40 jayme: imported !log imported to buster-wikimedia
* 13:39 marostegui: Upgrade db1088 (s6) to a newer mysql version (10.4.14)
* 13:39 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1088 for mysql upgrade', diff saved to https://phabricator.wikimedia.org/P12272 and previous config saved to /var/cache/conftool/dbconfig/20200817-133905-marostegui.json
* 13:34 jbond42: deploy json-c security update to buster
* 13:33 marostegui: Restart mysql on db2102 (testing new package)
* 13:30 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1099:3318', diff saved to https://phabricator.wikimedia.org/P12271 and previous config saved to /var/cache/conftool/dbconfig/20200817-133043-marostegui.json
* 13:29 urbanecm@deploy1001: Synchronized langlist: Creating lldwiki ([[phab:T259432|T259432]]) (duration: 00m 54s)
* 13:28 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Creating lldwiki ([[phab:T259432|T259432]]) (duration: 00m 55s)
* 13:27 urbanecm@deploy1001: sync-file aborted: Creating lldwiki ([[phab:T259432|T259432]])¨ (duration: 00m 00s)
* 13:26 urbanecm@deploy1001: Synchronized static/images/project-logos/: Creating lldwiki ([[phab:T259432|T259432]]) (duration: 00m 53s)
* 13:25 urbanecm@deploy1001: rebuilt and synchronized wikiversions files: Creating lldwiki ([[phab:T259432|T259432]])
* 13:23 urbanecm@deploy1001: Synchronized dblists: Creating lldwiki ([[phab:T259432|T259432]]) (duration: 00m 56s)
* 13:22 urbanecm@deploy1001: Synchronized wmf-config/db-codfw.php: Creating lldwiki ([[phab:T259432|T259432]]) (duration: 00m 56s)
* 13:20 urbanecm@deploy1001: Synchronized wmf-config/db-eqiad.php: Creating lldwiki ([[phab:T259432|T259432]]) (duration: 00m 55s)
* 13:13 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1099:3318', diff saved to https://phabricator.wikimedia.org/P12270 and previous config saved to /var/cache/conftool/dbconfig/20200817-131307-marostegui.json
* 13:10 oblivian@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 13:10 oblivian@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 13:09 oblivian@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 13:01 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1099:3318', diff saved to https://phabricator.wikimedia.org/P12269 and previous config saved to /var/cache/conftool/dbconfig/20200817-130127-marostegui.json
* 12:58 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 12:53 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-single
* 12:53 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-single
* 12:53 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-single
* 12:53 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-single
* 12:44 marostegui@cumin1001: dbctl commit (dc=all): 'Depoool db1089 for MCR change', diff saved to https://phabricator.wikimedia.org/P12268 and previous config saved to /var/cache/conftool/dbconfig/20200817-124458-marostegui.json
* 12:44 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1099:3311', diff saved to https://phabricator.wikimedia.org/P12267 and previous config saved to /var/cache/conftool/dbconfig/20200817-124409-marostegui.json
* 12:44 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 12:38 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-single
* 12:35 oblivian@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 12:35 oblivian@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 12:35 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 12:27 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 12:22 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1099:3311', diff saved to https://phabricator.wikimedia.org/P12266 and previous config saved to /var/cache/conftool/dbconfig/20200817-122234-marostegui.json
* 12:21 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-single
* 12:20 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-single
* 12:19 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-single
* 12:19 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-single
* 12:16 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1099:3311', diff saved to https://phabricator.wikimedia.org/P12265 and previous config saved to /var/cache/conftool/dbconfig/20200817-121600-marostegui.json
* 12:05 Lucas_WMDE: EU backport window done
* 12:02 Lucas_WMDE: lucaswerkmeister-wmde@mwmaint1002:~$ mwscript namespaceDupes.php bjnwiki --fix {{!}} tee [[phab:T259429|T259429]]-fix
* 12:02 Lucas_WMDE: lucaswerkmeister-wmde@mwmaint1002:~$ mwscript namespaceDupes.php bjnwiki {{!}} tee [[phab:T259429|T259429]]-dryrun
* 12:01 lucaswerkmeister-wmde@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:620643{{!}}Set Portal and Portal_talk namespaces in bjnwiki as an extra namespace. (T259429)]] (duration: 00m 55s)
* 11:57 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1099:3311', diff saved to https://phabricator.wikimedia.org/P12264 and previous config saved to /var/cache/conftool/dbconfig/20200817-115741-marostegui.json
* 11:54 lucaswerkmeister-wmde@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:620513{{!}}Add Wiktionary wordmark for eswiktionary (T254059)]], part 2 (duration: 00m 57s)
* 11:53 Lucas_WMDE: lucaswerkmeister-wmde@mwmaint1002:~$ printf 'https://en.wikipedia.org/static/images/mobile/copyright/wiktionary-wordmark-es.svg\n' {{!}} mwscript purgeList.php # [[phab:T254059|T254059]]
* 11:53 lucaswerkmeister-wmde@deploy1001: Synchronized static/images/mobile/copyright/wiktionary-wordmark-es.svg: Config: [[gerrit:620513{{!}}Add Wiktionary wordmark for eswiktionary (T254059)]], part 1 (duration: 00m 56s)
* 11:46 Lucas_WMDE: lucaswerkmeister-wmde@mwmaint1002:~$ printf 'https://en.wikipedia.org/static/images/project-logos/zh_classicalwiki%s.png\n' '' '-1.5x' '-2x' {{!}} mwscript purgeList.php # [[phab:T259006|T259006]]
* 11:45 lucaswerkmeister-wmde@deploy1001: Synchronized static/images/project-logos/: Config: [[gerrit:620510{{!}}Change the logo of lzh Wikipedia (T259006)]] (duration: 00m 55s)
* 11:40 lucaswerkmeister-wmde@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:620509{{!}}Add Turkish powered by MW and Wikimedia project icons for Turkish Wikiquote, Turkish Wiktionary, Turkish Wikisource and Turkish Wikibooks (T260493)]] (duration: 00m 55s)
* 11:35 lucaswerkmeister-wmde@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:620507{{!}}Add Turkish powered by MW and Wikimedia project icons (T260492)]] (duration: 00m 57s)
* 11:25 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 11:14 oblivian@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 11:12 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-single
* 11:10 oblivian@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 11:09 cparle@deploy1001: Synchronized wmf-config/InitialiseSettings.php: [SDC] configure mediasearch A/B test (duration: 01m 08s)
* 11:08 oblivian@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 10:57 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-single
* 10:54 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-single
* 10:52 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 10:52 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-single
* 10:52 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 10:51 oblivian@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 10:49 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 10:47 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-single
* 10:42 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 10:38 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-single
* 10:36 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-single
* 10:35 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-single
* 10:35 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-single
* 10:30 jayme@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 10:29 jayme@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 10:29 jayme@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 10:14 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single
* 10:14 jayme@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 10:13 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single
* 10:13 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single
* 10:10 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 10:07 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 10:06 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 09:58 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single
* 09:58 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single
* 09:56 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single
* 09:55 jayme@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 09:52 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single
* 09:48 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 09:45 jayme@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 09:42 jynus: updating compiler facts for cloud puppet compiler project to include new host dbprov2003
* 09:39 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 09:38 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single
* 09:38 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single
* 09:36 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 09:36 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 09:29 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single
* 09:28 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single
* 09:27 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single
* 09:23 jayme@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 09:22 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single
* 09:21 jayme@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 09:18 _joe_: running a full apt-get upgrade on mw1379-1380
* 09:18 _joe_: re-upgrading imagemagick on mw1378
* 09:16 _joe_: upgrading packages on mw1377
* 09:14 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 09:06 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single
* 09:06 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single
* 09:05 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single
* 08:25 jayme: forcing a puppet run on all mw-api servers in eqiad - [[phab:T260329|T260329]]
* 07:52 _joe_: repooling mw1382
* 07:37 _joe_: running the same test on mw1382 [[phab:T260329|T260329]]
* 07:34 _joe_: repooling mw1381
* 07:15 _joe_: running the same test on mw1381 [[phab:T260329|T260329]]
* 07:15 _joe_: repooled mw1281
* 06:26 _joe_: stop testing on mw1281, [[phab:T260329|T260329]]
* 05:45 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
* 05:43 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
* 05:28 marostegui: Stop mysql on db1099:3311, db1099:3318 for reimage
* 05:28 _joe_: depooling mw1281 for testing for [[phab:T260329|T260329]]
* 05:25 marostegui: Deploy schema change on db1139:3311
* 05:21 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1099:3311, db1099:3318 for reimage and MCR change', diff saved to https://phabricator.wikimedia.org/P12263 and previous config saved to /var/cache/conftool/dbconfig/20200817-052147-marostegui.json
 
== 2020-08-16 ==
* 11:12 gehel: repooling wdqs1004 - catched up on lag
 
== 2020-08-15 ==
* 21:18 gehel: depooling wdqs1004 and restarting services, will wait to catch up on lag before repooling
 
== 2020-08-14 ==
* 19:41 effie: restart mwdebug1002
* 16:58 cdanis: done deploying 'allow nameservers of Jio AS55836 to skip RPKI validation I9fcff8' to all routers [[phab:T260449|T260449]]
* 16:44 cdanis: ❌cdanis@cumin1001.eqiad.wmnet ~ 🕧☕ homer 'cr2-esams*'  commit 'allow nameservers of Jio AS55836 to skip RPKI validation I9fcff8'
* 16:39 cdanis: ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕧☕ homer 'cr1-codfw*'  commit 'allow nameservers of Jio AS55836 to skip RPKI validation I9fcff8'
* 16:36 cdanis: ❌cdanis@cumin1001.eqiad.wmnet ~ 🕧☕ homer 'cr2-codfw*'  commit 'allow nameservers of Jio AS55836 to skip RPKI validation I9fcff8'
* 02:41 eileen: tools revision changed from {{Gerrit|9a89f45974}} to {{Gerrit|b4ebd1e564}}
 
== 2020-08-13 ==
* 23:39 tzatziki: removing 3 files for legal compliance
* 22:03 mutante: switching xhgui from tungsten to xhgui1001 - ran puppet on webperf*001 - [[phab:T180761|T180761]] [[phab:T158837|T158837]]
* 21:54 andrew@deploy1001: Finished deploy [horizon/deploy@f3dcb29]: fix proxy in project-local domain --bug [[phab:T260388|T260388]] (duration: 03m 53s)
* 21:50 andrew@deploy1001: Started deploy [horizon/deploy@f3dcb29]: fix proxy in project-local domain --bug [[phab:T260388|T260388]]
* 21:11 mutante: rsyncing /var/lib/jenkins from releases1001 to releases1002 and then all other releases* servers. 57GB, overwriting existing data from manual config ([[phab:T247652|T247652]])
* 20:53 kormat: dropping xhgui.xhgui on m2
* 19:35 thcipriani@deploy1001: Synchronized php-1.36.0-wmf.4/extensions/DiscussionTools: [[gerrit:620030{{!}}Revert new reply API (again)]] [[phab:T259855|T259855]] (duration: 00m 57s)
* 18:06 herron: restarted ES on logstash1010
* 18:05 dpifke@deploy1001: Synchronized wmf-config/ProductionServices.php: Enabling new XHGui backend ([[phab:T180761|T180761]]) (duration: 00m 56s)
* 17:16 hnowlan: deployed ATS and varnish rules to route api.wikimedia.org
* 16:26 hnowlan: created api.wikimedia.org
* 15:49 hnowlan: moving api-gateway service to state production. critical set to false
* 15:41 herron: restart ES on logstash1012
* 14:56 fdans@deploy1001: Finished deploy [analytics/refinery@ba1a439]: Regular analytics weekly train (duration: 11m 34s)
* 14:45 ema: repool mw1382 with kernel memory accounting disabled [[phab:T260281|T260281]]
* 14:45 fdans@deploy1001: Started deploy [analytics/refinery@ba1a439]: Regular analytics weekly train
* 14:41 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 14:40 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 14:38 ema: reboot mw1382 with kernel memory accounting disabled [[phab:T260281|T260281]]
* 14:34 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-single
* 14:34 _joe_: rebooting mw1381 with a newer kernel, mw1383 as control with the old kernel [[phab:T260329|T260329]]
* 14:33 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-single
* 14:31 _joe_: installing kernel 4.19.0-0.bpo.9 on mw1381 [[phab:T260329|T260329]]
* 14:05 elukey@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
* 14:00 elukey: create schema[12]00[34] in ganeti - [[phab:T260347|T260347]]
* 13:59 elukey@cumin1001: START - Cookbook sre.ganeti.makevm
* 13:58 elukey@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
* 13:53 elukey@cumin1001: START - Cookbook sre.ganeti.makevm
* 13:51 elukey@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
* 13:46 elukey@cumin1001: START - Cookbook sre.ganeti.makevm
* 13:45 hnowlan: moving api-gateway service to monitoring_setup
* 13:44 elukey@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
* 13:44 hashar: Gracefully restarting Zuul
* 13:39 elukey@cumin1001: START - Cookbook sre.ganeti.makevm
* 13:10 _joe_: forcing a puppet run on the api appservers in eqiad  [[phab:T260329|T260329]]
* 13:07 oblivian@deploy1001: Synchronized wmf-config/CommonSettings.php: revert enabling of lilypond (again) [[phab:T257091|T257091]] [[phab:T260329|T260329]] (duration: 00m 59s)
* 11:24 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 11:20 jayme@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 11:09 hnowlan: restarting pybal on lvs2010 [[phab:T254908|T254908]]
* 11:09 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single
* 11:06 hnowlan: restarting pybal on lvs2009 [[phab:T254908|T254908]]
* 11:05 hnowlan: restarting pybal on lvs1016 [[phab:T254908|T254908]]
* 11:05 jayme: depool mw1380 for downgrade of poppler-utils,libpoppler-glib8,libpoppler64,curl,libcurl3,libcurl3-gnutls,libpython3.5,python3.5,libpython3.5-stdlib,python3.5-minimal,libpython3.5-minimal,imagemagick-6-common,libmagickcore-6.q16-3,libmagickwand-6.q16-3,imagemagick-6.q16,imagemagick,e2fslibs,e2fsprogs,libcomerr2,libss2 and reboot - [[phab:T260329|T260329]]
* 11:05 hnowlan: restarting pybal on lvs1015 [[phab:T254908|T254908]]
* 11:04 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single
* 10:42 hnowlan: Moving api-gateway service to from service_setup to lvs_setup and running puppet on LVS servers
* 10:17 jayme: depool mw1379 for downgrade of poppler-utils,libpoppler-glib8,libpoppler64,curl,libcurl3,libcurl3-gnutls,libpython3.5,python3.5,libpython3.5-stdlib,python3.5-minimal,libpython3.5-minimal,imagemagick-6-common,libmagickcore-6.q16-3,libmagickwand-6.q16-3,imagemagick-6.q16,imagemagick,e2fslibs,e2fsprogs,libcomerr2,libss2 and reboot - [[phab:T260329|T260329]]
* 10:04 XioNoX: re-order OSPF interfaces on all routers (now partially netbox driven)
* 09:37 ayounsi@deploy1001: Finished deploy [homer/deploy@89636df]: Homer release v0.2.5 (duration: 03m 03s)
* 09:34 ayounsi@deploy1001: Started deploy [homer/deploy@89636df]: Homer release v0.2.5
* 08:59 ayounsi@cumin1001: END (PASS) - Cookbook sre.network.prepare-upgrade (exit_code=0)
* 08:58 ayounsi@cumin1001: END (PASS) - Cookbook sre.network.prepare-upgrade (exit_code=0)
* 08:55 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1082', diff saved to https://phabricator.wikimedia.org/P12247 and previous config saved to /var/cache/conftool/dbconfig/20200813-085547-marostegui.json
* 08:45 _joe_: downgrading imagemagick on mw1378 [[phab:T260329|T260329]]
* 08:43 _joe_: downgrading imagemagick on mw1378 [[phab:T260281|T260281]]
* 08:38 ayounsi@cumin1001: START - Cookbook sre.network.prepare-upgrade
* 08:38 ayounsi@cumin1001: START - Cookbook sre.network.prepare-upgrade
* 07:55 _joe_: downgrading curl/libcurl3/libcurl3-gnutls on mw1377 [[phab:T260329|T260329]]
* 07:45 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1082', diff saved to https://phabricator.wikimedia.org/P12246 and previous config saved to /var/cache/conftool/dbconfig/20200813-074528-marostegui.json
* 07:19 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1082', diff saved to https://phabricator.wikimedia.org/P12244 and previous config saved to /var/cache/conftool/dbconfig/20200813-071943-marostegui.json
* 07:16 marostegui: Stop replication on db1082 to remove triggers on sanitarium for MCR changs
* 07:15 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1082', diff saved to https://phabricator.wikimedia.org/P12243 and previous config saved to /var/cache/conftool/dbconfig/20200813-071545-marostegui.json
* 06:48 marostegui: Deploy MCR change on dbstore1003:3311
* 06:02 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
* 06:01 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1126', diff saved to https://phabricator.wikimedia.org/P12242 and previous config saved to /var/cache/conftool/dbconfig/20200813-060135-marostegui.json
* 06:00 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
* 05:43 marostegui: Stop MySQL on db2135 (codfw master), haproxy irc alert will fire [[phab:T260324|T260324]]
* 05:28 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1126', diff saved to https://phabricator.wikimedia.org/P12241 and previous config saved to /var/cache/conftool/dbconfig/20200813-052859-marostegui.json
* 05:12 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1126', diff saved to https://phabricator.wikimedia.org/P12240 and previous config saved to /var/cache/conftool/dbconfig/20200813-051222-marostegui.json
* 05:01 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1126', diff saved to https://phabricator.wikimedia.org/P12239 and previous config saved to /var/cache/conftool/dbconfig/20200813-050107-marostegui.json
* 02:56 mutante: testreduce1001 - systemctl reset-failed ; fix parsoid-vd systemd state and icinga alert
* 00:37 mutante: removing jenkins_service_running checks from secondary servers where it's stopped, manually from icinga config, running puppet on icinga
* 00:14 mutante: re-enabling puppet on releases* servers
 
== 2020-08-12 ==
* 23:44 wkandek@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 23:41 wkandek@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 23:40 wkandek@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 23:39 wkandek@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 23:37 wkandek: reboot mw1372
* 23:36 wkandek@cumin1001: START - Cookbook sre.hosts.reboot-single
* 23:36 wkandek@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 23:32 wkandek: reboot mw1373
* 23:32 wkandek@cumin1001: START - Cookbook sre.hosts.reboot-single
* 23:31 wkandek: reboot mw1371
* 23:31 wkandek@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 23:31 wkandek@cumin1001: START - Cookbook sre.hosts.reboot-single
* 23:30 wkandek@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 23:28 wkandek: reboot mw1384
* 23:27 wkandek@cumin1001: START - Cookbook sre.hosts.reboot-single
* 23:27 wkandek: reboot mw1385
* 23:26 wkandek@cumin1001: START - Cookbook sre.hosts.reboot-single
* 23:25 wkandek@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 23:24 wkandek@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 23:22 wkandek: reboot mw1370
* 23:22 wkandek@cumin1001: START - Cookbook sre.hosts.reboot-single
* 23:19 wkandek@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 23:18 wkandek: reboot mw1369
* 23:18 wkandek@cumin1001: START - Cookbook sre.hosts.reboot-single
* 23:17 wkandek@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 23:17 wkandek: reboot mw1387
* 23:16 wkandek@cumin1001: START - Cookbook sre.hosts.reboot-single
* 23:16 wkandek@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 23:16 wkandek: reboot mw1389
* 23:15 wkandek@cumin1001: START - Cookbook sre.hosts.reboot-single
* 23:14 wkandek@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 23:09 wkandek: reboot mw1368
* 23:09 wkandek@cumin1001: START - Cookbook sre.hosts.reboot-single
* 23:08 wkandek: reboot me1367
* 23:08 wkandek@cumin1001: START - Cookbook sre.hosts.reboot-single
* 23:07 wkandek: reboot mw1391
* 23:07 wkandek@cumin1001: START - Cookbook sre.hosts.reboot-single
* 23:06 wkandek@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 23:06 wkandek@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 23:06 wkandek@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 23:05 ejegg: updated Fundraising CiviCRM from {{Gerrit|72452e28a9}} to {{Gerrit|f5469d0a4c}}
* 23:05 wkandek: reboot mw1393
* 23:04 wkandek@cumin1001: START - Cookbook sre.hosts.reboot-single
* 23:04 wkandek@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 23:01 wkandek: reboot mw1395
* 23:01 wkandek@cumin1001: START - Cookbook sre.hosts.reboot-single
* 23:00 wkandek@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 22:53 wkandek: reboot mw1397
* 22:53 wkandek@cumin1001: START - Cookbook sre.hosts.reboot-single
* 22:52 wkandek: reboot mw1366
* 22:52 wkandek@cumin1001: START - Cookbook sre.hosts.reboot-single
* 22:52 wkandek@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 22:52 wkandek: reboot me1365
* 22:51 wkandek@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 22:51 wkandek@cumin1001: START - Cookbook sre.hosts.reboot-single
* 22:51 wkandek@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 22:47 wkandek: reboot mw1399
* 22:47 wkandek@cumin1001: START - Cookbook sre.hosts.reboot-single
* 22:46 wkandek: reboot mw1364
* 22:46 wkandek@cumin1001: START - Cookbook sre.hosts.reboot-single
* 22:45 wkandek@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 22:44 wkandek@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 22:42 wkandek: reboot mw1401
* 22:42 wkandek@cumin1001: START - Cookbook sre.hosts.reboot-single
* 22:41 wkandek@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 22:41 wkandek: reboot mw1355
* 22:40 wkandek@cumin1001: START - Cookbook sre.hosts.reboot-single
* 22:40 wkandek@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 22:38 wkandek: reboot mw1354
* 22:38 wkandek@cumin1001: START - Cookbook sre.hosts.reboot-single
* 22:36 wkandek@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 22:36 wkandek: reboot mw1396
* 22:36 wkandek@cumin1001: START - Cookbook sre.hosts.reboot-single
* 22:35 wkandek@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 22:32 wkandek: reboot mw1353
* 22:32 wkandek@cumin1001: START - Cookbook sre.hosts.reboot-single
* 22:31 wkandek: reboot mw1352
* 22:31 wkandek@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 22:31 wkandek@cumin1001: START - Cookbook sre.hosts.reboot-single
* 22:30 wkandek@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 22:29 wkandek: reboot mw1348
* 22:29 wkandek@cumin1001: START - Cookbook sre.hosts.reboot-single
* 22:28 wkandek@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 22:26 wkandek: reboot 1347
* 22:26 wkandek@cumin1001: START - Cookbook sre.hosts.reboot-single
* 22:25 wkandek@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 22:23 wkandek@cumin1001: START - Cookbook sre.hosts.reboot-single
* 22:22 wkandek: reboot mw1350
* 22:22 wkandek@cumin1001: START - Cookbook sre.hosts.reboot-single
* 22:21 wkandek@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 22:20 wkandek@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 22:19 wkandek: reboot mw1346
* 22:19 wkandek@cumin1001: START - Cookbook sre.hosts.reboot-single
* 22:18 wkandek@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 22:14 wkandek: reboot mw1345
* 22:13 wkandek@cumin1001: START - Cookbook sre.hosts.reboot-single
* 22:12 wkandek@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 22:12 wkandek: reboot mw1349
* 22:12 wkandek@cumin1001: START - Cookbook sre.hosts.reboot-single
* 22:11 wkandek@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 22:08 wkandek: reboot mw1333
* 22:07 wkandek@cumin1001: START - Cookbook sre.hosts.reboot-single
* 22:07 wkandek@cumin1001: conftool action : set/pooled=yes; selector: name=mw1330.eqiad.wmnet
* 22:03 wkandek: reboot mw1344
* 22:03 wkandek@cumin1001: START - Cookbook sre.hosts.reboot-single
* 22:02 wkandek: reboot mw1343
* 22:02 wkandek@cumin1001: START - Cookbook sre.hosts.reboot-single
* 22:02 wkandek@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 22:00 wkandek: reboot mw1332
* 22:00 wkandek@cumin1001: START - Cookbook sre.hosts.reboot-single
* 21:56 wkandek@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 21:55 wkandek@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 21:53 wkandek@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 21:50 wkandek: reboot mw1331
* 21:50 wkandek@cumin1001: START - Cookbook sre.hosts.reboot-single
* 21:48 wkandek: reboot mw1342
* 21:47 wkandek@cumin1001: START - Cookbook sre.hosts.reboot-single
* 21:46 root@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 21:46 wkandek@cumin1001: conftool action : set/pooled=yes; selector: name=mw1340.eqiad.wmnet
* 21:41 wkandek@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 21:40 wkandek@cumin1001: START - Cookbook sre.hosts.reboot-single
* 21:39 wkandek@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 21:39 wkandek: reboot mw1341
* 21:39 wkandek@cumin1001: START - Cookbook sre.hosts.reboot-single
* 21:37 wkandek@cumin1001: END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97)
* 21:37 wkandek@cumin1001: START - Cookbook sre.hosts.reboot-single
* 21:36 root@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 21:33 wkandek: reboot mw1329
* 21:33 wkandek: reboot mw1328
* 21:32 root@cumin1001: START - Cookbook sre.hosts.reboot-single
* 21:32 wkandek@cumin1001: START - Cookbook sre.hosts.reboot-single
* 21:29 root@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 21:28 root@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 21:28 ejegg: updated payments-wiki from {{Gerrit|77ff5d70fc}} to {{Gerrit|a7ee1790e0}}
* 21:25 wkandek: reboot mw1340
* 21:25 wkandek@cumin1001: START - Cookbook sre.hosts.reboot-single
* 21:23 root@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 21:21 wkandek: reboot mw1339
* 21:20 root@cumin1001: START - Cookbook sre.hosts.reboot-single
* 21:20 root@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 21:15 wkandek: reboot mw1327
* 21:15 root@cumin1001: START - Cookbook sre.hosts.reboot-single
* 21:14 root@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 21:13 wkandek: reboot mw1326
* 21:13 root@cumin1001: START - Cookbook sre.hosts.reboot-single
* 21:11 root@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 21:11 wkandek: reboot mw1317
* 21:11 root@cumin1001: START - Cookbook sre.hosts.reboot-single
* 21:10 root@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 21:05 wkandek: reboot mw1316
* 21:04 root@cumin1001: START - Cookbook sre.hosts.reboot-single
* 21:03 wkandek: reboot mw1325
* 21:03 root@cumin1001: START - Cookbook sre.hosts.reboot-single
* 21:02 wkandek: reboot mw1324
* 21:02 root@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 21:02 root@cumin1001: START - Cookbook sre.hosts.reboot-single
* 21:01 root@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 21:01 root@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 21:01 wkandek: reboot mw1315
* 21:01 root@cumin1001: START - Cookbook sre.hosts.reboot-single
* 21:00 root@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 20:57 wkandek: reboot mw1323
* 20:57 root@cumin1001: START - Cookbook sre.hosts.reboot-single
* 20:54 root@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 20:52 wkandek: reboot mw1322
* 20:52 root@cumin1001: START - Cookbook sre.hosts.reboot-single
* 20:51 wkandek: reboot mw1314
* 20:51 root@cumin1001: START - Cookbook sre.hosts.reboot-single
* 20:50 root@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 20:50 wkandek: reboot mw1313
* 20:50 root@cumin1001: START - Cookbook sre.hosts.reboot-single
* 20:49 root@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 20:48 root@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 20:44 wkandek: reboot mw1312
* 20:44 root@cumin1001: START - Cookbook sre.hosts.reboot-single
* 20:43 root@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 20:43 wkandek: reboot mw1321
* 20:42 root@cumin1001: START - Cookbook sre.hosts.reboot-single
* 20:41 root@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 20:40 wkandek: reboot mw1297
* 20:40 root@cumin1001: START - Cookbook sre.hosts.reboot-single
* 20:39 root@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 20:39 wkandek: reboot mw1320
* 20:39 root@cumin1001: START - Cookbook sre.hosts.reboot-single
* 20:38 root@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 20:34 wkandek: reboot mw1290
* 20:34 root@cumin1001: START - Cookbook sre.hosts.reboot-single
* 20:33 wkandek: reboot mw1319
* 20:33 root@cumin1001: START - Cookbook sre.hosts.reboot-single
* 20:32 root@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 20:31 root@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 20:29 wkandek: reboot mw1275
* 20:29 root@cumin1001: START - Cookbook sre.hosts.reboot-single
* 20:28 root@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 20:26 wkandek: reboot mw1289
* 20:25 wkandek: reboot mw1288
* 20:25 root@cumin1001: START - Cookbook sre.hosts.reboot-single
* 20:25 root@cumin1001: START - Cookbook sre.hosts.reboot-single
* 20:24 root@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 20:23 wkandek: reboot mw1274
* 20:23 root@cumin1001: START - Cookbook sre.hosts.reboot-single
* 20:23 root@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 20:22 root@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 20:20 wkandek: reboot mw1273
* 20:20 root@cumin1001: START - Cookbook sre.hosts.reboot-single
* 20:16 root@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 20:13 wkandek: reboot mw1287
* 20:13 wkandek: reboot mw1286
* 20:13 root@cumin1001: START - Cookbook sre.hosts.reboot-single
* 20:13 root@cumin1001: START - Cookbook sre.hosts.reboot-single
* 20:11 root@cumin1001: START - Cookbook sre.hosts.reboot-single
* 20:11 root@cumin1001: START - Cookbook sre.hosts.reboot-single
* 20:11 wkandek: reboot mw1272
* 20:11 wkandek: reboot mw1271
* 19:41 hashar: Upgrading Jenkins on contint2001 (primary)
* 19:25 hashar: contint1001: sudo systemctl mask jenkins  # spare server
* 19:25 mutante: all releases* servers except 1001 - disable puppet; stop jenkins, mask jenkins
* 19:22 mutante: releases1002 - stopped and masked jenkins service
* 19:22 mutante: releases2001 - stopped and masked jenkins service
* 19:20 mutante: upgrading jenkins on releases*001
* 19:19 hashar: Upgrading Jenkins on contint1001 (spare)
* 19:16 hashar@deploy1001: rebuilt and synchronized wikiversions files: all wikis to 1.36.0-wmf.4
* 19:13 mutante: uploade new jenkins version to APT repo; upgrading jenkins on releases1002/2002
* 19:08 effie: pool mw1396
* 19:06 effie: repool mw1395 mw1397 mw1399
* 18:56 root@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 18:55 root@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 18:50 ladsgroup@deploy1001: Synchronized php-1.36.0-wmf.4/extensions/Wikibase/client/includes/Store/Sql/DirectSqlStore.php: [[phab:T255305{{!}}Set caching of CachingEntityRevisionLookup to CACHE_NONE in client]] (duration: 02m 13s)
* 18:47 wkandek: reboot mw1270
* 18:47 root@cumin1001: START - Cookbook sre.hosts.reboot-single
* 18:45 wkandek: reboot mw1269
* 18:41 root@cumin1001: START - Cookbook sre.hosts.reboot-single
* 18:39 jiji@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 18:39 jiji@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 18:38 jiji@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 18:28 jiji@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 18:25 wkandek: reboot mw1268
* 18:25 otto@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
* 18:25 otto@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' .
* 18:23 jiji@cumin1001: START - Cookbook sre.hosts.reboot-single
* 18:22 otto@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' .
* 18:22 otto@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
* 18:22 jiji@cumin1001: START - Cookbook sre.hosts.reboot-single
* 18:22 jiji@cumin1001: START - Cookbook sre.hosts.reboot-single
* 18:22 jiji@cumin1001: START - Cookbook sre.hosts.reboot-single
* 18:17 jiji@cumin1001: END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97)
* 18:17 jiji@cumin1001: START - Cookbook sre.hosts.reboot-single
* 18:16 jiji@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 18:10 catrope@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Enable GrowthExperiments on hewiki ([[phab:T255020|T255020]]) (duration: 01m 03s)
* 18:08 jiji@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 18:07 otto@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
* 18:07 otto@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
* 18:06 jiji@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 18:04 ladsgroup@deploy1001: Synchronized php-1.36.0-wmf.4/extensions/Wikibase/repo/includes/Store/Sql/SqlStore.php: [[phab:T255305{{!}}Set caching of CachingEntityRevisionLookup to CACHE_NONE in repo]] (duration: 01m 06s)
* 18:02 otto@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' .
* 18:02 otto@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' .
* 18:00 jiji@cumin1001: START - Cookbook sre.hosts.reboot-single
* 17:59 jiji@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 17:58 jiji@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 17:56 otto@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' .
* 17:52 otto@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' .
* 17:52 otto@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
* 17:52 jiji@cumin1001: START - Cookbook sre.hosts.reboot-single
* 17:51 otto@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' .
* 17:51 otto@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' .
* 17:51 jiji@cumin1001: START - Cookbook sre.hosts.reboot-single
* 17:50 jiji@cumin1001: START - Cookbook sre.hosts.reboot-single
* 17:49 effie: reboot mw1265 mw1282 mw1283
* 17:45 jiji@cumin1001: START - Cookbook sre.hosts.reboot-single
* 17:45 otto@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' .
* 17:37 jiji@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 17:36 jiji@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 17:30 jiji@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 17:25 jiji@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 17:21 jiji@cumin1001: START - Cookbook sre.hosts.reboot-single
* 17:21 jiji@cumin1001: START - Cookbook sre.hosts.reboot-single
* 17:20 jiji@cumin1001: START - Cookbook sre.hosts.reboot-single
* 17:20 jiji@cumin1001: START - Cookbook sre.hosts.reboot-single
* 17:19 effie: reboot mw1263 mw1264 mw1279 and mw1281
* 17:17 otto@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' .
* 17:17 otto@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
* 17:16 cdanis: for posterity: mw1359 has a bunch of special packages installed (previously recorded in SAL) and also has `sudo memleak-bpfcc -o 60000 -z 31 -Z 33 30` running in a tmux in an attempt to understand what's causing the page fragmentation in the appserver fleet
* 17:16 otto@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' .
* 17:16 otto@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' .
* 17:15 otto@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
* 17:15 otto@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
* 17:13 otto@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' .
* 17:13 otto@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' .
* 17:03 jiji@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 17:00 jiji@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 16:57 sbassett@deploy1001: Synchronized private/PrivateSettings.php: Additional mitigations for [[phab:T257687|T257687]] (duration: 01m 03s)
* 16:53 hnowlan@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
* 16:52 hnowlan@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
* 16:52 jiji@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 16:50 hnowlan@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
* 16:49 jiji@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 16:48 oblivian@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 16:45 jiji@cumin1001: START - Cookbook sre.hosts.reboot-single
* 16:45 jiji@cumin1001: START - Cookbook sre.hosts.reboot-single
* 16:44 jiji@cumin1001: START - Cookbook sre.hosts.reboot-single
* 16:35 hnowlan@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
* 16:35 jiji@cumin1001: START - Cookbook sre.hosts.reboot-single
* 16:32 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-single
* 16:31 effie: reboot mw1277 mw1278 && mw1261 mw1262
* 16:29 oblivian@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99)
* 16:24 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-single
* 16:04 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 16:04 krinkle@deploy1001: Synchronized wmf-config/CommonSettings.php: {{Gerrit|I3726a6364d}}, [[phab:T257079|T257079]] (duration: 01m 02s)
* 15:56 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 15:52 jayme@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 15:50 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single
* 15:48 jayme@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 15:48 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single
* 15:42 jayme@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 15:37 jayme@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 15:36 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single
* 15:32 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single
* 15:32 oblivian@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 15:32 jayme@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 15:31 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 15:26 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single
* 15:25 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single
* 15:22 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 15:21 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single
* 15:19 jayme@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 15:19 jayme@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 15:16 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-single
* 15:16 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-single
* 15:15 jayme@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 15:14 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single
* 15:12 cdanis: ✔️ cdanis@mw1359.eqiad.wmnet ~ 🕚☕ sudo apt install linux-headers-4.9.0-12-amd64
* 15:10 cdanis: ✔️ cdanis@mw1359.eqiad.wmnet ~ 🕚☕ sudo apt install python3-netaddr ieee-data
* 15:09 cdanis: ✔️ cdanis@mw1359.eqiad.wmnet ~ 🕚☕ sudo dpkg -i bpfcc-tools_0.12.0-2_all.deb libbpfcc_0.12.0-2_amd64.deb python3-bpfcc_0.12.0-2_all.deb
* 15:08 jayme@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 15:03 otto@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
* 15:03 otto@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
* 15:03 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single
* 15:03 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single
* 14:59 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single
* 14:54 cdanis: again un-kludging deneb.codfw.wmnet:/var/cache/pbuilder/hooks/stretch/D02backports
* 14:53 jayme@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 14:52 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single
* 14:45 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 14:44 cdanis: temporarily re-kludging deneb.codfw.wmnet:/var/cache/pbuilder/hooks/stretch/D02backports, original in my homedir
* 14:37 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single
* 14:37 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single
* 14:35 cdanis: un-kludging deneb.codfw.wmnet:/var/cache/pbuilder/hooks/stretch/D02backports
* 14:32 otto@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
* 14:31 cdanis: temporarily kludging deneb.codfw.wmnet:/var/cache/pbuilder/hooks/stretch/D02backports, original in my homedir
* 14:24 otto@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
* 14:24 otto@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
* 14:02 kormat: uploaded wmfmariadbpy 0.3 to apt
* 13:57 jiji@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 13:56 jiji@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 13:48 otto@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
* 13:48 otto@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
* 13:43 jiji@cumin1001: START - Cookbook sre.hosts.reboot-single
* 13:43 jiji@cumin1001: START - Cookbook sre.hosts.reboot-single
* 13:42 effie: restart mw1383 & mw1386
* 13:41 otto@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
* 13:41 otto@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
* 13:27 hashar@deploy1001: Synchronized php: group1 wikis to 1.36.0-wmf.4 (duration: 01m 16s)
* 13:25 hashar@deploy1001: rebuilt and synchronized wikiversions files: group1 wikis to 1.36.0-wmf.4
* 13:20 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 13:19 oblivian@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 13:15 cdanis: ✔️ cdanis@mw1357.eqiad.wmnet ~ 🕘☕ sudo sysctl -w vm/compact_memory=1
* 13:12 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-single
* 13:07 oblivian@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 13:04 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-single
* 12:59 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 12:52 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-single
* 12:50 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-single
* 12:33 oblivian@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
* 12:27 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-single
* 12:20 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 12:20 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 12:16 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-single
* 12:15 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-single
* 12:15 oblivian@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99)
* 12:15 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-single
* 11:51 ema: pool mw1363 after reboot
* 11:49 jynus: creating artificial low replication lag on db2130 to test icinga alerts [[phab:T253120|T253120]]
* 11:41 ema@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 11:37 ema@cumin1001: START - Cookbook sre.hosts.reboot-single
* 11:30 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 11:28 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 11:25 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single
* 11:24 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-single
* 11:21 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 11:17 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-single
* 11:17 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 11:13 oblivian@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 11:10 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single
* 11:08 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-single
* 11:07 oblivian@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99)
* 11:07 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-single
* 11:00 oblivian@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99)
* 11:00 oblivian@cumin1001: START - Cookbook sre.hosts.reboot-single
* 10:55 _joe_: rebooting mw1361
* 10:51 jayme: rebooting mw1356
* 10:49 _joe_: rebooting mw1378
* 09:45 _joe_: repooling mw1377
* 09:40 _joe_: rebooting mw1377
* 09:22 _joe_: depool mw1357 tool
* 09:14 _joe_: depooling mw1377 for inspection
* 09:12 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1110', diff saved to https://phabricator.wikimedia.org/P12220 and previous config saved to /var/cache/conftool/dbconfig/20200812-091211-marostegui.json
* 09:08 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1110', diff saved to https://phabricator.wikimedia.org/P12219 and previous config saved to /var/cache/conftool/dbconfig/20200812-090831-marostegui.json
* 08:50 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1110', diff saved to https://phabricator.wikimedia.org/P12218 and previous config saved to /var/cache/conftool/dbconfig/20200812-085021-marostegui.json
* 08:35 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1110', diff saved to https://phabricator.wikimedia.org/P12217 and previous config saved to /var/cache/conftool/dbconfig/20200812-083548-marostegui.json
* 07:49 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
* 07:46 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
* 07:31 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1110 for reimage', diff saved to https://phabricator.wikimedia.org/P12215 and previous config saved to /var/cache/conftool/dbconfig/20200812-073130-marostegui.json
* 04:51 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1126 for MCR change', diff saved to https://phabricator.wikimedia.org/P12214 and previous config saved to /var/cache/conftool/dbconfig/20200812-045157-marostegui.json
 
== 2020-08-11 ==
* 23:41 Urbanecm: Evening B&C window completed
* 23:39 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|0f238f71c95c7bd7534c28abfac759fbb47f674f}}: Update wgMFRemovableClasses ([[phab:T231160|T231160]]) (duration: 01m 03s)
* 23:36 urbanecm@deploy1001: Synchronized php-1.36.0-wmf.3/extensions/MobileFrontend/extension.json: {{Gerrit|c22d65ff9b2439f484ab8ccffed87b00e78c3ad2}}: Hide vertical nav-boxes on mobile domain ([[phab:T231160|T231160]]) (duration: 01m 03s)
* 23:34 urbanecm@deploy1001: Synchronized php-1.36.0-wmf.4/extensions/MobileFrontend/extension.json: {{Gerrit|81d54b0ec82d0b78f723f9400031e918a4a143aa}}: Hide vertical nav-boxes on mobile domain ([[phab:T231160|T231160]]) (duration: 01m 05s)
* 23:07 urbanecm@deploy1001: Synchronized wmf-config/CommonSettings.php: {{Gerrit|28faa279dacf6a4d6f0a663844e913738c2fa142}}: Switching to updated license definition (duration: 01m 04s)
* 21:52 krinkle@deploy1001: Synchronized php-1.36.0-wmf.3/includes/skins/SkinMustache.php: {{Gerrit|Ibe1f07346}}, [[phab:T259872|T259872]], [[phab:T259858|T259858]] (duration: 01m 04s)
* 19:40 otto@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
* 19:40 otto@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' .
* 19:35 otto@deploy1001: Synchronized wmf-config/InitialiseSettings.php: EventStreamConfig - Add streams for eventgate-main - [[phab:T251935|T251935]] (duration: 01m 04s)
* 19:21 ejegg: updated payments-wiki from {{Gerrit|f199c071c3}} to {{Gerrit|77ff5d70fc}}
* 18:55 otto@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' .
* 18:48 otto@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' .
* 18:44 otto@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' .
* 18:44 otto@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' .
* 18:28 catrope@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Grant investigate right to checkuser group on frwiki ([[phab:T260171|T260171]]) (duration: 01m 04s)
* 18:18 ppchelko@deploy1001: Synchronized wmf-config/CommonSettings-labs.php: Beta-only: Configured additional settings for API Portal beta wiki gerrit:619339 (duration: 01m 03s)
* 18:05 catrope@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Direct GrowthExperiments help panel questions to mentors on cswiki ([[phab:T250235|T250235]]) (duration: 01m 03s)
* 17:56 otto@deploy1001: Synchronized wmf-config/InitialiseSettings.php: EventStreamConfig - Remove extraneous mediawiki.api-request stream - [[phab:T251935|T251935]] (duration: 01m 01s)
* 17:53 mholloway-shell@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
* 17:53 mholloway-shell@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' .
* 17:43 mholloway-shell@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
* 17:43 mholloway-shell@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' .
* 17:38 mholloway-shell@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' .
* 17:36 cmjohnson@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 17:33 mholloway-shell@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'proton' for release 'production' .
* 17:31 cmjohnson@cumin1001: START - Cookbook sre.dns.netbox
* 17:28 mholloway-shell@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'proton' for release 'production' .
* 17:25 mholloway-shell@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'proton' for release 'production' .
* 17:04 cmjohnson@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 16:58 cmjohnson@cumin1001: START - Cookbook sre.dns.netbox
* 16:53 hashar@deploy1001: Synchronized php-1.36.0-wmf.4/skins/MinervaNeue/: Revert "ServiceWiring: Avoid usage of deprecated Title::getSubjectPage()" - [[phab:T260155|T260155]] (duration: 01m 06s)
* 16:12 herron: migrating lists.wikimedia.org services from fermium to lists1001 [[phab:T224586|T224586]]
* 15:36 hashar@deploy1001: rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.4
* 15:27 hashar@deploy1001: Finished scap: (no justification provided) (duration: 30m 51s)
* 14:59 marostegui: Deploy MCR change on db1116:3318
* 14:56 hashar@deploy1001: Started scap: (no justification provided)
* 14:56 hashar@deploy1001: Pruned MediaWiki: 1.36.0-wmf.2 (duration: 04m 15s)
* 14:55 jayme: updated helmfile to 0.125.2-1 on contint* and deploy*
* 14:52 otto@deploy1001: Finished deploy [analytics/refinery@35c4430]: Deploying to an-launcher1002 to get camus wrapper script changes - [[phab:T251935|T251935]] (duration: 01m 14s)
* 14:51 otto@deploy1001: Started deploy [analytics/refinery@35c4430]: Deploying to an-launcher1002 to get camus wrapper script changes - [[phab:T251935|T251935]]
* 14:50 hashar@deploy1001: Pruned MediaWiki: 1.36.0-wmf.1 (duration: 02m 07s)
* 14:48 jayme: imported helmfile_0.125.2-1 to buster-wikimedia, jessie-wikimedia, stretch-wikimedia
* 14:47 hashar@deploy1001: Pruned MediaWiki: 1.35.0-wmf.41 (duration: 04m 20s)
* 14:40 hashar@deploy1001: Pruned MediaWiki: 1.35.0-wmf.40 (duration: 10m 24s)
* 14:37 papaul: replacing msw-b5,b6,b7 and b8
* 14:30 hashar: Cleaning old MediaWiki versions that were never removed
* 14:27 hashar@deploy1001: sync aborted: testwikis wikis to 1.36.0-wmf.4 (duration: 72m 36s)
* 14:10 hashar: mw1319: scap pull
* 13:27 jayme@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' .
* 13:27 jayme@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'production' .
* 13:23 oblivian@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'blubberoid' for release 'production' .
* 13:16 oblivian@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'blubberoid' for release 'production' .
* 13:14 hashar@deploy1001: Started scap: testwikis wikis to 1.36.0-wmf.4
* 13:12 hashar: Applied 1.36.0-wmf.4 security patches # [[phab:T257972|T257972]]
* 13:03 jayme@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' .
* 13:03 jayme@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'production' .
* 12:52 kormat: uploaded wmfmariadbpy 0.2 packages to apt1001
* 12:36 oblivian@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'production' .
* 12:36 oblivian@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' .
* 11:54 marostegui: Install new MariaDB 10.4.14 on db2102
* 11:42 hnowlan@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
* 11:30 hnowlan@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
* 11:18 Urbanecm: EU B&C window done
* 11:08 kartik@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit{{!}}619255{{!}}Enable ContentTranslation in Sundanese WP as a default tool (T258502)]] (duration: 00m 59s)
* 10:39 volans: migrating *all* eqiad mgmt DNS records to the autogenerated ones via Netbox - [[phab:T233183|T233183]]
* 10:38 volans@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 10:34 volans@cumin1001: START - Cookbook sre.dns.netbox
* 10:20 elukey@cumin1001: END (PASS) - Cookbook sre.hadoop.change-distro-from-cdh (exit_code=0)
* 10:01 elukey@cumin1001: START - Cookbook sre.hadoop.change-distro-from-cdh
* 10:00 elukey@cumin1001: END (PASS) - Cookbook sre.hadoop.stop-cluster (exit_code=0)
* 09:51 elukey@cumin1001: START - Cookbook sre.hadoop.stop-cluster
* 09:29 volans@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 09:25 volans@cumin1001: START - Cookbook sre.dns.netbox
* 09:11 marostegui: Rename tables on muswiki and mhwiktionary on s3 master (db1123) without replication [[phab:T260112|T260112]]
* 09:01 volans: renewed puppet certificate on scb1001.eqiad.wmnet
* 08:52 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|e6ec237b6b6fb67a0a80613909589bc724f5eecf}}: Revert "Turn muswiki and mhwiktionary to read-only" ([[phab:T259004|T259004]]) (duration: 00m 58s)
* 08:45 urbanecm@deploy1001: Synchronized dblists/: {{Gerrit|81f4594b6c583f938821549b3a1800fec5b120bb}}: Point muswiki and mhwiktionary to s5 ([[phab:T259004|T259004]]; 3/3) (duration: 00m 58s)
* 08:44 urbanecm@deploy1001: Synchronized wmf-config/db-eqiad.php: {{Gerrit|81f4594b6c583f938821549b3a1800fec5b120bb}}: Point muswiki and mhwiktionary to s5 ([[phab:T259004|T259004]]; 2/3) (duration: 00m 58s)
* 08:43 urbanecm@deploy1001: Synchronized wmf-config/db-codfw.php: {{Gerrit|81f4594b6c583f938821549b3a1800fec5b120bb}}: Point muswiki and mhwiktionary to s5 ([[phab:T259004|T259004]]; 1/3) (duration: 01m 02s)
* 08:06 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|a04bc1f27e4ef4e38002d546d30bfd2d1dc60d0e}}: Turn muswiki and mhwiktionary to read-only ([[phab:T259004|T259004]]) (duration: 01m 01s)
* 08:04 ayounsi@cumin1001: END (PASS) - Cookbook sre.network.prepare-upgrade (exit_code=0)
* 06:54 ayounsi@cumin1001: START - Cookbook sre.network.prepare-upgrade
* 06:45 XioNoX: Re-prioritize peering over transit eqiad/esams - [[phab:T259614|T259614]]
* 01:59 tstarling@deploy1001: Synchronized wmf-config/PoolCounterSettings.php: enabling fast stale mode [[phab:T250248|T250248]] (duration: 00m 58s)
* 00:33 dpifke@deploy1001: Finished deploy [performance/arc-lamp@fc5f1c6]: Deploying latest attempt to fix [[phab:T259167|T259167]] (duration: 01m 03s)
* 00:31 dpifke@deploy1001: Started deploy [performance/arc-lamp@fc5f1c6]: Deploying latest attempt to fix [[phab:T259167|T259167]]
* 00:24 mutante: reverting switch of releases.wikimedia.org for today since releases-jenkins.wikimedia.org is tied to it and new jenkins still needs some config and plugins ([[phab:T247652|T247652]])
* 00:08 mutante: releases-jenkins.wikimedia.org currently under maintenance ([[phab:T247652|T247652]])
 
== 2020-08-10 ==
* 23:56 eileen: tools revision changed from {{Gerrit|22550f38c5}} to {{Gerrit|9a89f45974}}
* 23:53 mutante: https://releases.wikimedia.org switched to new backends running Debian buster. files have been synced. httpbb tests have been created and pass. ([[phab:T247652|T247652]])
* 23:52 mutante: https://releases.wikimedia.org switched to new backends running Debian buster. files have been synced of course.
* 20:13 hashar: Updated container for Jenkins job operations-puppet-tests-buster-docker https://gerrit.wikimedia.org/r/c/integration/config/+/619359/
* 20:10 ejegg: updated payments-wiki from {{Gerrit|932aacde54}} to {{Gerrit|f199c071c3}}
* 18:32 ryankemper@deploy1001: Finished deploy [wdqs/wdqs@3e12dbb]: 0.3.44 (duration: 15m 18s)
* 18:20 volans@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 18:17 volans@cumin1001: START - Cookbook sre.dns.netbox
* 18:17 ryankemper@deploy1001: Started deploy [wdqs/wdqs@3e12dbb]: 0.3.44
* 18:13 catrope@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Enable Special:Investigate on frwiki ([[phab:T257891|T257891]]) (duration: 00m 58s)
* 18:07 catrope@deploy1001: Synchronized wmf-config/CommonSettings.php: Explicitly disable nativeGallery in Parsoid settings (no-op) (duration: 00m 58s)
* 18:04 catrope@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Bump the weight of near match for search ([[phab:T257922|T257922]]) (duration: 00m 59s)
* 17:56 otto@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' .
* 17:54 cmjohnson@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 17:54 cmjohnson@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 17:52 robh@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 17:47 cmjohnson@cumin1001: START - Cookbook sre.dns.netbox
* 17:50 cmjohnson@cumin1001: START - Cookbook sre.dns.netbox
* 17:38 bking@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1072.eqiad.wmnet with OS bullseye
* 17:49 otto@deploy1001: Synchronized wmf-config/InitialiseSettings.php: EventStreamConfig - Add eventgate-analytics streams - [[phab:T251935|T251935]] (duration: 01m 02s)
* 17:29 vgutierrez: test trafficserver 9.1.2-1wm2 in cp6016 - [[phab:T309651|T309651]]
* 17:46 robh@cumin1001: START - Cookbook sre.dns.netbox
* 17:15 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1072.eqiad.wmnet with reason: host reimage
* 17:38 otto@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
* 17:13 bking@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1072.eqiad.wmnet with reason: host reimage
* 17:38 otto@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
* 17:00 bking@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1072.eqiad.wmnet with OS bullseye
* 17:37 cmjohnson@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 16:54 bking@cumin1001: END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0)
* 17:34 otto@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
* 16:54 bking@cumin1001: START - Cookbook sre.elasticsearch.force-shard-allocation
* 17:34 otto@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
* 16:53 bking@cumin1001: END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0)
* 17:31 cmjohnson@cumin1001: START - Cookbook sre.dns.netbox
* 16:53 bking@cumin1001: START - Cookbook sre.elasticsearch.force-shard-allocation
* 17:31 otto@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' .
* 16:26 bking@deploy1002: helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
* 17:31 otto@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' .
* 16:26 bking@deploy1002: helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
* 17:16 cmjohnson@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 16:01 bking@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1069.eqiad.wmnet with OS bullseye
* 17:15 cmjohnson@cumin1001: START - Cookbook sre.dns.netbox
* 15:45 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1069.eqiad.wmnet with reason: host reimage
* 17:14 volans@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 15:42 bking@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1069.eqiad.wmnet with reason: host reimage
* 17:13 cmjohnson@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 15:32 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 17:12 volans@cumin1001: START - Cookbook sre.dns.netbox
* 15:31 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 17:06 cmjohnson@cumin1001: START - Cookbook sre.dns.netbox
* 15:31 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 16:08 hnowlan@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
* 15:30 bking@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1069.eqiad.wmnet with OS bullseye
* 16:04 volans@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 15:28 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 16:03 hnowlan@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
* 15:27 bking@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1058.eqiad.wmnet with OS bullseye
* 15:59 volans@cumin1001: START - Cookbook sre.dns.netbox
* 15:08 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1058.eqiad.wmnet with reason: host reimage
* 15:59 volans@cumin1001: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97)
* 15:05 bking@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1058.eqiad.wmnet with reason: host reimage
* 15:59 volans@cumin1001: START - Cookbook sre.dns.netbox
* 14:59 elukey@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
* 15:55 hnowlan@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
* 14:59 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
* 15:32 hnowlan@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
* m: finished running 'homer "status:active" commit "netmon: Add the netmon1003 host as a syslog destination"' in the cumin1001 host. Homer reported no errors.
* 15:17 hnowlan@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
* 14:54 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
* 15:11 cmjohnson@cumin1001: START - Cookbook sre.dns.netbox
* 14:50 bking@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1058.eqiad.wmnet with OS bullseye
* 15:02 hnowlan@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
* 14:28 bking@cumin1001: conftool action : set/pooled=false; selector: dnsdisc=wdqs,name=codfw
* 15:01 volans@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 13:57 kevinbazira@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
* 14:48 volans@cumin1001: START - Cookbook sre.dns.netbox
* 13:57 kevinbazira@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
* 14:16 andrew@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
* m: Add the new netmon1003 host as a syslog destination in homer templates/common/system.conf https://gerrit.wikimedia.org/r/c/operations/homer/public/+/819124
* 14:15 otto@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' .
* m: Successfully ran '# run-puppet-merge' in the netmon1002 and netmon1003 hosts.
* 14:15 otto@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' .
* m: Running '# run-puppet-agent' in the netmon1003 host
* 14:14 andrew@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
* m: Running '# run-puppet-agent' in the netmon1002 host
* 14:14 andrew@cumin1001: START - Cookbook sre.hosts.downtime
* 13:47 ryankemper@cumin1001: END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0)
* 14:14 andrew@cumin1001: START - Cookbook sre.hosts.downtime
* 13:46 ryankemper@cumin1001: START - Cookbook sre.elasticsearch.force-shard-allocation
* 13:55 XioNoX: Re-prioritize peering over transit - codfw - [[phab:T259614|T259614]]
* m: puppet-merge on puppetmaster2004.codfw.wmnet for patch 819179 succeeded
* 12:34 XioNoX: Re-prioritize peering over transit - eqsin - [[phab:T259614|T259614]]
* m: Set netmon1003 as netmon_server and netmon1002 as a netmon_servers_failover in the Puppet repository https://gerrit.wikimedia.org/r/c/operations/puppet/+/819179
* 12:07 XioNoX: standardize cr1-eqiad interfaces
* m: authdns updated successfully
* 11:56 Urbanecm: EU B&C window done
* m: Had to revert https://gerrit.wikimedia.org/r/c/operations/dns/+/819177 because I rebased my changes incorrectly, sent the new patch in https://gerrit.wikimedia.org/r/c/operations/dns/+/821746
* 11:55 Urbanecm: Run `mwscript namespaceDupes.php --wiki=tiwiki --fix` at mwmaint1002 ([[phab:T259295|T259295]])
* m: running '# authdns-update' in  ns0.wikimedia.org
* 11:54 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|14b2897760d2658be70765fbb8e56ad552ce7a81}}: Define Portal namespace for tiwiki ([[phab:T259295|T259295]]) (duration: 00m 59s)
* m: Flip DNS for LibreNMS and Smokeping from netmon1002 to netmon1003 https://gerrit.wikimedia.org/r/c/operations/dns/+/819177
* 11:49 urbanecm@deploy1001: Synchronized static/images/project-logos/: {{Gerrit|bbbf7018b1b4db1a99f4828d1907c77a19158884}}: Regenerate Bengali Wikipedia logo from source SVG ([[phab:T259292|T259292]]) (duration: 00m 59s)
* 13:23 jynus: stop replication on db1117:m1 [[phab:T309074|T309074]]
* 11:41 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|0d8366f6951afae25617bea9402aa52e26f34e5d}}: Search Work NS by default at bnwikisource ([[phab:T258982|T258982]]) (duration: 00m 59s)
* m: netmon1002 to netmon1003 failover
* 11:37 Urbanecm: Run `mwscript namespaceDupes.php --wiki=hywiki --fix` at mwmaint1002 ([[phab:T259987|T259987]])
* 13:17 elukey@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
* 11:35 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|177148777e9c11a9f936d5d8b4d1c201ba9bf7fb}}: add two extra namespaces for hywiki ([[phab:T259987|T259987]]) (duration: 00m 59s)
* 13:16 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
* 11:28 Urbanecm: Purge https://en.wikipedia.org/static/images/project-logos/shnwiktionary*.png with purgeList.php ([[phab:T260010|T260010]])
* 10:58 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
* 11:27 XioNoX: standardize cr2-eqiad interfaces
* 09:53 vgutierrez: rolling restart of pybal in eqsin - [[phab:T310070|T310070]]
* 11:27 urbanecm@deploy1001: Synchronized static/images/project-logos/: {{Gerrit|c5c96ca419b3ab13c90cb55646be2aa9a07c8527}}: Regenerate shnwiktionary logo from source svg ([[phab:T260010|T260010]]) (duration: 00m 58s)
* 09:25 elukey@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
* 11:21 XioNoX: repool ulsfo
* 09:24 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
* 11:17 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|a15e3a22da13af89e9a2b76fdb24d6b5bebe6ec4}}: Increase autoconfirmed threshold for Chinese Wikinews to 7 days and 20 edits at least ([[phab:T259869|T259869]]) (duration: 00m 58s)
* 09:24 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
* 11:13 XioNoX: Re-prioritize peering over transit - ulsfo - [[phab:T259614|T259614]]
* 09:12 vgutierrez: rolling restart of pybal in codfw - [[phab:T310070|T310070]]
* 11:13 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|ba0b2abc0414940749c28a1f82ffbbfd94cd0fc5}}: Create TemplateEditor group on zhwiki ([[phab:T260012|T260012]]) (duration: 00m 58s)
* 08:47 elukey@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
* 11:09 Urbanecm: Run mwscript namespaceDupes.php --wiki=ptwikinews --fix --add-prefix=[[phab:T259959|T259959]] ([[phab:T259959|T259959]])
* 08:30 elukey@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
* 11:09 Urbanecm: Run mwscript namespaceDupes.php --wiki=ptwikinews --fix ([[phab:T259959|T259959]])
* 08:28 elukey@deploy1002: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
* 11:07 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|010f63ed64c599712e9ac11ed7fced666cc88ca1}}: Add WN as an alias to project namespace in Portuguese Wikinews ([[phab:T259959|T259959]]) (duration: 00m 58s)
* 08:27 elukey@deploy1002: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
* 11:06 urbanecm@deploy1001: sync-file aborted: {{Gerrit|010f63ed64c599712e9ac11ed7fced666cc88ca1}}: Add WN as an alias to project namespace in Portuguese Wikinews ([[phab:T259959|T259959]]¨) (duration: 00m 00s)
* 08:27 elukey@deploy1002: helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
* 10:44 jdrewniak@deploy1001: Synchronized portals: Wikimedia Portals Update: [[gerrit:619273{{!}} Bumping portals to master (T128546)]] (duration: 00m 58s)
* 08:27 elukey@deploy1002: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
* 10:43 jdrewniak@deploy1001: Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:619273{{!}} Bumping portals to master (T128546)]] (duration: 01m 01s)
* 08:26 elukey@deploy1002: helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
* 10:42 jayme@cumin1001: END (PASS) - Cookbook sre.discovery.pool (exit_code=0)
* 08:24 jynus: starting data check using es1021 and es2021, expect increased read traffic [[phab:T314559|T314559]]
* 10:37 jayme@cumin1001: START - Cookbook sre.discovery.pool
* 08:21 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
* 10:36 jayme@cumin1001: END (FAIL) - Cookbook sre.discovery.pool (exit_code=99)
* 06:22 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1130.eqiad.wmnet with reason: Maintenance
* 10:36 jayme@cumin1001: START - Cookbook sre.discovery.pool
* 06:22 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1130.eqiad.wmnet with reason: Maintenance
* 10:33 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 06:19 Amir1: dbmaint s5@eqiad ([[phab:T312863|T312863]] [[phab:T312984|T312984]] [[phab:T310011|T310011]] [[phab:T310485|T310485]])
* 10:32 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single
* 06:11 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db1130.eqiad.wmnet with reason: Maint
* 10:29 jayme@cumin1001: END (PASS) - Cookbook sre.discovery.depool (exit_code=0)
* 06:11 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db1130.eqiad.wmnet with reason: Maint
* 10:26 hnowlan@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
* 06:08 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depool db1130 [[phab:T314370|T314370]]', diff saved to https://phabricator.wikimedia.org/P32323 and previous config saved to /var/cache/conftool/dbconfig/20220809-060836-ladsgroup.json
* 10:26 hnowlan@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
* 06:07 oblivian@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
* 10:23 jayme@cumin1001: START - Cookbook sre.discovery.depool
* 06:02 ladsgroup@cumin1001: dbctl commit (dc=all): 'Promote db1100 to s5 primary and set section read-write [[phab:T314370|T314370]]', diff saved to https://phabricator.wikimedia.org/P32322 and previous config saved to /var/cache/conftool/dbconfig/20220809-060159-ladsgroup.json
* 10:19 jayme@cumin1001: END (PASS) - Cookbook sre.discovery.pool (exit_code=0)
* 06:01 ladsgroup@cumin1001: dbctl commit (dc=all): 'Set s5 eqiad as read-only for maintenance - [[phab:T314370|T314370]]', diff saved to https://phabricator.wikimedia.org/P32321 and previous config saved to /var/cache/conftool/dbconfig/20220809-060105-ladsgroup.json
* 10:18 jayme@cumin1001: START - Cookbook sre.discovery.pool
* 06:00 Amir1: Starting s5 eqiad failover from db1130 to db1100 - [[phab:T314370|T314370]]
* 10:14 volans@cumin1001: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
* 05:12 ladsgroup@cumin1001: dbctl commit (dc=all): 'Set db1100 with weight 0 [[phab:T314370|T314370]]', diff saved to https://phabricator.wikimedia.org/P32320 and previous config saved to /var/cache/conftool/dbconfig/20220809-051251-ladsgroup.json
* 10:10 volans@cumin1001: START - Cookbook sre.dns.netbox
* 05:12 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 22 hosts with reason: Primary switchover s5 [[phab:T314370|T314370]]
* 10:07 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
* 05:11 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1:00:00 on 22 hosts with reason: Primary switchover s5 [[phab:T314370|T314370]]
* 10:04 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single
* 02:42 ejegg: SmashPig upgraded from {{Gerrit|9b97ea15}} to {{Gerrit|13e9e9cc}}
* 09:56 hashar: Updated containeer for Jenkins job operations-dns-lint-docker https://gerrit.wikimedia.org/r/619267
* 02:31 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1148 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32318 and previous config saved to /var/cache/conftool/dbconfig/20220809-023113-ladsgroup.json
* 09:55 hashar: Updated container for Jenkins job operations-puppet-tests-buster-docker https://gerrit.wikimedia.org/r/619266
* 02:31 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1148.eqiad.wmnet with reason: Maintenance
* 09:54 jayme@cumin1001: END (PASS) - Cookbook sre.discovery.depool (exit_code=0)
* 02:30 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1148.eqiad.wmnet with reason: Maintenance
* 09:49 jayme@cumin1001: START - Cookbook sre.discovery.depool
* 02:30 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32317 and previous config saved to /var/cache/conftool/dbconfig/20220809-023052-ladsgroup.json
* 09:21 marostegui: Promote dbproxy1019 back [[phab:T255408|T255408]]
* 02:28 ejegg: payments-wiki upgraded from {{Gerrit|6880236d}} to {{Gerrit|cf5e1848}}
* 08:23 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
* 02:15 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P32316 and previous config saved to /var/cache/conftool/dbconfig/20220809-021546-ladsgroup.json
* 08:21 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
* 02:00 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P32315 and previous config saved to /var/cache/conftool/dbconfig/20220809-020040-ladsgroup.json
* 06:43 marostegui: Remove revision triggers from db2094:3318 [[phab:T238966|T238966]]
* 01:45 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32314 and previous config saved to /var/cache/conftool/dbconfig/20220809-014534-ladsgroup.json
* 06:42 marostegui: Stop replication on s8 codfw master to deploy MCR change, this will generate lag on s8 codfw [[phab:T238966|T238966]]
* 04:46 marostegui: Depool dbproxy1019 for reimage [[phab:T255408|T255408]]
 
== 2020-08-09 ==
* 21:58 ejegg: updated payments-wiki from {{Gerrit|cd012f37f1}} to {{Gerrit|932aacde54}}
* 03:53 ryankemper@cumin1001: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99)


== 2020-08-08 ==
== 2022-08-08 ==
* 02:23 ryankemper@cumin1001: START - Cookbook sre.wdqs.data-reload
* 23:52 tstarling@deploy1002: Synchronized wmf-config/InitialiseSettings.php: clean up testwiki experiments [[phab:T314750|T314750]] (duration: 03m 19s)
* 02:21 ryankemper@cumin1001: END (ERROR) - Cookbook sre.wdqs.data-reload (exit_code=97)
* 23:47 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 02:19 ryankemper@cumin1001: START - Cookbook sre.wdqs.data-reload
* 23:46 tstarling@deploy1002: Synchronized wmf-config/CommonSettings.php: clean up testwiki experiments [[phab:T314750|T314750]] (duration: 03m 27s)
* 23:46 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 23:46 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 23:45 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 23:32 eileen___: config revision changed from {{Gerrit|f5668044}} to 787cd0e0<eileen___> eileen
* 23:32 eileen___: civicrm upgraded from {{Gerrit|497bddf7}} to {{Gerrit|1f91ac2d}}
* 22:16 ryankemper@cumin1001: END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - [[phab:T289135|T289135]]
* 22:16 ryankemper@cumin1001: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host elastic1065.eqiad.wmnet with OS bullseye
* 21:53 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1065.eqiad.wmnet with reason: host reimage
* 21:50 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1065.eqiad.wmnet with reason: host reimage
* 21:36 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1065.eqiad.wmnet with OS bullseye
* 21:12 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1062.eqiad.wmnet with OS bullseye
* 20:53 ryankemper@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1062.eqiad.wmnet with reason: host reimage
* 20:50 ryankemper@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1062.eqiad.wmnet with reason: host reimage
* 20:36 ryankemper@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1062.eqiad.wmnet with OS bullseye
* 20:32 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 20:31 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 20:31 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 20:29 ryankemper@cumin1001: START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - [[phab:T289135|T289135]]
* 20:28 cjming: end of UTC late backport window
* 20:27 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 20:27 cjming@deploy1002: Synchronized php-1.39.0-wmf.23/skins/Vector/resources/skins.vector.styles/layouts/grid.less: Backport: [[gerrit:821243{{!}}Fix grid blowout bug (T314756)]] (duration: 03m 26s)
* 20:12 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 20:11 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 20:11 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 20:11 cjming@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:817785{{!}}Disable sticky header edit A/B test for pilot wikis (T312296)]] (duration: 03m 35s)
* 20:08 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 17:34 bking@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1088.eqiad.wmnet with OS bullseye
* 17:15 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1088.eqiad.wmnet with reason: host reimage
* 17:12 bking@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1088.eqiad.wmnet with reason: host reimage
* 17:00 bking@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1088.eqiad.wmnet with OS bullseye
* 16:54 bking@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1085.eqiad.wmnet with OS bullseye
* 16:49 ryankemper@cumin1001: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - [[phab:T289135|T289135]]
* 16:43 pt1979@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 16:41 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1085.eqiad.wmnet with reason: host reimage
* 16:39 pt1979@cumin2002: START - Cookbook sre.dns.netbox
* 16:38 bking@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1085.eqiad.wmnet with reason: host reimage
* 16:26 bking@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1085.eqiad.wmnet with OS bullseye
* 16:24 bking@cumin1001: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host elastic1085.eqiad.wmnet with OS bullseye
* 16:19 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1085.eqiad.wmnet with reason: host reimage
* 16:16 bking@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1085.eqiad.wmnet with reason: host reimage
* 16:16 ryankemper@cumin1001: START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - [[phab:T289135|T289135]]
* 16:14 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
* 16:12 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
* 16:10 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
* 16:09 ryankemper@cumin1001: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - [[phab:T289135|T289135]]
* 16:04 bking@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1085.eqiad.wmnet with OS bullseye
* 16:00 bking@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1084.eqiad.wmnet with OS bullseye
* 15:58 ryankemper@cumin1001: START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - [[phab:T289135|T289135]]
* 15:47 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1084.eqiad.wmnet with reason: host reimage
* 15:46 sukhe: upload reprepro -C main include bullseye-wikimedia python-pynetbox_6.6.0-1+wmf11u1_amd64.changes
* 15:45 bking@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1084.eqiad.wmnet with reason: host reimage
* 15:37 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2021.codfw.wmnet with reason: Maint
* 15:37 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2021.codfw.wmnet with reason: Maint
* 15:32 bking@cumin1001: START - Cookbook sre.hosts.reimage for host elastic1084.eqiad.wmnet with OS bullseye
* 14:59 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
* 14:55 elukey@deploy1002: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
* 14:47 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on cp5001.eqsin.wmnet with reason: depooled: faulty DIMM: [[phab:T314256|T314256]]
* 14:46 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on cp5001.eqsin.wmnet with reason: depooled: faulty DIMM: [[phab:T314256|T314256]]
* 14:34 elukey@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
* 14:11 kevinbazira@deploy1002: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
* 13:03 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 13:01 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 13:01 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 12:58 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 12:56 urbanecm@deploy1002: Synchronized wmf-config/CommonSettings.php: {{Gerrit|77fd5abdd7d9462869259e1511bbcf2d7ce62246}}: Growth: Add new rights to wgAvailableRights (duration: 03m 24s)
* 12:30 btullis@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1102.eqiad.wmnet
* 12:09 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 12:09 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 12:06 urbanecm@deploy1002: Synchronized php-1.39.0-wmf.23/extensions/GrowthExperiments/: {{Gerrit|3eaf155678b7313c55dcca0cd39ab29f73eead37}}: MentorTools: Do not use MentorWeightManager ([[phab:T314362|T314362]]) (duration: 03m 31s)
* 12:04 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 11:43 btullis@cumin1001: START - Cookbook sre.hosts.reboot-single for host an-worker1102.eqiad.wmnet
* 11:21 jelto@cumin1001: conftool action : set/pooled=yes; selector: name=kubernetes2022.codfw.wmnet
* 11:21 jelto: kubectl uncordon kubernetes2022.codfw.wmnet
* 10:43 Amir1: Removing db2079 from orchestrator ([[phab:T313885|T313885]])
* 10:39 Amir1: Removing db2079 from zarcillo ([[phab:T313885|T313885]])
* 10:35 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2079.codfw.wmnet
* 10:35 ladsgroup@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 10:30 ladsgroup@cumin1001: START - Cookbook sre.dns.netbox
* 10:25 ladsgroup@cumin1001: START - Cookbook sre.hosts.decommission for hosts db2079.codfw.wmnet
* 10:18 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 12:00:00 on db2079.codfw.wmnet with reason: Decom
* 10:17 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 2 days, 12:00:00 on db2079.codfw.wmnet with reason: Decom
* 08:41 jbond: deploy libtirpc update
* 07:57 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1144:3314 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32310 and previous config saved to /var/cache/conftool/dbconfig/20220808-075723-ladsgroup.json
* 07:57 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance
* 07:57 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance
* 07:57 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1142 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32309 and previous config saved to /var/cache/conftool/dbconfig/20220808-075702-ladsgroup.json
* 07:53 godog: grow sda/sdb 3 by 100G on thanos-be2001 - [[phab:T314275|T314275]]
* 07:50 godog: grow sda/sdb 3 by 100G on thanos-be1004 - [[phab:T314275|T314275]]
* 07:41 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P32308 and previous config saved to /var/cache/conftool/dbconfig/20220808-074156-ladsgroup.json
* 07:32 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 07:27 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 07:27 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 07:26 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P32307 and previous config saved to /var/cache/conftool/dbconfig/20220808-072650-ladsgroup.json
* 07:23 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 07:22 kartik@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:820815{{!}}trwikivoyage: Create rollbacker user group (T314678)]] (duration: 03m 17s)
* 07:18 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 07:17 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 07:17 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 07:16 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 07:11 elukey: restart rsyslog on ml-serve2007
* 07:11 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 07:11 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1142 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32306 and previous config saved to /var/cache/conftool/dbconfig/20220808-071144-ladsgroup.json
* 07:10 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 07:10 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 07:09 kartik@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:820261{{!}}Enable SectionTranslation on 10 Wikipedias where ContentTranslation is default (T308829)]] (duration: 03m 15s)
* 07:09 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 07:06 XioNoX: add CSP headers to Netbox - [[phab:T296356|T296356]]
* 07:05 elukey: restart rsyslog on ml-serve-ctrl2001


== 2020-08-07 ==
== 2022-08-07 ==
* 16:42 jforrester@deploy1001: Synchronized php-1.36.0-wmf.3/extensions/DiscussionTools/: [[phab:T259855|T259855]] Revert new reply API (duration: 01m 06s)
* 19:58 taavi: taavi@mwmaint1002 ~ $ echo "https://upload.wikimedia.org/wikipedia/commons/1/15/Keep_tidy_ask.svg" {{!}} mwscript purgeList.php --wiki enwiki # [[phab:T314712|T314712]]
* 15:01 volans: import DNS names for network devices in Netbox - [[phab:T258729|T258729]]
* 13:52 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1142 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32305 and previous config saved to /var/cache/conftool/dbconfig/20220807-135204-ladsgroup.json
* 13:27 godog: bounce pybal on lvs1016 and then lvs1015 to reset state, logstash1025 reported down but actually up
* 13:51 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1142.eqiad.wmnet with reason: Maintenance
* 10:27 sukhe@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
* 13:51 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1142.eqiad.wmnet with reason: Maintenance
* 10:27 sukhe@cumin1001: START - Cookbook sre.hosts.downtime
* 13:51 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1141 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32304 and previous config saved to /var/cache/conftool/dbconfig/20220807-135143-ladsgroup.json
* 10:02 elukey: reboot deneb via ganeti2021 (hostname config pointing to recdns for some reason)
* 13:36 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P32303 and previous config saved to /var/cache/conftool/dbconfig/20220807-133637-ladsgroup.json
* 09:15 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1092', diff saved to https://phabricator.wikimedia.org/P12195 and previous config saved to /var/cache/conftool/dbconfig/20200807-091527-marostegui.json
* 13:21 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P32302 and previous config saved to /var/cache/conftool/dbconfig/20220807-132131-ladsgroup.json
* 08:47 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1092', diff saved to https://phabricator.wikimedia.org/P12194 and previous config saved to /var/cache/conftool/dbconfig/20200807-084747-marostegui.json
* 13:06 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1141 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32301 and previous config saved to /var/cache/conftool/dbconfig/20220807-130625-ladsgroup.json
* 08:07 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1092', diff saved to https://phabricator.wikimedia.org/P12193 and previous config saved to /var/cache/conftool/dbconfig/20200807-080719-marostegui.json
* 12:06 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1141 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32300 and previous config saved to /var/cache/conftool/dbconfig/20220807-120610-ladsgroup.json
* 07:50 godog: prometheus codfw lvextend --resize --size +60G /dev/mapper/vg--hdd-prometheus--global
* 12:06 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1141.eqiad.wmnet with reason: Maintenance
* 07:49 godog: prometheus codfw lvextend --resize --size +30G /dev/mapper/vg--ssd-prometheus--k8s
* 12:05 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1141.eqiad.wmnet with reason: Maintenance
* 07:46 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1092', diff saved to https://phabricator.wikimedia.org/P12192 and previous config saved to /var/cache/conftool/dbconfig/20200807-074658-marostegui.json
* 12:05 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1149 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32299 and previous config saved to /var/cache/conftool/dbconfig/20220807-120549-ladsgroup.json
* 06:53 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
* 11:50 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P32298 and previous config saved to /var/cache/conftool/dbconfig/20220807-115043-ladsgroup.json
* 06:51 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
* 11:35 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P32297 and previous config saved to /var/cache/conftool/dbconfig/20220807-113537-ladsgroup.json
* 06:34 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1092 for upgrade', diff saved to https://phabricator.wikimedia.org/P12191 and previous config saved to /var/cache/conftool/dbconfig/20200807-063431-marostegui.json
* 11:20 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1149 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32296 and previous config saved to /var/cache/conftool/dbconfig/20220807-112031-ladsgroup.json


== 2020-08-06 ==
== 2022-08-06 ==
* 23:21 catrope@deploy1001: Synchronized php-1.36.0-wmf.3/extensions/GrowthExperiments/: Fixes for WelcomeSurvey language question ([[phab:T232410|T232410]]) (duration: 00m 59s)
* 17:59 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1149 ([[phab:T312863|T312863]])', diff saved to https://phabricator.wikimedia.org/P32295 and previous config saved to /var/cache/conftool/dbconfig/20220806-175916-ladsgroup.json
* 23:04 catrope@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Change GrowthExperiments mentor list on fawiki ([[phab:T253291|T253291]]) (duration: 00m 59s)
* 17:59 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1149.eqiad.wmnet with reason: Maintenance
* 21:43 andrew@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
* 17:58 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1149.eqiad.wmnet with reason: Maintenance
* 21:41 andrew@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
* 03:10 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 21:40 andrew@cumin1001: START - Cookbook sre.hosts.downtime
* 03:09 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 21:39 mholloway-shell@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'wikifeeds' for release 'production' .
* 03:09 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 21:39 andrew@cumin1001: START - Cookbook sre.hosts.downtime
* 03:08 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 21:35 mholloway-shell@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'wikifeeds' for release 'production' .
* 03:03 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 21:33 brennen@deploy1001: Synchronized php-1.36.0-wmf.3/vendor: [[gerrit:618850{{!}}Update git submodules (vendor)]] ([[phab:T259832|T259832]]) (duration: 01m 08s)
* 03:02 krinkle@deploy1002: Synchronized w/: {{Gerrit|I9067d47fab0324}} (duration: 03m 25s)
* 21:32 mholloway-shell@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' .
* 03:02 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 20:51 otto@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
* 03:02 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 20:51 otto@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
* 03:01 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 20:47 shdubsh: restart logstash -- pipeline appears stuck
* 02:41 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 20:38 otto@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' .
* 02:39 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 20:38 otto@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' .
* 02:39 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 20:19 otto@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' .
* 02:38 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 20:19 otto@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' .
* 02:38 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1143.eqiad.wmnet with reason: Maintenance
* 20:19 brennen: manually updating the vendor submodule on 1.36.0 for [[phab:T259832|T259832]]
* 02:37 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1143.eqiad.wmnet with reason: Maintenance
* 20:15 otto@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
* 02:33 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 20:15 otto@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
* 02:32 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 19:48 otto@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
* 02:32 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 19:47 otto@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
* 02:31 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 19:45 otto@deploy1001: Synchronized wmf-config/InitialiseSettings.php: EventStreamConfig - wgEventStreams - fix another typo in eventgate stream config - [[phab:T251935|T251935]] (duration: 00m 58s)
* 19:40 otto@deploy1001: Synchronized wmf-config/InitialiseSettings.php: EventStreamConfig - wgEventStreams - fix typo in eventgate stream config - [[phab:T251935|T251935]] (duration: 00m 59s)
* 19:26 otto@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
* 19:26 otto@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
* 19:04 brennen@deploy1001: rebuilt and synchronized wikiversions files: all wikis to 1.36.0-wmf.3
* 18:58 otto@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
* 18:57 otto@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
* 18:29 otto@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' .
* 18:29 otto@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' .
* 18:21 Urbanecm: Morning B&C window was completed
* 18:20 urbanecm@deploy1001: Synchronized php-1.36.0-wmf.3/extensions/GrowthExperiments/modules/: {{Gerrit|fb4a80830d7d915479e097cc82c681c5fb03d51b}}: Fix "Ask mentor" help panel button styling ([[phab:T250235|T250235]]) (duration: 01m 07s)
* 18:11 otto@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' .
* 18:11 otto@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' .
* 18:10 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|9db96595695b5ec1144c078e8961b3c04e8983cf}}: Remove temporary logging for mediamoderation ([[phab:T259742|T259742]]) (duration: 01m 07s)
* 18:07 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|9695811a30de30471a81b6ad05aa5e625f52caf1}}: : Enable DiscussionTools as a beta feature on 8 more wikis ("phase 1") ([[phab:T259574|T259574]]) (duration: 01m 06s)
* 17:42 brennen@deploy1001: Synchronized php: group1 wikis to 1.36.0-wmf.3 (duration: 01m 06s)
* 17:41 brennen@deploy1001: rebuilt and synchronized wikiversions files: group1 wikis to 1.36.0-wmf.3
* 17:37 brennen: train 1.36.0-wmf.3: proceeding to group1
* 17:36 brennen@deploy1001: Synchronized php-1.36.0-wmf.3/extensions/WikibaseMediaInfo/src/View/MediaInfoEntityTermsView.php: Backport: [[gerrit:618582{{!}}Fix array unpacking as argument list]] ([[phab:T259745|T259745]]) (duration: 01m 07s)
* 16:32 chrisalbon@deploy1001: Finished deploy [ores/deploy@f3c44be]: [[phab:T258435|T258435]] (duration: 14m 12s)
* 16:18 dpifke@deploy1001: Finished deploy [performance/arc-lamp@7838c88]: Deploying fixes for [[phab:T259167|T259167]] (duration: 00m 05s)
* 16:18 dpifke@deploy1001: Started deploy [performance/arc-lamp@7838c88]: Deploying fixes for [[phab:T259167|T259167]]
* 16:18 chrisalbon@deploy1001: Started deploy [ores/deploy@f3c44be]: [[phab:T258435|T258435]]
* 15:40 otto@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' .
* 15:40 otto@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' .
* 15:10 fdans@deploy1001: Finished deploy [analytics/refinery@97a02a3]: Regular analytics weekly train [analytics/refinery@97a02a3 (duration: 20m 01s)
* 14:50 fdans@deploy1001: Started deploy [analytics/refinery@97a02a3]: Regular analytics weekly train [analytics/refinery@97a02a3
* 14:00 otto@deploy1001: Synchronized wmf-config/InitialiseSettings.php: EventStreamConfig - Add eventgate-* test.event streams - [[phab:T251935|T251935]] (duration: 01m 08s)
* 13:32 jayme: updated helm to 2.16.9-2 on contint*, deploy* and chartmuseum*
* 13:24 jayme: imported helm_2.16.9-2 and tiller_2.16.9-2 to buster-wikimedia, jessie-wikimedia and stretch-wikimedia
* 12:06 kart_: Updated cxserver to 2020-08-05-070016-production ([[phab:T258919|T258919]], [[phab:T199523|T199523]], [[phab:T257943|T257943]], [[phab:T256194|T256194]])
* 12:03 kartik@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'cxserver' for release 'production' .
* 11:59 kartik@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'cxserver' for release 'production' .
* 11:57 kartik@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'staging' .
* 11:54 Lucas_WMDE: EU backport window done
* 11:54 lucaswerkmeister-wmde@deploy1001: Synchronized php-1.36.0-wmf.3/extensions/Flow/: Backport: [[gerrit:618580{{!}}Pass jQuery objects into jqueryMsg]] (duration: 01m 09s)
* 11:53 XioNoX: reboot cr2-eqord - [[phab:T259621|T259621]]
* 11:37 XioNoX: drain traffic away cr2-eqord - [[phab:T259621|T259621]]
* 11:27 lucaswerkmeister-wmde@deploy1001: Synchronized php-1.36.0-wmf.3/extensions/Wikibase/lib/: Backport: [[gerrit:618579{{!}}Fix CachingFallbackLabelDescriptionLookup failing in edge-cases (T259744)]] (duration: 01m 10s)
* 11:22 XioNoX: reboot cr2-eqdfw - [[phab:T259621|T259621]]
* 11:13 XioNoX: drain traffic away cr2-eqdfw - [[phab:T259621|T259621]]
* 10:52 mvolz@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'zotero' for release 'production' .
* 10:48 mvolz@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'zotero' for release 'production' .
* 10:45 mvolz@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'zotero' for release 'staging' .
* 10:23 mvolz@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'citoid' for release 'production' .
* 10:16 mvolz@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'citoid' for release 'production' .
* 10:14 jynus@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
* 10:12 jynus@cumin2001: START - Cookbook sre.hosts.downtime
* 10:11 mvolz@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'citoid' for release 'staging' .
* 08:44 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1127', diff saved to https://phabricator.wikimedia.org/P12188 and previous config saved to /var/cache/conftool/dbconfig/20200806-084406-marostegui.json
* 08:37 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1127', diff saved to https://phabricator.wikimedia.org/P12187 and previous config saved to /var/cache/conftool/dbconfig/20200806-083743-marostegui.json
* 08:30 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1127', diff saved to https://phabricator.wikimedia.org/P12186 and previous config saved to /var/cache/conftool/dbconfig/20200806-083033-marostegui.json
* 08:14 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1127', diff saved to https://phabricator.wikimedia.org/P12185 and previous config saved to /var/cache/conftool/dbconfig/20200806-081416-marostegui.json
* 07:03 elukey@cumin1001: END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0)
* 06:57 elukey@cumin1001: START - Cookbook sre.zookeeper.roll-restart-zookeeper
* 06:57 marostegui: Truncate tables on zerowiki [[phab:T227717|T227717]]
* 06:53 elukey@cumin1001: END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0)
* 06:47 elukey@cumin1001: START - Cookbook sre.zookeeper.roll-restart-zookeeper
* 06:43 elukey@cumin1001: END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0)
* 06:37 elukey: roll restart of druid clusters' zookeeper and an-conf* zookeeper for openjdk-11 upgrades
* 06:36 elukey@cumin1001: START - Cookbook sre.zookeeper.roll-restart-zookeeper
* 06:22 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
* 06:20 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
* 05:07 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1127 for MCR', diff saved to https://phabricator.wikimedia.org/P12184 and previous config saved to /var/cache/conftool/dbconfig/20200806-050743-marostegui.json
* 04:56 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1079', diff saved to https://phabricator.wikimedia.org/P12182 and previous config saved to /var/cache/conftool/dbconfig/20200806-045622-marostegui.json
* 04:51 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1079', diff saved to https://phabricator.wikimedia.org/P12181 and previous config saved to /var/cache/conftool/dbconfig/20200806-045107-marostegui.json
* 04:46 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1079', diff saved to https://phabricator.wikimedia.org/P12180 and previous config saved to /var/cache/conftool/dbconfig/20200806-044608-marostegui.json
* 04:37 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1079', diff saved to https://phabricator.wikimedia.org/P12179 and previous config saved to /var/cache/conftool/dbconfig/20200806-043758-marostegui.json
* 03:04 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=wtp2019.codfw.wmnet
* 02:24 eileen: process-control config revision is {{Gerrit|525eb71235}} turn off delete deleted contacts
* 01:52 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
* 01:52 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
* 01:19 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
* 01:19 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
* 01:17 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
* 01:17 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
* 00:35 mutante: wtp2019 - reimaging - parsoid service does not work, unlike on all other wtp*, making sure it's clean
* 00:00 mutante: LDAP - removed demon from nda group


== 2020-08-05 ==
== 2022-08-05 ==
* 23:57 eileen: civicrm revision changed from {{Gerrit|150c3476c4}} to {{Gerrit|72452e28a9}}, config revision is {{Gerrit|b6ece03513}}
* 22:20 dcausse@deploy1002: Finished deploy [wikimedia/discovery/analytics@71fe016]: Fix schedule_interval for image_recommendation_weekly (duration: 02m 01s)
* 23:02 shdubsh: logstash in codfw looks stuck -- restarting
* 22:18 dcausse@deploy1002: Started deploy [wikimedia/discovery/analytics@71fe016]: Fix schedule_interval for image_recommendation_weekly
* 19:41 brennen@deploy1001: rebuilt and synchronized wikiversions files: Revert group1 wikis to 1.36.0-wmf.2
* 17:08 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1195.eqiad.wmnet with OS bullseye
* 19:39 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
* 16:54 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1194.eqiad.wmnet with OS bullseye
* 19:37 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
* 16:53 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1195.eqiad.wmnet with reason: host reimage
* 19:13 brennen@deploy1001: Synchronized php: group1 wikis to 1.36.0-wmf.3 (duration: 01m 44s)
* 16:49 pt1979@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on db1195.eqiad.wmnet with reason: host reimage
* 19:11 brennen@deploy1001: rebuilt and synchronized wikiversions files: group1 wikis to 1.36.0-wmf.3
* 16:41 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1194.eqiad.wmnet with reason: host reimage
* 18:26 Lucas_WMDE: Morning backport window done
* 16:37 pt1979@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on db1194.eqiad.wmnet with reason: host reimage
* 18:25 lucaswerkmeister-wmde@deploy1001: Synchronized php-1.36.0-wmf.3/extensions/ContentTranslation/: Backport: [[gerrit:618566{{!}}Pass jQuery objects into jqueryMsg]] (duration: 01m 11s)
* 16:34 pt1979@cumin1001: START - Cookbook sre.hosts.reimage for host db1195.eqiad.wmnet with OS bullseye
* 18:14 mutante: test !log
* 16:27 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp203[56]\.codfw\.wmnet,service=varnish-fe
* 18:11 lucaswerkmeister-wmde@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:618343{{!}}Re-enable growth study quick survey (T257015)]] (duration: 01m 12s)
* 16:27 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp203[56]\.codfw\.wmnet,service=ats-be
* 17:30 shdubsh: test prometheus-icinga-exporter upgrade on icinga2001
* 16:27 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp203[56]\.codfw\.wmnet,service=ats-tls
* 16:50 elukey: powercycle stat1005 after GPU issue
* 16:26 pt1979@cumin1001: START - Cookbook sre.hosts.reimage for host db1194.eqiad.wmnet with OS bullseye
* 15:56 otto@deploy1001: Synchronized wmf-config/InitialiseSettings.php: EventStreamConfig - Add eventgate-logging-external streams and destination_event_service settings - [[phab:T251935|T251935]] (duration: 01m 05s)
* 16:25 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1193.eqiad.wmnet with OS bullseye
* 15:50 hnowlan@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
* 16:21 pt1979@cumin1001: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host db1192.eqiad.wmnet with OS bullseye
* 15:43 hnowlan@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
* 16:12 dcausse@deploy1002: Finished deploy [wikimedia/discovery/analytics@8489923]: [[phab:T304954|T304954]]: Automate imagesuggestion imports (duration: 02m 03s)
* 15:11 pt1979@cumin2001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 16:11 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1193.eqiad.wmnet with reason: host reimage
* 15:08 godog: bounce logstash on logstash100[789] - udp loss reported
* 16:11 milimetric@deploy1002: Finished deploy [analytics/refinery@fe7bf9e]: Hotfix for webrequest load refine, now with FORCE :) (duration: 06m 09s)
* 15:05 pt1979@cumin2001: START - Cookbook sre.dns.netbox
* 16:10 dcausse@deploy1002: Started deploy [wikimedia/discovery/analytics@8489923]: [[phab:T304954|T304954]]: Automate imagesuggestion imports
* 14:48 elukey: reboot stat1008 for unexpected maintenance (GPU stuck)
* 16:07 pt1979@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on db1193.eqiad.wmnet with reason: host reimage
* 14:33 otto@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' .
* 16:07 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1192.eqiad.wmnet with reason: host reimage
* 14:32 otto@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
* 16:05 milimetric@deploy1002: Started deploy [analytics/refinery@fe7bf9e]: Hotfix for webrequest load refine, now with FORCE :)
* 14:27 otto@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' .
* 16:04 milimetric@deploy1002: Finished deploy [analytics/refinery@fe7bf9e]: Hotfix for webrequest load refine (duration: 34m 38s)
* 14:27 otto@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
* 16:03 pt1979@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on db1192.eqiad.wmnet with reason: host reimage
* 14:25 moritzm: installing nmap bugfix updates from buster point release
* 15:55 pt1979@cumin1001: START - Cookbook sre.hosts.reimage for host db1193.eqiad.wmnet with OS bullseye
* 14:24 otto@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
* 15:52 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1191.eqiad.wmnet with OS bullseye
* 14:24 otto@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' .
* 15:51 pt1979@cumin1001: START - Cookbook sre.hosts.reimage for host db1192.eqiad.wmnet with OS bullseye
* 14:20 sukhe@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
* 15:42 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1190.eqiad.wmnet with OS bullseye
* 14:20 sukhe@cumin1001: START - Cookbook sre.hosts.downtime
* 15:38 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1191.eqiad.wmnet with reason: host reimage
* 14:14 moritzm: installing pillow security updates
* 15:34 pt1979@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on db1191.eqiad.wmnet with reason: host reimage
* 14:03 moritzm: installing node-minimist security updates
* 15:30 milimetric@deploy1002: Started deploy [analytics/refinery@fe7bf9e]: Hotfix for webrequest load refine
* 13:51 moritzm: installing Linux update to 4.9.132 from buster point update (no reboots, just the package updates)
* 15:28 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1190.eqiad.wmnet with reason: host reimage
* 13:32 jayme: updated helmfile to 0.125.2-0 and helm-diff to 3.1.2-1 on contint* and deploy*
* 15:25 pt1979@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on db1190.eqiad.wmnet with reason: host reimage
* 13:28 volans@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 15:24 jbond: upload trapperkeeper-metrics-clojure to puppet7 component
* 13:24 volans@cumin1001: START - Cookbook sre.dns.netbox
* 15:22 pt1979@cumin1001: START - Cookbook sre.hosts.reimage for host db1191.eqiad.wmnet with OS bullseye
* 13:04 elukey: restart yarn resource managers on an-master100[12] to pick up new Yarn settings - https://gerrit.wikimedia.org/r/c/operations/puppet/+/618529
* 15:19 jbond: upload puppetlabs-http-client-clojur to puppet7 component
* 13:00 moritzm: installing libjpeg-turbo security updates on stretch
* 15:17 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 12:52 XioNoX: netmon1002:/srv/deployment/librenms/librenms$ sudo -u librenms ./lnms migrate
* 15:16 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 12:49 jayme: imported helm-diff_3.1.2-1 to buster-wikimedia, jessie-wikimedia and stretch-wikimedia
* 15:16 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 12:46 moritzm: installing imagemagick security updates on buster
* 15:15 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 12:33 moritzm: installing net-snmp security updates on icinga hosts
* 15:14 dancy@deploy1002: Finished scap: Backport for [[gerrit:820653]] scap gitignore: ignore all files under the `scap` directory (duration: 04m 41s)
* 11:36 awight: EU Bacon reclosed
* 15:11 jbond: upload jolokia to puppet7 component
* 11:36 awight@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:614891{{!}}Switch test wikis to new version of vector by default (3/3) (T254227)]] (duration: 01m 07s)
* 15:10 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1185.eqiad.wmnet with OS bullseye
* 11:29 awight: EU Bacon reopened
* 15:09 dancy@deploy1002: Started scap: Backport for [[gerrit:820653]] scap gitignore: ignore all files under the `scap` directory
* 11:28 awight: EU Bacon complete
* 15:09 jbond: upload test-chuck-clojure to puppet7 component
* 11:26 awight@deploy1001: Synchronized wmf-config: Config: [[gerrit:618481{{!}}FileImporter: full default deployment (T232542)]] (duration: 01m 04s)
* 15:05 pt1979@cumin1001: START - Cookbook sre.hosts.reimage for host db1190.eqiad.wmnet with OS bullseye
* 11:23 jayme: imported helm-diff_3.1.2-0 to jessie-wikimedia and stretch-wikimedia
* 15:04 jbond: upload test-check-clojure to puppet7 component
* 11:22 jayme: imported helm-diff_3.1.2-0 to buster-wikimedia
* 14:57 jbond: upload nippy-clojure to puppet7 component
* 11:19 awight@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:618303{{!}}Add import sources for lijwikisource (T259633)]] (duration: 01m 07s)
* 14:56 pt1979@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1185.eqiad.wmnet with reason: host reimage
* 11:13 awight@deploy1001: sync-file aborted: Config: [[gerrit:618303{{!}}Add import sources for lijwikisource (T259633)]] (duration: 00m 13s)
* 14:52 pt1979@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on db1185.eqiad.wmnet with reason: host reimage
* 11:10 lucaswerkmeister-wmde@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:595542{{!}}Enable Data Bridge on Test Wikidata clients (T232584)]] (duration: 01m 20s)
* 14:43 jbond: upload fressian to puppet7 component
* 10:39 XioNoX: reboot cr3-ulsfo - [[phab:T259621|T259621]]
* 14:40 pt1979@cumin1001: START - Cookbook sre.hosts.reimage for host db1185.eqiad.wmnet with OS bullseye
* 10:28 XioNoX: drain traffic away cr3-ulsfo - [[phab:T259621|T259621]]
* 14:40 jbond: upload test-generative-clojure to puppet7 component
* 10:21 moritzm: installing libssh security updates
* 14:35 pt1979@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 10:18 XioNoX: reboot cr4-ulsfo - [[phab:T259621|T259621]]
* 14:34 jbond: upload data-generators-clojure to puppet7 component
* 09:58 XioNoX: drain traffic away cr4-ulsfo
* 14:31 pt1979@cumin2002: START - Cookbook sre.dns.netbox
* 09:53 XioNoX: depool ulsfo - [[phab:T259621|T259621]]
* 14:23 jbond: upload encore-clojure to puppet7 component
* 09:32 elukey: set ticket max renewable lifetime to 7d on all kerberos clients (was zero, the default)
* 14:17 jbond: upload truss-clojure to puppet7 component
* 09:07 jayme: imported helmfile_0.125.2-0 to jessie-wikimedia
* 14:13 jbond: upload structured-logging-clojure to puppet7 component
* 09:07 jayme: imported helmfile_0.125.2-0 to stretch-wikimedia
* 14:06 jbond: upload murphy-clojure to puppet7 component
* 09:05 jayme: imported helmfile_0.125.2-0 to buster-wikimedia
* 13:57 jbond: upload logstash-logback-encoder-7.2 to puppet7 component
* 08:39 marostegui: Remove revision triggers on db1125:3317
* 13:49 jbond: upload kitchensink-clojure to puppet7 component
* 08:39 marostegui: Stop replication on db1079 for MCR, this will generate lag on s7 on labsdb
* 13:27 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depool hosts with fragile power supply ([[phab:T314559|T314559]] [[phab:T314628|T314628]])', diff saved to https://phabricator.wikimedia.org/P32292 and previous config saved to /var/cache/conftool/dbconfig/20220805-132709-ladsgroup.json
* 08:39 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1079 for MCR', diff saved to https://phabricator.wikimedia.org/P12173 and previous config saved to /var/cache/conftool/dbconfig/20200805-083916-marostegui.json
* 13:12 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 12:00:00 on db2095.codfw.wmnet with reason: Maintenance
* 08:38 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1094', diff saved to https://phabricator.wikimedia.org/P12172 and previous config saved to /var/cache/conftool/dbconfig/20200805-083833-marostegui.json
* 13:12 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 2 days, 12:00:00 on db2095.codfw.wmnet with reason: Maintenance
* 08:29 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1094', diff saved to https://phabricator.wikimedia.org/P12171 and previous config saved to /var/cache/conftool/dbconfig/20200805-082908-marostegui.json
* 13:09 sukhe: repool codfw
* 08:21 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1094', diff saved to https://phabricator.wikimedia.org/P12170 and previous config saved to /var/cache/conftool/dbconfig/20200805-082138-marostegui.json
* 13:02 jbond: upload honeysql-clojure to puppet7 component
* 08:12 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1094', diff saved to https://phabricator.wikimedia.org/P12169 and previous config saved to /var/cache/conftool/dbconfig/20200805-081237-marostegui.json
* 12:53 _joe_: progressive repool of services in codfw
* 07:49 marostegui: Stop mysql on db1117:3323 (this will generate haproxy irc alerts) [[phab:T259589|T259589]]
* 12:24 moritzm: installing nano bugfix updates from bullseye point release
* 07:45 ayounsi@cumin1001: END (PASS) - Cookbook sre.network.prepare-upgrade (exit_code=0)
* 11:50 hnowlan@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: sync
* 07:39 ayounsi@cumin1001: END (PASS) - Cookbook sre.network.prepare-upgrade (exit_code=0)
* 11:40 hnowlan@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: sync
* 07:29 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
* 11:37 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repool after PDU maint on D3 ([[phab:T310146|T310146]])', diff saved to https://phabricator.wikimedia.org/P32291 and previous config saved to /var/cache/conftool/dbconfig/20220805-113729-ladsgroup.json
* 07:27 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
* 11:35 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repool after PDU maint on C6 ([[phab:T310145|T310145]])', diff saved to https://phabricator.wikimedia.org/P32290 and previous config saved to /var/cache/conftool/dbconfig/20220805-113555-ladsgroup.json
* 07:26 moritzm: installing perl security updates on buster
* 11:34 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repool after PDU maint on C5 ([[phab:T310145|T310145]])', diff saved to https://phabricator.wikimedia.org/P32289 and previous config saved to /var/cache/conftool/dbconfig/20220805-113436-ladsgroup.json
* 07:20 moritzm: installing libexif security updates on buster
* 10:46 hnowlan@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: sync
* 07:14 ayounsi@cumin1001: START - Cookbook sre.network.prepare-upgrade
* 10:36 hnowlan@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: sync
* 07:13 ayounsi@cumin1001: START - Cookbook sre.network.prepare-upgrade
* 10:17 hnowlan@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: sync
* 07:04 oblivian@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'test' .
* 10:12 Amir1: dbmaint at s4@codfw ([[phab:T312863|T312863]])
* 07:04 oblivian@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' .
* 10:07 hnowlan@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: sync
* 06:59 oblivian@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' .
* 09:04 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 12 hosts with reason: Maintenance
* 06:59 oblivian@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'test' .
* 09:03 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 12 hosts with reason: Maintenance
* 06:52 ayounsi@cumin1001: END (PASS) - Cookbook sre.network.prepare-upgrade (exit_code=0)
* 09:03 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2110.codfw.wmnet with reason: Maintenance
* 06:50 oblivian@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' .
* 09:03 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2110.codfw.wmnet with reason: Maintenance
* 06:50 oblivian@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'test' .
* 00:53 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8 days, 0:00:00 on gerrit2001.wikimedia.org with reason: decom, replaced by gerrit2002
* 06:46 oblivian@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'test' .
* 00:53 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 8 days, 0:00:00 on gerrit2001.wikimedia.org with reason: decom, replaced by gerrit2002
* 06:46 oblivian@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' .
* 00:53 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for gerrit2002.wikimedia.org
* 05:53 ayounsi@cumin1001: START - Cookbook sre.network.prepare-upgrade
* 00:53 dzahn@cumin1001: START - Cookbook sre.hosts.remove-downtime for gerrit2002.wikimedia.org
* 05:09 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1094 for MCR', diff saved to https://phabricator.wikimedia.org/P12167 and previous config saved to /var/cache/conftool/dbconfig/20200805-050907-marostegui.json
* 00:52 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8 days, 0:00:00 on gerrit2002.wikimedia.org with reason: decom, replaced by gerrit2002
* 05:08 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1136', diff saved to https://phabricator.wikimedia.org/P12166 and previous config saved to /var/cache/conftool/dbconfig/20200805-050808-marostegui.json
* 00:52 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 8 days, 0:00:00 on gerrit2002.wikimedia.org with reason: decom, replaced by gerrit2002
* 05:03 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1136', diff saved to https://phabricator.wikimedia.org/P12165 and previous config saved to /var/cache/conftool/dbconfig/20200805-050308-marostegui.json
* 00:18 mutante: restarting gerrit for config change - removing old replica [[phab:T313250|T313250]]
* 04:53 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1136', diff saved to https://phabricator.wikimedia.org/P12164 and previous config saved to /var/cache/conftool/dbconfig/20200805-045334-marostegui.json
* 04:33 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1136', diff saved to https://phabricator.wikimedia.org/P12163 and previous config saved to /var/cache/conftool/dbconfig/20200805-043346-marostegui.json


== 2020-08-04 ==
== 2022-08-04 ==
* 22:41 brennen: restarting php7.2-fpm on mw1404 for opcache issues
* 23:07 mutante: switching gerrit-replica.wikimedia.org to new machine gerrit2002, dropping gerrit-replica-new.wikimedia.org [[phab:T313250|T313250]]
* 21:45 mholloway-shell@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'proton' for release 'production' .
* 21:07 ryankemper@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
* 21:38 mholloway-shell@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'proton' for release 'production' .
* 20:59 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 21:34 mholloway-shell@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'proton' for release 'production' .
* 20:57 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 21:03 mholloway-shell@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' .
* 20:57 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 21:03 mholloway-shell@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
* 20:56 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 20:55 mholloway-shell@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' .
* 20:56 thcipriani@deploy1002: Finished scap: Backport for [[gerrit:819774]] tkwiki: Update wordmark (duration: 06m 12s)
* 20:55 mholloway-shell@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
* 20:51 ryankemper@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
* 20:52 mholloway-shell@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' .
* 20:51 ryankemper@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
* 20:27 ebernhardson@deploy1001: Finished deploy [search/mjolnir/deploy@c80e2e7]: use provided ca certs for elasticsearch (duration: 02m 22s)
* 20:51 ryankemper@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
* 20:25 ebernhardson@deploy1001: Started deploy [search/mjolnir/deploy@c80e2e7]: use provided ca certs for elasticsearch
* 20:50 thcipriani@deploy1002: Started scap: Backport for [[gerrit:819774]] tkwiki: Update wordmark
* 20:15 ebernhardson@deploy1001: Finished deploy [search/mjolnir/deploy@b17bfd4]: Move mjolnir daemons from cirrus hosts to dedicated instances (duration: 02m 07s)
* 20:48 thcipriani@deploy1002: Finished scap: Backport for [[gerrit:812391]] [config]: Add click event logging for mobile and desktop (duration: 39m 16s)
* 20:12 ebernhardson@deploy1001: Started deploy [search/mjolnir/deploy@b17bfd4]: Move mjolnir daemons from cirrus hosts to dedicated instances
* 20:45 ryankemper@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
* 19:19 brennen@deploy1001: rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.3
* 20:24 ryankemper@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
* 19:11 brennen@deploy1001: Finished scap: testwikis wikis to 1.36.0-wmf.3 (duration: 91m 03s)
* 20:23 ryankemper@deploy1002: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply
* 19:03 brennen: current 1.36.0-wmf.3 train status ([[phab:T257971|T257971]]): mid scap-cdb-rebuild for testwiki sync; will proceed with group0 when finished.
* 20:22 ryankemper@deploy1002: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply
* 18:55 sukhe: upload pdns-recursor_4.3.3-1~deb10u1 to apt.wm.o (buster) - [[phab:T252132|T252132]]
* 20:16 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 18:49 mutante: letting puppet install envoy on all ores1* hosts
* 20:15 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 18:46 mutante: letting puppet install envoy on all ores2* hosts
* 20:15 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 18:37 hnowlan@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
* 20:14 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 18:26 hnowlan@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
* 20:13 ryankemper@deploy1002: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply
* 18:19 mutante: temp disabling puppet on all ores hosts to add envoy
* 20:13 ryankemper@deploy1002: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply
* 17:50 hnowlan@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
* 20:10 ryankemper@deploy1002: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply
* 17:40 hnowlan@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
* 20:09 ryankemper@deploy1002: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply
* 17:40 brennen@deploy1001: Started scap: testwikis wikis to 1.36.0-wmf.3
* 20:08 thcipriani@deploy1002: Started scap: Backport for [[gerrit:812391]] [config]: Add click event logging for mobile and desktop
* 17:36 hnowlan@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
* 19:59 ryankemper@deploy1002: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply
* 17:25 hnowlan@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
* 19:55 dancy@deploy1002: rebuilt and synchronized wikiversions files: resync
* 17:21 hnowlan@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
* 19:49 mvernon@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for thanos-be2001.codfw.wmnet
* 17:17 hnowlan@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
* 19:49 mvernon@cumin1001: START - Cookbook sre.hosts.remove-downtime for thanos-be2001.codfw.wmnet
* 17:05 brennen: 1.36.0-wmf.3 was branched at {{Gerrit|2d0cf09cdf}} for [[phab:T257971|T257971]]
* 19:44 mvernon@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 8 hosts
* 16:51 hnowlan@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
* 19:44 mvernon@cumin1001: START - Cookbook sre.hosts.remove-downtime for 8 hosts
* 16:49 hnowlan@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
* 19:42 Emperor: rebooting thanos-be2001 to fix drive ordering
* 16:43 hnowlan@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
* 19:37 bking@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for elastic2071.codfw.wmnet
* 16:31 hnowlan@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
* 19:37 bking@cumin1001: START - Cookbook sre.hosts.remove-downtime for elastic2071.codfw.wmnet
* 16:24 hnowlan@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
* 19:31 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2071.codfw.wmnet with reason: [[phab:T310146|T310146]]
* 16:18 cmjohnson@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 19:31 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2071.codfw.wmnet with reason: [[phab:T310146|T310146]]
* 16:15 oblivian@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'termbox' for release 'production' .
* 19:13 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 16:08 cmjohnson@cumin1001: START - Cookbook sre.dns.netbox
* 19:12 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 16:05 oblivian@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'termbox' for release 'production' .
* 19:12 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 16:02 oblivian@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' .
* 19:12 ryankemper@deploy1002: helmfile [eqiad] DONE helmfile.d/services/changeprop: apply
* 16:02 oblivian@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'test' .
* 19:11 ryankemper@deploy1002: helmfile [eqiad] START helmfile.d/services/changeprop: apply
* 15:53 otto@deploy1001: Synchronized wmf-config/InitialiseSettings.php: EventStreamConfig - Set default topic_prefixes - [[phab:T255888|T255888]] (duration: 00m 58s)
* 19:11 dancy: There were many errors during php-fpm restart due to failure to contact  http://lvs2009:9090/pools/appservers-https_443/mw2361.codfw.wmnet and the like.
* 15:51 ayounsi@cumin1001: END (PASS) - Cookbook sre.network.prepare-upgrade (exit_code=0)
* 19:11 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 15:39 oblivian@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' .
* 19:10 dancy@deploy1002: rebuilt and synchronized wikiversions files: group2 wikis to 1.39.0-wmf.23  refs [[phab:T308076|T308076]]
* 15:39 oblivian@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'test' .
* 19:09 ryankemper@deploy1002: helmfile [codfw] DONE helmfile.d/services/changeprop: apply
* 15:38 hnowlan@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
* 19:09 ryankemper@deploy1002: helmfile [codfw] START helmfile.d/services/changeprop: apply
* 15:31 hnowlan@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
* 19:05 otto@deploy1002: helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: sync
* 15:18 otto@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Remove now unused wgEventServiceStreamConfig - [[phab:T229863|T229863]] (duration: 00m 58s)
* 19:04 otto@deploy1002: helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: sync
* 15:18 moritzm: installing jackson-databind security issues
* 19:04 otto@deploy1002: helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: sync
* 15:08 moritzm: installing qemu security updates on cloudvirt* Stretch hosts
* 19:03 otto@deploy1002: helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: sync
* 14:54 cmjohnson1: swapping kubernetes1010 network cable [[phab:T257542|T257542]]
* 19:03 otto@deploy1002: helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: sync
* 14:48 ayounsi@cumin1001: START - Cookbook sre.network.prepare-upgrade
* 19:02 otto@deploy1002: helmfile [staging] START helmfile.d/services/eventgate-analytics-external: sync
* 14:41 cmjohnson1: powercycling analytics1050 [[phab:T258370|T258370]]
* 19:02 ottomata: roll-restarting eventgate-analytics-external to pick up backwards incompatible schema change - [[phab:T314151|T314151]]
* 14:35 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1136 for MCR', diff saved to https://phabricator.wikimedia.org/P12161 and previous config saved to /var/cache/conftool/dbconfig/20200804-143524-marostegui.json
* 18:47 ryankemper@deploy1002: helmfile [staging] DONE helmfile.d/services/changeprop: apply
* 14:27 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1101:3317', diff saved to https://phabricator.wikimedia.org/P12160 and previous config saved to /var/cache/conftool/dbconfig/20200804-142710-marostegui.json
* 18:46 ryankemper@deploy1002: helmfile [staging] START helmfile.d/services/changeprop: apply
* 14:22 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1101:3317', diff saved to https://phabricator.wikimedia.org/P12159 and previous config saved to /var/cache/conftool/dbconfig/20200804-142220-marostegui.json
* 18:41 cwhite: poweroff kafka-logging2003 - [[phab:T310145|T310145]]
* 14:15 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1101:3317', diff saved to https://phabricator.wikimedia.org/P12158 and previous config saved to /var/cache/conftool/dbconfig/20200804-141556-marostegui.json
* 18:39 dzahn@cumin2002: conftool action : set/pooled=yes; selector: dc=codfw,name=mw237[0-6].codfw.wmnet
* 14:10 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1101:3317', diff saved to https://phabricator.wikimedia.org/P12157 and previous config saved to /var/cache/conftool/dbconfig/20200804-141004-marostegui.json
* 18:39 dzahn@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 7 hosts
* 13:56 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
* 18:39 dzahn@cumin2002: START - Cookbook sre.hosts.remove-downtime for 7 hosts
* 13:56 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'kube-system' for release 'coredns' .
* 18:35 dzahn@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw2369.codfw.wmnet
* 13:56 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' .
* 18:35 dzahn@cumin2002: START - Cookbook sre.hosts.remove-downtime for mw2369.codfw.wmnet
* 13:51 hashar: Install newer openjdk on contint2001 and restarting CI Jenkins
* 18:35 dzahn@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw2368.codfw.wmnet
* 12:00 jayme: helm was updated: 2.16.7-2 -> 2.16.9-1 on chartmuseum*, contint*, deploy*
* 18:35 dzahn@cumin2002: START - Cookbook sre.hosts.remove-downtime for mw2368.codfw.wmnet
* 11:43 Lucas_WMDE: EU backport window done
* 18:35 dzahn@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw2367.codfw.wmnet
* 11:41 marostegui: Deploy schema change on s3 codfw master, lag might show up on codfw s3 [[phab:T259238|T259238]]
* 18:35 dzahn@cumin2002: START - Cookbook sre.hosts.remove-downtime for mw2367.codfw.wmnet
* 11:37 moritzm: installing openjdk-11 security updates
* 18:35 dzahn@cumin2002: conftool action : set/pooled=yes; selector: dc=codfw,name=mw2369.codfw.wmnet
* 11:36 lucaswerkmeister-wmde@deploy1001: Synchronized wmf-config/Wikibase.php: Config: [[gerrit:618266{{!}}Load WikibaseRepo using extension registration in production (T257433)]] (duration: 00m 58s)
* 18:35 dzahn@cumin2002: conftool action : set/pooled=yes; selector: dc=codfw,name=mw2368.codfw.wmnet
* 11:12 Lucas_WMDE: Deployed patch for [[phab:T86738|T86738]] / [[phab:T259565|T259565]]
* 18:35 dzahn@cumin2002: conftool action : set/pooled=yes; selector: dc=codfw,name=mw2367.codfw.wmnet
* 11:03 moritzm: installing e2fsprogs security updates for stretch
* 18:35 dzahn@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw2366.codfw.wmnet
* 10:47 moritzm: installing tomcat8 security updates
* 18:35 dzahn@cumin2002: START - Cookbook sre.hosts.remove-downtime for mw2366.codfw.wmnet
* 10:47 vgutierrez: upgrade acme-chief to version 0.28
* 18:34 dzahn@cumin2002: conftool action : set/pooled=yes; selector: dc=codfw,name=mw2366.codfw.wmnet
* 10:33 vgutierrez: upload acme-chief 0.28 to apt.wm.o (buster) - [[phab:T259338|T259338]]
* 18:30 dzahn@cumin2002: conftool action : set/pooled=yes; selector: dc=codfw,name=mw2279.codfw.wmnet
* 10:18 moritzm: installing imagemagick security updates on stretch
* 18:30 dzahn@cumin2002: conftool action : set/pooled=yes; selector: dc=codfw,name=mw2278.codfw.wmnet
* 10:00 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1101:3317 for MCR and PK change [[phab:T259524|T259524]]', diff saved to https://phabricator.wikimedia.org/P12156 and previous config saved to /var/cache/conftool/dbconfig/20200804-100035-marostegui.json
* 18:29 dzahn@cumin2002: conftool action : set/pooled=yes; selector: dc=codfw,name=mw2277.codfw.wmnet
* 09:56 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1098:3317', diff saved to https://phabricator.wikimedia.org/P12155 and previous config saved to /var/cache/conftool/dbconfig/20200804-095608-marostegui.json
* 18:29 dzahn@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw2276.codfw.wmnet
* 09:49 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1098:3317', diff saved to https://phabricator.wikimedia.org/P12154 and previous config saved to /var/cache/conftool/dbconfig/20200804-094909-marostegui.json
* 18:29 dzahn@cumin2002: START - Cookbook sre.hosts.remove-downtime for mw2276.codfw.wmnet
* 09:06 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
* 18:29 dzahn@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw2275.codfw.wmnet
* 09:04 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
* 18:29 dzahn@cumin2002: START - Cookbook sre.hosts.remove-downtime for mw2275.codfw.wmnet
* 08:58 moritzm: installing python3.5 security updates
* 18:29 dzahn@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw2274.codfw.wmnet
* 08:15 moritzm: installing remaining cups security updates
* 18:29 dzahn@cumin2002: START - Cookbook sre.hosts.remove-downtime for mw2274.codfw.wmnet
* 08:13 XioNoX: cleaning up a bunch of prefix limit reached issues
* 18:29 dzahn@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw2273.codfw.wmnet
* 08:00 marostegui: Failover m2 from db1132 to db1107 -[[phab:T257540|T257540]]
* 18:29 dzahn@cumin2002: START - Cookbook sre.hosts.remove-downtime for mw2273.codfw.wmnet
* 07:54 moritzm: installing poppler security updates on stretch
* 18:26 milimetric@deploy1002: Finished deploy [analytics/refinery@2553288]: Regular analytics weekly train [analytics/refinery@2553288] (duration: 02m 39s)
* 07:43 jayme: imported helm_2.16.9-1 to jessie-wikimedia
* 18:24 dzahn@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw2272.codfw.wmnet
* 07:43 jayme: imported helm_2.16.9-1 to stretch-wikimedia
* 18:24 dzahn@cumin2002: START - Cookbook sre.hosts.remove-downtime for mw2272.codfw.wmnet
* 07:38 jayme: imported helm_2.16.9-1 to buster-wikimedia
* 18:24 dzahn@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw2271.codfw.wmnet
* 07:34 elukey: upgrade druid analytics (backend for Turnilo/Superset/etc..) to 0.19
* 18:24 dzahn@cumin2002: START - Cookbook sre.hosts.remove-downtime for mw2271.codfw.wmnet
* 07:32 XioNoX: remove nonstop-bridging from fasw-c-eqiad switches - [[phab:T191667|T191667]]
* 18:23 milimetric@deploy1002: Started deploy [analytics/refinery@2553288]: Regular analytics weekly train [analytics/refinery@2553288]
* 07:29 XioNoX: remove nonstop-bridging from eqiad asw2 switches - [[phab:T191667|T191667]]
* 18:23 milimetric@deploy1002: Finished deploy [analytics/refinery@2553288]: Regular analytics weekly train [analytics/refinery@2553288] (duration: 01m 32s)
* 07:28 XioNoX: remove nonstop-bridging from asw2-esams - [[phab:T191667|T191667]]
* 18:23 dzahn@cumin2002: conftool action : set/pooled=yes; selector: dc=codfw,name=mw2276.codfw.wmnet
* 07:27 marostegui: Start topology changes on m2 - [[phab:T257540|T257540]]
* 18:23 dzahn@cumin2002: conftool action : set/pooled=yes; selector: dc=codfw,name=mw2275.codfw.wmnet
* 07:25 moritzm: installing rails security updates
* 18:23 dzahn@cumin2002: conftool action : set/pooled=yes; selector: dc=codfw,name=mw2274.codfw.wmnet
* 06:42 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1119', diff saved to https://phabricator.wikimedia.org/P12153 and previous config saved to /var/cache/conftool/dbconfig/20200804-064223-marostegui.json
* 18:22 dzahn@cumin2002: conftool action : set/pooled=yes; selector: dc=codfw,name=mw2273.codfw.wmnet
* 06:30 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1119', diff saved to https://phabricator.wikimedia.org/P12152 and previous config saved to /var/cache/conftool/dbconfig/20200804-063026-marostegui.json
* 18:22 dzahn@cumin2002: conftool action : set/pooled=yes; selector: dc=codfw,name=mw2272.codfw.wmnet
* 06:27 _joe_: restarting docker daemon on kubestage1002, seems like a case of https://github.com/moby/moby/issues/29635
* 18:22 Emperor: shutdown  moss-fe2001.codfw.wmnet,ms-fe2011.codfw.wmnet,ms-be20[34,35,42,48,68].codfw.wmnet PDU work [[phab:T310145|T310145]]
* 06:23 marostegui@cumin1001: dbctl commit (dc=all): 'Restore original weight to db1089 on main traffic', diff saved to https://phabricator.wikimedia.org/P12151 and previous config saved to /var/cache/conftool/dbconfig/20200804-062358-marostegui.json
* 18:22 mvernon@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 8 hosts with reason: PDU work
* 06:22 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1119', diff saved to https://phabricator.wikimedia.org/P12150 and previous config saved to /var/cache/conftool/dbconfig/20200804-062256-marostegui.json
* 18:21 milimetric@deploy1002: Started deploy [analytics/refinery@2553288]: Regular analytics weekly train [analytics/refinery@2553288]
* 06:19 oblivian@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' .
* 18:21 mvernon@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 8 hosts with reason: PDU work
* 06:13 tstarling@deploy1001: Synchronized wmf-config/CommonSettings.php: re-enabling lilypond execution in safe mode 3rd attempt (duration: 00m 58s)
* 18:21 milimetric@deploy1002: Finished deploy [analytics/refinery@2553288]: Regular analytics weekly train [analytics/refinery@2553288] (duration: 00m 03s)
* 06:12 marostegui@cumin1001: dbctl commit (dc=all): 'More weight to db1089 on main traffic', diff saved to https://phabricator.wikimedia.org/P12149 and previous config saved to /var/cache/conftool/dbconfig/20200804-061255-marostegui.json
* 18:21 milimetric@deploy1002: Started deploy [analytics/refinery@2553288]: Regular analytics weekly train [analytics/refinery@2553288]
* 06:12 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1119', diff saved to https://phabricator.wikimedia.org/P12148 and previous config saved to /var/cache/conftool/dbconfig/20200804-061209-marostegui.json
* 18:21 milimetric@deploy1002: Finished deploy [analytics/refinery@2553288]: Regular analytics weekly train [analytics/refinery@2553288] (duration: 00m 03s)
* 06:10 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1098:3317 for MCR', diff saved to https://phabricator.wikimedia.org/P12147 and previous config saved to /var/cache/conftool/dbconfig/20200804-061003-marostegui.json
* 18:21 milimetric@deploy1002: Started deploy [analytics/refinery@2553288]: Regular analytics weekly train [analytics/refinery@2553288]
* 05:37 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
* 18:20 milimetric@deploy1002: Finished deploy [analytics/refinery@2553288]: Regular analytics weekly train [analytics/refinery@2553288] (duration: 01m 49s)
* 05:35 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
* 18:20 mvernon@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 9 hosts
* 05:18 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1119 for reimage', diff saved to https://phabricator.wikimedia.org/P12146 and previous config saved to /var/cache/conftool/dbconfig/20200804-051843-marostegui.json
* 18:20 mvernon@cumin1001: START - Cookbook sre.hosts.remove-downtime for 9 hosts
* 05:04 marostegui: Reboot db1107 to pick up the last kernel
* 18:19 milimetric@deploy1002: Started deploy [analytics/refinery@2553288]: Regular analytics weekly train [analytics/refinery@2553288]
* 05:01 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1089 into API', diff saved to https://phabricator.wikimedia.org/P12145 and previous config saved to /var/cache/conftool/dbconfig/20200804-050150-marostegui.json
* 18:14 mutante: mw2272 and upwards: scap pull, checking monitoring, repooling.. one by one
* 03:56 legoktm: added Arlo to wmf-deployment Gerrit group
* 18:13 dzahn@cumin2002: conftool action : set/pooled=yes; selector: dc=codfw,name=mw2271.codfw.wmnet
* 03:53 legoktm: added subbu to wmf-deployment Gerrit group
* 18:12 btullis@deploy1002: Finished deploy [analytics/refinery@2553288]: Regular analytics weekly train [analytics/refinery@2553288] (duration: 00m 51s)
* 18:11 btullis@deploy1002: Started deploy [analytics/refinery@2553288]: Regular analytics weekly train [analytics/refinery@2553288]
* 18:06 btullis@deploy1002: Finished deploy [analytics/refinery@2553288]: Regular analytics weekly train [analytics/refinery@2553288] (duration: 01m 54s)
* 18:04 btullis@deploy1002: Started deploy [analytics/refinery@2553288]: Regular analytics weekly train [analytics/refinery@2553288]
* 17:55 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs2009.codfw.wmnet with reason: shutdown for PDU upgrade
* 17:55 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on lvs2009.codfw.wmnet with reason: shutdown for PDU upgrade
* 17:43 mutante: maps2008 - downtime and shutdown for D3 maintenance
* 17:42 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on maps2008.codfw.wmnet with reason: codfw reboots
* 17:42 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 1:00:00 on maps2008.codfw.wmnet with reason: codfw reboots
* 17:42 mutante: thunmbor2006 - downtime and shutdown for D3 maintenance
* 17:42 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on thumbor2006.codfw.wmnet with reason: codfw reboots
* 17:41 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 1:00:00 on thumbor2006.codfw.wmnet with reason: codfw reboots
* 17:39 mutante: mw2386 - systemctl reset-failed
* 17:31 mutante: phab2001 - systemctl restart ssh-phab, attempting to clear Icinga pybal alerts, related to reboots
* 17:30 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on dns2001.wikimedia.org with reason: shutdown for PDU upgrade
* 17:30 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 3:00:00 on dns2001.wikimedia.org with reason: shutdown for PDU upgrade
* 17:29 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dns2001.wikimedia.org with reason: shutdown for PDU upgrade
* 17:29 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 1:00:00 on dns2001.wikimedia.org with reason: shutdown for PDU upgrade
* 17:28 Amir1: dbmaint at s4@eqiad ([[phab:T312863|T312863]])
* 17:26 bd808@deploy1002: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply
* 17:26 bd808@deploy1002: helmfile [eqiad] START helmfile.d/services/developer-portal: apply
* 17:24 bd808@deploy1002: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply
* 17:23 bd808@deploy1002: helmfile [codfw] START helmfile.d/services/developer-portal: apply
* 17:23 bd808@deploy1002: helmfile [staging] DONE helmfile.d/services/developer-portal: apply
* 17:23 bd808@deploy1002: helmfile [staging] START helmfile.d/services/developer-portal: apply
* 17:20 mutante: [an-launcher1002:~] $ sudo systemctl reset-failed
* 17:20 mvernon@cumin1001: conftool action : set/pooled=no; selector: name=ms-fe2012.codfw.wmnet
* 17:18 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance
* 17:18 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance
* 17:18 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2038.codfw.wmnet,service=varnish-fe
* 17:18 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2038.codfw.wmnet,service=ats-be
* 17:18 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2038.codfw.wmnet,service=ats-tls
* 17:16 Emperor: shutdown of moss-fe2002.codfw.wmnet,ms-be20[37,38,43,61,65,69].codfw.wmnet,ms-fe2012.codfw.wmnet,thanos-fe2003.codfw.wmnet for power work [[phab:T310146|T310146]]
* 17:16 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on cp[2035-2036].codfw.wmnet with reason: shutdown for PDU upgrade
* 17:15 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 4:00:00 on cp[2035-2036].codfw.wmnet with reason: shutdown for PDU upgrade
* 17:15 mvernon@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 9 hosts with reason: PDU work
* 17:15 mvernon@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 9 hosts with reason: PDU work
* 17:15 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp203[56]\.codfw\.wmnet,service=varnish-fe
* 17:15 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp203[56]\.codfw\.wmnet,service=ats-be
* 17:15 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp203[56]\.codfw\.wmnet,service=ats-tls
* 17:13 mvernon@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be[2036,2049,2054].codfw.wmnet,thanos-be2003.codfw.wmnet
* 17:13 mvernon@cumin1001: START - Cookbook sre.hosts.remove-downtime for ms-be[2036,2049,2054].codfw.wmnet,thanos-be2003.codfw.wmnet
* 17:12 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2037.codfw.wmnet,service=varnish-fe
* 17:12 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2037.codfw.wmnet,service=ats-be
* 17:12 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp2037.codfw.wmnet,service=ats-tls
* 17:12 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2050.codfw.wmnet with reason: [[phab:T310146|T310146]]
* 17:12 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2050.codfw.wmnet with reason: [[phab:T310146|T310146]]
* 17:11 ebysans@deploy1002: Finished deploy [analytics/refinery@2553288] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@2553288] (duration: 00m 04s)
* 17:11 ebysans@deploy1002: Started deploy [analytics/refinery@2553288] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@2553288]
* 17:11 ebysans@deploy1002: Finished deploy [analytics/refinery@2553288] (thin): Regular analytics weekly train THIN [analytics/refinery@2553288] (duration: 00m 07s)
* 17:10 ebysans@deploy1002: Started deploy [analytics/refinery@2553288] (thin): Regular analytics weekly train THIN [analytics/refinery@2553288]
* 17:10 ebysans@deploy1002: Finished deploy [analytics/refinery@2553288]: Regular analytics weekly train [analytics/refinery@2553288] (duration: 00m 15s)
* 17:09 ebysans@deploy1002: Started deploy [analytics/refinery@2553288]: Regular analytics weekly train [analytics/refinery@2553288]
* 17:07 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on lvs2010.codfw.wmnet with reason: shutdown for PDU upgrade
* 17:07 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 1:00:00 on lvs2010.codfw.wmnet with reason: shutdown for PDU upgrade
* 16:55 hnowlan@puppetmaster1001: conftool action : set/pooled=no; selector: name=maps2008.codfw.wmnet
* 16:51 ebysans@deploy1002: Finished deploy [analytics/refinery@2553288] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@2553288] (duration: 07m 14s)
* 16:45 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=restbase2016.codfw.wmnet
* 16:45 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=restbase202[05].codfw.wmnet
* 16:45 hnowlan@puppetmaster1001: conftool action : set/pooled=no; selector: name=restbase202[05].codfw.wmnet
* 16:45 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=maps2007.codfw.wmnet
* 16:43 ebysans@deploy1002: Started deploy [analytics/refinery@2553288] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@2553288]
* 16:43 ebysans@deploy1002: Finished deploy [analytics/refinery@2553288] (thin): Regular analytics weekly train THIN [analytics/refinery@2553288] (duration: 00m 07s)
* 16:43 ebysans@deploy1002: Started deploy [analytics/refinery@2553288] (thin): Regular analytics weekly train THIN [analytics/refinery@2553288]
* 16:37 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 18 hosts
* 16:37 jayme@cumin1001: START - Cookbook sre.hosts.remove-downtime for 18 hosts
* 16:35 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2059.codfw.wmnet with reason: [[phab:T310145|T310145]]
* 16:35 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2059.codfw.wmnet with reason: [[phab:T310145|T310145]]
* 16:34 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kafka-main2003.codfw.wmnet with reason: PDU swap
* 16:34 ebysans@deploy1002: Finished deploy [analytics/refinery@2553288]: Regular analytics weekly train [analytics/refinery@2553288] (duration: 00m 20s)
* 16:34 jayme@cumin1001: START - Cookbook sre.hosts.downtime for 1:00:00 on kafka-main2003.codfw.wmnet with reason: PDU swap
* 16:34 ebysans@deploy1002: Started deploy [analytics/refinery@2553288]: Regular analytics weekly train [analytics/refinery@2553288]
* 16:32 ebysans@deploy1002: Finished deploy [analytics/refinery@2553288]: Regular analytics weekly train [analytics/refinery@2553288] (duration: 29m 59s)
* 16:30 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depool D3 for PDU maint', diff saved to https://phabricator.wikimedia.org/P32286 and previous config saved to /var/cache/conftool/dbconfig/20220804-163037-ladsgroup.json
* 16:28 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 16:28 ladsgroup@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:820376{{!}}Start reading from new templatelinks columns in commons (T306673)]] (duration: 03m 00s)
* 16:27 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 16:27 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 16:26 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 16:17 brett: deploying authdns - geodns: Map out African countries by DC latency ([[phab:T311472|T311472]])
* 16:12 cwhite: poweroff logstash2028 - [[phab:T310145|T310145]]
* 16:06 Emperor: shutdown ms-be20[39,49,54].codfw.wmnet,thanos-be2003 for PDU swap [[phab:T310145|T310145]]
* 16:03 mvernon@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ms-be[2036,2049,2054].codfw.wmnet,thanos-be2003.codfw.wmnet with reason: PDU work
* 16:02 mvernon@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ms-be[2036,2049,2054].codfw.wmnet,thanos-be2003.codfw.wmnet with reason: PDU work
* 16:02 ebysans@deploy1002: Started deploy [analytics/refinery@2553288]: Regular analytics weekly train [analytics/refinery@2553288]
* 15:50 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2048.codfw.wmnet with reason: [[phab:T310145|T310145]]
* 15:50 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2048.codfw.wmnet with reason: [[phab:T310145|T310145]]
* 15:43 damilare: payments-wiki upgraded from {{Gerrit|0e4a5b3b}} to {{Gerrit|6880236d}}
* 15:37 _joe_: uncordoning ml-serve200<nowiki>{</nowiki>1,6<nowiki>}</nowiki>
* 15:27 sukhe: power off cp2037,cp2038: PDU upgrade
* 15:25 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:30:00 on phab2001.codfw.wmnet with reason: PDU swap
* 15:25 jelto: power off phab2001
* 15:25 jelto@cumin1001: START - Cookbook sre.hosts.downtime for 3:30:00 on phab2001.codfw.wmnet with reason: PDU swap
* 15:25 sukhe@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on cp[2037-2038].codfw.wmnet with reason: shutdown for PDU upgrade
* 15:24 sukhe@cumin2002: START - Cookbook sre.hosts.downtime for 4:00:00 on cp[2037-2038].codfw.wmnet with reason: shutdown for PDU upgrade
* 15:24 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp203[78]\.codfw\.wmnet,service=varnish-fe
* 15:23 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp203[78]\.codfw\.wmnet,service=ats-be
* 15:23 sukhe@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp203[78]\.codfw\.wmnet,service=ats-tls
* 15:21 XioNoX: un-drain codfw-ulsfo link - [[phab:T310310|T310310]]
* 15:21 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db[2116,2127,2167-2168].codfw.wmnet,es2022.codfw.wmnet with reason: Maintenance ([[phab:T310145|T310145]])
* 15:20 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 10:00:00 on db[2116,2127,2167-2168].codfw.wmnet,es2022.codfw.wmnet with reason: Maintenance ([[phab:T310145|T310145]])
* 15:20 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depool C6 for PDU maint ([[phab:T310145|T310145]])', diff saved to https://phabricator.wikimedia.org/P32285 and previous config saved to /var/cache/conftool/dbconfig/20220804-151958-ladsgroup.json
* 15:16 btullis@cumin1001: END (PASS) - Cookbook sre.aqs.roll-restart (exit_code=0) for AQS aqs cluster: Roll restart of all AQS's nodejs daemons.
* 15:16 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on restbase[2016,2020,2025].codfw.wmnet with reason: PDU maintenance
* 15:16 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime for 3:00:00 on restbase[2016,2020,2025].codfw.wmnet with reason: PDU maintenance
* 15:13 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db[2114,2126,2166].codfw.wmnet with reason: Maintenance ([[phab:T310145|T310145]])
* 15:13 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 10:00:00 on db[2114,2126,2166].codfw.wmnet with reason: Maintenance ([[phab:T310145|T310145]])
* 15:13 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp203[12]\.codfw\.wmnet,service=varnish-fe
* 15:13 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp203[12]\.codfw\.wmnet,service=ats-be
* 15:13 sukhe@puppetmaster1001: conftool action : set/pooled=yes; selector: name=cp203[12]\.codfw\.wmnet,service=ats-tls
* 15:12 mvernon@cumin1001: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be[2058,2064].codfw.wmnet
* 15:12 mvernon@cumin1001: START - Cookbook sre.hosts.remove-downtime for ms-be[2058,2064].codfw.wmnet
* 15:11 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depool hosts for PDU maint ([[phab:T310145|T310145]])', diff saved to https://phabricator.wikimedia.org/P32284 and previous config saved to /var/cache/conftool/dbconfig/20220804-151121-ladsgroup.json
* 15:09 godog: poweroff logstash2002 - [[phab:T310145|T310145]]
* 15:07 _joe_: pwoering down mc203<nowiki>{</nowiki>0,1<nowiki>}</nowiki>
* 15:07 root@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on logstash2002.codfw.wmnet with reason: pdu
* 15:06 root@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on logstash2002.codfw.wmnet with reason: pdu
* 15:05 btullis@cumin1001: START - Cookbook sre.aqs.roll-restart for AQS aqs cluster: Roll restart of all AQS's nodejs daemons.
* 14:58 jelto: power off mc20[30-31]
* 14:56 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mc[2030-2031].codfw.wmnet with reason: PDU swap
* 14:56 jelto@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on mc[2030-2031].codfw.wmnet with reason: PDU swap
* 14:56 XioNoX: draining codfw-ulsfo link - [[phab:T310310|T310310]]
* 14:36 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=maps2009.codfw.wmnet
* 14:35 hnowlan@puppetmaster1001: conftool action : set/pooled=no; selector: name=maps2007.codfw.wmnet
* 14:35 hnowlan@puppetmaster1001: conftool action : set/pooled=no; selector: name=restbase2025.codfw.wmnet
* 14:35 hnowlan@puppetmaster1001: conftool action : set/pooled=no; selector: name=restbase2020.codfw.wmnet
* 14:35 hnowlan@puppetmaster1001: conftool action : set/pooled=no; selector: name=restbase2016.codfw.wmnet
* 14:32 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wdqs2011.codfw.wmnet with reason: [[phab:T310145|T310145]]
* 14:31 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs2011.codfw.wmnet with reason: [[phab:T310145|T310145]]
* 14:25 jelto: power off gitlab-runner2003
* 14:25 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:30:00 on gitlab-runner2003.codfw.wmnet with reason: PDU swap
* 14:25 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wdqs2001.codfw.wmnet with reason: [[phab:T310145|T310145]]
* 14:24 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs2001.codfw.wmnet with reason: [[phab:T310145|T310145]]
* 14:24 jelto@cumin1001: START - Cookbook sre.hosts.downtime for 4:30:00 on gitlab-runner2003.codfw.wmnet with reason: PDU swap
* 14:23 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2032.codfw.wmnet with reason: [[phab:T310145|T310145]]
* 14:22 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2032.codfw.wmnet with reason: [[phab:T310145|T310145]]
* 14:22 root@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on logstash2035.codfw.wmnet with reason: pdu
* 14:22 godog: poweroff logstash2035 - [[phab:T310145|T310145]]
* 14:22 root@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on logstash2035.codfw.wmnet with reason: pdu
* 14:21 Emperor: shutdown ms-be20[58,64].codfw.wmnet for PDU swap [[phab:T310145|T310145]]
* 14:20 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 14:19 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 14:19 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 14:18 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 14:14 Lucas_WMDE: UTC afternoon backport+config window done
* 14:13 lucaswerkmeister-wmde@deploy1002: Synchronized wmf-config/CommonSettings-labs.php: Config: [[gerrit:820454{{!}}Remove unused $wgMathUseRestBase (T274436)]] (duration: 03m 01s)
* 14:13 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 14:12 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 14:12 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 14:11 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 14:06 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 14:05 lucaswerkmeister-wmde@deploy1002: Synchronized wmf-config/CommonSettings-labs.php: Config: [[gerrit:820254{{!}}CommonSettings-labs: Fix usage of $wgSFSValidateIPListLocationMD5]] (duration: 02m 51s)
* 14:05 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 14:05 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 14:05 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2033.codfw.wmnet with reason: [[phab:T310145|T310145]]
* 14:04 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2033.codfw.wmnet with reason: [[phab:T310145|T310145]]
* 14:04 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 13:59 lucaswerkmeister-wmde@deploy1002: Synchronized wmf-config/wikitech.php: Config: [[gerrit:820255{{!}}wikitech: Remove old LDAP config vars]] (duration: 02m 54s)
* 13:59 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 13:58 mvernon@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ms-be[2058,2064].codfw.wmnet with reason: PDU work
* 13:58 mvernon@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ms-be[2058,2064].codfw.wmnet with reason: PDU work
* 13:58 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 13:58 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 13:57 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 13:52 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 13:51 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 13:51 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 13:50 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 13:48 lucaswerkmeister-wmde@deploy1002: Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:820404{{!}}Remove unused $wgIncludejQueryMigrate (T280944)]] (2/2) (duration: 03m 03s)
* 13:45 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 13:45 lucaswerkmeister-wmde@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:820404{{!}}Remove unused $wgIncludejQueryMigrate (T280944)]] (1/2) (duration: 02m 58s)
* 13:44 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 13:44 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 13:43 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 13:40 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2066.codfw.wmnet with reason: [[phab:T310145|T310145]]
* 13:39 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2066.codfw.wmnet with reason: [[phab:T310145|T310145]]
* 13:38 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 13:37 lucaswerkmeister-wmde@deploy1002: Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:820402{{!}}Remove unused $wgLegacyJavaScriptGlobals (T72470)]] (2/2) (duration: 02m 59s)
* 13:37 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 13:37 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 13:36 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 13:34 lucaswerkmeister-wmde@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:820402{{!}}Remove unused $wgLegacyJavaScriptGlobals (T72470)]] (1/2) (duration: 02m 58s)
* 13:31 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 13:30 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 13:30 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 13:29 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 13:26 lucaswerkmeister-wmde@deploy1002: Synchronized wmf-config/SearchSettingsForSDC.php: Config: [[gerrit:820397{{!}}Remove unused $wgWBCSEnableDispatchingQueryBuilder]] (duration: 03m 01s)
* 13:24 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 13:23 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 13:23 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 13:19 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 13:17 taavi@deploy1002: Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:820441{{!}}Remove unused CA P3P config]] (duration: 03m 09s)
* 13:14 jbond: intorudce new puppetmaster backends puppetmaster[12]004
* 13:14 bking@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2065.codfw.wmnet with reason: [[phab:T310145|T310145]]
* 13:14 bking@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2065.codfw.wmnet with reason: [[phab:T310145|T310145]]
* 13:14 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 13:13 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 13:13 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 13:12 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 13:11 taavi@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:819175{{!}}QuickSurveys: Deploy research incentive survey to Bengali wiki (T314333)]] (duration: 03m 26s)
* 13:07 moritzm: installing jetty9 security updates
* 12:48 moritzm: installing Linux 4.19.249 kernels on Buster hosts
* 12:03 jbond: send sretest100[12] and idp-test2001 to the new puppetmaster[12]004 servers to test
* 11:46 moritzm: installing Linux 5.10.127-2 kernels on Bullseye hosts
* 11:41 jmm@cumin2002: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2017.codfw.wmnet to cluster codfw and group D
* 11:41 moritzm: installing libpgjava security updates
* 11:37 jmm@cumin2002: START - Cookbook sre.ganeti.addnode for new host ganeti2017.codfw.wmnet to cluster codfw and group D
* 11:34 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2017.codfw.wmnet
* 11:26 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host ganeti2017.codfw.wmnet
* 11:09 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2017.codfw.wmnet with OS bullseye
* 10:53 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=restbase2015.codfw.wmnet
* 10:53 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=restbase2022.codfw.wmnet
* 10:52 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2017.codfw.wmnet with reason: host reimage
* 10:49 jmm@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2017.codfw.wmnet with reason: host reimage
* 10:30 jmm@cumin2002: START - Cookbook sre.hosts.reimage for host ganeti2017.codfw.wmnet with OS bullseye
* 10:27 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on ganeti2017.codfw.wmnet with reason: Remove node for eventual reimage, [[phab:T311686|T311686]]
* 10:27 jmm@cumin2002: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on ganeti2017.codfw.wmnet with reason: Remove node for eventual reimage, [[phab:T311686|T311686]]
* 10:19 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 9:00:00 on 32 hosts with reason: PDU swap
* 10:19 jayme@cumin1001: START - Cookbook sre.hosts.downtime for 9:00:00 on 32 hosts with reason: PDU swap
* 10:03 Lucas_WMDE: stashbot temporarily parted and lost several logs between 9:42 UTC and 9:49 UTC; mainly mwdebug helmfil start/done, also ayounsi sre.deploy.python-code cookbook to cumin1001, cumin2002; see IRC logs
* 10:02 ayounsi@cumin1001: END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: update requirements + wmf-netbox - ayounsi@cumin1001
* 10:00 jynus: stop db2099 [[phab:T310145|T310145]]
* 10:00 ayounsi@cumin1001: START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: update requirements + wmf-netbox - ayounsi@cumin1001
* 09:39 jelto: power off mw22[71-79].codfw.wmnet
* 09:38 urbanecm@deploy1002: Synchronized php-1.39.0-wmf.23/extensions/GrowthExperiments/includes/EventLogging/SpecialEditGrowthConfigLogger.php: {{Gerrit|ba67dd940217e9f786f4349b4da0fe088475fde9}}: SpecialEditGrowthConfigLogger: Update schema version ([[phab:T314173|T314173]], [[phab:T312148|T312148]]) (duration: 03m 18s)
* 09:37 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 09:37 marostegui@cumin1001: dbctl commit (dc=all): 'Add db2177 to s3 [[phab:T311494|T311494]]', diff saved to https://phabricator.wikimedia.org/P32282 and previous config saved to /var/cache/conftool/dbconfig/20220804-093704-marostegui.json
* 09:36 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 09:36 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 09:35 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|ddcd333015bb58a98709a5005a5db7e8519dd0a5}}: testwiki: Growth: Assign enrollasmentor to * ([[phab:T310905|T310905]]) (duration: 03m 41s)
* 09:35 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 09:32 jelto: set/pooled=inactive mw22[71-79].codfw.wmnet
* 09:31 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 9:30:00 on 9 hosts with reason: PDU swap
* 09:31 jelto@cumin1001: START - Cookbook sre.hosts.downtime for 9:30:00 on 9 hosts with reason: PDU swap
* 09:30 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 09:29 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 09:29 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 09:28 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 09:27 ayounsi@cumin1001: END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: wmf-netbox.py update - ayounsi@cumin1001
* 09:26 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2089.codfw.wmnet
* 09:26 marostegui@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 09:26 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: {{Gerrit|0614a39bf15252c95a96565dd7c986237f3d3323}}: testwiki: Growth: Switch to structured mentor list ([[phab:T310905|T310905]]) (duration: 03m 38s)
* 09:25 ayounsi@cumin1001: START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: wmf-netbox.py update - ayounsi@cumin1001
* 09:23 mwdebug-deploy@deploy1002: helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
* 09:23 marostegui@cumin1001: START - Cookbook sre.dns.netbox
* 09:22 mwdebug-deploy@deploy1002: helmfile [codfw] START helmfile.d/services/mwdebug: apply
* 09:22 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
* 09:21 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
* 09:18 marostegui@cumin1001: START - Cookbook sre.hosts.decommission for hosts db2089.codfw.wmnet
* 09:12 jelto@cumin1001: conftool action : set/pooled=inactive; selector: name=kubernetes2022.codfw.wmnet
* 09:03 oblivian@mwmaint1002: pull aborted:  (duration: 00m 06s)
* 08:58 moritzm: installing gsasl security updates
* 08:57 oblivian@mwmaint1002: pull aborted:  (duration: 00m 18s)
* 08:48 moritzm: draining ganeti2017 [[phab:T311686|T311686]]
* 08:45 jelto: power off kubernetes2022
* 08:43 oblivian@deploy1002: Synchronized README: testing new scap configuration (duration: 03m 18s)
* 08:39 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:22:00 on kubernetes2022.codfw.wmnet with reason: PDU swap
* 08:38 jelto@cumin1001: START - Cookbook sre.hosts.downtime for 10:22:00 on kubernetes2022.codfw.wmnet with reason: PDU swap
* 08:37 jelto@cumin1001: conftool action : set/pooled=no; selector: name=kubernetes2022.codfw.wmnet
* 08:35 jelto: kubectl drain kubernetes2022.codfw.wmnet
* 08:32 jelto: kubectl cordon kubernetes2022.codfw.wmnet
* 08:28 moritzm: imported gsasl 1.8.0-8+wmf1 to stretch-wikimedia
* 08:26 jelto: power off mc2049 and mc2050
* 08:24 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:36:00 on mc[2049-2050].codfw.wmnet with reason: PDU swap
* 08:24 jelto@cumin1001: START - Cookbook sre.hosts.downtime for 10:36:00 on mc[2049-2050].codfw.wmnet with reason: PDU swap
* 08:22 oblivian@mwmaint1002: pull aborted:  (duration: 00m 11s)
* 08:19 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1132, db111, db1127, db1143', diff saved to https://phabricator.wikimedia.org/P32281 and previous config saved to /var/cache/conftool/dbconfig/20220804-081958-root.json
* 08:19 jelto: power off mc2047 and mc2048
* 08:16 jelto@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:45:00 on mc[2047-2048].codfw.wmnet with reason: PDU swap
* 08:16 jelto@cumin1001: START - Cookbook sre.hosts.downtime for 10:45:00 on mc[2047-2048].codfw.wmnet with reason: PDU swap
* 08:04 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubestagetcd2002.codfw.wmnet with reason: Switch instance to plain disks, [[phab:T311686|T311686]]
* 08:04 jmm@cumin2002: START - Cookbook sre.hosts.downtime for 1:00:00 on kubestagetcd2002.codfw.wmnet with reason: Switch instance to plain disks, [[phab:T311686|T311686]]
* 07:55 marostegui: Remove grants for 208.80.154.160/208.80.155.109 [[phab:T314528|T314528]]
* 07:49 marostegui@cumin1001: dbctl commit (dc=all): 'Remove db2089 from dbctl [[phab:T313799|T313799]]', diff saved to https://phabricator.wikimedia.org/P32280 and previous config saved to /var/cache/conftool/dbconfig/20220804-074957-marostegui.json
* 07:47 godog: grow sda/sdb 3 by 100G on thanos-be2002 - [[phab:T314275|T314275]]
* 07:46 godog: grow sda/sdb 3 by 100G on thanos-be1003 - [[phab:T314275|T314275]]
* 07:30 jmm@cumin2002: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2030.codfw.wmnet to cluster codfw and group A
* 07:29 jmm@cumin2002: START - Cookbook sre.ganeti.addnode for new host ganeti2030.codfw.wmnet to cluster codfw and group A
* 07:18 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2030.codfw.wmnet
* 07:16 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[2135,2160].codfw.wmnet with reason: codfw pdu maintenance
* 07:16 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db[2135,2160].codfw.wmnet with reason: codfw pdu maintenance
* 07:11 ayounsi@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 07:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host ganeti2030.codfw.wmnet
* 07:09 jmm@cumin2002: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti2030.codfw.wmnet to cluster codfw and group A
* 07:09 jmm@cumin2002: START - Cookbook sre.ganeti.addnode for new host ganeti2030.codfw.wmnet to cluster codfw and group A
* 07:09 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es[2023-2025].codfw.wmnet with reason: codfw pdu maintenance
* 07:08 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es[2023-2025].codfw.wmnet with reason: codfw pdu maintenance
* 07:05 ayounsi@cumin1001: START - Cookbook sre.dns.netbox
* 07:03 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2030.codfw.wmnet
* 07:02 ayounsi@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
* 06:58 ayounsi@cumin1001: START - Cookbook sre.dns.netbox
* 06:58 ayounsi@cumin1001: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97)
* 06:58 ayounsi@cumin1001: START - Cookbook sre.dns.netbox
* 06:54 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host ganeti2030.codfw.wmnet
* 06:06 _joe_: restarted memcached on mc2038 to pick up the actual production configuration
* 05:59 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2030.codfw.wmnet with OS bullseye
* 05:49 kart_: Updated cxserver to 2022-08-04-022612-production ([[phab:T313296|T313296]], [[phab:T308248|T308248]])
* 05:44 kartik@deploy1002: helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
* 05:43 kartik@deploy1002: helmfile [eqiad] START helmfile.d/services/cxserver: apply
* 05:42 kartik@deploy1002: helmfile [codfw] DONE helmfile.d/services/cxserver: apply
* 05:41 kartik@deploy1002: helmfile [codfw] START helmfile.d/services/cxserver: apply
* 05:40 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2030.codfw.wmnet with reason: host reimage
* 05:39 kartik@deploy1002: helmfile [staging] DONE helmfile.d/services/cxserver: apply
* 05:38 kartik@deploy1002: helmfile [staging] START helmfile.d/services/cxserver: apply
* 05:36 jmm@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2030.codfw.wmnet with reason: host reimage
* 05:22 jmm@cumin2002: START - Cookbook sre.hosts.reimage for host ganeti2030.codfw.wmnet with OS bullseye
* 05:17 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on ganeti2030.codfw.wmnet with reason: Remove node for eventual reimage, [[phab:T311686|T311686]]
* 05:16 jmm@cumin2002: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on ganeti2030.codfw.wmnet with reason: Remove node for eventual reimage, [[phab:T311686|T311686]]
* 04:38 ejegg: payments-wiki upgraded from {{Gerrit|712df4ce}} to {{Gerrit|0e4a5b3b}}
* 04:29 TimStarling: on mw2377 fiddling with CPU frequency control and doing benchmarks
* 04:09 krinkle@mwmaint1002: pull aborted:  (duration: 00m 05s)
* 01:23 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32278 and previous config saved to /var/cache/conftool/dbconfig/20220804-012341-marostegui.json
* 01:08 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P32277 and previous config saved to /var/cache/conftool/dbconfig/20220804-010834-marostegui.json
* 00:53 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P32276 and previous config saved to /var/cache/conftool/dbconfig/20220804-005328-marostegui.json
* 00:38 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32275 and previous config saved to /var/cache/conftool/dbconfig/20220804-003822-marostegui.json
* 00:36 marostegui@cumin1001: dbctl commit (dc=all): 'Depooling db1146:3312 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32274 and previous config saved to /var/cache/conftool/dbconfig/20220804-003611-marostegui.json
* 00:36 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance
* 00:35 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance
* 00:35 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 ([[phab:T312972|T312972]])', diff saved to https://phabricator.wikimedia.org/P32273 and previous config saved to /var/cache/conftool/dbconfig/20220804-003549-marostegui.json
* 00:20 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P32272 and previous config saved to /var/cache/conftool/dbconfig/20220804-002043-marostegui.json
* 00:06 mutante: gerrit - [2022-08-04 00:05:33,173] Replication to gerrit2@gerrit2002.wikimedia.org:/srv/gerrit/git/analytics/geowiki.git started.. [[phab:T313250|T313250]]
* 00:06 mutante: gerrit - [2022-08-04 00:05:33,173] Replication to gerrit2@gerrit2002.wikimedia.org:/srv/gerrit/git/analytics/geowiki.git started... [CONTEXT pushOneId="83ad5008" ]
* 00:05 marostegui@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P32271 and previous config saved to /var/cache/conftool/dbconfig/20220804-000536-marostegui.json
* 00:03 mutante: gerrit - service restart to deploy config change to add second replica [[phab:T313250|T313250]]
* 00:01 dzahn@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on gerrit.wikimedia.org with reason: service restart
* 00:00 dzahn@cumin2002: START - Cookbook sre.hosts.downtime for 1:00:00 on gerrit.wikimedia.org with reason: service restart
* 00:00 dzahn@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on gerrit1001.wikimedia.org with reason: service restart


== 2020-08-03 ==
== 2022-08-03 ==
* 23:43 mutante: mwdebug1001 - temp installing apt-file for debugging an issue on mwmaint
* 23:59 dzahn@cumin2002: START - Cookbook sre.hosts.downtime for 1:00:00 on gerrit1001.wikimedia.org with reason: service restart
* 23:14 catrope@deploy1001