You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Difference between revisions of "Server Admin Log"

From Wikitech-static
Jump to navigation Jump to search
imported>Stashbot
(thcipriani@deploy1002: Synchronized php-1.36.0-wmf.37/extensions/WikimediaEvents/modules/ext.wikimediaEvents/searchSatisfaction.js: Backport: Revert "Turn on glent m1 AB test" T262612 (duration: 00m 58s))
imported>Stashbot
(bstorm@cumin1001: END (PASS) - Cookbook wmcs.wikireplicas.add_wiki (exit_code=0))
Line 1: Line 1:
== 2021-04-02 ==
* 22:31 bstorm@cumin1001: END (PASS) - Cookbook wmcs.wikireplicas.add_wiki (exit_code=0)
* 22:31 bstorm@cumin1001: Added views for new wiki: trvwiki [[phab:T276246|T276246]]
* 22:08 bstorm@cumin1001: START - Cookbook wmcs.wikireplicas.add_wiki
* 22:08 mutante: pooled mw2395,mw2396 as API appservers running on new hardware
* 22:08 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw239[5-6].codfw.wmnet
* 21:58 legoktm: legoktm@lists1002:~$ time sudo mailman-web rebuild_index
* 21:56 dzahn@cumin1001: conftool action : set/weight=30; selector: name=mw239[5-6].codfw.wmnet
* 21:55 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw239[5-6].codfw.wmnet
* 21:48 mutante: mw2395, mw2396 - reboot - becoming API servers
* 21:43 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw239[0-4].codfw.wmnet
* 21:42 mutante: pooled 12 brand-new codfw appservers running on new hardware generation
* 21:41 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw238[5-9].codfw.wmnet
* 21:40 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw2384.codfw.wmnet
* 21:40 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw2383.codfw.wmnet
* 21:38 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw[2395-2396].codfw.wmnet with reason: new_install
* 21:38 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw[2395-2396].codfw.wmnet with reason: new_install
* 21:37 robh@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on moss-fe1002.eqiad.wmnet with reason: REIMAGE
* 21:35 robh@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on moss-fe1001.eqiad.wmnet with reason: REIMAGE
* 21:35 robh@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on moss-fe1002.eqiad.wmnet with reason: REIMAGE
* 21:35 dzahn@cumin1001: conftool action : set/weight=30; selector: name=mw239[0-4].codfw.wmnet
* 21:34 dzahn@cumin1001: conftool action : set/weight=30; selector: name=mw238[3-9].codfw.wmnet
* 21:33 robh@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on moss-fe1001.eqiad.wmnet with reason: REIMAGE
* 21:28 legoktm: imported python-xapian-haystack 2.1.0-6~wmf1 on apt1001 ([[phab:T278717|T278717]])
* 21:24 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2394.codfw.wmnet
* 21:24 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2393.codfw.wmnet
* 21:24 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2392.codfw.wmnet
* 21:22 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2391.codfw.wmnet
* 21:22 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2390.codfw.wmnet
* 21:22 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2389.codfw.wmnet
* 21:21 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2388.codfw.wmnet
* 21:21 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2387.codfw.wmnet
* 21:21 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2386.codfw.wmnet
* 21:21 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2385.codfw.wmnet
* 21:21 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2384.codfw.wmnet
* 21:21 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2383.codfw.wmnet
* 21:19 mutante: generating mcrouter certs for mw2395 through mw2404  ([[phab:T278396|T278396]])
* 21:07 mutante: mw2383 through mw2394 - 'uptime && scap pull' via ssh -C (not cumin because it needs to run as non-root)
* 20:58 mutante: mw238* - scap pull via cumin not possible because it doesnt work as root
* 20:50 andrew@deploy1002: Finished deploy [horizon/deploy@86c7cdc]: tweak to affinity group options (duration: 03m 39s)
* 20:46 andrew@deploy1002: Started deploy [horizon/deploy@86c7cdc]: tweak to affinity group options
* 20:44 mutante: mw2385 through mw2394 - serial rebooting
* 20:43 mutante: mw2384 reboot
* 20:43 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw[2390-2394].codfw.wmnet with reason: new_install
* 20:43 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw[2390-2394].codfw.wmnet with reason: new_install
* 20:43 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 10 hosts with reason: new_install
* 20:43 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on 10 hosts with reason: new_install
* 20:40 andrew@deploy1002: Finished deploy [horizon/deploy@86c7cdc]: update horizon for codfw1dev (duration: 01m 47s)
* 20:39 andrew@deploy1002: Started deploy [horizon/deploy@86c7cdc]: update horizon for codfw1dev
* 20:09 bstorm@cumin1001: END (PASS) - Cookbook wmcs.wikireplicas.add_wiki (exit_code=0)
* 20:09 bstorm@cumin1001: Added views for new wiki: taywiki [[phab:T275836|T275836]]
* 19:47 bstorm@cumin1001: START - Cookbook wmcs.wikireplicas.add_wiki
* 19:29 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2383.codfw.wmnet with reason: new_install
* 19:29 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw2383.codfw.wmnet with reason: new_install
* 19:07 bstorm@cumin1001: END (PASS) - Cookbook wmcs.wikireplicas.add_wiki (exit_code=0)
* 19:07 bstorm@cumin1001: Added views for new wiki: mnwwiktionary [[phab:T276126|T276126]]
* 18:44 bstorm@cumin1001: START - Cookbook wmcs.wikireplicas.add_wiki
* 18:44 mutante: [puppetmaster1001:~] $ sudo puppet node deactivate mw2247.codfw.wmnet
* 18:28 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts mw2247.codfw.wmnet
* 18:20 dzahn@cumin1001: START - Cookbook sre.hosts.decommission for hosts mw2247.codfw.wmnet
* 17:57 legoktm: upgraded mailman3 python3-django-postorius on lists1002
* 15:48 kharlan@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
* 15:48 kharlan@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
* 15:45 kharlan@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
* 15:45 kharlan@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
* 15:41 kharlan@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
* 14:35 jiji@cumin1001: conftool action : set/weight=20; selector: cluster=jobrunner,name=mw133[7-8].eqiad.wmnet
* 14:34 jiji@cumin1001: conftool action : set/weight=20; selector: cluster=videoscaler,name=mw133[5-6].eqiad.wmnet
* 14:32 jiji@cumin1001: conftool action : set/pooled=no; selector: cluster=jobrunner,name=mw133[5-6].eqiad.wmnet
* 14:31 jiji@cumin1001: conftool action : set/pooled=no; selector: cluster=videoscaler,name=mw133[7-8].eqiad.wmnet
* 14:30 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-coord1001.eqiad.wmnet with reason: REIMAGE
* 14:29 jiji@cumin1001: conftool action : set/pooled=no; selector: cluster=videoscaler,name=mw1111.eqiad.wmnet
* 14:28 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-coord1001.eqiad.wmnet with reason: REIMAGE
* 14:20 Urbanecm: Start server-side upload for 3 video files ([[phab:T279060|T279060]], [[phab:T279061|T279061]], [[phab:T279062|T279062]])
* 14:09 Urbanecm: Start server-side upload for 3 video files ([[phab:T279138|T279138]], [[phab:T279137|T279137]], [[phab:T279136|T279136]])
* 13:42 hashar@deploy1002: rebuilt and synchronized wikiversions files: all wikis to 1.36.0-wmf.37
* 13:14 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-master1001.eqiad.wmnet with reason: REIMAGE
* 13:12 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-master1001.eqiad.wmnet with reason: REIMAGE
* 13:11 reedy@deploy1002: Synchronized php-1.36.0-wmf.37/load.php: [[phab:T278579|T278579]] (duration: 00m 58s)
* 13:10 reedy@deploy1002: Synchronized php-1.36.0-wmf.37/includes/OutputHandler.php: [[phab:T278579|T278579]] (duration: 00m 57s)
* 13:08 reedy@deploy1002: Synchronized php-1.36.0-wmf.37/includes/MediaWiki.php: [[phab:T278579|T278579]] (duration: 00m 58s)
* 11:46 Urbanecm: correction: Start server-side upload for 3 video files ([[phab:T279079|T279079]], [[phab:T279080|T279080]], [[phab:T279104|T279104]])
* 11:45 Urbanecm: Start server-side upload for 3 images ([[phab:T279079|T279079]], [[phab:T279080|T279080]], [[phab:T279104|T279104]])
* 10:54 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-master1002.eqiad.wmnet with reason: REIMAGE
* 10:52 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-master1002.eqiad.wmnet with reason: REIMAGE
* 10:14 akosiaris@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'cxserver' for release 'staging' .
* 10:14 akosiaris@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'cxserver' for release 'production' .
* 10:12 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'cxserver' for release 'production' .
* 10:12 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'cxserver' for release 'staging' .
* 10:11 akosiaris@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'staging' .
* 10:11 akosiaris@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'production' .
* 10:07 hashar@deploy1002: rebuilt and synchronized wikiversions files: Rollback group0 wikis to 1.36.0-wmf.36 - [[phab:T278343|T278343]]
* 09:45 hashar@deploy1002: rebuilt and synchronized wikiversions files: Revert group1 and group2 wikis to 1.36.0-wmf.36 - [[phab:T278343|T278343]]
* 09:44 hashar@deploy1002: sync-wikiversions aborted: Revert group1 and group2 wikis to 1.36.0-wmf.36 (duration: 00m 01s)
* 09:06 dcausse: remove dumps from wdqs1009 to free disk space
* 07:33 effie: powercycle an-worker1080
* 07:28 elukey: manual fix for an-worker1080's interface in netbox (xe-4/0/11), moved by mistake to public-1b
* 03:54 dwisehaupt: replication user on fundraising db set to require ssl for connections at the mysql user level. db updated on frdb1004 and verified on a set of hosts
* 03:16 dwisehaupt: replication user on payments db set to require ssl for connections at the mysql user level. db updated on payments1001 and verified on a set of hosts
== 2021-04-01 ==
== 2021-04-01 ==
* 23:32 thcipriani@deploy1002: Synchronized php-1.36.0-wmf.37/extensions/WikimediaEvents/modules/ext.wikimediaEvents/searchSatisfaction.js: Backport: [[gerrit:676350{{!}}Revert "Turn on glent m1 AB test"]] [[phab:T262612|T262612]] (duration: 00m 58s)
* 23:32 thcipriani@deploy1002: Synchronized php-1.36.0-wmf.37/extensions/WikimediaEvents/modules/ext.wikimediaEvents/searchSatisfaction.js: Backport: [[gerrit:676350{{!}}Revert "Turn on glent m1 AB test"]] [[phab:T262612|T262612]] (duration: 00m 58s)

Revision as of 22:31, 2 April 2021

2021-04-02

  • 22:31 bstorm@cumin1001: END (PASS) - Cookbook wmcs.wikireplicas.add_wiki (exit_code=0)
  • 22:31 bstorm@cumin1001: Added views for new wiki: trvwiki T276246
  • 22:08 bstorm@cumin1001: START - Cookbook wmcs.wikireplicas.add_wiki
  • 22:08 mutante: pooled mw2395,mw2396 as API appservers running on new hardware
  • 22:08 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw239[5-6].codfw.wmnet
  • 21:58 legoktm: legoktm@lists1002:~$ time sudo mailman-web rebuild_index
  • 21:56 dzahn@cumin1001: conftool action : set/weight=30; selector: name=mw239[5-6].codfw.wmnet
  • 21:55 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw239[5-6].codfw.wmnet
  • 21:48 mutante: mw2395, mw2396 - reboot - becoming API servers
  • 21:43 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw239[0-4].codfw.wmnet
  • 21:42 mutante: pooled 12 brand-new codfw appservers running on new hardware generation
  • 21:41 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw238[5-9].codfw.wmnet
  • 21:40 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw2384.codfw.wmnet
  • 21:40 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw2383.codfw.wmnet
  • 21:38 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw[2395-2396].codfw.wmnet with reason: new_install
  • 21:38 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw[2395-2396].codfw.wmnet with reason: new_install
  • 21:37 robh@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on moss-fe1002.eqiad.wmnet with reason: REIMAGE
  • 21:35 robh@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on moss-fe1001.eqiad.wmnet with reason: REIMAGE
  • 21:35 robh@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on moss-fe1002.eqiad.wmnet with reason: REIMAGE
  • 21:35 dzahn@cumin1001: conftool action : set/weight=30; selector: name=mw239[0-4].codfw.wmnet
  • 21:34 dzahn@cumin1001: conftool action : set/weight=30; selector: name=mw238[3-9].codfw.wmnet
  • 21:33 robh@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on moss-fe1001.eqiad.wmnet with reason: REIMAGE
  • 21:28 legoktm: imported python-xapian-haystack 2.1.0-6~wmf1 on apt1001 (T278717)
  • 21:24 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2394.codfw.wmnet
  • 21:24 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2393.codfw.wmnet
  • 21:24 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2392.codfw.wmnet
  • 21:22 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2391.codfw.wmnet
  • 21:22 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2390.codfw.wmnet
  • 21:22 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2389.codfw.wmnet
  • 21:21 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2388.codfw.wmnet
  • 21:21 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2387.codfw.wmnet
  • 21:21 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2386.codfw.wmnet
  • 21:21 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2385.codfw.wmnet
  • 21:21 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2384.codfw.wmnet
  • 21:21 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2383.codfw.wmnet
  • 21:19 mutante: generating mcrouter certs for mw2395 through mw2404 (T278396)
  • 21:07 mutante: mw2383 through mw2394 - 'uptime && scap pull' via ssh -C (not cumin because it needs to run as non-root)
  • 20:58 mutante: mw238* - scap pull via cumin not possible because it doesnt work as root
  • 20:50 andrew@deploy1002: Finished deploy [horizon/deploy@86c7cdc]: tweak to affinity group options (duration: 03m 39s)
  • 20:46 andrew@deploy1002: Started deploy [horizon/deploy@86c7cdc]: tweak to affinity group options
  • 20:44 mutante: mw2385 through mw2394 - serial rebooting
  • 20:43 mutante: mw2384 reboot
  • 20:43 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw[2390-2394].codfw.wmnet with reason: new_install
  • 20:43 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw[2390-2394].codfw.wmnet with reason: new_install
  • 20:43 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 10 hosts with reason: new_install
  • 20:43 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on 10 hosts with reason: new_install
  • 20:40 andrew@deploy1002: Finished deploy [horizon/deploy@86c7cdc]: update horizon for codfw1dev (duration: 01m 47s)
  • 20:39 andrew@deploy1002: Started deploy [horizon/deploy@86c7cdc]: update horizon for codfw1dev
  • 20:09 bstorm@cumin1001: END (PASS) - Cookbook wmcs.wikireplicas.add_wiki (exit_code=0)
  • 20:09 bstorm@cumin1001: Added views for new wiki: taywiki T275836
  • 19:47 bstorm@cumin1001: START - Cookbook wmcs.wikireplicas.add_wiki
  • 19:29 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2383.codfw.wmnet with reason: new_install
  • 19:29 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw2383.codfw.wmnet with reason: new_install
  • 19:07 bstorm@cumin1001: END (PASS) - Cookbook wmcs.wikireplicas.add_wiki (exit_code=0)
  • 19:07 bstorm@cumin1001: Added views for new wiki: mnwwiktionary T276126
  • 18:44 bstorm@cumin1001: START - Cookbook wmcs.wikireplicas.add_wiki
  • 18:44 mutante: [puppetmaster1001:~] $ sudo puppet node deactivate mw2247.codfw.wmnet
  • 18:28 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts mw2247.codfw.wmnet
  • 18:20 dzahn@cumin1001: START - Cookbook sre.hosts.decommission for hosts mw2247.codfw.wmnet
  • 17:57 legoktm: upgraded mailman3 python3-django-postorius on lists1002
  • 15:48 kharlan@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
  • 15:48 kharlan@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
  • 15:45 kharlan@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
  • 15:45 kharlan@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
  • 15:41 kharlan@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
  • 14:35 jiji@cumin1001: conftool action : set/weight=20; selector: cluster=jobrunner,name=mw133[7-8].eqiad.wmnet
  • 14:34 jiji@cumin1001: conftool action : set/weight=20; selector: cluster=videoscaler,name=mw133[5-6].eqiad.wmnet
  • 14:32 jiji@cumin1001: conftool action : set/pooled=no; selector: cluster=jobrunner,name=mw133[5-6].eqiad.wmnet
  • 14:31 jiji@cumin1001: conftool action : set/pooled=no; selector: cluster=videoscaler,name=mw133[7-8].eqiad.wmnet
  • 14:30 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-coord1001.eqiad.wmnet with reason: REIMAGE
  • 14:29 jiji@cumin1001: conftool action : set/pooled=no; selector: cluster=videoscaler,name=mw1111.eqiad.wmnet
  • 14:28 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-coord1001.eqiad.wmnet with reason: REIMAGE
  • 14:20 Urbanecm: Start server-side upload for 3 video files (T279060, T279061, T279062)
  • 14:09 Urbanecm: Start server-side upload for 3 video files (T279138, T279137, T279136)
  • 13:42 hashar@deploy1002: rebuilt and synchronized wikiversions files: all wikis to 1.36.0-wmf.37
  • 13:14 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-master1001.eqiad.wmnet with reason: REIMAGE
  • 13:12 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-master1001.eqiad.wmnet with reason: REIMAGE
  • 13:11 reedy@deploy1002: Synchronized php-1.36.0-wmf.37/load.php: T278579 (duration: 00m 58s)
  • 13:10 reedy@deploy1002: Synchronized php-1.36.0-wmf.37/includes/OutputHandler.php: T278579 (duration: 00m 57s)
  • 13:08 reedy@deploy1002: Synchronized php-1.36.0-wmf.37/includes/MediaWiki.php: T278579 (duration: 00m 58s)
  • 11:46 Urbanecm: correction: Start server-side upload for 3 video files (T279079, T279080, T279104)
  • 11:45 Urbanecm: Start server-side upload for 3 images (T279079, T279080, T279104)
  • 10:54 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-master1002.eqiad.wmnet with reason: REIMAGE
  • 10:52 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-master1002.eqiad.wmnet with reason: REIMAGE
  • 10:14 akosiaris@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'cxserver' for release 'staging' .
  • 10:14 akosiaris@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'cxserver' for release 'production' .
  • 10:12 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'cxserver' for release 'production' .
  • 10:12 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'cxserver' for release 'staging' .
  • 10:11 akosiaris@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'staging' .
  • 10:11 akosiaris@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'production' .
  • 10:07 hashar@deploy1002: rebuilt and synchronized wikiversions files: Rollback group0 wikis to 1.36.0-wmf.36 - T278343
  • 09:45 hashar@deploy1002: rebuilt and synchronized wikiversions files: Revert group1 and group2 wikis to 1.36.0-wmf.36 - T278343
  • 09:44 hashar@deploy1002: sync-wikiversions aborted: Revert group1 and group2 wikis to 1.36.0-wmf.36 (duration: 00m 01s)
  • 09:06 dcausse: remove dumps from wdqs1009 to free disk space
  • 07:33 effie: powercycle an-worker1080
  • 07:28 elukey: manual fix for an-worker1080's interface in netbox (xe-4/0/11), moved by mistake to public-1b
  • 03:54 dwisehaupt: replication user on fundraising db set to require ssl for connections at the mysql user level. db updated on frdb1004 and verified on a set of hosts
  • 03:16 dwisehaupt: replication user on payments db set to require ssl for connections at the mysql user level. db updated on payments1001 and verified on a set of hosts

2021-04-01

  • 23:32 thcipriani@deploy1002: Synchronized php-1.36.0-wmf.37/extensions/WikimediaEvents/modules/ext.wikimediaEvents/searchSatisfaction.js: Backport: Revert "Turn on glent m1 AB test" T262612 (duration: 00m 58s)
  • 23:28 thcipriani: reset /srv/mediawiki-staging/php-1.36.0-wmf.37/extensions/TimedMediaHandler to 1be781d (HEAD of wmf/1.36.0-wmf.37 -- from HEAD of master 49f417)
  • 23:12 thcipriani@deploy1002: Synchronized wmf-config/logos.php: Backport: Part III Add hi-res version of mediawiki.org logos T268230 (duration: 00m 57s)
  • 23:10 thcipriani@deploy1002: Synchronized logos: Backport: Part II Add hi-res version of mediawiki.org logos T268230 (duration: 00m 57s)
  • 23:08 thcipriani@deploy1002: Synchronized static: Backport: Part I Add hi-res version of mediawiki.org logos T268230 (duration: 00m 59s)
  • 22:53 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2248.codfw.wmnet
  • 22:50 twentyafterfour@deploy1002: Finished deploy [releng/phatality@27ddd0b]: deploy phatality (duration: 00m 13s)
  • 22:50 twentyafterfour@deploy1002: Started deploy [releng/phatality@27ddd0b]: deploy phatality
  • 22:49 twentyafterfour: deploying phatality
  • 22:34 dzahn@cumin1001: START - Cookbook sre.hosts.decommission for hosts mw2248.codfw.wmnet
  • 22:33 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts mw2247.codfw.wmnet
  • 22:17 dzahn@cumin1001: START - Cookbook sre.hosts.decommission for hosts mw2247.codfw.wmnet
  • 22:12 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2246.codfw.wmnet
  • 22:00 dzahn@cumin1001: START - Cookbook sre.hosts.decommission for hosts mw2246.codfw.wmnet
  • 21:58 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2243.codfw.wmnet
  • 21:36 dzahn@cumin1001: START - Cookbook sre.hosts.decommission for hosts mw2243.codfw.wmnet
  • 20:42 mutante: mw2243, mw2246, mw2247, mw2248 - depooled - replaced by mw2379, mw2380, mw2381, mw2382 ( T277780)
  • 20:41 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2248.codfw.wmnet
  • 20:41 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2247.codfw.wmnet
  • 20:41 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2246.codfw.wmnet
  • 20:41 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2243.codfw.wmnet
  • 20:23 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw2382.codfw.wmnet
  • 20:23 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw2381.codfw.wmnet
  • 20:21 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw2380.codfw.wmnet
  • 20:20 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw2379.codfw.wmnet
  • 20:06 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2379.codfw.wmnet with reason: new_install
  • 20:06 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw2379.codfw.wmnet with reason: new_install
  • 20:06 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2380.codfw.wmnet with reason: new_install
  • 20:05 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw2380.codfw.wmnet with reason: new_install
  • 20:05 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2382.codfw.wmnet with reason: new_install
  • 20:05 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw2382.codfw.wmnet with reason: new_install
  • 20:05 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2381.codfw.wmnet with reason: new_install
  • 20:05 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw2381.codfw.wmnet with reason: new_install
  • 20:01 razzi@deploy1002: Finished deploy [analytics/superset/deploy@5b8de4c]: Deployment of superset fd7c9eb71e193, released after 1.0.1 (duration: 00m 04s)
  • 20:01 razzi@deploy1002: Started deploy [analytics/superset/deploy@5b8de4c]: Deployment of superset fd7c9eb71e193, released after 1.0.1
  • 20:01 razzi@deploy1002: deploy aborted: Deployment of superset fd7c9eb71e193, released after 1.0.1hv (duration: 00m 00s)
  • 20:01 mutante: mw2379, mw2380, mw2381, mw2382 - scap pull
  • 19:59 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2382.codfw.wmnet
  • 19:59 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2381.codfw.wmnet
  • 19:59 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2380.codfw.wmnet
  • 19:59 razzi@deploy1002: Finished deploy [analytics/superset/deploy@5b8de4c]: Deployment of superset fd7c9eb71e193, released after 1.0.1 (duration: 00m 21s)
  • 19:59 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2379.codfw.wmnet
  • 19:58 razzi@deploy1002: Started deploy [analytics/superset/deploy@5b8de4c]: Deployment of superset fd7c9eb71e193, released after 1.0.1
  • 19:57 kharlan@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
  • 19:57 kharlan@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
  • 19:56 razzi@deploy1002: Finished deploy [analytics/superset/deploy@5b8de4c]: Deployment of superset fd7c9eb71e193, released after 1.0.1 (duration: 00m 12s)
  • 19:56 razzi@deploy1002: Started deploy [analytics/superset/deploy@5b8de4c]: Deployment of superset fd7c9eb71e193, released after 1.0.1
  • 19:54 kharlan@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
  • 19:54 kharlan@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
  • 19:51 kharlan@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
  • 19:37 mutante: pooled parse2001 again after twentyaftefour rebuilt the l10n cache for wmf.37 which fixed it and made Apache alert recover (T268524)
  • 19:34 mutante: mw2379, mw2380, mw2381, mw2382 - rebooting
  • 19:34 twentyafterfour@deploy1002: scap sync-l10n completed (1.36.0-wmf.37) (duration: 02m 38s)
  • 19:30 mutante: depooled parse2001 because on train deployment it caused "MWException: No localisation cache found for English" and then "HTTP CRITICAL: HTTP/1.1 500 Internal Server Error" (T268524)
  • 19:29 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=parse2001.codfw.wmnet
  • 19:28 dzahn@cumin1001: conftool action : set/pooled=inactive; selector: name=parse2001.codfw.wmnet
  • 19:27 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=parse2001.codfw.wmnet
  • 19:21 twentyafterfour@deploy1002: rebuilt and synchronized wikiversions files: group2 wikis to 1.36.0-wmf.37 refs T278343
  • 18:59 mutante: creating mcrouter certs for mw2379 thorugh mw2382
  • 18:35 Urbanecm: Morning B&C window done
  • 18:33 urbanecm@deploy1002: Synchronized php-1.36.0-wmf.37/extensions/WikibaseMediaInfo/resources/mediasearch-vue/components/base/Dialog.vue: e77f2b9: Use appendChild() instead of append() (T278448) (duration: 01m 09s)
  • 18:31 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: b485d1c: Enable SandboxLink extension in ptwikinews (T278634) (duration: 01m 12s)
  • 17:49 robh@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast1003.wikimedia.org with reason: REIMAGE
  • 17:47 robh@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on bast1003.wikimedia.org with reason: REIMAGE
  • 17:24 cmjohnson@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 17:21 cmjohnson@cumin1001: START - Cookbook sre.dns.netbox
  • 16:59 Urbanecm: Start server-side upload of two files (T279082, T279081)
  • 16:44 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=aqs1007.eqiad.wmnet
  • 16:39 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: a7acf33: hrwiki: Fix help panel links (T275684) (duration: 01m 10s)
  • 16:25 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2396.codfw.wmnet with reason: REIMAGE
  • 16:23 pt1979@cumin2001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw2396.codfw.wmnet with reason: REIMAGE
  • 16:02 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2395.codfw.wmnet with reason: REIMAGE
  • 16:00 pt1979@cumin2001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw2395.codfw.wmnet with reason: REIMAGE
  • 15:58 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2394.codfw.wmnet with reason: REIMAGE
  • 15:56 pt1979@cumin2001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw2394.codfw.wmnet with reason: REIMAGE
  • 15:39 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2393.codfw.wmnet with reason: REIMAGE
  • 15:37 pt1979@cumin2001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw2393.codfw.wmnet with reason: REIMAGE
  • 15:32 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2391.codfw.wmnet with reason: REIMAGE
  • 15:30 pt1979@cumin2001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw2391.codfw.wmnet with reason: REIMAGE
  • 15:05 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2392.codfw.wmnet with reason: REIMAGE
  • 15:03 pt1979@cumin2001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw2392.codfw.wmnet with reason: REIMAGE
  • 14:52 volans: uploaded python3-wmflib_0.0.7 to bullseye-wikimedia
  • 14:41 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2390.codfw.wmnet with reason: REIMAGE
  • 14:39 pt1979@cumin2001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw2390.codfw.wmnet with reason: REIMAGE
  • 14:22 effie: disable puppet on mw* canaries, rolling depool and pooling of canaries
  • 14:06 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-worker1001.eqiad.wmnet with reason: REIMAGE
  • 14:04 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-worker1001.eqiad.wmnet with reason: REIMAGE
  • 14:02 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2389.codfw.wmnet with reason: REIMAGE
  • 13:59 pt1979@cumin2001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw2389.codfw.wmnet with reason: REIMAGE
  • 13:53 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2388.codfw.wmnet with reason: REIMAGE
  • 13:51 pt1979@cumin2001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw2388.codfw.wmnet with reason: REIMAGE
  • 13:24 ema: cp3054: reboot with Linux 4.19.181+1 -- the kernel was not upgraded earlier during T273278 reboots due to broken dpkg status
  • 13:16 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1022.eqiad.wmnet
  • 13:07 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ganeti1022.eqiad.wmnet
  • 12:59 dcaro@cumin1001: END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0)
  • 12:53 dcaro@cumin1001: START - Cookbook sre.hosts.upgrade-and-reboot
  • 12:51 dcaro@cumin1001: END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0)
  • 12:47 moritzm: drain ganeti1022
  • 12:46 dcaro@cumin1001: START - Cookbook sre.hosts.upgrade-and-reboot
  • 12:45 dcaro@cumin1001: END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0)
  • 12:45 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1021.eqiad.wmnet
  • 12:40 dcaro@cumin1001: START - Cookbook sre.hosts.upgrade-and-reboot
  • 12:38 dcaro@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd2003-dev.codfw.wmnet
  • 12:38 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ganeti1021.eqiad.wmnet
  • 12:34 dcaro@cumin1001: START - Cookbook sre.hosts.reboot-single for host cloudcephosd2003-dev.codfw.wmnet
  • 12:23 moritzm: drain ganeti1021
  • 12:21 dcaro@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd2003-dev.codfw.wmnet
  • 12:19 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1020.eqiad.wmnet
  • 12:15 dcaro@cumin1001: START - Cookbook sre.hosts.reboot-single for host cloudcephosd2003-dev.codfw.wmnet
  • 12:12 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ganeti1020.eqiad.wmnet
  • 11:59 Urbanecm: Start server upload of two video files (~4 GB in total) # T278856
  • 11:55 moritzm: drain ganeti1020
  • 11:55 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1019.eqiad.wmnet
  • 11:48 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ganeti1019.eqiad.wmnet
  • 11:43 ladsgroup@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Disable RelatedArticles on Timeless skin on German Wikipedia (T278611) (duration: 01m 08s)
  • 11:41 moritzm: drain ganeti1019
  • 11:38 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1018.eqiad.wmnet
  • 11:31 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ganeti1018.eqiad.wmnet
  • {{safesubst:SAL entry|1=11:23 ladsgroup@deploy1002: Synchronized wmf-config/InitialiseSettings.php: [[gerrit:674820|Enable MediaSearch by default for anonymous users (duration: 01m 10s)}}
  • 11:20 moritzm: drain ganeti1018
  • 11:17 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1017.eqiad.wmnet
  • 11:10 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ganeti1017.eqiad.wmnet
  • 11:00 moritzm: drain ganeti1017
  • 10:56 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1011.eqiad.wmnet
  • 10:49 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ganeti1011.eqiad.wmnet
  • 10:39 dcaro@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd2002-dev.codfw.wmnet
  • 10:33 dcaro@cumin1001: START - Cookbook sre.hosts.reboot-single for host cloudcephosd2002-dev.codfw.wmnet
  • 10:33 dcaro@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd2001-dev.codfw.wmnet
  • 10:26 dcaro@cumin1001: START - Cookbook sre.hosts.reboot-single for host cloudcephosd2001-dev.codfw.wmnet
  • 09:07 hashar: contint2001: compressing files with 4 parallel executions: sudo -u jenkins find /srv/jenkins/builds/mediawiki-fresnel-patch-docker -name "*trace.json" -print0|xargs -0 -P4 gzip
  • 09:01 hashar: contint2001: compressing all fresnel trace--trace.json files: sudo -u jenkins find /srv/jenkins/builds/mediawiki-fresnel-patch-docker -name "*trace.json" -exec gzip {} \+ # T249268
  • 08:52 moritzm: drain ganeti1011
  • 08:35 moritzm: failover Ganeti master in eqiad to ganeti1009
  • 08:25 moritzm: installing ldb security updates
  • 08:12 kharlan@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
  • 08:12 kharlan@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
  • 08:10 kharlan@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
  • 08:10 kharlan@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
  • 08:09 kharlan@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
  • 07:58 kharlan@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
  • 07:58 kharlan@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
  • 07:55 kharlan@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
  • 06:37 elukey: powercycle cp1087 (no ssh, no tty via serial console) - T278729
  • 06:35 elukey@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp1087.eqiad.wmnet
  • 02:50 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2386.codfw.wmnet with reason: REIMAGE
  • 02:48 pt1979@cumin2001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw2386.codfw.wmnet with reason: REIMAGE
  • 02:21 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2387.codfw.wmnet with reason: REIMAGE
  • 02:18 pt1979@cumin2001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw2387.codfw.wmnet with reason: REIMAGE
  • 02:16 pt1979@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2384.codfw.wmnet with reason: REIMAGE
  • 02:13 pt1979@cumin2001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw2384.codfw.wmnet with reason: REIMAGE
  • 01:53 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2385.codfw.wmnet with reason: REIMAGE
  • 01:52 Reedy: `echo "https://www.mediawiki.org/static/images/footer/poweredby_mediawiki_176x62.png" | mwscript purgeList.php --wiki=enwiki` T268230
  • 01:51 pt1979@cumin2001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw2385.codfw.wmnet with reason: REIMAGE
  • 01:51 Reedy: `echo "https://www.mediawiki.org/favicon.ico" | mwscript purgeList.php --wiki=enwiki` T268230
  • 01:37 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2384.codfw.wmnet with reason: REIMAGE
  • 01:35 pt1979@cumin2001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw2384.codfw.wmnet with reason: REIMAGE
  • 01:24 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2383.codfw.wmnet with reason: REIMAGE
  • 01:22 pt1979@cumin2001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw2383.codfw.wmnet with reason: REIMAGE
  • 01:12 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2382.codfw.wmnet with reason: REIMAGE
  • 01:10 pt1979@cumin2001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw2382.codfw.wmnet with reason: REIMAGE
  • 00:56 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2381.codfw.wmnet with reason: REIMAGE
  • 00:54 pt1979@cumin2001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw2381.codfw.wmnet with reason: REIMAGE
  • 00:46 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2380.codfw.wmnet with reason: REIMAGE
  • 00:44 pt1979@cumin2001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw2380.codfw.wmnet with reason: REIMAGE
  • 00:32 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2379.codfw.wmnet with reason: REIMAGE
  • 00:30 pt1979@cumin2001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw2379.codfw.wmnet with reason: REIMAGE
  • 00:12 pt1979@cumin2001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 00:08 legoktm: uploaded mailman3 3.2.1-1+wmf1, postorius 1.2.4-1+wmf1 to apt.wikimedia.org
  • 00:07 pt1979@cumin2001: START - Cookbook sre.dns.netbox

2021-03-31

  • 23:34 urbanecm@deploy1002: Synchronized php-1.36.0-wmf.37/extensions/Wikibase/client/includes/DataAccess/Scribunto/: bfc8f55: Eliminate another php.getSetting() from Lua code (duration: 01m 09s)
  • 23:32 urbanecm@deploy1002: Synchronized php-1.36.0-wmf.36/extensions/Wikibase/client/includes/DataAccess/Scribunto/: ad564a0: Eliminate another php.getSetting() from Lua code (duration: 01m 10s)
  • 23:12 jhuneidi@deploy1002: Synchronized .pipeline/config.yaml: Config: Include private folder in restricted image (T276145) (duration: 01m 08s)
  • 23:05 ladsgroup@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Use the new mediawiki logos, part II (T268230) (duration: 01m 11s)
  • 23:03 ladsgroup@deploy1002: Synchronized static: Use the new mediawiki logos, part I (T268230) (duration: 01m 09s)
  • 22:58 Urbanecm: Start server side upload for 3 files
  • 22:01 Urbanecm: Server side upload of three video files (T279011, T278956, T278955)
  • 22:01 eileen: civicrm revision changed from 2fcea570bd to 740e49d868, config revision is 6779e3829a
  • 20:16 robh@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 20:00 dwisehaupt: shifted payments2003 to use gtid for mysql replication.
  • 19:55 robh@cumin1001: START - Cookbook sre.dns.netbox
  • 19:21 twentyafterfour@deploy1002: Synchronized php: group1 wikis to 1.36.0-wmf.37 refs T278343 (duration: 01m 08s)
  • 19:20 twentyafterfour@deploy1002: rebuilt and synchronized wikiversions files: group1 wikis to 1.36.0-wmf.37 refs T278343
  • 19:18 robh@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 19:13 twentyafterfour@deploy1002: rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.37 refs T278343
  • 19:06 robh@cumin1001: START - Cookbook sre.dns.netbox
  • 19:03 twentyafterfour@deploy1002: Synchronized php-1.36.0-wmf.37/includes/Revision/RevisionRecord.php: sync https://gerrit.wikimedia.org/r/c/mediawiki/core/+/675875 to unblock train refs T278376 T278343 (duration: 00m 58s)
  • 17:56 twentyafterfour@deploy1002: rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.36 refs T278343
  • 17:49 twentyafterfour@deploy1002: rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.37 refs T278343
  • 17:41 twentyafterfour: The train is now unblocked, promoting to group0 refs T278343
  • 17:01 Urbanecm: Server side upload of three video files (T278959, T278958, T278957)
  • 15:09 hnowlan@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
  • 15:09 hnowlan@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
  • 14:57 papaul: disconnecting ps1-d8-codfw for replacement
  • 14:17 hnowlan@puppetmaster1001: conftool action : set/pooled=no; selector: name=aqs1007.eqiad.wmnet
  • 14:02 Urbanecm: Server side upload of two video files (T278961, T278960)
  • 13:48 jynus: retrying s3 snapshot on codfw
  • 13:39 akosiaris: revert mw1412, mw1413, wtp1032, mw2305 to the previous state for T278220
  • 13:34 akosiaris: disabling puppet on role::mediawiki::appserver, role::mediawiki::appserver::api, role::mediawiki::maintenance, role::mediawiki::jobrunner, role::parsoid, role::parsoid::testing T278220
  • 13:00 akosiaris: repool all jobrunners/videoscalers in the respective conftool clusters. The video transcoding backlog has been served we can return to "normal"
  • 12:59 akosiaris: repool all jobrunners/videoscalers in the respective conftool clusters
  • 12:59 akosiaris@cumin1001: conftool action : set/pooled=yes; selector: dc=eqiad,cluster=videoscaler
  • 12:59 akosiaris@cumin1001: conftool action : set/pooled=yes; selector: dc=eqiad,cluster=jobrunner
  • 11:38 awight: EU deployment complete
  • 11:38 awight@deploy1002: Synchronized php-1.36.0-wmf.37/extensions/WikibaseMediaInfo: Backport: Style change to mediasearch logged-in notice close (T274927) Suppress user notice on mobile (T274927) Reset namespace filter on cancel (T276261) (duration: 01m 08s)
  • 11:26 awight@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: vector: Disable WVUI search widget treatment A/B test (T276917) (duration: 01m 08s)
  • 10:48 effie: enable puppet on all mw* servers
  • 10:10 effie: disable puppet on all mw* hosts
  • 09:03 hashar: contint2001: enable puppet again
  • 08:38 hashar: contint2001: stopping Puppet for an Apache config live hack
  • 04:35 eileen: civicrm revision changed from 7040b68c11 to 2fcea570bd, config revision is 6779e3829a
  • 02:37 pt1979@cumin2001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 02:33 pt1979@cumin2001: START - Cookbook sre.dns.netbox
  • 02:22 pt1979@cumin2001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 02:17 pt1979@cumin2001: START - Cookbook sre.dns.netbox
  • 02:06 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on wcqs2003.codfw.wmnet with reason: REIMAGE
  • 02:05 pt1979@cumin2001: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
  • 02:04 pt1979@cumin2001: START - Cookbook sre.hosts.downtime for 2:00:00 on wcqs2003.codfw.wmnet with reason: REIMAGE
  • 02:00 pt1979@cumin2001: START - Cookbook sre.dns.netbox
  • 01:37 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on wcqs2002.codfw.wmnet with reason: REIMAGE
  • 01:35 pt1979@cumin2001: START - Cookbook sre.hosts.downtime for 2:00:00 on wcqs2002.codfw.wmnet with reason: REIMAGE
  • 01:15 urbanecm@deploy1002: Synchronized wmf-config/config/gawiki.yaml: 3283ae5: Enable local uploads on Irish Wikipedia (T277723) (duration: 01m 08s)
  • 01:13 urbanecm@deploy1002: Synchronized dblists/commonsuploads.dblist: 3283ae5: Enable local uploads on Irish Wikipedia (T277723) (duration: 01m 08s)
  • 01:07 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on wcqs2001.codfw.wmnet with reason: REIMAGE
  • 01:05 pt1979@cumin2001: START - Cookbook sre.hosts.downtime for 2:00:00 on wcqs2001.codfw.wmnet with reason: REIMAGE

2021-03-30

  • 23:59 Trey314159: reindexing English wikis on elastic@eqiad, elastic@codfw, and cloudelastic (T274200)
  • 23:56 legoktm@deploy1002: Synchronized php-1.36.0-wmf.37/extensions/TimedMediaHandler/extension.json: Allow autoconfirmed users to see Special:TranscodeStatistics by default (T278867) (duration: 01m 08s)
  • 23:53 legoktm@deploy1002: Synchronized php-1.36.0-wmf.36/extensions/TimedMediaHandler/extension.json: Allow autoconfirmed users to see Special:TranscodeStatistics by default (T278867) (duration: 01m 08s)
  • 23:29 Amir1: sudo django-admin hyperkitty_import -l discovery-alerts@lists-next.wikimedia.org discovery-alerts.mbox/discovery-alerts.mbox --pythonpath /usr/share/mailman3-web --settings settings (T278609)
  • 23:27 pt1979@cumin2001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 23:23 pt1979@cumin2001: START - Cookbook sre.dns.netbox
  • 23:15 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: ef306a3: Growth features: bnwiki: Enable impact module (T274793) (duration: 01m 07s)
  • 22:52 cstone: civicrm revision changed from ad430721f6 to 7040b68c11
  • 21:11 twentyafterfour@deploy1002: Finished deploy [releng/phatality@fbca60c]: rollback (duration: 00m 12s)
  • 21:11 twentyafterfour@deploy1002: Started deploy [releng/phatality@fbca60c]: rollback
  • 21:05 twentyafterfour@deploy1002: Finished deploy [releng/phatality@fbca60c]: trying again with newly built zip (duration: 00m 12s)
  • 21:05 twentyafterfour@deploy1002: Started deploy [releng/phatality@fbca60c]: trying again with newly built zip
  • 21:02 legoktm: scap pulling on mw1298
  • 20:59 twentyafterfour@deploy1002: Finished deploy [releng/phatality@715d809]: (no justification provided) (duration: 00m 15s)
  • 20:58 twentyafterfour@deploy1002: Started deploy [releng/phatality@715d809]: (no justification provided)
  • 20:58 legoktm: killed remaining ffmpeg on mw1298
  • 20:56 twentyafterfour@deploy1002: Finished deploy [releng/phatality@715d809]: (no justification provided) (duration: 00m 12s)
  • 20:56 twentyafterfour@deploy1002: Started deploy [releng/phatality@715d809]: (no justification provided)
  • 20:53 kharlan@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
  • 20:52 kharlan@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
  • 20:41 twentyafterfour@deploy1002: Finished deploy [releng/phatality@715d809]: (no justification provided) (duration: 00m 20s)
  • 20:41 twentyafterfour@deploy1002: Started deploy [releng/phatality@715d809]: (no justification provided)
  • 20:41 kharlan@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
  • 20:40 kharlan@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
  • 20:38 kharlan@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
  • 20:37 kharlan@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
  • 20:37 twentyafterfour@deploy1002: Finished deploy [releng/phatality@715d809]: (no justification provided) (duration: 00m 31s)
  • 20:36 twentyafterfour@deploy1002: Started deploy [releng/phatality@715d809]: (no justification provided)
  • 20:35 twentyafterfour@deploy1002: Finished deploy [releng/phatality@715d809]: (no justification provided) (duration: 00m 05s)
  • 20:35 twentyafterfour@deploy1002: Started deploy [releng/phatality@715d809]: (no justification provided)
  • 20:34 kharlan@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
  • 20:34 twentyafterfour@deploy1002: Started restart [releng/phatality@715d809]: (no justification provided)
  • 20:33 twentyafterfour@deploy1002: Finished scap: testwikis wikis to 1.36.0-wmf.37 refs T278343 (duration: 80m 32s)
  • 20:29 twentyafterfour@deploy1002: Finished deploy [releng/phatality@715d809]: (no justification provided) (duration: 00m 49s)
  • 20:29 twentyafterfour@deploy1002: Started deploy [releng/phatality@715d809]: (no justification provided)
  • 20:28 legoktm@deploy1002: conftool action : set/pooled=yes; selector: cluster=videoscaler,name=mw1307.eqiad.wmnet
  • 20:28 legoktm@deploy1002: conftool action : set/pooled=yes; selector: cluster=videoscaler,name=mw1306.eqiad.wmnet
  • 20:28 legoktm@deploy1002: conftool action : set/pooled=yes; selector: cluster=videoscaler,name=mw1305.eqiad.wmnet
  • 20:28 legoktm@deploy1002: conftool action : set/pooled=yes; selector: cluster=videoscaler,name=mw1304.eqiad.wmnet
  • 20:28 legoktm@deploy1002: conftool action : set/pooled=yes; selector: cluster=videoscaler,name=mw1303.eqiad.wmnet
  • 20:28 legoktm@deploy1002: conftool action : set/pooled=no; selector: cluster=jobrunner,name=mw1307.eqiad.wmnet
  • 20:28 legoktm@deploy1002: conftool action : set/pooled=no; selector: cluster=jobrunner,name=mw1306.eqiad.wmnet
  • 20:27 legoktm@deploy1002: conftool action : set/pooled=no; selector: cluster=jobrunner,name=mw1305.eqiad.wmnet
  • 20:27 legoktm@deploy1002: conftool action : set/pooled=no; selector: cluster=jobrunner,name=mw1304.eqiad.wmnet
  • 20:27 legoktm@deploy1002: conftool action : set/pooled=no; selector: cluster=jobrunner,name=mw1303.eqiad.wmnet
  • 20:26 twentyafterfour: preparing to deploy phatality upgrade to kibana cluster
  • 20:25 legoktm@deploy1002: conftool action : set/pooled=yes; selector: cluster=jobrunner,name=mw1296.eqiad.wmnet
  • 20:25 legoktm@deploy1002: conftool action : set/pooled=yes; selector: cluster=jobrunner,name=mw1298.eqiad.wmnet
  • 20:25 legoktm@deploy1002: conftool action : set/pooled=yes; selector: cluster=jobrunner,name=mw1299.eqiad.wmnet
  • 20:21 joal@deploy1002: Finished deploy [analytics/refinery@1a53e9a] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@1a53e9a] (duration: 04m 29s)
  • 20:20 legoktm@deploy1002: conftool action : set/pooled=no; selector: cluster=videoscaler,name=mw1299.eqiad.wmnet
  • 20:20 legoktm@deploy1002: conftool action : set/pooled=no; selector: cluster=videoscaler,name=mw1298.eqiad.wmnet
  • 20:20 legoktm@deploy1002: conftool action : set/pooled=no; selector: cluster=videoscaler,name=mw1296.eqiad.wmnet
  • 20:16 joal@deploy1002: Started deploy [analytics/refinery@1a53e9a] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@1a53e9a]
  • 20:16 joal@deploy1002: Finished deploy [analytics/refinery@1a53e9a] (thin): Regular analytics weekly train THIN [analytics/refinery@1a53e9a] (duration: 00m 07s)
  • 20:16 joal@deploy1002: Started deploy [analytics/refinery@1a53e9a] (thin): Regular analytics weekly train THIN [analytics/refinery@1a53e9a]
  • 20:15 joal@deploy1002: Finished deploy [analytics/refinery@1a53e9a]: Regular analytics weekly train [analytics/refinery@1a53e9a] (duration: 17m 11s)
  • 20:02 twentyafterfour: when syncing 1.36.0-wmf.37 promote to testwikis, one server failed: server mw1298.eqiad.wmnet and two more appear to be hung because scap is stuck at 2 left 99% without making any progress for a long time now. refs T278343
  • 19:58 bblack@cumin1001: conftool action : set/pooled=yes; selector: name=cp1087.eqiad.wmnet
  • 19:58 joal@deploy1002: Started deploy [analytics/refinery@1a53e9a]: Regular analytics weekly train [analytics/refinery@1a53e9a]
  • 19:58 bblack: repool cp1087 - T278729
  • 19:13 twentyafterfour@deploy1002: Started scap: testwikis wikis to 1.36.0-wmf.37 refs T278343
  • 18:15 pt1979@cumin2001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 18:09 pt1979@cumin2001: START - Cookbook sre.dns.netbox
  • 17:22 legoktm: moved mw[1293-1295] to jobrunners and mw[1300-1302] to videoscalers
  • 17:22 legoktm@deploy1002: conftool action : set/pooled=yes; selector: cluster=videoscaler,name=mw1302.eqiad.wmnet
  • 17:22 legoktm@deploy1002: conftool action : set/pooled=yes; selector: cluster=videoscaler,name=mw1301.eqiad.wmnet
  • 17:21 legoktm@deploy1002: conftool action : set/pooled=yes; selector: cluster=videoscaler,name=mw1300.eqiad.wmnet
  • 17:21 legoktm@deploy1002: conftool action : set/pooled=no; selector: cluster=jobrunner,name=mw1302.eqiad.wmnet
  • 17:21 legoktm@deploy1002: conftool action : set/pooled=no; selector: cluster=jobrunner,name=mw1301.eqiad.wmnet
  • 17:21 legoktm@deploy1002: conftool action : set/pooled=no; selector: cluster=jobrunner,name=mw1300.eqiad.wmnet
  • 17:21 legoktm@deploy1002: conftool action : set/pooled=yes; selector: cluster=jobrunner,name=mw1295.eqiad.wmnet
  • 17:21 legoktm@deploy1002: conftool action : set/pooled=yes; selector: cluster=jobrunner,name=mw1294.eqiad.wmnet
  • 17:21 legoktm@deploy1002: conftool action : set/pooled=yes; selector: cluster=jobrunner,name=mw1293.eqiad.wmnet
  • 17:19 legoktm: killed all ffmpeg on mw1294
  • 17:17 legoktm@deploy1002: conftool action : set/pooled=no; selector: cluster=videoscaler,name=mw1295.eqiad.wmnet
  • 17:17 legoktm@deploy1002: conftool action : set/pooled=no; selector: cluster=videoscaler,name=mw1293.eqiad.wmnet
  • 17:17 legoktm@deploy1002: conftool action : set/pooled=no; selector: cluster=videoscaler,name=mw1294.eqiad.wmnet
  • 17:13 mbsantos@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
  • 17:12 mbsantos@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
  • 17:10 mbsantos@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' .
  • 17:08 mbsantos@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'proton' for release 'production' .
  • 17:05 mbsantos@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'proton' for release 'production' .
  • 17:02 mbsantos@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'proton' for release 'production' .
  • 16:40 effie: enable puppet on mw* hosts
  • 16:10 mutante: mw1296 - started ferm
  • 16:10 mutante: mw1308 - started ferm
  • 16:07 akosiaris: split jobrunners/videoscalers clusters in conftool. mw12* become videoscalers, mw13* become jobrunners, killing ffmpeg on mw13*
  • 16:07 mutante: mw1309 - systemctl start ferm
  • 16:07 akosiaris@cumin1001: conftool action : set/pooled=no; selector: dc=eqiad,cluster=jobrunner,name=mw12.*
  • 16:06 akosiaris@cumin1001: conftool action : set/pooled=no; selector: dc=eqiad,cluster=videoscaler,name=mw13.*
  • 16:06 akosiaris@cumin1001: conftool action : set/pooled=yes; selector: dc=eqiad,cluster=videoscaler,name=mw12.*
  • 15:59 akosiaris: depool a number of hosts from videoscalers
  • 15:59 akosiaris@cumin1001: conftool action : set/pooled=no; selector: dc=eqiad,cluster=videoscaler,name=mw12.*
  • 15:55 legoktm@deploy1002: conftool action : set/pooled=no; selector: name=mw1308.eqiad.wmnet,service=jobrunner
  • 15:55 legoktm@deploy1002: conftool action : set/pooled=no; selector: name=mw1307.eqiad.wmnet,service=jobrunner
  • 15:42 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=aqs1004.eqiad.wmnet
  • 15:29 hnowlan: moving all test tables out of cassandra directories on aqs hosts
  • 14:59 effie: disable puppet on mediawiki servers to deploy 663565
  • 14:58 Urbanecm: Move Help talk:Help talk:Getting started --> Help talk:Getting started via moveBatch.php on enwiki (T278350)
  • 14:32 arturo: manually start update-openstack-mirror.service on sodium (T278505)
  • 13:02 jbond42: rollout lxml update T278822
  • 12:55 jbond42: update spamassasin on lists,otrs and mx T278820
  • 12:39 Amir1: ssh -p 29418 gerrit.wikimedia.org replication start wikidata/query-builder --wait (T277060)
  • 12:38 jbond42: update python(3)-pygments
  • 12:36 hnowlan@puppetmaster1001: conftool action : set/pooled=no; selector: name=aqs1004.eqiad.wmnet
  • 12:14 Urbanecm: mwmaint1002: Downloading multiple big files (total filesize estimated 150 GB, downloaded and processed in batches) for server-side uploads
  • 11:21 ladsgroup@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Disable legacy javascript global variables in group1, Some increase in client errors is expected (T72470) (duration: 01m 11s)
  • 09:58 aborrero@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudnet1003.eqiad.wmnet
  • 09:52 aborrero@cumin1001: START - Cookbook sre.hosts.reboot-single for host cloudnet1003.eqiad.wmnet
  • 09:42 hnowlan@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
  • 09:41 hnowlan@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
  • 09:35 hnowlan@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
  • 09:35 hnowlan@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
  • 09:05 hnowlan@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
  • 09:04 hnowlan@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
  • 08:36 jynus: mariadb upgrade of all buster source backup hosts to 10.4.18 T250666
  • 08:05 dcausse: refreshing wdqs entities (T278693)
  • 07:37 elukey: restart-php7.2-fpm on mw1304, jobrunner completely overwhelmed by ffmpeg/transcode jobs (not publishing metrics, erroring out for memcached timeouts) - T278734
  • 07:28 hashar@deploy1002: rebuilt and synchronized wikiversions files: all wikis to 1.36.0-wmf.36 - T274940
  • 06:06 elukey: powercycle cp1087 (no ssh, no mgmt console tty)
  • 06:04 elukey@puppetmaster1001: conftool action : set/pooled=no; selector: name=cp1087.eqiad.wmnet

2021-03-29

  • 19:06 hnowlan@puppetmaster1001: conftool action : set/pooled=yes; selector: name=aqs1004.eqiad.wmnet
  • 17:47 volans@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 17:37 volans@cumin1001: START - Cookbook sre.dns.netbox
  • 16:15 hnowlan@puppetmaster1001: conftool action : set/pooled=no; selector: name=aqs1004.eqiad.wmnet
  • 16:11 hnowlan: depooled aqs1004 for transfer of large tables to aqs1010
  • 15:54 jbond@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 15:47 jbond@cumin1001: START - Cookbook sre.dns.netbox
  • 15:45 jbond@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 15:39 jbond@cumin1001: START - Cookbook sre.dns.netbox
  • 13:26 jiji@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2001.codfw.wmnet with reason: REIMAGE
  • 13:24 jiji@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on parse2001.codfw.wmnet with reason: REIMAGE
  • 13:03 ema: cp4027: rollback luajit experiment https://github.com/apache/trafficserver/issues/7423#issuecomment-809354214
  • 12:36 ema: cp4027: re-enable JIT compilation in all ats-be lua scripts -- https://github.com/apache/trafficserver/issues/7423
  • 11:57 ema: cp4027: re-enable JIT compilation in normalize-path.lua -- https://github.com/apache/trafficserver/issues/7423
  • 11:32 ema: cp4027: install libluajit 2.1.0~beta3+dfsg-6wm1 with P15083 applied -- https://github.com/apache/trafficserver/issues/7423
  • 09:59 jbond@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pki2001.codfw.wmnet with reason: REIMAGE
  • 09:57 jbond@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on pki2001.codfw.wmnet with reason: REIMAGE
  • 09:16 ryankemper: T267927 `sudo -i cookbook sre.wdqs.data-reload wdqs2008.codfw.wmnet --task-id T267927 --reload-data wikidata --reason 'T267927: Reload wikidata jnl from fresh dumps' --reuse-downloaded-dump --depool`
  • 09:15 ryankemper@cumin2001: START - Cookbook sre.wdqs.data-reload
  • 08:47 filippo@deploy1002: Finished deploy [librenms/librenms@df69efe]: deploy I156f32925f693 (duration: 00m 08s)
  • 08:47 filippo@deploy1002: Started deploy [librenms/librenms@df69efe]: deploy I156f32925f693
  • 07:59 hashar@deploy1002: Synchronized php: group1 wikis to 1.36.0-wmf.36 (duration: 01m 06s)
  • 07:58 hashar@deploy1002: rebuilt and synchronized wikiversions files: group1 wikis to 1.36.0-wmf.36
  • 07:54 hashar@deploy1002: Synchronized php-1.36.0-wmf.36/extensions/FlaggedRevs: Wrap most of functionalities depending on protect mode in a condition - T278478 (duration: 01m 08s)
  • 07:49 ladsgroup@deploy1002: Synchronized php-1.36.0-wmf.36/extensions/FlaggedRevs: Wrap most of functionalities depending on protect mode in a condition (T278478) (duration: 01m 08s)
  • 07:42 godog: swift eqiad-prod: less weight for ms-be[1019-1026] / more weight to ms-be106[0-3] - T272836 T268435

2021-03-27

  • 19:25 elukey: powercycle elastic1060 - T278630
  • 06:10 ryankemper: T267927 `sudo https_proxy=webproxy.codfw.wmnet:8080 wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2 -O /srv/wdqs/latest-all.ttl.bz2 && sudo https_proxy=webproxy.codfw.wmnet:8080 wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-lexemes.ttl.bz2 -O /srv/wdqs/latest-lexemes.ttl.bz2` on `ryankemper@wdqs2008` tmux session `download_dumps_2020-03-26`
  • 05:44 ryankemper@cumin2001: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99)
  • 05:44 ryankemper@cumin2001: START - Cookbook sre.wdqs.data-reload
  • 05:42 ryankemper@cumin2001: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99)
  • 05:42 ryankemper@cumin2001: START - Cookbook sre.wdqs.data-reload
  • 05:40 ryankemper@cumin2001: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99)
  • 05:40 ryankemper@cumin2001: START - Cookbook sre.wdqs.data-reload
  • 05:40 ryankemper@cumin2001: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99)
  • 05:40 ryankemper@cumin2001: START - Cookbook sre.wdqs.data-reload
  • 05:38 ryankemper@cumin2001: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99)
  • 05:38 ryankemper@cumin2001: START - Cookbook sre.wdqs.data-reload

2021-03-26

  • 22:27 tzatziki: reset password for Philroc
  • 20:10 robh@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudgw1001.eqiad.wmnet with reason: REIMAGE
  • 20:08 robh@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw1001.eqiad.wmnet with reason: REIMAGE
  • 17:44 hashar@deploy1002: Synchronized php-1.36.0-wmf.36/includes/changes/RecentChange.php: RecentChange: directly build the user identity if we have the data - T277795 (duration: 01m 06s)
  • 17:42 hashar@deploy1002: Finished scap: Revert "Add change tags for media additions/removals" - T266067 T278429 (duration: 31m 43s)
  • 17:10 hashar@deploy1002: Started scap: Revert "Add change tags for media additions/removals" - T266067 T278429
  • 15:40 Urbanecm: Delete `commonswiki:ip-autoblock:whitelist` cache key from memcached (wmf.36 moves the autoblock whitelist source, and it was deployed on commonswiki for a while, resulting in the cache key being empty)
  • 15:37 hnowlan: importing imposm3_0.11.0+git20201104.4758cf4-1_amd64.changes on apt1001
  • 14:40 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1016.eqiad.wmnet
  • 14:33 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ganeti1016.eqiad.wmnet
  • 14:05 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1015.eqiad.wmnet
  • 13:58 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ganeti1015.eqiad.wmnet
  • 13:10 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1014.eqiad.wmnet
  • 13:02 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ganeti1014.eqiad.wmnet
  • 13:02 moritzm: reimaging theemin T275873
  • 12:56 moritzm: drain ganeti1014
  • 12:49 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1013.eqiad.wmnet
  • 12:42 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ganeti1013.eqiad.wmnet
  • 12:37 moritzm: drain ganeti1013
  • 12:35 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1012.eqiad.wmnet
  • 12:27 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ganeti1012.eqiad.wmnet
  • 10:55 Urbanecm: Move `Help talk:Getting Started --> Help talk:Getting started` on enwiki with `[urbanecm@mwmaint1002 ~]$ mwscript moveBatch.php --wiki=enwiki -r 'sysadmin action: fixing phab:T278350' -u 'Martin Urbanec' batch.txt` (T278350)
  • 10:49 Urbanecm: Move `User talk:TheAafi/Help talk` to `Help talk:Getting Started` via `[urbanecm@mwmaint1002 ~]$ mwscript moveBatch.php --wiki=enwiki -r 'sysadmin action: fixing phab:T278350' -u 'Martin Urbanec' batch.txt` to fix an UBN task (T278350)
  • 10:10 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts chlorine.eqiad.wmnet
  • 10:02 akosiaris@cumin1001: START - Cookbook sre.hosts.decommission for hosts chlorine.eqiad.wmnet
  • 10:00 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts argon.eqiad.wmnet
  • 09:49 filippo@deploy1002: Finished deploy [librenms/librenms@63e862a]: deploy I955cbfc244 (duration: 00m 08s)
  • 09:49 filippo@deploy1002: Started deploy [librenms/librenms@63e862a]: deploy I955cbfc244
  • 09:46 akosiaris@cumin1001: START - Cookbook sre.hosts.decommission for hosts argon.eqiad.wmnet
  • 09:45 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts acrab.codfw.wmnet
  • 09:43 moritzm: delete fermium in Ganeti (was still around, but powered down) T224586
  • 09:38 akosiaris@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts acrux.codfw.wmnet
  • 09:36 akosiaris@cumin1001: START - Cookbook sre.hosts.decommission for hosts acrab.codfw.wmnet
  • 09:32 akosiaris@cumin1001: START - Cookbook sre.hosts.decommission for hosts acrux.codfw.wmnet
  • 09:31 filippo@deploy1002: Finished deploy [librenms/librenms@e7727e3]: deploy I12ac21d877c (duration: 00m 12s)
  • 09:31 filippo@deploy1002: Started deploy [librenms/librenms@e7727e3]: deploy I12ac21d877c
  • 09:28 moritzm: drain ganeti1012
  • 09:27 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1010.eqiad.wmnet
  • 09:20 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ganeti1010.eqiad.wmnet
  • 08:38 moritzm: drain ganeti1010
  • 08:38 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1009.eqiad.wmnet
  • 08:30 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ganeti1009.eqiad.wmnet
  • 06:11 ryankemper: [WDQS Deploy] Deploy complete. Successful test query placed on query.wikidata.org, there's no relevant criticals in Icinga, and Grafana looks good
  • 06:09 ryankemper: [WDQS Deploy] Restarted `wdqs-updater` across all hosts, 4 hosts at a time: `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'`
  • 06:09 ryankemper: [WDQS Deploy] Restarted `wdqs-categories` across all test hosts simultaneously: `sudo -E cumin 'A:wdqs-test' 'systemctl restart wdqs-categories'`
  • 06:09 ryankemper: [WDQS Deploy] Restarted `wdqs-categories` across lvs-managed hosts, one node at a time: `sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 45 && systemctl restart wdqs-categories && sleep 45 && pool'`
  • 05:06 ryankemper@deploy1002: Finished deploy [wdqs/wdqs@bb5a072]: 0.3.68 (duration: 07m 31s)
  • 05:00 ryankemper: [WDQS Deploy] Tests passing following deploy of `0.3.68` on canary `wdqs1003`; proceeding to rest of fleet
  • 04:58 ryankemper@deploy1002: Started deploy [wdqs/wdqs@bb5a072]: 0.3.68
  • 04:58 ryankemper: [WDQS Deploy] Gearing up for deploy of wdqs `0.3.68`. Pre-deploy tests passing on canary `wdqs1003`

2021-03-25

  • 23:47 thcipriani@deploy1002: Synchronized php-1.36.0-wmf.36/extensions/3D/package.json: No-op demo sync (duration: 01m 07s)
  • 23:37 stran@deploy1002: Synchronized README: (no justification provided) (duration: 01m 06s)
  • 23:20 jhuneidi@deploy1002: Synchronized README: DEMO: README (duration: 01m 07s)
  • 22:59 brennen: no patches for upcoming deploy window, but we'll be conducting a deployment training using DEMO patches to READMEs.
  • 22:16 Urbanecm: [urbanecm@mwmaint1002 ~]$ mwscript deleteEqualMessages.php --wiki=hrwiki --delete
  • 21:35 kharlan@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
  • 21:35 kharlan@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
  • 21:31 kharlan@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
  • 21:31 kharlan@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
  • 21:27 kharlan@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
  • 19:48 hashar@deploy1002: rebuilt and synchronized wikiversions files: Revert group 1 and 2 wikis to 1.36.0-wmf.35 - T274940
  • 19:37 hashar@deploy1002: rebuilt and synchronized wikiversions files: Revert group2 wikis to 1.36.0-wmf.35 - T274940
  • 19:36 hashar@deploy1002: sync-wikiversions aborted: (no justification provided) (duration: 00m 03s)
  • 19:11 hashar@deploy1002: rebuilt and synchronized wikiversions files: all wikis to 1.36.0-wmf.36
  • 19:04 urbanecm@deploy1002: Synchronized wmf-config/flaggedrevs.php: ce7d2d7: ruwiki: flaggedrevs: Delete autoeditor group (T275337) (duration: 01m 08s)
  • 19:01 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: ce7d2d7: ruwiki: flaggedrevs: Delete autoeditor group (T275337) (duration: 01m 06s)
  • 18:59 Urbanecm: `mwscript migrateUserGroup.php --wiki=ruwiki 'autoeditor' 'autoreview' ` finished (T275337)
  • 18:53 Urbanecm: [urbanecm@mwmaint1002 ~/uploads]$ mwscript importImages.php --wiki=commonswiki --comment-ext=txt --user=Sturm . # T278391
  • 18:50 Urbanecm: mwscript migrateUserGroup.php --wiki=ruwiki 'autoeditor' 'autoreview' # T275337
  • 18:49 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: 39cd4f1: ruwiki: flaggedrevs: Do not allow sysops to modify users in autoeditor group (T275337) (duration: 01m 09s)
  • 18:45 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: dcfb7fe: ruwiki: flaggedrevs: Do not remove autoreview group (T275337) (duration: 01m 14s)
  • 18:39 urbanecm@deploy1002: Synchronized wmf-config/flaggedrevs.php: 3fb6646: ruwiki: flaggedrevs: Revoke review from sysop group (T275811) (duration: 01m 06s)
  • 18:29 urbanecm@deploy1002: Synchronized logos/config.yaml: 29660f9: Update altwiki logo (3/3; T275819) (duration: 01m 06s)
  • 18:28 urbanecm@deploy1002: Synchronized wmf-config/logos.php: 29660f9: Update altwiki logo (2/3; T275819) (duration: 01m 06s)
  • 18:26 urbanecm@deploy1002: Synchronized static/images/project-logos/: 29660f9: Update altwiki logo (1/3; T275819) (duration: 01m 10s)
  • 18:21 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: 62be4e7: Disable magic links on enwiki (T275951) (duration: 01m 20s)
  • 18:14 mutante: alert1001 - sudo systemctl restart tcpircbot-logmsgbot
  • 18:09 marxarelli: scap sync-file .pipeline Config: Include patches in restricted image (T271274)
  • 18:06 hnowlan: draining and restarting aqs1004-b cassandra
  • 17:45 hnowlan: draining and restarting aqs1004-a cassandra
  • 17:16 mbsantos@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'proton' for release 'production' .
  • 17:14 mbsantos@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'proton' for release 'production' .
  • 17:08 mbsantos@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'proton' for release 'production' .
  • 16:39 hashar: Restarted Apache 2 on contint2001 / contint1001
  • 16:35 ryankemper@cumin1001: START - Cookbook sre.elasticsearch.rolling-upgrade
  • 16:32 moritzm: restarting apache on an-tool1007/turnilo
  • 16:27 moritzm: restarting dnsdist/rdns-recursor on malmok
  • 16:24 jbond42: restart slapd on ldap-replica
  • 16:22 jbond42: restart slapd on ldap-corp
  • 16:20 jbond42: restart apache on lists1002
  • 16:18 jbond42: restart apache on netbox
  • 16:13 hashar@deploy1002: Synchronized php-1.36.0-wmf.36/extensions/ProofreadPage: Disallow negative or decimal values in pages tag - T278400 (duration: 01m 32s)
  • 16:12 jbond42: restart routinator on rpki*
  • 16:12 moritzm: restarting nginx on apt*
  • 16:10 moritzm: restarting apache on dbmonitor
  • 16:08 moritzm: restart Apacge on matomo/piwik
  • 16:03 jbond42: restart apache service on gerrit
  • 16:02 jbond42: restart idp service
  • 16:01 ema: A:cp rolling ats-{tls,backend}-restart for openssl upgrades -- https://www.openssl.org/news/secadv/20210325.txt
  • 15:45 moritzm: installing openssl updates on buster
  • 14:48 herron@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 14:45 herron@cumin1001: START - Cookbook sre.dns.netbox
  • 14:13 twentyafterfour: update phabricator again (last night's update undid a hotfix that is now fixed properly)
  • 13:45 moritzm: drain ganeti1009
  • 13:27 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on webperf1001.eqiad.wmnet with reason: adapt RAM
  • 13:27 jmm@cumin2001: START - Cookbook sre.hosts.downtime for 1:00:00 on webperf1001.eqiad.wmnet with reason: adapt RAM
  • 13:27 moritzm: reduce webperf1001/webperf2001 to 4G RAM (xhgui has been split off to separate VMs)
  • 13:24 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1008.eqiad.wmnet
  • 13:18 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ganeti1008.eqiad.wmnet
  • 12:52 hnowlan: aqs1004 nodetool-a cleanup finished
  • 12:14 moritzm: drain ganeti1008
  • 12:12 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1007.eqiad.wmnet
  • 12:08 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ganeti1007.eqiad.wmnet
  • 11:52 ladsgroup@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Disable Legacy javascript in fawikiquote (T72470) (duration: 01m 07s)
  • 11:46 moritzm: drain ganeti1007
  • 11:44 ladsgroup@deploy1002: Synchronized php-1.36.0-wmf.36/skins/Vector/resources: Inform anonymous A/B test by tracking time from navigationStart (T275807) (duration: 01m 09s)
  • 11:43 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1006.eqiad.wmnet
  • 11:39 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ganeti1006.eqiad.wmnet
  • 11:33 ladsgroup@deploy1002: Synchronized dblists/: tawiki: Enable Growth features in dark mode, Part II (T278369) (duration: 01m 07s)
  • 11:32 ladsgroup@deploy1002: Synchronized wmf-config: tawiki: Enable Growth features in dark mode (T278369) (duration: 01m 30s)
  • 11:29 jbond@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pki2001.codfw.wmnet with reason: REIMAGE
  • 11:27 jbond@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on pki2001.codfw.wmnet with reason: REIMAGE
  • 11:24 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dns4001.wikimedia.org
  • 11:19 jbond@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pki1001.eqiad.wmnet with reason: REIMAGE
  • 11:18 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host dns4001.wikimedia.org
  • 11:17 jbond@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on pki1001.eqiad.wmnet with reason: REIMAGE
  • 11:10 moritzm: drain ganeti1006
  • 11:03 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1005.eqiad.wmnet
  • 10:58 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ganeti1005.eqiad.wmnet
  • 10:54 kharlan@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
  • 10:54 kharlan@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
  • 10:51 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host flerovium.eqiad.wmnet
  • 10:48 kharlan@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
  • 10:48 kharlan@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
  • 10:45 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host flerovium.eqiad.wmnet
  • 10:44 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host furud.codfw.wmnet
  • 10:42 kharlan@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
  • 10:40 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host furud.codfw.wmnet
  • 10:36 hnowlan: running general nodetool cleanup on aqs1004-a
  • 10:35 hnowlan: running cleanup on aqs1004-a: nodetool-a cleanup "local_group_default_T_pageviews_per_project_v2" data
  • 10:34 moritzm: drain ganeti1005
  • 10:29 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet
  • 10:28 dcaro@cumin1001: END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0)
  • 10:24 dcaro@cumin1001: START - Cookbook sre.hosts.upgrade-and-reboot
  • 10:23 dcaro@cumin1001: END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0)
  • 10:22 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet
  • 10:18 dcaro@cumin1001: START - Cookbook sre.hosts.upgrade-and-reboot
  • 10:17 dcaro@cumin1001: END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0)
  • 10:13 dcaro@cumin1001: START - Cookbook sre.hosts.upgrade-and-reboot
  • 10:13 dcaro@cumin1001: END (FAIL) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=99)
  • 10:13 dcaro@cumin1001: START - Cookbook sre.hosts.upgrade-and-reboot
  • 09:26 moritzm: drain ganeti2024
  • 09:25 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet
  • 09:17 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet
  • 08:45 moritzm: drain ganeti2023
  • 08:43 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet
  • 08:35 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet
  • 08:12 elukey: upgrade hive packages in thirdparty/bigtop15 to 2.3.6-2 for buster-wikimedia
  • 08:11 elukey: upgrade hive packages in thirdparty/bigtop15 to 2.3.6-2
  • 07:41 legoktm: upgraded lists1002 to hyperkitty 1.2.2-1+wmf1 (T276687)
  • 07:36 legoktm: uploaded hyperkitty 1.2.2-1+wmf1 to buster-wikimedia (T276687)
  • 07:35 jynus: restart db2135 T278408 T273281
  • 07:05 effie: enable puppet on all mediawiki servers
  • 06:57 XioNoX: Option 82: use-vlan-id
  • 06:53 effie: enable puppet on jobrunners
  • 06:47 effie: enable puppet on parsoid
  • 06:40 effie: disable puppet on all mediawiki servers to merge 673061 (service proxy to listen on ::1)
  • 06:23 ryankemper@cumin1001: END (FAIL) - Cookbook sre.elasticsearch.rolling-upgrade (exit_code=99)
  • 05:19 ryankemper@cumin1001: START - Cookbook sre.elasticsearch.rolling-upgrade
  • 04:44 legoktm: restarted exim4 on lists1002 so it listens on 0.0.0.0 instead of 127.0.0.1
  • 04:16 ryankemper@cumin1001: END (FAIL) - Cookbook sre.elasticsearch.rolling-upgrade (exit_code=99)
  • 03:10 ryankemper@cumin1001: START - Cookbook sre.elasticsearch.rolling-upgrade
  • 01:33 ryankemper@cumin1001: END (FAIL) - Cookbook sre.elasticsearch.rolling-upgrade (exit_code=99)
  • 01:10 legoktm: mailman3: added lists-next.wikimedia.org domain
  • 01:08 legoktm: mailman3: renamed default site from "example.com" to "lists-next.wikimedia.org"
  • 00:50 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw2378.codfw.wmnet
  • 00:35 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw2377.codfw.wmnet
  • 00:35 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw2777.codfw.wmnet
  • 00:34 mutante: mw2377, mw2378 - first scap pull
  • 00:33 dzahn@cumin1001: conftool action : set/weight=10; selector: name=mw2378.codfw.wmnet
  • 00:33 dzahn@cumin1001: conftool action : set/weight=10; selector: name=mw2377.codfw.wmnet
  • 00:32 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2378.codfw.wmnet
  • 00:32 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2377.codfw.wmnet
  • 00:29 legoktm: syncing facts for puppet-compiler
  • 00:23 mutante: mw2377, mw2378 - reboot
  • 00:14 twentyafterfour: phabricator update complete
  • 00:10 twentyafterfour: deploying phabricator
  • 00:05 ryankemper: T274204 `sudo -i cookbook sre.elasticsearch.rolling-upgrade search_eqiad "eqiad cluster reboot" --task-id T274204 --nodes-per-run 3 --start-datetime 2021-03-24T23:55:35` on `ryankemper@cumin1001` tmux session `elasticsearch_rolling_upgrade_reboots`

2021-03-24

  • 23:57 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2378.codfw.wmnet with reason: new_install
  • 23:57 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw2378.codfw.wmnet with reason: new_install
  • 23:56 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2377.codfw.wmnet with reason: new_install
  • 23:56 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw2377.codfw.wmnet with reason: new_install
  • 23:56 ryankemper@cumin1001: START - Cookbook sre.elasticsearch.rolling-upgrade
  • 23:48 mutante: generating new mcrouter certs for mw2377, mw2378
  • 22:07 ryankemper@cumin1001: END (PASS) - Cookbook sre.elasticsearch.rolling-upgrade (exit_code=0)
  • 22:07 legoktm: disabled puppet on lists1002 while mailman3-web is broken
  • 21:49 ryankemper@cumin2001: END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
  • 21:19 mutante: webperf2001 - restarted apache
  • 21:11 hashar@deploy1002: Synchronized php: group1 wikis to 1.36.0-wmf.36 (duration: 01m 07s)
  • 21:10 hashar@deploy1002: rebuilt and synchronized wikiversions files: group1 wikis to 1.36.0-wmf.36
  • 21:08 kharlan@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
  • 21:08 kharlan@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
  • 21:07 hashar@deploy1002: Synchronized php-1.36.0-wmf.36/extensions/GrowthExperiments: LinkRecommendation: Modify path args for calls to API - T277865 (duration: 01m 07s)
  • 21:05 hashar@deploy1002: Synchronized php-1.36.0-wmf.36/extensions/ProofreadPage: Revert "Add default TemplateStyles for an Index" - T278379 (duration: 01m 07s)
  • 21:04 kharlan@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
  • 21:04 kharlan@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
  • 21:02 hashar@deploy1002: Synchronized php-1.36.0-wmf.36/extensions/GlobalUsage: Fix hook registration after class was namespaced - T278375 (duration: 01m 07s)
  • 20:59 hashar@deploy1002: Synchronized wmf-config/env.php: multiversion: Move '@' operator in env.php closer to relevant statement (duration: 01m 07s)
  • 20:56 kharlan@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
  • 20:30 kharlan@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
  • 20:26 kharlan@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
  • 20:13 ryankemper@cumin2001: START - Cookbook sre.wdqs.data-transfer
  • 20:13 ryankemper@cumin2001: END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
  • 20:10 ryankemper@cumin2001: START - Cookbook sre.wdqs.data-transfer
  • 20:09 ryankemper@cumin2001: START - Cookbook sre.wdqs.data-transfer
  • 20:07 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cumin2002.codfw.wmnet with reason: REIMAGE
  • 20:05 pt1979@cumin2001: START - Cookbook sre.hosts.downtime for 2:00:00 on cumin2002.codfw.wmnet with reason: REIMAGE
  • 19:59 ryankemper@cumin2001: END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
  • 19:59 ryankemper@cumin2001: START - Cookbook sre.wdqs.data-transfer
  • 19:57 ryankemper: T267927 Host key is missing for `wdqs2008` leading to `data-transfer` cookbook failing, looking into resolving
  • 19:55 ryankemper@cumin2001: END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
  • 19:55 ryankemper@cumin2001: START - Cookbook sre.wdqs.data-transfer
  • 19:50 ryankemper@cumin2001: END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
  • 19:50 ryankemper@cumin2001: START - Cookbook sre.wdqs.data-transfer
  • 19:49 ryankemper@cumin2001: END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
  • 19:49 ryankemper@cumin2001: START - Cookbook sre.wdqs.data-transfer
  • 19:45 ryankemper@cumin2001: END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
  • 19:45 ryankemper@cumin2001: START - Cookbook sre.wdqs.data-transfer
  • 19:42 ryankemper: T267927 Re-enabledpuppet on `wdqs2008` and ran puppet agent
  • 19:21 ryankemper@cumin1001: START - Cookbook sre.elasticsearch.rolling-upgrade
  • 19:14 hashar@deploy1002: rebuilt and synchronized wikiversions files: Revert group 1 to 1.36.0-wmf.35
  • 19:07 hashar@deploy1002: Synchronized php: group1 wikis to 1.36.0-wmf.36 (duration: 01m 21s)
  • 19:05 hashar@deploy1002: rebuilt and synchronized wikiversions files: group1 wikis to 1.36.0-wmf.36
  • 19:03 urbanecm@deploy1002: Synchronized wmf-config/config/shwiki.yaml: 0f3aa72: shwiki: Enable Growth features in dark mode (T278240; 3/3) (duration: 01m 08s)
  • 19:02 urbanecm@deploy1002: Synchronized dblists/growthexperiments.dblist: 0f3aa72: shwiki: Enable Growth features in dark mode (T278240; 2/3) (duration: 01m 06s)
  • 19:00 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: 0f3aa72: shwiki: Enable Growth features in dark mode (T278240; 1/3) (duration: 01m 07s)
  • 18:54 urbanecm@deploy1002: Synchronized wmf-config/config/eswiki.yaml: ced0920: Enable Growth features on eswiki in dark mode (T278235; 3/3) (duration: 01m 06s)
  • 18:53 urbanecm@deploy1002: Synchronized dblists/growthexperiments.dblist: ced0920: Enable Growth features on eswiki in dark mode (T278235; 2/3) (duration: 01m 07s)
  • 18:52 urbanecm@deploy1002: sync-file aborted: ced0920: Enable Growth features on eswiki in dark mode (2/3) (duration: 00m 01s)
  • 18:51 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: ced0920: Enable Growth features on eswiki in dark mode (T278235; 1/3) (duration: 01m 08s)
  • 18:49 legoktm@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 18:45 legoktm@cumin1001: START - Cookbook sre.dns.netbox
  • 18:42 legoktm@cumin1001: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
  • 18:40 legoktm@cumin1001: START - Cookbook sre.dns.netbox
  • 18:31 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: 5aa0506: Promote several Growth target wikis out of dark mode (T277491; T276830; T276123; T276816; T275550; T276450) (duration: 01m 08s)
  • 18:20 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: 333393d: Add autopatrol to autoreviewers in en.wikibooks (T278300) (duration: 01m 09s)
  • 18:08 pt1979@cumin2001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 18:02 pt1979@cumin2001: START - Cookbook sre.dns.netbox
  • 17:25 effie: upgrade memcached on mc-gp* hosts
  • 15:45 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on irc2001.wikimedia.org with reason: adapt RAM
  • 15:45 jmm@cumin2001: START - Cookbook sre.hosts.downtime for 1:00:00 on irc2001.wikimedia.org with reason: adapt RAM
  • 15:42 moritzm: reduce RAM for irc2001 to 2G, was originally created with 8 G T224579
  • 15:35 effie: enable puppet on all mediawiki + memcached hosts
  • 15:20 moritzm: drain ganeti2022
  • 15:20 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2021.codfw.wmnet
  • 15:10 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ganeti2021.codfw.wmnet
  • 14:35 moritzm: drain ganeti2021
  • 14:31 effie: disable puppet on all mediawiki servers + memcached for 674290
  • 14:05 moritzm: failover Ganeti master in codfw to ganeti2019
  • 13:59 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet
  • 13:51 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet
  • 13:29 moritzm: installing irc1001
  • 13:15 moritzm: drain ganeti2020
  • 12:38 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet
  • 12:30 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet
  • 12:28 effie: enabling puppet on mediawiki and memcached servers
  • 12:10 jynus: restart dbprov200[12] T271913
  • 11:59 marostegui@cumin1001: dbctl commit (dc=all): 'db1160 (re)pooling @ 100%: Slowly repool db1160 after schema change', diff saved to https://phabricator.wikimedia.org/P15076 and previous config saved to /var/cache/conftool/dbconfig/20210324-115940-root.json
  • 11:57 Andrew-WMDE_: EU deploys done
  • 11:53 jynus: restart dbprov100[12] T271913
  • 11:51 andrew-wmde@deploy1002: Synchronized php-1.36.0-wmf.35/extensions/MassMessage/: Backport: MassMessage: Unbreak remote content fetching (T276936) (duration: 01m 08s)
  • 11:49 effie: disable puppet on all hosts running mediawiki+memcached to merge 674282
  • 11:45 andrew-wmde@deploy1002: Synchronized php-1.36.0-wmf.36/extensions/MassMessage/: Backport: MassMessage: Unbreak remote content fetching (T276936) (duration: 01m 07s)
  • 11:44 marostegui@cumin1001: dbctl commit (dc=all): 'db1160 (re)pooling @ 75%: Slowly repool db1160 after schema change', diff saved to https://phabricator.wikimedia.org/P15075 and previous config saved to /var/cache/conftool/dbconfig/20210324-114436-root.json
  • 11:29 marostegui@cumin1001: dbctl commit (dc=all): 'db1160 (re)pooling @ 50%: Slowly repool db1160 after schema change', diff saved to https://phabricator.wikimedia.org/P15074 and previous config saved to /var/cache/conftool/dbconfig/20210324-112932-root.json
  • 11:22 andrew-wmde@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: Enable CodeMirror accessibility colors on initial wikis (T276346) (duration: 01m 08s)
  • 11:15 jynus: restart serially db2097 db2098 db2099 db2100 T271913
  • 11:14 andrew-wmde@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: Enable bracket matching on group0 and wikitech (T273591) (duration: 01m 25s)
  • 11:14 marostegui@cumin1001: dbctl commit (dc=all): 'db1160 (re)pooling @ 25%: Slowly repool db1160 after schema change', diff saved to https://phabricator.wikimedia.org/P15073 and previous config saved to /var/cache/conftool/dbconfig/20210324-111429-root.json
  • 10:50 jmm@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host irc1001.wikimedia.org
  • 10:48 mbsantos@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'wikifeeds' for release 'production' .
  • 10:45 mbsantos@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'wikifeeds' for release 'production' .
  • 10:44 mbsantos@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' .
  • 10:36 jmm@cumin1001: START - Cookbook sre.ganeti.makevm for new host irc1001.wikimedia.org
  • 10:31 jynus: restart db1171 T271913
  • 10:15 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
  • 10:14 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
  • 10:14 jynus: restart db1145 T271913
  • 10:06 akosiaris@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
  • 10:06 akosiaris@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
  • 10:03 jynus: restart db1139 T271913
  • 09:56 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1160 for schema change', diff saved to https://phabricator.wikimedia.org/P15072 and previous config saved to /var/cache/conftool/dbconfig/20210324-095655-marostegui.json
  • 09:56 marostegui@cumin1001: dbctl commit (dc=all): 'db1149 (re)pooling @ 100%: Slowly repool db1149 after schema change', diff saved to https://phabricator.wikimedia.org/P15071 and previous config saved to /var/cache/conftool/dbconfig/20210324-095606-root.json
  • 09:51 jynus: restart db1116 T271913
  • 09:41 marostegui@cumin1001: dbctl commit (dc=all): 'db1149 (re)pooling @ 75%: Slowly repool db1149 after schema change', diff saved to https://phabricator.wikimedia.org/P15070 and previous config saved to /var/cache/conftool/dbconfig/20210324-094102-root.json
  • 09:28 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
  • 09:28 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
  • 09:25 marostegui@cumin1001: dbctl commit (dc=all): 'db1149 (re)pooling @ 50%: Slowly repool db1149 after schema change', diff saved to https://phabricator.wikimedia.org/P15069 and previous config saved to /var/cache/conftool/dbconfig/20210324-092558-root.json
  • 09:10 marostegui@cumin1001: dbctl commit (dc=all): 'db1149 (re)pooling @ 25%: Slowly repool db1149 after schema change', diff saved to https://phabricator.wikimedia.org/P15068 and previous config saved to /var/cache/conftool/dbconfig/20210324-091055-root.json
  • 08:29 jayme@cumin1001: conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=sessionstore
  • 08:16 gehel: restarting wdqs updater on all nodes for config change
  • 08:14 jayme@cumin1001: conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=eventgate-analytics
  • 08:14 jayme@cumin1001: conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=eventgate-analytics-external
  • 08:10 marostegui@cumin1001: dbctl commit (dc=all): 'db1086 (re)pooling @ 75%: Slowly repool db1086 after schema change', diff saved to https://phabricator.wikimedia.org/P15066 and previous config saved to /var/cache/conftool/dbconfig/20210324-081057-root.json
  • 08:07 marostegui@cumin1001: dbctl commit (dc=all): 'db1141 (re)pooling @ 100%: Slowly repool db1141', diff saved to https://phabricator.wikimedia.org/P15065 and previous config saved to /var/cache/conftool/dbconfig/20210324-080725-root.json
  • 08:02 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1149 for schema change', diff saved to https://phabricator.wikimedia.org/P15064 and previous config saved to /var/cache/conftool/dbconfig/20210324-080223-marostegui.json
  • 08:01 jayme@cumin1001: conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=eventgate-main
  • 08:01 jayme@cumin1001: conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=eventgate-logging-external
  • 08:01 jayme@cumin1001: conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=zotero
  • 07:55 marostegui@cumin1001: dbctl commit (dc=all): 'db1086 (re)pooling @ 50%: Slowly repool db1086 after schema change', diff saved to https://phabricator.wikimedia.org/P15063 and previous config saved to /var/cache/conftool/dbconfig/20210324-075553-root.json
  • 07:52 marostegui@cumin1001: dbctl commit (dc=all): 'db1141 (re)pooling @ 75%: Slowly repool db1141', diff saved to https://phabricator.wikimedia.org/P15062 and previous config saved to /var/cache/conftool/dbconfig/20210324-075221-root.json
  • 07:50 jayme@cumin1001: conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=dnsdisc=eventgate-main
  • 07:50 jayme@cumin1001: conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=dnsdisc=eventgate-logging-external
  • 07:50 jayme@cumin1001: conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=dnsdisc=zotero
  • 07:41 elukey@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ml-etcd2002.codfw.wmnet
  • 07:40 marostegui@cumin1001: dbctl commit (dc=all): 'db1086 (re)pooling @ 25%: Slowly repool db1086 after schema change', diff saved to https://phabricator.wikimedia.org/P15061 and previous config saved to /var/cache/conftool/dbconfig/20210324-074050-root.json
  • 07:37 marostegui@cumin1001: dbctl commit (dc=all): 'db1141 (re)pooling @ 50%: Slowly repool db1141', diff saved to https://phabricator.wikimedia.org/P15060 and previous config saved to /var/cache/conftool/dbconfig/20210324-073718-root.json
  • 07:27 elukey@cumin1001: START - Cookbook sre.ganeti.makevm for new host ml-etcd2002.codfw.wmnet
  • 07:23 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1086 for schema change', diff saved to https://phabricator.wikimedia.org/P15059 and previous config saved to /var/cache/conftool/dbconfig/20210324-072319-marostegui.json
  • 07:22 marostegui@cumin1001: dbctl commit (dc=all): 'db1141 (re)pooling @ 25%: Slowly repool db1141', diff saved to https://phabricator.wikimedia.org/P15058 and previous config saved to /var/cache/conftool/dbconfig/20210324-072214-root.json
  • 07:20 elukey@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts ml-etcd2002.codfw.wmnet
  • 07:10 elukey@cumin1001: START - Cookbook sre.hosts.decommission for hosts ml-etcd2002.codfw.wmnet
  • 07:09 moritzm: installing squid security updates
  • 06:35 marostegui@cumin1001: dbctl commit (dc=all): 'Add db1181 to dbctl, depooled T275633', diff saved to https://phabricator.wikimedia.org/P15057 and previous config saved to /var/cache/conftool/dbconfig/20210324-063459-marostegui.json
  • 06:24 root@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1084.eqiad.wmnet
  • 06:14 root@cumin1001: START - Cookbook sre.hosts.decommission for hosts db1084.eqiad.wmnet
  • 05:52 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1141', diff saved to https://phabricator.wikimedia.org/P15056 and previous config saved to /var/cache/conftool/dbconfig/20210324-055246-marostegui.json
  • 04:44 ryankemper@cumin1001: END (FAIL) - Cookbook sre.elasticsearch.rolling-upgrade (exit_code=99)
  • 03:41 ryankemper: T274204 `sudo -i cookbook sre.elasticsearch.rolling-upgrade search_codfw "codfw cluster reboot" --task-id T274204 --nodes-per-run 3 --start-datetime 2021-03-24T02:29:39` on `ryankemper@cumin1001` tmux session `elasticsearch_rolling_upgrade_reboots`
  • 03:41 ryankemper: T274204 Restarting `codfw` restart; the timestamp argument should prevent it from wasting time on nodes that have been rebooted already
  • 03:40 ryankemper@cumin1001: START - Cookbook sre.elasticsearch.rolling-upgrade
  • 03:39 ryankemper: T274204 Timed out waiting for write queues to empty: `[59/60, retrying in 60.00s] Attempt to run 'spicerack.elasticsearch_cluster.ElasticsearchClusters.wait_for_all_write_queues_empty' raised: Write queue not empty (had value of 241631) for partition 0 of topic codfw.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.`
  • 03:38 ryankemper@cumin1001: END (FAIL) - Cookbook sre.elasticsearch.rolling-upgrade (exit_code=99)
  • 02:38 ryankemper: T274204 `sudo -i cookbook sre.elasticsearch.rolling-upgrade search_codfw "codfw cluster reboot" --task-id T274204 --nodes-per-run 3 --start-datetime 2021-03-24T02:29:39` on `ryankemper@cumin1001` tmux session `elasticsearch_rolling_upgrade_reboots`
  • 02:31 ryankemper@cumin1001: START - Cookbook sre.elasticsearch.rolling-upgrade
  • 01:59 ryankemper: T274204 For now I'll proceed to the reboots of `codfw`
  • 01:59 ryankemper: T274204 `ctrl+c`'d out of run; relforge is relying on outdated config that is trying to talk to `relforge1002` which no longer exists. Need to refactor so that config no longer lives in spicerack
  • 01:58 ryankemper@cumin1001: END (ERROR) - Cookbook sre.elasticsearch.rolling-upgrade-reboot (exit_code=97)
  • 01:49 ryankemper: T274204 `sudo -i cookbook sre.elasticsearch.rolling-upgrade-reboot relforge "relforge cluster restarts" --task-id T274204 --nodes-per-run 3 --start-datetime 2021-03-24T01:45:59+00:00` on `ryankemper@cumin1001` tmux session `elasticsearch_rolling_upgrade_reboots`
  • 01:48 ryankemper@cumin1001: START - Cookbook sre.elasticsearch.rolling-upgrade-reboot
  • 01:36 eileen: civicrm revision changed from f36a0b08f0 to ad430721f6, config revision is 26b02db7ba
  • 00:22 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2378.codfw.wmnet with reason: REIMAGE
  • 00:18 pt1979@cumin2001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw2378.codfw.wmnet with reason: REIMAGE
  • 00:18 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2377.codfw.wmnet with reason: REIMAGE
  • 00:16 pt1979@cumin2001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw2377.codfw.wmnet with reason: REIMAGE

2021-03-23

  • 22:59 robh@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pki-root1001.eqiad.wmnet with reason: REIMAGE
  • 22:57 robh@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on pki-root1001.eqiad.wmnet with reason: REIMAGE
  • 22:33 dwisehaupt: pushing 60f9baaf50b to fundraising hosts which will enable ssl by default for mysql client connections that use the host my.cnf file - T170321
  • 22:19 ebernhardson@deploy1002: Finished deploy [wikimedia/discovery/analytics@3fd7d7b]: partition ores dumps by namespace (duration: 02m 07s)
  • 22:17 ebernhardson@deploy1002: Started deploy [wikimedia/discovery/analytics@3fd7d7b]: partition ores dumps by namespace
  • 22:09 dzahn@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 22:05 dzahn@cumin1001: START - Cookbook sre.dns.netbox
  • 21:27 ppchelko@deploy1002: Finished deploy [restbase/deploy@531c474]: Add pageviews top-per-country endpoint (duration: 17m 58s)
  • 21:09 ppchelko@deploy1002: Started deploy [restbase/deploy@531c474]: Add pageviews top-per-country endpoint
  • 21:04 robh@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 21:00 robh@cumin1001: START - Cookbook sre.dns.netbox
  • 21:00 robh@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 20:41 eileen: civicrm revision changed from 39d24e8b0a to f36a0b08f0, config revision is 26b02db7ba
  • 20:24 robh@cumin1001: START - Cookbook sre.dns.netbox
  • 20:24 robh@cumin1001: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
  • 20:21 robh@cumin1001: START - Cookbook sre.dns.netbox
  • 20:13 robh@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts auth1002.eqiad.wmnet
  • 20:03 robh@cumin1001: START - Cookbook sre.hosts.decommission for hosts auth1002.eqiad.wmnet
  • 20:02 robh@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts auth1002.eqiad.wmnet
  • 20:01 robh@cumin1001: START - Cookbook sre.hosts.decommission for hosts auth1002.eqiad.wmnet
  • 19:51 jforrester@deploy1002: Finished deploy [integration/docroot@9de8c9d]: Add homer-public listing, added by volans (duration: 00m 08s)
  • 19:51 jforrester@deploy1002: Started deploy [integration/docroot@9de8c9d]: Add homer-public listing, added by volans
  • 18:45 otto@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Remove schema overrides for 6 finished EL migrations - T267347 T271164 T267351 T267348 T267343 T267353 (duration: 01m 07s)
  • 18:40 legoktm@deploy1002: Synchronized php-1.36.0-wmf.36/vendor/: Bump wikimedia/parsoid to 0.13.0-a29 (duration: 01m 16s)
  • 18:20 mbsantos@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
  • 18:18 mbsantos@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
  • 18:16 mbsantos@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' .
  • 18:10 legoktm@deploy1002: Synchronized wmf-config/ProductionServices.php: Add irc2001.wikimedia.org (running buster) as second irc server (T224579) (duration: 01m 08s)
  • 15:39 akosiaris@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'staging' .
  • 15:39 akosiaris@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'production' .
  • 15:38 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'cxserver' for release 'production' .
  • 15:38 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'cxserver' for release 'staging' .
  • 15:36 akosiaris@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'cxserver' for release 'staging' .
  • 15:36 akosiaris@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'cxserver' for release 'production' .
  • 15:32 moritzm: installing libsdl2 security updates
  • 15:31 akosiaris: pool echostore for eqiad (the first of the larger services traffic wise)
  • 15:31 akosiaris@cumin1001: conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=echostore
  • 15:25 Trey314159: reindexing Italian wikis on elastic@eqiad, elastic@codfw, and cloudelastic complete (T274200)
  • 15:10 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
  • 15:10 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
  • 15:10 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
  • 14:53 akosiaris@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
  • 14:53 akosiaris@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
  • 14:53 akosiaris@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
  • 14:46 akosiaris@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
  • 14:46 akosiaris@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
  • 14:46 akosiaris@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
  • 14:43 akosiaris: pool more services in eqiad k8s. T277741. Only the very large ones traffic wise are still on codfw
  • 14:43 akosiaris@cumin1001: conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=recommendation-api
  • 14:43 akosiaris@cumin1001: conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=push-notifications
  • 14:43 akosiaris@cumin1001: conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=proton
  • 14:42 akosiaris@cumin1001: conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=mobileapps
  • 14:42 akosiaris@cumin1001: conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=mathoid
  • 14:42 akosiaris@cumin1001: conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=linkrecommendation
  • 14:42 akosiaris@cumin1001: conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=eventstreams-internal
  • 14:42 akosiaris@cumin1001: conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=eventstreams
  • 14:20 akosiaris: pool a few more services in eqiad k8s. T277741
  • 14:19 akosiaris@cumin1001: conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=wikifeeds
  • 14:19 akosiaris@cumin1001: conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=termbox
  • 14:19 akosiaris@cumin1001: conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=similar-users
  • 14:07 hashar@deploy1002: rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.36
  • 14:06 akosiaris: pool a few services in eqiad k8s. T277741
  • 14:05 akosiaris@cumin1001: conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=cxserver
  • 14:05 akosiaris@cumin1001: conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=citoid
  • 14:05 akosiaris@cumin1001: conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=blubberoid
  • 14:05 akosiaris@cumin1001: conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=api-gateway
  • 14:05 akosiaris@cumin1001: conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=apertium
  • 14:05 moritzm: installing pygments security updates on stretch
  • 14:04 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2008.codfw.wmnet
  • 13:59 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ganeti2008.codfw.wmnet
  • 13:55 hashar@deploy1002: Finished scap: Promote testwikis from 1.36.0-wmf.35 to 1.36.0-wmf.36 - T274940 (duration: 31m 57s)
  • 13:54 elukey: sudo systemctl reload apache2 on prometheus[12]00[34] to pick up new k8s-mlserve instance settings
  • 13:28 moritzm: drain ganeti2008
  • 13:25 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2018.codfw.wmnet
  • 13:23 hashar@deploy1002: Started scap: Promote testwikis from 1.36.0-wmf.35 to 1.36.0-wmf.36 - T274940
  • 13:17 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ganeti2018.codfw.wmnet
  • 13:15 ema: cp3054: install varnishkafka built explicitly against varnish 6.0.1-1wm2 to fix broken dpkg status T264398
  • 13:05 marostegui@cumin1001: dbctl commit (dc=all): 'db1086 (re)pooling @ 100%: Slowly repool db1086 after cloning db1181', diff saved to https://phabricator.wikimedia.org/P15054 and previous config saved to /var/cache/conftool/dbconfig/20210323-130543-root.json
  • 13:01 marostegui@cumin1001: dbctl commit (dc=all): 'db1148 (re)pooling @ 100%: Slowly repool db1148 after schema change', diff saved to https://phabricator.wikimedia.org/P15053 and previous config saved to /var/cache/conftool/dbconfig/20210323-130153-root.json
  • 12:58 moritzm: drain ganeti2018
  • 12:58 akosiaris: remove and decomission argon, chroline, acrab, acrux T277741, T277191
  • 12:51 marostegui@cumin1001: dbctl commit (dc=all): 'db1165 (re)pooling @ 100%: Slowly pool db1165 into s6 T258361', diff saved to https://phabricator.wikimedia.org/P15052 and previous config saved to /var/cache/conftool/dbconfig/20210323-125155-root.json
  • 12:50 marostegui@cumin1001: dbctl commit (dc=all): 'db1086 (re)pooling @ 75%: Slowly repool db1086 after cloning db1181', diff saved to https://phabricator.wikimedia.org/P15051 and previous config saved to /var/cache/conftool/dbconfig/20210323-125039-root.json
  • 12:50 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2017.codfw.wmnet
  • 12:46 marostegui@cumin1001: dbctl commit (dc=all): 'db1148 (re)pooling @ 75%: Slowly repool db1148 after schema change', diff saved to https://phabricator.wikimedia.org/P15050 and previous config saved to /var/cache/conftool/dbconfig/20210323-124650-root.json
  • 12:42 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ganeti2017.codfw.wmnet
  • 12:36 marostegui@cumin1001: dbctl commit (dc=all): 'db1165 (re)pooling @ 85%: Slowly pool db1165 into s6 T258361', diff saved to https://phabricator.wikimedia.org/P15049 and previous config saved to /var/cache/conftool/dbconfig/20210323-123651-root.json
  • 12:35 marostegui@cumin1001: dbctl commit (dc=all): 'db1086 (re)pooling @ 50%: Slowly repool db1086 after cloning db1181', diff saved to https://phabricator.wikimedia.org/P15048 and previous config saved to /var/cache/conftool/dbconfig/20210323-123535-root.json
  • 12:31 marostegui@cumin1001: dbctl commit (dc=all): 'db1148 (re)pooling @ 50%: Slowly repool db1148 after schema change', diff saved to https://phabricator.wikimedia.org/P15047 and previous config saved to /var/cache/conftool/dbconfig/20210323-123146-root.json
  • 12:27 moritzm: drain ganeti2017
  • 12:25 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2016.codfw.wmnet
  • 12:21 marostegui@cumin1001: dbctl commit (dc=all): 'db1165 (re)pooling @ 75%: Slowly pool db1165 into s6 T258361', diff saved to https://phabricator.wikimedia.org/P15046 and previous config saved to /var/cache/conftool/dbconfig/20210323-122148-root.json
  • 12:20 marostegui@cumin1001: dbctl commit (dc=all): 'db1086 (re)pooling @ 25%: Slowly repool db1086 after cloning db1181', diff saved to https://phabricator.wikimedia.org/P15045 and previous config saved to /var/cache/conftool/dbconfig/20210323-122032-root.json
  • 12:17 akosiaris: remove all schedule downtimes for k8s cluster. T277741
  • 12:17 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ganeti2016.codfw.wmnet
  • 12:16 marostegui@cumin1001: dbctl commit (dc=all): 'db1148 (re)pooling @ 25%: Slowly repool db1148 after schema change', diff saved to https://phabricator.wikimedia.org/P15044 and previous config saved to /var/cache/conftool/dbconfig/20210323-121642-root.json
  • 12:09 moritzm: drain ganeti2016
  • 12:06 marostegui@cumin1001: dbctl commit (dc=all): 'db1165 (re)pooling @ 60%: Slowly pool db1165 into s6 T258361', diff saved to https://phabricator.wikimedia.org/P15043 and previous config saved to /var/cache/conftool/dbconfig/20210323-120644-root.json
  • 11:55 moritzm: installing libcaca security updates
  • 11:51 marostegui@cumin1001: dbctl commit (dc=all): 'db1165 (re)pooling @ 50%: Slowly pool db1165 into s6 T258361', diff saved to https://phabricator.wikimedia.org/P15042 and previous config saved to /var/cache/conftool/dbconfig/20210323-115141-root.json
  • 11:38 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on aqs[1012-1015].eqiad.wmnet with reason: New buster hosts, not in use
  • 11:38 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on aqs[1012-1015].eqiad.wmnet with reason: New buster hosts, not in use
  • 11:36 marostegui@cumin1001: dbctl commit (dc=all): 'db1165 (re)pooling @ 35%: Slowly pool db1165 into s6 T258361', diff saved to https://phabricator.wikimedia.org/P15041 and previous config saved to /var/cache/conftool/dbconfig/20210323-113637-root.json
  • 11:31 Lucas_WMDE: EU backport&config window done
  • 11:30 lucaswerkmeister-wmde@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: Enable DiscussionTools' beta features on dewiki (T276494) (duration: 00m 58s)
  • 11:21 marostegui@cumin1001: dbctl commit (dc=all): 'db1165 (re)pooling @ 25%: Slowly pool db1165 into s6 T258361', diff saved to https://phabricator.wikimedia.org/P15040 and previous config saved to /var/cache/conftool/dbconfig/20210323-112133-root.json
  • 11:06 marostegui@cumin1001: dbctl commit (dc=all): 'db1165 (re)pooling @ 20%: Slowly pool db1165 into s6 T258361', diff saved to https://phabricator.wikimedia.org/P15039 and previous config saved to /var/cache/conftool/dbconfig/20210323-110630-root.json
  • 11:05 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1148', diff saved to https://phabricator.wikimedia.org/P15038 and previous config saved to /var/cache/conftool/dbconfig/20210323-110553-marostegui.json
  • 11:03 marostegui@cumin1001: dbctl commit (dc=all): 'db1147 (re)pooling @ 100%: Slowly repool db1147 after schema change', diff saved to https://phabricator.wikimedia.org/P15037 and previous config saved to /var/cache/conftool/dbconfig/20210323-110347-root.json
  • 11:01 moritzm: installing tomcat8 security updates
  • 10:56 jayme: all services re-deployed to k8s eqiad - T277741
  • 10:51 marostegui@cumin1001: dbctl commit (dc=all): 'db1165 (re)pooling @ 15%: Slowly pool db1165 into s6 T258361', diff saved to https://phabricator.wikimedia.org/P15036 and previous config saved to /var/cache/conftool/dbconfig/20210323-105126-root.json
  • 10:48 marostegui@cumin1001: dbctl commit (dc=all): 'db1147 (re)pooling @ 75%: Slowly repool db1147 after schema change', diff saved to https://phabricator.wikimedia.org/P15035 and previous config saved to /var/cache/conftool/dbconfig/20210323-104843-root.json
  • 10:46 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'zotero' for release 'staging' .
  • 10:46 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'zotero' for release 'production' .
  • 10:45 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'wikifeeds' for release 'production' .
  • 10:45 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' .
  • 10:45 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'termbox' for release 'test' .
  • 10:45 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'termbox' for release 'production' .
  • 10:45 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'termbox' for release 'staging' .
  • 10:44 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'similar-users' for release 'main' .
  • 10:44 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'sessionstore' for release 'staging' .
  • 10:44 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'sessionstore' for release 'production' .
  • 10:43 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'recommendation-api' for release 'production' .
  • 10:42 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'push-notifications' for release 'canary' .
  • 10:42 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'push-notifications' for release 'main' .
  • 10:41 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'proton' for release 'production' .
  • 10:39 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
  • 10:39 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'staging' .
  • 10:37 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mathoid' for release 'staging' .
  • 10:37 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mathoid' for release 'production' .
  • 10:36 marostegui@cumin1001: dbctl commit (dc=all): 'db1165 (re)pooling @ 10%: Slowly pool db1165 into s6 T258361', diff saved to https://phabricator.wikimedia.org/P15034 and previous config saved to /var/cache/conftool/dbconfig/20210323-103623-root.json
  • 10:33 marostegui@cumin1001: dbctl commit (dc=all): 'db1147 (re)pooling @ 50%: Slowly repool db1147 after schema change', diff saved to https://phabricator.wikimedia.org/P15033 and previous config saved to /var/cache/conftool/dbconfig/20210323-103340-root.json
  • 10:32 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
  • 10:32 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
  • 10:32 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
  • 10:32 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams-internal' for release 'canary' .
  • 10:32 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams-internal' for release 'main' .
  • 10:31 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams' for release 'canary' .
  • 10:31 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams' for release 'production' .
  • 10:29 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
  • 10:29 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' .
  • 10:29 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' .
  • 10:28 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' .
  • 10:28 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
  • 10:28 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
  • 10:27 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' .
  • 10:27 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' .
  • 10:26 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'echostore' for release 'staging' .
  • 10:26 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'echostore' for release 'production' .
  • 10:25 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'cxserver' for release 'production' .
  • 10:25 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'cxserver' for release 'staging' .
  • 10:24 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'citoid' for release 'production' .
  • 10:23 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'citoid' for release 'staging' .
  • 10:23 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
  • 10:23 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
  • 10:22 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 10:22 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 10:22 jayme@cumin1001: conftool action : set/pooled=yes; selector: dc=eqiad,service=kubesvc
  • 10:22 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'blubberoid' for release 'production' .
  • 10:22 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'blubberoid' for release 'staging' .
  • 10:21 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
  • 10:21 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
  • 10:21 marostegui@cumin1001: dbctl commit (dc=all): 'db1165 (re)pooling @ 5%: Slowly pool db1165 into s6 T258361', diff saved to https://phabricator.wikimedia.org/P15031 and previous config saved to /var/cache/conftool/dbconfig/20210323-102119-root.json
  • 10:20 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'apertium' for release 'staging' .
  • 10:20 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'apertium' for release 'production' .
  • 10:20 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'similar-users' for release 'main' .
  • 10:20 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'recommendation-api' for release 'production' .
  • 10:20 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'proton' for release 'production' .
  • 10:20 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams' for release 'production' .
  • 10:20 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams' for release 'canary' .
  • 10:20 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' .
  • 10:19 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
  • 10:19 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' .
  • 10:19 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' .
  • 10:19 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
  • 10:19 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
  • 10:19 hashar@deploy1002: Pruned MediaWiki: 1.36.0-wmf.33 (duration: 01m 48s)
  • 10:18 marostegui@cumin1001: dbctl commit (dc=all): 'db1147 (re)pooling @ 25%: Slowly repool db1147 after schema change', diff saved to https://phabricator.wikimedia.org/P15030 and previous config saved to /var/cache/conftool/dbconfig/20210323-101836-root.json
  • 10:16 hashar@deploy1002: Pruned MediaWiki: 1.36.0-wmf.32 (duration: 14m 47s)
  • 10:10 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubernetes1005.eqiad.wmnet
  • 10:02 hashar: scap clean --delete 1.36.0-wmf.32 # T274940
  • 10:01 hashar: Applied security patches for 1.36.0-wmf.36 # T274940
  • 09:57 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubernetes1006.eqiad.wmnet
  • 09:56 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubernetes1015.eqiad.wmnet
  • 09:54 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single for host kubernetes1006.eqiad.wmnet
  • 09:54 marostegui@cumin1001: dbctl commit (dc=all): 'Pool db1165 into s6 with minimal weight T258361', diff saved to https://phabricator.wikimedia.org/P15029 and previous config saved to /var/cache/conftool/dbconfig/20210323-095437-marostegui.json
  • 09:54 akosiaris@deploy1002: helmfile [eqiad] DONE helmfile.d/admin 'sync'.
  • 09:54 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubernetes1016.eqiad.wmnet
  • 09:53 akosiaris: deploy helmfile.d/admin_ng for eqiad T277741
  • 09:53 hashar: scap prep 1.36.0-wmf.36 # T274940
  • 09:53 akosiaris@deploy1002: helmfile [eqiad] START helmfile.d/admin 'sync'.
  • 09:53 jayme@cumin1001: conftool action : set/pooled=yes; selector: dc=codfw,service=kubesvc,name=kubernetes2017.codfw.wmnet
  • 09:53 jayme@cumin1001: conftool action : set/weight=10; selector: dc=codfw,service=kubesvc,name=kubernetes2017.codfw.wmnet
  • 09:51 akosiaris@deploy1002: helmfile [eqiad] DONE helmfile.d/admin 'sync'.
  • 09:50 jayme@cumin1001: conftool action : set/pooled=yes; selector: dc=eqiad,service=kubesvc,name=kubernetes1017.eqiad.wmnet
  • 09:50 jayme@cumin1001: conftool action : set/weight=10; selector: dc=eqiad,service=kubesvc,name=kubernetes1017.eqiad.wmnet
  • 09:49 akosiaris@deploy1002: helmfile [eqiad] START helmfile.d/admin 'sync'.
  • 09:46 akosiaris@deploy1002: helmfile [eqiad] DONE helmfile.d/admin 'sync'.
  • 09:46 akosiaris@deploy1002: helmfile [eqiad] START helmfile.d/admin 'sync'.
  • 09:45 akosiaris@deploy1002: helmfile [eqiad] DONE helmfile.d/admin 'sync'.
  • 09:45 akosiaris@deploy1002: helmfile [eqiad] START helmfile.d/admin 'sync'.
  • 09:45 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1014.eqiad.wmnet with reason: REIMAGE
  • 09:44 akosiaris@deploy1002: helmfile [eqiad] DONE helmfile.d/admin 'sync'.
  • 09:44 akosiaris@deploy1002: helmfile [eqiad] START helmfile.d/admin 'sync'.
  • 09:43 jayme@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1014.eqiad.wmnet with reason: REIMAGE
  • 09:43 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1013.eqiad.wmnet with reason: REIMAGE
  • 09:42 marostegui@cumin1001: dbctl commit (dc=all): 'Pool db1165 into s6 with minimal weight T258361', diff saved to https://phabricator.wikimedia.org/P15028 and previous config saved to /var/cache/conftool/dbconfig/20210323-094257-marostegui.json
  • 09:41 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1012.eqiad.wmnet with reason: REIMAGE
  • 09:41 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single for host kubernetes1016.eqiad.wmnet
  • 09:41 jayme@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1013.eqiad.wmnet with reason: REIMAGE
  • 09:40 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single for host kubernetes1015.eqiad.wmnet
  • 09:40 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single for host kubernetes1005.eqiad.wmnet
  • 09:39 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1011.eqiad.wmnet with reason: REIMAGE
  • 09:38 jayme@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1012.eqiad.wmnet with reason: REIMAGE
  • 09:37 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1010.eqiad.wmnet with reason: REIMAGE
  • 09:36 jayme@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1011.eqiad.wmnet with reason: REIMAGE
  • 09:36 jayme@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes1004.eqiad.wmnet with reason: REIMAGE
  • 09:35 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1009.eqiad.wmnet with reason: REIMAGE
  • 09:34 jayme@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1010.eqiad.wmnet with reason: REIMAGE
  • 09:33 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubernetes1017.eqiad.wmnet
  • 09:33 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1008.eqiad.wmnet with reason: REIMAGE
  • 09:32 jayme@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1009.eqiad.wmnet with reason: REIMAGE
  • 09:32 marostegui@cumin1001: dbctl commit (dc=all): 'Add db1165 to dbctl, depooled - T258361', diff saved to https://phabricator.wikimedia.org/P15027 and previous config saved to /var/cache/conftool/dbconfig/20210323-093246-marostegui.json
  • 09:31 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1007.eqiad.wmnet with reason: REIMAGE
  • 09:31 jayme@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1004.eqiad.wmnet with reason: REIMAGE
  • 09:30 jayme@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1008.eqiad.wmnet with reason: REIMAGE
  • 09:29 jayme@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes1003.eqiad.wmnet with reason: REIMAGE
  • 09:29 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1002.eqiad.wmnet with reason: REIMAGE
  • 09:28 jayme@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1007.eqiad.wmnet with reason: REIMAGE
  • 09:28 jayme@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1003.eqiad.wmnet with reason: REIMAGE
  • 09:27 jayme@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1002.eqiad.wmnet with reason: REIMAGE
  • 09:26 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1001.eqiad.wmnet with reason: REIMAGE
  • 09:26 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1086 to clone db1181 T275633', diff saved to https://phabricator.wikimedia.org/P15025 and previous config saved to /var/cache/conftool/dbconfig/20210323-092600-marostegui.json
  • 09:24 jayme@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1001.eqiad.wmnet with reason: REIMAGE
  • 09:18 akosiaris@cumin1001: conftool action : set/pooled=true; selector: dc=eqiad,cluster=kubernetes,name=kubernetes1017.eqiad.wmnet
  • 09:17 akosiaris@cumin1001: conftool action : set/pooled=yes; selector: dc=eqiad,service=kubemaster,cluster=kubernetes
  • 09:17 akosiaris@cumin1001: conftool action : set/weight=10; selector: dc=eqiad,service=kubemaster,cluster=kubernetes
  • 09:16 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single for host kubernetes1017.eqiad.wmnet
  • 09:14 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1147', diff saved to https://phabricator.wikimedia.org/P15024 and previous config saved to /var/cache/conftool/dbconfig/20210323-091432-marostegui.json
  • 09:05 akosiaris: reboot kubetcd100[456] for kernel upgrades. T277741 T273278
  • 09:04 akosiaris: empty etcd T277741
  • 08:43 akosiaris: poweroff argon and chlorine T277741
  • 08:41 marostegui@cumin1001: dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 100%: Slowly repool db1098:3317', diff saved to https://phabricator.wikimedia.org/P15023 and previous config saved to /var/cache/conftool/dbconfig/20210323-083957-root.json
  • 08:41 akosiaris@cumin1001: conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=zotero
  • 08:41 akosiaris@cumin1001: conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=wikifeeds
  • 08:41 akosiaris@cumin1001: conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=termbox
  • 08:41 akosiaris@cumin1001: conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=similar-users
  • 08:41 akosiaris@cumin1001: conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=sessionstore
  • 08:40 akosiaris@cumin1001: conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=recommendation-api
  • 08:40 akosiaris@cumin1001: conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=push-notifications
  • 08:40 akosiaris@cumin1001: conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=proton
  • 08:40 akosiaris@cumin1001: conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=mobileapps
  • 08:40 akosiaris@cumin1001: conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=mathoid
  • 08:40 akosiaris@cumin1001: conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=linkrecommendation
  • 08:40 akosiaris@cumin1001: conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=eventstreams-internal
  • 08:40 akosiaris@cumin1001: conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=eventstreams
  • 08:40 akosiaris@cumin1001: conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=eventgate-main
  • 08:39 akosiaris@cumin1001: conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=eventgate-logging-external
  • 08:39 akosiaris@cumin1001: conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=eventgate-analytics-external
  • 08:39 akosiaris@cumin1001: conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=eventgate-analytics
  • 08:39 akosiaris@cumin1001: conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=echostore
  • 08:39 akosiaris@cumin1001: conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=cxserver
  • 08:39 akosiaris@cumin1001: conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=citoid
  • 08:39 akosiaris@cumin1001: conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=blubberoid
  • 08:39 akosiaris@cumin1001: conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=api-gateway
  • 08:39 akosiaris@cumin1001: conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=apertium
  • 08:33 akosiaris: eqiad services in k8s depooled. T277741
  • 08:33 akosiaris@cumin1001: conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=wikifeeds
  • 08:33 akosiaris@cumin1001: conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=termbox
  • 08:33 akosiaris@cumin1001: conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=similar-users
  • 08:33 akosiaris@cumin1001: conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=sessionstore
  • 08:33 akosiaris@cumin1001: conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=recommendation-api
  • 08:33 akosiaris@cumin1001: conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=push-notifications
  • 08:32 akosiaris@cumin1001: conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=proton
  • 08:32 akosiaris@cumin1001: conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=mobileapps
  • 08:32 akosiaris@cumin1001: conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=mathoid
  • 08:32 akosiaris@cumin1001: conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=linkrecommendation
  • 08:32 akosiaris@cumin1001: conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=eventstreams-internal
  • 08:32 akosiaris@cumin1001: conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=eventstreams
  • 08:32 akosiaris@cumin1001: conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=eventgate-main
  • 08:32 akosiaris@cumin1001: conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=eventgate-logging-external
  • 08:32 akosiaris@cumin1001: conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=eventgate-analytics-external
  • 08:32 akosiaris@cumin1001: conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=eventgate-analytics
  • 08:32 akosiaris@cumin1001: conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=echostore
  • 08:31 akosiaris@cumin1001: conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=cxserver
  • 08:31 akosiaris@cumin1001: conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=citoid
  • 08:31 akosiaris@cumin1001: conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=blubberoid
  • 08:31 akosiaris@cumin1001: conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=api-gateway
  • 08:31 akosiaris@cumin1001: conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=apertium
  • 08:28 akosiaris: downtime all services in T277741 for 24H
  • 08:25 akosiaris: beginning the k8s upgrade/reinit process. T277741
  • 08:24 marostegui@cumin1001: dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 75%: Slowly repool db1098:3317', diff saved to https://phabricator.wikimedia.org/P15022 and previous config saved to /var/cache/conftool/dbconfig/20210323-082454-root.json
  • 08:24 moritzm: installing mariadb-10.3 updates on buster (just client-side libs/tools, unrelated to the main wmf-mariadb packages)
  • 08:24 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 18 hosts with reason: Reinitialize eqiad k8s cluster with new etcd
  • 08:24 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 18 hosts with reason: Reinitialize eqiad k8s cluster with new etcd
  • 08:22 marostegui@cumin1001: dbctl commit (dc=all): 'db1146:3314 (re)pooling @ 100%: Slowly repool db1146:3314 after schema change', diff saved to https://phabricator.wikimedia.org/P15021 and previous config saved to /var/cache/conftool/dbconfig/20210323-082213-root.json
  • 08:09 marostegui@cumin1001: dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 50%: Slowly repool db1098:3317', diff saved to https://phabricator.wikimedia.org/P15020 and previous config saved to /var/cache/conftool/dbconfig/20210323-080949-root.json
  • 08:07 marostegui@cumin1001: dbctl commit (dc=all): 'db1146:3314 (re)pooling @ 75%: Slowly repool db1146:3314 after schema change', diff saved to https://phabricator.wikimedia.org/P15019 and previous config saved to /var/cache/conftool/dbconfig/20210323-080709-root.json
  • 07:54 marostegui@cumin1001: dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 25%: Slowly repool db1098:3317', diff saved to https://phabricator.wikimedia.org/P15017 and previous config saved to /var/cache/conftool/dbconfig/20210323-075445-root.json
  • 07:52 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1098:3317 to enable report_host T266483', diff saved to https://phabricator.wikimedia.org/P15016 and previous config saved to /var/cache/conftool/dbconfig/20210323-075253-marostegui.json
  • 07:52 marostegui@cumin1001: dbctl commit (dc=all): 'db1101:3318 (re)pooling @ 100%: Slowly repool db1101:3318', diff saved to https://phabricator.wikimedia.org/P15015 and previous config saved to /var/cache/conftool/dbconfig/20210323-075230-root.json
  • 07:52 marostegui@cumin1001: dbctl commit (dc=all): 'db1101:3317 (re)pooling @ 100%: Slowly repool db1101:3317', diff saved to https://phabricator.wikimedia.org/P15014 and previous config saved to /var/cache/conftool/dbconfig/20210323-075216-root.json
  • 07:52 marostegui@cumin1001: dbctl commit (dc=all): 'db1146:3314 (re)pooling @ 50%: Slowly repool db1146:3314 after schema change', diff saved to https://phabricator.wikimedia.org/P15013 and previous config saved to /var/cache/conftool/dbconfig/20210323-075206-root.json
  • 07:37 marostegui@cumin1001: dbctl commit (dc=all): 'db1101:3318 (re)pooling @ 75%: Slowly repool db1101:3318', diff saved to https://phabricator.wikimedia.org/P15012 and previous config saved to /var/cache/conftool/dbconfig/20210323-073726-root.json
  • 07:37 marostegui@cumin1001: dbctl commit (dc=all): 'db1101:3317 (re)pooling @ 75%: Slowly repool db1101:3317', diff saved to https://phabricator.wikimedia.org/P15011 and previous config saved to /var/cache/conftool/dbconfig/20210323-073713-root.json
  • 07:37 marostegui@cumin1001: dbctl commit (dc=all): 'db1146:3314 (re)pooling @ 25%: Slowly repool db1146:3314 after schema change', diff saved to https://phabricator.wikimedia.org/P15010 and previous config saved to /var/cache/conftool/dbconfig/20210323-073702-root.json
  • 07:36 elukey: create a 50g lvm volume on prometheus[12]00[34] for the k8s-mlserve cluster - T272918
  • 07:34 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1181.eqiad.wmnet with reason: REIMAGE
  • 07:32 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on db1181.eqiad.wmnet with reason: REIMAGE
  • 07:23 marostegui@cumin1001: dbctl commit (dc=all): 'db1086 (re)pooling @ 100%: Slowly repool db1086 after removing it from master', diff saved to https://phabricator.wikimedia.org/P15009 and previous config saved to /var/cache/conftool/dbconfig/20210323-072352-root.json
  • 07:22 marostegui@cumin1001: dbctl commit (dc=all): 'db1101:3318 (re)pooling @ 50%: Slowly repool db1101:3318', diff saved to https://phabricator.wikimedia.org/P15008 and previous config saved to /var/cache/conftool/dbconfig/20210323-072223-root.json
  • 07:22 marostegui@cumin1001: dbctl commit (dc=all): 'db1101:3317 (re)pooling @ 50%: Slowly repool db1101:3317', diff saved to https://phabricator.wikimedia.org/P15007 and previous config saved to /var/cache/conftool/dbconfig/20210323-072209-root.json
  • 07:08 marostegui@cumin1001: dbctl commit (dc=all): 'db1086 (re)pooling @ 75%: Slowly repool db1086 after removing it from master', diff saved to https://phabricator.wikimedia.org/P15006 and previous config saved to /var/cache/conftool/dbconfig/20210323-070849-root.json
  • 07:07 marostegui@cumin1001: dbctl commit (dc=all): 'db1101:3318 (re)pooling @ 25%: Slowly repool db1101:3318', diff saved to https://phabricator.wikimedia.org/P15005 and previous config saved to /var/cache/conftool/dbconfig/20210323-070719-root.json
  • 07:07 marostegui@cumin1001: dbctl commit (dc=all): 'db1101:3317 (re)pooling @ 25%: Slowly repool db1101:3317', diff saved to https://phabricator.wikimedia.org/P15004 and previous config saved to /var/cache/conftool/dbconfig/20210323-070705-root.json
  • 07:02 marostegui: Upgrade kernel on db1101
  • 06:59 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1101:3318 to enable report_host T266483', diff saved to https://phabricator.wikimedia.org/P15003 and previous config saved to /var/cache/conftool/dbconfig/20210323-065947-marostegui.json
  • 06:58 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1101:3317 to enable report_host T266483', diff saved to https://phabricator.wikimedia.org/P15002 and previous config saved to /var/cache/conftool/dbconfig/20210323-065836-marostegui.json
  • 06:53 marostegui@cumin1001: dbctl commit (dc=all): 'db1086 (re)pooling @ 50%: Slowly repool db1086 after removing it from master', diff saved to https://phabricator.wikimedia.org/P15001 and previous config saved to /var/cache/conftool/dbconfig/20210323-065345-root.json
  • 06:38 marostegui@cumin1001: dbctl commit (dc=all): 'db1086 (re)pooling @ 25%: Slowly repool db1086 after removing it from master', diff saved to https://phabricator.wikimedia.org/P15000 and previous config saved to /var/cache/conftool/dbconfig/20210323-063842-root.json
  • 06:29 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1146:3314', diff saved to https://phabricator.wikimedia.org/P14999 and previous config saved to /var/cache/conftool/dbconfig/20210323-062942-marostegui.json
  • 06:23 marostegui@cumin1001: dbctl commit (dc=all): 'db1086 (re)pooling @ 10%: Slowly repool db1086 after removing it from master', diff saved to https://phabricator.wikimedia.org/P14998 and previous config saved to /var/cache/conftool/dbconfig/20210323-062338-root.json
  • 06:20 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1086', diff saved to https://phabricator.wikimedia.org/P14997 and previous config saved to /var/cache/conftool/dbconfig/20210323-062059-marostegui.json
  • 06:20 marostegui: Upgrade kernel on db1086
  • 06:07 marostegui@cumin1001: dbctl commit (dc=all): 'db1086 (re)pooling @ 25%: Slowly repool db1086 after removing it from master', diff saved to https://phabricator.wikimedia.org/P14996 and previous config saved to /var/cache/conftool/dbconfig/20210323-060701-root.json
  • 06:02 marostegui@cumin1001: dbctl commit (dc=all): 'Promote db1136 to s7 master and remove read-only from s7 T274336', diff saved to https://phabricator.wikimedia.org/P14995 and previous config saved to /var/cache/conftool/dbconfig/20210323-060216-marostegui.json
  • 06:01 marostegui@cumin1001: dbctl commit (dc=all): 'Set s7 as read-only for maintenance T274336', diff saved to https://phabricator.wikimedia.org/P14994 and previous config saved to /var/cache/conftool/dbconfig/20210323-060104-marostegui.json
  • 06:00 marostegui: Starting s7 eqiad failover from db1086 to db1136 - T274336
  • 05:13 marostegui@cumin1001: dbctl commit (dc=all): 'Add db1174 to api T274336', diff saved to https://phabricator.wikimedia.org/P14993 and previous config saved to /var/cache/conftool/dbconfig/20210323-051346-marostegui.json
  • 05:12 marostegui@cumin1001: dbctl commit (dc=all): 'Set weight 0 to db1136 before failover T274336', diff saved to https://phabricator.wikimedia.org/P14992 and previous config saved to /var/cache/conftool/dbconfig/20210323-051210-marostegui.json
  • 00:07 tstarling@deploy1002: Synchronized wmf-config: use RequestTimeout library step 3: clean up (duration: 00m 58s)
  • 00:06 tstarling@deploy1002: Synchronized wmf-config/CommonSettings.php: use RequestTimeout library step 2: enable new system (duration: 00m 57s)
  • 00:04 tstarling@deploy1002: Synchronized wmf-config/PhpAutoPrepend.php: use RequestTimeout library step 1: disable old request timeout system (duration: 00m 58s)

2021-03-22

  • 23:52 robh@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudgw1002.eqiad.wmnet with reason: REIMAGE
  • 23:49 robh@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw1002.eqiad.wmnet with reason: REIMAGE
  • 23:34 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2250.codfw.wmnet
  • 23:21 pt1979@cumin2001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 23:18 ebernhardson@deploy1002: Synchronized php-1.36.0-wmf.35/extensions/WikimediaEvents/modules/ext.wikimediaEvents/searchSatisfaction.js: T262612: Start glent m1 ab test (duration: 01m 53s)
  • 23:18 pt1979@cumin2001: START - Cookbook sre.dns.netbox
  • 23:08 dzahn@cumin1001: START - Cookbook sre.hosts.decommission for hosts mw2250.codfw.wmnet
  • 23:01 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2249.codfw.wmnet
  • 22:52 mutante: decom mw2249
  • 22:44 dzahn@cumin1001: START - Cookbook sre.hosts.decommission for hosts mw2249.codfw.wmnet
  • 21:08 sbassett: Deployed security patch for T272244
  • 20:02 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw2279.codfw.wmnet,service=canary
  • 20:02 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw2278.codfw.wmnet,service=canary
  • 20:02 dzahn@cumin1001: conftool action : set/weight=1; selector: name=mw2279.codfw.wmnet,service=canary
  • 20:02 dzahn@cumin1001: conftool action : set/weight=1; selector: name=mw2278.codfw.wmnet,service=canary
  • 19:50 mutante: gerrit2001 - restarted apache2 as well for consistency
  • 19:47 mutante: gerrit - restarting apache2 after we dropped MaxClients config line. This should make us fall back to Debian default MaxRequestWorkers. (since we use event MPM we should not be using MaxClients in the first place, says #httpd) (T277127)
  • 18:20 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: 25247c9: hrwiki: Configure mentorship for Growth team features (T275684) (duration: 01m 00s)
  • 18:13 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: 951601f: Grant enwiki pagemovers the delete-redirect right (T278131) (duration: 00m 59s)
  • 17:30 Trey314159: reindexing Italian wikis on elastic@eqiad, elastic@codfw, and cloudelastic (T274200)
  • 16:49 jayme@deploy1002: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
  • 16:48 jayme@deploy1002: helmfile [staging-codfw] START helmfile.d/admin 'apply'.
  • 16:47 jayme@deploy1002: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
  • 16:46 jayme@deploy1002: helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
  • 16:37 jayme@deploy1002: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
  • 16:37 jayme@deploy1002: helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
  • 16:12 pt1979@cumin2001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 16:07 pt1979@cumin2001: START - Cookbook sre.dns.netbox
  • 15:58 marostegui@cumin1001: dbctl commit (dc=all): 'db1144:3314 (re)pooling @ 100%: Slowly repool db1144:3314', diff saved to https://phabricator.wikimedia.org/P14990 and previous config saved to /var/cache/conftool/dbconfig/20210322-155808-root.json
  • 15:43 marostegui@cumin1001: dbctl commit (dc=all): 'db1144:3314 (re)pooling @ 75%: Slowly repool db1144:3314', diff saved to https://phabricator.wikimedia.org/P14989 and previous config saved to /var/cache/conftool/dbconfig/20210322-154304-root.json
  • 15:38 pt1979@cumin2001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 15:33 pt1979@cumin2001: START - Cookbook sre.dns.netbox
  • 15:28 marostegui@cumin1001: dbctl commit (dc=all): 'db1144:3314 (re)pooling @ 50%: Slowly repool db1144:3314', diff saved to https://phabricator.wikimedia.org/P14988 and previous config saved to /var/cache/conftool/dbconfig/20210322-152800-root.json
  • 15:12 marostegui@cumin1001: dbctl commit (dc=all): 'db1144:3314 (re)pooling @ 25%: Slowly repool db1144:3314', diff saved to https://phabricator.wikimedia.org/P14987 and previous config saved to /var/cache/conftool/dbconfig/20210322-151257-root.json
  • 14:26 jayme@deploy1002: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
  • 14:25 jayme@deploy1002: helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
  • 14:23 jayme@deploy1002: helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
  • 14:22 jayme@deploy1002: helmfile [staging-codfw] START helmfile.d/admin 'sync'.
  • 14:14 elukey@deploy1002: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
  • 14:14 elukey@deploy1002: helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
  • 14:11 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1144:3314 for schema change', diff saved to https://phabricator.wikimedia.org/P14986 and previous config saved to /var/cache/conftool/dbconfig/20210322-141146-marostegui.json
  • 14:08 marostegui@cumin1001: dbctl commit (dc=all): 'db1143 (re)pooling @ 100%: Slowly repool db1143', diff saved to https://phabricator.wikimedia.org/P14985 and previous config saved to /var/cache/conftool/dbconfig/20210322-140800-root.json
  • 14:07 XioNoX: rename cloud-hosts1-b-eqiad to cloud-hosts1-eqiad - T277771
  • 14:07 XioNoX: rename cloud-hosts1-b-eqiad to cloud-hosts1-eqiad
  • 13:52 marostegui@cumin1001: dbctl commit (dc=all): 'db1143 (re)pooling @ 75%: Slowly repool db1143', diff saved to https://phabricator.wikimedia.org/P14984 and previous config saved to /var/cache/conftool/dbconfig/20210322-135256-root.json
  • 13:37 marostegui@cumin1001: dbctl commit (dc=all): 'db1143 (re)pooling @ 50%: Slowly repool db1143', diff saved to https://phabricator.wikimedia.org/P14983 and previous config saved to /var/cache/conftool/dbconfig/20210322-133753-root.json
  • 13:26 kharlan@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
  • 13:26 kharlan@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
  • 13:22 marostegui@cumin1001: dbctl commit (dc=all): 'db1143 (re)pooling @ 25%: Slowly repool db1143', diff saved to https://phabricator.wikimedia.org/P14982 and previous config saved to /var/cache/conftool/dbconfig/20210322-132249-root.json
  • 13:20 kharlan@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
  • 13:20 kharlan@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
  • 13:16 kharlan@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
  • 12:28 jayme@deploy1002: helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
  • 12:27 jayme@deploy1002: helmfile [staging-codfw] START helmfile.d/admin 'sync'.
  • 12:20 jayme@deploy1002: helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
  • 12:19 jayme@deploy1002: helmfile [staging-codfw] START helmfile.d/admin 'sync'.
  • 12:19 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1143 for schema change', diff saved to https://phabricator.wikimedia.org/P14981 and previous config saved to /var/cache/conftool/dbconfig/20210322-121924-marostegui.json
  • 11:29 marostegui@cumin1001: dbctl commit (dc=all): 'db1085 (re)pooling @ 100%: Slowly repool db1085', diff saved to https://phabricator.wikimedia.org/P14980 and previous config saved to /var/cache/conftool/dbconfig/20210322-112954-root.json
  • 11:27 marostegui@cumin1001: dbctl commit (dc=all): 'db1142 (re)pooling @ 100%: Slowly repool db1142', diff saved to https://phabricator.wikimedia.org/P14979 and previous config saved to /var/cache/conftool/dbconfig/20210322-112707-root.json
  • 11:15 elukey@deploy1002: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
  • 11:15 elukey@deploy1002: helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
  • 11:15 elukey@deploy1002: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
  • 11:14 marostegui@cumin1001: dbctl commit (dc=all): 'db1085 (re)pooling @ 75%: Slowly repool db1085', diff saved to https://phabricator.wikimedia.org/P14978 and previous config saved to /var/cache/conftool/dbconfig/20210322-111451-root.json
  • 11:14 elukey@deploy1002: helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
  • 11:12 marostegui@cumin1001: dbctl commit (dc=all): 'db1142 (re)pooling @ 75%: Slowly repool db1142', diff saved to https://phabricator.wikimedia.org/P14977 and previous config saved to /var/cache/conftool/dbconfig/20210322-111203-root.json
  • 11:09 hnowlan@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
  • 11:09 hnowlan@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
  • 10:59 marostegui@cumin1001: dbctl commit (dc=all): 'db1085 (re)pooling @ 50%: Slowly repool db1085', diff saved to https://phabricator.wikimedia.org/P14976 and previous config saved to /var/cache/conftool/dbconfig/20210322-105947-root.json
  • 10:57 marostegui@cumin1001: dbctl commit (dc=all): 'db1142 (re)pooling @ 50%: Slowly repool db1142', diff saved to https://phabricator.wikimedia.org/P14975 and previous config saved to /var/cache/conftool/dbconfig/20210322-105700-root.json
  • 10:53 akosiaris@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'staging' .
  • 10:53 akosiaris@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'production' .
  • 10:51 moritzm: installing libdbi-perl security updates
  • 10:48 hnowlan@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
  • 10:48 hnowlan@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
  • 10:48 elukey@deploy1002: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
  • 10:48 elukey@deploy1002: helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
  • 10:47 elukey@deploy1002: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
  • 10:47 elukey@deploy1002: helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
  • 10:44 marostegui@cumin1001: dbctl commit (dc=all): 'db1085 (re)pooling @ 25%: Slowly repool db1085', diff saved to https://phabricator.wikimedia.org/P14974 and previous config saved to /var/cache/conftool/dbconfig/20210322-104443-root.json
  • 10:42 marostegui@cumin1001: dbctl commit (dc=all): 'db1142 (re)pooling @ 25%: Slowly repool db1142', diff saved to https://phabricator.wikimedia.org/P14973 and previous config saved to /var/cache/conftool/dbconfig/20210322-104156-root.json
  • 10:42 hnowlan@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
  • 10:41 hnowlan@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
  • 10:41 jdrewniak@deploy1002: Synchronized portals: Wikimedia Portals Update: Bumping portals to master (T128546) (duration: 00m 58s)
  • 10:40 jdrewniak@deploy1002: Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: Bumping portals to master (T128546) (duration: 00m 58s)
  • 10:34 elukey@deploy1002: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
  • 10:33 elukey@deploy1002: helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
  • 10:32 elukey@deploy1002: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
  • 10:32 elukey@deploy1002: helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
  • 10:27 elukey@deploy1002: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
  • 10:26 elukey@deploy1002: helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
  • 10:26 elukey@deploy1002: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
  • 10:25 elukey@deploy1002: helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
  • 10:21 jayme@deploy1002: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
  • 10:21 jayme@deploy1002: helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
  • 10:17 elukey@deploy1002: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
  • 10:17 elukey@deploy1002: helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
  • 10:15 elukey@deploy1002: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
  • 10:15 elukey@deploy1002: helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
  • 10:12 elukey: run homer for cr1/cr2 eqiad and codfw to add new iBGP session for the k8s ML clusters - https://gerrit.wikimedia.org/r/c/operations/homer/public/+/661055
  • 09:50 reedy@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config cleanup (duration: 00m 57s)
  • 09:49 reedy@deploy1002: Synchronized wmf-config/InitialiseSettings-labs.php: Config cleanup (duration: 00m 59s)
  • 09:48 reedy@deploy1002: Synchronized wmf-config/CommonSettings-labs.php: Config cleanup (duration: 01m 20s)
  • 09:36 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1142 for schema change', diff saved to https://phabricator.wikimedia.org/P14971 and previous config saved to /var/cache/conftool/dbconfig/20210322-093558-marostegui.json
  • 09:15 marostegui@cumin1001: dbctl commit (dc=all): 'db1141 (re)pooling @ 100%: Slowly repool db1141', diff saved to https://phabricator.wikimedia.org/P14970 and previous config saved to /var/cache/conftool/dbconfig/20210322-091534-root.json
  • 09:00 marostegui@cumin1001: dbctl commit (dc=all): 'db1141 (re)pooling @ 75%: Slowly repool db1141', diff saved to https://phabricator.wikimedia.org/P14969 and previous config saved to /var/cache/conftool/dbconfig/20210322-090030-root.json
  • 08:45 marostegui@cumin1001: dbctl commit (dc=all): 'db1141 (re)pooling @ 50%: Slowly repool db1141', diff saved to https://phabricator.wikimedia.org/P14968 and previous config saved to /var/cache/conftool/dbconfig/20210322-084527-root.json
  • 08:30 marostegui@cumin1001: dbctl commit (dc=all): 'db1141 (re)pooling @ 25%: Slowly repool db1141', diff saved to https://phabricator.wikimedia.org/P14967 and previous config saved to /var/cache/conftool/dbconfig/20210322-083023-root.json
  • 08:13 godog: swift eqiad-prod: less weight for ms-be[1019-1026] / more weight to ms-be106[0-3] - T272836 T268435
  • 08:13 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1158.eqiad.wmnet with reason: REIMAGE
  • 08:11 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on db1158.eqiad.wmnet with reason: REIMAGE
  • 08:02 jayme: build and release docker-registry.discovery.wmnet/eventrouter:0.3.0-6, docker-registry.discovery.wmnet/fluent-bit:1.5.3-3, docker-registry.discovery.wmnet/ratelimit:1.5.1-s3
  • 08:00 marostegui: Stop MySQL on db1085 to clone db1165 (lag will appear on s6 on wiki replicas) T258361
  • 08:00 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1085 to clone db1165', diff saved to https://phabricator.wikimedia.org/P14965 and previous config saved to /var/cache/conftool/dbconfig/20210322-080020-marostegui.json
  • 07:51 elukey: stop/start mariadb instances on dbstore1004 to reduce buffer pool memory settings - T273865
  • 07:37 marostegui@cumin1001: dbctl commit (dc=all): 'db1161 (re)pooling @ 100%: Slowly repool db1161', diff saved to https://phabricator.wikimedia.org/P14964 and previous config saved to /var/cache/conftool/dbconfig/20210322-073747-root.json
  • 07:22 marostegui@cumin1001: dbctl commit (dc=all): 'db1161 (re)pooling @ 75%: Slowly repool db1161', diff saved to https://phabricator.wikimedia.org/P14963 and previous config saved to /var/cache/conftool/dbconfig/20210322-072243-root.json
  • 07:14 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1141 for schema change', diff saved to https://phabricator.wikimedia.org/P14962 and previous config saved to /var/cache/conftool/dbconfig/20210322-071430-marostegui.json
  • 07:07 marostegui@cumin1001: dbctl commit (dc=all): 'db1161 (re)pooling @ 50%: Slowly repool db1161', diff saved to https://phabricator.wikimedia.org/P14961 and previous config saved to /var/cache/conftool/dbconfig/20210322-070740-root.json
  • 06:52 marostegui@cumin1001: dbctl commit (dc=all): 'db1161 (re)pooling @ 25%: Slowly repool db1161', diff saved to https://phabricator.wikimedia.org/P14960 and previous config saved to /var/cache/conftool/dbconfig/20210322-065236-root.json
  • 06:37 marostegui@cumin1001: dbctl commit (dc=all): 'Remove db1084 from dbctl T276302', diff saved to https://phabricator.wikimedia.org/P14959 and previous config saved to /var/cache/conftool/dbconfig/20210322-063732-marostegui.json
  • 06:11 marostegui: Sanitize db1124 db2094 db1154: taywiki trvwiki mnwwiktionary
  • 04:28 kartik@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'staging' .

2021-03-21

  • 10:25 _joe_: restarting gerrit on gerrit1001, using 45G of reserved memory
  • 09:22 elukey: install apache2-bin-dbgsym on gerrit1001 - T277127
  • 08:50 qchris: Restarting apache on gerrit1001 again (all apache workers busy again) see T277127
  • 08:18 qchris: Restarting apache on gerrit1001 (all apache workers busy)

2021-03-20

  • 00:22 tzatziki: altering emails for STei (WMF) and SGrabarczuk (WMF)

2021-03-19

  • 21:11 mutante: scandium - stop apache and rerun puppet which fails after reimaging because it tries to run an nginx on port 80 which is already used by apache T268248
  • 20:31 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on scandium.eqiad.wmnet with reason: REIMAGE
  • 20:29 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on scandium.eqiad.wmnet with reason: REIMAGE
  • 20:15 mutante: scandium - reimaging with buster
  • 20:14 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on scandium.eqiad.wmnet with reason: reimage
  • 20:14 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on scandium.eqiad.wmnet with reason: reimage
  • 20:11 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2245.codfw.wmnet
  • 19:55 dzahn@cumin1001: START - Cookbook sre.hosts.decommission for hosts mw2245.codfw.wmnet
  • 19:53 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2244.codfw.wmnet
  • 19:53 legoktm@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host lists1002.wikimedia.org
  • 19:50 mutante: testreduce1001 - confirmed MariaDB @@datadir is /srv/data/mysql and deleting /var/lib/mysql (T277580)
  • 19:40 dzahn@cumin1001: START - Cookbook sre.hosts.decommission for hosts mw2244.codfw.wmnet
  • 19:39 dzahn@cumin1001: conftool action : set/pooled=inactive; selector: name=mw2245.codfw.wmnet
  • 19:39 legoktm@cumin1001: START - Cookbook sre.ganeti.makevm for new host lists1002.wikimedia.org
  • 19:39 dzahn@cumin1001: conftool action : set/pooled=inactive; selector: name=mw2244.codfw.wmnet
  • 19:37 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw2252.codfw.wmnet,service=canary
  • 19:37 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw2251.codfw.wmnet,service=canary
  • 19:33 dzahn@cumin1001: conftool action : set/weight=1; selector: name=mw2252.codfw.wmnet,service=canary
  • 19:33 dzahn@cumin1001: conftool action : set/weight=1; selector: name=mw2251.codfw.wmnet,service=canary
  • 19:24 mutante: deploy2002 - re-enabled puppet, reverted patch of scap-sync-master
  • 18:46 mutante: deploy2002 - disable puppet, copy modified version of scap-master-sync over it that does not --exclude="**/cache/l10n/*.cdb" (for T275826)
  • 16:01 effie: upgrade memcached on mc-gp200*
  • 12:36 klausman@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve2002.codfw.wmnet with reason: REIMAGE
  • 12:34 klausman@cumin2001: START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2002.codfw.wmnet with reason: REIMAGE
  • 12:10 effie: upgrade memcached on mc1026,mc2026
  • 11:37 klausman@deploy1002: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
  • 11:37 klausman@deploy1002: helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
  • 11:36 klausman@deploy1002: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
  • 11:36 klausman@deploy1002: helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
  • 11:30 klausman@deploy1002: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
  • 11:29 klausman@deploy1002: helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
  • 11:29 klausman@deploy1002: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
  • 11:29 klausman@deploy1002: helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
  • 11:29 klausman@deploy1002: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
  • 11:29 klausman@deploy1002: helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
  • 11:27 akosiaris@deploy1002: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
  • 11:27 akosiaris@deploy1002: helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
  • 11:20 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve2002.codfw.wmnet with reason: REIMAGE
  • 11:18 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2002.codfw.wmnet with reason: REIMAGE
  • 10:45 klausman@deploy1002: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
  • 10:45 klausman@deploy1002: helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
  • 10:42 moritzm: installing dbmonitor1002 T224589
  • 10:42 klausman@deploy1002: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
  • 10:42 klausman@deploy1002: helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
  • 10:41 klausman@deploy1002: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
  • 10:41 klausman@deploy1002: helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
  • 10:11 kharlan@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
  • 10:10 kharlan@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
  • 10:05 kharlan@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
  • 10:04 kharlan@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
  • 09:40 kharlan@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
  • 09:36 jayme@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'proton' for release 'production' .
  • 08:22 elukey: upload alluxio 2.4.1 to thirdparty/bigtop15 on stretch/buster-wikimedia
  • 07:16 ryankemper: T275885 `ryankemper@cumin1001:~$ sudo cumin 'P{relforge*}' 'sudo run-puppet-agent'` (change hadn't been merged when I ran the agent earlier)
  • 04:04 eileen: civicrm revision changed from 99bf1c9210 to 39d24e8b0a, config revision is 26b02db7ba
  • 03:27 ryankemper: [wdqs] `ryankemper@wdqs1013:~$ sudo systemctl restart wdqs-blazegraph`
  • 03:26 ryankemper: T275885 `ryankemper@cumin1001:~$ sudo cumin 'P{relforge*}' 'sudo run-puppet-agent'`
  • 02:43 ryankemper: T275885 Revoking current `relforge` TLS cert in advance of generation of new cert: `ryankemper@puppetmaster1001:/srv/private$ sudo puppet cert clean relforge.svc.eqiad.wmnet`
  • 00:51 dancy@deploy1002: Synchronized php-1.36.0-wmf.35/extensions/LiquidThreads/classes/Thread.php: T277772 (duration: 00m 58s)
  • 00:45 mutante: testreduce1001 - stop mysql; rsyncing /var/lib/mysql to /srv/data/mysql (T277580)

2021-03-18

  • 23:56 legoktm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Don't define a default icon (T274199) (duration: 00m 57s)
  • 23:38 brennen@deploy1002: Synchronized php-1.36.0-wmf.35/includes/user/ActorStore.php: Backport: ActorStore::getActorById - fall back to master. (T277795) (duration: 00m 57s)
  • 23:35 brennen@deploy1002: Synchronized php-1.36.0-wmf.35/includes/user/ActorStore.php: Backport: ActorStore::getActorById - fall back to master. (T277795) (duration: 00m 58s)
  • 23:25 dduvall@deploy1002: Synchronized .pipeline: config: Use build environment HTTP proxy for APT sources (T277109) (duration: 01m 02s)
  • 23:06 brennen: train status: 1.36.0-wmf.35 (T274939) stable on all wikis after deploy of hotfix for T277795
  • 22:53 brennen@deploy1002: Synchronized php-1.36.0-wmf.35/includes/specials/SpecialContributions.php: Backport: ActorStore::getActorById - fall back to master. (T277795) (duration: 01m 07s)
  • 22:30 dduvall@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'blubberoid' for release 'production' .
  • 22:29 dduvall@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'blubberoid' for release 'production' .
  • 22:25 dduvall@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' .
  • 20:37 dancy@deploy1002: Synchronized php-1.36.0-wmf.35/extensions/LiquidThreads/classes/Thread.php: (no justification provided) (duration: 01m 05s)
  • 19:04 dancy@deploy1002: rebuilt and synchronized wikiversions files: group2 wikis to 1.36.0-wmf.35
  • 18:28 legoktm: re-enabled puppet on registry*
  • 18:17 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: 44eddcc: hrwiki: Deploy Growth features to newcomers (T275684) (duration: 01m 08s)
  • 18:12 urbanecm@deploy1002: Synchronized dblists/growthexperiments.dblist: 179d9e5: mswiki: Enable Growth features in stealth mode (T277562; 2/2) (duration: 01m 08s)
  • 18:10 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: 179d9e5: mswiki: Enable Growth features in stealth mode (T277562; 1/2) (duration: 01m 11s)
  • 17:58 legoktm: disabled puppet on registry* for rolling out https://gerrit.wikimedia.org/r/672537
  • 17:50 urbanecm@deploy1002: Synchronized dblists/growthexperiments.dblist: 55aa6cb: tewiki: Enable Growth features in stealth mode (T277491; 2/2) (duration: 01m 08s)
  • 17:50 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2242.codfw.wmnet
  • 17:48 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: 55aa6cb: tewiki: Enable Growth features in stealth mode (T277491; 1/2) (duration: 01m 10s)
  • 17:45 urbanecm@deploy1002: Synchronized dblists/growthexperiments.dblist: 04342e9: simplewiki: Enable Growth team features in stealth mode (T277550) (duration: 01m 09s)
  • 17:42 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: 04342e9: simplewiki: Enable Growth team features in stealth mode (T277550) (duration: 01m 10s)
  • 17:40 dduvall@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' .
  • 17:31 dzahn@cumin1001: START - Cookbook sre.hosts.decommission for hosts mw2242.codfw.wmnet
  • 17:28 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2241.codfw.wmnet
  • 17:09 dzahn@cumin1001: START - Cookbook sre.hosts.decommission for hosts mw2241.codfw.wmnet
  • 17:07 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2240.codfw.wmnet
  • 16:54 dzahn@cumin1001: START - Cookbook sre.hosts.decommission for hosts mw2240.codfw.wmnet
  • 16:51 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2239.codfw.wmnet
  • 16:38 dzahn@cumin1001: START - Cookbook sre.hosts.decommission for hosts mw2239.codfw.wmnet
  • 16:37 dzahn@cumin1001: conftool action : set/pooled=inactive; selector: name=mw2242.codfw.wmnet
  • 16:37 dzahn@cumin1001: conftool action : set/pooled=inactive; selector: name=mw2241.codfw.wmnet
  • 16:37 dzahn@cumin1001: conftool action : set/pooled=inactive; selector: name=mw2240.codfw.wmnet
  • 16:37 dzahn@cumin1001: conftool action : set/pooled=inactive; selector: name=mw2239.codfw.wmnet
  • 15:33 shdubsh: clean up dead letter queue and restart all logstashes
  • 14:50 cmjohnson@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 14:43 cmjohnson@cumin1001: START - Cookbook sre.dns.netbox
  • 14:37 dcausse: repooling wdqs1005
  • 14:29 hashar: Restarting CI Jenkins for plugin upgrade
  • 13:49 elukey: reboot analytics1066
  • 13:23 ladsgroup@deploy1002: Synchronized php-1.36.0-wmf.35/extensions/Wikibase/repo: languageLabelDescriptionAliases: use getLanguageNameByCode (T275611 T277722) (duration: 01m 14s)
  • 12:58 jbond42: upload cas_6.3.2 to apt buster-wikimedia
  • 11:37 mvolz@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'zotero' for release 'production' .
  • 11:34 mvolz@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'zotero' for release 'production' .
  • 11:25 mvolz@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'zotero' for release 'staging' .
  • 11:24 urbanecm@deploy1002: Synchronized wmf-config/flaggedrevs.php: 896c9f0: flaggedrevs: Disable multiple dimensions in hewikisource (duration: 01m 09s)
  • 11:20 urbanecm@deploy1002: Synchronized php-1.36.0-wmf.35/extensions/GrowthExperiments/includes/HomepageHooks.php: 3b2aa1a: Remove variant C from list of valid variants (T277727) (duration: 01m 09s)
  • 11:16 mvolz@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'citoid' for release 'production' .
  • 11:14 mvolz@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'citoid' for release 'production' .
  • 11:11 mvolz@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'citoid' for release 'staging' .
  • 11:11 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: 0005676: GrowthExperiments: set $wgGEHomepageNewAccountVariants to D only (T277727) (duration: 01m 10s)
  • 11:08 urbanecm@deploy1002: Synchronized wmf-config/CommonSettings.php: NOOP: e7f5eac: Enable CentralAuth IRC feed in beta cluster (T277432) (duration: 01m 12s)
  • 09:13 _joe_: hard reboot of snapshot1005
  • 09:04 _joe_: attempted reboot of snapshot1005, read-only filesystem and probably disks are broken beyond repair
  • 08:27 godog: swift eqiad-prod: less weight for ms-be[1019-1026] - T272836
  • 08:18 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1004.eqiad.wmnet with reason: REIMAGE
  • 08:16 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1004.eqiad.wmnet with reason: REIMAGE
  • 08:03 marostegui@cumin1001: dbctl commit (dc=all): 'db1126 (re)pooling @ 100%: Slowly repool db1126', diff saved to https://phabricator.wikimedia.org/P14946 and previous config saved to /var/cache/conftool/dbconfig/20210318-080258-root.json
  • 08:02 akosiaris: reimage ml-serve1004 to debug a docker volume_group issue
  • 07:47 marostegui@cumin1001: dbctl commit (dc=all): 'db1126 (re)pooling @ 75%: Slowly repool db1126', diff saved to https://phabricator.wikimedia.org/P14945 and previous config saved to /var/cache/conftool/dbconfig/20210318-074754-root.json
  • 07:32 marostegui@cumin1001: dbctl commit (dc=all): 'db1126 (re)pooling @ 50%: Slowly repool db1126', diff saved to https://phabricator.wikimedia.org/P14944 and previous config saved to /var/cache/conftool/dbconfig/20210318-073250-root.json
  • 07:20 dcausse: depooling & restarting blazegraph on wdqs1005
  • 07:19 marostegui: Deploy schema change on s4 codfw master, lag will appear - T276150 T276156
  • 07:17 marostegui@cumin1001: dbctl commit (dc=all): 'db1126 (re)pooling @ 25%: Slowly repool db1126', diff saved to https://phabricator.wikimedia.org/P14943 and previous config saved to /var/cache/conftool/dbconfig/20210318-071747-root.json
  • 07:15 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1156.eqiad.wmnet with reason: REIMAGE
  • 07:13 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on db1156.eqiad.wmnet with reason: REIMAGE
  • 06:32 marostegui@cumin1001: dbctl commit (dc=all): 'Add db1161 to dbctl, depooled T258361', diff saved to https://phabricator.wikimedia.org/P14942 and previous config saved to /var/cache/conftool/dbconfig/20210318-063241-marostegui.json
  • 06:22 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db2120', diff saved to https://phabricator.wikimedia.org/P14941 and previous config saved to /var/cache/conftool/dbconfig/20210318-062201-marostegui.json
  • 06:04 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1126 for schema change', diff saved to https://phabricator.wikimedia.org/P14940 and previous config saved to /var/cache/conftool/dbconfig/20210318-060445-marostegui.json
  • 03:46 andrewbogott: restarting slapd on seaborgium, serpens, and r-o ldap replicas (we're getting irregular connection failures)
  • 00:05 eileen: tools revision changed from b7b4060c30 to ef54260b0d

2021-03-17

  • 23:42 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: c730dd5: idwiki: Deploy Growth features to newcomers (T259024) (duration: 01m 08s)
  • 23:40 urbanecm@deploy1002: Synchronized wmf-config/CommonSettings.php: 5c14e7d: Define confirmed group in MediaWikiServices hook (T275334, T277704, T275310, T275333) (duration: 01m 08s)
  • 23:30 ebernhardson@deploy1002: Synchronized php-1.36.0-wmf.35/extensions/CirrusSearch/profiles/FallbackProfiles.config.php: Add fallback profile including glent m1 (duration: 01m 42s)
  • 22:27 robh@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1038.eqiad.wmnet with reason: REIMAGE
  • 22:25 robh@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1037.eqiad.wmnet with reason: REIMAGE
  • 22:25 robh@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mc1038.eqiad.wmnet with reason: REIMAGE
  • 22:23 robh@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mc1037.eqiad.wmnet with reason: REIMAGE
  • 20:52 robh@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1184.eqiad.wmnet with reason: REIMAGE
  • 20:50 robh@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1183.eqiad.wmnet with reason: REIMAGE
  • 20:49 robh@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on db1184.eqiad.wmnet with reason: REIMAGE
  • 20:48 robh@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1182.eqiad.wmnet with reason: REIMAGE
  • 20:47 robh@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on db1183.eqiad.wmnet with reason: REIMAGE
  • 20:46 robh@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1181.eqiad.wmnet with reason: REIMAGE
  • 20:45 robh@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on db1182.eqiad.wmnet with reason: REIMAGE
  • 20:44 robh@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1180.eqiad.wmnet with reason: REIMAGE
  • 20:43 robh@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on db1181.eqiad.wmnet with reason: REIMAGE
  • 20:42 andrew@deploy1002: Finished deploy [horizon/deploy@17ea780]: display volume usage summaries (duration: 03m 34s)
  • 20:42 robh@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1179.eqiad.wmnet with reason: REIMAGE
  • 20:41 robh@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on db1180.eqiad.wmnet with reason: REIMAGE
  • 20:40 robh@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1178.eqiad.wmnet with reason: REIMAGE
  • 20:39 robh@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on db1179.eqiad.wmnet with reason: REIMAGE
  • 20:39 andrew@deploy1002: Started deploy [horizon/deploy@17ea780]: display volume usage summaries
  • 20:38 robh@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1177.eqiad.wmnet with reason: REIMAGE
  • 20:37 robh@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on db1178.eqiad.wmnet with reason: REIMAGE
  • 20:35 robh@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on db1177.eqiad.wmnet with reason: REIMAGE
  • 20:19 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2238.codfw.wmnet
  • 20:08 dzahn@cumin1001: START - Cookbook sre.hosts.decommission for hosts mw2238.codfw.wmnet
  • 20:07 robh@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1176.eqiad.wmnet with reason: REIMAGE
  • 20:05 robh@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on db1176.eqiad.wmnet with reason: REIMAGE
  • 20:05 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2237.codfw.wmnet
  • 19:54 dzahn@cumin1001: START - Cookbook sre.hosts.decommission for hosts mw2237.codfw.wmnet
  • 19:51 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2236.codfw.wmnet
  • 19:48 andrew@deploy1002: Finished deploy [horizon/deploy@3c2d1ee]: support VM resizing (duration: 03m 42s)
  • 19:44 andrew@deploy1002: Started deploy [horizon/deploy@3c2d1ee]: support VM resizing
  • 19:42 dzahn@cumin1001: START - Cookbook sre.hosts.decommission for hosts mw2236.codfw.wmnet
  • 19:42 dzahn@cumin1001: conftool action : set/pooled=inactive; selector: name=mw2238.codfw.wmnet
  • 19:42 dzahn@cumin1001: conftool action : set/pooled=inactive; selector: name=mw2237.codfw.wmnet
  • 19:42 dzahn@cumin1001: conftool action : set/pooled=inactive; selector: name=mw2236.codfw.wmnet
  • 19:38 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2235.codfw.wmnet
  • 19:29 mutante: testreduce1001 - rebooted, fdisk /dev/sdb, create partition table, create primary partition, mkfs.ext4 /dev/vdb1
  • 19:23 dzahn@cumin1001: START - Cookbook sre.hosts.decommission for hosts mw2235.codfw.wmnet
  • 19:18 andrew@deploy1002: Finished deploy [horizon/deploy@8967660]: clean up a reverted hack (duration: 03m 25s)
  • 19:17 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2234.codfw.wmnet
  • 19:14 andrew@deploy1002: Started deploy [horizon/deploy@8967660]: clean up a reverted hack
  • 19:06 dancy@deploy1002: Synchronized php: group1 wikis to 1.36.0-wmf.35 (duration: 01m 26s)
  • 19:05 mutante: ganeti1011 - rebooting VM testreduce1001 on ganeti level for T277580
  • 19:04 dancy@deploy1002: rebuilt and synchronized wikiversions files: group1 wikis to 1.36.0-wmf.35
  • 19:02 dzahn@cumin1001: START - Cookbook sre.hosts.decommission for hosts mw2234.codfw.wmnet
  • 19:01 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2233.codfw.wmnet
  • 18:58 catrope@deploy1002: Synchronized php-1.36.0-wmf.35/extensions/WikimediaEvents/: sessionTick: Tick right away on sessionReset (T277515) (duration: 01m 10s)
  • 18:52 catrope@deploy1002: Synchronized php-1.36.0-wmf.35/vendor/: Bump wikimedia/parsoid to 0.13.0-a28 (T276649) (duration: 01m 18s)
  • 18:43 dzahn@cumin1001: START - Cookbook sre.hosts.decommission for hosts mw2233.codfw.wmnet
  • 18:43 dzahn@cumin1001: conftool action : set/pooled=inactive; selector: name=mw2235.codfw.wmnet
  • 18:43 dzahn@cumin1001: conftool action : set/pooled=inactive; selector: name=mw2234.codfw.wmnet
  • 18:43 dzahn@cumin1001: conftool action : set/pooled=inactive; selector: name=mw2233.codfw.wmnet
  • 18:33 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2232.codfw.wmnet
  • 18:31 catrope@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Define Portal and Portal talk namespace for niawiki (T277671) (duration: 01m 11s)
  • 18:10 cmjohnson@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 18:10 dzahn@cumin1001: START - Cookbook sre.hosts.decommission for hosts mw2232.codfw.wmnet
  • 18:09 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2231.codfw.wmnet
  • 18:08 cmjohnson@cumin1001: START - Cookbook sre.dns.netbox
  • 17:56 dzahn@cumin1001: START - Cookbook sre.hosts.decommission for hosts mw2231.codfw.wmnet
  • 17:52 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2230.codfw.wmnet
  • 17:50 razzi: update firewall rules to allow mysql-sqoop in analytics-in4 to access clouddb1021 - https://gerrit.wikimedia.org/r/c/operations/homer/public/+/672797
  • 17:47 ejegg: updated payments-wiki from 0405ea1723 to b06009c099
  • 17:41 dzahn@cumin1001: START - Cookbook sre.hosts.decommission for hosts mw2230.codfw.wmnet
  • 17:35 cmjohnson@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 17:28 cmjohnson@cumin1001: START - Cookbook sre.dns.netbox
  • 17:09 cmjohnson@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 17:04 cmjohnson@cumin1001: START - Cookbook sre.dns.netbox
  • 16:50 andrew@deploy1002: Finished deploy [horizon/deploy@8c50f27]: more support for disabled flavors (duration: 02m 32s)
  • 16:48 andrew@deploy1002: Started deploy [horizon/deploy@8c50f27]: more support for disabled flavors
  • 16:45 andrew@deploy1002: Finished deploy [horizon/deploy@8c50f27]: more support for disabled flavors (duration: 00m 07s)
  • 16:45 andrew@deploy1002: Started deploy [horizon/deploy@8c50f27]: more support for disabled flavors
  • 16:44 andrew@deploy1002: Finished deploy [horizon/deploy@e4fd934]: more support for disabled flavors (duration: 00m 07s)
  • 16:44 andrew@deploy1002: Started deploy [horizon/deploy@e4fd934]: more support for disabled flavors
  • 16:38 effie: upgrade memcached on mc1025, mc2025
  • 16:06 dancy@deploy1002: rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.35
  • 16:04 dancy@deploy1002: Synchronized php-1.36.0-wmf.35/includes/Revision/RevisionRecord.php: (no justification provided) (duration: 00m 58s)
  • 15:54 ejegg: updated standalone SmashPig deployment from 58b070db1a to 250a8570d1
  • 15:23 jmm@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host dbmonitor1002.wikimedia.org
  • 14:56 jmm@cumin1001: START - Cookbook sre.ganeti.makevm for new host dbmonitor1002.wikimedia.org
  • 14:45 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testreduce1001.eqiad.wmnet
  • 14:25 marostegui@cumin1001: dbctl commit (dc=all): 'db1087 (re)pooling @ 100%: Slowly repool db1087', diff saved to https://phabricator.wikimedia.org/P14935 and previous config saved to /var/cache/conftool/dbconfig/20210317-142532-root.json
  • 14:18 jayme: rebooting restreduce1001 for T277580
  • 14:17 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single for host testreduce1001.eqiad.wmnet
  • 14:10 marostegui@cumin1001: dbctl commit (dc=all): 'db1087 (re)pooling @ 75%: Slowly repool db1087', diff saved to https://phabricator.wikimedia.org/P14934 and previous config saved to /var/cache/conftool/dbconfig/20210317-141028-root.json
  • 14:02 jayme@cumin1001: conftool action : set/pooled=true; selector: name=codfw,dnsdisc=sessionstore
  • 14:02 jayme@cumin1001: conftool action : set/pooled=true; selector: name=codfw,dnsdisc=eventgate-analytics
  • 14:01 otto@deploy1002: Finished deploy [analytics/refinery@d2f1b28] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@d2f1b28] (duration: 04m 19s)
  • 13:59 razzi@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-conf1001.eqiad.wmnet
  • 13:58 moritzm: added bullseye tftpboot environment T275873
  • 13:56 otto@deploy1002: Started deploy [analytics/refinery@d2f1b28] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@d2f1b28]
  • 13:56 otto@deploy1002: Finished deploy [analytics/refinery@d2f1b28] (thin): Regular analytics weekly train THIN [analytics/refinery@d2f1b28] (duration: 00m 06s)
  • 13:56 otto@deploy1002: Started deploy [analytics/refinery@d2f1b28] (thin): Regular analytics weekly train THIN [analytics/refinery@d2f1b28]
  • 13:55 marostegui@cumin1001: dbctl commit (dc=all): 'db1087 (re)pooling @ 50%: Slowly repool db1087', diff saved to https://phabricator.wikimedia.org/P14933 and previous config saved to /var/cache/conftool/dbconfig/20210317-135522-root.json
  • 13:54 razzi@cumin1001: START - Cookbook sre.hosts.reboot-single for host an-conf1001.eqiad.wmnet
  • 13:52 razzi@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-conf1003.eqiad.wmnet
  • 13:52 otto@deploy1002: Finished deploy [analytics/refinery@d2f1b28]: Regular analytics weekly train [analytics/refinery@d2f1b28] (duration: 11m 36s)
  • 13:47 jayme@cumin1001: conftool action : set/pooled=true; selector: name=codfw,dnsdisc=eventgate-analytics-external
  • 13:47 jayme@cumin1001: conftool action : set/pooled=true; selector: name=codfw,dnsdisc=eventgate-logging-external
  • 13:47 jayme@cumin1001: conftool action : set/pooled=true; selector: name=codfw,dnsdisc=api-gateway
  • 13:47 jayme@cumin1001: conftool action : set/pooled=true; selector: name=codfw,dnsdisc=echostore
  • 13:47 razzi@cumin1001: START - Cookbook sre.hosts.reboot-single for host an-conf1003.eqiad.wmnet
  • 13:46 razzi@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-conf1002.eqiad.wmnet
  • 13:41 razzi@cumin1001: START - Cookbook sre.hosts.reboot-single for host an-conf1002.eqiad.wmnet
  • 13:40 otto@deploy1002: Started deploy [analytics/refinery@d2f1b28]: Regular analytics weekly train [analytics/refinery@d2f1b28]
  • 13:40 marostegui@cumin1001: dbctl commit (dc=all): 'db1087 (re)pooling @ 25%: Slowly repool db1087', diff saved to https://phabricator.wikimedia.org/P14932 and previous config saved to /var/cache/conftool/dbconfig/20210317-134018-root.json
  • 13:38 kormat: stopping db2137:s5 T277632
  • 13:33 kormat: stopping db2089:s5 T277632
  • 13:31 otto@deploy1002: Finished deploy [analytics/aqs/deploy@3e92346]: deploy aqs as part of train - T207171, T263697 (duration: 03m 24s)
  • 13:27 otto@deploy1002: Started deploy [analytics/aqs/deploy@3e92346]: deploy aqs as part of train - T207171, T263697
  • 13:23 jynus: stopping s5 instance on db2099 and restoring from backup T277632
  • 13:17 jayme@cumin1001: conftool action : set/pooled=true; selector: name=codfw,dnsdisc=eventstreams
  • 13:14 jayme@cumin1001: conftool action : set/pooled=true; selector: name=codfw,dnsdisc=eventstreams-internal
  • 13:13 jayme@cumin1001: conftool action : set/pooled=true; selector: name=codfw,dnsdisc=mobileapps
  • 13:13 jayme@cumin1001: conftool action : set/pooled=true; selector: name=codfw,dnsdisc=wikifeeds
  • 13:13 jayme@cumin1001: conftool action : set/pooled=true; selector: name=codfw,dnsdisc=termbox
  • 13:12 moritzm: installing tiff security updates
  • 12:45 jayme@cumin1001: conftool action : set/pooled=true; selector: name=codfw,dnsdisc=similar-users
  • 12:45 jayme@cumin1001: conftool action : set/pooled=true; selector: name=codfw,dnsdisc=push-notifications
  • 12:45 jayme@cumin1001: conftool action : set/pooled=true; selector: name=codfw,dnsdisc=proton
  • 12:45 jayme@cumin1001: conftool action : set/pooled=true; selector: name=codfw,dnsdisc=linkrecommendation
  • 12:44 jayme@cumin1001: conftool action : set/pooled=true; selector: name=codfw,dnsdisc=blubberoid
  • 12:44 jayme@cumin1001: conftool action : set/pooled=true; selector: name=codfw,dnsdisc=apertium
  • 12:11 jayme@cumin1001: conftool action : set/pooled=true; selector: name=codfw,dnsdisc=mathoid
  • 12:10 jayme@cumin1001: conftool action : set/pooled=true; selector: name=codfw,dnsdisc=eventgate-main
  • 11:49 marostegui: Deploy schema change on s8, lag will appear on wiki replicas T276150 T276156
  • 11:47 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1087 for schema change', diff saved to https://phabricator.wikimedia.org/P14931 and previous config saved to /var/cache/conftool/dbconfig/20210317-114746-marostegui.json
  • 11:46 marostegui@cumin1001: dbctl commit (dc=all): 'db1109 (re)pooling @ 100%: Slowly repool db1109', diff saved to https://phabricator.wikimedia.org/P14930 and previous config saved to /var/cache/conftool/dbconfig/20210317-114601-root.json
  • 11:30 marostegui@cumin1001: dbctl commit (dc=all): 'db1109 (re)pooling @ 75%: Slowly repool db1109', diff saved to https://phabricator.wikimedia.org/P14929 and previous config saved to /var/cache/conftool/dbconfig/20210317-113057-root.json
  • 11:20 jayme: switch restbase-async back to codfw (the newly initialized cluster)
  • 11:17 jayme@cumin1001: conftool action : set/pooled=true; selector: dnsdisc=restbase-async,name=codfw
  • 11:17 jayme@cumin1001: conftool action : set/pooled=false; selector: dnsdisc=restbase-async,name=eqiad
  • 11:15 marostegui@cumin1001: dbctl commit (dc=all): 'db1109 (re)pooling @ 50%: Slowly repool db1109', diff saved to https://phabricator.wikimedia.org/P14928 and previous config saved to /var/cache/conftool/dbconfig/20210317-111553-root.json
  • 11:09 moritzm: restarting tomcat on idp.wikimedia.org
  • 11:00 marostegui@cumin1001: dbctl commit (dc=all): 'db1109 (re)pooling @ 25%: Slowly repool db1109', diff saved to https://phabricator.wikimedia.org/P14927 and previous config saved to /var/cache/conftool/dbconfig/20210317-110050-root.json
  • 09:59 moritzm: imported PHP 5.6.40 to thirdparty/php56 T224589
  • 09:47 vgutierrez: restart varnish-fe on cp5011
  • 09:24 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1109 for schema change', diff saved to https://phabricator.wikimedia.org/P14926 and previous config saved to /var/cache/conftool/dbconfig/20210317-092443-marostegui.json
  • 09:23 marostegui@cumin1001: dbctl commit (dc=all): 'db1114 (re)pooling @ 100%: Slowly repool db1114', diff saved to https://phabricator.wikimedia.org/P14925 and previous config saved to /var/cache/conftool/dbconfig/20210317-092357-root.json
  • 09:08 marostegui@cumin1001: dbctl commit (dc=all): 'db1114 (re)pooling @ 75%: Slowly repool db1114', diff saved to https://phabricator.wikimedia.org/P14924 and previous config saved to /var/cache/conftool/dbconfig/20210317-090853-root.json
  • 09:04 jayme@cumin1001: conftool action : set/pooled=true; selector: name=codfw,dnsdisc=recommendation-api
  • 09:04 jayme@cumin1001: conftool action : set/pooled=true; selector: name=codfw,dnsdisc=cxserver
  • 09:04 jayme@cumin1001: conftool action : set/pooled=true; selector: name=codfw,dnsdisc=citoid
  • 09:01 marostegui@cumin1001: dbctl commit (dc=all): 'db1082 (re)pooling @ 100%: Slowly repool db1082', diff saved to https://phabricator.wikimedia.org/P14923 and previous config saved to /var/cache/conftool/dbconfig/20210317-090108-root.json
  • 08:58 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1084 T276302', diff saved to https://phabricator.wikimedia.org/P14922 and previous config saved to /var/cache/conftool/dbconfig/20210317-085852-marostegui.json
  • 08:53 marostegui@cumin1001: dbctl commit (dc=all): 'db1114 (re)pooling @ 50%: Slowly repool db1114', diff saved to https://phabricator.wikimedia.org/P14921 and previous config saved to /var/cache/conftool/dbconfig/20210317-085350-root.json
  • 08:46 marostegui@cumin1001: dbctl commit (dc=all): 'db1082 (re)pooling @ 75%: Slowly repool db1082', diff saved to https://phabricator.wikimedia.org/P14920 and previous config saved to /var/cache/conftool/dbconfig/20210317-084605-root.json
  • 08:38 marostegui@cumin1001: dbctl commit (dc=all): 'db1114 (re)pooling @ 25%: Slowly repool db1114', diff saved to https://phabricator.wikimedia.org/P14919 and previous config saved to /var/cache/conftool/dbconfig/20210317-083846-root.json
  • 08:31 marostegui@cumin1001: dbctl commit (dc=all): 'db1082 (re)pooling @ 50%: Slowly repool db1082', diff saved to https://phabricator.wikimedia.org/P14918 and previous config saved to /var/cache/conftool/dbconfig/20210317-083101-root.json
  • 08:15 marostegui@cumin1001: dbctl commit (dc=all): 'db1082 (re)pooling @ 25%: Slowly repool db1082', diff saved to https://phabricator.wikimedia.org/P14917 and previous config saved to /var/cache/conftool/dbconfig/20210317-081557-root.json
  • 07:50 godog: swift eqiad-prod: less weight for ms-be[1019-1026] - T272836
  • 07:34 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1114 for schema change', diff saved to https://phabricator.wikimedia.org/P14916 and previous config saved to /var/cache/conftool/dbconfig/20210317-073403-marostegui.json
  • 07:30 marostegui@cumin1001: dbctl commit (dc=all): 'db1111 (re)pooling @ 100%: Slowly repool db1111', diff saved to https://phabricator.wikimedia.org/P14915 and previous config saved to /var/cache/conftool/dbconfig/20210317-073024-root.json
  • 07:15 marostegui@cumin1001: dbctl commit (dc=all): 'db1111 (re)pooling @ 75%: Slowly repool db1111', diff saved to https://phabricator.wikimedia.org/P14914 and previous config saved to /var/cache/conftool/dbconfig/20210317-071520-root.json
  • 07:00 marostegui@cumin1001: dbctl commit (dc=all): 'db1111 (re)pooling @ 50%: Slowly repool db1111', diff saved to https://phabricator.wikimedia.org/P14913 and previous config saved to /var/cache/conftool/dbconfig/20210317-070017-root.json
  • 06:52 marostegui: Stop MySQL on db1082 to clone db1161 (lag will appear on s5 on wikireplicas) - T258361
  • 06:51 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1082 to clone db1161 T258361', diff saved to https://phabricator.wikimedia.org/P14911 and previous config saved to /var/cache/conftool/dbconfig/20210317-065146-marostegui.json
  • 06:46 marostegui@cumin1001: dbctl commit (dc=all): 'Pool db2150 into s7 T275633', diff saved to https://phabricator.wikimedia.org/P14910 and previous config saved to /var/cache/conftool/dbconfig/20210317-064606-marostegui.json
  • 06:45 marostegui@cumin1001: dbctl commit (dc=all): 'db1111 (re)pooling @ 25%: Slowly repool db1111', diff saved to https://phabricator.wikimedia.org/P14909 and previous config saved to /var/cache/conftool/dbconfig/20210317-064513-root.json
  • 06:03 marostegui@cumin1001: dbctl commit (dc=all): 'Add db2150 to s7, depooled T275633', diff saved to https://phabricator.wikimedia.org/P14908 and previous config saved to /var/cache/conftool/dbconfig/20210317-060358-marostegui.json
  • 05:42 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1111 for schema change', diff saved to https://phabricator.wikimedia.org/P14907 and previous config saved to /var/cache/conftool/dbconfig/20210317-054206-marostegui.json
  • 02:25 eileen: civicrm revision changed from 8c137b94f0 to 99bf1c9210, config revision is ef2767ab91
  • 01:55 eileen: civicrm revision changed from 550be50105 to 8c137b94f0, config revision is ef2767ab91

2021-03-16

  • 23:56 krinkle@deploy1002: Synchronized php-1.36.0-wmf.35/includes/Revision/: I8619ab9e92b, T277362, T275531 (duration: 00m 58s)
  • 23:51 krinkle@deploy1002: Synchronized php-1.36.0-wmf.34/extensions/Scribunto/: I84e8732d8d - tmp logging (duration: 00m 58s)
  • 23:47 Krinkle: There is an uncommitted dirty diff in /srv/mediawiki-staging/php-1.36.0-wmf.34/extensions/WikimediaMaintenance/createExtensionTables.php
  • 23:31 krinkle@deploy1002: Synchronized wmf-config/InitialiseSettings.php: I1ca4f30c2, T262612 (duration: 00m 57s)
  • 23:22 krinkle@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Icd6635cb302cc, T277332 (duration: 00m 58s)
  • 23:07 krinkle@deploy1002: Synchronized wmf-config/InitialiseSettings.php: I8d8c94d95c6 (duration: 00m 59s)
  • 23:03 twentyafterfour: applied hotfix to phabricator/src/infrastructure/customfield/storage/PhabricatorCustomFieldStorage.php and restarted php-fpm
  • 23:02 krinkle@deploy1002: Synchronized wmf-config/CommonSettings.php: I4097cbcb1d5 (duration: 00m 59s)
  • 22:59 krinkle@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Ie24eb2077 (duration: 00m 58s)
  • 20:59 dzahn@cumin1001: conftool action : set/pooled=inactive; selector: name=mw2232.codfw.wmnet
  • 20:59 dzahn@cumin1001: conftool action : set/pooled=inactive; selector: name=mw2231.codfw.wmnet
  • 20:59 dzahn@cumin1001: conftool action : set/pooled=inactive; selector: name=mw2230.codfw.wmnet
  • 20:49 andrew@deploy1002: Finished deploy [horizon/deploy@e4fd934]: tiny horizon patch to support flavor deprecation (duration: 03m 44s)
  • 20:45 andrew@deploy1002: Started deploy [horizon/deploy@e4fd934]: tiny horizon patch to support flavor deprecation
  • 20:15 XioNoX: remove DMZ zone from pfw3-eqiad - T174203
  • 20:00 brennen: 1.36.0-wmf.35 train status (T274939): blocked at group0 on T277362
  • 19:52 dancy@deploy1002: rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.34
  • 19:52 XioNoX: commit changes to pfw3-eqiad - T274422
  • 19:44 dancy@deploy1002: rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.35
  • 19:31 dancy@deploy1002: Finished scap: testwikis wikis to 1.36.0-wmf.35 (duration: 33m 41s)
  • 19:20 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2229.codfw.wmnet
  • 19:11 dzahn@cumin1001: conftool action : set/pooled=inactive; selector: name=mw2229.codfw.wmnet
  • 19:10 dzahn@cumin1001: START - Cookbook sre.hosts.decommission for hosts mw2229.codfw.wmnet
  • 19:08 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2228.codfw.wmnet
  • 19:07 dzahn@cumin1001: conftool action : set/pooled=inactive; selector: name=mw2228.codfw.wmnet
  • 19:06 XioNoX: commit changes to pfw3-codfw - T274422
  • 18:58 dancy@deploy1002: Started scap: testwikis wikis to 1.36.0-wmf.35
  • 18:55 dzahn@cumin1001: START - Cookbook sre.hosts.decommission for hosts mw2228.codfw.wmnet
  • 18:48 mbsantos@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
  • 18:43 mbsantos@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
  • 18:41 mbsantos@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' .
  • 18:03 ppchelko@deploy1002: Finished deploy [restbase/deploy@f99ddaa]: Add new wikis T275837 T271983 T273466 T276127 T273460 T276249 (duration: 31m 31s)
  • 17:44 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6 days, 0:00:00 on aqs1011.eqiad.wmnet with reason: New buster hosts, not in use
  • 17:44 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime for 6 days, 0:00:00 on aqs1011.eqiad.wmnet with reason: New buster hosts, not in use
  • 17:37 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts mw2227.codfw.wmnet
  • 17:32 ppchelko@deploy1002: Started deploy [restbase/deploy@f99ddaa]: Add new wikis T275837 T271983 T273466 T276127 T273460 T276249
  • 17:09 dzahn@cumin1001: START - Cookbook sre.hosts.decommission for hosts mw2227.codfw.wmnet
  • 17:04 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts mw2226.codfw.wmnet
  • 16:47 eevans@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'sessionstore' for release 'production' .
  • 16:44 eevans@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'sessionstore' for release 'production' .
  • 16:25 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2242.codfw.wmnet
  • 16:25 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2241.codfw.wmnet
  • 16:24 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2240.codfw.wmnet
  • 16:21 dzahn@cumin1001: START - Cookbook sre.hosts.decommission for hosts mw2226.codfw.wmnet
  • 16:20 dzahn@cumin1001: conftool action : set/pooled=inactive; selector: name=mw2227.codfw.wmnet
  • 16:20 dzahn@cumin1001: conftool action : set/pooled=inactive; selector: name=mw2226.codfw.wmnet
  • 16:17 mutante: testreduce1001 - gzip /var/log/daemon.log.1 ; apt-get clean .. free some disk space
  • 15:47 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16 days, 16:00:00 on acrux.codfw.wmnet with reason: Extend downtime for like a month until we remove the VMs
  • 15:47 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime for 16 days, 16:00:00 on acrux.codfw.wmnet with reason: Extend downtime for like a month until we remove the VMs
  • 15:47 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16 days, 16:00:00 on acrab.codfw.wmnet with reason: Extend downtime for like a month until we remove the VMs
  • 15:46 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime for 16 days, 16:00:00 on acrab.codfw.wmnet with reason: Extend downtime for like a month until we remove the VMs
  • 15:34 marostegui@cumin1001: dbctl commit (dc=all): 'db1101:3318 (re)pooling @ 100%: Slowly repool db1101:3318', diff saved to https://phabricator.wikimedia.org/P14905 and previous config saved to /var/cache/conftool/dbconfig/20210316-153446-root.json
  • 15:32 ayounsi@deploy1002: Finished deploy [homer/deploy@759f82c]: T277006 (duration: 04m 56s)
  • 15:27 ayounsi@deploy1002: Started deploy [homer/deploy@759f82c]: T277006
  • 15:19 marostegui@cumin1001: dbctl commit (dc=all): 'db1101:3318 (re)pooling @ 75%: Slowly repool db1101:3318', diff saved to https://phabricator.wikimedia.org/P14904 and previous config saved to /var/cache/conftool/dbconfig/20210316-151943-root.json
  • 15:07 hashar@deploy1002: Finished deploy [integration/docroot@cf787a5]: (no justification provided) (duration: 00m 30s)
  • 15:06 hashar@deploy1002: Started deploy [integration/docroot@cf787a5]: (no justification provided)
  • 15:04 marostegui@cumin1001: dbctl commit (dc=all): 'db1101:3318 (re)pooling @ 50%: Slowly repool db1101:3318', diff saved to https://phabricator.wikimedia.org/P14903 and previous config saved to /var/cache/conftool/dbconfig/20210316-150439-root.json
  • 15:03 hashar@deploy1002: Finished deploy [integration/docroot@44d5685]: Verify check can restart php-fpm # T275468 (duration: 00m 07s)
  • 15:03 hashar@deploy1002: Started deploy [integration/docroot@44d5685]: Verify check can restart php-fpm # T275468
  • 14:58 Amir1: end of foreachwikiindblist wikidataclient extensions/Wikibase/lib/maintenance/populateSitesTable.php --force-protocol https (T276251 T276129 T275839)
  • 14:53 jmm@cumin2001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host ganeti2015.codfw.wmnet
  • 14:49 marostegui@cumin1001: dbctl commit (dc=all): 'db1101:3318 (re)pooling @ 25%: Slowly repool db1101:3318', diff saved to https://phabricator.wikimedia.org/P14902 and previous config saved to /var/cache/conftool/dbconfig/20210316-144935-root.json
  • 14:37 Amir1: start of foreachwikiindblist wikidataclient extensions/Wikibase/lib/maintenance/populateSitesTable.php --force-protocol https (T276251 T276129 T275839)
  • 13:45 moritzm: powercycling ganeti2015, stuck on reboot
  • 13:35 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ganeti2015.codfw.wmnet
  • 13:35 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'zotero' for release 'production' .
  • 13:35 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'zotero' for release 'staging' .
  • 13:34 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'wikifeeds' for release 'production' .
  • 13:33 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' .
  • 13:32 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'termbox' for release 'test' .
  • 13:32 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'termbox' for release 'production' .
  • 13:32 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'termbox' for release 'staging' .
  • 13:32 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'similar-users' for release 'main' .
  • 13:31 moritzm: drain ganeti2015
  • 13:31 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'sessionstore' for release 'staging' .
  • 13:31 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'sessionstore' for release 'production' .
  • 13:30 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'recommendation-api' for release 'production' .
  • 13:30 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2014.codfw.wmnet
  • 13:28 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1101:3318 for schema change', diff saved to https://phabricator.wikimedia.org/P14901 and previous config saved to /var/cache/conftool/dbconfig/20210316-132844-marostegui.json
  • 13:28 marostegui@cumin1001: dbctl commit (dc=all): 'db1172 (re)pooling @ 100%: Slowly repool db1172', diff saved to https://phabricator.wikimedia.org/P14900 and previous config saved to /var/cache/conftool/dbconfig/20210316-132814-root.json
  • 13:28 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'push-notifications' for release 'main' .
  • 13:27 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'push-notifications' for release 'canary' .
  • 13:26 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'proton' for release 'production' .
  • 13:24 urbanecm@deploy1002: Synchronized wmf-config/logos.php: 7fb50c3: trvwiki: set logo to File:Wikipedia-logo-v2-trv.svg (T276246; 2/2) (duration: 00m 57s)
  • 13:24 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
  • 13:24 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'staging' .
  • 13:23 urbanecm@deploy1002: Synchronized static/images/project-logos/: 7fb50c3: trvwiki: set logo to File:Wikipedia-logo-v2-trv.svg (T276246; 1/2) (duration: 01m 01s)
  • 13:22 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ganeti2014.codfw.wmnet
  • 13:22 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mathoid' for release 'production' .
  • 13:22 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mathoid' for release 'staging' .
  • 13:22 urbanecm@deploy1002: sync-file aborted: 7fb50c3: trvwiki: set logo to File:Wikipedia-logo-v2-trv.svg (T276246) (duration: 00m 00s)
  • 13:20 moritzm: drain ganeti2014
  • 13:19 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2013.codfw.wmnet
  • 13:19 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
  • 13:19 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
  • 13:19 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
  • 13:18 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'eventstreams-internal' for release 'canary' .
  • 13:18 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'eventstreams-internal' for release 'main' .
  • 13:17 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'eventstreams' for release 'production' .
  • 13:16 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'eventstreams' for release 'canary' .
  • 13:15 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' .
  • 13:15 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
  • 13:13 marostegui@cumin1001: dbctl commit (dc=all): 'db1172 (re)pooling @ 75%: Slowly repool db1172', diff saved to https://phabricator.wikimedia.org/P14899 and previous config saved to /var/cache/conftool/dbconfig/20210316-131310-root.json
  • 13:13 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' .
  • 13:13 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' .
  • 13:12 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
  • 13:12 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
  • 13:12 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ganeti2013.codfw.wmnet
  • 13:10 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' .
  • 13:10 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' .
  • 13:09 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'echostore' for release 'production' .
  • 13:09 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'echostore' for release 'staging' .
  • 13:08 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'cxserver' for release 'production' .
  • 13:08 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'cxserver' for release 'staging' .
  • 13:07 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'citoid' for release 'staging' .
  • 13:07 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'citoid' for release 'production' .
  • 13:07 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
  • 13:07 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
  • 13:05 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 13:05 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 13:05 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'blubberoid' for release 'staging' .
  • 13:05 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'blubberoid' for release 'production' .
  • 13:04 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
  • 13:04 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
  • 13:03 akosiaris: sync all services on the new codfw kubernetes cluster T277191
  • 13:02 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'apertium' for release 'production' .
  • 13:02 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'apertium' for release 'staging' .
  • 12:59 moritzm: drain ganeti2013
  • 12:58 marostegui@cumin1001: dbctl commit (dc=all): 'db1172 (re)pooling @ 50%: Slowly repool db1172', diff saved to https://phabricator.wikimedia.org/P14898 and previous config saved to /var/cache/conftool/dbconfig/20210316-125807-root.json
  • 12:55 volans@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 12:53 Urbanecm: New wiki creation is done
  • 12:51 volans@cumin1001: START - Cookbook sre.dns.netbox
  • 12:50 urbanecm@deploy1002: Synchronized wmf-config/flaggedrevs.php: 1426d04: flaggedrevs: Simplify the config a bit (duration: 00m 58s)
  • 12:46 urbanecm@deploy1002: Synchronized wmf-config/interwiki.php: Update interwiki cache (duration: 02m 06s)
  • 12:43 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Creating mnwwiktionary (T276125) (duration: 00m 57s)
  • 12:43 marostegui@cumin1001: dbctl commit (dc=all): 'db1172 (re)pooling @ 25%: Slowly repool db1172', diff saved to https://phabricator.wikimedia.org/P14897 and previous config saved to /var/cache/conftool/dbconfig/20210316-124303-root.json
  • 12:42 urbanecm@deploy1002: Synchronized wmf-config/logos.php: Creating mnwwiktionary (T276125) (duration: 01m 00s)
  • 12:41 urbanecm@deploy1002: Synchronized static/images/project-logos/: Creating mnwwiktionary (T276125) (duration: 01m 01s)
  • 12:40 urbanecm@deploy1002: rebuilt and synchronized wikiversions files: Creating mnwwiktionary (T276125)
  • 12:39 urbanecm@deploy1002: Synchronized dblists: Creating mnwwiktionary (T276125) (duration: 00m 57s)
  • 12:39 jayme@deploy1002: helmfile [codfw] DONE helmfile.d/admin 'sync'.
  • 12:37 urbanecm@deploy1002: Synchronized wmf-config/db-codfw.php: Creating mnwwiktionary (T276125) (duration: 00m 58s)
  • 12:36 urbanecm@deploy1002: Synchronized wmf-config/db-eqiad.php: Creating mnwwiktionary (T276125) (duration: 00m 58s)
  • 12:34 urbanecm@deploy1002: Synchronized langlist: Creating trvwiki (T276246) (duration: 00m 58s)
  • 12:33 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Creating trvwiki (T276246) (duration: 00m 57s)
  • 12:32 urbanecm@deploy1002: Synchronized static/images/project-logos/: Creating trvwiki (T276246) (duration: 00m 58s)
  • 12:31 urbanecm@deploy1002: rebuilt and synchronized wikiversions files: Creating trvwiki (T276246)
  • 12:29 urbanecm@deploy1002: Synchronized dblists: Creating trvwiki (T276246) (duration: 00m 57s)
  • 12:28 jayme@deploy1002: helmfile [codfw] START helmfile.d/admin 'sync'.
  • 12:28 urbanecm@deploy1002: Synchronized wmf-config/db-codfw.php: Creating trvwiki (T276246) (duration: 01m 02s)
  • 12:27 urbanecm@deploy1002: Synchronized wmf-config/db-eqiad.php: Creating trvwiki (T276246) (duration: 00m 57s)
  • 12:20 urbanecm@deploy1002: Synchronized langlist: Creating taywiki (T275803) (duration: 00m 57s)
  • 12:19 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Creating taywiki (T275803) (duration: 00m 58s)
  • 12:17 urbanecm@deploy1002: Synchronized wmf-config/logos.php: Creating taywiki (T275803) (duration: 00m 57s)
  • 12:17 jayme@deploy1002: helmfile [codfw] DONE helmfile.d/admin 'sync'.
  • 12:16 urbanecm@deploy1002: Synchronized static/images/project-logos/: Creating taywiki (T275803) (duration: 00m 58s)
  • 12:14 urbanecm@deploy1002: rebuilt and synchronized wikiversions files: Creating taywiki (T275803)
  • 12:12 urbanecm@deploy1002: Synchronized dblists: Creating taywiki (T275803) (duration: 00m 58s)
  • 12:11 urbanecm@deploy1002: Synchronized wmf-config/db-codfw.php: Creating taywiki (T275803) (duration: 01m 02s)
  • 12:10 urbanecm@deploy1002: Synchronized wmf-config/db-eqiad.php: Creating taywiki (T275803) (duration: 00m 59s)
  • 12:10 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on aqs1011.eqiad.wmnet with reason: New buster host
  • 12:10 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on aqs1011.eqiad.wmnet with reason: New buster host
  • 12:07 jayme@deploy1002: helmfile [codfw] START helmfile.d/admin 'sync'.
  • 11:54 jayme@cumin1001: conftool action : set/pooled=yes; selector: dc=codfw,cluster=kubernetes,service=kubesvc
  • 11:43 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1172 for schema change', diff saved to https://phabricator.wikimedia.org/P14896 and previous config saved to /var/cache/conftool/dbconfig/20210316-114310-marostegui.json
  • 11:33 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubernetes2015.codfw.wmnet
  • 11:32 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubernetes2016.codfw.wmnet
  • 11:32 effie: upgrade memached in mc1023, mc2023
  • 11:31 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubernetes2006.codfw.wmnet
  • 11:30 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single for host kubernetes2016.codfw.wmnet
  • 11:29 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single for host kubernetes2015.codfw.wmnet
  • 11:29 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single for host kubernetes2006.codfw.wmnet
  • 11:29 marostegui@cumin1001: dbctl commit (dc=all): 'db1099:3318 (re)pooling @ 100%: Slowly repool db1099:3318', diff saved to https://phabricator.wikimedia.org/P14895 and previous config saved to /var/cache/conftool/dbconfig/20210316-112931-root.json
  • 11:28 jayme@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host kubernetes2006.codfw.wmnet
  • 11:28 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single for host kubernetes2006.codfw.wmnet
  • 11:22 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: c444517: 4e66529: dff200b: Enable DiscussionTools features on several projects (T276493; T276498; T277103) (duration: 00m 57s)
  • 11:17 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubernetes2005.codfw.wmnet
  • 11:17 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubernetes2017.codfw.wmnet
  • 11:16 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: f0d5465: Enable DiscussionTools beta features on enwiki (T273146) (duration: 00m 58s)
  • 11:15 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single for host kubernetes2005.codfw.wmnet
  • 11:14 marostegui@cumin1001: dbctl commit (dc=all): 'db1099:3318 (re)pooling @ 75%: Slowly repool db1099:3318', diff saved to https://phabricator.wikimedia.org/P14893 and previous config saved to /var/cache/conftool/dbconfig/20210316-111427-root.json
  • 11:13 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: 835f9ab: Enable ContentTranslation as a default tool in Amharic, Maltese and Uzbek Wikipedias (T276765) (duration: 01m 00s)
  • 11:10 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2014.codfw.wmnet with reason: REIMAGE
  • 11:08 akosiaris@cumin1001: conftool action : set/pooled=yes; selector: dc=codfw,service=kubemaster,name=.*,cluster=kubernetes
  • 11:08 akosiaris@cumin1001: conftool action : set/weight=10; selector: dc=codfw,service=kubemaster,name=.*,cluster=kubernetes
  • 11:07 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2013.codfw.wmnet with reason: REIMAGE
  • 11:06 jayme@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2014.codfw.wmnet with reason: REIMAGE
  • 11:05 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2012.codfw.wmnet with reason: REIMAGE
  • 11:04 jayme@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2013.codfw.wmnet with reason: REIMAGE
  • 11:03 jayme@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes2010.codfw.wmnet with reason: REIMAGE
  • 11:02 jayme@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2012.codfw.wmnet with reason: REIMAGE
  • 11:01 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2011.codfw.wmnet with reason: REIMAGE
  • 11:00 jayme@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes2004.codfw.wmnet with reason: REIMAGE
  • 10:59 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single for host kubernetes2017.codfw.wmnet
  • 10:59 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2009.codfw.wmnet with reason: REIMAGE
  • 10:59 marostegui@cumin1001: dbctl commit (dc=all): 'db1099:3318 (re)pooling @ 50%: Slowly repool db1099:3318', diff saved to https://phabricator.wikimedia.org/P14892 and previous config saved to /var/cache/conftool/dbconfig/20210316-105924-root.json
  • 10:59 jayme@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2011.codfw.wmnet with reason: REIMAGE
  • 10:58 jayme@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2010.codfw.wmnet with reason: REIMAGE
  • 10:57 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2008.codfw.wmnet with reason: REIMAGE
  • 10:55 jayme@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2009.codfw.wmnet with reason: REIMAGE
  • 10:55 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2007.codfw.wmnet with reason: REIMAGE
  • 10:55 jayme@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2004.codfw.wmnet with reason: REIMAGE
  • 10:54 jayme@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2008.codfw.wmnet with reason: REIMAGE
  • 10:53 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2003.codfw.wmnet with reason: REIMAGE
  • 10:52 jayme@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2007.codfw.wmnet with reason: REIMAGE
  • 10:51 marostegui@cumin1001: dbctl commit (dc=all): 'db1162 (re)pooling @ 100%: Slowly repool db1162', diff saved to https://phabricator.wikimedia.org/P14891 and previous config saved to /var/cache/conftool/dbconfig/20210316-105128-root.json
  • 10:51 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2002.codfw.wmnet with reason: REIMAGE
  • 10:51 jayme@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2003.codfw.wmnet with reason: REIMAGE
  • 10:49 jayme@cumin1001: conftool action : set/pooled=inactive; selector: dc=codfw,service=kubesvc,name=kubernetes2006.codfw.wmnet
  • 10:49 jayme@cumin1001: conftool action : set/pooled=inactive; selector: dc=codfw,service=kubesvc,name=kubernetes2005.codfw.wmnet
  • 10:49 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2001.codfw.wmnet with reason: REIMAGE
  • 10:49 jayme@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2002.codfw.wmnet with reason: REIMAGE
  • 10:47 jayme@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2001.codfw.wmnet with reason: REIMAGE
  • 10:47 jayme@cumin1001: conftool action : set/pooled=inactive; selector: dc=codfw,service=kubesvc,name=kubernetes2015.codfw.wmnet
  • 10:46 jayme@cumin1001: conftool action : set/pooled=inactive; selector: dc=codfw,service=kubesvc,name=kubernetes2016.codfw.wmnet
  • 10:44 marostegui@cumin1001: dbctl commit (dc=all): 'db1099:3318 (re)pooling @ 25%: Slowly repool db1099:3318', diff saved to https://phabricator.wikimedia.org/P14890 and previous config saved to /var/cache/conftool/dbconfig/20210316-104420-root.json
  • 10:36 marostegui@cumin1001: dbctl commit (dc=all): 'db1162 (re)pooling @ 75%: Slowly repool db1162', diff saved to https://phabricator.wikimedia.org/P14889 and previous config saved to /var/cache/conftool/dbconfig/20210316-103625-root.json
  • 10:21 marostegui@cumin1001: dbctl commit (dc=all): 'db1162 (re)pooling @ 60%: Slowly repool db1162', diff saved to https://phabricator.wikimedia.org/P14887 and previous config saved to /var/cache/conftool/dbconfig/20210316-102121-root.json
  • 10:15 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2012.codfw.wmnet
  • 10:07 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ganeti2012.codfw.wmnet
  • 10:07 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1161.eqiad.wmnet with reason: REIMAGE
  • 10:06 marostegui@cumin1001: dbctl commit (dc=all): 'db1162 (re)pooling @ 50%: Slowly repool db1162', diff saved to https://phabricator.wikimedia.org/P14886 and previous config saved to /var/cache/conftool/dbconfig/20210316-100617-root.json
  • 10:05 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on db1161.eqiad.wmnet with reason: REIMAGE
  • 10:03 moritzm: drain ganeti2012
  • 10:00 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2011.codfw.wmnet
  • 09:59 akosiaris: Push new certs for kubemaster.svc.codfw.wmnet - T277191
  • 09:51 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ganeti2011.codfw.wmnet
  • 09:51 marostegui@cumin1001: dbctl commit (dc=all): 'db1162 (re)pooling @ 49%: Slowly repool db1162', diff saved to https://phabricator.wikimedia.org/P14885 and previous config saved to /var/cache/conftool/dbconfig/20210316-095113-root.json
  • 09:50 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubetcd2006.codfw.wmnet
  • 09:48 moritzm: drain ganeti2011
  • 09:46 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubetcd2005.codfw.wmnet
  • 09:46 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single for host kubetcd2006.codfw.wmnet
  • 09:44 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single for host kubetcd2005.codfw.wmnet
  • 09:44 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubetcd2004.codfw.wmnet
  • 09:41 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1099:3318 for schema change', diff saved to https://phabricator.wikimedia.org/P14884 and previous config saved to /var/cache/conftool/dbconfig/20210316-094117-marostegui.json
  • 09:40 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single for host kubetcd2004.codfw.wmnet
  • 09:36 marostegui@cumin1001: dbctl commit (dc=all): 'db1162 (re)pooling @ 25%: Slowly repool db1162', diff saved to https://phabricator.wikimedia.org/P14883 and previous config saved to /var/cache/conftool/dbconfig/20210316-093609-root.json
  • 09:34 akosiaris: poweroff acrux and acrab T277191
  • 09:22 marostegui@cumin1001: dbctl commit (dc=all): 'db1076 (re)pooling @ 100%: Slowly repool db1076', diff saved to https://phabricator.wikimedia.org/P14881 and previous config saved to /var/cache/conftool/dbconfig/20210316-092204-root.json
  • 09:21 marostegui@cumin1001: dbctl commit (dc=all): 'db1162 (re)pooling @ 20%: Slowly repool db1162', diff saved to https://phabricator.wikimedia.org/P14880 and previous config saved to /var/cache/conftool/dbconfig/20210316-092106-root.json
  • 09:18 akosiaris: switch restbase-async to eqiad since the kubernetes codfw cluster is being reinitialized and it makes little sense to have it there while the callers will run in eqiad only
  • 09:15 akosiaris@cumin1001: conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=restbase-async
  • 09:12 akosiaris@cumin1001: conftool action : set/pooled=false; selector: name=codfw,dnsdisc=restbase-async
  • 09:12 jayme@cumin1001: conftool action : set/pooled=false; selector: name=codfw,dnsdisc=wikifeeds
  • 09:12 jayme@cumin1001: conftool action : set/pooled=false; selector: name=codfw,dnsdisc=termbox
  • 09:12 jayme@cumin1001: conftool action : set/pooled=false; selector: name=codfw,dnsdisc=similar-users
  • 09:12 jayme@cumin1001: conftool action : set/pooled=false; selector: name=codfw,dnsdisc=sessionstore
  • 09:11 jayme@cumin1001: conftool action : set/pooled=false; selector: name=codfw,dnsdisc=recommendation-api
  • 09:11 jayme@cumin1001: conftool action : set/pooled=false; selector: name=codfw,dnsdisc=push-notifications
  • 09:11 jayme@cumin1001: conftool action : set/pooled=false; selector: name=codfw,dnsdisc=proton
  • 09:11 jayme@cumin1001: conftool action : set/pooled=false; selector: name=codfw,dnsdisc=mobileapps
  • 09:11 jayme@cumin1001: conftool action : set/pooled=false; selector: name=codfw,dnsdisc=mathoid
  • 09:11 jayme@cumin1001: conftool action : set/pooled=false; selector: name=codfw,dnsdisc=linkrecommendation
  • 09:11 jayme@cumin1001: conftool action : set/pooled=false; selector: name=codfw,dnsdisc=eventstreams-internal
  • 09:11 jayme@cumin1001: conftool action : set/pooled=false; selector: name=codfw,dnsdisc=eventstreams
  • 09:11 jayme@cumin1001: conftool action : set/pooled=false; selector: name=codfw,dnsdisc=eventgate-main
  • 09:11 jayme@cumin1001: conftool action : set/pooled=false; selector: name=codfw,dnsdisc=eventgate-logging-external
  • 09:10 jayme@cumin1001: conftool action : set/pooled=false; selector: name=codfw,dnsdisc=eventgate-analytics-external
  • 09:10 jayme@cumin1001: conftool action : set/pooled=false; selector: name=codfw,dnsdisc=eventgate-analytics
  • 09:10 jayme@cumin1001: conftool action : set/pooled=false; selector: name=codfw,dnsdisc=echostore
  • 09:10 jayme@cumin1001: conftool action : set/pooled=false; selector: name=codfw,dnsdisc=cxserver
  • 09:10 jayme@cumin1001: conftool action : set/pooled=false; selector: name=codfw,dnsdisc=citoid
  • 09:10 jayme@cumin1001: conftool action : set/pooled=false; selector: name=codfw,dnsdisc=blubberoid
  • 09:10 jayme@cumin1001: conftool action : set/pooled=false; selector: name=codfw,dnsdisc=api-gateway
  • 09:10 jayme@cumin1001: conftool action : set/pooled=false; selector: name=codfw,dnsdisc=apertium
  • 09:07 marostegui@cumin1001: dbctl commit (dc=all): 'db1076 (re)pooling @ 75%: Slowly repool db1076', diff saved to https://phabricator.wikimedia.org/P14879 and previous config saved to /var/cache/conftool/dbconfig/20210316-090701-root.json
  • 09:06 marostegui@cumin1001: dbctl commit (dc=all): 'db1162 (re)pooling @ 15%: Slowly repool db1162', diff saved to https://phabricator.wikimedia.org/P14878 and previous config saved to /var/cache/conftool/dbconfig/20210316-090602-root.json
  • 09:05 jayme@cumin1001: END (PASS) - Cookbook sre.discovery.service-route (exit_code=0)
  • 09:05 jayme@cumin1001: START - Cookbook sre.discovery.service-route
  • 09:05 jayme@cumin1001: END (PASS) - Cookbook sre.discovery.service-route (exit_code=0)
  • 09:05 jayme@cumin1001: START - Cookbook sre.discovery.service-route
  • 09:05 jayme@cumin1001: END (PASS) - Cookbook sre.discovery.service-route (exit_code=0)
  • 09:04 jayme@cumin1001: START - Cookbook sre.discovery.service-route
  • 09:04 jayme@cumin1001: END (PASS) - Cookbook sre.discovery.service-route (exit_code=0)
  • 09:04 jayme@cumin1001: START - Cookbook sre.discovery.service-route
  • 09:04 jayme@cumin1001: END (PASS) - Cookbook sre.discovery.service-route (exit_code=0)
  • 09:04 jayme@cumin1001: START - Cookbook sre.discovery.service-route
  • 09:04 jayme@cumin1001: END (PASS) - Cookbook sre.discovery.service-route (exit_code=0)
  • 09:04 jayme@cumin1001: START - Cookbook sre.discovery.service-route
  • 09:04 jayme@cumin1001: END (PASS) - Cookbook sre.discovery.service-route (exit_code=0)
  • 09:04 jayme@cumin1001: START - Cookbook sre.discovery.service-route
  • 09:04 jayme@cumin1001: END (PASS) - Cookbook sre.discovery.service-route (exit_code=0)
  • 09:04 jayme@cumin1001: START - Cookbook sre.discovery.service-route
  • 09:03 jayme@cumin1001: END (PASS) - Cookbook sre.discovery.service-route (exit_code=0)
  • 09:03 jayme@cumin1001: START - Cookbook sre.discovery.service-route
  • 09:03 jayme@cumin1001: END (PASS) - Cookbook sre.discovery.service-route (exit_code=0)
  • 09:03 jayme@cumin1001: START - Cookbook sre.discovery.service-route
  • 09:03 jayme@cumin1001: END (PASS) - Cookbook sre.discovery.service-route (exit_code=0)
  • 09:03 jayme@cumin1001: START - Cookbook sre.discovery.service-route
  • 09:03 jayme@cumin1001: END (PASS) - Cookbook sre.discovery.service-route (exit_code=0)
  • 09:02 jayme@cumin1001: START - Cookbook sre.discovery.service-route
  • 09:02 jayme@cumin1001: END (PASS) - Cookbook sre.discovery.service-route (exit_code=0)
  • 09:02 jayme@cumin1001: START - Cookbook sre.discovery.service-route
  • 09:02 jayme@cumin1001: END (PASS) - Cookbook sre.discovery.service-route (exit_code=0)
  • 09:02 jayme@cumin1001: START - Cookbook sre.discovery.service-route
  • 09:02 jayme@cumin1001: END (PASS) - Cookbook sre.discovery.service-route (exit_code=0)
  • 09:02 jayme@cumin1001: START - Cookbook sre.discovery.service-route
  • 09:02 jayme@cumin1001: END (PASS) - Cookbook sre.discovery.service-route (exit_code=0)
  • 09:02 jayme@cumin1001: START - Cookbook sre.discovery.service-route
  • 09:02 jayme@cumin1001: END (PASS) - Cookbook sre.discovery.service-route (exit_code=0)
  • 09:02 jayme@cumin1001: START - Cookbook sre.discovery.service-route
  • 09:01 jayme@cumin1001: END (PASS) - Cookbook sre.discovery.service-route (exit_code=0)
  • 09:01 jayme@cumin1001: START - Cookbook sre.discovery.service-route
  • 09:01 jayme@cumin1001: END (PASS) - Cookbook sre.discovery.service-route (exit_code=0)
  • 09:01 jayme@cumin1001: START - Cookbook sre.discovery.service-route
  • 09:01 jayme@cumin1001: END (PASS) - Cookbook sre.discovery.service-route (exit_code=0)
  • 09:01 jayme@cumin1001: START - Cookbook sre.discovery.service-route
  • 09:01 jayme@cumin1001: END (PASS) - Cookbook sre.discovery.service-route (exit_code=0)
  • 09:01 jayme@cumin1001: START - Cookbook sre.discovery.service-route
  • 09:01 jayme@cumin1001: END (PASS) - Cookbook sre.discovery.service-route (exit_code=0)
  • 09:01 jayme@cumin1001: START - Cookbook sre.discovery.service-route
  • 09:00 jayme@cumin1001: END (PASS) - Cookbook sre.discovery.service-route (exit_code=0)
  • 09:00 jayme@cumin1001: START - Cookbook sre.discovery.service-route
  • 09:00 jayme@cumin1001: END (PASS) - Cookbook sre.discovery.service-route (exit_code=0)
  • 09:00 jayme@cumin1001: START - Cookbook sre.discovery.service-route
  • 09:00 jayme@cumin1001: END (PASS) - Cookbook sre.discovery.service-route (exit_code=0)
  • 09:00 jayme@cumin1001: START - Cookbook sre.discovery.service-route
  • 09:00 jayme@cumin1001: END (PASS) - Cookbook sre.discovery.service-route (exit_code=0)
  • 09:00 jayme@cumin1001: START - Cookbook sre.discovery.service-route
  • 08:59 akosiaris: starting the k8s codfw cluster reinitialization process
  • 08:59 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 18 hosts with reason: Reinitialize codfw k8s cluster with new etcd
  • 08:59 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 18 hosts with reason: Reinitialize codfw k8s cluster with new etcd
  • 08:57 jayme@cumin1001: END (PASS) - Cookbook sre.discovery.service-route (exit_code=0)
  • 08:56 jayme@cumin1001: START - Cookbook sre.discovery.service-route
  • 08:51 marostegui@cumin1001: dbctl commit (dc=all): 'db1076 (re)pooling @ 50%: Slowly repool db1076', diff saved to https://phabricator.wikimedia.org/P14877 and previous config saved to /var/cache/conftool/dbconfig/20210316-085157-root.json
  • 08:50 marostegui@cumin1001: dbctl commit (dc=all): 'db1162 (re)pooling @ 10%: Slowly repool db1162', diff saved to https://phabricator.wikimedia.org/P14876 and previous config saved to /var/cache/conftool/dbconfig/20210316-085058-root.json
  • 08:47 marostegui: Check tables on db2150 db2120 T276742
  • 08:42 moritzm: remove Java 8 from contint/releases T269354
  • 08:36 marostegui@cumin1001: dbctl commit (dc=all): 'db1076 (re)pooling @ 25%: Slowly repool db1076', diff saved to https://phabricator.wikimedia.org/P14875 and previous config saved to /var/cache/conftool/dbconfig/20210316-083653-root.json
  • 08:35 marostegui@cumin1001: dbctl commit (dc=all): 'db1162 (re)pooling @ 5%: Slowly repool db1162', diff saved to https://phabricator.wikimedia.org/P14874 and previous config saved to /var/cache/conftool/dbconfig/20210316-083555-root.json
  • 08:20 marostegui@cumin1001: dbctl commit (dc=all): 'db1162 (re)pooling @ 2%: Slowly repool db1162', diff saved to https://phabricator.wikimedia.org/P14873 and previous config saved to /var/cache/conftool/dbconfig/20210316-082051-root.json
  • 08:18 godog: enable nick enforcing for logmsgbot - T276303
  • 08:05 marostegui@cumin1001: dbctl commit (dc=all): 'db1162 (re)pooling @ 1%: Slowly repool db1162', diff saved to https://phabricator.wikimedia.org/P14872 and previous config saved to /var/cache/conftool/dbconfig/20210316-080547-root.json
  • 07:51 godog: swift eqiad-prod: less weight for ms-be[1019-1026] - T272836
  • 07:29 marostegui@cumin1001: dbctl commit (dc=all): 'db1136 (re)pooling @ 100%: Repool db1136', diff saved to https://phabricator.wikimedia.org/P14871 and previous config saved to /var/cache/conftool/dbconfig/20210316-072910-root.json
  • 07:14 marostegui@cumin1001: dbctl commit (dc=all): 'db1136 (re)pooling @ 75%: Repool db1136', diff saved to https://phabricator.wikimedia.org/P14870 and previous config saved to /var/cache/conftool/dbconfig/20210316-071407-root.json
  • 06:59 marostegui@cumin1001: dbctl commit (dc=all): 'db1136 (re)pooling @ 50%: Repool db1136', diff saved to https://phabricator.wikimedia.org/P14869 and previous config saved to /var/cache/conftool/dbconfig/20210316-065903-root.json
  • 06:58 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db2148', diff saved to https://phabricator.wikimedia.org/P14868 and previous config saved to /var/cache/conftool/dbconfig/20210316-065840-marostegui.json
  • 06:58 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db2108', diff saved to https://phabricator.wikimedia.org/P14867 and previous config saved to /var/cache/conftool/dbconfig/20210316-065814-marostegui.json
  • 06:52 marostegui: Stop MySQL on db2120 to clone db2150 - T275633
  • 06:51 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db2120 T275633', diff saved to https://phabricator.wikimedia.org/P14865 and previous config saved to /var/cache/conftool/dbconfig/20210316-065148-marostegui.json
  • 06:44 marostegui@cumin1001: dbctl commit (dc=all): 'db1136 (re)pooling @ 25%: Repool db1136', diff saved to https://phabricator.wikimedia.org/P14864 and previous config saved to /var/cache/conftool/dbconfig/20210316-064358-root.json
  • 05:53 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1136.eqiad.wmnet with reason: REIMAGE
  • 05:51 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on db1136.eqiad.wmnet with reason: REIMAGE
  • 05:35 marostegui: Stop MySQL on db1162 to clone db1162 T258361
  • 05:35 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1076', diff saved to https://phabricator.wikimedia.org/P14862 and previous config saved to /var/cache/conftool/dbconfig/20210316-053516-marostegui.json

2021-03-15

  • 23:31 legoktm@deploy1002: Synchronized wmf-config/CommonSettings.php: Remove back-compat from when IRC feed servers was a string (T224579) (duration: 00m 59s)
  • 23:24 legoktm@deploy1002: Synchronized wmf-config/: Define IRC feed servers as an array in {Production,Labs}Services.php (T224579) (duration: 00m 59s)
  • 23:23 legoktm@deploy1002: Synchronized wmf-config/CommonSettings.php: Support having multiple IRC feed servers (T224579) (duration: 00m 58s)
  • 23:13 legoktm@deploy1002: conftool action : set/pooled=inactive; selector: name=mw2225.codfw.wmnet
  • 23:11 legoktm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: GlobalWatchlist: allow watching up to 50 sites (T276195) (duration: 01m 04s)
  • 21:36 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2239.codfw.wmnet
  • 21:36 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2238.codfw.wmnet
  • 21:36 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2237.codfw.wmnet
  • 21:35 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2236.codfw.wmnet
  • 21:02 ebernhardson@deploy1002: Finished deploy [wikimedia/discovery/analytics@4300929]: convert_to_esbulk: Accept partial hour timestamps (duration: 03m 02s)
  • 20:59 ebernhardson@deploy1002: Started deploy [wikimedia/discovery/analytics@4300929]: convert_to_esbulk: Accept partial hour timestamps
  • 20:55 legoktm: re-enabled puppet on kubestage2001, uncordoned kubestage2002
  • 20:23 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2225.codfw.wmnet
  • 19:57 ebernhardson@deploy1002: Finished deploy [wikimedia/discovery/analytics@82e0654]: prepare_mw_rev_score: Correct scores_export to bulk_ingest (duration: 01m 49s)
  • 19:55 ebernhardson@deploy1002: Started deploy [wikimedia/discovery/analytics@82e0654]: prepare_mw_rev_score: Correct scores_export to bulk_ingest
  • 19:53 dzahn@cumin1001: START - Cookbook sre.hosts.decommission for hosts mw2225.codfw.wmnet
  • 19:53 dzahn@cumin1001: END (ERROR) - Cookbook sre.hosts.decommission (exit_code=97) for hosts mw2224.codfw.wmnet
  • 19:53 dzahn@cumin1001: START - Cookbook sre.hosts.decommission for hosts mw2224.codfw.wmnet
  • 19:43 eevans@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'echostore' for release 'production' .
  • 19:37 eevans@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'echostore' for release 'production' .
  • 19:27 eevans@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'echostore' for release 'staging' .
  • 18:56 dduvall@deploy1002: Synchronized .pipeline: config: Initial multiversion pipeline configuration pipeline: add building the webserver image (T274182) (duration: 00m 59s)
  • 18:55 dduvall@deploy1002: Synchronized multiversion/: config: Initial multiversion pipeline configuration pipeline: add building the webserver image (T274182) (duration: 00m 59s)
  • 18:46 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: e5a7284: Enable DiscussionsTools for enwikibooks (T276851) (duration: 00m 59s)
  • 18:41 legoktm: puppet disabled on kubestage1001 for debugging docker-registry credentials
  • 18:38 urbanecm@deploy1002: Synchronized wmf-config/config/enwikibooks.yaml: b6a8df0: Enable visualeditor on enwikibooks by default (T276851; 2/2) (duration: 01m 00s)
  • 18:37 foks: removing 1 file from eowiki, for legal compliance
  • 18:35 urbanecm@deploy1002: Synchronized dblists/visualeditor-nondefault.dblist: b6a8df0: Enable visualeditor on enwikibooks by default (T276851; 1/2) (duration: 00m 58s)
  • 18:33 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: b70a75c: Configure default search namespaces for thwikisource (T275280) (duration: 00m 59s)
  • 18:18 hoo: Updated the Wikidata property suggester with data from the 2021-03-08 JSON dump (with pre-applied T132839 workarounds)
  • 18:17 urbanecm@deploy1002: Synchronized php-1.36.0-wmf.34/extensions/WikimediaEvents/modules/ext.wikimediaEvents/clientError.js: a7eb550: Use master version of clientError.js (duration: 00m 58s)
  • 18:13 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: a8234a9: Add deleterevision right to botadmin group on fawiki (T277358) (duration: 00m 59s)
  • 18:07 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2223.codfw.wmnet
  • 18:07 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2235.codfw.wmnet
  • 18:06 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2234.codfw.wmnet
  • 17:56 dzahn@cumin1001: START - Cookbook sre.hosts.decommission for hosts mw2223.codfw.wmnet
  • 17:55 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2222.codfw.wmnet
  • 17:30 hnowlan: disabling puppet on aqs100[4-9].eqiad.wmnet to test change to password logic in puppet
  • 17:30 dzahn@cumin1001: START - Cookbook sre.hosts.decommission for hosts mw2222.codfw.wmnet
  • 17:29 dzahn@cumin1001: conftool action : set/pooled=inactive; selector: name=mw2223.codfw.wmnet
  • 17:29 dzahn@cumin1001: conftool action : set/pooled=inactive; selector: name=mw2222.codfw.wmnet
  • 17:29 dzahn@cumin1001: conftool action : set/pooled=inactive; selector: name=mw2221.codfw.wmnet
  • 17:28 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2221.codfw.wmnet
  • 17:03 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on aqs1010.eqiad.wmnet with reason: New buster host
  • 17:03 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on aqs1010.eqiad.wmnet with reason: New buster host
  • 16:59 dzahn@cumin1001: START - Cookbook sre.hosts.decommission for hosts mw2221.codfw.wmnet
  • 16:58 dzahn@cumin1001: conftool action : set/pooled=inactive; selector: name=mw2224.codfw.wmnet
  • 16:58 dzahn@cumin1001: conftool action : set/pooled=inactive; selector: name=mw2220.codfw.wmnet
  • 16:56 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2220.codfw.wmnet
  • 16:55 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts mw2224.codfw.wmnet
  • 16:48 dzahn@cumin1001: START - Cookbook sre.hosts.decommission for hosts mw2224.codfw.wmnet
  • 16:42 dzahn@cumin1001: START - Cookbook sre.hosts.decommission for hosts mw2220.codfw.wmnet
  • 16:38 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2233.codfw.wmnet
  • 16:38 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2232.codfw.wmnet
  • 16:38 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2231.codfw.wmnet
  • 16:29 aborrero@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt-wdqs1002.eqiad.wmnet
  • 16:28 aborrero@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt-wdqs1003.eqiad.wmnet
  • 16:27 aborrero@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt-wdqs1001.eqiad.wmnet
  • 16:23 aborrero@cumin1001: START - Cookbook sre.hosts.reboot-single for host cloudvirt-wdqs1003.eqiad.wmnet
  • 16:23 aborrero@cumin1001: START - Cookbook sre.hosts.reboot-single for host cloudvirt-wdqs1002.eqiad.wmnet
  • 16:14 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2010.codfw.wmnet
  • 16:08 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ganeti2010.codfw.wmnet
  • 16:06 aborrero@cumin1001: START - Cookbook sre.hosts.reboot-single for host cloudvirt-wdqs1001.eqiad.wmnet
  • 16:05 moritzm: draining ganeti2010
  • 16:04 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2009.codfw.wmnet
  • 15:58 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ganeti2009.codfw.wmnet
  • 15:48 moritzm: draining ganeti2009
  • 15:47 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2007.codfw.wmnet
  • 15:41 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ganeti2007.codfw.wmnet
  • 15:33 moritzm: draining ganeti2007
  • 15:27 aborrero@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudgw2001-dev.codfw.wmnet with reason: REIMAGE
  • 15:24 aborrero@cumin2001: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw2001-dev.codfw.wmnet with reason: REIMAGE
  • 15:16 marostegui@cumin1001: dbctl commit (dc=all): 'db1074 (re)pooling @ 100%: Repool db1074', diff saved to https://phabricator.wikimedia.org/P14858 and previous config saved to /var/cache/conftool/dbconfig/20210315-151648-root.json
  • 15:16 aborrero@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudgw2002-dev.codfw.wmnet with reason: REIMAGE
  • 15:14 aborrero@cumin2001: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw2002-dev.codfw.wmnet with reason: REIMAGE
  • 15:08 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1162.eqiad.wmnet with reason: REIMAGE
  • 15:06 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on db1162.eqiad.wmnet with reason: REIMAGE
  • 15:01 marostegui@cumin1001: dbctl commit (dc=all): 'db1074 (re)pooling @ 75%: Repool db1074', diff saved to https://phabricator.wikimedia.org/P14857 and previous config saved to /var/cache/conftool/dbconfig/20210315-150144-root.json
  • 14:46 marostegui@cumin1001: dbctl commit (dc=all): 'db1074 (re)pooling @ 50%: Repool db1074', diff saved to https://phabricator.wikimedia.org/P14856 and previous config saved to /var/cache/conftool/dbconfig/20210315-144641-root.json
  • 14:36 kharlan@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
  • 14:36 kharlan@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
  • 14:32 kharlan@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
  • 14:32 kharlan@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
  • 14:31 marostegui@cumin1001: dbctl commit (dc=all): 'db1074 (re)pooling @ 25%: Repool db1074', diff saved to https://phabricator.wikimedia.org/P14855 and previous config saved to /var/cache/conftool/dbconfig/20210315-143137-root.json
  • 14:28 kharlan@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
  • 14:08 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1074', diff saved to https://phabricator.wikimedia.org/P14854 and previous config saved to /var/cache/conftool/dbconfig/20210315-140809-marostegui.json
  • 14:04 dcausse: re-pooling wdqs1005
  • 13:54 marostegui@cumin1001: dbctl commit (dc=all): 'db1105:3312 (re)pooling @ 100%: Repool db1105:3312', diff saved to https://phabricator.wikimedia.org/P14853 and previous config saved to /var/cache/conftool/dbconfig/20210315-135426-root.json
  • 13:39 marostegui@cumin1001: dbctl commit (dc=all): 'db1105:3312 (re)pooling @ 75%: Repool db1105:3312', diff saved to https://phabricator.wikimedia.org/P14852 and previous config saved to /var/cache/conftool/dbconfig/20210315-133921-root.json
  • 13:25 Urbanecm: Deploy security patch for T152394
  • 13:24 marostegui@cumin1001: dbctl commit (dc=all): 'db1105:3312 (re)pooling @ 50%: Repool db1105:3312', diff saved to https://phabricator.wikimedia.org/P14851 and previous config saved to /var/cache/conftool/dbconfig/20210315-132418-root.json
  • 13:09 marostegui@cumin1001: dbctl commit (dc=all): 'db1105:3312 (re)pooling @ 25%: Repool db1105:3312', diff saved to https://phabricator.wikimedia.org/P14849 and previous config saved to /var/cache/conftool/dbconfig/20210315-130914-root.json
  • 12:39 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1105:3312', diff saved to https://phabricator.wikimedia.org/P14848 and previous config saved to /var/cache/conftool/dbconfig/20210315-123930-marostegui.json
  • 12:32 urbanecm@deploy1002: Synchronized php-1.36.0-wmf.34/extensions/MobileFrontend/: 41a2aaa: Revert "Rewite MoveLeadParagraphTransform based on mobile apps approach" (T277302) (duration: 00m 58s)
  • 12:31 Lucas_WMDE: maintenance scripts for T270249 completed successfully, no more terms for deleted items found on stat1007
  • 12:30 urbanecm@deploy1002: Synchronized php-1.36.0-wmf.34/extensions/GrowthExperiments/: fa2abfa: Manual submodule update of GrowthExperiments repository (T276966) (duration: 00m 59s)
  • 12:29 Lucas_WMDE: RemoveDeletedItemsFromTermStore.php finished in 5m39s
  • 12:23 Lucas_WMDE: lucaswerkmeister-wmde@mwmaint1002:~$ time mwscript extensions/Wikibase/repo/maintenance/RemoveDeletedItemsFromTermStore.php wikidatawiki --itemIds "$(sed -n 5555,9593p T270249.ids | tr '\n' ',' | sed 's/,$//')" # T270249, remaining 4039 items
  • 12:22 Lucas_WMDE: RemoveDeletedItemsFromTermStore.php finished in 8min
  • 12:19 _joe_: depooled mw1347 for testing
  • 12:13 Lucas_WMDE: lucaswerkmeister-wmde@mwmaint1002:~$ time mwscript extensions/Wikibase/repo/maintenance/RemoveDeletedItemsFromTermStore.php wikidatawiki --itemIds "$(sed -n 555,5554p T270249.ids | tr '\n' ',' | sed 's/,$//')" # T270249, 5000 items
  • 12:12 Lucas_WMDE: finished in 43s
  • 12:11 Lucas_WMDE: lucaswerkmeister-wmde@mwmaint1002:~$ time mwscript extensions/Wikibase/repo/maintenance/RemoveDeletedItemsFromTermStore.php wikidatawiki --itemIds "$(sed -n 55,554p T270249.ids | tr '\n' ',' | sed 's/,$//')" # T270249, 500 items
  • 12:10 Lucas_WMDE: finished in 5.1s
  • 12:10 Lucas_WMDE: lucaswerkmeister-wmde@mwmaint1002:~$ time mwscript extensions/Wikibase/repo/maintenance/RemoveDeletedItemsFromTermStore.php wikidatawiki --itemIds "$(sed -n 5,54p T270249.ids | tr '\n' ',' | sed 's/,$//')" # T270249, 50 items
  • 11:58 marostegui@cumin1001: dbctl commit (dc=all): 'db1129 (re)pooling @ 100%: Repool db1129', diff saved to https://phabricator.wikimedia.org/P14847 and previous config saved to /var/cache/conftool/dbconfig/20210315-115826-root.json
  • 11:51 jdrewniak@deploy1002: Synchronized portals: Wikimedia Portals Update: Bumping portals to master (T128546) (duration: 00m 58s)
  • 11:50 jdrewniak@deploy1002: Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: Bumping portals to master (T128546) (duration: 00m 59s)
  • 11:43 marostegui@cumin1001: dbctl commit (dc=all): 'db1129 (re)pooling @ 75%: Repool db1129', diff saved to https://phabricator.wikimedia.org/P14846 and previous config saved to /var/cache/conftool/dbconfig/20210315-114323-root.json
  • 11:34 moritzm: restarting FPM on mw canaries to pick up new libtiff
  • 11:30 aborrero@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudgw2002-dev.codfw.wmnet with reason: REIMAGE
  • 11:28 aborrero@cumin2001: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw2002-dev.codfw.wmnet with reason: REIMAGE
  • 11:28 marostegui@cumin1001: dbctl commit (dc=all): 'db1129 (re)pooling @ 50%: Repool db1129', diff saved to https://phabricator.wikimedia.org/P14844 and previous config saved to /var/cache/conftool/dbconfig/20210315-112819-root.json
  • 11:22 moritzm: installing tiff security updates
  • 11:17 moritzm: installing golang-1.7 security updates
  • 11:13 marostegui@cumin1001: dbctl commit (dc=all): 'db1129 (re)pooling @ 25%: Repool db1129', diff saved to https://phabricator.wikimedia.org/P14843 and previous config saved to /var/cache/conftool/dbconfig/20210315-111315-root.json
  • 11:00 volans: upgraded spicerack on cumin1001 to 0.0.49-1+deb10u1
  • 10:58 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1129', diff saved to https://phabricator.wikimedia.org/P14842 and previous config saved to /var/cache/conftool/dbconfig/20210315-105855-marostegui.json
  • 10:58 marostegui@cumin1001: dbctl commit (dc=all): 'db1076 (re)pooling @ 100%: Repool db1076', diff saved to https://phabricator.wikimedia.org/P14841 and previous config saved to /var/cache/conftool/dbconfig/20210315-105820-root.json
  • 10:56 volans@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:05:00 on cumin2001.codfw.wmnet with reason: test
  • 10:55 volans@cumin2001: START - Cookbook sre.hosts.downtime for 0:05:00 on cumin2001.codfw.wmnet with reason: test
  • 10:43 marostegui@cumin1001: dbctl commit (dc=all): 'db1076 (re)pooling @ 75%: Repool db1076', diff saved to https://phabricator.wikimedia.org/P14840 and previous config saved to /var/cache/conftool/dbconfig/20210315-104316-root.json
  • 10:42 moritzm: installing pygments security updates on buster
  • 10:33 volans: upgraded spicerack on cumin2001 to 0.0.49-1+deb10u1
  • 10:28 marostegui@cumin1001: dbctl commit (dc=all): 'db1076 (re)pooling @ 50%: Repool db1076', diff saved to https://phabricator.wikimedia.org/P14839 and previous config saved to /var/cache/conftool/dbconfig/20210315-102813-root.json
  • 10:26 kormat@cumin1001: dbctl commit (dc=all): 'db1114 (re)pooling @ 100%: schema change T267767', diff saved to https://phabricator.wikimedia.org/P14838 and previous config saved to /var/cache/conftool/dbconfig/20210315-102648-kormat.json
  • 10:13 marostegui@cumin1001: dbctl commit (dc=all): 'db1076 (re)pooling @ 25%: Repool db1076', diff saved to https://phabricator.wikimedia.org/P14837 and previous config saved to /var/cache/conftool/dbconfig/20210315-101309-root.json
  • 10:11 kormat@cumin1001: dbctl commit (dc=all): 'db1114 (re)pooling @ 50%: schema change T267767', diff saved to https://phabricator.wikimedia.org/P14836 and previous config saved to /var/cache/conftool/dbconfig/20210315-101143-kormat.json
  • 10:03 kormat@cumin1001: dbctl commit (dc=all): 'db1114 depooling: schema change T267767', diff saved to https://phabricator.wikimedia.org/P14835 and previous config saved to /var/cache/conftool/dbconfig/20210315-100337-kormat.json
  • 10:03 kormat@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1114.eqiad.wmnet with reason: schema change T267767
  • 10:02 kormat@cumin1001: START - Cookbook sre.hosts.downtime for 1:00:00 on db1114.eqiad.wmnet with reason: schema change T267767
  • 09:56 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1076', diff saved to https://phabricator.wikimedia.org/P14834 and previous config saved to /var/cache/conftool/dbconfig/20210315-095607-marostegui.json
  • 09:49 marostegui@cumin1001: dbctl commit (dc=all): 'db1146:3312 (re)pooling @ 100%: Repool db1146:3312', diff saved to https://phabricator.wikimedia.org/P14833 and previous config saved to /var/cache/conftool/dbconfig/20210315-094920-root.json
  • 09:34 marostegui@cumin1001: dbctl commit (dc=all): 'db1146:3312 (re)pooling @ 75%: Repool db1146:3312', diff saved to https://phabricator.wikimedia.org/P14832 and previous config saved to /var/cache/conftool/dbconfig/20210315-093416-root.json
  • 09:23 vgutierrez: rolling restart of LVS cluster to bump depool_threshold to 0.8 on text & upload clusters - T274888
  • 09:19 marostegui@cumin1001: dbctl commit (dc=all): 'db1146:3312 (re)pooling @ 50%: Repool db1146:3312', diff saved to https://phabricator.wikimedia.org/P14831 and previous config saved to /var/cache/conftool/dbconfig/20210315-091912-root.json
  • 09:04 marostegui@cumin1001: dbctl commit (dc=all): 'db1146:3312 (re)pooling @ 25%: Repool db1146:3312', diff saved to https://phabricator.wikimedia.org/P14830 and previous config saved to /var/cache/conftool/dbconfig/20210315-090409-root.json
  • 08:54 marostegui: Stop MySQL on db1136 T277007
  • 08:54 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1136 T277007', diff saved to https://phabricator.wikimedia.org/P14829 and previous config saved to /var/cache/conftool/dbconfig/20210315-085409-marostegui.json
  • 08:35 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1146:3312', diff saved to https://phabricator.wikimedia.org/P14828 and previous config saved to /var/cache/conftool/dbconfig/20210315-083555-marostegui.json
  • 08:33 godog: swift eqiad-prod remove decom hosts from account/container rings - T272836 T276193
  • 08:33 marostegui: Repool labsdb1009 T276980
  • 07:22 elukey: powercycle ms-be1038 - no ssh, no tty available in mgmt serial console, irrecoverable error saved in ilo's system logs

2021-03-14

  • 17:57 marostegui@cumin1001: dbctl commit (dc=all): 'db1146:3314 (re)pooling @ 100%: Repool db1146:3314', diff saved to https://phabricator.wikimedia.org/P14827 and previous config saved to /var/cache/conftool/dbconfig/20210314-175751-root.json
  • 17:42 marostegui@cumin1001: dbctl commit (dc=all): 'db1146:3314 (re)pooling @ 75%: Repool db1146:3314', diff saved to https://phabricator.wikimedia.org/P14826 and previous config saved to /var/cache/conftool/dbconfig/20210314-174248-root.json
  • 17:27 marostegui@cumin1001: dbctl commit (dc=all): 'db1146:3314 (re)pooling @ 50%: Repool db1146:3314', diff saved to https://phabricator.wikimedia.org/P14825 and previous config saved to /var/cache/conftool/dbconfig/20210314-172744-root.json
  • 17:12 marostegui@cumin1001: dbctl commit (dc=all): 'db1146:3314 (re)pooling @ 25%: Repool db1146:3314', diff saved to https://phabricator.wikimedia.org/P14824 and previous config saved to /var/cache/conftool/dbconfig/20210314-171240-root.json
  • 14:43 gehel: depool wdqs1005 and restart blazegraph - will keep depooled until this server has catched up on lag

2021-03-13

  • 19:02 Amir1: change default charset of all core tables in labstestwiki to binary (T269348)
  • 18:53 Amir1: run schema changes for varbinary on wikitech (T269348)
  • 17:38 twentyafterfour: restarted apache on gerrit1001 to resolve apache worker exhaustion see T277127
  • 16:57 Reedy: gerrit web interface is slow/timing out
  • 01:18 ryankemper: T266470 Re-enabled icinga service notifications for `Check no envoy runtime configuration is left persistent` on `wdqs100[9,10]`
  • 01:04 ryankemper: T266470 merged https://gerrit.wikimedia.org/r/c/operations/dns/+/668255 && `ryankemper@authdns1001:~$ sudo authdns-update`
  • 00:55 mutante: [wdqs1009:/etc/envoy] $ sudo /usr/local/sbin/build-envoy-config -c /etc/envoy/

2021-03-12

  • 22:53 ryankemper: T266470 Manually disabled service notifications for `Check no envoy runtime configuration is left persistent`, will need to circle back on Monday to restore notifications
  • 22:10 legoktm: imported mailman-puppetmaster.mailman.eqiad1.wikimedia.cloud facts to puppet-compiler
  • 21:52 mutante: puppetmaster1001 sudo puppet cert clean testreduce.discovery.wmnet (T266509)
  • 21:15 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2219.codfw.wmnet
  • 20:49 dzahn@cumin1001: START - Cookbook sre.hosts.decommission for hosts mw2219.codfw.wmnet
  • 20:48 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2218.codfw.wmnet
  • 20:32 dzahn@cumin1001: START - Cookbook sre.hosts.decommission for hosts mw2218.codfw.wmnet
  • 20:32 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2217.codfw.wmnet
  • 20:22 eevans@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'sessionstore' for release 'staging' .
  • 20:15 dzahn@cumin1001: START - Cookbook sre.hosts.decommission for hosts mw2217.codfw.wmnet
  • 20:14 dzahn@cumin1001: conftool action : set/pooled=inactive; selector: name=mw2219.codfw.wmnet
  • 20:14 dzahn@cumin1001: conftool action : set/pooled=inactive; selector: name=mw2218.codfw.wmnet
  • 20:14 dzahn@cumin1001: conftool action : set/pooled=inactive; selector: name=mw2217.codfw.wmnet
  • 19:47 dzahn@cumin1001: conftool action : set/weight=1; selector: name=mw2376.codfw.wmnet,service=canary
  • 19:47 dzahn@cumin1001: conftool action : set/weight=1; selector: name=mw2374.codfw.wmnet,service=canary
  • 19:47 ebernhardson: start in-place reindex testwiki in eqiad, codfw, cloudelastic cirrus clusters for T269493
  • 19:45 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw2374.codfw.wmnet
  • 19:41 mutante: mw2374, mw2376 - depooling to turn them into canaries
  • 19:41 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2376.codfw.wmnet
  • 19:41 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2374.codfw.wmnet
  • 19:09 cstone: tools revision changed from 532f8ecb33 to b7b4060c30
  • 18:28 bblack: authdns1001.wikimedia.org,dns2001.wikimedia.org - upgrade gdnsd to 3.6.0 (half the servers have been on this for a couple weeks now, just finishing up the rollout)
  • 18:24 bblack: dns[15]001.wikimedia.org - upgrade gdnsd to 3.6.0 (half the servers have been on this for a couple weeks now, just finishing up the rollout)
  • 18:21 bblack: dns[34]001.wikimedia.org - upgrade gdnsd to 3.6.0 (half the servers have been on this for a couple weeks now, just finishing up the rollout)
  • 18:03 mutante: depooling mw2244,mw2245 (API on old hardware), mw2229,mw2230 (app on old hardware) - T277119
  • 18:02 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2245.codfw.wmnet
  • 18:01 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2244.codfw.wmnet
  • 18:00 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2230.codfw.wmnet
  • 18:00 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2229.codfw.wmnet
  • 17:00 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on aqs1010.eqiad.wmnet with reason: New buster host
  • 17:00 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on aqs1010.eqiad.wmnet with reason: New buster host
  • 16:05 cmjohnson@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 16:01 cmjohnson@cumin1001: START - Cookbook sre.dns.netbox
  • 14:56 cmjohnson@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 14:50 cmjohnson@cumin1001: START - Cookbook sre.dns.netbox
  • 14:34 marostegui@cumin1001: dbctl commit (dc=all): 'db1170:3312 (re)pooling @ 100%: Repool db1170:3312 after schema change', diff saved to https://phabricator.wikimedia.org/P14818 and previous config saved to /var/cache/conftool/dbconfig/20210312-143450-root.json
  • 14:19 marostegui@cumin1001: dbctl commit (dc=all): 'db1170:3312 (re)pooling @ 75%: Repool db1170:3312 after schema change', diff saved to https://phabricator.wikimedia.org/P14817 and previous config saved to /var/cache/conftool/dbconfig/20210312-141947-root.json
  • 14:04 marostegui@cumin1001: dbctl commit (dc=all): 'db1170:3312 (re)pooling @ 50%: Repool db1170:3312 after schema change', diff saved to https://phabricator.wikimedia.org/P14816 and previous config saved to /var/cache/conftool/dbconfig/20210312-140443-root.json
  • 13:49 marostegui@cumin1001: dbctl commit (dc=all): 'db1170:3312 (re)pooling @ 25%: Repool db1170:3312 after schema change', diff saved to https://phabricator.wikimedia.org/P14815 and previous config saved to /var/cache/conftool/dbconfig/20210312-134940-root.json
  • 13:24 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1088.eqiad.wmnet
  • 13:14 marostegui@cumin1001: START - Cookbook sre.hosts.decommission for hosts db1088.eqiad.wmnet
  • 13:10 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1170:3312', diff saved to https://phabricator.wikimedia.org/P14814 and previous config saved to /var/cache/conftool/dbconfig/20210312-131033-marostegui.json
  • 12:12 vgutierrez: restart ats-tls on cp3051
  • 11:55 effie: upgrade memcached on mc1022, mc2022
  • 11:22 hnowlan: corrected git_server for logstash-logback-encoder, cassandra/twcs and cassandra/metrics-collector on deploy1002
  • 09:45 jayme@deploy1002: helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'.
  • 09:45 jayme@deploy1002: helmfile [staging-eqiad] START helmfile.d/admin 'sync'.
  • 09:44 jayme@deploy1002: helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
  • 09:43 jayme@deploy1002: helmfile [staging-codfw] START helmfile.d/admin 'sync'.
  • 09:28 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mx1001.wikimedia.org
  • 09:25 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host mx1001.wikimedia.org
  • 09:16 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mx2001.wikimedia.org
  • 09:11 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host mx2001.wikimedia.org
  • 09:07 zpapierski@deploy1002: Finished deploy [wikimedia/discovery/analytics@9a408b2]: T273847 export queries to relforge dag deployment - elastic-template handling (duration: 01m 35s)
  • 09:05 zpapierski@deploy1002: Started deploy [wikimedia/discovery/analytics@9a408b2]: T273847 export queries to relforge dag deployment - elastic-template handling
  • 09:00 zpapierski@deploy1002: Finished deploy [wikimedia/discovery/analytics@9a408b2]: T273847 export queries to relforge dag deployment - elastic-template handling (duration: 00m 09s)
  • 09:00 zpapierski@deploy1002: Started deploy [wikimedia/discovery/analytics@9a408b2]: T273847 export queries to relforge dag deployment - elastic-template handling
  • 08:59 zpapierski@deploy1002: Finished deploy [wikimedia/discovery/analytics@9a408b2]: T273847 export queries to relforge dag deployment - elastic-template handling (duration: 00m 10s)
  • 08:59 zpapierski@deploy1002: Started deploy [wikimedia/discovery/analytics@9a408b2]: T273847 export queries to relforge dag deployment - elastic-template handling
  • 08:52 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host pybal-test2003.codfw.wmnet
  • 08:49 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host pybal-test2003.codfw.wmnet
  • 08:47 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host pybal-test2002.codfw.wmnet
  • 08:44 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host pybal-test2002.codfw.wmnet
  • 08:35 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host scandium.eqiad.wmnet
  • 08:20 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host scandium.eqiad.wmnet
  • 08:01 moritzm: installing openjpeg2 security updates
  • 07:16 marostegui: Stop mysql on db2108 to clone db2148
  • 07:16 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db2108 T276742', diff saved to https://phabricator.wikimedia.org/P14811 and previous config saved to /var/cache/conftool/dbconfig/20210312-071628-marostegui.json
  • 07:14 marostegui@cumin1001: dbctl commit (dc=all): 'db1082 (re)pooling @ 100%: Repool db1082 after schema change', diff saved to https://phabricator.wikimedia.org/P14810 and previous config saved to /var/cache/conftool/dbconfig/20210312-071400-root.json
  • 07:02 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db2148 T276742', diff saved to https://phabricator.wikimedia.org/P14809 and previous config saved to /var/cache/conftool/dbconfig/20210312-070219-marostegui.json
  • 06:58 marostegui@cumin1001: dbctl commit (dc=all): 'db1082 (re)pooling @ 60%: Repool db1082 after schema change', diff saved to https://phabricator.wikimedia.org/P14808 and previous config saved to /var/cache/conftool/dbconfig/20210312-065857-root.json
  • 06:50 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1146:3314 for table checking T276742', diff saved to https://phabricator.wikimedia.org/P14807 and previous config saved to /var/cache/conftool/dbconfig/20210312-065008-marostegui.json
  • 06:43 marostegui@cumin1001: dbctl commit (dc=all): 'db1082 (re)pooling @ 30%: Repool db1082 after schema change', diff saved to https://phabricator.wikimedia.org/P14806 and previous config saved to /var/cache/conftool/dbconfig/20210312-064353-root.json
  • 06:30 marostegui: Deploy schema change on s2 codfw master, lag will appear - T276150 T276156
  • 06:28 marostegui@cumin1001: dbctl commit (dc=all): 'db1082 (re)pooling @ 10%: Repool db1082 after schema change', diff saved to https://phabricator.wikimedia.org/P14805 and previous config saved to /var/cache/conftool/dbconfig/20210312-062850-root.json
  • 06:13 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1082 for schema change', diff saved to https://phabricator.wikimedia.org/P14804 and previous config saved to /var/cache/conftool/dbconfig/20210312-061306-marostegui.json
  • 06:11 marostegui@cumin1001: dbctl commit (dc=all): 'Remove db1088 from dbctl T276025', diff saved to https://phabricator.wikimedia.org/P14803 and previous config saved to /var/cache/conftool/dbconfig/20210312-061118-marostegui.json
  • 04:14 eileen: tools revision changed from d64b2f8cee to 532f8ecb33
  • 01:30 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2215.codfw.wmnet
  • 00:58 mutante: shutting down mw2215
  • 00:57 dzahn@cumin1001: START - Cookbook sre.hosts.decommission for hosts mw2215.codfw.wmnet

2021-03-11

  • 22:55 mutante: depooled mw2224 through mw2228 but not removing from DSH groups yet (T277119)
  • 22:55 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2228.codfw.wmnet
  • 22:55 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2227.codfw.wmnet
  • 22:55 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2226.codfw.wmnet
  • 22:54 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2225.codfw.wmnet
  • 22:54 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2224.codfw.wmnet
  • 22:50 dzahn@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 22:48 dzahn@cumin1001: START - Cookbook sre.dns.netbox
  • 22:47 mutante: running DNS cookbook in an attempt to remove mw2216
  • 22:47 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts mw2216.codfw.wmnet
  • 22:41 brennen@deploy1002: rebuilt and synchronized wikiversions files: all wikis to 1.36.0-wmf.34
  • 22:36 brennen: train status: 1.36.0-wmf.34 (T274938): T277229 and T266517 related issues hopefully resolved, rolling forward to all wikis
  • 22:34 brennen@deploy1002: Synchronized php-1.36.0-wmf.34/extensions/WikimediaEvents/modules/ext.wikimediaEvents/clientError.js: Backport: Do not log script errors without file uri (T266517) (duration: 01m 07s)
  • 22:33 dzahn@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 22:30 brennen@deploy1002: Synchronized php-1.36.0-wmf.34/extensions/MobileFrontend/includes/: Backport: Revert "Fix: Save user options only once when Advanced Mode is toggled" (T277229) (duration: 01m 09s)
  • 22:28 dzahn@cumin1001: START - Cookbook sre.dns.netbox
  • 21:57 Amir1: run populate pages in cognate (T259360)
  • 21:28 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2222.codfw.wmnet
  • 21:27 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2223.codfw.wmnet
  • 21:27 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2221.codfw.wmnet
  • 21:27 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2220.codfw.wmnet
  • 21:21 brennen@deploy1002: rebuilt and synchronized wikiversions files: Revert "all wikis to 1.36.0-wmf.34"
  • 21:20 brennen: train status: 1.36.0-wmf.34 (T274938): rolling back to group1 and marking T277229 a train blocker
  • 21:17 robh@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup1003.eqiad.wmnet with reason: REIMAGE
  • 21:15 robh@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on backup1003.eqiad.wmnet with reason: REIMAGE
  • {{safesubst:SAL entry|1=21:14 tgr@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:670858|Enable GrowthExperiments link recommendations on testwiki (T277173)] (duration: 00m 59s)}}
  • 21:13 zpapierski@deploy1002: Finished deploy [wikimedia/discovery/analytics@3810277]: T273847 export queries to relforge dag deployment - correct start date (duration: 01m 53s)
  • 21:12 zpapierski@deploy1002: Started deploy [wikimedia/discovery/analytics@3810277]: T273847 export queries to relforge dag deployment - correct start date
  • 21:05 dzahn@cumin1001: START - Cookbook sre.hosts.decommission for hosts mw2216.codfw.wmnet
  • 21:04 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts mw2215.codfw.wmnet
  • 21:03 otto@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams' for release 'canary' .
  • 21:03 otto@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams' for release 'production' .
  • 21:03 dzahn@cumin1001: START - Cookbook sre.hosts.decommission for hosts mw2215.codfw.wmnet
  • 21:00 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on mw2216.codfw.wmnet with reason: decom
  • 21:00 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on mw2216.codfw.wmnet with reason: decom
  • 21:00 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on mw2215.codfw.wmnet with reason: decom
  • 21:00 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on mw2215.codfw.wmnet with reason: decom
  • 21:00 otto@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'eventstreams' for release 'canary' .
  • 21:00 otto@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'eventstreams' for release 'production' .
  • 20:58 mutante: deactivating codfw API canaries on old hardware (T277119)
  • 20:57 dzahn@cumin1001: conftool action : set/pooled=inactive; selector: name=mw2216.codfw.wmnet
  • 20:57 dzahn@cumin1001: conftool action : set/pooled=inactive; selector: name=mw2215.codfw.wmnet
  • 20:50 otto@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'eventstreams' for release 'production' .
  • 20:46 zpapierski@deploy1002: Finished deploy [wikimedia/discovery/analytics@cc478d4]: T273847 export queries to relforge dag deployment (duration: 02m 09s)
  • 20:44 zpapierski@deploy1002: Started deploy [wikimedia/discovery/analytics@cc478d4]: T273847 export queries to relforge dag deployment
  • 20:35 otto@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams-internal' for release 'main' .
  • 20:33 otto@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'eventstreams-internal' for release 'main' .
  • 20:28 otto@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'eventstreams-internal' for release 'main' .
  • 20:20 mutante: phab1001 - systemctl start phabricator_clean_tmp_files - now Succeeded
  • 20:17 razzi@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host matomo1002.eqiad.wmnet
  • 20:13 razzi@cumin1001: START - Cookbook sre.hosts.reboot-single for host matomo1002.eqiad.wmnet
  • 20:04 brennen@deploy1002: rebuilt and synchronized wikiversions files: all wikis to 1.36.0-wmf.34
  • 19:59 mutante: phab1001 - sudo systemctl start phabricator_clean_tmp_files (manually run after conversion from cron to timer, and it fails with permission issues)
  • 19:55 tgr_: T277173 running mwscript extensions/WikimediaMaintenance/createExtensionTables.php --wiki=testwiki GrowthExperiments
  • 19:54 tgr@deploy1002: Synchronized wmf-config/: Config: Configure GrowthExperiments Add Link settings, step 2 (T277173) (duration: 01m 08s)
  • 19:43 robh@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 19:30 tgr@deploy1002: Synchronized wmf-config/: Config: Configure GrowthExperiments Add Link settings, step 1 (T277173) (duration: 01m 08s)
  • 19:18 tgr@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: wikitech: enable BetaFeatures (T125941) (duration: 01m 08s)
  • 19:13 hnowlan@deploy1002: Finished deploy [restbase/deploy@6f0fe23]: Remove internal ratelimits that were causing service proxy issues (duration: 16m 25s)
  • 18:56 hnowlan@deploy1002: Started deploy [restbase/deploy@6f0fe23]: Remove internal ratelimits that were causing service proxy issues
  • 18:47 tgr_: running mwscript extensions/GrowthExperiments/maintenance/importOresTopics.php testwiki --count 1000 --verbose --wikiId enwiki --apiUrl 'https://en.wikipedia.org/w/api.php'
  • 17:31 effie: install mecached 1.6.6-1 on mwdebug1001
  • 16:26 effie: upgrade memcached on mc1021, mc2021
  • 16:20 cmjohnson@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 16:14 cmjohnson@cumin1001: START - Cookbook sre.dns.netbox
  • 16:11 marostegui@cumin1001: dbctl commit (dc=all): 'db1110 (re)pooling @ 100%: Repool db1110 after schema change', diff saved to https://phabricator.wikimedia.org/P14802 and previous config saved to /var/cache/conftool/dbconfig/20210311-161138-root.json
  • 15:56 marostegui@cumin1001: dbctl commit (dc=all): 'db1110 (re)pooling @ 60%: Repool db1110 after schema change', diff saved to https://phabricator.wikimedia.org/P14801 and previous config saved to /var/cache/conftool/dbconfig/20210311-155635-root.json
  • 15:53 cmjohnson1: updating firmware wdqs1009 T274751
  • 15:41 marostegui@cumin1001: dbctl commit (dc=all): 'db1110 (re)pooling @ 30%: Repool db1110 after schema change', diff saved to https://phabricator.wikimedia.org/P14800 and previous config saved to /var/cache/conftool/dbconfig/20210311-154131-root.json
  • 15:35 cmjohnson@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 15:30 cmjohnson@cumin1001: START - Cookbook sre.dns.netbox
  • 15:26 marostegui@cumin1001: dbctl commit (dc=all): 'db1110 (re)pooling @ 10%: Repool db1110 after schema change', diff saved to https://phabricator.wikimedia.org/P14799 and previous config saved to /var/cache/conftool/dbconfig/20210311-152627-root.json
  • 15:14 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1110 for schema change', diff saved to https://phabricator.wikimedia.org/P14798 and previous config saved to /var/cache/conftool/dbconfig/20210311-151435-marostegui.json
  • 15:07 marostegui@cumin1001: dbctl commit (dc=all): 'db1113:3315 (re)pooling @ 100%: Repool db1113:3315 after schema change', diff saved to https://phabricator.wikimedia.org/P14797 and previous config saved to /var/cache/conftool/dbconfig/20210311-150707-root.json
  • 14:55 klausman: restarting pybal on lvs2009 T272918
  • 14:52 marostegui@cumin1001: dbctl commit (dc=all): 'db1113:3315 (re)pooling @ 60%: Repool db1113:3315 after schema change', diff saved to https://phabricator.wikimedia.org/P14796 and previous config saved to /var/cache/conftool/dbconfig/20210311-145204-root.json
  • 14:50 klausman: restarting pybal on lvs1016 T272918
  • 14:49 klausman: restarting pybal on lvs2010 T272918
  • 14:46 moritzm: installing openssl (1.1) security updates for stretch
  • 14:37 marostegui@cumin1001: dbctl commit (dc=all): 'db1113:3315 (re)pooling @ 30%: Repool db1113:3315 after schema change', diff saved to https://phabricator.wikimedia.org/P14795 and previous config saved to /var/cache/conftool/dbconfig/20210311-143700-root.json
  • 14:21 marostegui@cumin1001: dbctl commit (dc=all): 'db1113:3315 (re)pooling @ 10%: Repool db1113:3315 after schema change', diff saved to https://phabricator.wikimedia.org/P14794 and previous config saved to /var/cache/conftool/dbconfig/20210311-142157-root.json
  • 14:07 godog: swift eqiad-prod: decrease weight for SSDs on ms-be[1019-1026] - T272836
  • 14:05 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1113:3315 for schema change', diff saved to https://phabricator.wikimedia.org/P14793 and previous config saved to /var/cache/conftool/dbconfig/20210311-140526-marostegui.json
  • 14:03 marostegui@cumin1001: dbctl commit (dc=all): 'db1130 (re)pooling @ 100%: Repool db1130 after schema change', diff saved to https://phabricator.wikimedia.org/P14792 and previous config saved to /var/cache/conftool/dbconfig/20210311-140328-root.json
  • 14:01 marostegui@cumin1001: dbctl commit (dc=all): 'Pool db2149 into s3', diff saved to https://phabricator.wikimedia.org/P14791 and previous config saved to /var/cache/conftool/dbconfig/20210311-140119-marostegui.json
  • 13:56 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1063.eqiad.wmnet
  • 13:48 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be1063.eqiad.wmnet
  • 13:48 marostegui@cumin1001: dbctl commit (dc=all): 'db1130 (re)pooling @ 60%: Repool db1130 after schema change', diff saved to https://phabricator.wikimedia.org/P14790 and previous config saved to /var/cache/conftool/dbconfig/20210311-134825-root.json
  • 13:47 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1062.eqiad.wmnet
  • 13:39 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
  • 13:39 jayme@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
  • 13:39 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be1062.eqiad.wmnet
  • 13:36 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1061.eqiad.wmnet
  • 13:33 jayme@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
  • 13:33 jayme@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
  • 13:33 marostegui@cumin1001: dbctl commit (dc=all): 'db1130 (re)pooling @ 30%: Repool db1130 after schema change', diff saved to https://phabricator.wikimedia.org/P14789 and previous config saved to /var/cache/conftool/dbconfig/20210311-133321-root.json
  • 13:18 marostegui@cumin1001: dbctl commit (dc=all): 'db1130 (re)pooling @ 10%: Repool db1130 after schema change', diff saved to https://phabricator.wikimedia.org/P14788 and previous config saved to /var/cache/conftool/dbconfig/20210311-131818-root.json
  • 13:04 moritzm: installing openssl1.0 security updates on stretch
  • 13:03 arturo: copy python-mwclient 0.8.4-1 from stretch-wikimedia to buster-wikimedia for T275865
  • 13:02 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1130 for schema change', diff saved to https://phabricator.wikimedia.org/P14787 and previous config saved to /var/cache/conftool/dbconfig/20210311-130208-marostegui.json
  • 13:01 marostegui@cumin1001: dbctl commit (dc=all): 'db1144:3315 (re)pooling @ 100%: Repool db1144:3315 after schema change', diff saved to https://phabricator.wikimedia.org/P14786 and previous config saved to /var/cache/conftool/dbconfig/20210311-130103-root.json
  • 13:00 hnowlan: imported cassandra_2.2.6-wmf5 to buster-wikimedia
  • 12:49 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be1061.eqiad.wmnet
  • 12:46 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1060.eqiad.wmnet
  • 12:46 marostegui@cumin1001: dbctl commit (dc=all): 'db1144:3315 (re)pooling @ 60%: Repool db1144:3315 after schema change', diff saved to https://phabricator.wikimedia.org/P14785 and previous config saved to /var/cache/conftool/dbconfig/20210311-124559-root.json
  • 12:39 hnowlan: imported cassandra_2.2.6-wmf1 to buster-wikimedia
  • 12:37 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be1060.eqiad.wmnet
  • 12:33 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1059.eqiad.wmnet
  • 12:30 marostegui@cumin1001: dbctl commit (dc=all): 'db1144:3315 (re)pooling @ 30%: Repool db1144:3315 after schema change', diff saved to https://phabricator.wikimedia.org/P14783 and previous config saved to /var/cache/conftool/dbconfig/20210311-123056-root.json
  • 12:30 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be1059.eqiad.wmnet
  • 12:27 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1058.eqiad.wmnet
  • 12:22 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be1058.eqiad.wmnet
  • 12:22 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1057.eqiad.wmnet
  • 12:17 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be1057.eqiad.wmnet
  • 12:17 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1056.eqiad.wmnet
  • 12:16 Lucas_WMDE: EU backport&config window done
  • 12:15 marostegui@cumin1001: dbctl commit (dc=all): 'db1144:3315 (re)pooling @ 10%: Repool db1144:3315 after schema change', diff saved to https://phabricator.wikimedia.org/P14782 and previous config saved to /var/cache/conftool/dbconfig/20210311-121552-root.json
  • 12:13 Lucas_WMDE: lucaswerkmeister-wmde@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/RemoveDeletedItemsFromTermStore.php wikidatawiki --itemIds 581768,739279,774383,852302 # T270249, finished in 1.124s
  • 12:12 Lucas_WMDE: finished in 1.124s real time
  • 12:12 Lucas_WMDE: start of lucaswerkmeister-wmde@mwmaint1002:~$ time mwscript extensions/Wikibase/repo/maintenance/RemoveDeletedItemsFromTermStore.php wikidatawiki --itemIds 581768,739279,774383,852302
  • 12:10 lucaswerkmeister-wmde@deploy1002: Synchronized wmf-config/LabsServices.php: Config: Update comment for irc.beta.wmflabs.org (T277081) (comment-only beta-only change) (duration: 01m 13s)
  • 12:08 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be1056.eqiad.wmnet
  • 12:07 lucaswerkmeister-wmde@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: Fix obsolete comments on wgCheckUserLogLogins (T253802) (duration: 01m 08s)
  • 12:07 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1055.eqiad.wmnet
  • 12:05 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1144:3315 for schema change', diff saved to https://phabricator.wikimedia.org/P14781 and previous config saved to /var/cache/conftool/dbconfig/20210311-120554-marostegui.json
  • 12:00 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be1055.eqiad.wmnet
  • 11:59 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1054.eqiad.wmnet
  • 11:53 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be1054.eqiad.wmnet
  • 11:53 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1053.eqiad.wmnet
  • 11:49 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be1053.eqiad.wmnet
  • 11:49 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1052.eqiad.wmnet
  • 11:43 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be1052.eqiad.wmnet
  • 11:42 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1051.eqiad.wmnet
  • 11:37 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be1051.eqiad.wmnet
  • 11:37 klausman@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 11:37 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1050.eqiad.wmnet
  • 11:35 klausman@cumin1001: START - Cookbook sre.dns.netbox
  • 11:34 klausman@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 11:31 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be1050.eqiad.wmnet
  • 11:31 klausman@cumin1001: START - Cookbook sre.dns.netbox
  • 11:31 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1049.eqiad.wmnet
  • 11:27 marostegui@cumin1001: dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 100%: Repool db1096:3315 after schema change', diff saved to https://phabricator.wikimedia.org/P14778 and previous config saved to /var/cache/conftool/dbconfig/20210311-112747-root.json
  • 11:26 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be1049.eqiad.wmnet
  • 11:26 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1048.eqiad.wmnet
  • 11:21 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be1048.eqiad.wmnet
  • 11:16 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1047.eqiad.wmnet
  • 11:12 marostegui@cumin1001: dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 60%: Repool db1096:3315 after schema change', diff saved to https://phabricator.wikimedia.org/P14777 and previous config saved to /var/cache/conftool/dbconfig/20210311-111243-root.json
  • 11:11 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be1047.eqiad.wmnet
  • 11:04 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1046.eqiad.wmnet
  • 10:59 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be1046.eqiad.wmnet
  • 10:58 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1045.eqiad.wmnet
  • 10:57 marostegui@cumin1001: dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 30%: Repool db1096:3315 after schema change', diff saved to https://phabricator.wikimedia.org/P14776 and previous config saved to /var/cache/conftool/dbconfig/20210311-105740-root.json
  • 10:54 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be1045.eqiad.wmnet
  • 10:54 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1044.eqiad.wmnet
  • 10:49 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be1044.eqiad.wmnet
  • 10:42 marostegui@cumin1001: dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 10%: Repool db1096:3315 after schema change', diff saved to https://phabricator.wikimedia.org/P14775 and previous config saved to /var/cache/conftool/dbconfig/20210311-104236-root.json
  • 10:42 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1043.eqiad.wmnet
  • 10:35 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be1043.eqiad.wmnet
  • 10:35 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1042.eqiad.wmnet
  • 10:30 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be1042.eqiad.wmnet
  • 10:29 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1041.eqiad.wmnet
  • 10:22 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be1041.eqiad.wmnet
  • 10:22 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1040.eqiad.wmnet
  • 10:17 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1096:3315 for schema change', diff saved to https://phabricator.wikimedia.org/P14774 and previous config saved to /var/cache/conftool/dbconfig/20210311-101714-marostegui.json
  • 10:16 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be1040.eqiad.wmnet
  • 10:16 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1039.eqiad.wmnet
  • 10:16 marostegui@cumin1001: dbctl commit (dc=all): 'Add db2149 to dbctl, depooled, T275633', diff saved to https://phabricator.wikimedia.org/P14773 and previous config saved to /var/cache/conftool/dbconfig/20210311-101604-marostegui.json
  • 10:10 marostegui@cumin1001: dbctl commit (dc=all): 'db1174 (re)pooling @ 100%: Repool db1174 after schema change', diff saved to https://phabricator.wikimedia.org/P14772 and previous config saved to /var/cache/conftool/dbconfig/20210311-101008-root.json
  • 10:09 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be1039.eqiad.wmnet
  • 10:07 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db2109', diff saved to https://phabricator.wikimedia.org/P14771 and previous config saved to /var/cache/conftool/dbconfig/20210311-100705-marostegui.json
  • 10:00 mbsantos@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'proton' for release 'production' .
  • 09:55 marostegui@cumin1001: dbctl commit (dc=all): 'db1174 (re)pooling @ 60%: Repool db1174 after schema change', diff saved to https://phabricator.wikimedia.org/P14770 and previous config saved to /var/cache/conftool/dbconfig/20210311-095504-root.json
  • 09:51 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1038.eqiad.wmnet
  • 09:45 marostegui: Deploy schema change on s5 codfw master, lag will appear - T276150 T276156
  • 09:43 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be1038.eqiad.wmnet
  • 09:40 marostegui@cumin1001: dbctl commit (dc=all): 'db1174 (re)pooling @ 30%: Repool db1174 after schema change', diff saved to https://phabricator.wikimedia.org/P14769 and previous config saved to /var/cache/conftool/dbconfig/20210311-094000-root.json
  • 09:35 akosiaris@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'recommendation-api' for release 'production' .
  • 09:31 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1037.eqiad.wmnet
  • 09:31 hashar: Restarting CI Jenkins
  • 09:29 akosiaris@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'recommendation-api' for release 'production' .
  • 09:24 marostegui@cumin1001: dbctl commit (dc=all): 'db1174 (re)pooling @ 10%: Repool db1174 after schema change', diff saved to https://phabricator.wikimedia.org/P14768 and previous config saved to /var/cache/conftool/dbconfig/20210311-092457-root.json
  • 09:24 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be1037.eqiad.wmnet
  • 09:23 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1036.eqiad.wmnet
  • 09:19 effie: upgrade memcached on mc1020, mc2020
  • 09:16 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be1036.eqiad.wmnet
  • 09:11 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1035.eqiad.wmnet
  • 09:08 akosiaris@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'recommendation-api' for release 'production' .
  • 09:04 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be1035.eqiad.wmnet
  • 09:03 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1174', diff saved to https://phabricator.wikimedia.org/P14767 and previous config saved to /var/cache/conftool/dbconfig/20210311-090342-marostegui.json
  • 09:03 marostegui@cumin1001: dbctl commit (dc=all): 'db1079 (re)pooling @ 100%: Repool db1079 after schema change', diff saved to https://phabricator.wikimedia.org/P14766 and previous config saved to /var/cache/conftool/dbconfig/20210311-090312-root.json
  • 09:03 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1033.eqiad.wmnet
  • 08:58 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be1033.eqiad.wmnet
  • 08:58 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1032.eqiad.wmnet
  • 08:51 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be1032.eqiad.wmnet
  • 08:50 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1031.eqiad.wmnet
  • 08:48 marostegui@cumin1001: dbctl commit (dc=all): 'db1079 (re)pooling @ 60%: Repool db1079 after schema change', diff saved to https://phabricator.wikimedia.org/P14765 and previous config saved to /var/cache/conftool/dbconfig/20210311-084809-root.json
  • 08:43 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be1031.eqiad.wmnet
  • 08:41 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1030.eqiad.wmnet
  • 08:34 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be1030.eqiad.wmnet
  • 08:33 marostegui@cumin1001: dbctl commit (dc=all): 'db1079 (re)pooling @ 30%: Repool db1079 after schema change', diff saved to https://phabricator.wikimedia.org/P14764 and previous config saved to /var/cache/conftool/dbconfig/20210311-083305-root.json
  • 08:30 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1029.eqiad.wmnet
  • 08:25 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db2109', diff saved to https://phabricator.wikimedia.org/P14762 and previous config saved to /var/cache/conftool/dbconfig/20210311-082546-marostegui.json
  • 08:25 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db2074', diff saved to https://phabricator.wikimedia.org/P14761 and previous config saved to /var/cache/conftool/dbconfig/20210311-082528-marostegui.json
  • 08:24 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db2074', diff saved to https://phabricator.wikimedia.org/P14760 and previous config saved to /var/cache/conftool/dbconfig/20210311-082445-marostegui.json
  • 08:24 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be1029.eqiad.wmnet
  • 08:22 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1028.eqiad.wmnet
  • 08:18 marostegui@cumin1001: dbctl commit (dc=all): 'db1079 (re)pooling @ 10%: Repool db1079 after schema change', diff saved to https://phabricator.wikimedia.org/P14759 and previous config saved to /var/cache/conftool/dbconfig/20210311-081801-root.json
  • 08:16 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be1028.eqiad.wmnet
  • 08:10 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db2108 T275633', diff saved to https://phabricator.wikimedia.org/P14758 and previous config saved to /var/cache/conftool/dbconfig/20210311-081010-marostegui.json
  • 08:09 marostegui@cumin1001: dbctl commit (dc=all): 'Add db2148 to s2 T275633', diff saved to https://phabricator.wikimedia.org/P14757 and previous config saved to /var/cache/conftool/dbconfig/20210311-080944-marostegui.json
  • 07:43 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1079', diff saved to https://phabricator.wikimedia.org/P14756 and previous config saved to /var/cache/conftool/dbconfig/20210311-074352-marostegui.json
  • 07:37 marostegui@cumin1001: dbctl commit (dc=all): 'db1136 (re)pooling @ 100%: Repool db1136 after schema change', diff saved to https://phabricator.wikimedia.org/P14755 and previous config saved to /var/cache/conftool/dbconfig/20210311-073741-root.json
  • 07:22 marostegui@cumin1001: dbctl commit (dc=all): 'db1136 (re)pooling @ 60%: Repool db1136 after schema change', diff saved to https://phabricator.wikimedia.org/P14754 and previous config saved to /var/cache/conftool/dbconfig/20210311-072237-root.json
  • 07:07 marostegui@cumin1001: dbctl commit (dc=all): 'db1136 (re)pooling @ 30%: Repool db1136 after schema change', diff saved to https://phabricator.wikimedia.org/P14753 and previous config saved to /var/cache/conftool/dbconfig/20210311-070734-root.json
  • 06:52 marostegui@cumin1001: dbctl commit (dc=all): 'db1136 (re)pooling @ 10%: Repool db1136 after schema change', diff saved to https://phabricator.wikimedia.org/P14752 and previous config saved to /var/cache/conftool/dbconfig/20210311-065230-root.json
  • 06:48 marostegui: Stop mysql on db2108 to clone db2148 T275633
  • 06:48 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db2108 T275633', diff saved to https://phabricator.wikimedia.org/P14750 and previous config saved to /var/cache/conftool/dbconfig/20210311-064821-marostegui.json
  • 06:38 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1136', diff saved to https://phabricator.wikimedia.org/P14749 and previous config saved to /var/cache/conftool/dbconfig/20210311-063814-marostegui.json
  • 06:36 marostegui: Drop testreduce from m5 - T276787
  • 05:34 thcipriani: restarted apache2 on gerrit1001
  • 00:50 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2219.codfw.wmnet
  • 00:49 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2218.codfw.wmnet
  • 00:49 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2217.codfw.wmnet
  • 00:48 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2216.codfw.wmnet
  • 00:48 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2215.codfw.wmnet

2021-03-10

  • 23:49 mholloway-shell@deploy1002: Synchronized php-1.36.0-wmf.34/extensions/EventLogging: EventLogging: Stream always in sample if the user is in debugMode (T276515) (duration: 01m 23s)
  • 23:41 dwisehaupt: disabled silverpop daily run in process-control until utf8mb4 conversion completes on frdev1001
  • 23:12 robh@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-druid1004.eqiad.wmnet with reason: REIMAGE
  • 23:10 robh@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-druid1004.eqiad.wmnet with reason: REIMAGE
  • 23:10 legoktm@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts registry1002.eqiad.wmnet
  • 23:01 legoktm@cumin1001: START - Cookbook sre.hosts.decommission for hosts registry1002.eqiad.wmnet
  • 22:55 legoktm@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts registry[2001-2002].codfw.wmnet
  • 22:51 andrewbogott: updating puppet compiler facts to catch up with a new custom fact
  • 22:44 legoktm@cumin1001: START - Cookbook sre.hosts.decommission for hosts registry[2001-2002].codfw.wmnet
  • 22:40 legoktm@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts registry1001.eqiad.wmnet
  • 22:32 brennen@deploy1002: Synchronized php: group1 wikis to 1.36.0-wmf.34 (duration: 01m 30s)
  • 22:30 brennen@deploy1002: rebuilt and synchronized wikiversions files: group1 wikis to 1.36.0-wmf.34
  • 22:27 legoktm@cumin1001: START - Cookbook sre.hosts.decommission for hosts registry1001.eqiad.wmnet
  • 22:26 brennen: train status: 1.36.0-wmf.34 (T274938): T277094 believed resolved, promoting to group1.
  • 22:25 brennen@deploy1002: Synchronized php-1.36.0-wmf.34/extensions/WikimediaEvents/modules/ext.wikimediaEvents/clientError.js: Backport: Fix client error logging (T277094) (duration: 01m 09s)
  • 21:53 mutante: ferm/iptables docker NAT rules applied by puppet on releases servers after breaking out fules into their own profile class (T276869)
  • 21:51 dwisehaupt: upgraded mariadb and keeping replication stopped on frdb1002 to start the utf8mb4 table alters under a root screen session
  • 21:43 brennen: train status: 1.36.0-wmf.34 (T274938): client errors may still be missing for group0; continuing to hold for T277094 until we know what's broken.
  • 21:40 brennen@deploy1002: Synchronized php-1.36.0-wmf.34/extensions/WikimediaEvents/modules/ext.wikimediaEvents/clientError.js: Backport: Revert "Error in shouldLog logic drops most errors" (T277094) (duration: 01m 08s)
  • 21:38 dwisehaupt: stopping mysql replication on frdev1001 and starting utf8mb4 table alters under a root screen session
  • 21:38 dwisehaupt: stopping mysql replication on frdb1003 and starting utf8mb4 table alters under a root screen session
  • 21:30 brennen: train status: 1.36.0-wmf.34 (T274938): logstash client error board was set up incorrectly; reverting earlier patch for T277094 and will proceed to group1.
  • 21:19 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: cdc47f3: jawiki: Growth features: Add help panel links (T276830) (duration: 01m 08s)
  • 21:16 eileen: civicrm revision changed from b13e70d968 to 550be50105, config revision is 970b10b0b3
  • 21:13 cdanis@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' .
  • 21:00 cdanis@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' .
  • 20:57 cdanis@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' .
  • 20:56 Urbanecm: Fixing wrong sync message: urbanecm@deploy1002 Synchronized dblists/growthexperiments.dblist f72c3d6: jawiki: Enable Growth features in stealth mode (T276830) (duration: 01m 08s)
  • 20:56 Urbanecm: Fixing wrong sync message: urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: f72c3d6: jawiki: Enable Growth features in stealth mode (T276830) (duration: 01m 07s)
  • 20:54 urbanecm@deploy1002: Synchronized dblists/growthexperiments.dblist: 92ae985: thwiki: Make Growth features available to newcomers (T274646) (duration: 01m 08s)
  • 20:53 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: 92ae985: thwiki: Make Growth features available to newcomers (T274646) (duration: 01m 07s)
  • 20:50 cdanis@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' .
  • 20:48 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: 92ae985: thwiki: Make Growth features available to newcomers (T274646) (duration: 01m 08s)
  • 20:41 brennen@deploy1002: Synchronized php-1.36.0-wmf.34/extensions/WikimediaEvents/modules/ext.wikimediaEvents/clientError.js: Backport: Error in shouldLog logic drops most errors (T277094) (duration: 01m 14s)
  • 20:36 cdanis@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' .
  • 19:58 brennen: train status: 1.36.0-wmf.34 (T274938): currently blocked at group0 as client error logging is broken (UBN ticket incoming), will hold for patch.
  • 19:37 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: a130e9f: Enable Growth features on eowiki in stealth mode (T276123) (duration: 01m 08s)
  • 19:35 robh@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-logging1003.eqiad.wmnet with reason: REIMAGE
  • 19:33 robh@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-logging1003.eqiad.wmnet with reason: REIMAGE
  • 19:32 ryankemper: T266470 `ryankemper@cumin1001:~$ sudo -E cumin 'A:wdqs-all' 'sudo enable-puppet "revoking old cert and generating new one with new alt_names - T266470 - root"'` && `ryankemper@cumin1001:~$ sudo -E cumin 'A:wdqs-all' 'sudo run-puppet-agent'`
  • 19:29 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: 84271f6: Enable DiscussionTools beta features on frwiktionary (T276189) (duration: 01m 09s)
  • 19:28 ryankemper: T266470 `ryankemper@wdqs1004:~$ sudo enable-puppet "revoking old cert and generating new one with new alt_names - T266470 - root"` && `sudo run-puppet-agent`
  • 19:27 ryankemper: T266470 `/srv/private` commit SHA for this change is `45852086679616bccb5bba3dd6396082b0f25a3d`
  • 19:26 ryankemper: T266470 `sudo chown -Rv gitpuppet:gitpuppet /srv/private/modules/secret/secrets/certificates/wdqs.discovery.wmnet/` && `sudo chown -v gitpuppet:gitpuppet /srv/private/modules/secret/secrets/ssl/wdqs.discovery.wmnet.key`
  • 19:25 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: 5093618: Enable DiscussionTools beta feature for newtopictool on most wikis (T275827) (duration: 01m 08s)
  • 19:23 ryankemper: T266470 Deployed https://gerrit.wikimedia.org/r/c/operations/puppet/+/670562 (copies over new pubkey)
  • 19:23 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: 4824679: Disable DiscussionTools Reply Tool A/B test (T276967) (duration: 01m 07s)
  • 19:22 urbanecm@deploy1002: Synchronized php-1.36.0-wmf.34/extensions/DiscussionTools/includes/Hooks/HookUtils.php: 9cb48f0: Allow users to continue using reply tool after disabling A/B test (T276967) (duration: 01m 07s)
  • 19:20 urbanecm@deploy1002: Synchronized php-1.36.0-wmf.33/extensions/DiscussionTools/includes/Hooks/HookUtils.php: 4193ff7: Allow users to continue using reply tool after disabling A/B test (T276967) (duration: 01m 09s)
  • 19:18 urbanecm@deploy1002: Synchronized php-1.36.0-wmf.34/extensions/WikimediaEvents/modules/ext.wikimediaEvents/searchSatisfaction.js: e998086: searchSatisfaction: Allow for async initialisation (T274869) (duration: 01m 08s)
  • 19:18 ryankemper: T266470 `sudo cergen -c 'wdqs.*' --generate --base-path /srv/private/modules/secret/secrets/certificates /srv/private/modules/secret/secrets/certificates/certificate.manifests.d`
  • 19:17 robh@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-logging1002.eqiad.wmnet with reason: REIMAGE
  • 19:16 urbanecm@deploy1002: Synchronized php-1.36.0-wmf.33/extensions/WikimediaEvents/modules/ext.wikimediaEvents/searchSatisfaction.js: d9bad12: searchSatisfaction: Allow for async initialisation (T274869) (duration: 01m 08s)
  • 19:16 ryankemper: T266470 `sudo rm -fv certificates/wdqs.discovery.wmnet/wdqs.discovery.wmnet.crt.pem certificates/wdqs.discovery.wmnet/wdqs.discovery.wmnet.csr.pem certificates/wdqs.discovery.wmnet/wdqs.discovery.wmnet.keystore.jks certificates/wdqs.discovery.wmnet/wdqs.discovery.wmnet.keystore.p12 certificates/wdqs.discovery.wmnet/truststore.jks` (full paths not provided to fit the IRC line)
  • 19:15 ryankemper: T266470 `sudo puppet cert clean wdqs.discovery.wmnet`
  • 19:15 robh@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-logging1002.eqiad.wmnet with reason: REIMAGE
  • 19:14 ryankemper: T266470 on `ryankemper@cumin1001`: `sudo -E cumin 'A:wdqs-all' 'sudo disable-puppet "revoking old cert and generating new one with new alt_names - T266470"'`
  • 19:14 ryankemper: T266470 Temporarily disabling puppet on all `wdqs*` hosts in preparation for `wdqs.discovery.wmnet` certificate revocation
  • 19:06 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: fe99c31: Remove unused config for InukaPageView (T265921) (duration: 01m 26s)
  • 18:56 dwisehaupt: all fundraising servers are now running buster - T254198
  • 18:37 mforns@deploy1002: Finished deploy [analytics/refinery@7fbc3c7] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@7fbc3c700ccb3c598690da9a38990ef7cb187656] (duration: 04m 12s)
  • 18:33 mforns@deploy1002: Started deploy [analytics/refinery@7fbc3c7] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@7fbc3c700ccb3c598690da9a38990ef7cb187656]
  • 18:33 mforns@deploy1002: Finished deploy [analytics/refinery@7fbc3c7] (thin): Regular analytics weekly train THIN [analytics/refinery@7fbc3c700ccb3c598690da9a38990ef7cb187656] (duration: 00m 07s)
  • 18:33 mforns@deploy1002: Started deploy [analytics/refinery@7fbc3c7] (thin): Regular analytics weekly train THIN [analytics/refinery@7fbc3c700ccb3c598690da9a38990ef7cb187656]
  • 18:32 mforns@deploy1002: Finished deploy [analytics/refinery@7fbc3c7]: Regular analytics weekly train [analytics/refinery@7fbc3c700ccb3c598690da9a38990ef7cb187656] (duration: 14m 30s)
  • 18:18 mforns@deploy1002: Started deploy [analytics/refinery@7fbc3c7]: Regular analytics weekly train [analytics/refinery@7fbc3c700ccb3c598690da9a38990ef7cb187656]
  • 17:48 mutante: new Wikimedia project language "trv" added - Seediq is an Atayalic language spoken in the mountains of Northern Taiwan by the Seediq and Taroko people.
  • 17:45 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kafka-logging2003.codfw.wmnet with reason: REIMAGE
  • 17:42 pt1979@cumin2001: START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-logging2003.codfw.wmnet with reason: REIMAGE
  • 17:19 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kafka-logging2002.codfw.wmnet with reason: REIMAGE
  • 17:17 pt1979@cumin2001: START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-logging2002.codfw.wmnet with reason: REIMAGE
  • 16:56 aborrero@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt1030.eqiad.wmnet
  • 16:52 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kafka-logging2001.codfw.wmnet with reason: REIMAGE
  • 16:50 aborrero@cumin1001: START - Cookbook sre.hosts.reboot-single for host cloudvirt1030.eqiad.wmnet
  • 16:50 pt1979@cumin2001: START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-logging2001.codfw.wmnet with reason: REIMAGE
  • 16:47 robh@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-logging1001.eqiad.wmnet with reason: REIMAGE
  • 16:45 robh@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-logging1001.eqiad.wmnet with reason: REIMAGE
  • 16:20 robh@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-logging1001.eqiad.wmnet with reason: REIMAGE
  • 16:18 robh@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-logging1001.eqiad.wmnet with reason: REIMAGE
  • 15:33 marostegui@cumin1001: dbctl commit (dc=all): 'db1127 (re)pooling @ 100%: Repool db1127 after schema change', diff saved to https://phabricator.wikimedia.org/P14744 and previous config saved to /var/cache/conftool/dbconfig/20210310-153324-root.json
  • 15:22 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sodium.wikimedia.org
  • 15:18 marostegui@cumin1001: dbctl commit (dc=all): 'db1127 (re)pooling @ 60%: Repool db1127 after schema change', diff saved to https://phabricator.wikimedia.org/P14743 and previous config saved to /var/cache/conftool/dbconfig/20210310-151820-root.json
  • 15:16 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host sodium.wikimedia.org
  • 15:03 marostegui@cumin1001: dbctl commit (dc=all): 'db1127 (re)pooling @ 30%: Repool db1127 after schema change', diff saved to https://phabricator.wikimedia.org/P14742 and previous config saved to /var/cache/conftool/dbconfig/20210310-150316-root.json
  • 14:53 klausman@puppetmaster1001: conftool action : set/pooled=yes:weight=1; selector: cluster=ml_serve,service=kubemaster
  • 14:52 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2061.codfw.wmnet
  • 14:48 marostegui@cumin1001: dbctl commit (dc=all): 'db1127 (re)pooling @ 10%: Repool db1127 after schema change', diff saved to https://phabricator.wikimedia.org/P14741 and previous config saved to /var/cache/conftool/dbconfig/20210310-144813-root.json
  • 14:44 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be2061.codfw.wmnet
  • 14:43 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2060.codfw.wmnet
  • 14:35 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1127', diff saved to https://phabricator.wikimedia.org/P14740 and previous config saved to /var/cache/conftool/dbconfig/20210310-143547-marostegui.json
  • 14:35 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be2060.codfw.wmnet
  • 14:34 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2059.codfw.wmnet
  • 14:26 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be2059.codfw.wmnet
  • 14:24 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2058.codfw.wmnet
  • 14:23 marostegui@cumin1001: dbctl commit (dc=all): 'db1170:3317 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P14739 and previous config saved to /var/cache/conftool/dbconfig/20210310-142316-root.json
  • 14:19 akosiaris@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
  • 14:19 akosiaris@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
  • 14:19 akosiaris@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
  • 14:15 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be2058.codfw.wmnet
  • 14:14 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2057.codfw.wmnet
  • 14:08 marostegui@cumin1001: dbctl commit (dc=all): 'db1170:3317 (re)pooling @ 60%: 10', diff saved to https://phabricator.wikimedia.org/P14738 and previous config saved to /var/cache/conftool/dbconfig/20210310-140812-root.json
  • 14:06 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be2057.codfw.wmnet
  • 14:06 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2056.codfw.wmnet
  • 14:01 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be2056.codfw.wmnet
  • 14:01 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2055.codfw.wmnet
  • 13:55 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be2055.codfw.wmnet
  • 13:54 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2054.codfw.wmnet
  • 13:53 marostegui@cumin1001: dbctl commit (dc=all): 'db1170:3317 (re)pooling @ 30%: 10', diff saved to https://phabricator.wikimedia.org/P14736 and previous config saved to /var/cache/conftool/dbconfig/20210310-135309-root.json
  • 13:49 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be2054.codfw.wmnet
  • 13:43 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2053.codfw.wmnet
  • 13:37 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be2053.codfw.wmnet
  • 13:37 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2052.codfw.wmnet
  • 13:31 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be2052.codfw.wmnet
  • 13:30 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2051.codfw.wmnet
  • 13:24 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be2051.codfw.wmnet
  • 13:24 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2050.codfw.wmnet
  • 13:18 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be2050.codfw.wmnet
  • 13:17 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2049.codfw.wmnet
  • 13:11 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be2049.codfw.wmnet
  • 13:10 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2048.codfw.wmnet
  • 13:07 aborrero@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt1029.eqiad.wmnet
  • 13:03 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be2048.codfw.wmnet
  • 12:59 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2047.codfw.wmnet
  • 12:54 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be2047.codfw.wmnet
  • 12:54 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2046.codfw.wmnet
  • 12:52 ariel@cumin1001: END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0)
  • 12:47 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be2046.codfw.wmnet
  • 12:47 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2045.codfw.wmnet
  • 12:47 aborrero@cumin1001: START - Cookbook sre.hosts.reboot-single for host cloudvirt1029.eqiad.wmnet
  • 12:45 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: 623ed48: nowiki: Enable Growth features in stealth mode (T276816) (duration: 01m 07s)
  • 12:41 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1170:3317 for schema change', diff saved to https://phabricator.wikimedia.org/P14734 and previous config saved to /var/cache/conftool/dbconfig/20210310-124140-marostegui.json
  • 12:41 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be2045.codfw.wmnet
  • 12:40 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2044.codfw.wmnet
  • 12:36 marostegui@cumin1001: dbctl commit (dc=all): 'db1101:3317 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P14733 and previous config saved to /var/cache/conftool/dbconfig/20210310-123654-root.json
  • 12:36 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host theemin.codfw.wmnet
  • 12:34 ladsgroup@deploy1002: Synchronized php-1.36.0-wmf.34/languages: Add shy name (same as shy-latn) (T259360) (duration: 01m 10s)
  • 12:34 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be2044.codfw.wmnet
  • 12:32 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2043.codfw.wmnet
  • 12:32 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host theemin.codfw.wmnet
  • 12:32 ariel@cumin1001: START - Cookbook sre.cassandra.roll-restart
  • 12:31 ariel@cumin1001: END (FAIL) - Cookbook sre.cassandra.roll-restart (exit_code=99)
  • 12:31 ariel@cumin1001: START - Cookbook sre.cassandra.roll-restart
  • 12:28 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be2043.codfw.wmnet
  • 12:23 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2042.codfw.wmnet
  • 12:22 ladsgroup@deploy1002: Synchronized php-1.36.0-wmf.33/languages: Add shy name (same as shy-latn) (T259360) (duration: 01m 10s)
  • 12:21 marostegui@cumin1001: dbctl commit (dc=all): 'db1101:3317 (re)pooling @ 60%: 10', diff saved to https://phabricator.wikimedia.org/P14732 and previous config saved to /var/cache/conftool/dbconfig/20210310-122150-root.json
  • 12:17 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be2042.codfw.wmnet
  • 12:15 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2041.codfw.wmnet
  • 12:12 ladsgroup@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Update several Wikidata-related configs (duration: 01m 32s)
  • 12:09 klausman@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 12:08 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be2041.codfw.wmnet
  • 12:07 klausman@cumin1001: START - Cookbook sre.dns.netbox
  • 12:06 marostegui@cumin1001: dbctl commit (dc=all): 'db1101:3317 (re)pooling @ 30%: 10', diff saved to https://phabricator.wikimedia.org/P14731 and previous config saved to /var/cache/conftool/dbconfig/20210310-120647-root.json
  • 11:57 jiji@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts mc1024.eqiad.wmnet
  • 11:42 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on aqs1010.eqiad.wmnet with reason: New buster host
  • 11:42 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on aqs1010.eqiad.wmnet with reason: New buster host
  • 11:34 aborrero@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt1013.eqiad.wmnet
  • 11:29 aborrero@cumin1001: START - Cookbook sre.hosts.reboot-single for host cloudvirt1013.eqiad.wmnet
  • 11:27 jiji@cumin1001: START - Cookbook sre.hosts.decommission for hosts mc1024.eqiad.wmnet
  • 11:25 kormat@cumin1001: dbctl commit (dc=all): 'db1168 (re)pooling @ 100%: schema change T267767', diff saved to https://phabricator.wikimedia.org/P14730 and previous config saved to /var/cache/conftool/dbconfig/20210310-112553-kormat.json
  • 11:24 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1101:3317 for schema change', diff saved to https://phabricator.wikimedia.org/P14729 and previous config saved to /var/cache/conftool/dbconfig/20210310-112427-marostegui.json
  • 11:22 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2040.codfw.wmnet
  • 11:19 marostegui@cumin1001: dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P14728 and previous config saved to /var/cache/conftool/dbconfig/20210310-111903-root.json
  • 11:16 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be2040.codfw.wmnet
  • 11:15 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2039.codfw.wmnet
  • 11:10 kormat@cumin1001: dbctl commit (dc=all): 'db1168 (re)pooling @ 75%: schema change T267767', diff saved to https://phabricator.wikimedia.org/P14727 and previous config saved to /var/cache/conftool/dbconfig/20210310-111049-kormat.json
  • 11:07 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be2039.codfw.wmnet
  • 11:06 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2038.codfw.wmnet
  • 11:05 aborrero@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt1028.eqiad.wmnet
  • 11:04 marostegui@cumin1001: dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 60%: 10', diff saved to https://phabricator.wikimedia.org/P14726 and previous config saved to /var/cache/conftool/dbconfig/20210310-110359-root.json
  • 11:00 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be2038.codfw.wmnet
  • 11:00 aborrero@cumin1001: START - Cookbook sre.hosts.reboot-single for host cloudvirt1028.eqiad.wmnet
  • 10:55 kormat@cumin1001: dbctl commit (dc=all): 'db1168 (re)pooling @ 50%: schema change T267767', diff saved to https://phabricator.wikimedia.org/P14725 and previous config saved to /var/cache/conftool/dbconfig/20210310-105545-kormat.json
  • 10:54 aborrero@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt1023.eqiad.wmnet
  • 10:51 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2037.codfw.wmnet
  • 10:48 marostegui@cumin1001: dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 30%: 10', diff saved to https://phabricator.wikimedia.org/P14724 and previous config saved to /var/cache/conftool/dbconfig/20210310-104856-root.json
  • 10:47 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be2037.codfw.wmnet
  • 10:45 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2036.codfw.wmnet
  • 10:40 kormat@cumin1001: dbctl commit (dc=all): 'db1168 (re)pooling @ 25%: schema change T267767', diff saved to https://phabricator.wikimedia.org/P14723 and previous config saved to /var/cache/conftool/dbconfig/20210310-104042-kormat.json
  • 10:40 effie: upgrade memcached on mc2019, mc1019
  • 10:39 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be2036.codfw.wmnet
  • 10:38 kormat@cumin1001: dbctl commit (dc=all): 'db1168 depooling: schema change T267767', diff saved to https://phabricator.wikimedia.org/P14722 and previous config saved to /var/cache/conftool/dbconfig/20210310-103836-kormat.json
  • 10:38 kormat@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1168.eqiad.wmnet with reason: schema change T267767
  • 10:38 kormat@cumin1001: START - Cookbook sre.hosts.downtime for 1:00:00 on db1168.eqiad.wmnet with reason: schema change T267767
  • 10:37 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2035.codfw.wmnet
  • 10:32 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be2035.codfw.wmnet
  • 10:29 aborrero@cumin1001: START - Cookbook sre.hosts.reboot-single for host cloudvirt1023.eqiad.wmnet
  • 10:19 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1098:3317 for schema change', diff saved to https://phabricator.wikimedia.org/P14721 and previous config saved to /var/cache/conftool/dbconfig/20210310-101922-marostegui.json
  • 10:18 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2034.codfw.wmnet
  • 10:12 marostegui: Drop testreduce_vd from m5 master - T276787
  • 10:11 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be2034.codfw.wmnet
  • 10:03 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2033.codfw.wmnet
  • 09:58 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be2033.codfw.wmnet
  • 09:58 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2032.codfw.wmnet
  • 09:52 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be2032.codfw.wmnet
  • 09:49 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2031.codfw.wmnet
  • 09:40 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be2031.codfw.wmnet
  • 09:35 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2030.codfw.wmnet
  • 09:30 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be2030.codfw.wmnet
  • 09:27 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2029.codfw.wmnet
  • 09:25 aborrero@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt2003-dev.codfw.wmnet with reason: REIMAGE
  • 09:23 aborrero@cumin2001: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt2003-dev.codfw.wmnet with reason: REIMAGE
  • 09:21 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be2029.codfw.wmnet
  • 09:18 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2028.codfw.wmnet
  • 09:12 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-be2028.codfw.wmnet
  • 08:39 marostegui: Upgrade mysql and kernel on db2132
  • 08:25 marostegui: Upgrade mysql and kernel on db2078
  • 08:21 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thorium.eqiad.wmnet
  • 08:20 moritzm: pruning obsolete kernels from ganeti hosts in eqiad/codfw
  • 08:17 moritzm: powercycling thorium, stuck on reboot
  • 08:16 marostegui@cumin1001: dbctl commit (dc=all): 'db1085 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P14719 and previous config saved to /var/cache/conftool/dbconfig/20210310-081627-root.json
  • 08:11 marostegui: Check tables on db1150:3315 - T276742
  • 08:09 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host thorium.eqiad.wmnet
  • 08:05 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host analytics-tool1001.eqiad.wmnet
  • 08:03 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host analytics-tool1001.eqiad.wmnet
  • 08:01 marostegui@cumin1001: dbctl commit (dc=all): 'db1085 (re)pooling @ 60%: 10', diff saved to https://phabricator.wikimedia.org/P14718 and previous config saved to /var/cache/conftool/dbconfig/20210310-080123-root.json
  • 07:52 marostegui: Deploy schema change on s7 codfw (lag will appear) T276150 T276156
  • 07:46 marostegui@cumin1001: dbctl commit (dc=all): 'db1085 (re)pooling @ 30%: 10', diff saved to https://phabricator.wikimedia.org/P14717 and previous config saved to /var/cache/conftool/dbconfig/20210310-074618-root.json
  • 07:33 filippo@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host graphite1004.eqiad.wmnet
  • 07:29 filippo@cumin1001: START - Cookbook sre.hosts.reboot-single for host graphite1004.eqiad.wmnet
  • 07:26 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1085 for schema change', diff saved to https://phabricator.wikimedia.org/P14716 and previous config saved to /var/cache/conftool/dbconfig/20210310-072642-marostegui.json
  • 07:25 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1113:3316', diff saved to https://phabricator.wikimedia.org/P14715 and previous config saved to /var/cache/conftool/dbconfig/20210310-072508-marostegui.json
  • 07:07 elukey: sudo apt-get remove linux-image-4.9.0-9-amd64 on sodium to free space for /boot
  • 07:06 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db2145', diff saved to https://phabricator.wikimedia.org/P14714 and previous config saved to /var/cache/conftool/dbconfig/20210310-070642-marostegui.json
  • 07:03 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1113:3316 for schema change', diff saved to https://phabricator.wikimedia.org/P14713 and previous config saved to /var/cache/conftool/dbconfig/20210310-070312-marostegui.json
  • 07:01 elukey: remove the oldest kernel on ganeti nodes to free space for /boot
  • 07:00 marostegui: Depool clouddb1016
  • 06:45 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1111.eqiad.wmnet with reason: REIMAGE
  • 06:43 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1111.eqiad.wmnet with reason: REIMAGE
  • 06:17 elukey: reimage an-worker1111 to buster
  • 05:27 ryankemper: T266470 Rollout of updated certificate complete. We're now ready to implement envoy for `wdqs-test` which will allow `wdqs1009` to be reachable via port 443 and thereby allow us to go live with `query-preview.wikidata.org` when the time comes
  • 05:26 ryankemper: T266470 `ryankemper@cumin1001:~$ sudo -E cumin 'A:wdqs-all' 'sudo enable-puppet "revoking old cert and generating new one with new alt_names - T266470 - root"'` and `ryankemper@cumin1001:~$ sudo -E cumin 'A:wdqs-all' 'sudo run-puppet-agent'`
  • 05:24 ryankemper: T266470 Test queries passing on `wdqs1004`, and `https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&refresh=1m&var-cluster_name=wdqs&from=now-1h&to=now` looks as expected. Proceeding to rest of fleet
  • 05:20 ryankemper: T266470 Enabled puppet on single public wdqs host to verify certificate update is without issue: `ryankemper@wdqs1004:~$ sudo enable-puppet "revoking old cert and generating new one with new alt_names - T266470 - root"` followed by `ryankemper@wdqs1004:~$ sudo run-puppet-agent`
  • 05:15 ryankemper: T266470 [`/srv/private`] All changes commited to private git repo, commit SHA `ec1d6cfae8c72e4f807b343cdb9f25c27817d98d`
  • 05:13 ryankemper: T266470 [`/srv/private`] `chown gitpuppet:gitpuppet` on all modified files (were owned by root, probably because I sudo'd - may be that a git commit hook would have caught that but explicitly chowning just to be safe)
  • 05:06 ryankemper: T266470 New `wdqs.discovery.wmnet.crt` added to public `operations/puppet` repo: https://gerrit.wikimedia.org/r/c/operations/puppet/+/670337/
  • 04:58 ryankemper: T266470 The above two actions mean that we're ready to generate the new certificate files. Proceeding: `sudo cergen -c 'wdqs.*' --generate --base-path /srv/private/modules/secret/secrets/certificates /srv/private/modules/secret/secrets/certificates/certificate.manifests.d` on `ryankemper@puppetmaster1001:/srv/private`
  • 04:57 ryankemper: T266470 `sudo rm -fv certificates/wdqs.discovery.wmnet/wdqs.discovery.wmnet.crt.pem certificates/wdqs.discovery.wmnet/wdqs.discovery.wmnet.csr.pem certificates/wdqs.discovery.wmnet/wdqs.discovery.wmnet.keystore.jks certificates/wdqs.discovery.wmnet/wdqs.discovery.wmnet.keystore.p12 certificates/wdqs.discovery.wmnet/truststore.jks` (full paths not provided to fit the IRC line)
  • 04:56 ryankemper: T266470 In the `/srv/private` repo, `/srv/private/modules/secret/secrets/certificates/certificate.manifests.d/wdqs.certs.yaml` has been edited to add the relevant `alt_names`
  • 04:55 ryankemper: T266470 Certificate revoked: `ryankemper@puppetmaster1001:/srv/private$ sudo puppet cert clean wdqs.discovery.wmnet`
  • 04:53 ryankemper: T266470 `ryankemper@cumin1001:~$ sudo -E cumin 'A:wdqs-all' 'sudo disable-puppet "revoking old cert and generating new one with new alt_names - T266470"'`
  • 04:52 ryankemper: T266470 Temporarily disabling puppet on all `wdqs*` hosts in preparation for `wdqs.discovery.wmnet` certificate revocation
  • 01:08 krinkle@deploy1002: Synchronized php-1.36.0-wmf.34/extensions/NavigationTiming/modules/ext.navigationTiming.js: T276826 Ibd9ddf14d64 (duration: 01m 14s)
  • 00:02 robh@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-backup1002.eqiad.wmnet with reason: REIMAGE
  • 00:00 robh@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-backup1001.eqiad.wmnet with reason: REIMAGE

2021-03-09

  • 23:59 robh@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-backup1002.eqiad.wmnet with reason: REIMAGE
  • 23:58 robh@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-backup1001.eqiad.wmnet with reason: REIMAGE
  • 22:04 mutante: phab1001 - manually running phab public task dumd script after making changes to redirect stdout
  • 20:42 elukey: reimaged an-worker1091 to buster
  • 20:41 bstorm: depooled labsdb1009 T276980
  • 20:25 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1091.eqiad.wmnet with reason: REIMAGE
  • 20:25 bstorm: downtimed labsdb1009 so it doesn't keep paging T276980
  • 20:23 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1091.eqiad.wmnet with reason: REIMAGE
  • 20:09 brennen: train status: 1.36.0-wmf.32 (T274938) on group0 at 20:06:32 UTC; logs initially quiet.
  • 20:06 brennen@deploy1002: rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.34
  • 19:05 brennen@deploy1002: Pruned MediaWiki: 1.36.0-wmf.31 (duration: 03m 34s)
  • 19:04 pt1979@cumin2001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 18:59 pt1979@cumin2001: START - Cookbook sre.dns.netbox
  • 18:54 brennen@deploy1002: Finished scap: testwikis wikis to 1.36.0-wmf.34 (duration: 47m 25s)
  • 18:52 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1087.eqiad.wmnet with reason: REIMAGE
  • 18:49 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1087.eqiad.wmnet with reason: REIMAGE
  • 18:47 dcausse: re-pool wdqs1004
  • 18:37 mbsantos@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
  • 18:35 mbsantos@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
  • 18:34 pt1979@cumin2001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 18:29 pt1979@cumin2001: START - Cookbook sre.dns.netbox
  • 18:26 elukey: reimage an-worker1087 to buster
  • 18:16 mbsantos@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'proton' for release 'production' .
  • 18:13 mbsantos@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'proton' for release 'production' .
  • 18:12 brennen@deploy1002: Started scap: testwikis wikis to 1.36.0-wmf.34
  • 18:10 mbsantos@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
  • 18:05 mbsantos@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
  • 18:03 mbsantos@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' .
  • 18:02 marxarelli: deleting shut down memc* deployment-prep instances to free up quota for replacement db instances (T276968)
  • 18:02 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1085.eqiad.wmnet with reason: REIMAGE
  • 18:00 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1085.eqiad.wmnet with reason: REIMAGE
  • 17:50 papaul: rebooting db2073 for firmware upgrade
  • 17:01 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1077.eqiad.wmnet with reason: REIMAGE
  • 17:00 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: 3119d7a: sqwiki: Fix deployment of Growth features (duration: 01m 00s)
  • 16:59 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1077.eqiad.wmnet with reason: REIMAGE
  • 16:46 pt1979@cumin2001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 16:41 pt1979@cumin2001: START - Cookbook sre.dns.netbox
  • 16:40 elukey: reimage analytics1077 to buster
  • 16:33 aborrero@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt1027.eqiad.wmnet
  • 16:32 jayme@deploy1002: helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
  • 16:31 jayme@deploy1002: helmfile [staging-codfw] START helmfile.d/admin 'sync'.
  • 16:31 brennen: 1.36.0-wmf.34 was branched at e175899 for T274938
  • 16:27 aborrero@cumin1001: START - Cookbook sre.hosts.reboot-single for host cloudvirt1027.eqiad.wmnet
  • 16:21 marostegui@cumin1001: dbctl commit (dc=all): 'db1175 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P14708 and previous config saved to /var/cache/conftool/dbconfig/20210309-162116-root.json
  • 16:06 marostegui@cumin1001: dbctl commit (dc=all): 'db1175 (re)pooling @ 80%: 10', diff saved to https://phabricator.wikimedia.org/P14707 and previous config saved to /var/cache/conftool/dbconfig/20210309-160613-root.json
  • 15:56 moritzm: imported prometheus-ircd-exporter 0.2 to apt.wikimedia.org T224579
  • 15:51 marostegui@cumin1001: dbctl commit (dc=all): 'db1175 (re)pooling @ 60%: 10', diff saved to https://phabricator.wikimedia.org/P14706 and previous config saved to /var/cache/conftool/dbconfig/20210309-155109-root.json
  • 15:45 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1072.eqiad.wmnet with reason: REIMAGE
  • 15:43 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1072.eqiad.wmnet with reason: REIMAGE
  • 15:37 marostegui@cumin1001: dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 100%: Repooling db1096:3316 after schema change', diff saved to https://phabricator.wikimedia.org/P14705 and previous config saved to /var/cache/conftool/dbconfig/20210309-153715-root.json
  • 15:36 marostegui@cumin1001: dbctl commit (dc=all): 'db1175 (re)pooling @ 40%: 10', diff saved to https://phabricator.wikimedia.org/P14704 and previous config saved to /var/cache/conftool/dbconfig/20210309-153605-root.json
  • 15:35 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe1008.eqiad.wmnet
  • 15:29 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-fe1008.eqiad.wmnet
  • 15:28 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe1007.eqiad.wmnet
  • 15:28 otto@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Declare KaiOS / Inuka event streams - T267344 T267345 T267346 (duration: 00m 58s)
  • 15:22 marostegui@cumin1001: dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 60%: Repooling db1096:3316 after schema change', diff saved to https://phabricator.wikimedia.org/P14703 and previous config saved to /var/cache/conftool/dbconfig/20210309-152212-root.json
  • 15:21 marostegui@cumin1001: dbctl commit (dc=all): 'db1175 (re)pooling @ 30%: 10', diff saved to https://phabricator.wikimedia.org/P14702 and previous config saved to /var/cache/conftool/dbconfig/20210309-152102-root.json
  • 15:20 otto@deploy1002: Synchronized wmf-config/InitialiseSettings.php: WikimediaEvents: Bump session_tick sampling rate to 10% (duration: 00m 58s)
  • 15:18 elukey: reimage analytics1072 (hadoop hdfs journal node) to buster
  • 15:15 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-fe1007.eqiad.wmnet
  • 15:15 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe1006.eqiad.wmnet
  • 15:11 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-fe1006.eqiad.wmnet
  • 15:07 marostegui@cumin1001: dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 30%: Repooling db1096:3316 after schema change', diff saved to https://phabricator.wikimedia.org/P14701 and previous config saved to /var/cache/conftool/dbconfig/20210309-150708-root.json
  • 15:05 marostegui@cumin1001: dbctl commit (dc=all): 'db1175 (re)pooling @ 20%: 10', diff saved to https://phabricator.wikimedia.org/P14700 and previous config saved to /var/cache/conftool/dbconfig/20210309-150558-root.json
  • 15:00 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe1005.eqiad.wmnet
  • 14:56 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-fe1005.eqiad.wmnet
  • 14:56 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1089.eqiad.wmnet with reason: REIMAGE
  • 14:54 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1090.eqiad.wmnet with reason: REIMAGE
  • 14:53 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1089.eqiad.wmnet with reason: REIMAGE
  • 14:52 marostegui@cumin1001: dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 10%: Repooling db1096:3316 after schema change', diff saved to https://phabricator.wikimedia.org/P14699 and previous config saved to /var/cache/conftool/dbconfig/20210309-145205-root.json
  • 14:52 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1090.eqiad.wmnet with reason: REIMAGE
  • 14:41 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe2008.codfw.wmnet
  • 14:38 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-fe2008.codfw.wmnet
  • 14:34 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1096:3316 for schema change', diff saved to https://phabricator.wikimedia.org/P14698 and previous config saved to /var/cache/conftool/dbconfig/20210309-143453-marostegui.json
  • 14:32 volker-e@deploy1002: Finished deploy [design/style-guide@deee49c]: Deploy design/style-guide: deee49c index: Add links to our design process and work guides (#446) (duration: 00m 06s)
  • 14:32 volker-e@deploy1002: Started deploy [design/style-guide@deee49c]: Deploy design/style-guide: deee49c index: Add links to our design process and work guides (#446)
  • 14:32 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1015.eqiad.wmnet with reason: REIMAGE
  • 14:31 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe2007.codfw.wmnet
  • 14:30 marostegui@cumin1001: dbctl commit (dc=all): 'db1098:3316 (re)pooling @ 100%: Repooling after schema change', diff saved to https://phabricator.wikimedia.org/P14697 and previous config saved to /var/cache/conftool/dbconfig/20210309-143033-root.json
  • 14:30 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1014.eqiad.wmnet with reason: REIMAGE
  • 14:29 elukey: drain + reimage an-worker1090/89 to Buster
  • 14:29 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1015.eqiad.wmnet with reason: REIMAGE
  • 14:28 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1012.eqiad.wmnet with reason: REIMAGE
  • 14:27 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-fe2007.codfw.wmnet
  • 14:27 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1014.eqiad.wmnet with reason: REIMAGE
  • 14:26 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1012.eqiad.wmnet with reason: REIMAGE
  • 14:25 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe2006.codfw.wmnet
  • 14:21 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-fe2006.codfw.wmnet
  • 14:17 jakob@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'termbox' for release 'production' .
  • 14:15 marostegui@cumin1001: dbctl commit (dc=all): 'db1098:3316 (re)pooling @ 60%: Repooling after schema change', diff saved to https://phabricator.wikimedia.org/P14696 and previous config saved to /var/cache/conftool/dbconfig/20210309-141529-root.json
  • 14:14 jakob@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'termbox' for release 'production' .
  • 14:12 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe2005.codfw.wmnet
  • 14:12 jakob@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'test' .
  • 14:10 moritzm: installing intel-microcode updates on stretch
  • 14:09 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host ms-fe2005.codfw.wmnet
  • 14:08 jakob@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' .
  • 14:07 jgleeson: updated smashpig from 5a69abd40f to 58b070db1a
  • 14:00 marostegui@cumin1001: dbctl commit (dc=all): 'db1098:3316 (re)pooling @ 30%: Repooling after schema change', diff saved to https://phabricator.wikimedia.org/P14694 and previous config saved to /var/cache/conftool/dbconfig/20210309-140025-root.json
  • 13:52 filippo@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1004.eqiad.wmnet
  • 13:52 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1102.eqiad.wmnet with reason: REIMAGE
  • 13:50 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1080.eqiad.wmnet with reason: REIMAGE
  • 13:49 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1102.eqiad.wmnet with reason: REIMAGE
  • 13:49 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1080.eqiad.wmnet with reason: REIMAGE
  • 13:45 marostegui@cumin1001: dbctl commit (dc=all): 'db1098:3316 (re)pooling @ 10%: Repooling after schema change', diff saved to https://phabricator.wikimedia.org/P14693 and previous config saved to /var/cache/conftool/dbconfig/20210309-134522-root.json
  • 13:37 filippo@cumin1001: START - Cookbook sre.hosts.reboot-single for host prometheus1004.eqiad.wmnet
  • 13:34 aborrero@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on cloudvirt1038.eqiad.wmnet with reason: HW issue
  • 13:34 aborrero@cumin1001: START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on cloudvirt1038.eqiad.wmnet with reason: HW issue
  • 13:31 marostegui@cumin1001: dbctl commit (dc=all): 'db1168 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P14692 and previous config saved to /var/cache/conftool/dbconfig/20210309-133124-root.json
  • 13:28 filippo@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1003.eqiad.wmnet
  • 13:27 elukey: reimage an-worker1102 and an-worker1080 (hdfs journal node) to Buster
  • 13:21 jgleeson: updated payments-wiki from 65dbf0ed9d to 0e7800027a
  • 13:16 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1198:3316 for schema change', diff saved to https://phabricator.wikimedia.org/P14691 and previous config saved to /var/cache/conftool/dbconfig/20210309-131652-marostegui.json
  • 13:16 marostegui@cumin1001: dbctl commit (dc=all): 'db1168 (re)pooling @ 60%: 10', diff saved to https://phabricator.wikimedia.org/P14690 and previous config saved to /var/cache/conftool/dbconfig/20210309-131620-root.json
  • 13:10 filippo@cumin1001: START - Cookbook sre.hosts.reboot-single for host prometheus1003.eqiad.wmnet
  • 13:08 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1103.eqiad.wmnet with reason: REIMAGE
  • 13:06 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1103.eqiad.wmnet with reason: REIMAGE
  • 13:03 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1013.eqiad.wmnet with reason: REIMAGE
  • 13:01 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1013.eqiad.wmnet with reason: REIMAGE
  • 13:01 marostegui@cumin1001: dbctl commit (dc=all): 'db1168 (re)pooling @ 30%: 10', diff saved to https://phabricator.wikimedia.org/P14689 and previous config saved to /var/cache/conftool/dbconfig/20210309-130116-root.json
  • 12:59 elukey: drain + reimage an-worker1103 to Buster
  • 12:59 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1011.eqiad.wmnet with reason: REIMAGE
  • 12:57 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1011.eqiad.wmnet with reason: REIMAGE
  • 12:56 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mw1403.eqiad.wmnet
  • 12:56 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mw1402.eqiad.wmnet
  • 12:50 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1168 for schema change', diff saved to https://phabricator.wikimedia.org/P14688 and previous config saved to /var/cache/conftool/dbconfig/20210309-125007-marostegui.json
  • 12:49 marostegui@cumin1001: dbctl commit (dc=all): 'db1173 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P14687 and previous config saved to /var/cache/conftool/dbconfig/20210309-124931-root.json
  • 12:41 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host mw1403.eqiad.wmnet
  • 12:41 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host mw1402.eqiad.wmnet
  • 12:38 hnowlan@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 12:34 marostegui@cumin1001: dbctl commit (dc=all): 'db1173 (re)pooling @ 60%: 10', diff saved to https://phabricator.wikimedia.org/P14686 and previous config saved to /var/cache/conftool/dbconfig/20210309-123427-root.json
  • 12:33 aborrero@cumin1001: START - Cookbook sre.hosts.reboot-single for host cloudvirt1038.eqiad.wmnet
  • 12:31 hnowlan@cumin1001: START - Cookbook sre.dns.netbox
  • 12:30 hnowlan: regenerating interfaces and reimaging aqs101[1-5]
  • 12:29 marostegui: Upgrade db2084 kernel
  • 12:26 marostegui: Upgrade db2094 kernel
  • 12:19 marostegui@cumin1001: dbctl commit (dc=all): 'db1173 (re)pooling @ 30%: 10', diff saved to https://phabricator.wikimedia.org/P14685 and previous config saved to /var/cache/conftool/dbconfig/20210309-121924-root.json
  • 12:19 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1166 entirely', diff saved to https://phabricator.wikimedia.org/P14684 and previous config saved to /var/cache/conftool/dbconfig/20210309-121913-marostegui.json
  • 12:18 marostegui@cumin1001: dbctl commit (dc=all): 'db1166 (re)pooling @ 30%: 10', diff saved to https://phabricator.wikimedia.org/P14683 and previous config saved to /var/cache/conftool/dbconfig/20210309-121849-root.json
  • 12:16 urbanecm@deploy1002: Synchronized php-1.36.0-wmf.33/extensions/GrowthExperiments/: dbd6f0c: Make help panel fallback to help desk if no mentor is available (T275908; T273782) (duration: 01m 01s)
  • 12:13 marostegui: Upgrade db2080 kernel
  • 12:06 marostegui: Upgrade db2077 kernel
  • 12:03 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1173 for schema change', diff saved to https://phabricator.wikimedia.org/P14682 and previous config saved to /var/cache/conftool/dbconfig/20210309-120326-marostegui.json
  • 12:00 marostegui: Upgrade db2076 kernel
  • 11:56 effie: restart envoy on mw1276
  • 11:56 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1010.eqiad.wmnet with reason: REIMAGE
  • 11:53 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1010.eqiad.wmnet with reason: REIMAGE
  • 11:52 jayme@deploy1002: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
  • 11:52 jayme@deploy1002: helmfile [staging-codfw] START helmfile.d/admin 'apply'.
  • 11:45 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mw1307.eqiad.wmnet
  • 11:42 filippo@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2004.codfw.wmnet
  • 11:42 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mwdebug1003.eqiad.wmnet
  • 11:30 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host mw1307.eqiad.wmnet
  • 11:30 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host mwdebug1003.eqiad.wmnet
  • 11:29 jayme@deploy1002: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
  • 11:29 jayme@deploy1002: helmfile [staging-codfw] START helmfile.d/admin 'apply'.
  • 11:28 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host doc1001.eqiad.wmnet
  • 11:26 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host doc1001.eqiad.wmnet
  • 11:25 filippo@cumin1001: START - Cookbook sre.hosts.reboot-single for host prometheus2004.codfw.wmnet
  • 11:20 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host webperf1001.eqiad.wmnet
  • 11:18 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host webperf1001.eqiad.wmnet
  • 11:15 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host webperf1002.eqiad.wmnet
  • 11:11 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host webperf1002.eqiad.wmnet
  • 11:11 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host webperf2002.codfw.wmnet
  • 11:10 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host webperf2002.codfw.wmnet
  • 11:09 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host webperf2001.codfw.wmnet
  • 11:04 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host webperf2001.codfw.wmnet
  • 11:01 aborrero@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt1037.eqiad.wmnet
  • 10:56 moritzm: installing mariadb-10.1 updates for stretch (distro version with libs/tools only, not wmf-mariadb)
  • 10:54 aborrero@cumin1001: START - Cookbook sre.hosts.reboot-single for host cloudvirt1037.eqiad.wmnet
  • 10:53 dcausse: started to import lexemes on wdqs1009 (T276784)
  • 10:52 filippo@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2003.codfw.wmnet
  • 10:50 hnowlan@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 10:45 hnowlan@cumin1001: START - Cookbook sre.dns.netbox
  • 10:43 filippo@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ms-be[2020-2027].codfw.wmnet
  • 10:36 filippo@cumin1001: START - Cookbook sre.hosts.reboot-single for host prometheus2003.codfw.wmnet
  • 10:31 moritzm: upgrading perf on stretch hosts
  • 10:30 filippo@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host graphite2003.codfw.wmnet
  • 10:23 moritzm: installing gdisk security updates
  • 10:15 filippo@cumin1001: START - Cookbook sre.hosts.reboot-single for host graphite2003.codfw.wmnet
  • 10:14 moritzm: installing libbsd security updates
  • 10:07 filippo@cumin1001: START - Cookbook sre.hosts.decommission for hosts ms-be[2020-2027].codfw.wmnet
  • 10:00 moritzm: installing busybox security updates
  • 09:54 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1001.eqiad.wmnet
  • 09:52 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host puppetboard1001.eqiad.wmnet
  • 09:50 marostegui: Reboot db2073 for kernel upgrade (stretch)
  • 09:50 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2001.codfw.wmnet
  • 09:49 jayme@deploy1002: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
  • 09:49 jayme@deploy1002: helmfile [staging-codfw] START helmfile.d/admin 'apply'.
  • 09:46 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host puppetboard2001.codfw.wmnet
  • 09:44 marostegui: Reboot db2072 for kernel upgrade (stretch)
  • 09:42 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1112.eqiad.wmnet with reason: REIMAGE
  • 09:40 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1112.eqiad.wmnet with reason: REIMAGE
  • 09:39 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1076.eqiad.wmnet with reason: REIMAGE
  • 09:36 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1076.eqiad.wmnet with reason: REIMAGE
  • 09:14 elukey: drain + reimage analytics1076 and an-worker1112 to Buster
  • 09:01 moritzm: installing Linux 4.9.258 updates on Stretch hosts
  • 08:59 filippo@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ms-be[2017-2019].codfw.wmnet
  • 08:54 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1093.eqiad.wmnet with reason: REIMAGE
  • 08:52 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1092.eqiad.wmnet with reason: REIMAGE
  • 08:52 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1093.eqiad.wmnet with reason: REIMAGE
  • 08:50 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1092.eqiad.wmnet with reason: REIMAGE
  • 08:46 filippo@cumin1001: START - Cookbook sre.hosts.decommission for hosts ms-be[2017-2019].codfw.wmnet
  • 08:46 jayme@deploy1002: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
  • 08:46 jayme@deploy1002: helmfile [staging-codfw] START helmfile.d/admin 'apply'.
  • 08:33 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet
  • 08:27 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet
  • 08:26 filippo@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ms-be2016.codfw.wmnet
  • 08:12 marostegui: Stop mysql on clouddb1015:3314, 3316
  • 07:59 filippo@cumin1001: START - Cookbook sre.hosts.decommission for hosts ms-be2016.codfw.wmnet
  • 07:50 dcausse: restarted blazegraph on wdqs1004 and depooled it to catchup lag
  • 07:40 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1095.eqiad.wmnet with reason: REIMAGE
  • 07:38 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1094.eqiad.wmnet with reason: REIMAGE
  • 07:38 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1095.eqiad.wmnet with reason: REIMAGE
  • 07:36 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1094.eqiad.wmnet with reason: REIMAGE
  • 07:24 godog: swift eqiad-prod: add weight to ms-be106[0-3] - T268435
  • 07:03 elukey@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 07:01 elukey: drain + reimage an-worker109[4,5] to Buster
  • 06:58 elukey@cumin1001: START - Cookbook sre.dns.netbox
  • 06:30 _joe_: restarting gerrit on gerrit1001, using 48 GB of heap
  • 06:19 marostegui: Deploy schema change on s6 codfw (there will be lag on codfw) T276150 T276156
  • 05:37 marostegui: Stop mysql on clouddb1014:3312, 3317 to transfer its data to cloudb1021
  • 05:16 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1175 for table check T276742', diff saved to https://phabricator.wikimedia.org/P14675 and previous config saved to /var/cache/conftool/dbconfig/20210309-051646-marostegui.json
  • 00:58 Krinkle: krinkle@mwmaint1002 Ran invalidateUserSesssions.php for one user
  • 00:13 urbanecm@deploy1002: Synchronized wmf-config/config/incubatorwiki.yaml: 0d260ed: Enable modern Vector on incubator (T275479; 2/2) (duration: 00m 57s)
  • 00:11 urbanecm@deploy1002: Synchronized dblists/desktop-improvements.dblist: 0d260ed: Enable modern Vector on incubator (T275479; 1/2) (duration: 01m 01s)
  • 00:09 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: ce82e0c: Logo updates (T273085) (duration: 00m 58s)
  • 00:08 urbanecm@deploy1002: Synchronized static/images/mobile/copyright/: ce82e0c: Logo updates (T273085) (duration: 00m 58s)

2021-03-08

  • 22:36 robh@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-druid1005.eqiad.wmnet with reason: REIMAGE
  • 22:34 robh@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-druid1005.eqiad.wmnet with reason: REIMAGE
  • 21:42 mholloway-shell@deploy1002: Synchronized wmf-config/CommonSettings.php: WikimediaEvents: Create data QA group/right on testwiki (T276515) (duration: 00m 57s)
  • 21:18 otto@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Migrate Editing schemas to Event Platform on all wikis - T267343, T267353 (duration: 00m 58s)
  • 21:04 otto@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Migrate Editing schemas to Event Platform on testwiki, take 2 - T267343, T267353 (duration: 00m 58s)
  • 20:53 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: 1227e2a: idwiki: Growth features: Add mentorlist (T259024) (duration: 00m 58s)
  • 20:44 legoktm: legoktm@registry1004:~$ sudo systemctl reset-failed # to fix icinga warning
  • 20:43 robh@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-druid1003.eqiad.wmnet with reason: REIMAGE
  • 20:41 robh@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-druid1003.eqiad.wmnet with reason: REIMAGE
  • 20:38 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: 5ce7b46: Set wgGEHelpPanelAskMentor to true by default (T275908) (duration: 01m 07s)
  • 20:32 bblack: miscweb[12]002 - re-enabled puppet and deployed new cert
  • 20:23 bblack: miscweb[12]002 - disabling puppet to remake cergen cert...
  • 19:55 otto@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Migrate Editing schemas to Event Platform on testwiki - T267343, T267353 (duration: 00m 57s)
  • 19:47 dduvall@deploy1002: Synchronized php-1.36.0-wmf.33/maintenance/: maintenance: aa6f291: 4893ddb: fa97162: 380c448: DB_NONE offline maintenance improvements (duration: 00m 58s)
  • 19:37 dduvall@deploy1002: Synchronized wmf-config/: wmf-config/env.php,CommonSettings.php: f70049b: e53dc3a: f9b9ea1: WMF_DATACENTER, WMF_MAINTENANCE_OFFLINE handling (duration: 01m 00s)
  • 19:37 bblack: cp-text: banning varnish-fe for req.http.host == ( 7 wikis from T274784 )
  • 19:21 urbanecm@deploy1002: Synchronized wmf-config/config/: 1c46d0b: 1aad60b: vector: Expand Desktop Improvements pilot wiki group (T273090) (duration: 00m 58s)
  • 19:20 urbanecm@deploy1002: Synchronized dblists/desktop-improvements.dblist: 1c46d0b: 1aad60b: vector: Expand Desktop Improvements pilot wiki group (T273090) (duration: 00m 57s)
  • 19:14 bblack: cp-text: disabling puppet ahead of T274784 changes - https://gerrit.wikimedia.org/r/c/operations/puppet/+/669840
  • 19:10 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: e1cb988: Enable flood flag on hrwiki (T276560) (duration: 00m 58s)
  • 18:58 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: a855800: Fix sqwiki help panel links description (T275550) (duration: 00m 58s)
  • 18:47 robh@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 18:40 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: dfd9588: hiwiki: Add missing help panel link descriptions (T276450) (duration: 00m 58s)
  • 18:37 robh@cumin1001: START - Cookbook sre.dns.netbox
  • 18:36 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1116.eqiad.wmnet with reason: REIMAGE
  • 18:35 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1116.eqiad.wmnet with reason: REIMAGE
  • 18:33 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1115.eqiad.wmnet with reason: REIMAGE
  • 18:33 robh@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 18:31 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1115.eqiad.wmnet with reason: REIMAGE
  • 18:29 robh@cumin1001: START - Cookbook sre.dns.netbox
  • 18:11 elukey: drain + reimage an-worker11[15,16] to Buster
  • 17:40 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1114.eqiad.wmnet with reason: REIMAGE
  • 17:38 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1113.eqiad.wmnet with reason: REIMAGE
  • 17:37 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1114.eqiad.wmnet with reason: REIMAGE
  • 17:36 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1113.eqiad.wmnet with reason: REIMAGE
  • 17:12 elukey: drain + reimage an-worker11[13,14] to Buster
  • 16:45 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1110.eqiad.wmnet with reason: REIMAGE
  • 16:43 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1109.eqiad.wmnet with reason: REIMAGE
  • 16:43 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1110.eqiad.wmnet with reason: REIMAGE
  • 16:41 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1109.eqiad.wmnet with reason: REIMAGE
  • 16:17 elukey: drain + reimage an-worker1109/1110 to Buster
  • 15:55 marostegui: Restart db1115 (tendril host)
  • 15:47 marostegui@cumin1001: dbctl commit (dc=all): 'db1166 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P14669 and previous config saved to /var/cache/conftool/dbconfig/20210308-154710-root.json
  • 15:32 marostegui@cumin1001: dbctl commit (dc=all): 'db1166 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P14666 and previous config saved to /var/cache/conftool/dbconfig/20210308-153207-root.json
  • 15:18 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1108.eqiad.wmnet with reason: REIMAGE
  • 15:17 marostegui@cumin1001: dbctl commit (dc=all): 'db1166 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P14665 and previous config saved to /var/cache/conftool/dbconfig/20210308-151703-root.json
  • 15:16 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1108.eqiad.wmnet with reason: REIMAGE
  • 15:16 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1107.eqiad.wmnet with reason: REIMAGE
  • 15:14 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1107.eqiad.wmnet with reason: REIMAGE
  • 15:07 otto@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Migrate PrefUpdate to EventGate on all wikis - T267348 (duration: 00m 59s)
  • 15:02 otto@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Remove wgEventLoggingSchemas overrides for Growth and WMDE Tech wishes schemas - T267333, etc. (duration: 00m 59s)
  • 15:02 marostegui@cumin1001: dbctl commit (dc=all): 'db1166 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P14664 and previous config saved to /var/cache/conftool/dbconfig/20210308-150159-root.json
  • 14:54 elukey: drain + reimage an-worker110[7,8] to Buster
  • 14:51 otto@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
  • 14:51 otto@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
  • 14:48 otto@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
  • 14:48 otto@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
  • 14:18 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1106.eqiad.wmnet with reason: REIMAGE
  • 14:16 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1105.eqiad.wmnet with reason: REIMAGE
  • 14:16 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1106.eqiad.wmnet with reason: REIMAGE
  • 14:14 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1105.eqiad.wmnet with reason: REIMAGE
  • 13:51 elukey: drain + reimage an-worker110[4,5] to Buster
  • 13:07 marostegui@cumin1001: dbctl commit (dc=all): 'db1168 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P14663 and previous config saved to /var/cache/conftool/dbconfig/20210308-130712-root.json
  • 12:52 marostegui@cumin1001: dbctl commit (dc=all): 'db1168 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P14662 and previous config saved to /var/cache/conftool/dbconfig/20210308-125208-root.json
  • 12:37 marostegui@cumin1001: dbctl commit (dc=all): 'db1168 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P14661 and previous config saved to /var/cache/conftool/dbconfig/20210308-123704-root.json
  • 12:22 marostegui@cumin1001: dbctl commit (dc=all): 'db1168 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P14660 and previous config saved to /var/cache/conftool/dbconfig/20210308-122201-root.json
  • 12:20 urbanecm@deploy1002: Synchronized php-1.36.0-wmf.33/extensions/GrowthExperiments/includes/Mentorship/MentorHooks.php: 48d6c55: MentorHooks: Make mentor assignment follow same rules as HomepageHooks (T276720) (duration: 00m 58s)
  • 11:30 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1088.eqiad.wmnet with reason: REIMAGE
  • 11:28 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1088.eqiad.wmnet with reason: REIMAGE
  • 10:41 elukey: drain + reimage an-worker1104/1089 to Debian Buster
  • 10:19 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1084.eqiad.wmnet with reason: REIMAGE
  • 10:17 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1083.eqiad.wmnet with reason: REIMAGE
  • 10:15 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1084.eqiad.wmnet with reason: REIMAGE
  • 10:15 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1083.eqiad.wmnet with reason: REIMAGE
  • 10:01 marostegui: Repool clouddb1013:3311, clouddb1013:3313
  • 09:55 _joe_: uploading new versions of docker images: php7.{2,3}-{cli,fpm}, httpd, httpd-fcgi, mediawiki-httpd, memcached T276097 T265327
  • 09:34 _joe_: manually removed the old graphoid IP from scb server's interfaces (long-standing bug in wikimedia-lvs-realserver when removing the last managed IP)
  • 09:19 elukey: drain + reimage an-worker108[3,4] to Buster
  • 09:17 _joe_: regenerating puppet certs for scb200{1,2}
  • 08:56 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1082.eqiad.wmnet with reason: REIMAGE
  • 08:54 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1081.eqiad.wmnet with reason: REIMAGE
  • 08:53 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1082.eqiad.wmnet with reason: REIMAGE
  • 08:52 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1081.eqiad.wmnet with reason: REIMAGE
  • 08:21 godog: swift eqiad-prod: add weight to ms-be106[0-3] - T268435
  • 08:20 elukey: drain + reimage an-worker108[1,2] to Buster
  • 07:49 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1075.eqiad.wmnet with reason: REIMAGE
  • 07:47 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1075.eqiad.wmnet with reason: REIMAGE
  • 07:46 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1074.eqiad.wmnet with reason: REIMAGE
  • 07:44 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1074.eqiad.wmnet with reason: REIMAGE
  • 07:32 marostegui: Depool clouddb1013:3311, clouddb1013:3313 - T269211
  • 07:23 elukey: drain + reimage analytics107[4,5] to Buster
  • 07:14 marostegui@cumin1001: dbctl commit (dc=all): 'db1134 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P14657 and previous config saved to /var/cache/conftool/dbconfig/20210308-071443-root.json
  • 06:59 marostegui@cumin1001: dbctl commit (dc=all): 'db1134 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P14656 and previous config saved to /var/cache/conftool/dbconfig/20210308-065939-root.json
  • 06:53 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db2116 T275633', diff saved to https://phabricator.wikimedia.org/P14655 and previous config saved to /var/cache/conftool/dbconfig/20210308-065300-marostegui.json
  • 06:52 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db2092 T275633', diff saved to https://phabricator.wikimedia.org/P14654 and previous config saved to /var/cache/conftool/dbconfig/20210308-065220-marostegui.json
  • 06:49 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db2146 T275633', diff saved to https://phabricator.wikimedia.org/P14653 and previous config saved to /var/cache/conftool/dbconfig/20210308-064953-marostegui.json
  • 06:44 marostegui: Set innodb_change_buffering = none on all parsercache hosts T263443
  • 06:44 marostegui@cumin1001: dbctl commit (dc=all): 'db1134 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P14652 and previous config saved to /var/cache/conftool/dbconfig/20210308-064436-root.json
  • 06:37 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1168 T276742', diff saved to https://phabricator.wikimedia.org/P14651 and previous config saved to /var/cache/conftool/dbconfig/20210308-063700-marostegui.json
  • 06:29 marostegui@cumin1001: dbctl commit (dc=all): 'db1134 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P14650 and previous config saved to /var/cache/conftool/dbconfig/20210308-062932-root.json
  • 06:23 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1166 T276742', diff saved to https://phabricator.wikimedia.org/P14649 and previous config saved to /var/cache/conftool/dbconfig/20210308-062350-marostegui.json
  • 06:21 ryankemper@cumin2001: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99)

2021-03-07

  • 08:01 elukey: "megacli -LDSetProp -ForcedWB -Immediate -Lall -aAll" on analytics1066 - BBU looks fine, but the raid controller was using WriteThrough

2021-03-05

  • 23:16 legoktm: imported pygments 2.8.0+dfsg-1 to apt.wm.o buster-wikimedia component/pygments (T276298)
  • 21:36 pt1979@cumin2001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 21:32 pt1979@cumin2001: START - Cookbook sre.dns.netbox
  • 21:01 legoktm: updated udplog to 1.9 on mwlog1002.eqiad.wmnet and mwlog2002.codfw.wmnet
  • 20:48 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts deploy1001.eqiad.wmnet
  • 20:34 dzahn@cumin1001: START - Cookbook sre.hosts.decommission for hosts deploy1001.eqiad.wmnet
  • 20:15 legoktm@deploy1002: conftool action : set/pooled=no; selector: name=registry2002.codfw.wmnet
  • 20:15 legoktm@deploy1002: conftool action : set/pooled=no; selector: name=registry2001.codfw.wmnet
  • 20:12 legoktm@deploy1002: conftool action : set/pooled=yes; selector: name=registry2004.codfw.wmnet
  • 20:04 legoktm@deploy1002: conftool action : set/weight=10; selector: name=registry2004.codfw.wmnet
  • 20:04 legoktm@deploy1002: conftool action : set/pooled=no; selector: name=registry2004.codfw.wmnet
  • 20:02 legoktm@deploy1002: conftool action : set/pooled=no; selector: name=registry2004.codfw.wmnet
  • 19:30 legoktm@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host registry2004.codfw.wmnet
  • 19:14 legoktm@cumin1001: START - Cookbook sre.ganeti.makevm for new host registry2004.codfw.wmnet
  • 19:04 mutante: phab1001 - running public_task_dump.py (from cron job) manually
  • 18:50 legoktm@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts registry2004.eqiad.wmnet
  • 18:45 legoktm@cumin1001: START - Cookbook sre.hosts.decommission for hosts registry2004.eqiad.wmnet
  • 18:45 razzi@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddb1021.eqiad.wmnet with reason: REIMAGE
  • 18:43 razzi@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on clouddb1021.eqiad.wmnet with reason: REIMAGE
  • 18:23 razzi@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 18:18 razzi@cumin1001: START - Cookbook sre.dns.netbox
  • 16:58 razzi@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 16:54 effie: depool mw1276 and pool back
  • 16:53 razzi@cumin1001: START - Cookbook sre.dns.netbox
  • 16:48 razzi: edit https://netbox.wikimedia.org/dcim/devices/2078/ device name from labsdb1012 to clouddb1021
  • 16:36 aborrero@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt1036.eqiad.wmnet
  • 16:30 razzi: delete non-mgmt interfaces for labsdb1012 at https://netbox.wikimedia.org/dcim/devices/2078/interfaces/
  • 16:28 razzi: rename https://netbox.wikimedia.org/ipam/ip-addresses/734/ DNS name from labsdb1012.mgmt.eqiad.wmnet to clouddb1021.mgmt.eqiad.wmnet
  • 16:22 aborrero@cumin1001: START - Cookbook sre.hosts.reboot-single for host cloudvirt1036.eqiad.wmnet
  • 16:17 razzi@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts labsdb1012.eqiad.wmnet
  • 16:11 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1086.eqiad.wmnet with reason: REIMAGE
  • 16:09 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1086.eqiad.wmnet with reason: REIMAGE
  • 16:07 razzi@cumin1001: START - Cookbook sre.hosts.decommission for hosts labsdb1012.eqiad.wmnet
  • 15:56 razzi: stop mariadb on labsdb1012 to reimage and rename to clouddb1021: T269211
  • 15:39 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1073.eqiad.wmnet with reason: REIMAGE
  • 15:38 cmjohnson@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 15:37 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1073.eqiad.wmnet with reason: REIMAGE
  • 15:29 cmjohnson@cumin1001: START - Cookbook sre.dns.netbox
  • 15:07 elukey: drain + reimage analytics1073 and an-worker1086 to Debian Buster
  • 14:24 cmjohnson@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 14:20 cmjohnson@cumin1001: START - Cookbook sre.dns.netbox
  • 13:59 elukey@cumin1001: END (FAIL) - Cookbook sre.hadoop.roll-restart-masters (exit_code=99)
  • 13:52 marostegui: Rebuild some indexes on db2102
  • 13:38 elukey@cumin1001: START - Cookbook sre.hadoop.roll-restart-masters
  • 13:38 marostegui@cumin1001: dbctl commit (dc=all): 'DEpool db1134', diff saved to https://phabricator.wikimedia.org/P14644 and previous config saved to /var/cache/conftool/dbconfig/20210305-133833-marostegui.json
  • 13:24 marostegui: Check tables on db1134
  • 12:31 aborrero@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt1035.eqiad.wmnet
  • 12:24 aborrero@cumin1001: START - Cookbook sre.hosts.reboot-single for host cloudvirt1035.eqiad.wmnet
  • 11:28 marostegui: Temporarily set innodb_change_buffering = none on db1134 (s1) - T263443
  • 11:09 marostegui: Run check table on db2092, db2116, db2145, db2146 (there will be lag)
  • 10:54 aborrero@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt1034.eqiad.wmnet
  • 10:47 aborrero@cumin1001: START - Cookbook sre.hosts.reboot-single for host cloudvirt1034.eqiad.wmnet
  • 10:43 jakob@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'termbox' for release 'production' .
  • 10:38 jakob@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'termbox' for release 'production' .
  • 10:32 aborrero@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt1033.eqiad.wmnet
  • 10:25 aborrero@cumin1001: START - Cookbook sre.hosts.reboot-single for host cloudvirt1033.eqiad.wmnet
  • 09:54 jayme@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'test' .
  • 09:52 jayme@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
  • 09:50 jayme@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
  • 09:45 jayme@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
  • 09:31 jayme@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'test' .
  • 09:31 jayme@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' .
  • 09:31 jayme@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
  • 09:28 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1078.eqiad.wmnet with reason: REIMAGE
  • 09:28 jayme: switched back active kubernetes staging cluster to eqiad
  • 09:28 jayme@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'apertium' for release 'staging' .
  • 09:26 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1078.eqiad.wmnet with reason: REIMAGE
  • 09:26 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1079.eqiad.wmnet with reason: REIMAGE
  • 09:24 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1079.eqiad.wmnet with reason: REIMAGE
  • 09:21 filippo@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts ms-be1034.eqiad.wmnet
  • 09:19 jakob@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' .
  • 09:12 filippo@cumin1001: START - Cookbook sre.hosts.decommission for hosts ms-be1034.eqiad.wmnet
  • 08:44 jynus@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup2003.codfw.wmnet with reason: REIMAGE
  • 08:42 jynus@cumin2001: START - Cookbook sre.hosts.downtime for 2:00:00 on backup2003.codfw.wmnet with reason: REIMAGE
  • 08:32 elukey: drain + reimage an-worker107[8,9] to Debian Buster
  • 08:01 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1071.eqiad.wmnet with reason: REIMAGE
  • 07:59 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1070.eqiad.wmnet with reason: REIMAGE
  • 07:59 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1071.eqiad.wmnet with reason: REIMAGE
  • 07:57 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1070.eqiad.wmnet with reason: REIMAGE
  • 07:33 elukey: drain + reimage analytics107[0-1] to debian buster
  • 06:51 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db2092', diff saved to https://phabricator.wikimedia.org/P14640 and previous config saved to /var/cache/conftool/dbconfig/20210305-065137-marostegui.json
  • 06:17 legoktm: uploaded udplog 1.9 (buster-wikimedia) to apt.wikimedia.org (T276421)
  • 00:59 legoktm: depooled registry1001/registry1002 (old stretch VMs) - T272550
  • 00:59 legoktm@deploy1002: conftool action : set/pooled=no; selector: name=registry1002.eqiad.wmnet
  • 00:58 legoktm@deploy1002: conftool action : set/pooled=no; selector: name=registry1001.eqiad.wmnet
  • 00:58 legoktm@deploy1002: conftool action : set/pooled=yes; selector: name=registry1004.eqiad.wmnet
  • 00:57 legoktm@deploy1002: conftool action : set/pooled=no; selector: name=registry1004.eqiad.wmnet
  • 00:57 legoktm@deploy1002: conftool action : set/pooled=inactive; selector: name=registry1004.eqiad.codfw
  • 00:56 legoktm@deploy1002: conftool action : set/weight=10; selector: name=registry1004.eqiad.wmnet
  • 00:55 legoktm@deploy1002: conftool action : set/pooled=no; selector: name=registry1004.eqiad.codfw
  • 00:50 ryankemper: T266470 [ats] `sudo cumin 'A:cp-ats' 'sudo run-puppet-agent'`
  • 00:47 ryankemper: T266470 [ats] Deploying new mappings for `query-preview.wikidata.org` microsite: https://gerrit.wikimedia.org/r/c/operations/puppet/+/668173/
  • 00:41 ebernhardson@deploy1002: Finished deploy [wikimedia/discovery/analytics@4cc913e]: correct refinery-drop-older-than checksum (duration: 01m 34s)
  • 00:39 ryankemper: T266470 Ran `sudo run-puppet-agent` on `miscweb1002` without issue; `/var/log/apache2/query*.log` looks as expected
  • 00:39 ebernhardson@deploy1002: Started deploy [wikimedia/discovery/analytics@4cc913e]: correct refinery-drop-older-than checksum
  • 00:36 ryankemper: T266470 Deploying new `query-preview` microsite: https://gerrit.wikimedia.org/r/c/operations/puppet/+/668543
  • 00:23 legoktm@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host registry2004.eqiad.wmnet
  • 00:06 legoktm@cumin1001: START - Cookbook sre.ganeti.makevm for new host registry2004.eqiad.wmnet

2021-03-04

  • 23:55 legoktm@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host registry1004.eqiad.wmnet
  • 23:39 legoktm@cumin1001: START - Cookbook sre.ganeti.makevm for new host registry1004.eqiad.wmnet
  • 20:12 urbanecm@deploy1002: Synchronized wmf-config/config/hiwiki.yaml: c6b04cb: Enable Growth features on hiwiki in stealth mode (T276450; 3/3) (duration: 00m 58s)
  • 20:11 urbanecm@deploy1002: Synchronized dblists/growthexperiments.dblist: c6b04cb: Enable Growth features on hiwiki in stealth mode (T276450; 2/3) (duration: 00m 57s)
  • 20:10 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: c6b04cb: Enable Growth features on hiwiki in stealth mode (T276450; 1/3) (duration: 00m 57s)
  • 20:08 urbanecm@deploy1002: Synchronized php-1.36.0-wmf.33/extensions/GrowthExperiments/includes/HomepageModules/Help.php: 8cc65e3: cleanup: Remove help panel URL from Help homepage module (T276450; T273118) (duration: 00m 58s)
  • 19:33 rzl: restarted apache and php7.0-fpm on doc1001 due to staleness
  • 19:21 urbanecm@deploy1002: Synchronized wmf-config/config/sqwiki.yaml: 377bc4f: Enable Growth features on sqwiki in stealth mode (T275550; 3/3) (duration: 00m 57s)
  • 19:20 urbanecm@deploy1002: Synchronized dblists/growthexperiments.dblist: 377bc4f: Enable Growth features on sqwiki in stealth mode (T275550; 2/3) (duration: 00m 57s)
  • 19:19 dwisehaupt: replication restarted on frdb1004 after utf8mb4 conversion completed.
  • 19:18 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: 377bc4f: Enable Growth features on sqwiki in stealth mode (T275550; 1/3) (duration: 00m 57s)
  • 19:11 jforrester@deploy1002: Synchronized php-1.36.0-wmf.33/extensions/FlaggedRevs/frontend/specialpages/reports/ProblemChanges.php: T276386 Fix fatal calls to getConfig (duration: 01m 12s)
  • 19:06 cmjohnson@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 18:59 cmjohnson@cumin1001: START - Cookbook sre.dns.netbox
  • 18:26 jynus@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on backup2003.codfw.wmnet with reason: REIMAGE
  • 18:25 jynus@cumin2001: START - Cookbook sre.hosts.downtime for 2:00:00 on backup2003.codfw.wmnet with reason: REIMAGE
  • 17:39 mutante: [deneb:~] $ sudo systemctl start cowbuilder_update_jessie-amd64
  • 17:25 cmjohnson@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 17:20 cmjohnson@cumin1001: START - Cookbook sre.dns.netbox
  • 17:11 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on deploy1001.eqiad.wmnet with reason: decom
  • 17:11 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on deploy1001.eqiad.wmnet with reason: decom
  • 17:05 aborrero@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt1032.eqiad.wmnet
  • 16:59 aborrero@cumin1001: START - Cookbook sre.hosts.reboot-single for host cloudvirt1032.eqiad.wmnet
  • 16:56 tarrow@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'test' .
  • 16:56 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1069.eqiad.wmnet with reason: REIMAGE
  • 16:54 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1068.eqiad.wmnet with reason: REIMAGE
  • 16:54 pt1979@cumin2001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 16:54 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1069.eqiad.wmnet with reason: REIMAGE
  • 16:53 tarrow@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' .
  • 16:52 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1068.eqiad.wmnet with reason: REIMAGE
  • 16:47 pt1979@cumin2001: START - Cookbook sre.dns.netbox
  • 16:39 aborrero@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt1031.eqiad.wmnet
  • 16:33 aborrero@cumin1001: START - Cookbook sre.hosts.reboot-single for host cloudvirt1031.eqiad.wmnet
  • 16:23 jakob@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' .
  • 16:20 jakob@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'test' .
  • 16:13 aborrero@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt1026.eqiad.wmnet
  • 16:12 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db2145', diff saved to https://phabricator.wikimedia.org/P14635 and previous config saved to /var/cache/conftool/dbconfig/20210304-161226-marostegui.json
  • 16:08 aborrero@cumin1001: START - Cookbook sre.hosts.reboot-single for host cloudvirt1026.eqiad.wmnet
  • 16:02 aborrero@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt1025.eqiad.wmnet
  • 15:55 aborrero@cumin1001: START - Cookbook sre.hosts.reboot-single for host cloudvirt1025.eqiad.wmnet
  • 15:52 jakob@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'test' .
  • 15:42 aborrero@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt1024.eqiad.wmnet
  • 15:28 jakob@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' .
  • 15:28 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1067.eqiad.wmnet with reason: REIMAGE
  • 15:26 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1066.eqiad.wmnet with reason: REIMAGE
  • 15:26 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1067.eqiad.wmnet with reason: REIMAGE
  • 15:24 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1066.eqiad.wmnet with reason: REIMAGE
  • 15:21 jakob@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' .
  • 15:12 elukey: drain + reimage analytics106[6,7] to Debian Buster
  • 15:11 aborrero@cumin1001: START - Cookbook sre.hosts.reboot-single for host cloudvirt1024.eqiad.wmnet
  • 14:40 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1065.eqiad.wmnet with reason: REIMAGE
  • 14:38 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1065.eqiad.wmnet with reason: REIMAGE
  • 14:38 pt1979@cumin2001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 14:35 jayme@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 14:34 pt1979@cumin2001: START - Cookbook sre.dns.netbox
  • 14:30 jayme@cumin1001: START - Cookbook sre.dns.netbox
  • 14:23 jayme@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts neon.eqiad.wmnet
  • 14:18 jayme@cumin1001: START - Cookbook sre.hosts.decommission for hosts neon.eqiad.wmnet
  • 14:15 jayme@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts neon.eqiad.wmnet
  • 14:15 jayme@cumin1001: START - Cookbook sre.hosts.decommission for hosts neon.eqiad.wmnet
  • 14:04 liw@deploy1002: rebuilt and synchronized wikiversions files: all wikis to 1.36.0-wmf.33
  • 13:55 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1064.eqiad.wmnet with reason: REIMAGE
  • 13:53 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1063.eqiad.wmnet with reason: REIMAGE
  • 13:52 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1064.eqiad.wmnet with reason: REIMAGE
  • 13:50 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1063.eqiad.wmnet with reason: REIMAGE
  • 13:48 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1001.eqiad.wmnet
  • 13:45 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db2116', diff saved to https://phabricator.wikimedia.org/P14632 and previous config saved to /var/cache/conftool/dbconfig/20210304-134521-marostegui.json
  • 13:44 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host sretest1001.eqiad.wmnet
  • 13:44 volans: uploaded spicerack_0.0.49 to apt.wikimedia.org buster-wikimedia
  • 13:35 moritzm: restarting mw canaries for libzstd update
  • 13:32 elukey: drain + reimage analytics10[63,64] to Debian Buster
  • 13:29 moritzm: installing libzstd security updates on Buster
  • 13:13 marostegui@cumin1001: dbctl commit (dc=all): 'Add db2146 to dbctl T275633', diff saved to https://phabricator.wikimedia.org/P14631 and previous config saved to /var/cache/conftool/dbconfig/20210304-131301-marostegui.json
  • 13:10 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1062.eqiad.wmnet with reason: REIMAGE
  • 13:08 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1061.eqiad.wmnet with reason: REIMAGE
  • 13:07 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1062.eqiad.wmnet with reason: REIMAGE
  • 13:06 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1061.eqiad.wmnet with reason: REIMAGE
  • 12:48 elukey: drain + reimage analytics10[61,62] to Debian Buster
  • 12:45 jakob@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' .
  • 12:40 mbsantos@deploy1002: Finished deploy [tilerator/deploy@6fcbb9f]: (no justification provided) (duration: 00m 14s)
  • 12:40 wmde-fisch@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: Remove conflicting gadget configuration for hewiki (T276330) (duration: 01m 12s)
  • 12:40 mbsantos@deploy1002: Started deploy [tilerator/deploy@6fcbb9f]: (no justification provided)
  • 12:34 jakob@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' .
  • 12:11 kormat@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on db1115.eqiad.wmnet,dbmonitor1001.wikimedia.org with reason: Restart db1115 to fix memory leak
  • 12:11 kormat@cumin1001: START - Cookbook sre.hosts.downtime for 0:30:00 on db1115.eqiad.wmnet,dbmonitor1001.wikimedia.org with reason: Restart db1115 to fix memory leak
  • 12:10 jakob@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' .
  • 12:00 marostegui: Stop mysql on db1117:3321 to clone db1159
  • 11:42 jakob@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' .
  • 11:40 marostegui@cumin1001: dbctl commit (dc=all): 'Add db2145 to s1 (and repool db2116) - T275633', diff saved to https://phabricator.wikimedia.org/P14625 and previous config saved to /var/cache/conftool/dbconfig/20210304-114052-marostegui.json
  • 11:28 marostegui@cumin1001: dbctl commit (dc=all): 'Add db2145 into dbctl depooled - T275633', diff saved to https://phabricator.wikimedia.org/P14624 and previous config saved to /var/cache/conftool/dbconfig/20210304-112848-marostegui.json
  • 11:27 _joe_: restarted redis on mc2027 to pick up the replication change
  • 11:14 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1059.eqiad.wmnet with reason: REIMAGE
  • 11:11 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1059.eqiad.wmnet with reason: REIMAGE
  • 11:10 kormat@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Needs fixing after T274472
  • 11:10 kormat@cumin1001: START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Needs fixing after T274472
  • 11:08 dcaro@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt1022.eqiad.wmnet
  • 11:04 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1060.eqiad.wmnet with reason: REIMAGE
  • 11:02 dcaro@cumin1001: START - Cookbook sre.hosts.reboot-single for host cloudvirt1022.eqiad.wmnet
  • 11:02 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1060.eqiad.wmnet with reason: REIMAGE
  • 10:40 elukey: drain + reimage analytics1059/1060 to Debian Buster
  • 10:32 moritzm: uploaded screen 4.2.1-3+deb8u1+wmf1 to jessie-wikimedia
  • 09:32 elukey: install linux 5.10 on an-worker[1097-1101] (GPU workers) and reboot them
  • 09:30 kormat: disabling puppet on all db hosts while deploying a puppet monitoring change T275497
  • 09:19 moritzm: uploaded udplog 1.8.5+deb10u1 to buster-wikimedia
  • 08:45 elukey@deploy1002: Finished deploy [analytics/refinery@605f8b8]: Fix for geoeditors monthly job (duration: 11m 03s)
  • 08:33 elukey@deploy1002: Started deploy [analytics/refinery@605f8b8]: Fix for geoeditors monthly job
  • 07:38 elukey: reboot an-worker1096 to pick up 5.10 kernel
  • 06:25 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1088 T276025', diff saved to https://phabricator.wikimedia.org/P14622 and previous config saved to /var/cache/conftool/dbconfig/20210304-062503-marostegui.json
  • 06:11 marostegui: Stop MySQL on db2116 to clone db2145 T275633
  • 06:11 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db2116 T275633', diff saved to https://phabricator.wikimedia.org/P14621 and previous config saved to /var/cache/conftool/dbconfig/20210304-061134-marostegui.json
  • 05:20 kart_: Updated apertium to 2021-03-03-170806-production (T274262)
  • 05:15 kartik@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'apertium' for release 'production' .
  • 05:11 kartik@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'apertium' for release 'production' .
  • 05:10 kartik@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'apertium' for release 'staging' .
  • 01:24 twentyafterfour: phabricator upgrade complete
  • 01:22 twentyafterfour: restarting php7.3-fpm on phab1001 to complete phabricator upgrade
  • 00:02 ebernhardson@deploy1002: Finished deploy [wikimedia/discovery/analytics@e47f735]: search_satisfaction_daily: make files readable by druid ingestion (duration: 25m 35s)

2021-03-03

  • 23:36 ebernhardson@deploy1002: Started deploy [wikimedia/discovery/analytics@e47f735]: search_satisfaction_daily: make files readable by druid ingestion
  • 23:08 legoktm@deploy1002: conftool action : set/pooled=yes; selector: name=registry2003.codfw.wmnet
  • 22:56 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mwmaint2001.codfw.wmnet
  • 22:51 legoktm@deploy1002: conftool action : set/weight=10; selector: name=registry2003.codfw.wmnet
  • 22:50 legoktm@deploy1002: conftool action : set/pooled=no; selector: name=registry2003.codfw.wmnet
  • 22:47 dzahn@cumin1001: START - Cookbook sre.hosts.decommission for hosts mwmaint2001.codfw.wmnet
  • 22:05 legoktm@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host registry2003.codfw.wmnet
  • 21:58 mutante: puppetmaster1001 - signing puppet cert for gitlab1001.wikmedia.org (T274459)
  • 21:53 ebernhardson@deploy1002: Finished deploy [wikimedia/discovery/analytics@7f37d40]: replace refinery-drop-hive-partitions with refinery-drop-older-than (duration: 01m 37s)
  • 21:51 ebernhardson@deploy1002: Started deploy [wikimedia/discovery/analytics@7f37d40]: replace refinery-drop-hive-partitions with refinery-drop-older-than
  • 21:50 legoktm@cumin1001: START - Cookbook sre.ganeti.makevm for new host registry2003.codfw.wmnet
  • 21:30 legoktm@deploy1002: conftool action : set/pooled=yes; selector: name=registry1003.eqiad.wmnet
  • 21:25 legoktm@deploy1002: conftool action : set/weight=10; selector: name=registry1003.eqiad.wmnet
  • 21:21 legoktm@deploy1002: conftool action : set/pooled=no; selector: name=registry1003.eqiad.wmnet
  • 21:16 legoktm@deploy1002: conftool action : set/pooled=yes; selector: name=registry1002.eqiad.wmnet
  • 20:35 dzahn@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host gitlab1001.wikimedia.org
  • 20:17 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Revert: Enable Growth features on sqwiki in stealth mode (T275550) (duration: 01m 10s)
  • 20:05 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: 0120778: Enable Growth features on sqwiki in stealth mode (T275550) (duration: 01m 09s)
  • 20:02 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ms-backup2002.codfw.wmnet with reason: REIMAGE
  • 20:00 pt1979@cumin2001: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-backup2002.codfw.wmnet with reason: REIMAGE
  • 19:57 urbanecm@deploy1002: Synchronized php-1.36.0-wmf.33/extensions/GrowthExperiments/: 4cba184: Help panel: Do not require help desk to be configured (T273118) (duration: 01m 10s)
  • 19:53 urbanecm@deploy1002: Synchronized php-1.36.0-wmf.32/extensions/GrowthExperiments/: a036d9f: Help panel: Do not require help desk to be configured (T273118) (duration: 01m 10s)
  • 19:48 legoktm@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host registry1003.eqiad.wmnet
  • 19:41 dzahn@cumin1001: START - Cookbook sre.ganeti.makevm for new host gitlab1001.wikimedia.org
  • 19:40 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: 7acb37c: dawiki: Deploy Growth features to newcomers (T256126) (duration: 01m 09s)
  • 19:38 urbanecm@deploy1002: sync-file aborted: 7acb37c: dawiki: Deploy Growth features to newcomers (duration: 00m 03s)
  • 19:33 legoktm@cumin1001: START - Cookbook sre.ganeti.makevm for new host registry1003.eqiad.wmnet
  • 19:30 cmjohnson@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 19:28 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: 7221371: rowiki: Make Growth features available to ro newcomers (T275130) (duration: 01m 10s)
  • 19:21 cmjohnson@cumin1001: START - Cookbook sre.dns.netbox
  • 19:14 urbanecm@deploy1002: Synchronized php-1.36.0-wmf.33/extensions/WikibaseMediaInfo/src/Special/SpecialMediaSearch.php: b741dc3: Also requet timestamp|snippet from non-page results (T271174; T276353) (duration: 01m 09s)
  • 19:08 jayme@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
  • 19:00 cmjohnson@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 18:58 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ms-backup2001.codfw.wmnet with reason: REIMAGE
  • 18:56 pt1979@cumin2001: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-backup2001.codfw.wmnet with reason: REIMAGE
  • 18:53 cmjohnson@cumin1001: START - Cookbook sre.dns.netbox
  • 18:52 jayme@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'zotero' for release 'staging' .
  • 18:51 jayme@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' .
  • 18:49 jayme@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'test' .
  • 18:49 jayme@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' .
  • 18:47 jayme@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' .
  • 18:46 jayme@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'sessionstore' for release 'staging' .
  • 18:45 jayme@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'recommendation-api' for release 'production' .
  • 18:43 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts deploy2001.codfw.wmnet
  • 18:43 jayme@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'push-notifications' for release 'main' .
  • 18:42 legoktm: uploaded python3-docker-report 0.0.11 to buster-wikimedia
  • 18:40 jayme@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'proton' for release 'production' .
  • 18:39 jayme@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' .
  • 18:37 jayme@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'mathoid' for release 'staging' .
  • 18:37 jayme@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
  • 18:36 jayme@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'eventstreams-internal' for release 'main' .
  • 18:36 dzahn@cumin1001: START - Cookbook sre.hosts.decommission for hosts deploy2001.codfw.wmnet
  • 18:32 jayme@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'eventstreams' for release 'production' .
  • 18:30 jayme@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
  • 18:30 jayme@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' .
  • 18:29 jayme@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
  • 18:27 jayme@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' .
  • 18:27 jayme@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' .
  • 18:26 jayme@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'echostore' for release 'staging' .
  • 18:26 otto@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' .
  • 18:26 otto@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
  • 18:25 jayme@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'staging' .
  • 18:24 jayme@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'apertium' for release 'staging' .
  • 18:24 jayme@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'apertium' for release 'production' .
  • 18:23 jayme@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'citoid' for release 'staging' .
  • 18:22 jayme@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
  • 18:21 jayme@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 18:20 jayme@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' .
  • 18:17 jayme@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
  • 18:17 jayme@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' .
  • 18:17 jayme@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'recommendation-api' for release 'production' .
  • 18:17 jayme@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'proton' for release 'production' .
  • 18:16 jayme@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'apertium' for release 'production' .
  • 18:16 jayme@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'apertium' for release 'staging' .
  • 18:15 otto@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
  • 18:15 otto@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' .
  • 18:12 jayme@deploy1002: helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'.
  • 18:09 jayme@deploy1002: helmfile [staging-eqiad] START helmfile.d/admin 'sync'.
  • 17:56 otto@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' .
  • 17:56 otto@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
  • 17:49 otto@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' .
  • 17:49 otto@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
  • 17:46 otto@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
  • 17:31 jayme@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubestage1002.eqiad.wmnet with reason: REIMAGE
  • 17:31 jayme@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage1002.eqiad.wmnet with reason: REIMAGE
  • 17:29 jayme@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubestage1001.eqiad.wmnet with reason: REIMAGE
  • 17:29 jayme@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage1001.eqiad.wmnet with reason: REIMAGE
  • 17:16 dwisehaupt: correction for last log with correct host - stopping mysql replication on frdb1004 and starting utf8mb4 table alters under a root screen session
  • 17:15 dwisehaupt: stopping mysql replication on frdb2001 and starting utf8mb4 table alters under a root screen session
  • 17:14 otto@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Set destination_event_serivce: eventgate-main for rdf-streaming-updater streams - T273901 (duration: 01m 08s)
  • 17:13 elukey@cumin1001: END (PASS) - Cookbook sre.aqs.roll-restart (exit_code=0)
  • 17:09 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on maps1009.eqiad.wmnet with reason: Resyncing database from scratch
  • 17:09 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on maps1009.eqiad.wmnet with reason: Resyncing database from scratch
  • 17:09 elukey@cumin1001: START - Cookbook sre.aqs.roll-restart
  • 16:40 dzahn@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 16:36 dzahn@cumin1001: START - Cookbook sre.dns.netbox
  • 16:33 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts gitlab1001.eqiad.wmnet
  • 16:30 dzahn@cumin1001: START - Cookbook sre.hosts.decommission for hosts gitlab1001.eqiad.wmnet
  • 16:29 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts gitlab1002.eqiad.wmnet
  • 16:28 otto@deploy1002: Synchronized wmf-config/InitialiseSettings.php: canary_events_enabled: true for rdf-streaming-updater streams - T273901 (duration: 01m 49s)
  • 16:26 mutante: deleting gitlab VMs - we have to start over and decom old VMs, then create new VMs with public IPs (T274459)
  • 16:25 dzahn@cumin1001: START - Cookbook sre.hosts.decommission for hosts gitlab1002.eqiad.wmnet
  • 16:23 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on gitlab1002.eqiad.wmnet with reason: decom
  • 16:23 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on gitlab1002.eqiad.wmnet with reason: decom
  • 16:23 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on gitlab1001.eqiad.wmnet with reason: decom
  • 16:23 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on gitlab1001.eqiad.wmnet with reason: decom
  • 16:18 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestagetcd1006.eqiad.wmnet
  • 16:15 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestagetcd1005.eqiad.wmnet
  • 16:14 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single for host kubestagetcd1006.eqiad.wmnet
  • 16:12 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single for host kubestagetcd1005.eqiad.wmnet
  • 16:11 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestagetcd1004.eqiad.wmnet
  • 16:09 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single for host kubestagetcd1004.eqiad.wmnet
  • 16:07 jayme@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts neon.eqiad.wmnet
  • 16:05 jayme@cumin1001: START - Cookbook sre.hosts.decommission for hosts neon.eqiad.wmnet
  • 15:55 aborrero@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt1021.eqiad.wmnet
  • 15:34 aborrero@cumin1001: START - Cookbook sre.hosts.reboot-single for host cloudvirt1021.eqiad.wmnet
  • 15:27 jayme: staging.svc.eqiad.wmnet now (temporarily) points to the staging-codfw kubernetes cluster (during upgrade in eqiad)
  • 15:27 jayme@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'mathoid' for release 'staging' .
  • 15:26 jayme@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
  • 15:25 jayme@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
  • 15:23 jiji@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1027.eqiad.wmnet
  • 15:19 liw@deploy1002: Synchronized php: group1 wikis to 1.36.0-wmf.33 (duration: 01m 08s)
  • 15:18 jiji@cumin1001: START - Cookbook sre.hosts.reboot-single for host mc1027.eqiad.wmnet
  • 15:18 liw@deploy1002: rebuilt and synchronized wikiversions files: group1 wikis to 1.36.0-wmf.33
  • 15:13 urbanecm@deploy1002: Synchronized php-1.36.0-wmf.33/extensions/CentralAuth/: af899b6: Transform the first parameter to string (T276316) (duration: 01m 11s)
  • 14:48 effie: upgrade memcached on mc1027,mc2027
  • 14:41 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1101.eqiad.wmnet with reason: REIMAGE
  • 14:39 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1100.eqiad.wmnet with reason: REIMAGE
  • 14:39 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1101.eqiad.wmnet with reason: REIMAGE
  • 14:37 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1099.eqiad.wmnet with reason: REIMAGE
  • 14:37 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1100.eqiad.wmnet with reason: REIMAGE
  • 14:35 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1099.eqiad.wmnet with reason: REIMAGE
  • 14:06 aborrero@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt1018.eqiad.wmnet
  • 13:58 aborrero@cumin1001: START - Cookbook sre.hosts.reboot-single for host cloudvirt1018.eqiad.wmnet
  • 13:09 godog: swift eqiad-prod: remove ssd weight for ms-be1034 - T276193
  • 12:54 aborrero@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt1017.eqiad.wmnet
  • 12:48 aborrero@cumin1001: START - Cookbook sre.hosts.reboot-single for host cloudvirt1017.eqiad.wmnet
  • 12:42 Urbanecm: Deploy a security patch for T276306
  • 12:29 urbanecm@deploy1002: Synchronized php-1.36.0-wmf.33/extensions/GrowthExperiments/ServiceWiring.php: cf635b4: Do not open DB connections during service initialization (T276307) (duration: 01m 11s)
  • 12:26 aborrero@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt1016.eqiad.wmnet
  • 12:26 Urbanecm: urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 90a205f: Add ReferenceTooltips and other gadget names for ReferencePreviews (T274353) (duration: 01m 10s)
  • 12:20 aborrero@cumin1001: START - Cookbook sre.hosts.reboot-single for host cloudvirt1016.eqiad.wmnet
  • 12:04 aborrero@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt1014.eqiad.wmnet
  • 11:58 aborrero@cumin1001: START - Cookbook sre.hosts.reboot-single for host cloudvirt1014.eqiad.wmnet
  • 11:33 marostegui@cumin1001: dbctl commit (dc=all): 'db1175 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P14616 and previous config saved to /var/cache/conftool/dbconfig/20210303-113349-root.json
  • 11:18 marostegui@cumin1001: dbctl commit (dc=all): 'db1175 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P14615 and previous config saved to /var/cache/conftool/dbconfig/20210303-111843-root.json
  • 11:07 filippo@cumin1001: conftool action : set/pooled=true; selector: dnsdisc=swift,name=codfw
  • 11:03 marostegui@cumin1001: dbctl commit (dc=all): 'db1175 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P14614 and previous config saved to /var/cache/conftool/dbconfig/20210303-110339-root.json
  • 10:57 marostegui@cumin1001: dbctl commit (dc=all): 'db1164 (re)pooling @ 100%: Slowly repool db1164 in s1 for the first time', diff saved to https://phabricator.wikimedia.org/P14613 and previous config saved to /var/cache/conftool/dbconfig/20210303-105726-root.json
  • 10:48 marostegui@cumin1001: dbctl commit (dc=all): 'db1175 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P14612 and previous config saved to /var/cache/conftool/dbconfig/20210303-104836-root.json
  • 10:45 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1175 for schema change', diff saved to https://phabricator.wikimedia.org/P14611 and previous config saved to /var/cache/conftool/dbconfig/20210303-104522-marostegui.json
  • 10:43 marostegui@cumin1001: dbctl commit (dc=all): 'db1166 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P14610 and previous config saved to /var/cache/conftool/dbconfig/20210303-104302-root.json
  • 10:42 marostegui@cumin1001: dbctl commit (dc=all): 'db1164 (re)pooling @ 90%: Slowly repool db1164 in s1 for the first time', diff saved to https://phabricator.wikimedia.org/P14609 and previous config saved to /var/cache/conftool/dbconfig/20210303-104223-root.json
  • 10:38 jbond42: upload new wmf-laptop 0.5.0 package
  • 10:37 vgutierrez: rolling restart of ats-tls on eqiad
  • 10:34 akosiaris@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
  • 10:34 akosiaris@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
  • 10:34 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
  • 10:27 marostegui@cumin1001: dbctl commit (dc=all): 'db1166 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P14608 and previous config saved to /var/cache/conftool/dbconfig/20210303-102758-root.json
  • 10:27 marostegui@cumin1001: dbctl commit (dc=all): 'db1164 (re)pooling @ 75%: Slowly repool db1164 in s1 for the first time', diff saved to https://phabricator.wikimedia.org/P14607 and previous config saved to /var/cache/conftool/dbconfig/20210303-102719-root.json
  • 10:25 jayme@deploy1002: helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
  • 10:25 jayme@deploy1002: helmfile [staging-codfw] START helmfile.d/admin 'sync'.
  • 10:12 marostegui@cumin1001: dbctl commit (dc=all): 'db1166 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P14605 and previous config saved to /var/cache/conftool/dbconfig/20210303-101255-root.json
  • 10:12 marostegui@cumin1001: dbctl commit (dc=all): 'db1164 (re)pooling @ 60%: Slowly repool db1164 in s1 for the first time', diff saved to https://phabricator.wikimedia.org/P14604 and previous config saved to /var/cache/conftool/dbconfig/20210303-101215-root.json
  • 10:05 aborrero@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudnet1003.eqiad.wmnet
  • 10:00 aborrero@cumin1001: START - Cookbook sre.hosts.reboot-single for host cloudnet1003.eqiad.wmnet
  • 10:00 aborrero@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudnet1003.eqiad.wmnet
  • 09:57 marostegui@cumin1001: dbctl commit (dc=all): 'db1166 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P14602 and previous config saved to /var/cache/conftool/dbconfig/20210303-095751-root.json
  • 09:57 marostegui@cumin1001: dbctl commit (dc=all): 'db1164 (re)pooling @ 50%: Slowly repool db1164 in s1 for the first time', diff saved to https://phabricator.wikimedia.org/P14601 and previous config saved to /var/cache/conftool/dbconfig/20210303-095712-root.json
  • 09:55 aborrero@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on cloudnet1003.eqiad.wmnet with reason: HW issue
  • 09:54 aborrero@cumin1001: START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on cloudnet1003.eqiad.wmnet with reason: HW issue
  • 09:54 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1166 for schema change', diff saved to https://phabricator.wikimedia.org/P14600 and previous config saved to /var/cache/conftool/dbconfig/20210303-095417-marostegui.json
  • 09:53 marostegui@cumin1001: dbctl commit (dc=all): 'db1157 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P14599 and previous config saved to /var/cache/conftool/dbconfig/20210303-095351-root.json
  • 09:42 marostegui@cumin1001: dbctl commit (dc=all): 'db1164 (re)pooling @ 30%: Slowly repool db1164 in s1 for the first time', diff saved to https://phabricator.wikimedia.org/P14598 and previous config saved to /var/cache/conftool/dbconfig/20210303-094208-root.json
  • 09:41 elukey@cumin1001: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker[1132,1135-1138].eqiad.wmnet
  • 09:39 elukey@cumin1001: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker[1132,1135-1138].eqiad.wmnet
  • 09:38 marostegui@cumin1001: dbctl commit (dc=all): 'db1157 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P14597 and previous config saved to /var/cache/conftool/dbconfig/20210303-093847-root.json
  • 09:31 aborrero@cumin1001: START - Cookbook sre.hosts.reboot-single for host cloudnet1003.eqiad.wmnet
  • 09:30 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1138.eqiad.wmnet with reason: REIMAGE
  • 09:28 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1138.eqiad.wmnet with reason: REIMAGE
  • 09:28 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1137.eqiad.wmnet with reason: REIMAGE
  • 09:27 marostegui@cumin1001: dbctl commit (dc=all): 'db1164 (re)pooling @ 25%: Slowly repool db1164 in s1 for the first time', diff saved to https://phabricator.wikimedia.org/P14596 and previous config saved to /var/cache/conftool/dbconfig/20210303-092705-root.json
  • 09:25 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1137.eqiad.wmnet with reason: REIMAGE
  • 09:23 marostegui@cumin1001: dbctl commit (dc=all): 'db1157 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P14595 and previous config saved to /var/cache/conftool/dbconfig/20210303-092343-root.json
  • 09:16 jayme@deploy1002: helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
  • 09:16 jayme@deploy1002: helmfile [staging-codfw] START helmfile.d/admin 'sync'.
  • 09:12 marostegui@cumin1001: dbctl commit (dc=all): 'db1164 (re)pooling @ 15%: Slowly repool db1164 in s1 for the first time', diff saved to https://phabricator.wikimedia.org/P14594 and previous config saved to /var/cache/conftool/dbconfig/20210303-091201-root.json
  • 09:08 marostegui@cumin1001: dbctl commit (dc=all): 'db1157 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P14593 and previous config saved to /var/cache/conftool/dbconfig/20210303-090840-root.json
  • 09:02 zpapierski@deploy1002: Finished deploy [wdqs/wdqs@dbfd1f6]: Deploying emergency fix - WDQS 0.3.66 (duration: 08m 17s)
  • 09:00 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1157 for schema change', diff saved to https://phabricator.wikimedia.org/P14592 and previous config saved to /var/cache/conftool/dbconfig/20210303-090030-marostegui.json
  • 08:56 marostegui@cumin1001: dbctl commit (dc=all): 'db1164 (re)pooling @ 10%: Slowly repool db1164 in s1 for the first time', diff saved to https://phabricator.wikimedia.org/P14591 and previous config saved to /var/cache/conftool/dbconfig/20210303-085658-root.json
  • 08:54 zpapierski@deploy1002: Started deploy [wdqs/wdqs@dbfd1f6]: Deploying emergency fix - WDQS 0.3.66
  • 08:50 marostegui@cumin1001: dbctl commit (dc=all): 'Increase weight for db1164 in s1 T258361', diff saved to https://phabricator.wikimedia.org/P14590 and previous config saved to /var/cache/conftool/dbconfig/20210303-085014-marostegui.json
  • 08:48 test: tcpircbot --joe
  • 08:40 elukey@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on an-worker1136.eqiad.wmnet with reason: REIMAGE
  • 08:40 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1136.eqiad.wmnet with reason: REIMAGE
  • 08:32 godog: stop/mask tcpircbot-logmsgbot on pontoon-icinga-01 - T276299
  • 07:30 _joe_: test
  • 07:17 _joe_: test log
  • 06:41 marostegui: Testing log
  • 06:27 ryankemper: T275345 T274555 `sudo confctl select 'name=elastic2054.codfw.wmnet' set/pooled=yes` on `ryankemper@puppetmaster1001`
  • 06:26 ryankemper: T275345 T274555 `sudo confctl select 'name=elastic2045.codfw.wmnet' set/pooled=yes` on `ryankemper@puppetmaster1001`
  • 06:21 ryankemper: T275345 T274555 Re-pooling `elastic2045` and `elastic2054` (commands follow)
  • 06:20 ryankemper: T275345 T274555 `curl -H 'Content-Type: application/json' -XPUT http://localhost:9400/_cluster/settings -d '{"transient":{"cluster.routing.allocation.exclude":{"_name": null,"_ip": null}'` => `{"acknowledged":true,"persistent":{},"transient":{}}`}}
  • 06:18 ryankemper: T275345 T274555 `curl -H 'Content-Type: application/json' -XPUT http://localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.exclude":{"_name": null,"_ip": null}'` => `{"acknowledged":true,"persistent":{},"transient":{}}`}}
  • 06:17 ryankemper: T275345 T274555 Unbanning `elastic2045` and `elastic2054` from our cluster now that both hosts have been re-imaged and are running without errors (commands follow)
  • 06:15 ryankemper: T274555 Removed downtime for `elastic2054`
  • 05:32 ryankemper: T274555 `sudo -i wmf-auto-reimage-host --conftool -p T274555 elastic2054.codfw.wmnet` on `ryankemper@cumin2001` tmux session `elastic_reimage_elastic2054`
  • 05:27 ryankemper: Downtime `wdqs1012` until `2021-03-03 19:25:40` (~14 hours from now). Its `wdqs-updater` is failing; ultimately it's blazegraph journal is probably in a bad state meaning we'd have to copy one over from a healthy node, but not kicking that off right now so that we can investigate a little bit first
  • 05:16 ryankemper: T275345 `ryankemper@elastic2045:~$ sudo apt-get upgrade wmf-elasticsearch-search-plugins`
  • 03:50 ryankemper: Depooled `wdqs1012` until I've got its updater back online
  • 03:24 ryankemper: `ryankemper@wdqs1012:~$ sudo systemctl restart wdqs-blazegraph` ~2 mins ago
  • 02:45 ejegg: updated fundraising CiviCRM from e1dacbe348 to b13e70d968
  • 02:09 ejegg: updated payments-wiki from 365bf54393 to 65dbf0ed9d
  • 00:42 Urbanecm: Finished deployment in Evening B&C window; logmsgbot is currently down, and a simple restart did not bring it back up
  • 00:41 Urbanecm: 00:40:16 Synchronized wmf-config/config/idwiki.yaml: 80edca8: Enable Growth features in idwiki in stealth mode (T259024; 3/3) (duration: 01m 09s)
  • 00:38 Urbanecm: 00:38:12 Synchronized dblists/growthexperiments.dblist: 80edca8: Enable Growth features in idwiki in stealth mode (T259024; 2/3) (duration: 01m 10s)
  • 00:31 Urbanecm: 00:31:26 Synchronized wmf-config/InitialiseSettings.php: 80edca8: Enable Growth features in idwiki in stealth mode (T259024; 1/3) (duration: 01m 11s)
  • 00:21 dwisehaupt: replication restarted on frdb2001 after utf8mb4 conversion completed.
  • 00:21 mutante: alert1001 systemctl restart tcpircbot-logmsgbot
  • 00:08 urbanecm@deploy1002: sync-file aborted: 80edca8: Enable Growth features in idwiki in stealth mode (T259024; 1/3) (duration: 06m 45s)

2021-03-02

  • 23:52 mutante: mwmaint2002 - find /home -nouser -delete
  • 23:42 shdubsh: restart kibana to finalize phatality 7.10 deployment
  • 23:38 twentyafterfour@deploy1002: Finished deploy [releng/phatality@4d0f053]: sudoer rules fixed, trying again: deploy phatality (duration: 00m 06s)
  • 23:38 twentyafterfour@deploy1002: Started deploy [releng/phatality@4d0f053]: sudoer rules fixed, trying again: deploy phatality
  • 23:27 twentyafterfour@deploy1002: Finished deploy [releng/phatality@4d0f053]: trying again: deploy phatality 7.10 (duration: 00m 37s)
  • 23:27 twentyafterfour@deploy1002: Started deploy [releng/phatality@4d0f053]: trying again: deploy phatality 7.10
  • 23:22 twentyafterfour@deploy1002: Finished deploy [releng/phatality@4d0f053]: deploy phatality 7.10 (duration: 00m 05s)
  • 23:22 twentyafterfour@deploy1002: Started deploy [releng/phatality@4d0f053]: deploy phatality 7.10
  • 23:20 twentyafterfour@deploy1002: Finished deploy [releng/phatality@4d0f053]: deploy phatality 7.10 (duration: 01m 01s)
  • 23:19 twentyafterfour@deploy1002: Started deploy [releng/phatality@4d0f053]: deploy phatality 7.10
  • 23:11 mutante: mwmaint2002 - rsyncing home dirs from mwmaint1002 (T275905)
  • 23:09 ebernhardson: restart weged prometheus-wmf-elasticsearch-exporter-9200 on elastic2042
  • 23:03 mforns@deploy1002: Finished deploy [analytics/refinery@3bd0858] (hadoop-test): Regular analytics weekly train TEST- forgot version bump [analytics/refinery@3bd0858d0c3b524e6d170099d1e2f3d12fad495d] (duration: 04m 56s)
  • 22:58 mforns@deploy1002: Started deploy [analytics/refinery@3bd0858] (hadoop-test): Regular analytics weekly train TEST- forgot version bump [analytics/refinery@3bd0858d0c3b524e6d170099d1e2f3d12fad495d]
  • 22:53 mforns@deploy1002: Finished deploy [analytics/refinery@3bd0858] (thin): Regular analytics weekly train THIN- forgot bnump up [analytics/refinery@3bd0858d0c3b524e6d170099d1e2f3d12fad495d] (duration: 00m 06s)
  • 22:53 mforns@deploy1002: Started deploy [analytics/refinery@3bd0858] (thin): Regular analytics weekly train THIN- forgot bnump up [analytics/refinery@3bd0858d0c3b524e6d170099d1e2f3d12fad495d]
  • 22:53 mforns@deploy1002: Finished deploy [analytics/refinery@3bd0858]: Regular analytics weekly train- forgot bump up [analytics/refinery@3bd0858d0c3b524e6d170099d1e2f3d12fad495d] (duration: 18m 41s)
  • 22:34 mforns@deploy1002: Started deploy [analytics/refinery@3bd0858]: Regular analytics weekly train- forgot bump up [analytics/refinery@3bd0858d0c3b524e6d170099d1e2f3d12fad495d]
  • 22:23 mforns@deploy1002: Finished deploy [analytics/refinery@af99602] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@af99602101018664670a76d28cd755caf07dcde7] (duration: 07m 30s)
  • 22:16 mforns@deploy1002: Started deploy [analytics/refinery@af99602] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@af99602101018664670a76d28cd755caf07dcde7]
  • 22:14 mforns@deploy1002: Finished deploy [analytics/refinery@af99602] (thin): Regular analytics weekly train THIN [analytics/refinery@af99602101018664670a76d28cd755caf07dcde7] (duration: 00m 07s)
  • 22:14 mforns@deploy1002: Started deploy [analytics/refinery@af99602] (thin): Regular analytics weekly train THIN [analytics/refinery@af99602101018664670a76d28cd755caf07dcde7]
  • 22:12 mforns@deploy1002: Finished deploy [analytics/refinery@af99602]: Regular analytics weekly train [analytics/refinery@af99602101018664670a76d28cd755caf07dcde7] (duration: 13m 09s)
  • 21:59 mforns@deploy1002: Started deploy [analytics/refinery@af99602]: Regular analytics weekly train [analytics/refinery@af99602101018664670a76d28cd755caf07dcde7]
  • 21:58 mforns@deploy1002: deploy aborted: Regular analytics weekly train [analytics/refinery@COMMIT_HASH] (duration: 00m 01s)
  • 21:57 mforns@deploy1002: Started deploy [analytics/refinery@af99602]: Regular analytics weekly train [analytics/refinery@COMMIT_HASH]
  • 21:51 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mwmaint2001.codfw.wmnet with reason: decom
  • 21:51 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on mwmaint2001.codfw.wmnet with reason: decom
  • 21:51 legoktm: copied docker-registry package from stretch-wikimedia to buster-wikimedia (T272550)
  • 20:47 krinkle@deploy1002: Synchronized wmf-config/InitialiseSettings.php: I7f387bf19e5f prep wgChronologyProtectorStash ahead of wmf.33 roll out to ensure cross-wiki consistency (duration: 01m 18s)
  • 20:04 kharlan@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
  • 20:03 kharlan@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
  • 20:00 kharlan@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
  • 20:00 kharlan@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
  • 19:56 papaul: codfw mgmt is going down for 5 minutes for maintenance thank youn
  • 19:53 kharlan@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
  • 19:43 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: c48e40a: Enable babel categorize on thwikisource (T275283) (duration: 01m 09s)
  • 19:42 pt1979@cumin2001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 19:39 pt1979@cumin2001: START - Cookbook sre.dns.netbox
  • 19:28 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: f6fa5b3: Set local timezone for trwikivoyage to UTC (T275598) (duration: 01m 09s)
  • 19:15 volans@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1002.eqiad.wmnet with reason: REIMAGE
  • 19:13 volans@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1002.eqiad.wmnet with reason: REIMAGE
  • 18:59 ebernhardson: apply merge.policy.deletes_pct_allowed=20 to production-search-codfw commonswiki_file to encourage merging away deleted docs from T271493
  • 18:53 mholloway-shell@deploy1002: Synchronized php-1.36.0-wmf.32/extensions/EventLogging: Fix timestamp format for migrated events (T276235) (duration: 01m 10s)
  • 18:42 volans@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1002.eqiad.wmnet with reason: REIMAGE
  • 18:40 volans@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1002.eqiad.wmnet with reason: REIMAGE
  • 18:28 dduvall@deploy1002: Synchronized private/readme.php: Config: Extend wmfSwiftConfig placeholder keys (duration: 01m 09s)
  • 18:21 mholloway-shell@deploy1002: Synchronized php-1.36.0-wmf.33/extensions/EventLogging: Fix timestamp format for migrated events (T276235) (duration: 01m 09s)
  • 18:12 vgutierrez: rolling restart of ats-tls on esams
  • 17:46 ebernhardson@deploy1002: Finished deploy [wikimedia/discovery/analytics@869a29b]: ores_bulk_ingest: Increase drafttopic error_threshold to 1 per 500 (duration: 02m 55s)
  • 17:43 ebernhardson@deploy1002: Started deploy [wikimedia/discovery/analytics@869a29b]: ores_bulk_ingest: Increase drafttopic error_threshold to 1 per 500
  • 17:39 legoktm@deploy1002: Synchronized php-1.36.0-wmf.32/extensions/Graph/: Do not log graph errors to WMF servers (duration: 01m 08s)
  • 17:21 legoktm@deploy1002: Synchronized php-1.36.0-wmf.32/extensions/ContentTranslation/: Re-apply: CX3 Build 0.1.0+20210223 (duration: 01m 10s)
  • 16:37 jayme@cumin1001: conftool action : set/pooled=true; selector: dnsdisc=helm-charts,name=eqiad
  • 16:37 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host chartmuseum1001.eqiad.wmnet
  • 16:33 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single for host chartmuseum1001.eqiad.wmnet
  • 16:14 kharlan@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
  • 16:14 kharlan@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
  • 16:10 kharlan@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
  • 16:10 kharlan@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
  • 15:59 marostegui@cumin1001: dbctl commit (dc=all): 'db1084 (re)pooling @ 100%: Repool db1084 after cloning db1164', diff saved to https://phabricator.wikimedia.org/P14563 and previous config saved to /var/cache/conftool/dbconfig/20210302-155932-root.json
  • 15:56 jayme@cumin1001: conftool action : set/pooled=false; selector: dnsdisc=helm-charts,name=eqiad
  • 15:56 jayme@cumin1001: conftool action : set/pooled=true; selector: dnsdisc=helm-charts,name=codfw
  • 15:55 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host chartmuseum2001.codfw.wmnet
  • 15:52 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single for host chartmuseum2001.codfw.wmnet
  • 15:44 marostegui@cumin1001: dbctl commit (dc=all): 'db1084 (re)pooling @ 85%: Repool db1084 after cloning db1164', diff saved to https://phabricator.wikimedia.org/P14562 and previous config saved to /var/cache/conftool/dbconfig/20210302-154429-root.json
  • 15:35 tgr@deploy1002: Synchronized php-1.36.0-wmf.33/extensions/GrowthExperiments/: Backport: HomepageHooks: Block search data hook if link recommendations are off (T276224) (duration: 01m 13s)
  • 15:29 marostegui@cumin1001: dbctl commit (dc=all): 'db1084 (re)pooling @ 75%: Repool db1084 after cloning db1164', diff saved to https://phabricator.wikimedia.org/P14561 and previous config saved to /var/cache/conftool/dbconfig/20210302-152925-root.json
  • 15:14 marostegui@cumin1001: dbctl commit (dc=all): 'db1084 (re)pooling @ 50%: Repool db1084 after cloning db1164', diff saved to https://phabricator.wikimedia.org/P14560 and previous config saved to /var/cache/conftool/dbconfig/20210302-151422-root.json
  • 15:00 hnowlan@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
  • 15:00 hnowlan@deploy1002: helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
  • 14:59 marostegui@cumin1001: dbctl commit (dc=all): 'db1084 (re)pooling @ 25%: Repool db1084 after cloning db1164', diff saved to https://phabricator.wikimedia.org/P14559 and previous config saved to /var/cache/conftool/dbconfig/20210302-145918-root.json
  • 14:57 vgutierrez: rolling restart of ats-tls on codfw
  • 14:53 hnowlan@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
  • 14:53 hnowlan@deploy1002: helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
  • 14:44 marostegui@cumin1001: dbctl commit (dc=all): 'db1084 (re)pooling @ 10%: Repool db1084 after cloning db1164', diff saved to https://phabricator.wikimedia.org/P14558 and previous config saved to /var/cache/conftool/dbconfig/20210302-144415-root.json
  • 14:43 akosiaris@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
  • 14:43 akosiaris@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
  • 14:42 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
  • 14:29 jynus: dropping db grants for bacula from m1 T274809
  • 14:29 marostegui@cumin1001: dbctl commit (dc=all): 'db1084 (re)pooling @ 5%: Repool db1084 after cloning db1164', diff saved to https://phabricator.wikimedia.org/P14557 and previous config saved to /var/cache/conftool/dbconfig/20210302-142911-root.json
  • 14:07 jynus: dropping database bacula from m1 (with replication) T274809
  • 14:04 liw@deploy1002: rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.33
  • 13:57 moritzm: installing bind9 security updates on stretch (client-side tools/libs only)
  • 13:56 akosiaris@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host kubestagemaster1001.eqiad.wmnet
  • 13:44 akosiaris@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host kubemaster1002.eqiad.wmnet
  • 13:42 akosiaris@cumin1001: START - Cookbook sre.ganeti.makevm for new host kubestagemaster1001.eqiad.wmnet
  • 13:25 akosiaris@cumin1001: START - Cookbook sre.ganeti.makevm for new host kubemaster1002.eqiad.wmnet
  • 13:24 akosiaris@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host kubemaster1001.eqiad.wmnet
  • 13:16 akosiaris@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host kubemaster2001.codfw.wmnet
  • 13:13 akosiaris@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
  • 13:10 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
  • 13:08 akosiaris@cumin1001: START - Cookbook sre.ganeti.makevm for new host kubemaster1001.eqiad.wmnet
  • 12:53 akosiaris@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host kubemaster2002.codfw.wmnet
  • 12:53 akosiaris@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
  • 12:46 akosiaris@cumin1001: START - Cookbook sre.ganeti.makevm for new host kubemaster2001.codfw.wmnet
  • 12:44 aborrero@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt1012.eqiad.wmnet
  • 12:43 akosiaris@cumin1001: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host kubemaster2001.codfw.wmnet
  • 12:39 aborrero@cumin1001: START - Cookbook sre.hosts.reboot-single for host cloudvirt1012.eqiad.wmnet
  • 12:32 jayme@cumin1001: END (FAIL) - Cookbook sre.discovery.service-route (exit_code=99)
  • 12:32 jayme@cumin1001: START - Cookbook sre.discovery.service-route
  • 12:28 jayme@cumin1001: conftool action : set/pooled=false; selector: dnsdisc=helm-charts,name=codfw
  • 12:23 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: 29952b4: vector: Stage 3 of WVUI search treatment A/B test (T249297) (duration: 01m 08s)
  • 12:21 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: 5674d2a: Enable SectionTranslation in testwiki (T275596) (duration: 01m 09s)
  • 12:13 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestagetcd2003.codfw.wmnet
  • 12:12 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single for host kubestagetcd2003.codfw.wmnet
  • 12:12 mbsantos@deploy1002: Finished deploy [tilerator/deploy@8d3d81c]: (no justification provided) (duration: 00m 15s)
  • 12:11 mbsantos@deploy1002: Started deploy [tilerator/deploy@8d3d81c]: (no justification provided)
  • 12:07 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestagetcd2002.codfw.wmnet
  • 12:06 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: af89965: Remove test2wiki from wgContentTranslationAsBetaFeature (duration: 01m 38s)
  • 12:02 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single for host kubestagetcd2002.codfw.wmnet
  • 12:00 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1084 to clone db1164 T258361', diff saved to https://phabricator.wikimedia.org/P14554 and previous config saved to /var/cache/conftool/dbconfig/20210302-115959-marostegui.json
  • 11:53 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestagetcd2001.codfw.wmnet
  • 11:53 mbsantos@deploy1002: Finished deploy [tilerator/deploy@937deb5]: (no justification provided) (duration: 00m 03s)
  • 11:53 mbsantos@deploy1002: Started deploy [tilerator/deploy@937deb5]: (no justification provided)
  • 11:49 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single for host kubestagetcd2001.codfw.wmnet
  • 11:47 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestagemaster2001.codfw.wmnet
  • 11:40 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single for host kubestagemaster2001.codfw.wmnet
  • 11:16 akosiaris@cumin1001: START - Cookbook sre.ganeti.makevm for new host kubemaster2002.codfw.wmnet
  • 11:16 akosiaris@cumin1001: START - Cookbook sre.ganeti.makevm for new host kubemaster2001.codfw.wmnet
  • 11:12 hnowlan@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
  • 11:11 hnowlan@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
  • 10:37 jiji@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1028.eqiad.wmnet
  • 10:31 jiji@cumin1001: START - Cookbook sre.hosts.reboot-single for host mc1028.eqiad.wmnet
  • 10:30 effie: upgrade memcached on mc2024, mc1028
  • 10:21 elukey@cumin1001: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1119.eqiad.wmnet
  • 10:18 elukey@cumin1001: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1119.eqiad.wmnet
  • 10:12 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1119.eqiad.wmnet with reason: REIMAGE
  • 10:09 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1119.eqiad.wmnet with reason: REIMAGE
  • 10:05 volans@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1002.eqiad.wmnet with reason: REIMAGE
  • 10:03 volans@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1002.eqiad.wmnet with reason: REIMAGE
  • 09:54 elukey@cumin1001: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker[1130-1131].eqiad.wmnet
  • 09:52 elukey@cumin1001: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker[1130-1131].eqiad.wmnet
  • 09:46 liw@deploy1002: Finished scap: testwikis wikis to 1.36.0-wmf.33 (duration: 36m 20s)
  • 09:43 elukey@cumin1001: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker[1124-1128].eqiad.wmnet
  • 09:41 elukey@cumin1001: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker[1124-1128].eqiad.wmnet
  • 09:39 elukey@cumin1001: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker[1120-1123].eqiad.wmnet
  • 09:37 elukey@cumin1001: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker[1120-1123].eqiad.wmnet
  • 09:36 elukey@cumin1001: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1119.eqiad.wmnet
  • 09:33 elukey@cumin1001: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1119.eqiad.wmnet
  • 09:12 liw@deploy1002: Started scap: testwikis wikis to 1.36.0-wmf.33
  • 08:58 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1131.eqiad.wmnet with reason: REIMAGE
  • 08:56 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1130.eqiad.wmnet with reason: REIMAGE
  • 08:56 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1131.eqiad.wmnet with reason: REIMAGE
  • 08:54 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1130.eqiad.wmnet with reason: REIMAGE
  • 08:54 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1128.eqiad.wmnet with reason: REIMAGE
  • 08:54 vgutierrez: rolling restart of ats-tls on ulsfo
  • 08:52 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1128.eqiad.wmnet with reason: REIMAGE
  • 08:39 kharlan@deploy1002: helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
  • 08:30 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1127.eqiad.wmnet with reason: REIMAGE
  • 08:28 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1126.eqiad.wmnet with reason: REIMAGE
  • 08:27 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1127.eqiad.wmnet with reason: REIMAGE
  • 08:25 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1126.eqiad.wmnet with reason: REIMAGE
  • 08:25 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1125.eqiad.wmnet with reason: REIMAGE
  • 08:23 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1125.eqiad.wmnet with reason: REIMAGE
  • 08:00 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1124.eqiad.wmnet with reason: REIMAGE
  • 07:59 liw: 1.36.0-wmf.33 was branched at 800e1f8 for T274937
  • 07:58 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1123.eqiad.wmnet with reason: REIMAGE
  • 07:58 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1124.eqiad.wmnet with reason: REIMAGE
  • 07:56 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1122.eqiad.wmnet with reason: REIMAGE
  • 07:56 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1123.eqiad.wmnet with reason: REIMAGE
  • 07:54 godog: swift eqiad-prod: add weight to ms-be106[0-3] - T268435
  • 07:54 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1122.eqiad.wmnet with reason: REIMAGE
  • 07:28 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1121.eqiad.wmnet with reason: REIMAGE
  • 07:27 ryankemper: Pooled `elastic106[0,4]` (Noticed I never re-pooled these hosts after resolving an incident last week)
  • 07:26 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1121.eqiad.wmnet with reason: REIMAGE
  • 07:26 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1120.eqiad.wmnet with reason: REIMAGE
  • 07:24 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1120.eqiad.wmnet with reason: REIMAGE
  • 05:40 Amir1: apply gerrit:667757 on mwdebug1002 to test T259360
  • 05:38 marostegui@cumin1001: dbctl commit (dc=all): 'Pool db2152 into s8 as vslow - T275633', diff saved to https://phabricator.wikimedia.org/P14551 and previous config saved to /var/cache/conftool/dbconfig/20210302-053814-marostegui.json
  • 00:59 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: 0f08e8b: Update the Persian Wikipedia logos (T261033; 2/2) (duration: 00m 56s)
  • 00:58 urbanecm@deploy1002: Synchronized static/images/mobile/copyright/: 0f08e8b: Update the Persian Wikipedia logos (T261033; 1/2) (duration: 00m 56s)
  • 00:54 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: 97ebf75: Separate Wikivoyage wordmark and icon (T261033; T273477) (duration: 00m 56s)
  • 00:53 urbanecm@deploy1002: Synchronized static/images/mobile/copyright/: 97ebf75: Separate Wikivoyage wordmark and icon (T261033; T273477) (duration: 00m 56s)
  • 00:46 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: 61647cd: Fixes max-width configuration for new Vector (T260091) (duration: 00m 56s)
  • 00:43 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: 6cc8521: Enable og tags on non-wikidata wikis (T157145) (duration: 00m 56s)
  • 00:37 urbanecm@deploy1002: Synchronized wmf-config/config/hrwiki.yaml: REDEPLOY: d53834e: Enable Growth features on hrwiki in stealth mode (3/3; T275684) (duration: 00m 56s)
  • 00:36 urbanecm@deploy1002: Synchronized dblists/growthexperiments.dblist: REDEPLOY: d53834e: Enable Growth features on hrwiki in stealth mode (2/3; T275684) (duration: 00m 56s)
  • 00:35 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: REDEPLOY: d53834e: Enable Growth features on hrwiki in stealth mode (1/3; T275684) (duration: 00m 55s)
  • 00:33 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: REDEPLOY: Config: EventLoggingSchemas: Bump HomepageVisit version (T275615) (duration: 00m 56s)
  • 00:31 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: REDEPLOY: 21cb6f5: Revert "Revert "vector: Stage 2 of WVUI search treatment A/B test"" (T249297) (duration: 00m 56s)
  • 00:29 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: REDEPLOY: 599b739: Simplify deployment of Growth team features (3/3; T276091) (duration: 00m 56s)
  • 00:27 urbanecm@deploy1002: Synchronized wmf-config/CommonSettings.php: REDEPLOY: de0f741: Simplify deployment of Growth team features (2/3; T276091) (duration: 00m 57s)
  • 00:26 urbanecm@deploy1002: sync-file aborted: REDEPLOY: de0f741: Simplify deployment of Growth team features (2/3; T276091) (duration: 00m 25s)
  • 00:25 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: REDEPLOY: e991806: Simplify deployment of Growth team features (1/3; T276091) (duration: 00m 56s)
  • 00:22 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: REDEPLOY: Revert: vector: Stage 2 of WVUI search treatment A/B test (T249297) (duration: 00m 56s)
  • 00:21 pt1979@cumin2001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 00:21 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: REDEPLOY: 1edcbb5: vector: Stage 2 of WVUI search treatment A/B test (T249297) (duration: 00m 56s)
  • 00:19 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: REDEPLOYING: 2a8ece1: GrowthExperiments: set GELinkRecommendationsUseEventGate (duration: 00m 57s)
  • 00:18 urbanecm@deploy1002: sync-file aborted: 2a8ece1: GrowthExperiments: set GELinkRecommendationsUseEventGate (duration: 00m 05s)
  • 00:17 urbanecm@deploy1002: Synchronized wmf-config/InitialiseSettings.php: REDEPLOYING: 92f6597: rowiki: Update help panel links (T275130) (duration: 00m 59s)
  • 00:16 pt1979@cumin2001: START - Cookbook sre.dns.netbox
  • 00:11 mutante: deploy2002 - ran 'git etch' in /srv/mediawiki-staging

2021-03-01

  • 23:05 eileen: civicrm revision changed from 04a029958c to e1dacbe348, config revision is 643477b35d
  • 23:01 ebernhardson@deploy1002: Finished deploy [wikimedia/discovery/analytics@61e7533]: ores_bulk_ingest: Handle unexpected api response (duration: 01m 33s)
  • 23:00 ebernhardson@deploy1002: Started deploy [wikimedia/discovery/analytics@61e7533]: ores_bulk_ingest: Handle unexpected api response
  • 22:57 mholloway-shell@deploy1002: Synchronized php-1.36.0-wmf.32/extensions/WikimediaEvents: Fix: Restore exporting wgWMESchemaEditAttemptStepSamplingRate to JS (duration: 00m 57s)
  • 22:41 mstyles@deploy1002: Finished deploy [wikimedia/discovery/analytics@ca2c5b5]: import commons ttl dag fix (T270103) (duration: 02m 04s)
  • 22:39 mstyles@deploy1002: Started deploy [wikimedia/discovery/analytics@ca2c5b5]: import commons ttl dag fix (T270103)
  • 22:22 dwisehaupt: ran the following on frdb2001 to allow replication to continue after conversion to utf8mb4 charset: set global slave_type_conversions = ALL_NON_LOSSY;
  • 22:21 dwisehaupt: stopping mysql replication on frdb2001 and starting utf8mb4 table alters under a root screen session
  • 22:16 eileen: civicrm revision changed from f07390ff87 to 04a029958c, config revision is 643477b35d
  • 22:12 twentyafterfour@deploy1002: Finished scap: (no justification provided) (duration: 16m 24s)
  • 21:57 twentyafterfour: running scap sync from the new server deply1002
  • 21:56 twentyafterfour@deploy1002: Started scap: (no justification provided)
  • 21:54 mstyles@deploy1002: Finished deploy [wikimedia/discovery/analytics@ca2c5b5]: import commons ttl dag fix (T270103) (duration: 02m 34s)
  • 21:52 mstyles@deploy1002: Started deploy [wikimedia/discovery/analytics@ca2c5b5]: import commons ttl dag fix (T270103)
  • 21:49 mutante: deploy1002 - removed scap-global-lock, unlocked scap
  • 21:43 phamhi: rebooted clouddb1013 for maintenance
  • 21:38 mutante: cumin 'mw*' 'grep master_rsync /etc/scap.cfg' showed all mw servers are now using deploy1002 (T265963)
  • 21:30 shdubsh: completed removal of kafka logging inputs to legacy logstash cluster - T234854
  • 21:18 mutante: mw1262 - running puppet to switch to new deployment server, scap pull
  • 21:16 effie: pooling mw1262 back
  • 21:08 mutante: [mwdebug1001:~] $ /usr/local/lib/nagios/plugins/check_mw_versions --deployhost deploy1002.eqiad.wmnet - OKAY: wikiversions in sync (T265963)
  • 21:05 mutante: re-enabling puppet on deploy1001 - running puppet on deploy*, switching eqiad scap master and deployment_server globally (T265963)
  • 20:37 mutante: deploy1001 - disable puppet and manually create scap-global-lock - NO DEPLOYMENTS
  • 20:35 jiji@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1029.eqiad.wmnet
  • 20:28 jiji@cumin1001: START - Cookbook sre.hosts.reboot-single for host mc1029.eqiad.wmnet
  • 20:28 effie: upgrade mc1029, mc2029 to memcached 1.6
  • 19:55 urbanecm@deploy1001: Synchronized wmf-config/config/hrwiki.yaml: d53834e: Enable Growth features on hrwiki in stealth modeEnable Growth features on hrwiki in stealth mode (3/3; T275684) (duration: 00m 54s)
  • 19:54 urbanecm@deploy1001: sync-file aborted: d53834e: Enable Growth features on hrwiki in stealth modeEnable Growth features on hrwiki in stealth mode (3/3; T275684) (duration: 00m 03s)
  • 19:53 urbanecm@deploy1001: Synchronized dblists/growthexperiments.dblist: d53834e: Enable Growth features on hrwiki in stealth modeEnable Growth features on hrwiki in stealth mode (2/3; T275684) (duration: 00m 56s)
  • 19:52 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: d53834e: Enable Growth features on hrwiki in stealth modeEnable Growth features on hrwiki in stealth mode (1/3; T275684) (duration: 00m 55s)
  • 19:41 tgr@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Config: EventLoggingSchemas: Bump HomepageVisit version (T275615) (duration: 00m 56s)
  • 19:34 phuedx@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Config: Revert "Revert "vector: Stage 2 of WVUI search treatment A/B test"" (T249297) (duration: 00m 54s)
  • 19:20 pt1979@cumin2001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 19:02 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: 599b739: Simplify deployment of Growth team features (3/3; T276091) (duration: 01m 00s)
  • 19:01 urbanecm@deploy1001: Synchronized wmf-config/CommonSettings.php: de0f741: Simplify deployment of Growth team features (2/3; T276091) (duration: 00m 57s)
  • 18:56 pt1979@cumin2001: START - Cookbook sre.dns.netbox
  • 18:54 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: e991806: Simplify deployment of Growth team features (1/3; T276091) (duration: 00m 57s)
  • 18:42 mutante: mwmaint2002.mgmt - racadm serveraction powerup
  • 18:26 ryankemper: [Relforge] Lifting downtime on `relforge1004` now that T275658 is done
  • 18:24 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1307.eqiad.wmnet
  • 18:24 mutante: mw1307 - back to stretch now
  • 18:22 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1307.eqiad.wmnet
  • 18:20 mutante: mwmaint2002 - shutting down for maintenance
  • 18:12 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1098.eqiad.wmnet with reason: REIMAGE
  • 18:10 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1098.eqiad.wmnet with reason: REIMAGE
  • 18:03 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mwmaint2002.codfw.wmnet with reason: new install
  • 18:03 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on mwmaint2002.codfw.wmnet with reason: new install
  • 18:00 volans@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1002.eqiad.wmnet with reason: REIMAGE
  • 18:00 mutante: puppetmaster1001 - generating mcrouter cert for mwmaint2002 T275905
  • 17:58 volans@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1002.eqiad.wmnet with reason: REIMAGE
  • 17:43 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1307.eqiad.wmnet with reason: REIMAGE
  • 17:41 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1307.eqiad.wmnet with reason: REIMAGE
  • 17:17 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1307.eqiad.wmnet
  • 17:07 mutante: our latest Wikipedia language edition ready to move on from the incubator https://tay.wikipedia.org
  • 17:05 mutante: new Wikimedia project language - tay - Atayal is spoken by the Atayal people of Taiwan
  • 17:03 jayme@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'mathoid' for release 'production' .
  • 16:38 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1097.eqiad.wmnet with reason: REIMAGE
  • 16:35 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1097.eqiad.wmnet with reason: REIMAGE
  • 16:20 jayme@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'mathoid' for release 'production' .
  • 15:57 jayme@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'mathoid' for release 'staging' .
  • 15:11 vgutierrez: rolling restart of ats-tls on cp[5007-5011]
  • 14:49 marostegui: Failover m3 proxy back to dbproxy1020
  • 14:32 jiji@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1030.eqiad.wmnet
  • 14:25 jiji@cumin1001: START - Cookbook sre.hosts.reboot-single for host mc1030.eqiad.wmnet
  • 14:18 effie: upgrade mc1030 mc2030 to memcached 1.6
  • 14:07 marostegui: Upgrade dbproxy1020 kernel
  • 14:05 moritzm: installing openldap security updates on stretch (client-side tools/libs only, slapd instances all on Buster and fixed)
  • 13:22 moritzm: instaling docker.io security updates for Buster
  • 12:26 awight: EU config deployments complete
  • 12:10 awight@deploy1001: Synchronized wmf-config: Config: GrowthExperiments: set GELinkRecommendationsUseEventGate (T274198) (duration: 01m 05s)
  • 11:49 jdrewniak@deploy1001: Synchronized portals: Wikimedia Portals Update: Bumping portals to master (T128546) (duration: 00m 55s)
  • 11:48 jdrewniak@deploy1001: Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: Bumping portals to master (T128546) (duration: 00m 55s)
  • 10:48 marostegui@cumin1001: dbctl commit (dc=all): 'db1168 (re)pooling @ 100%: Slowly pool db1168 for the first time', diff saved to https://phabricator.wikimedia.org/P14547 and previous config saved to /var/cache/conftool/dbconfig/20210301-104842-root.json
  • 10:33 marostegui@cumin1001: dbctl commit (dc=all): 'db1168 (re)pooling @ 85%: Slowly pool db1168 for the first time', diff saved to https://phabricator.wikimedia.org/P14546 and previous config saved to /var/cache/conftool/dbconfig/20210301-103338-root.json
  • 10:18 marostegui@cumin1001: dbctl commit (dc=all): 'db1168 (re)pooling @ 75%: Slowly pool db1168 for the first time', diff saved to https://phabricator.wikimedia.org/P14545 and previous config saved to /var/cache/conftool/dbconfig/20210301-101835-root.json
  • 10:15 vgutierrez: restart ats-tls on cp5012
  • 10:03 marostegui@cumin1001: dbctl commit (dc=all): 'db1168 (re)pooling @ 65%: Slowly pool db1168 for the first time', diff saved to https://phabricator.wikimedia.org/P14544 and previous config saved to /var/cache/conftool/dbconfig/20210301-100331-root.json
  • 09:48 marostegui@cumin1001: dbctl commit (dc=all): 'db1168 (re)pooling @ 50%: Slowly pool db1168 for the first time', diff saved to https://phabricator.wikimedia.org/P14543 and previous config saved to /var/cache/conftool/dbconfig/20210301-094828-root.json
  • 09:33 marostegui@cumin1001: dbctl commit (dc=all): 'db1168 (re)pooling @ 40%: Slowly pool db1168 for the first time', diff saved to https://phabricator.wikimedia.org/P14542 and previous config saved to /var/cache/conftool/dbconfig/20210301-093324-root.json
  • 09:25 marostegui@cumin1001: dbctl commit (dc=all): 'db1134 (re)pooling @ 100%: Repool db1134 after on-site maintenance', diff saved to https://phabricator.wikimedia.org/P14541 and previous config saved to /var/cache/conftool/dbconfig/20210301-092536-root.json
  • 09:18 marostegui@cumin1001: dbctl commit (dc=all): 'db1168 (re)pooling @ 30%: Slowly pool db1168 for the first time', diff saved to https://phabricator.wikimedia.org/P14540 and previous config saved to /var/cache/conftool/dbconfig/20210301-091820-root.json
  • 09:10 marostegui@cumin1001: dbctl commit (dc=all): 'db1134 (re)pooling @ 85%: Repool db1134 after on-site maintenance', diff saved to https://phabricator.wikimedia.org/P14539 and previous config saved to /var/cache/conftool/dbconfig/20210301-091032-root.json
  • 09:03 marostegui@cumin1001: dbctl commit (dc=all): 'db1168 (re)pooling @ 25%: Slowly pool db1168 for the first time', diff saved to https://phabricator.wikimedia.org/P14538 and previous config saved to /var/cache/conftool/dbconfig/20210301-090317-root.json
  • 08:55 marostegui@cumin1001: dbctl commit (dc=all): 'db1134 (re)pooling @ 75%: Repool db1134 after on-site maintenance', diff saved to https://phabricator.wikimedia.org/P14537 and previous config saved to /var/cache/conftool/dbconfig/20210301-085529-root.json
  • 08:48 marostegui@cumin1001: dbctl commit (dc=all): 'db1168 (re)pooling @ 20%: Slowly pool db1168 for the first time', diff saved to https://phabricator.wikimedia.org/P14536 and previous config saved to /var/cache/conftool/dbconfig/20210301-084813-root.json
  • 08:45 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: 92f6597: rowiki: Update help panel links (T275130) (duration: 01m 08s)
  • 08:40 marostegui@cumin1001: dbctl commit (dc=all): 'db1134 (re)pooling @ 65%: Repool db1134 after on-site maintenance', diff saved to https://phabricator.wikimedia.org/P14535 and previous config saved to /var/cache/conftool/dbconfig/20210301-084025-root.json
  • 08:38 elukey: reboot an-worker1112
  • 08:33 marostegui@cumin1001: dbctl commit (dc=all): 'db1168 (re)pooling @ 15%: Slowly pool db1168 for the first time', diff saved to https://phabricator.wikimedia.org/P14534 and previous config saved to /var/cache/conftool/dbconfig/20210301-083310-root.json
  • 08:25 marostegui@cumin1001: dbctl commit (dc=all): 'db1134 (re)pooling @ 50%: Repool db1134 after on-site maintenance', diff saved to https://phabricator.wikimedia.org/P14533 and previous config saved to /var/cache/conftool/dbconfig/20210301-082521-root.json
  • 08:18 marostegui@cumin1001: dbctl commit (dc=all): 'db1168 (re)pooling @ 10%: Slowly pool db1168 for the first time', diff saved to https://phabricator.wikimedia.org/P14532 and previous config saved to /var/cache/conftool/dbconfig/20210301-081806-root.json
  • 08:10 marostegui@cumin1001: dbctl commit (dc=all): 'db1134 (re)pooling @ 40%: Repool db1134 after on-site maintenance', diff saved to https://phabricator.wikimedia.org/P14531 and previous config saved to /var/cache/conftool/dbconfig/20210301-081018-root.json
  • 08:03 marostegui@cumin1001: dbctl commit (dc=all): 'db1168 (re)pooling @ 5%: Slowly pool db1168 for the first time', diff saved to https://phabricator.wikimedia.org/P14530 and previous config saved to /var/cache/conftool/dbconfig/20210301-080303-root.json
  • 07:55 marostegui@cumin1001: dbctl commit (dc=all): 'db1134 (re)pooling @ 25%: Repool db1134 after on-site maintenance', diff saved to https://phabricator.wikimedia.org/P14529 and previous config saved to /var/cache/conftool/dbconfig/20210301-075514-root.json
  • 07:53 marostegui: Upgrade pc1010 pc2008 pc200 to 10.4.18
  • 07:53 elukey: clean up old logs + apt-get clean + puppet clientbucket on an-coord1001 to free space
  • 07:48 marostegui@cumin1001: dbctl commit (dc=all): 'db1168 (re)pooling @ 4%: Slowly pool db1168 for the first time', diff saved to https://phabricator.wikimedia.org/P14528 and previous config saved to /var/cache/conftool/dbconfig/20210301-074759-root.json
  • 07:40 marostegui@cumin1001: dbctl commit (dc=all): 'db1134 (re)pooling @ 15%: Repool db1134 after on-site maintenance', diff saved to https://phabricator.wikimedia.org/P14527 and previous config saved to /var/cache/conftool/dbconfig/20210301-074011-root.json
  • 07:29 marostegui@cumin1001: dbctl commit (dc=all): 'Give some more weight to db1168', diff saved to https://phabricator.wikimedia.org/P14526 and previous config saved to /var/cache/conftool/dbconfig/20210301-072957-marostegui.json
  • 07:25 marostegui@cumin1001: dbctl commit (dc=all): 'db1134 (re)pooling @ 10%: Repool db1134 after on-site maintenance', diff saved to https://phabricator.wikimedia.org/P14525 and previous config saved to /var/cache/conftool/dbconfig/20210301-072507-root.json
  • 07:10 marostegui@cumin1001: dbctl commit (dc=all): 'Give some more weight to db1168', diff saved to https://phabricator.wikimedia.org/P14524 and previous config saved to /var/cache/conftool/dbconfig/20210301-071047-marostegui.json
  • 07:10 marostegui@cumin1001: dbctl commit (dc=all): 'db1134 (re)pooling @ 5%: Repool db1134 after on-site maintenance', diff saved to https://phabricator.wikimedia.org/P14523 and previous config saved to /var/cache/conftool/dbconfig/20210301-071004-root.json
  • 07:05 marostegui: Stop MySQL on db2082 to clone db2152 - T275633
  • 06:55 marostegui@cumin1001: dbctl commit (dc=all): 'db1134 (re)pooling @ 1%: Repool db1134 after on-site maintenance', diff saved to https://phabricator.wikimedia.org/P14521 and previous config saved to /var/cache/conftool/dbconfig/20210301-065500-root.json
  • 06:47 marostegui@cumin1001: dbctl commit (dc=all): 'Pool db1168 with minimal weight T258361', diff saved to https://phabricator.wikimedia.org/P14520 and previous config saved to /var/cache/conftool/dbconfig/20210301-064704-marostegui.json
  • 06:46 marostegui@cumin1001: dbctl commit (dc=all): 'Add db1168 to dbctl T258361!', diff saved to https://phabricator.wikimedia.org/P14519 and previous config saved to /var/cache/conftool/dbconfig/20210301-064603-marostegui.json
  • 06:32 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1092.eqiad.wmnet
  • 06:25 marostegui@cumin1001: START - Cookbook sre.hosts.decommission for hosts db1092.eqiad.wmnet

2021-02-28

  • 14:17 gehel: repooled wdqs1011 - catched up on lag

2021-02-27

  • 21:19 dwisehaupt: ran the following on frdb2002 to allow replication to continue after conversion to utf8mb4 charset: set global slave_type_conversions = ALL_NON_LOSSY;
  • 18:44 gehel: depooled wdqs1011 to catch up on lag
  • 18:37 gehel: powercycling wdqs1011
  • 00:08 mutante: deploy1002 - rsyncing home dirs from deploy1001

2021-02-26

  • 20:29 mutante: deploy2001 - /srv/mediawiki-staging sudo find . -name *.cdb delete - deleted 190 GB of old cdb files (T275826 T265963)
  • 18:31 dwisehaupt: starting the utf8mb4 table alters on frdb2002 under a root screen session
  • 17:59 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mwmaint2002.codfw.wmnet with reason: REIMAGE
  • 17:57 pt1979@cumin2001: START - Cookbook sre.hosts.downtime for 2:00:00 on mwmaint2002.codfw.wmnet with reason: REIMAGE
  • 15:04 dcaro@cumin1001: END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0)
  • 14:59 dcaro@cumin1001: START - Cookbook sre.hosts.upgrade-and-reboot
  • 14:57 dcaro@cumin1001: END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0)
  • 14:51 dcaro@cumin1001: START - Cookbook sre.hosts.upgrade-and-reboot
  • 14:49 dcaro@cumin1001: END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0)
  • 14:44 dcaro@cumin1001: START - Cookbook sre.hosts.upgrade-and-reboot
  • 14:43 dcaro@cumin1001: END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0)
  • 14:38 dcaro@cumin1001: START - Cookbook sre.hosts.upgrade-and-reboot
  • 14:37 dcaro@cumin1001: END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0)
  • 14:31 dcaro@cumin1001: START - Cookbook sre.hosts.upgrade-and-reboot
  • 14:30 dcaro@cumin1001: END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0)
  • 14:25 dcaro@cumin1001: START - Cookbook sre.hosts.upgrade-and-reboot
  • 14:22 dcaro@cumin1001: END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0)
  • 14:17 dcaro@cumin1001: START - Cookbook sre.hosts.upgrade-and-reboot
  • 13:56 dcaro@cumin1001: END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0)
  • 13:51 dcaro@cumin1001: START - Cookbook sre.hosts.upgrade-and-reboot
  • 13:51 dcaro@cumin1001: END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0)
  • 13:45 dcaro@cumin1001: START - Cookbook sre.hosts.upgrade-and-reboot
  • 13:44 dcaro@cumin1001: END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0)
  • 13:38 dcaro@cumin1001: START - Cookbook sre.hosts.upgrade-and-reboot
  • 13:05 jiji@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1031.eqiad.wmnet
  • 13:00 jiji@cumin1001: START - Cookbook sre.hosts.reboot-single for host mc1031.eqiad.wmnet
  • 12:59 effie: upgrade memcached on mc1031, mc2031
  • 12:40 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
  • 12:40 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
  • 12:40 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' .
  • 12:40 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' .
  • 12:23 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
  • 12:23 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
  • 12:22 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' .
  • 12:22 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' .
  • 12:19 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
  • 12:19 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
  • 12:18 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' .
  • 12:18 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' .
  • 12:14 marostegui@cumin1001: dbctl commit (dc=all): 'Add new vslow,dump host to codfw s4 - T275633', diff saved to https://phabricator.wikimedia.org/P14508 and previous config saved to /var/cache/conftool/dbconfig/20210226-121438-marostegui.json
  • 12:12 dcaro@cumin1001: END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0)
  • 12:11 aborrero@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcontrol1003.wikimedia.org
  • 12:10 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
  • 12:10 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
  • 12:10 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' .
  • 12:10 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' .
  • 12:07 dcaro@cumin1001: START - Cookbook sre.hosts.upgrade-and-reboot
  • 12:00 aborrero@cumin1001: START - Cookbook sre.hosts.reboot-single for host cloudcontrol1003.wikimedia.org
  • 12:00 aborrero@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcontrol1004.wikimedia.org
  • 11:59 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
  • 11:59 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
  • 11:59 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' .
  • 11:58 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' .
  • 11:55 jbond42: delete exim messages in the queue ro root@wikimedia.org older then 7200 seconds and younger the 10800 seconds on mx1001
  • 11:54 jbond42: delete exim messages in the queue ro root@wikimedia.org older then 7200 seconds and younger the 10800 seconds on mx2001
  • 11:47 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
  • 11:47 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
  • 11:47 dcaro@cumin1001: END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0)
  • 11:42 aborrero@cumin1001: START - Cookbook sre.hosts.reboot-single for host cloudcontrol1004.wikimedia.org
  • 11:42 aborrero@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcontrol1005.wikimedia.org
  • 11:41 dcaro@cumin1001: START - Cookbook sre.hosts.upgrade-and-reboot
  • 11:38 dcaro@cumin1001: END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0)
  • 11:38 vgutierrez: rolling restart of ats-tls on cp500[1-5]
  • 11:37 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' .
  • 11:37 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' .
  • 11:33 dcaro@cumin1001: START - Cookbook sre.hosts.upgrade-and-reboot
  • 11:32 dcaro@cumin1001: END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0)
  • 11:30 aborrero@cumin1001: START - Cookbook sre.hosts.reboot-single for host cloudcontrol1005.wikimedia.org
  • 11:27 dcaro@cumin1001: START - Cookbook sre.hosts.upgrade-and-reboot
  • 11:22 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1096.eqiad.wmnet with reason: REIMAGE
  • 11:21 dcaro@cumin1001: END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0)
  • 11:21 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'zotero' for release 'staging' .
  • 11:21 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'zotero' for release 'production' .
  • 11:21 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' .
  • 11:21 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'wikifeeds' for release 'production' .
  • 11:20 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'test' .
  • 11:20 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' .
  • 11:20 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'production' .
  • 11:20 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' .
  • 11:20 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'sessionstore' for release 'staging' .
  • 11:20 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'sessionstore' for release 'production' .
  • 11:19 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'recommendation-api' for release 'production' .
  • 11:19 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1096.eqiad.wmnet with reason: REIMAGE
  • 11:19 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'push-notifications' for release 'canary' .
  • 11:19 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'push-notifications' for release 'main' .
  • 11:18 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'proton' for release 'production' .
  • 11:18 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
  • 11:18 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' .
  • 11:18 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'mathoid' for release 'production' .
  • 11:18 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'mathoid' for release 'staging' .
  • 11:18 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
  • 11:17 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
  • 11:17 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
  • 11:17 dcaro@cumin1001: START - Cookbook sre.hosts.upgrade-and-reboot
  • 11:17 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventstreams-internal' for release 'canary' .
  • 11:17 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventstreams-internal' for release 'main' .
  • 11:17 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventstreams' for release 'production' .
  • 11:17 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventstreams' for release 'canary' .
  • 11:16 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
  • 11:16 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' .
  • 11:16 dcaro@cumin1001: END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0)
  • 11:16 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' .
  • 11:16 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' .
  • 11:15 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
  • 11:15 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
  • 11:12 dcaro@cumin1001: START - Cookbook sre.hosts.upgrade-and-reboot
  • 11:10 dcaro@cumin1001: END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0)
  • 11:05 dcaro@cumin1001: START - Cookbook sre.hosts.upgrade-and-reboot
  • 11:05 dcaro@cumin1001: END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0)
  • 11:05 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' .
  • 11:05 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' .
  • 11:04 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'echostore' for release 'production' .
  • 11:04 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'echostore' for release 'staging' .
  • 11:04 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'production' .
  • 11:04 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'staging' .
  • 11:03 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'citoid' for release 'production' .
  • 11:03 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'citoid' for release 'staging' .
  • 11:03 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
  • 11:03 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
  • 11:03 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 11:03 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 11:02 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' .
  • 11:02 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'production' .
  • 11:02 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
  • 11:02 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
  • 11:02 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'apertium' for release 'staging' .
  • 11:02 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'apertium' for release 'production' .
  • 11:00 dcaro@cumin1001: START - Cookbook sre.hosts.upgrade-and-reboot
  • 10:59 dcaro@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephmon2002-dev.codfw.wmnet
  • 10:55 dcaro@cumin1001: START - Cookbook sre.hosts.reboot-single for host cloudcephmon2002-dev.codfw.wmnet
  • 10:54 dcaro@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephmon2003-dev.codfw.wmnet
  • 10:50 dcaro@cumin1001: START - Cookbook sre.hosts.reboot-single for host cloudcephmon2003-dev.codfw.wmnet
  • 10:50 dcaro@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephmon2003-dev.codfw.wmnet
  • 10:46 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'proton' for release 'production' .
  • 10:44 aborrero@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt1039.eqiad.wmnet
  • 10:44 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
  • 10:44 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
  • 10:43 dcaro@cumin1001: START - Cookbook sre.hosts.reboot-single for host cloudcephmon2003-dev.codfw.wmnet
  • 10:41 dcaro@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephmon2002-dev.codfw.wmnet
  • 10:38 dcaro@cumin1001: START - Cookbook sre.hosts.reboot-single for host cloudcephmon2002-dev.codfw.wmnet
  • 10:38 dcaro@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephmon2001-dev.codfw.wmnet
  • 10:38 aborrero@cumin1001: START - Cookbook sre.hosts.reboot-single for host cloudvirt1039.eqiad.wmnet
  • 10:34 dcaro@cumin1001: START - Cookbook sre.hosts.reboot-single for host cloudcephmon2001-dev.codfw.wmnet
  • 10:33 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' .
  • 10:33 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' .
  • 10:33 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'echostore' for release 'staging' .
  • 10:33 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'echostore' for release 'production' .
  • 10:33 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'production' .
  • 10:33 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'staging' .
  • 10:33 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'citoid' for release 'staging' .
  • 10:33 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'citoid' for release 'production' .
  • 10:32 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
  • 10:32 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
  • 10:32 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 10:32 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 10:32 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'production' .
  • 10:31 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' .
  • 10:31 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
  • 10:31 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
  • 10:31 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'apertium' for release 'staging' .
  • 10:31 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'apertium' for release 'production' .
  • 10:28 dcaro@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd2003-dev.codfw.wmnet
  • 10:22 marostegui@cumin1001: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repool db1169 after cloning db1134', diff saved to https://phabricator.wikimedia.org/P14505 and previous config saved to /var/cache/conftool/dbconfig/20210226-102254-root.json
  • 10:18 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1096.eqiad.wmnet with reason: REIMAGE
  • 10:16 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1096.eqiad.wmnet with reason: REIMAGE
  • 10:14 dcaro@cumin1001: START - Cookbook sre.hosts.reboot-single for host cloudcephosd2003-dev.codfw.wmnet
  • 10:09 dcaro@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd2002-dev.codfw.wmnet
  • 10:07 marostegui@cumin1001: dbctl commit (dc=all): 'db1169 (re)pooling @ 85%: Repool db1169 after cloning db1134', diff saved to https://phabricator.wikimedia.org/P14504 and previous config saved to /var/cache/conftool/dbconfig/20210226-100750-root.json
  • 10:06 aborrero@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudweb2001-dev.wikimedia.org
  • 10:05 dcaro@cumin1001: START - Cookbook sre.hosts.reboot-single for host cloudcephosd2002-dev.codfw.wmnet
  • 09:59 aborrero@cumin2001: START - Cookbook sre.hosts.reboot-single for host cloudweb2001-dev.wikimedia.org
  • 09:59 aborrero@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudservices2003-dev.wikimedia.org
  • 09:55 aborrero@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudservices2002-dev.wikimedia.org
  • 09:54 dcaro@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd2001-dev.codfw.wmnet
  • 09:52 aborrero@cumin2001: START - Cookbook sre.hosts.reboot-single for host cloudservices2003-dev.wikimedia.org
  • 09:52 marostegui@cumin1001: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repool db1169 after cloning db1134', diff saved to https://phabricator.wikimedia.org/P14503 and previous config saved to /var/cache/conftool/dbconfig/20210226-095247-root.json
  • 09:50 aborrero@cumin2001: START - Cookbook sre.hosts.reboot-single for host cloudservices2002-dev.wikimedia.org
  • 09:50 dcaro@cumin1001: START - Cookbook sre.hosts.reboot-single for host cloudcephosd2001-dev.codfw.wmnet
  • 09:48 dcaro@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd2001-dev.codfw.wmnet
  • 09:43 dcaro@cumin1001: START - Cookbook sre.hosts.reboot-single for host cloudcephosd2001-dev.codfw.wmnet
  • 09:41 aborrero@cumin2001: END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host cloudcontrol2001-dev.wikimedia.org
  • 09:37 marostegui@cumin1001: dbctl commit (dc=all): 'db1169 (re)pooling @ 65%: Repool db1169 after cloning db1134', diff saved to https://phabricator.wikimedia.org/P14502 and previous config saved to /var/cache/conftool/dbconfig/20210226-093743-root.json
  • 09:33 aborrero@cumin2001: START - Cookbook sre.hosts.reboot-single for host cloudcontrol2001-dev.wikimedia.org
  • 09:28 root@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 09:24 root@cumin1001: START - Cookbook sre.dns.netbox
  • 09:22 marostegui@cumin1001: dbctl commit (dc=all): 'db1169 (re)pooling @ 50%: Repool db1169 after cloning db1134', diff saved to https://phabricator.wikimedia.org/P14501 and previous config saved to /var/cache/conftool/dbconfig/20210226-092240-root.json
  • 09:13 jbond42: pupet enabled post sudoers fix, running puppet fleet wide with cumin -b 15 '*' 'run-puppet-agent '
  • 09:07 marostegui@cumin1001: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repool db1169 after cloning db1134', diff saved to https://phabricator.wikimedia.org/P14500 and previous config saved to /var/cache/conftool/dbconfig/20210226-090736-root.json
  • 08:55 jbond42: disabled puppet pending rollback of https://gerrit.wikimedia.org/r/666899
  • 08:52 marostegui@cumin1001: dbctl commit (dc=all): 'db1169 (re)pooling @ 25%: Repool db1169 after cloning db1134', diff saved to https://phabricator.wikimedia.org/P14498 and previous config saved to /var/cache/conftool/dbconfig/20210226-085233-root.json
  • 08:37 marostegui@cumin1001: dbctl commit (dc=all): 'db1169 (re)pooling @ 15%: Repool db1169 after cloning db1134', diff saved to https://phabricator.wikimedia.org/P14497 and previous config saved to /var/cache/conftool/dbconfig/20210226-083729-root.json
  • 08:22 marostegui@cumin1001: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repool db1169 after cloning db1134', diff saved to https://phabricator.wikimedia.org/P14496 and previous config saved to /var/cache/conftool/dbconfig/20210226-082226-root.json
  • 08:19 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1058.eqiad.wmnet with reason: REIMAGE
  • 08:17 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1058.eqiad.wmnet with reason: REIMAGE
  • 08:07 marostegui@cumin1001: dbctl commit (dc=all): 'db1169 (re)pooling @ 5%: Repool db1169 after cloning db1134', diff saved to https://phabricator.wikimedia.org/P14495 and previous config saved to /var/cache/conftool/dbconfig/20210226-080722-root.json
  • 08:04 elukey: run ipmi mc reset cold for analytics1058 - mgmt responding to pings and ipmi, but not to ssh
  • 07:52 marostegui@cumin1001: dbctl commit (dc=all): 'db1169 (re)pooling @ 1%: Repool db1169 after cloning db1134', diff saved to https://phabricator.wikimedia.org/P14494 and previous config saved to /var/cache/conftool/dbconfig/20210226-075219-root.json
  • 07:02 marostegui: Stop MySQL on db2106 to clone db2147 T275633
  • 07:01 elukey: reboot an-worker1099 to clear out kernel soft lockup errors
  • 06:59 elukey: restart datanode on an-worker1099 - soft lockup kernel errors
  • 06:53 kartik@deploy1001: Synchronized php-1.36.0-wmf.32/extensions/ContentTranslation: Bump ContentTranslation to e6b1a7c to include lost {{gerrit|666327}} backport (duration: 00m 58s)
  • 06:39 marostegui@cumin1001: dbctl commit (dc=all): 'Remove db1092 from dbctl T275019', diff saved to https://phabricator.wikimedia.org/P14492 and previous config saved to /var/cache/conftool/dbconfig/20210226-063914-marostegui.json
  • 06:32 kartik@deploy1001: Synchronized php-1.36.0-wmf.32/extensions/ContentTranslation: Resync ContentTranslation for {{gerrit|666327}} (duration: 01m 16s)
  • 06:17 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1169 to clone db1134 T275343', diff saved to https://phabricator.wikimedia.org/P14490 and previous config saved to /var/cache/conftool/dbconfig/20210226-061705-marostegui.json
  • 05:29 ryankemper@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2045.codfw.wmnet with reason: REIMAGE
  • 05:27 ryankemper@cumin2001: START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2045.codfw.wmnet with reason: REIMAGE
  • 05:25 ryankemper: [relforge] Downtimed `relforge1004` until `2021-03-02 07:23:36` (https://phabricator.wikimedia.org/T275658 is in flight to fix broken `kibana.service`)
  • 05:07 ryankemper: T275345 `sudo -i wmf-auto-reimage-host --conftool -p T275345 elastic2045.codfw.wmnet` on `ryankemper@cumin2001` tmux session `elastic_reimage_elastic1065`
  • 04:23 ryankemper: T267927 [WDQS Data Reload] `sudo -i cookbook sre.wdqs.data-reload wdqs2008.codfw.wmnet --task-id T267927 --reload-data wikidata --reason 'T267927: Reload wikidata jnl from fresh dumps' --reuse-downloaded-dump --depool` on `ryankemper@cumin2001` tmux session `wdqs_data_reload_2008`
  • 04:21 ryankemper@cumin2001: START - Cookbook sre.wdqs.data-reload
  • 00:14 urbanecm@deploy1001: Synchronized php-1.36.0-wmf.32/extensions/Graph/: 9d5cf34: Do not log graph errors to WMF servers (T274557) (duration: 01m 36s)

2021-02-25

  • 23:55 mutante: deploy1002, deploy2002 - scap-master-sync deploy1001.eqiad.wmnet (T265963)
  • 23:41 mutante: deploy2001 2/2 - because rsync is --delete but also --exclude="**/cache/l10n/*.cdb" --exclude="*.swp" you can't expect /srv/mediawiki-staging to be the same size on 2 servers
  • 23:39 mutante: deploy2001 - scap-master-sync from deploy1001 runs and attempts to --delete files to stay in sync but fails to do so because *.cdb files are in cache dirs and rsync does not want to delete non-empty directories, this leads to build up of the size of /srv/mediawiki-staging to 10 times the size of eqiad
  • 23:34 mutante: deploy2001 - scap-master-sync from deploy1001
  • 23:13 mutante: deploy1002 - /usr/local/bin/scap-master-sync deploy1001.eqiad.wmnet
  • 23:01 dduvall@deploy1001: Pruned MediaWiki: 1.36.0-wmf.30 (duration: 04m 20s)
  • 21:38 legoktm: pushed new version of docker-registry.discovery.wmnet/wikimedia-buster image
  • 21:20 mutante: deploy2001 - rsynced /srv/deployment from deploy1001 after gerrit:666757
  • 20:57 eileen: civicrm revision changed from 604d07c859 to f07390ff87, config revision is 643477b35d
  • 20:35 jhuneidi@deploy1001: rebuilt and synchronized wikiversions files: all wikis to 1.36.0-wmf.32 refs T274936
  • 20:17 tgr@deploy1001: Synchronized php-1.36.0-wmf.31/extensions/GrowthExperiments/: Backport: Impact module: Add "not rendered" state (T270294, T275615) (duration: 01m 08s)
  • 19:40 tgr@deploy1001: Synchronized php-1.36.0-wmf.32/extensions/GrowthExperiments/: Backport: Impact module: Add "not rendered" state (T270294, T275615) (duration: 01m 26s)
  • 19:16 ryankemper: T267927 Downloading dumps: `sudo https_proxy=webproxy.codfw.wmnet:8080 wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2 -O /srv/wdqs/latest-all.ttl.bz2 && sudo https_proxy=webproxy.codfw.wmnet:8080 wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-lexemes.ttl.bz2 -O /srv/wdqs/latest-lexemes.ttl.bz2` on `ryankemper@wdqs2008` tmux session `download_latest_dumps`
  • 18:59 ryankemper@cumin2001: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99)
  • 18:59 ryankemper@cumin2001: START - Cookbook sre.wdqs.data-reload
  • 18:59 ryankemper: T267927 Manual puppet run got `wdqs2008` present in puppetdb again. Now being blocked by lack of host key for `wdqs2008` present on `cumin2001`, so I'm running puppet on `cumin2001` to get the latest state of `/etc/ssh/ssh_known_hosts`
  • 18:57 ryankemper@cumin2001: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99)
  • 18:57 ryankemper@cumin2001: START - Cookbook sre.wdqs.data-reload
  • 18:56 ryankemper@cumin2001: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99)
  • 18:56 ryankemper@cumin2001: START - Cookbook sre.wdqs.data-reload
  • 18:50 ryankemper: T267927 Trying to kick off data reload on `wdqs2008` from `cumin2001` fails because of `spicerack.remote.RemoteError: No hosts provided`. Doing some spelunking through IRC history looks like this happens when a host is not present in puppetDB. I'm confirmed `wdqs2008` is absent on puppetboard, so running puppet agent to get it re-registered (hopefully)
  • 18:38 ryankemper@cumin2001: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99)
  • 18:38 ryankemper@cumin2001: START - Cookbook sre.wdqs.data-reload
  • 18:37 pt1979@cumin2001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 18:37 ryankemper@cumin2001: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99)
  • 18:37 ryankemper@cumin2001: START - Cookbook sre.wdqs.data-reload
  • 18:36 ryankemper@cumin2001: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99)
  • 18:36 ryankemper@cumin2001: START - Cookbook sre.wdqs.data-reload
  • 18:31 pt1979@cumin2001: START - Cookbook sre.dns.netbox
  • 18:30 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'zotero' for release 'staging' .
  • 18:30 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'zotero' for release 'production' .
  • 18:30 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' .
  • 18:30 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'wikifeeds' for release 'production' .
  • 18:30 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'production' .
  • 18:30 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' .
  • 18:30 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'test' .
  • 18:30 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' .
  • 18:27 oblivian@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
  • 18:25 oblivian@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
  • 18:23 bblack: dns[1235]002 - upgrade gdnsd to 3.6.0 (dns4002 and authdns2001 already running it for some time!)
  • 18:21 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'sessionstore' for release 'staging' .
  • 18:20 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'sessionstore' for release 'production' .
  • 18:19 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'recommendation-api' for release 'production' .
  • 18:19 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'push-notifications' for release 'canary' .
  • 18:19 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'push-notifications' for release 'main' .
  • 18:19 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'proton' for release 'production' .
  • 18:19 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' .
  • 18:19 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
  • 18:18 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'mathoid' for release 'production' .
  • 18:18 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'mathoid' for release 'staging' .
  • 18:18 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
  • 18:18 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
  • 18:18 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
  • 18:18 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventstreams-internal' for release 'canary' .
  • 18:18 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventstreams-internal' for release 'main' .
  • 18:08 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventstreams' for release 'production' .
  • 18:08 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventstreams' for release 'canary' .
  • 17:58 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
  • 17:58 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' .
  • 17:48 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' .
  • 17:47 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' .
  • 17:37 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
  • 17:37 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
  • 17:27 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' .
  • 17:27 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' .
  • 17:27 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'echostore' for release 'production' .
  • 17:27 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'echostore' for release 'staging' .
  • 17:27 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'production' .
  • 17:27 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'staging' .
  • 17:26 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'citoid' for release 'production' .
  • 17:26 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'citoid' for release 'staging' .
  • 17:26 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
  • 17:26 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
  • 17:26 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 17:26 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 17:26 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'production' .
  • 17:26 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' .
  • 17:25 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
  • 17:25 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
  • 17:25 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'apertium' for release 'production' .
  • 17:25 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'apertium' for release 'staging' .
  • 17:19 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'zotero' for release 'staging' .
  • 17:19 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'zotero' for release 'production' .
  • 17:18 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' .
  • 17:18 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'wikifeeds' for release 'production' .
  • 17:18 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'production' .
  • 17:18 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' .
  • 17:18 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'test' .
  • 17:17 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' .
  • 17:07 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'sessionstore' for release 'staging' .
  • 17:07 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'sessionstore' for release 'production' .
  • 17:06 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'recommendation-api' for release 'production' .
  • 17:06 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'push-notifications' for release 'main' .
  • 17:06 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'push-notifications' for release 'canary' .
  • 17:05 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'proton' for release 'production' .
  • 17:04 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
  • 17:04 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' .
  • 17:04 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'mathoid' for release 'staging' .
  • 17:04 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'mathoid' for release 'production' .
  • 17:03 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
  • 17:03 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
  • 17:03 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
  • 17:02 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventstreams-internal' for release 'canary' .
  • 17:02 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventstreams-internal' for release 'main' .
  • 17:01 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventstreams' for release 'production' .
  • 17:01 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventstreams' for release 'canary' .
  • 17:00 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
  • 17:00 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' .
  • 16:54 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' .
  • 16:54 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' .
  • 16:50 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
  • 16:50 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
  • 16:39 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' .
  • 16:39 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' .
  • 16:39 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'echostore' for release 'production' .
  • 16:39 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'echostore' for release 'staging' .
  • 16:38 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'staging' .
  • 16:38 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'production' .
  • 16:38 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'citoid' for release 'staging' .
  • 16:38 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'citoid' for release 'production' .
  • 16:38 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
  • 16:38 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
  • 16:37 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 16:37 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 16:37 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' .
  • 16:37 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'production' .
  • 16:36 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
  • 16:36 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
  • 16:34 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'apertium' for release 'staging' .
  • 16:34 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'apertium' for release 'production' .
  • 16:28 akosiaris@deploy1001: helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
  • 16:23 akosiaris@deploy1001: helmfile [staging-codfw] START helmfile.d/admin 'sync'.
  • 16:17 akosiaris@deploy1001: helmfile [staging-codfw] START helmfile.d/admin 'sync'.
  • 16:16 akosiaris@deploy1001: helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
  • 16:16 akosiaris@deploy1001: helmfile [staging-codfw] START helmfile.d/admin 'sync'.
  • 16:16 akosiaris@deploy1001: helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
  • 16:16 akosiaris@deploy1001: helmfile [staging-codfw] START helmfile.d/admin 'sync'.
  • 15:38 elukey@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ml-serve-ctrl2002.codfw.wmnet
  • 15:26 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestage2002.codfw.wmnet with reason: REIMAGE
  • 15:24 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage2002.codfw.wmnet with reason: REIMAGE
  • 15:23 elukey@cumin1001: START - Cookbook sre.ganeti.makevm for new host ml-serve-ctrl2002.codfw.wmnet
  • 15:23 elukey@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ml-serve-ctrl2001.codfw.wmnet
  • 15:05 elukey@cumin1001: START - Cookbook sre.ganeti.makevm for new host ml-serve-ctrl2001.codfw.wmnet
  • 15:00 moritzm: installing libmaxminddb updates from buster 10.8 point release
  • 14:59 vgutierrez: pool cp4032
  • 14:42 vgutierrez: depool cp4032 for ats-tls/NUMA tests
  • 14:35 klausman@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ml-serve-ctrl1002.eqiad.wmnet
  • 14:27 moritzm: installing postgresql security updates on buster
  • 14:24 kormat@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ml-serve-ctrl1001.eqiad.wmnet
  • 14:22 klausman@cumin1001: START - Cookbook sre.ganeti.makevm for new host ml-serve-ctrl1002.eqiad.wmnet
  • 14:20 klausman@cumin1001: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host ml-serve-ctrl1002.eqiad.wmnet
  • 14:17 klausman@cumin1001: START - Cookbook sre.ganeti.makevm for new host ml-serve-ctrl1002.eqiad.wmnet
  • 14:16 moritzm: installing cairo security updates on buster
  • 14:14 klausman@cumin1001: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host ml-serve-ctrl1002.eqiad.wmnet
  • 14:10 klausman@cumin1001: START - Cookbook sre.ganeti.makevm for new host ml-serve-ctrl1002.eqiad.wmnet
  • 14:09 kormat@cumin1001: START - Cookbook sre.ganeti.makevm for new host ml-serve-ctrl1001.eqiad.wmnet
  • 13:57 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestage2002.codfw.wmnet with reason: REIMAGE
  • 13:55 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestage2001.codfw.wmnet with reason: REIMAGE
  • 13:55 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage2002.codfw.wmnet with reason: REIMAGE
  • 13:53 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage2001.codfw.wmnet with reason: REIMAGE
  • 13:15 akosiaris: reinitialize all of staging-codfw. kubestage2* and kubestagemaster* have been scheduled downtime in icinga.
  • 12:32 moritzm: installing openssl security updates on Buster
  • 12:20 Lucas_WMDE: EU backport&config window done
  • 12:16 phuedx@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Config: [stage 1] Enable WVUI search by default to logged-in modern Vector users except on pilot wikis (T249297) (duration: 01m 31s)
  • 11:56 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1134.eqiad.wmnet with reason: REIMAGE
  • 11:54 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on db1134.eqiad.wmnet with reason: REIMAGE
  • 11:47 jbond42: upload new wmf-laptop package
  • 11:40 marostegui: Stop MySQL on db1134 to reimage it to buster T275343
  • 11:29 kormat@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on dborch1001.wikimedia.org with reason: Restart for new kernel
  • 11:29 kormat@cumin1001: START - Cookbook sre.hosts.downtime for 0:15:00 on dborch1001.wikimedia.org with reason: Restart for new kernel
  • 11:28 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host otrs1001.eqiad.wmnet
  • 11:22 moritzm: reset-failed ifup@ens5.service on otrs1001 T273026
  • 11:15 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host otrs1001.eqiad.wmnet
  • 11:15 moritzm: rebooting otrs1001 (ticket.wikimedia.org) for a kernel update
  • 10:59 elukey@cumin1001: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker[1117-1118].eqiad.wmnet
  • 10:57 elukey@cumin1001: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker[1117-1118].eqiad.wmnet
  • 10:42 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1118.eqiad.wmnet with reason: REIMAGE
  • 10:40 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1117.eqiad.wmnet with reason: REIMAGE
  • 10:40 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1118.eqiad.wmnet with reason: REIMAGE
  • 10:38 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1117.eqiad.wmnet with reason: REIMAGE
  • 10:37 marostegui@cumin1001: dbctl commit (dc=all): 'db1088 (re)pooling @ 100%: After cloning db1168', diff saved to https://phabricator.wikimedia.org/P14481 and previous config saved to /var/cache/conftool/dbconfig/20210225-103719-root.json
  • 10:34 klausman@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve2002.codfw.wmnet with reason: REIMAGE
  • 10:32 klausman@cumin2001: START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2002.codfw.wmnet with reason: REIMAGE
  • 10:22 marostegui@cumin1001: dbctl commit (dc=all): 'db1088 (re)pooling @ 75%: After cloning db1168', diff saved to https://phabricator.wikimedia.org/P14480 and previous config saved to /var/cache/conftool/dbconfig/20210225-102215-root.json
  • 10:07 marostegui@cumin1001: dbctl commit (dc=all): 'db1088 (re)pooling @ 50%: After cloning db1168', diff saved to https://phabricator.wikimedia.org/P14479 and previous config saved to /var/cache/conftool/dbconfig/20210225-100712-root.json
  • 10:05 klausman@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve2003.codfw.wmnet with reason: REIMAGE
  • 10:03 klausman@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve2004.codfw.wmnet with reason: REIMAGE
  • 10:01 klausman@cumin2001: START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2003.codfw.wmnet with reason: REIMAGE
  • 10:01 klausman@cumin2001: START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2004.codfw.wmnet with reason: REIMAGE
  • 09:52 marostegui@cumin1001: dbctl commit (dc=all): 'db1088 (re)pooling @ 25%: After cloning db1168', diff saved to https://phabricator.wikimedia.org/P14477 and previous config saved to /var/cache/conftool/dbconfig/20210225-095208-root.json
  • 09:37 marostegui@cumin1001: dbctl commit (dc=all): 'db1088 (re)pooling @ 10%: After cloning db1168', diff saved to https://phabricator.wikimedia.org/P14476 and previous config saved to /var/cache/conftool/dbconfig/20210225-093705-root.json
  • 09:32 klausman@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve2001.codfw.wmnet with reason: REIMAGE
  • 09:32 klausman@cumin2001: START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2001.codfw.wmnet with reason: REIMAGE
  • 09:21 jiji@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1032.eqiad.wmnet
  • 09:14 jiji@cumin1001: START - Cookbook sre.hosts.reboot-single for host mc1032.eqiad.wmnet
  • 09:10 effie: upgrade memcached on mc1032, mc2032, mc2036
  • 08:32 volans@cumin2001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 08:29 volans@cumin2001: START - Cookbook sre.dns.netbox
  • 08:15 vgutierrez: restart ats-tls on cp5006 to enable parent proxies support - T274888
  • 08:15 XioNoX: un-drain lumen eqiad-codfw link for BW testing
  • 08:07 XioNoX: drain lumen eqiad-codfw link for BW testing
  • 06:50 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1088 to clone db1168 T258361', diff saved to https://phabricator.wikimedia.org/P14474 and previous config saved to /var/cache/conftool/dbconfig/20210225-065018-marostegui.json
  • 06:32 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1092 T275019', diff saved to https://phabricator.wikimedia.org/P14473 and previous config saved to /var/cache/conftool/dbconfig/20210225-063243-marostegui.json
  • 00:29 ryankemper: T274204 Restored service health on `elastic106[0,4,5]` via `sudo apt-get remove --purge wmf-elasticsearch-search-plugins --yes && sudo dpkg -i /var/cache/apt/archives/wmf-elasticsearch-search-plugins_6.5.4-4~stretch_all.deb && sudo puppet agent -tv`. There's some sort of issue with `6.5.4-5~stretch` that we will need to circle back and investigate; for now the fleet is staying on `6.5.4-4~stretch`
  • 00:05 ryankemper: T274204 `Ctrl+C`'d out of the current rolling-upgrade; the 3 hosts that have their elasticsearch systemd units in a failing state are running the latest plugin version, meaning the new version is likely the cause of the failures
  • 00:01 mutante: mwlog1001 - temp disabling puppet to deploy gerrit::661200 - because this is a jessie
  • 00:01 ryankemper@cumin1001: END (ERROR) - Cookbook sre.elasticsearch.rolling-upgrade (exit_code=97)

2021-02-24

  • 23:42 ryankemper@cumin1001: START - Cookbook sre.elasticsearch.rolling-upgrade
  • 23:30 ryankemper@cumin1001: END (FAIL) - Cookbook sre.elasticsearch.rolling-upgrade (exit_code=99)
  • 23:18 ryankemper: T274204 `sudo -i cookbook sre.elasticsearch.rolling-upgrade search_eqiad "eqiad cluster restarts" --task-id T274204 --nodes-per-run 3`
  • 23:18 ryankemper@cumin1001: START - Cookbook sre.elasticsearch.rolling-upgrade
  • 23:17 ryankemper: T274204 Beginning rolling-upgrade of `eqiad` CirrusSearch cluster to upgrade to `wmf-elasticsearch-search-plugins/stretch-wikimedia 6.5.4-5~stretch`, see tmux session `elastic_rolling_upgrade` on `ryankemper@cumin1001`
  • 23:13 eileen: civicrm revision is 5e042e6e57, config revision is 8572611a32
  • 22:09 ryankemper: T265113 Unbanned `elastic1063` from both Elasticsearch clusters (`production-search-eqiad` and `production-search-omega-eqiad`)
  • 22:03 Urbanecm: Deploy security patches for T275669
  • 20:59 andrew@cumin1001: END (PASS) - Cookbook wmcs.wikireplicas.add_wiki (exit_code=0)
  • 20:59 andrew@cumin1001: Added views for new wiki: mniwiki T273465
  • 20:43 mstyles@deploy1001: Finished deploy [wikimedia/discovery/analytics@44fba51]: add import ttl dags - T270103 (duration: 02m 33s)
  • 20:40 mstyles@deploy1001: Started deploy [wikimedia/discovery/analytics@44fba51]: add import ttl dags - T270103
  • 20:36 andrew@cumin1001: START - Cookbook wmcs.wikireplicas.add_wiki
  • 20:35 andrew@cumin1001: END (PASS) - Cookbook wmcs.wikireplicas.add_wiki (exit_code=0)
  • 20:35 andrew@cumin1001: Added views for new wiki: mniwiktionary T273459
  • 20:16 jhuneidi@deploy1001: Synchronized php: group1 wikis to 1.36.0-wmf.32 refs T274936 (duration: 01m 10s)
  • 20:15 jhuneidi@deploy1001: rebuilt and synchronized wikiversions files: group1 wikis to 1.36.0-wmf.32 refs T274936
  • 20:12 andrew@cumin1001: START - Cookbook wmcs.wikireplicas.add_wiki
  • 19:52 mstyles@deploy1001: Finished deploy [wikimedia/discovery/analytics@44fba51]: airflow dags for importing ttl data (duration: 00m 42s)
  • 19:51 mstyles@deploy1001: Started deploy [wikimedia/discovery/analytics@44fba51]: airflow dags for importing ttl data
  • 19:32 andrew@cumin1001: END (FAIL) - Cookbook wmcs.wikireplicas.add_wiki (exit_code=99)
  • 19:21 andrew@cumin1001: START - Cookbook wmcs.wikireplicas.add_wiki
  • 19:14 urbanecm@deploy1001: Synchronized wmf-config/CommonSettings.php: f9f968a: Remove unneeded $wgHiddenPrefs[] = visualeditor-betatempdisable (T273188) (duration: 01m 04s)
  • 19:07 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: f21fc4a: Enable SecurePoll logging for votewiki, testwiki (T273990) (duration: 01m 08s)
  • 17:40 bblack: authdns2001 - trial upgrade gdnsd to 3.6.0-1~wmf1
  • 16:49 klausman@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1003.eqiad.wmnet with reason: REIMAGE
  • 16:47 klausman@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1003.eqiad.wmnet with reason: REIMAGE
  • 16:47 klausman@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1004.eqiad.wmnet with reason: REIMAGE
  • 16:45 klausman@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1004.eqiad.wmnet with reason: REIMAGE
  • 16:45 klausman@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1002.eqiad.wmnet with reason: REIMAGE
  • 16:42 klausman@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1001.eqiad.wmnet with reason: REIMAGE
  • 16:41 klausman@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1002.eqiad.wmnet with reason: REIMAGE
  • 16:40 klausman@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1001.eqiad.wmnet with reason: REIMAGE
  • 16:15 klausman@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on ml-serve[1003-1004].eqiad.wmnet with reason: Reimaging failures due to broken partman recipe
  • 16:15 klausman@cumin1001: START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on ml-serve[1003-1004].eqiad.wmnet with reason: Reimaging failures due to broken partman recipe
  • 15:54 milimetric@deploy1001: Finished deploy [analytics/refinery@5908f27] (test): Train hotfix (duration: 00m 13s)
  • 15:54 milimetric@deploy1001: Started deploy [analytics/refinery@5908f27] (test): Train hotfix
  • 15:54 milimetric@deploy1001: Finished deploy [analytics/refinery@5908f27] (thin): Train hotfix (duration: 00m 06s)
  • 15:54 milimetric@deploy1001: Started deploy [analytics/refinery@5908f27] (thin): Train hotfix
  • 15:54 milimetric@deploy1001: Finished deploy [analytics/refinery@5908f27]: Train hotfix (duration: 11m 36s)
  • 15:42 milimetric@deploy1001: Started deploy [analytics/refinery@5908f27]: Train hotfix
  • 15:38 otto@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Migrate all WMDE Technical Wishes schemas to EventGate on all wikis (duration: 01m 05s)
  • 15:21 milimetric@deploy1001: Finished deploy [analytics/refinery@bcb1a69] (test): Regular analytics weekly train TEST [analytics/refinery@bcb1a69] (duration: 00m 13s)
  • 15:21 milimetric@deploy1001: Started deploy [analytics/refinery@bcb1a69] (test): Regular analytics weekly train TEST [analytics/refinery@bcb1a69]
  • 15:21 milimetric@deploy1001: Finished deploy [analytics/refinery@bcb1a69] (thin): Regular analytics weekly train THIN [analytics/refinery@bcb1a69] (duration: 00m 06s)
  • 15:21 milimetric@deploy1001: Started deploy [analytics/refinery@bcb1a69] (thin): Regular analytics weekly train THIN [analytics/refinery@bcb1a69]
  • 15:21 milimetric@deploy1001: Finished deploy [analytics/refinery@bcb1a69]: Regular analytics weekly train [analytics/refinery@bcb1a69] (duration: 17m 10s)
  • 15:06 godog: bounce icinga on alert1001 - reported high latency
  • 15:06 otto@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Migrate HomepageVisit and ServerSideAccountCreation EL streams to all wikis - T267333 (duration: 01m 05s)
  • 15:03 milimetric@deploy1001: Started deploy [analytics/refinery@bcb1a69]: Regular analytics weekly train [analytics/refinery@bcb1a69]
  • 15:01 klausman@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve[1001-1004].eqiad.wmnet with reason: Reimaging for T272918
  • 15:01 klausman@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve[1001-1004].eqiad.wmnet with reason: Reimaging for T272918
  • 14:50 bblack: dns4002 - trial upgrade gdnsd to 3.6.0-1~wmf1
  • 14:29 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db2152.codfw.wmnet with reason: REIMAGE
  • 14:29 pt1979@cumin2001: START - Cookbook sre.hosts.downtime for 2:00:00 on db2152.codfw.wmnet with reason: REIMAGE
  • 14:25 jynus@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ms-be2016.codfw.wmnet with reason: REIMAGE
  • 14:25 jynus@cumin2001: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2016.codfw.wmnet with reason: REIMAGE
  • 14:13 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db2151.codfw.wmnet with reason: REIMAGE
  • 14:13 pt1979@cumin2001: START - Cookbook sre.hosts.downtime for 2:00:00 on db2151.codfw.wmnet with reason: REIMAGE
  • 14:05 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db2150.codfw.wmnet with reason: REIMAGE
  • 14:04 pt1979@cumin2001: START - Cookbook sre.hosts.downtime for 2:00:00 on db2150.codfw.wmnet with reason: REIMAGE
  • 13:46 marostegui: Compare data between db1134 and db1163 T275343
  • 13:34 moritzm: restarting FPM/mcrouter on mw canaries to pick up openssl updates
  • 13:11 moritzm: installing openssl security updates on buster
  • 12:32 Urbanecm: Two undeployed patches were reverted to unbreak deployments (666340, 666341), cc marxarelli
  • 12:25 phuedx@deploy1001: Synchronized php-1.36.0-wmf.32/extensions/WikimediaEvents: Backport: Fix dynamically loaded instruments (duration: 01m 11s)
  • 12:20 marostegui@cumin1001: dbctl commit (dc=all): 'db1112 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P14465 and previous config saved to /var/cache/conftool/dbconfig/20210224-122043-root.json
  • 12:18 jayme@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
  • 12:17 jayme@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 12:17 jayme@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
  • 12:16 jayme@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 12:15 jayme@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
  • 12:14 jayme@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 12:06 hnowlan: restarting mtail on A:mw-api or A:parsoid or A:mw-jobrunner or A:mw
  • 12:05 marostegui@cumin1001: dbctl commit (dc=all): 'db1112 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P14464 and previous config saved to /var/cache/conftool/dbconfig/20210224-120538-root.json
  • 11:54 jayme@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'zotero' for release 'production' .
  • 11:53 jayme@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'termbox' for release 'production' .
  • 11:51 jayme@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'similar-users' for release 'main' .
  • 11:50 marostegui@cumin1001: dbctl commit (dc=all): 'db1112 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P14463 and previous config saved to /var/cache/conftool/dbconfig/20210224-115034-root.json
  • 11:45 jayme@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'recommendation-api' for release 'production' .
  • 11:44 jayme@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'push-notifications' for release 'main' .
  • 11:42 jayme@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'proton' for release 'production' .
  • 11:39 jayme@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
  • 11:35 marostegui@cumin1001: dbctl commit (dc=all): 'db1112 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P14462 and previous config saved to /var/cache/conftool/dbconfig/20210224-113531-root.json
  • 11:33 jayme@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'mathoid' for release 'production' .
  • 11:32 jayme@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams-internal' for release 'main' .
  • 11:29 jayme@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams' for release 'production' .
  • 11:29 jayme@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams' for release 'canary' .
  • 11:24 jayme@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
  • 11:23 jayme@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' .
  • 11:22 jayme@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' .
  • 11:22 jayme@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' .
  • 11:20 marostegui@cumin1001: dbctl commit (dc=all): 'db1112 (re)pooling @ 10%: After schema change', diff saved to https://phabricator.wikimedia.org/P14461 and previous config saved to /var/cache/conftool/dbconfig/20210224-112027-root.json
  • 11:20 jayme@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' .
  • 11:15 jayme@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'cxserver' for release 'production' .
  • 11:14 jayme@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'citoid' for release 'production' .
  • 11:13 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1112 for schema change', diff saved to https://phabricator.wikimedia.org/P14460 and previous config saved to /var/cache/conftool/dbconfig/20210224-111301-marostegui.json
  • 11:12 jayme@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'apertium' for release 'production' .
  • 11:06 aborrero@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudmetrics1002.eqiad.wmnet with reason: serer issue
  • 11:06 aborrero@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudmetrics1002.eqiad.wmnet with reason: serer issue
  • 10:52 marostegui@cumin1001: dbctl commit (dc=all): 'db1157 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P14459 and previous config saved to /var/cache/conftool/dbconfig/20210224-105204-root.json
  • 10:37 marostegui@cumin1001: dbctl commit (dc=all): 'db1157 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P14458 and previous config saved to /var/cache/conftool/dbconfig/20210224-103700-root.json
  • 10:21 marostegui@cumin1001: dbctl commit (dc=all): 'db1157 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P14457 and previous config saved to /var/cache/conftool/dbconfig/20210224-102157-root.json
  • 10:20 jayme@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'zotero' for release 'production' .
  • 10:19 moritzm: installing gnutls28 bugfix updates from Buster 10.8 point release
  • 10:16 jayme@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'termbox' for release 'production' .
  • 10:13 jayme@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'similar-users' for release 'main' .
  • 10:10 jayme@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'recommendation-api' for release 'production' .
  • 10:06 marostegui@cumin1001: dbctl commit (dc=all): 'db1157 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P14456 and previous config saved to /var/cache/conftool/dbconfig/20210224-100653-root.json
  • 10:04 jayme@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'push-notifications' for release 'main' .
  • 10:02 jayme@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'proton' for release 'production' .
  • 09:56 jayme@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
  • 09:51 marostegui@cumin1001: dbctl commit (dc=all): 'db1157 (re)pooling @ 10%: After schema change', diff saved to https://phabricator.wikimedia.org/P14455 and previous config saved to /var/cache/conftool/dbconfig/20210224-095150-root.json
  • 09:45 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1157 for schema change', diff saved to https://phabricator.wikimedia.org/P14454 and previous config saved to /var/cache/conftool/dbconfig/20210224-094523-marostegui.json
  • 09:34 marostegui: Update pc2007, pc2010, db2071
  • 09:31 marostegui: Update db1077
  • 09:27 jiji@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1033.eqiad.wmnet
  • 09:20 jiji@cumin1001: START - Cookbook sre.hosts.reboot-single for host mc1033.eqiad.wmnet
  • 09:19 effie: upgrade memcached on mc1033, mc2033
  • 09:07 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast1002.wikimedia.org with reason: REIMAGE
  • 09:06 volans: run "sudo find . -user root -exec chown netbox. '{}' \;" in /srv/deployment/netbox/deploy-cache/revs on netbox* hosts to prevent scap failures on cleanup - T265084
  • 09:05 jmm@cumin2001: START - Cookbook sre.hosts.downtime for 2:00:00 on bast1002.wikimedia.org with reason: REIMAGE
  • 09:01 elukey: roll restart druid brokers on druid public
  • 08:58 jayme@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'mathoid' for release 'production' .
  • 08:53 jayme@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'eventstreams-internal' for release 'main' .
  • 08:52 jayme@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'eventstreams' for release 'production' .
  • 08:52 jayme@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'eventstreams' for release 'canary' .
  • 08:50 jayme@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
  • 08:50 jayme@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' .
  • 08:48 jayme@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' .
  • 08:48 jayme@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' .
  • 08:35 moritzm: reimaging bast1002 to Buster
  • 08:33 jayme@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' .
  • 08:32 jayme@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'cxserver' for release 'production' .
  • 08:30 jayme@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'citoid' for release 'production' .
  • 08:26 jayme@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'apertium' for release 'production' .
  • 08:04 jynus: restarting db2101, db2139, db2141 T271913
  • 07:56 moritzm: installing remaining openldap updates for buster
  • 06:24 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1090.eqiad.wmnet
  • 06:18 marostegui@cumin1001: START - Cookbook sre.hosts.decommission for hosts db1090.eqiad.wmnet
  • 04:10 ryankemper: T267927 [WDQS Data Reload] Running `/srv/deployment/wdqs/wdqs/loadData.sh -n wdq -d /srv/wdqs/munged/ -s 864` on `ryankemper@wdqs2008` tmux session `data_reload`
  • 04:04 ryankemper: [WDQS] Depooled `wdqs2008`
  • 03:16 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db2149.codfw.wmnet with reason: REIMAGE
  • 03:13 pt1979@cumin2001: START - Cookbook sre.hosts.downtime for 2:00:00 on db2149.codfw.wmnet with reason: REIMAGE
  • 03:03 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db2148.codfw.wmnet with reason: REIMAGE
  • 03:01 pt1979@cumin2001: START - Cookbook sre.hosts.downtime for 2:00:00 on db2148.codfw.wmnet with reason: REIMAGE
  • 02:58 ryankemper: [WDQS Data Reload] Restarting reload on test node `wdqs1009` from where it last left off: `/srv/deployment/wdqs/wdqs/loadData.sh -n wdq -d /srv/wdqs/munged/ -s 947`
  • 02:57 ryankemper: [WDQS Deploy] Deploy complete. Successful test query placed on query.wikidata.org, there's no relevant criticals in Icinga, and Grafana looks good
  • 02:39 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db2147.codfw.wmnet with reason: REIMAGE
  • 02:37 pt1979@cumin2001: START - Cookbook sre.hosts.downtime for 2:00:00 on db2147.codfw.wmnet with reason: REIMAGE
  • 02:35 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db2146.codfw.wmnet with reason: REIMAGE
  • 02:33 pt1979@cumin2001: START - Cookbook sre.hosts.downtime for 2:00:00 on db2146.codfw.wmnet with reason: REIMAGE
  • 02:30 ryankemper: [WDQS Deploy] Restarting `wdqs-categories` across lvs-managed hosts, one node at a time: `sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 45 && systemctl restart wdqs-categories && sleep 45 && pool'`
  • 02:29 ryankemper: [WDQS Deploy] Restarted `wdqs-categories` across all test hosts simultaneously: `sudo -E cumin 'A:wdqs-test' 'systemctl restart wdqs-categories'`
  • 02:29 ryankemper: [WDQS Deploy] Restarted `wdqs-updater` across all hosts, 4 hosts at a time: `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'`
  • 02:27 ryankemper@deploy1001: Finished deploy [wdqs/wdqs@b5fc9d5]: 0.3.64 (duration: 06m 24s)
  • 02:24 ebernhardson@deploy1001: Finished deploy [wikimedia/discovery/analytics@25549e7]: ores_bulk_ingest: use backoffs starting at 30sec (duration: 01m 37s)
  • 02:22 gehel@cumin2001: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99)
  • 02:22 ebernhardson@deploy1001: Started deploy [wikimedia/discovery/analytics@25549e7]: ores_bulk_ingest: use backoffs starting at 30sec
  • 02:20 ryankemper@deploy1001: Started deploy [wdqs/wdqs@b5fc9d5]: 0.3.64
  • 02:18 ryankemper@deploy1001: Finished deploy [wdqs/wdqs@b5fc9d5]: 0.3.64 (duration: 11m 22s)
  • 02:09 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db2145.codfw.wmnet with reason: REIMAGE
  • 02:07 ryankemper: [WDQS Deploy] Tests passing following deploy of `0.3.64` on canary `wdqs1003`; proceeding to rest of fleet
  • 02:07 pt1979@cumin2001: START - Cookbook sre.hosts.downtime for 2:00:00 on db2145.codfw.wmnet with reason: REIMAGE
  • 02:06 ryankemper@deploy1001: Started deploy [wdqs/wdqs@b5fc9d5]: 0.3.64
  • 02:06 ryankemper: [WDQS Deploy] Gearing up for deploy of wdqs `0.3.64`. Pre-deploy tests passing on canary `wdqs1003`
  • 00:58 volker-e@deploy1001: Finished deploy [design/style-guide@a66b5b6]: Deploy design/style-guide: a66b5b6 “Components”: Add “Dialogs” (#430) (duration: 00m 06s)
  • 00:58 volker-e@deploy1001: Started deploy [design/style-guide@a66b5b6]: Deploy design/style-guide: a66b5b6 “Components”: Add “Dialogs” (#430)
  • 00:47 ebernhardson@deploy1001: Finished deploy [wikimedia/discovery/analytics@4ee50e3]: ores_bulk_ingest: more retry on error (duration: 01m 37s)
  • 00:45 ebernhardson@deploy1001: Started deploy [wikimedia/discovery/analytics@4ee50e3]: ores_bulk_ingest: more retry on error
  • 00:03 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db2145.codfw.wmnet with reason: REIMAGE
  • 00:02 pt1979@cumin2001: START - Cookbook sre.hosts.downtime for 2:00:00 on db2145.codfw.wmnet with reason: REIMAGE

2021-02-23

  • 22:52 chaomodus: Netbox 2.10 upgrade complete T265084
  • 22:28 crusnov@deploy1001: Finished deploy [netbox/deploy@dabbf5e]: Deploying Netbox 2.10.4-wmf to production T265084 (duration: 06m 11s)
  • 22:25 ppchelko@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
  • 22:25 ppchelko@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
  • 22:23 ppchelko@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
  • 22:23 ppchelko@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
  • 22:22 crusnov@deploy1001: Started deploy [netbox/deploy@dabbf5e]: Deploying Netbox 2.10.4-wmf to production T265084
  • 22:21 ppchelko@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
  • 22:21 ppchelko@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
  • 22:17 chaomodus: deploying Netbox 2.10 to production and associated work
  • 21:48 otto@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Fix typos in wgEventLoggingSchemas (duration: 01m 05s)
  • 21:38 jhuneidi@deploy1001: rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.32 refs T274936
  • 21:36 ebernhardson@deploy1001: Finished deploy [wikimedia/discovery/analytics@1344853]: apply spark env_vars to executors too (duration: 01m 46s)
  • 21:34 ebernhardson@deploy1001: Started deploy [wikimedia/discovery/analytics@1344853]: apply spark env_vars to executors too
  • 21:28 jhuneidi@deploy1001: Finished scap: testwikis wikis to 1.36.0-wmf.32 refs T274936 (duration: 36m 52s)
  • 21:03 otto@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
  • 21:03 otto@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
  • 21:00 ebernhardson@deploy1001: Finished deploy [wikimedia/discovery/analytics@46a8ae1]: ores_bulk_ingest: namespace is not plural (duration: 01m 41s)
  • 21:00 otto@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
  • 21:00 otto@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
  • 20:58 ebernhardson@deploy1001: Started deploy [wikimedia/discovery/analytics@46a8ae1]: ores_bulk_ingest: namespace is not plural
  • 20:57 dzahn@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host gitlab1002.eqiad.wmnet
  • 20:52 jhuneidi@deploy1001: Started scap: testwikis wikis to 1.36.0-wmf.32 refs T274936
  • 20:44 ppchelko@deploy1001: Synchronized wmf-config/CommonSettings.php: No-op: math enable talking to mathoid directly in labs, T274436 (duration: 00m 57s)
  • 20:38 otto@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Fix typo in visualeditortemplatedialoguse - T275015 (duration: 01m 01s)
  • 20:13 razzi@cumin1001: END (PASS) - Cookbook sre.kafka.reboot-workers (exit_code=0) for Kafka jumbo cluster: Reboot kafka nodes - razzi@cumin1001
  • 20:04 dzahn@cumin1001: START - Cookbook sre.ganeti.makevm for new host gitlab1002.eqiad.wmnet
  • 19:54 ryankemper@cumin1001: END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
  • 19:54 ryankemper@cumin1001: START - Cookbook sre.wdqs.data-transfer
  • 19:49 ryankemper@cumin1001: END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
  • 19:49 ryankemper@cumin1001: START - Cookbook sre.wdqs.data-transfer
  • 19:43 ryankemper: [WDQS Deploy] Disk space low on `wdqs1009`, rolling back so that can be addressed
  • 19:43 ryankemper@deploy1001: Finished deploy [wdqs/wdqs@b5fc9d5]: 0.3.64 (duration: 08m 01s)
  • 19:38 otto@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Declare WMDE Technical Wishes streams and migrate to EventGate on testwiki (duration: 02m 41s)
  • 19:36 ryankemper: [WDQS Deploy] Tests passing following deploy of `0.3.64` on canary `wdqs1003`; proceeding to rest of fleet
  • 19:35 ryankemper@deploy1001: Started deploy [wdqs/wdqs@b5fc9d5]: 0.3.64
  • 19:35 ryankemper: [WDQS Deploy] Gearing up for deploy of wdqs `0.3.64`. Pre-deploy tests passing on canary `wdqs1003`
  • 19:33 dzahn@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host gitlab1001.eqiad.wmnet
  • 19:32 legoktm: re-enabling puppet on registry*
  • 19:30 legoktm: pushed new wikimedia-buster image
  • 19:16 ebernhardson@deploy1001: Finished deploy [wikimedia/discovery/analytics@3969cae]: new dag ores_bulk_ingest (duration: 01m 32s)
  • 19:15 ebernhardson@deploy1001: Started deploy [wikimedia/discovery/analytics@3969cae]: new dag ores_bulk_ingest
  • 19:10 dduvall@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'blubberoid' for release 'production' .
  • 19:08 dduvall@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'blubberoid' for release 'production' .
  • 19:08 legoktm: disabling puppet on registry* except registry2001 while rolling out https://gerrit.wikimedia.org/r/664683
  • 19:04 dduvall@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' .
  • 18:41 dzahn@cumin1001: START - Cookbook sre.ganeti.makevm for new host gitlab1001.eqiad.wmnet
  • 18:17 ebernhardson@deploy1001: Finished deploy [wikimedia/discovery/analytics@c2190da]: environment and venv builder for ores_bulk_ingest (duration: 01m 40s)
  • 18:15 ebernhardson@deploy1001: Started deploy [wikimedia/discovery/analytics@c2190da]: environment and venv builder for ores_bulk_ingest
  • 18:15 ebernhardson@deploy1001: deploy aborted: environment and venv builder for ores_bulk_ingest (duration: 00m 16s)
  • 18:15 ebernhardson@deploy1001: Started deploy [wikimedia/discovery/analytics@c2190da]: environment and venv builder for ores_bulk_ingest
  • 18:12 pt1979@cumin2001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 18:07 pt1979@cumin2001: START - Cookbook sre.dns.netbox
  • 17:29 hnowlan@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
  • 17:29 hnowlan@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
  • 17:22 longma: wmf/1.36.0-wmf.32 was branched at 03c382f for T274936
  • 17:21 jiji@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1034.eqiad.wmnet
  • 17:18 hnowlan@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
  • 17:18 hnowlan@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
  • 17:17 jiji@cumin1001: START - Cookbook sre.hosts.reboot-single for host mc1034.eqiad.wmnet
  • 17:16 effie: upgrade memcached on mc1034, mc2034 - T270315
  • 17:01 hnowlan@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
  • 17:01 hnowlan@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
  • 16:59 hnowlan@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
  • 16:59 hnowlan@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
  • 16:55 hnowlan@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 16:55 hnowlan@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 16:48 mholloway-shell@deploy1001: Synchronized wmf-config/InitialiseSettings.php: WikimediaEvents: Enable session tick instrument on all wikis (T274172) (duration: 00m 58s)
  • 16:46 hnowlan@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 16:46 hnowlan@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 16:42 hnowlan@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 16:42 hnowlan@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 16:25 razzi@cumin1001: START - Cookbook sre.kafka.reboot-workers for Kafka jumbo cluster: Reboot kafka nodes - razzi@cumin1001
  • 16:02 otto@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Declare TranslationRecommendation event streams - T271163 (duration: 00m 58s)
  • 15:52 jynus: previous message should say 15:38 T267338
  • 15:51 jynus: started swift codfw backup stress test at 14:38 with 10 threads T267338
  • 15:44 elukey: reboot an-launcher1002 for kernel updates
  • 15:35 moritzm: restarting PHP/Apache on mw canaries for gnutls update
  • 15:23 moritzm: installing gnutls28 bugfix updates from Buster 10.8 point release
  • 15:17 elukey: deploy a new term to the analytics-in4 filter on cr1/cr2-eqiad (see https://gerrit.wikimedia.org/r/c/operations/homer/public/+/665814)
  • 14:55 otto@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Remove wgEventLoggingSchemas overrides for QuickSurvey and NavigationTiming (duration: 00m 56s)
  • 14:51 elukey: drop /srv/backup-1007 on stat1008 to free space
  • 14:41 otto@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Migrate SpecialMuteSubmit to EventGate on all wikis - T268517 (duration: 00m 58s)
  • 14:40 otto@deploy1001: sync-file aborted: Migrate SpecialMuteSubmit to EventGate on all wikis - T268517 (duration: 00m 05s)
  • 14:17 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-worker1002.eqiad.wmnet with reason: REIMAGE
  • 14:15 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-worker1002.eqiad.wmnet with reason: REIMAGE
  • 14:07 jayme@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'zotero' for release 'staging' .
  • 14:05 jayme@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' .
  • 14:05 jayme@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'test' .
  • 14:03 jayme@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' .
  • 14:02 jayme@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'recommendation-api' for release 'production' .
  • 14:00 moritzm: restarting PHP/Apache on mw canaries for openldap update
  • 13:58 jayme@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'push-notifications' for release 'main' .
  • 13:56 jayme@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'proton' for release 'production' .
  • 13:54 moritzm: installing openldap security updates on buster (just client-side tools/libs, all slapd instance already fixed)
  • 13:54 jayme@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' .
  • 13:50 jayme@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'mathoid' for release 'staging' .
  • 13:49 jayme@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventstreams-internal' for release 'main' .
  • 12:54 kharlan@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
  • 12:54 kharlan@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
  • 12:48 kharlan@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
  • 12:48 kharlan@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
  • 12:44 kharlan@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
  • 12:40 urbanecm@deploy1001: Synchronized php-1.36.0-wmf.31/extensions/ContentTranslation/app/: ee77c4a: bump ContentTranslation (T275385) (duration: 00m 59s)
  • 12:37 jayme@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventstreams' for release 'production' .
  • 12:36 jayme@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
  • 12:35 jayme@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' .
  • 12:34 jayme@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
  • 12:32 jayme@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' .
  • 12:32 jayme@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' .
  • 12:31 jayme@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'staging' .
  • 12:31 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: 8b7ca4c: thwikisource: Add NS 102 and NS 114 as content namespace (T275282) (duration: 00m 56s)
  • 12:30 jayme@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'citoid' for release 'staging' .
  • 12:29 jayme@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
  • 12:27 jayme@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 12:26 jayme@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' .
  • 12:19 jayme@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'apertium' for release 'staging' .
  • 12:17 jayme: running puppet on deploy1001
  • 12:15 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Add sources to specialSiteLinkGroups Wikibase setting (T138332) (duration: 01m 00s)
  • 11:27 jiji@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1035.eqiad.wmnet
  • 11:20 jiji@cumin1001: START - Cookbook sre.hosts.reboot-single for host mc1035.eqiad.wmnet
  • 11:18 effie: upgrade memcached on mc1035, mc2035 - T270315
  • 10:06 kormat@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts dbmonitor2001.wikimedia.org
  • 09:58 kormat@cumin1001: START - Cookbook sre.hosts.decommission for hosts dbmonitor2001.wikimedia.org
  • 09:45 vgutierrez: reload nginx on cloudelastic100[56]
  • 09:44 moritzm: installing screen security updates on stretch
  • 09:40 kormat@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on 6 hosts with reason: Restart mariadb to pick up config changes T266913
  • 09:40 kormat@cumin1001: START - Cookbook sre.hosts.downtime for 1:30:00 on 6 hosts with reason: Restart mariadb to pick up config changes T266913
  • 09:35 moritzm: installing bind security updates on buster (client-side tools/libs)
  • 09:10 oblivian@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
  • 09:10 jayme@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'wikifeeds' for release 'production' .
  • 09:06 jayme@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'wikifeeds' for release 'production' .
  • 08:55 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1001.eqiad.wmnet
  • 08:50 jayme@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' .
  • 08:47 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host cumin1001.eqiad.wmnet
  • 08:40 Urbanecm: [urbanecm@mwmaint1002 ~/altwiki]$ mwscript namespaceDupes.php altwiki --fix
  • 08:39 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: 9f434e2: Add ВП as an alias for NS_PROJECT in altwiki (T271980) (duration: 00m 59s)
  • 08:39 Urbanecm: Run mwscript updateSpecialPages.php --wiki=altwiki
  • 08:02 oblivian@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
  • 07:56 oblivian@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
  • 07:56 oblivian@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
  • 07:13 hashar: Restarting CI Jenkins for plugin upgrade # T271683
  • 05:13 krinkle@deploy1001: Finished deploy [integration/docroot@44d5685]: I307e8f4f6979 (duration: 00m 06s)
  • 05:13 krinkle@deploy1001: Started deploy [integration/docroot@44d5685]: I307e8f4f6979
  • 00:46 eileen: civicrm revision changed from c535ac603a to 5e042e6e57, config revision is ef64f705bb

2021-02-22

  • 23:59 mutante: logstash2031 - systemctl reset-failed
  • 23:53 mutante: stat1007 - same problem and alerts as stat1004
  • 23:52 mutante: stat1004 - systemctl reset-failed to clear icinga alerts for systemd state caused by jupyterhub singleuser services
  • 23:47 dpifke@deploy1001: Finished deploy [performance/arc-lamp@1f3bce1]: Revert https://gerrit.wikimedia.org/r/c/performance/arc-lamp/+/664600 (duration: 00m 05s)
  • 23:47 dpifke@deploy1001: Started deploy [performance/arc-lamp@1f3bce1]: Revert https://gerrit.wikimedia.org/r/c/performance/arc-lamp/+/664600
  • 23:37 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1286.eqiad.wmnet
  • 23:36 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1286.eqiad.wmnet
  • 23:34 milimetric@deploy1001: Finished deploy [analytics/refinery@3de01b5] (thin): Fix camus (duration: 00m 07s)
  • 23:34 milimetric@deploy1001: Started deploy [analytics/refinery@3de01b5] (thin): Fix camus
  • 23:33 milimetric@deploy1001: Finished deploy [analytics/refinery@3de01b5]: Fix camus (duration: 14m 03s)
  • 23:27 robh@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1014.eqiad.wmnet with reason: REIMAGE
  • 23:25 robh@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1014.eqiad.wmnet with reason: REIMAGE
  • 23:22 oblivian@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
  • 23:22 oblivian@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
  • 23:19 milimetric@deploy1001: Started deploy [analytics/refinery@3de01b5]: Fix camus
  • 23:18 oblivian@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
  • 23:18 oblivian@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
  • 23:09 ppchelko@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
  • 23:09 ppchelko@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
  • 23:06 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1410.eqiad.wmnet
  • 23:06 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1412.eqiad.wmnet
  • 23:02 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1412.eqiad.wmnet
  • 23:00 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1410.eqiad.wmnet
  • 22:52 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1286.eqiad.wmnet with reason: REIMAGE
  • 22:50 legoktm: disabling puppet on mwdebug1001 to test https://gerrit.wikimedia.org/r/c/operations/puppet/+/664903
  • 22:49 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1286.eqiad.wmnet with reason: REIMAGE
  • 22:45 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1412.eqiad.wmnet with reason: REIMAGE
  • 22:43 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1410.eqiad.wmnet with reason: REIMAGE
  • 22:42 krinkle@deploy1001: Synchronized w/fatal-error.php: df694d695 (duration: 00m 56s)
  • 22:42 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1412.eqiad.wmnet with reason: REIMAGE
  • 22:41 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1410.eqiad.wmnet with reason: REIMAGE
  • 22:31 ppchelko@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
  • 22:31 ppchelko@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
  • 22:18 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1279.eqiad.wmnet
  • 22:18 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1312.eqiad.wmnet
  • 22:16 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1314.eqiad.wmnet
  • 21:46 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1314.eqiad.wmnet
  • 21:46 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1279.eqiad.wmnet
  • 21:45 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1312.eqiad.wmnet
  • 21:00 Amir1: end of foreachwikiindblist wikidataclient extensions/Wikibase/lib/maintenance/populateSitesTable.php --force-protocol https (T273463 T271985 T273468)
  • 20:59 sbassett: Deployed security patch for T274883
  • 20:59 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1279.eqiad.wmnet with reason: REIMAGE
  • 20:57 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1279.eqiad.wmnet with reason: REIMAGE
  • 20:50 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1312.eqiad.wmnet with reason: REIMAGE
  • 20:48 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1312.eqiad.wmnet with reason: REIMAGE
  • 20:46 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1314.eqiad.wmnet with reason: REIMAGE
  • 20:44 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1314.eqiad.wmnet with reason: REIMAGE
  • 20:39 Amir1: start of foreachwikiindblist wikidataclient extensions/Wikibase/lib/maintenance/populateSitesTable.php --force-protocol https (T273463 T271985 T273468)
  • 20:29 mutante: mw1279 (canary) - reimaging to buster
  • 20:29 mutante: mw1279 (canary) - reimaging to stretch
  • 20:27 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1349.eqiad.wmnet
  • 20:25 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1349.eqiad.wmnet
  • 20:19 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1316.eqiad.wmnet
  • 20:16 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1316.eqiad.wmnet
  • 20:15 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1315.eqiad.wmnet
  • 20:08 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1315.eqiad.wmnet
  • 20:06 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1349.eqiad.wmnet with reason: REIMAGE
  • 20:04 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1349.eqiad.wmnet with reason: REIMAGE
  • 19:36 urbanecm@deploy1001: Synchronized wmf-config/config/rowiki.yaml: fc7b071: Enable GrowthExperiments on rowiki (T275130; 3/3) (duration: 00m 55s)
  • 19:35 urbanecm@deploy1001: Synchronized dblists/growthexperiments.dblist: fc7b071: Enable GrowthExperiments on rowiki (T275130; 2/3) (duration: 00m 55s)
  • 19:33 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: fc7b071: Enable GrowthExperiments on rowiki (T275130; 1/3) (duration: 00m 55s)
  • 19:25 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1315.eqiad.wmnet with reason: REIMAGE
  • 19:23 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1315.eqiad.wmnet with reason: REIMAGE
  • 19:22 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1316.eqiad.wmnet with reason: REIMAGE
  • 19:19 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1316.eqiad.wmnet with reason: REIMAGE
  • 19:08 urbanecm@deploy1001: Synchronized dblists/growthexperiments.dblist: 902b685: Enable GrowthExperiments on thwiki (T274646) (duration: 00m 54s)
  • 19:06 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: 902b685: Enable GrowthExperiments on thwiki (T274646) (duration: 00m 56s)
  • 17:18 ppchelko@deploy1001: Finished deploy [restbase/deploy@c5c4b2d] (dev-cluster): remove graphoid (duration: 03m 09s)
  • 17:15 ppchelko@deploy1001: Started deploy [restbase/deploy@c5c4b2d] (dev-cluster): remove graphoid
  • 16:51 Urbanecm: Run scap pull on mwmaint1002 to clear any local changes
  • 16:50 urbanecm@deploy1001: Synchronized wmf-config/interwiki.php: Update interwiki cache (duration: 02m 22s)
  • 16:47 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Creating mniwiktionary (T273457) (duration: 00m 56s)
  • 16:46 urbanecm@deploy1001: rebuilt and synchronized wikiversions files: Creating mniwiktionary (T273457)
  • 16:45 urbanecm@deploy1001: Synchronized dblists: Creating mniwiktionary (T273457) (duration: 00m 56s)
  • 16:44 hnowlan@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
  • 16:44 hnowlan@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
  • 16:44 urbanecm@deploy1001: Synchronized wmf-config/db-codfw.php: Creating mniwiktionary (T273457) (duration: 00m 56s)
  • 16:42 urbanecm@deploy1001: Synchronized wmf-config/db-eqiad.php: Creating mniwiktionary (T273457) (duration: 00m 55s)
  • 16:37 hnowlan@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
  • 16:36 hnowlan@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
  • 16:26 dpifke@deploy1001: Finished deploy [performance/arc-lamp@1f3bce1]: Deploy ArcLamp fixes for T273565 and T273640 (duration: 00m 05s)
  • 16:26 dpifke@deploy1001: Started deploy [performance/arc-lamp@1f3bce1]: Deploy ArcLamp fixes for T273565 and T273640
  • 16:19 urbanecm@deploy1001: Synchronized langlist: Creating mniwiki (T273456) (duration: 00m 54s)
  • 16:18 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Creating mniwiki (T273456) (duration: 00m 56s)
  • 16:17 urbanecm@deploy1001: Synchronized wmf-config/logos.php: Creating mniwiki (T273456) (duration: 00m 56s)
  • 16:15 urbanecm@deploy1001: Synchronized static/images/project-logos/: Creating mniwiki (T273456) (duration: 00m 55s)
  • 16:14 urbanecm@deploy1001: rebuilt and synchronized wikiversions files: Creating mniwiki (T273456)
  • 16:13 urbanecm@deploy1001: Synchronized dblists: Creating mniwiki (T273456) (duration: 00m 57s)
  • 16:12 urbanecm@deploy1001: Synchronized wmf-config/db-codfw.php: Creating mniwiki (T273456) (duration: 00m 55s)
  • 16:11 urbanecm@deploy1001: Synchronized wmf-config/db-eqiad.php: Creating mniwiki (T273456) (duration: 00m 56s)
  • 16:08 urbanecm@deploy1001: Synchronized langlist: Creating altwiki (T271980) (duration: 00m 55s)
  • 16:03 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Creating altwiki (T271980) (duration: 00m 55s)
  • 16:02 urbanecm@deploy1001: rebuilt and synchronized wikiversions files: Creating altwiki (T271980)
  • 16:00 urbanecm@deploy1001: Synchronized dblists: Creating altwiki (T271980) (duration: 00m 54s)
  • 15:59 urbanecm@deploy1001: Synchronized wmf-config/db-codfw.php: Creating altwiki (T271980) (duration: 00m 59s)
  • 15:57 Urbanecm: Temporarily replace /srv/mediawiki/php-1.36.0-wmf.31/extensions/WikimediaMaintenance/addWiki.php with /home/urbanecm/addWiki.php at mwmaint1002 to unbreak addWiki.php
  • 15:53 jayme@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
  • 15:43 urbanecm@deploy1001: Synchronized wmf-config/db-eqiad.php: Creating altwiki (T271980) (duration: 00m 56s)
  • 15:32 jayme@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
  • 14:16 herron: roll restarting kafkamon hosts for updates
  • 13:57 filippo@cumin1001: conftool action : set/pooled=false; selector: dnsdisc=swift,name=codfw
  • 13:47 filippo@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus4001.ulsfo.wmnet
  • 13:40 urbanecm@deploy1001: Synchronized php-1.36.0-wmf.31/extensions/ContentTranslation/app/: f9e823e: CX3 Build 0.1.0+20210216 (fixes missing bits in T271397) (duration: 00m 55s)
  • 13:40 filippo@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus3001.esams.wmnet
  • 13:37 moritzm: installing openldap security updates on corp replicas
  • 13:36 urbanecm@deploy1001: Synchronized php-1.36.0-wmf.31/extensions/FlaggedRevs/extension.json: a4cd98e: Grant sysops review and unreviewed pages right by default (apparently i forgot to rebase the first time, resync; T275293) (duration: 00m 57s)
  • 13:32 filippo@cumin1001: START - Cookbook sre.hosts.reboot-single for host prometheus4001.ulsfo.wmnet
  • 13:31 godog: reset-failed ifup@ens14 on prometheus3001 - T273026
  • 13:30 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cuminunpriv1001.eqiad.wmnet
  • 13:29 akosiaris: repool sessionstore in eqiad after sessionstore certificate refresh. T274564
  • 13:29 akosiaris@cumin1001: conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=sessionstore
  • 13:27 filippo@cumin1001: START - Cookbook sre.hosts.reboot-single for host prometheus3001.esams.wmnet
  • 13:26 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host cuminunpriv1001.eqiad.wmnet
  • 13:17 dcaro@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on cloudvirt-wdqs[1001-1003].eqiad.wmnet with reason: Restarting cloudcanary instances
  • 13:16 dcaro@cumin1001: START - Cookbook sre.hosts.downtime for 5:00:00 on cloudvirt-wdqs[1001-1003].eqiad.wmnet with reason: Restarting cloudcanary instances
  • 13:16 dcaro@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on 27 hosts with reason: Restarting cloudcanary instances
  • 13:16 dcaro@cumin1001: START - Cookbook sre.hosts.downtime for 5:00:00 on 27 hosts with reason: Restarting cloudcanary instances
  • 13:11 marostegui@cumin1001: dbctl commit (dc=all): 'db1175 (re)pooling @ 100%: Slowly repool db1175', diff saved to https://phabricator.wikimedia.org/P14439 and previous config saved to /var/cache/conftool/dbconfig/20210222-131153-root.json
  • 12:56 marostegui@cumin1001: dbctl commit (dc=all): 'db1175 (re)pooling @ 75%: Slowly repool db1175', diff saved to https://phabricator.wikimedia.org/P14438 and previous config saved to /var/cache/conftool/dbconfig/20210222-125650-root.json
  • 12:41 marostegui@cumin1001: dbctl commit (dc=all): 'db1175 (re)pooling @ 50%: Slowly repool db1175', diff saved to https://phabricator.wikimedia.org/P14437 and previous config saved to /var/cache/conftool/dbconfig/20210222-124146-root.json
  • 12:32 akosiaris@cumin1001: conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=sessionstore
  • 12:28 akosiaris@cumin1001: conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=sessionstore
  • 12:26 marostegui@cumin1001: dbctl commit (dc=all): 'db1175 (re)pooling @ 25%: Slowly repool db1175', diff saved to https://phabricator.wikimedia.org/P14436 and previous config saved to /var/cache/conftool/dbconfig/20210222-122643-root.json
  • 12:24 urbanecm@deploy1001: Synchronized wmf-config//throttle.php: d806f3a: Add a throttle rule for for edit-a-thon (T275237) (duration: 00m 54s)
  • 12:22 akosiaris: depool sessionstore in eqiad for sessionstore certificate refresh. T274564
  • 12:21 akosiaris: repool sessionstore in codfw after sessionstore certificate refresh. T274564
  • 12:21 akosiaris@cumin1001: conftool action : set/pooled=true; selector: name=codfw,dnsdisc=sessionstore
  • 12:20 urbanecm@deploy1001: Synchronized php-1.36.0-wmf.31/extensions/FlaggedRevs/extension.json: a4cd98e: Grant sysops review and unreviewed pages right by default (T275293) (duration: 00m 55s)
  • 12:18 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: 7bd26dc: Add inaturalist-open-data.s3.amazonaws.com to copyupload list (T275318) (duration: 00m 56s)
  • 12:15 urbanecm@deploy1001: Synchronized wmf-config/abusefilter.php: 391900b: ukwikivoyage: Enable block AbuseFilter action (T275271) (duration: 00m 55s)
  • 12:11 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: a1f8ce4: Enable Section Translation on Bengali Wikipedia (T271397) (duration: 00m 56s)
  • 12:11 marostegui@cumin1001: dbctl commit (dc=all): 'db1175 (re)pooling @ 10%: Slowly repool db1175', diff saved to https://phabricator.wikimedia.org/P14435 and previous config saved to /var/cache/conftool/dbconfig/20210222-121139-root.json
  • 12:07 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1175 for schema change', diff saved to https://phabricator.wikimedia.org/P14434 and previous config saved to /var/cache/conftool/dbconfig/20210222-120717-marostegui.json
  • 12:05 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: 4775fb6: Adjust CX MT threshold to 90 for Vietnamese Wikipedia (T275121) (duration: 00m 57s)
  • 12:02 moritzm: installing openldap security updates on serpens/seaborgium
  • 11:59 jiji@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1036.eqiad.wmnet
  • 11:54 jiji@cumin1001: START - Cookbook sre.hosts.reboot-single for host mc1036.eqiad.wmnet
  • 11:53 effie: upgrading memecached to 1.6 on mc1036
  • 11:50 volans: upgrading python3-wmflib fleet wide to 0.0.7-1+deb10u1
  • 11:27 dcaro@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 27 hosts with reason: Restarting cloudcanary instances
  • 11:27 dcaro@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on 27 hosts with reason: Restarting cloudcanary instances
  • 11:26 dcaro@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on cloudvirt-wdqs[1001-1003].eqiad.wmnet with reason: Restarting cloudcanary instances
  • 11:26 dcaro@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on cloudvirt-wdqs[1001-1003].eqiad.wmnet with reason: Restarting cloudcanary instances
  • 11:26 dcaro@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on cloudvirt-wdqs[1001-1003].eqiad.wmnet with reason: Restarting cloudcanary instances
  • 11:26 dcaro@cumin1001: START - Cookbook sre.hosts.downtime for 4:00:00 on cloudvirt-wdqs[1001-1003].eqiad.wmnet with reason: Restarting cloudcanary instances
  • 11:22 godog: roll restart prometheus on cloudmetrics*
  • 11:21 godog: roll restart prometheus on prometheus*
  • 11:12 godog: restart prometheus on prometheus2004 to apply changes - T273278
  • 11:10 marostegui@cumin1001: dbctl commit (dc=all): 'db1166 (re)pooling @ 100%: Slowly repool db1166', diff saved to https://phabricator.wikimedia.org/P14433 and previous config saved to /var/cache/conftool/dbconfig/20210222-111032-root.json
  • 10:55 marostegui@cumin1001: dbctl commit (dc=all): 'db1166 (re)pooling @ 75%: Slowly repool db1166', diff saved to https://phabricator.wikimedia.org/P14432 and previous config saved to /var/cache/conftool/dbconfig/20210222-105528-root.json
  • 10:49 _joe_: removing stray old builds from compiler1003
  • 10:40 marostegui@cumin1001: dbctl commit (dc=all): 'db1166 (re)pooling @ 50%: Slowly repool db1166', diff saved to https://phabricator.wikimedia.org/P14431 and previous config saved to /var/cache/conftool/dbconfig/20210222-104025-root.json
  • 10:36 _joe_: manually removed the restbase-http ipvs entry from the load balancers
  • 10:30 akosiaris@cumin1001: conftool action : set/pooled=false; selector: name=codfw,dnsdisc=sessionstore
  • 10:29 akosiaris: depool sessionstore in codfw for sessionstore certificate refresh. T274564
  • 10:25 marostegui@cumin1001: dbctl commit (dc=all): 'db1166 (re)pooling @ 25%: Slowly repool db1166', diff saved to https://phabricator.wikimedia.org/P14430 and previous config saved to /var/cache/conftool/dbconfig/20210222-102521-root.json
  • 10:16 _joe_: restarting pybal on lvs1015 to pick up restbase http removal
  • 10:12 _joe_: restarting pybal on lvs1016 to pick up restbase http removal
  • 10:10 marostegui@cumin1001: dbctl commit (dc=all): 'db1166 (re)pooling @ 10%: Slowly repool db1166', diff saved to https://phabricator.wikimedia.org/P14429 and previous config saved to /var/cache/conftool/dbconfig/20210222-101018-root.json
  • 10:06 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1166 for schema change', diff saved to https://phabricator.wikimedia.org/P14428 and previous config saved to /var/cache/conftool/dbconfig/20210222-100653-marostegui.json
  • 09:51 _joe_: restarting low-traffic pybals in codfw to remove the restbase http endpoint
  • 09:35 marostegui: Deploy schema change on s3 codfw master, there will be lag on s3 codfw - T273359
  • 09:30 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stat1008.eqiad.wmnet
  • 09:20 elukey@cumin1001: START - Cookbook sre.hosts.reboot-single for host stat1008.eqiad.wmnet
  • 09:08 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stat1005.eqiad.wmnet
  • 09:04 moritzm: installing screen security updates on Buster
  • 09:00 elukey@cumin1001: START - Cookbook sre.hosts.reboot-single for host stat1005.eqiad.wmnet
  • 08:40 godog: swift codfw-prod: more weight to ms-be20[58-61] - T269337
  • 08:39 gehel: depool elastic2045 and ban from clsuters - T275345
  • 08:12 urbanecm@deploy1001: Synchronized wmf-config/flaggedrevs.php: cea41a2: fiwiki: Assign stablesettings to reviewers in IS.php rather than FR-specific file (T275017; 2/2) (duration: 00m 55s)
  • 08:11 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: cea41a2: fiwiki: Assign stablesettings to reviewers in IS.php rather than FR-specific file (T275017; 1/2) (duration: 01m 08s)
  • 07:54 marostegui@cumin1001: dbctl commit (dc=all): 'Remove db1090* from dbctl T274333', diff saved to https://phabricator.wikimedia.org/P14426 and previous config saved to /var/cache/conftool/dbconfig/20210222-075437-marostegui.json
  • 07:38 moritzm: installing openldap security updates on LDAP replicas
  • 07:29 hashar: Restarting CI Jenkins to downgrade plugin # T271683
  • 07:14 hashar: Restarting CI Jenkins for plugin upgrade # T271683
  • 07:11 elukey: powercycle elastic2045 - com2 available, no ssh, no root login (hangs indefinitely), no prometheus metrics reported

2021-02-21

  • 16:02 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1162 - crashed', diff saved to https://phabricator.wikimedia.org/P14424 and previous config saved to /var/cache/conftool/dbconfig/20210221-160258-marostegui.json
  • 10:07 ariel@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on snapshot1008.eqiad.wmnet with reason: REIMAGE
  • 10:05 ariel@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on snapshot1008.eqiad.wmnet with reason: REIMAGE
  • 09:32 ariel@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on snapshot1008.eqiad.wmnet with reason: REIMAGE
  • 09:30 ariel@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on snapshot1008.eqiad.wmnet with reason: REIMAGE
  • 09:29 ariel@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dumpsdata1002.eqiad.wmnet
  • 09:23 ariel@cumin1001: START - Cookbook sre.hosts.reboot-single for host dumpsdata1002.eqiad.wmnet

2021-02-20

  • 00:17 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1317.eqiad.wmnet
  • 00:16 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1317.eqiad.wmnet
  • 00:15 ebernhardson: start batch processing images through MachineVision fetchSuggestions.php for T274220 on mwmaint1002
  • 00:15 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1333.eqiad.wmnet
  • 00:13 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1333.eqiad.wmnet
  • 00:13 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1339.eqiad.wmnet
  • 00:13 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1342.eqiad.wmnet
  • 00:11 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1342.eqiad.wmnet

2021-02-19

  • 23:09 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1339.eqiad.wmnet
  • 23:05 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1317.eqiad.wmnet with reason: REIMAGE
  • 23:03 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1317.eqiad.wmnet with reason: REIMAGE
  • 22:49 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1342.eqiad.wmnet with reason: REIMAGE
  • 22:47 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1342.eqiad.wmnet with reason: REIMAGE
  • 22:44 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1333.eqiad.wmnet with reason: REIMAGE
  • 22:42 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1333.eqiad.wmnet with reason: REIMAGE
  • 22:33 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1340.eqiad.wmnet
  • 22:26 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1339.eqiad.wmnet with reason: REIMAGE
  • 22:24 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1339.eqiad.wmnet with reason: REIMAGE
  • 22:18 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1340.eqiad.wmnet
  • 22:16 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1320.eqiad.wmnet
  • 22:14 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1320.eqiad.wmnet
  • 22:10 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1262.eqiad.wmnet
  • 22:10 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1261.eqiad.wmnet
  • 22:09 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1261.eqiad.wmnet
  • 22:08 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1261.eqiad.wmnet
  • 21:50 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw2336.codfw.wmnet
  • 21:44 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2336.codfw.wmnet
  • 21:34 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1340.eqiad.wmnet with reason: REIMAGE
  • 21:32 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1340.eqiad.wmnet with reason: REIMAGE
  • 21:27 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1320.eqiad.wmnet with reason: REIMAGE
  • 21:24 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1262.eqiad.wmnet with reason: REIMAGE
  • 21:24 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1320.eqiad.wmnet with reason: REIMAGE
  • 21:22 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2336.codfw.wmnet with reason: REIMAGE
  • 21:22 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1262.eqiad.wmnet with reason: REIMAGE
  • 21:20 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw2336.codfw.wmnet with reason: REIMAGE
  • 20:59 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1287.eqiad.wmnet
  • 20:57 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1287.eqiad.wmnet
  • 20:51 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw2257.codfw.wmnet
  • 20:50 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw2257.codfw.cwmnet
  • 20:49 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1270.eqiad.wmnet
  • 20:48 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1261.eqiad.wmnet
  • 20:42 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1270.eqiad.wmnet
  • 20:42 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1261.eqiad.wmnet
  • 20:42 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1287.eqiad.wmnet
  • 20:33 mutante: mw1261, mw1270 - scap pull
  • 20:33 cdanis: ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕞🍵 sudo cumin 'mw1261*,mw1270*,mw1287*' 'depool'
  • 20:32 mutante: mw1287 - scap pull
  • 20:17 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2257.codfw.wmnet
  • 20:17 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1270.eqiad.wmnet
  • 20:17 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1261.eqiad.wmnet
  • 20:15 dduvall@deploy1001: Pruned MediaWiki: 1.36.0-wmf.29 (duration: 01m 42s)
  • 20:06 dduvall@deploy1001: Pruned MediaWiki: 1.36.0-wmf.28 (duration: 01m 50s)
  • 20:04 dduvall@deploy1001: Pruned MediaWiki: 1.36.0-wmf.27 (duration: 02m 12s)
  • 20:01 dduvall@deploy1001: Pruned MediaWiki: 1.36.0-wmf.26 (duration: 02m 12s)
  • 19:57 dduvall@deploy1001: Pruned MediaWiki: 1.36.0-wmf.25 (duration: 04m 09s)
  • 19:48 marxarelli: 1.36.0-wmf.31 re-rolled to all wikis (T271345)
  • 19:24 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1287.eqiad.wmnet with reason: REIMAGE
  • 19:22 robh@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1129.eqiad.wmnet with reason: REIMAGE
  • 19:22 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1270.eqiad.wmnet with reason: REIMAGE
  • 19:20 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1287.eqiad.wmnet with reason: REIMAGE
  • 19:19 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1261.eqiad.wmnet with reason: REIMAGE
  • 19:19 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1270.eqiad.wmnet with reason: REIMAGE
  • 19:17 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1261.eqiad.wmnet with reason: REIMAGE
  • 19:11 dduvall@deploy1001: rebuilt and synchronized wikiversions files: all wikis to 1.36.0-wmf.31
  • 19:01 dduvall@deploy1001: Synchronized php-1.36.0-wmf.31/extensions/Echo/includes/model/Event.php: backport: Echo::create: Convert UserIdentityValue to plain User (T275161) (duration: 01m 20s)
  • 18:52 marxarelli: fetching backport https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Echo/+/665177 for sync prior to all wikis (re)deploy (T275161)
  • 18:38 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1367.eqiad.wmnet
  • 18:36 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1341.eqiad.wmnet
  • 18:35 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1367.eqiad.wmnet
  • 18:32 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw2272.codfw.wmnet
  • 18:30 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1341.eqiad.wmnet
  • 18:30 mutante: mw1367 - powercycled - stuck in reboot
  • 18:29 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2272.codfw.wmnet
  • 18:07 Urbanecm: Password reset for User:Kolyma (T274737)
  • 17:36 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1341.eqiad.wmnet with reason: REIMAGE
  • 17:34 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1341.eqiad.wmnet with reason: REIMAGE
  • 17:33 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2272.codfw.wmnet with reason: REIMAGE
  • 17:31 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw2272.codfw.wmnet with reason: REIMAGE
  • 17:29 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1367.eqiad.wmnet with reason: REIMAGE
  • 17:27 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1367.eqiad.wmnet with reason: REIMAGE
  • 16:57 robh@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1141.eqiad.wmnet with reason: REIMAGE
  • 16:55 robh@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1140.eqiad.wmnet with reason: REIMAGE
  • 16:55 robh@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1141.eqiad.wmnet with reason: REIMAGE
  • 16:53 robh@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1134.eqiad.wmnet with reason: REIMAGE
  • 16:53 robh@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1140.eqiad.wmnet with reason: REIMAGE
  • 16:51 robh@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1134.eqiad.wmnet with reason: REIMAGE
  • 14:29 mbsantos@deploy1001: Finished deploy [tilerator/deploy@937deb5]: (no justification provided) (duration: 00m 15s)
  • 14:28 mbsantos@deploy1001: Started deploy [tilerator/deploy@937deb5]: (no justification provided)
  • 14:00 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'echostore' for release 'production' .
  • 14:00 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'echostore' for release 'staging' .
  • 13:43 akosiaris@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'echostore' for release 'staging' .
  • 13:43 akosiaris@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'echostore' for release 'production' .
  • 13:43 akosiaris@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'echostore' for release 'production' .
  • 13:43 akosiaris@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'echostore' for release 'staging' .
  • 13:43 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'echostore' for release 'production' .
  • 13:43 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'echostore' for release 'staging' .
  • 13:43 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'echostore' for release 'staging' .
  • 13:43 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'echostore' for release 'production' .
  • 13:42 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'echostore' for release 'staging' .
  • 13:42 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'echostore' for release 'production' .
  • 13:41 godog: reset-failed ifup@ens13 on prometheus5001 - T273026
  • 13:39 filippo@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus5001.eqsin.wmnet
  • 13:31 gehel@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1010.eqiad.wmnet with reason: REIMAGE
  • 13:29 gehel@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1010.eqiad.wmnet with reason: REIMAGE
  • 13:22 filippo@cumin1001: START - Cookbook sre.hosts.reboot-single for host prometheus5001.eqsin.wmnet
  • 09:27 elukey@cumin1001: END (PASS) - Cookbook sre.hadoop.stop-cluster (exit_code=0) for Hadoop backup cluster: Stop the Hadoop cluster before maintenance. - elukey@cumin1001
  • 09:16 elukey@cumin1001: START - Cookbook sre.hadoop.stop-cluster for Hadoop backup cluster: Stop the Hadoop cluster before maintenance. - elukey@cumin1001
  • 08:40 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-airflow1001.eqiad.wmnet
  • 08:34 elukey@cumin1001: START - Cookbook sre.hosts.reboot-single for host an-airflow1001.eqiad.wmnet
  • 08:06 godog: swift codfw-prod: more weight to ms-be20[58-61] - T269337
  • 08:04 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1108.eqiad.wmnet
  • 07:47 elukey@cumin1001: START - Cookbook sre.hosts.reboot-single for host an-worker1108.eqiad.wmnet
  • 02:26 robh@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1133.eqiad.wmnet with reason: REIMAGE
  • 02:24 robh@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1133.eqiad.wmnet with reason: REIMAGE
  • 01:22 mutante: mwmaint2001 back on buster and back in scap dsh groups (if anything pops up you can revert 665175)
  • 01:19 mutante: deleting my huge build from puppet-compiler that failed because it made the compiler instance run out of disk to run on *
  • 01:03 urbanecm@deploy1001: Synchronized php-1.36.0-wmf.30/includes/ProtectionForm.php: d305308: field descriptors in HTMLForm must have keys (T275018; T274980) (duration: 01m 08s)
  • 01:02 urbanecm@deploy1001: Synchronized php-1.36.0-wmf.31/includes/ProtectionForm.php: 2487c25: field descriptors in HTMLForm must have keys (T275018; T274980) (duration: 01m 10s)
  • 00:54 mutante: mwmaint2001 - back from reimage - scap pull
  • 00:26 urbanecm@deploy1001: Synchronized static/images/project-logos/wikimedia-cloud-services.svg: 686acba: Restore logos on Vector (classic version) and use cloud icon for labs (T274210) (duration: 01m 07s)
  • 00:14 dpifke@deploy1001: Synchronized wmf-config/PhpAutoPrepend.php: Deploying excimer-wall profiler pipeline T253160 (duration: 01m 03s)
  • 00:12 dpifke@deploy1001: Synchronized wmf-config/profiler.php: Deploying excimer-wall profiler pipeline T253160 (duration: 01m 02s)

2021-02-18

  • 23:48 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mwmaint2001.codfw.wmnet with reason: REIMAGE
  • 23:46 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mwmaint2001.codfw.wmnet with reason: REIMAGE
  • 23:26 dancy@deploy1001: Synchronized wmf-config/: Syncing https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/634552 (duration: 01m 07s)
  • 23:22 dancy@deploy1001: Synchronized wmf-config/CommonSettings.php: Syncing https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/634551 (duration: 01m 08s)
  • 23:15 dancy@deploy1001: Synchronized src/ServiceConfig.php: (no justification provided) (duration: 03m 21s)
  • 23:11 mutante: mwmaint2001 - will be rebooted for OS upgrade - T267607
  • 23:10 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mwmaint2001.codfw.wmnet with reason: OS upgrade
  • 23:10 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 1:00:00 on mwmaint2001.codfw.wmnet with reason: OS upgrade
  • 23:04 mutante: mwmaint1002 - rsyncing data from mwmaint2001
  • 22:30 mutante: mwmaint2001 - tar-gzipping a lot of old user home data I keep finding, partially museum worthy from several maintenance hosts ago, like places like /root/home-mwmaint1001/username/home-terbium/iron/ :p
  • 21:29 marxarelli: 1.36.0-wmf.31 rolled back due to T275161 and new logspam (T271345)
  • 21:26 dduvall@deploy1001: rebuilt and synchronized wikiversions files: Revert "all wikis to 1.36.0-wmf.31"
  • 20:09 dduvall@deploy1001: rebuilt and synchronized wikiversions files: all wikis to 1.36.0-wmf.31
  • 19:27 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: f33f9f7: Make DiscussionTools replytool available for everyone on gomwiktionary (T258554) (duration: 01m 05s)
  • 19:25 mutante: mwmaint2001 - deleting 'home-terbium' from all home directories (yes, it's in Bacula if you really used that, hope you didn't, it's been years since terbium)
  • 19:25 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: da7b812: Enable DiscussionTools beta feature for newtopictool on arwiki, cswiki, huwiki (T273145) (duration: 01m 12s)
  • 19:20 urbanecm@deploy1001: Synchronized php-1.36.0-wmf.31/extensions/DiscussionTools/: 1cc29df: 6b88aff: DiscussionTools backports (T272666; T274949) (duration: 01m 08s)
  • 19:19 urbanecm@deploy1001: sync-file aborted: 1cc29df DiscussionTools backports (T272666; T274949) (duration: 00m 00s)
  • 19:17 urbanecm@deploy1001: Synchronized php-1.36.0-wmf.30/extensions/DiscussionTools/: 9c6cdf5: 97acef6: DiscussionTools backports (T272666; T274949) (duration: 01m 26s)
  • 19:04 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mwmaint2001.codfw.wmnet with reason: OS upgrade
  • 19:04 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mwmaint2001.codfw.wmnet with reason: OS upgrade
  • 16:51 volans: uploaded python3-wmflib_0.0.7 to apt.wikimedia.org buster-wikimedia
  • 16:23 shdubsh: restart ircecho on kraz -- deploying new metrics endpoint T216611
  • 16:05 moritzm: installing libmaxminddb updates from buster 10.8 point release
  • 15:33 _joe_: rebuilding base images for stretch,buster
  • 15:30 moritzm: installing PHP 7.3 security updates on buster
  • 15:06 godog: swift codfw-prod decrease HDD weight for ms-be20[16-27] - T272837
  • 14:35 moritzm: installing libzstd security updates on Buster
  • 13:59 moritzm: installing intel-microcode security updates on buster
  • 13:49 jynus: restart db1150 T271913
  • 12:20 jynus: restart db1140 T271913
  • 12:01 urbanecm@deploy1001: Synchronized php-1.36.0-wmf.31/includes/HookContainer/DeprecatedHooks.php: 28aa871: Silent deprecate ProtectionForm::buildForm (T274889) (duration: 01m 14s)
  • 11:49 jynus: restart db1102 T271913
  • 11:13 marostegui@deploy1001: Synchronized wmf-config/db-eqiad.php: Repool pc1009 (duration: 01m 09s)
  • 11:04 marostegui: Upgrade and reboot pc1009
  • 11:03 marostegui@deploy1001: Synchronized wmf-config/db-eqiad.php: Depool pc1009 (duration: 01m 08s)
  • 10:47 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: 33ab68f: Add https://seer.ufrgs.br to the wgCopyUploadsDomains allowlist of Wikimedia Commons (T270962) (duration: 01m 09s)
  • 10:45 urbanecm@deploy1001: Synchronized static/images: d1db300: Revert "Temporarily add cswiki-black-ribbon.png as a static resource" (duration: 01m 09s)
  • 10:42 jynus: restarting dbprov* hosts T271913
  • 10:34 dcaro@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudmetrics1001.eqiad.wmnet
  • 10:30 oblivian@deploy1001: Synchronized wmf-config/ProductionServices.php: Switch restbase calls to envoy (duration: 01m 15s)
  • 10:27 dcaro@cumin1001: START - Cookbook sre.hosts.reboot-single for host cloudmetrics1001.eqiad.wmnet
  • 09:48 jynus: restarting backup* hosts T271913
  • 09:46 elukey: upgrade presto to 0.246-wmf on an-coord1001, an-presto*, stat100x
  • 08:48 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1090:3312, db1090:3317 T274333', diff saved to https://phabricator.wikimedia.org/P14408 and previous config saved to /var/cache/conftool/dbconfig/20210218-084758-marostegui.json
  • 08:31 marostegui: Upgrade kernel on db1154 and db1155 (sanitarium running buster hosts)
  • 08:23 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-worker1003.eqiad.wmnet with reason: REIMAGE
  • 08:21 elukey@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-worker1003.eqiad.wmnet with reason: REIMAGE
  • 08:01 godog: upgrade grafana* to 7.4.2 - T263747
  • 07:59 marostegui: Reboot es2029, es2030, es2031, es2032, es2033, es2034 for kernel upgrade
  • 07:32 marostegui: Reboot es2026, es2027, es2028 for kernel upgrade
  • 06:56 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host archiva1002.wikimedia.org
  • 06:54 elukey@cumin1001: START - Cookbook sre.hosts.reboot-single for host archiva1002.wikimedia.org
  • 06:53 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host zookeeper-test1002.eqiad.wmnet
  • 06:49 elukey@cumin1001: START - Cookbook sre.hosts.reboot-single for host zookeeper-test1002.eqiad.wmnet
  • 06:25 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1075.eqiad.wmnet
  • 06:10 marostegui: Reboot dbproxy1014 for kernel upgrade
  • 01:56 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: fe64695: hewikisource: Allow sysops to grant/revoke reviewer (T274796) (duration: 01m 07s)
  • 01:38 robh@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 01:32 robh@cumin1001: START - Cookbook sre.dns.netbox
  • 00:58 robh@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 00:49 robh@cumin1001: START - Cookbook sre.dns.netbox
  • 00:34 urbanecm@deploy1001: Synchronized php-1.36.0-wmf.30/extensions/CentralNotice/resources/ext.centralNotice.display/state.js: dd64e44: Remove optedOutCampaigns property from impression data (T275054) (duration: 01m 08s)
  • 00:32 urbanecm@deploy1001: Synchronized php-1.36.0-wmf.31/extensions/CentralNotice/resources/ext.centralNotice.display/state.js: ff444c2: Remove optedOutCampaigns property from impression data (T275054) (duration: 01m 09s)
  • 00:31 urbanecm@deploy1001: Synchronized wmf-config/CommonSettings.php: 08b32c4: Remove wgCentralNoticeImpressionEventSampleRate; will default to 0 (T275054) (duration: 02m 17s)
  • 00:28 urbanecm@deploy1001: sync-file aborted: 08b32c4: Remove wgCentralNoticeImpressionEventSampleRate; will default to 0 (T275054) (duration: 00m 00s)
  • 00:03 ebernhardson@deploy1001: Finished deploy [wikimedia/discovery/analytics@8ca6884]: cirrus_namespace_map: Use retries when fetching (duration: 01m 21s)
  • 00:02 ebernhardson@deploy1001: Started deploy [wikimedia/discovery/analytics@8ca6884]: cirrus_namespace_map: Use retries when fetching

2021-02-17

  • 20:31 robh@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash1034.eqiad.wmnet with reason: REIMAGE
  • 20:29 robh@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on logstash1034.eqiad.wmnet with reason: REIMAGE
  • 20:23 marxarelli: 1.36.0-wmf.31 rolled to group1. no new errors for wmf.31 (T271345)
  • 20:17 dduvall@deploy1001: Synchronized php: group1 wikis to 1.36.0-wmf.31 (duration: 01m 15s)
  • 20:15 dduvall@deploy1001: rebuilt and synchronized wikiversions files: group1 wikis to 1.36.0-wmf.31
  • 19:45 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: 2e521f7: hewikisource: Allow reviewers to rollback (T274796) (duration: 01m 10s)
  • 19:42 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: 88e6ebc: hewikisource: Add bureaucrats the ability to grant/revoke (trans)import (T274796) (duration: 01m 09s)
  • 19:36 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: 6c5c5f0: arbcom_ruwiki: Add arbcom user group (T274844) (duration: 01m 12s)
  • 19:34 robh@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash1033.eqiad.wmnet with reason: REIMAGE
  • 19:32 robh@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on logstash1033.eqiad.wmnet with reason: REIMAGE
  • 19:27 Urbanecm: urbanecm@mwmaint1002:~$ mwscript namespaceDupes.php --wiki=tlwikibooks --fix # T274976 # P14404
  • 19:26 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: c37fa01: tlwikibooks: Add Wikijunior namespace (T274976) (duration: 01m 09s)
  • 19:24 Urbanecm: [urbanecm@mwmaint1002 ~]$ mwscript namespaceDupes.php --wiki=tlwikibooks --fix # T274977 # P14403
  • 19:21 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: a7eb726: tlwikibooks: Add WB as an alias to NS_PROJECT (T274977) (duration: 01m 09s)
  • 19:14 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: 352dd72: Enable GlobalWatchlist extension on metawiki (T260862) (duration: 01m 07s)
  • 19:08 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: 6ac78bd: Remove uses of removed VisualEditor config variables (T273177; 2/2) (duration: 01m 07s)
  • 19:06 urbanecm@deploy1001: Synchronized wmf-config/CommonSettings.php: 6ac78bd: Remove uses of removed VisualEditor config variables (T273177; 1/2) (duration: 01m 14s)
  • 18:40 ppchelko@deploy1001: Finished deploy [restbase/deploy@c5c4b2d]: Remove graphoid T242855 (duration: 19m 54s)
  • 18:27 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1350.eqiad.wmnet
  • 18:26 effie: enable puppet on mw*
  • 18:24 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1344.eqiad.wmnet
  • 18:24 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1343.eqiad.wmnet
  • 18:21 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1275.eqiad.wmnet
  • 18:20 ppchelko@deploy1001: Started deploy [restbase/deploy@c5c4b2d]: Remove graphoid T242855
  • 18:19 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1350.eqiad.wmnet
  • 18:14 mutante: mw1350 - powercycled via mgmt
  • 18:11 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1343.eqiad.wmnet
  • 18:08 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1344.eqiad.wmnet
  • 18:08 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1275.eqiad.wmnet
  • 18:07 effie: disable puppet on mw* in eqiad
  • 17:36 godog: roll-restart logstash7 in codfw/eqiad to apply ulogd filters - T234565
  • 17:28 robh@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash1035.eqiad.wmnet with reason: REIMAGE
  • 17:26 robh@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on logstash1035.eqiad.wmnet with reason: REIMAGE
  • 17:16 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1343.eqiad.wmnet with reason: REIMAGE
  • 17:14 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1344.eqiad.wmnet with reason: REIMAGE
  • 17:13 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1343.eqiad.wmnet with reason: REIMAGE
  • 17:12 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1275.eqiad.wmnet with reason: REIMAGE
  • 17:12 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1344.eqiad.wmnet with reason: REIMAGE
  • 17:10 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1275.eqiad.wmnet with reason: REIMAGE
  • 17:07 jiji@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 17:05 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1350.eqiad.wmnet with reason: REIMAGE
  • 17:03 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1350.eqiad.wmnet with reason: REIMAGE
  • 16:58 jiji@cumin1001: START - Cookbook sre.dns.netbox
  • 16:46 godog: roll-restart logstash to apply ulogd filter - T234565
  • 16:42 akosiaris@deploy1001: helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
  • 16:41 akosiaris@deploy1001: helmfile [staging-codfw] START helmfile.d/admin 'sync'.
  • 16:33 akosiaris@deploy1001: helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
  • 16:32 moritzm: installing intel-microcode security updates on buster
  • 16:23 akosiaris@deploy1001: helmfile [staging-codfw] START helmfile.d/admin 'sync'.
  • 16:08 akosiaris@deploy1001: helmfile [staging-codfw] START helmfile.d/admin 'sync'.
  • 16:06 oblivian@deploy1001: Finished deploy [docker-pkg/deploy@b5f4a3e]: (no justification provided) (duration: 00m 30s)
  • 16:05 oblivian@deploy1001: Started deploy [docker-pkg/deploy@b5f4a3e]: (no justification provided)
  • 15:36 cdanis: T275028 rolling restart done; check for fetch failures once caches re-fill
  • 15:34 root@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host htmldumper1001.eqiad.wmnet
  • 15:31 moritzm: uploaded jasper 1.900.1-debian1-2.4+deb8u6+wmf3 to apt.wikimedia.org
  • 15:28 root@cumin1001: START - Cookbook sre.hosts.reboot-single for host htmldumper1001.eqiad.wmnet
  • 15:26 root@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dumpsdata1003.eqiad.wmnet
  • 15:20 root@cumin1001: START - Cookbook sre.hosts.reboot-single for host dumpsdata1003.eqiad.wmnet
  • 15:17 root@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dumpsdata1001.eqiad.wmnet
  • 15:13 root@cumin1001: START - Cookbook sre.hosts.reboot-single for host dumpsdata1001.eqiad.wmnet
  • 14:26 cdanis: starting rolling restart of cp-upload@eqsin varnish-fe T275028
  • 13:55 marostegui@cumin1001: dbctl commit (dc=all): 'db1172 (re)pooling @ 100%: Slowly repool db1172', diff saved to https://phabricator.wikimedia.org/P14396 and previous config saved to /var/cache/conftool/dbconfig/20210217-135533-root.json
  • 13:40 marostegui@cumin1001: dbctl commit (dc=all): 'db1172 (re)pooling @ 80%: Slowly repool db1172', diff saved to https://phabricator.wikimedia.org/P14395 and previous config saved to /var/cache/conftool/dbconfig/20210217-134030-root.json
  • 13:30 akosiaris@deploy1001: helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
  • 13:30 akosiaris@deploy1001: helmfile [staging-codfw] START helmfile.d/admin 'sync'.
  • 13:28 moritzm: installing libzstd security updates on Buster
  • 13:25 marostegui@cumin1001: dbctl commit (dc=all): 'db1172 (re)pooling @ 60%: Slowly repool db1172', diff saved to https://phabricator.wikimedia.org/P14393 and previous config saved to /var/cache/conftool/dbconfig/20210217-132526-root.json
  • 13:19 lucaswerkmeister-wmde@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Config: Enable Wikibase Repo ID generator rate limiting on Wikidata (T272032) (duration: 01m 11s)
  • 13:10 marostegui@cumin1001: dbctl commit (dc=all): 'db1172 (re)pooling @ 50%: Slowly repool db1172', diff saved to https://phabricator.wikimedia.org/P14392 and previous config saved to /var/cache/conftool/dbconfig/20210217-131022-root.json
  • 13:06 akosiaris@deploy1001: helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
  • 13:05 akosiaris@deploy1001: helmfile [staging-codfw] START helmfile.d/admin 'sync'.
  • 12:55 akosiaris@deploy1001: helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
  • 12:55 akosiaris@deploy1001: helmfile [staging-codfw] START helmfile.d/admin 'sync'.
  • 12:55 marostegui@cumin1001: dbctl commit (dc=all): 'db1172 (re)pooling @ 40%: Slowly repool db1172', diff saved to https://phabricator.wikimedia.org/P14391 and previous config saved to /var/cache/conftool/dbconfig/20210217-125519-root.json
  • 12:50 akosiaris@deploy1001: helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
  • 12:49 akosiaris@deploy1001: helmfile [staging-codfw] START helmfile.d/admin 'sync'.
  • 12:45 akosiaris@deploy1001: helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
  • 12:45 akosiaris@deploy1001: helmfile [staging-codfw] START helmfile.d/admin 'sync'.
  • 12:42 akosiaris@deploy1001: helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
  • 12:42 akosiaris@deploy1001: helmfile [staging-codfw] START helmfile.d/admin 'sync'.
  • 12:40 akosiaris@deploy1001: helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
  • 12:40 marostegui@cumin1001: dbctl commit (dc=all): 'db1172 (re)pooling @ 20%: Slowly repool db1172', diff saved to https://phabricator.wikimedia.org/P14390 and previous config saved to /var/cache/conftool/dbconfig/20210217-124015-root.json
  • 12:40 akosiaris@deploy1001: helmfile [staging-codfw] START helmfile.d/admin 'sync'.
  • 12:14 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: 6eeee95: vector: Enable search treatment AB test on test wikis (T259798) (duration: 01m 08s)
  • 12:10 urbanecm@deploy1001: Synchronized dblists/desktop-improvements.dblist: 7872251: Revert "Revert "vector: Enable WVUI search on test wikis"" (T259798) (duration: 01m 09s)
  • 12:07 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: 7872251: Revert "Revert "vector: Enable WVUI search on test wikis"" (T259798) (duration: 01m 25s)
  • 11:58 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2001.wikimedia.org
  • 11:55 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host netbox-dev2001.wikimedia.org
  • 11:24 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly pool db1172 in s8 - T258361', diff saved to https://phabricator.wikimedia.org/P14389 and previous config saved to /var/cache/conftool/dbconfig/20210217-112422-marostegui.json
  • 11:08 akosiaris@deploy1001: helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
  • 11:08 akosiaris@deploy1001: helmfile [staging-codfw] START helmfile.d/admin 'sync'.
  • 11:04 akosiaris@deploy1001: helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
  • 11:04 akosiaris@deploy1001: helmfile [staging-codfw] START helmfile.d/admin 'sync'.
  • 11:04 akosiaris@deploy1001: helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
  • 11:03 akosiaris@deploy1001: helmfile [staging-codfw] START helmfile.d/admin 'sync'.
  • 10:52 aborrero@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on cloudnet1004.eqiad.wmnet with reason: hardware failure
  • 10:52 aborrero@cumin1001: START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on cloudnet1004.eqiad.wmnet with reason: hardware failure
  • 10:13 _joe_: depooling mw1331 to perform some tests for T266855
  • 10:08 aborrero@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 10:01 aborrero@cumin1001: START - Cookbook sre.dns.netbox
  • 09:32 elukey: reboot dbstore100[3-5] for kernel upgrades
  • 08:44 marostegui: upgrade es2020 es2021 es2022's kernel
  • 08:41 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly pool db1172 in s8 - T258361', diff saved to https://phabricator.wikimedia.org/P14388 and previous config saved to /var/cache/conftool/dbconfig/20210217-084120-marostegui.json
  • 08:08 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stat1004.eqiad.wmnet
  • 08:04 elukey@cumin1001: START - Cookbook sre.hosts.reboot-single for host stat1004.eqiad.wmnet
  • 07:41 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly pool db1172 in s8 - T258361', diff saved to https://phabricator.wikimedia.org/P14387 and previous config saved to /var/cache/conftool/dbconfig/20210217-074107-marostegui.json
  • 07:40 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stat1007.eqiad.wmnet
  • 07:33 elukey@cumin1001: START - Cookbook sre.hosts.reboot-single for host stat1007.eqiad.wmnet
  • 07:30 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stat1006.eqiad.wmnet
  • 07:23 elukey@cumin1001: START - Cookbook sre.hosts.reboot-single for host stat1006.eqiad.wmnet
  • 07:21 marostegui@cumin1001: dbctl commit (dc=all): 'Pool db1172 in s8 for the first time - T258361', diff saved to https://phabricator.wikimedia.org/P14386 and previous config saved to /var/cache/conftool/dbconfig/20210217-072131-marostegui.json
  • 07:21 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stat1004.eqiad.wmnet
  • 07:16 marostegui: Add x1 to orchestrator
  • 07:04 elukey@cumin1001: START - Cookbook sre.hosts.reboot-single for host stat1004.eqiad.wmnet
  • 07:01 marostegui: Restart db1103 (x1) primary master DONE - T273758
  • 07:00 marostegui: Restart db1103 (x1) primary master - T273758
  • 06:39 marostegui@cumin1001: dbctl commit (dc=all): 'Add db1172 to dbctl, but not pooled yet T258361', diff saved to https://phabricator.wikimedia.org/P14385 and previous config saved to /var/cache/conftool/dbconfig/20210217-063915-marostegui.json
  • 01:41 mutante: mwdebug1001 - back on buster and pooled
  • 01:41 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mwdebug1001.eqiad.wmnet
  • 01:39 mutante: mwdebug1001 - rebooting
  • 01:04 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1345.eqiad.wmnet
  • 01:04 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1351.eqiad.wmnet
  • 01:00 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mwdebug1001.eqiad.wmnet
  • 01:00 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mwdebug1001.eqiad.wmnet
  • 00:58 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1345.eqiad.wmnet
  • 00:49 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1351.eqiad.wmnet
  • 00:33 mutante: mw1351 - powercycled
  • 00:27 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mwdebug1001.eqiad.wmnet
  • 00:17 legoktm@deploy1001: Synchronized php-1.36.0-wmf.30/extensions/timeline/: Add $wgTimelineFontDirectory to be passed as GDFONTPATH (T274822) (duration: 01m 06s)
  • 00:15 legoktm@deploy1001: Synchronized php-1.36.0-wmf.31/extensions/timeline/: Add $wgTimelineFontDirectory to be passed as GDFONTPATH (T274822) (duration: 01m 02s)
  • 00:13 legoktm@deploy1001: Synchronized wmf-config/timeline.php: Set $wgTimelineFontDirectory (T274822) (duration: 01m 05s)
  • 00:04 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1345.eqiad.wmnet with reason: REIMAGE
  • 00:02 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1345.eqiad.wmnet with reason: REIMAGE

2021-02-16

  • 23:54 mutante: puppetmaster1001 - puppet cert clean mwdebug1001, sign new request, initial puppet run, now on buster (T274023)
  • 23:54 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1351.eqiad.wmnet with reason: REIMAGE
  • 23:52 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1351.eqiad.wmnet with reason: REIMAGE
  • 23:44 dzahn@cumin1001: conftool action : set/pooled=inactive; selector: name=mwdebug1001.eqiad.wmnet
  • 23:44 mutante: reimaging mwdebug1001 with buster
  • 23:43 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mwdebug1001.eqiad.wmnet
  • 23:37 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mwdebug1001.eqiad.wmnet with reason: OS upgrade
  • 23:37 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mwdebug1001.eqiad.wmnet with reason: OS upgrade
  • 23:09 twentyafterfour@deploy1001: Synchronized php-1.36.0-wmf.30/includes/HookContainer/DeprecatedHooks.php: silence deprecation refs T274889 (duration: 01m 14s)
  • 22:52 jgleeson: updated payments-wiki config to 3d1b4564a2
  • 22:39 gehel: restarting wdqs-updater on wdqs2001
  • 22:35 bstorm@cumin1001: END (FAIL) - Cookbook wmcs.wikireplicas.add_wiki (exit_code=99)
  • 22:23 bstorm@cumin1001: START - Cookbook wmcs.wikireplicas.add_wiki
  • 22:22 akosiaris: re-enable puppet and squid on install1003. wdqs seems to be mildly related to the outage, restart it
  • 22:09 elukey@cumin1001: END (PASS) - Cookbook sre.hadoop.reboot-workers (exit_code=0) for Hadoop analytics cluster
  • 21:45 akosiaris: stop squid as a stopgap on install1003 and disable puppet so that it is not restarted while we figure out what wdqs updater is doing to cause issue to mediawiki
  • 20:47 marxarelli: 1.36.0-wmf.31 rolled to group0. no new errors for wmf.31 (T271345)
  • 20:33 dduvall@deploy1001: rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.31
  • 20:20 mutante: mwdebug1002 has been recreated on buster and has been repooled after scap pull - you can find a .tar.gz in your home with the contents of your home before reimaging, fingerprint at T274023#6835116
  • 20:18 legoktm@cumin1001: conftool action : set/pooled=yes; selector: name=mw1297.eqiad.wmnet
  • 20:18 legoktm@cumin1001: conftool action : set/pooled=yes; selector: name=mw1290.eqiad.wmnet
  • 20:18 legoktm@cumin1001: conftool action : set/pooled=yes; selector: name=mw1289.eqiad.wmnet
  • 20:18 legoktm@cumin1001: conftool action : set/pooled=yes; selector: name=mw1288.eqiad.wmnet
  • 20:17 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mwdebug1002.eqiad.wmnet
  • 20:15 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mwdebug1002.eqiad.wmnet
  • 20:04 legoktm@cumin1001: conftool action : set/pooled=no; selector: name=mw1297.eqiad.wmnet
  • 20:04 legoktm@cumin1001: conftool action : set/pooled=no; selector: name=mw1290.eqiad.wmnet
  • 20:04 legoktm@cumin1001: conftool action : set/pooled=no; selector: name=mw1289.eqiad.wmnet
  • 20:03 legoktm@cumin1001: conftool action : set/pooled=no; selector: name=mw1288.eqiad.wmnet
  • 19:58 ryankemper: [WDQS] De-pooled `wdqs100[4,7]` to catch up on lag, and pooled `wdqs100[5,6]`
  • 19:09 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on mwdebug1002.eqiad.wmnet with reason: OS upgrade
  • 19:09 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 3:00:00 on mwdebug1002.eqiad.wmnet with reason: OS upgrade
  • 19:06 legoktm@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1297.eqiad.wmnet with reason: REIMAGE
  • 19:04 legoktm@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1290.eqiad.wmnet with reason: REIMAGE
  • 19:03 legoktm@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1297.eqiad.wmnet with reason: REIMAGE
  • 19:02 legoktm@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1289.eqiad.wmnet with reason: REIMAGE
  • 19:01 legoktm@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1290.eqiad.wmnet with reason: REIMAGE
  • 19:00 legoktm@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1288.eqiad.wmnet with reason: REIMAGE
  • 18:59 mutante: puppetmaster1002 - puppet cert clean mwdebug1002.eqiad.wmnet, sign new request, initial puppet run (T274023)
  • 18:59 legoktm@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1289.eqiad.wmnet with reason: REIMAGE
  • 18:58 legoktm@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1288.eqiad.wmnet with reason: REIMAGE
  • 18:52 mutante: re-creating mwdebug1002
  • 18:49 dduvall@deploy1001: Finished scap: testwikis wikis to 1.36.0-wmf.31 (duration: 49m 37s)
  • 18:41 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1346.eqiad.wmnet
  • 18:38 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1352.eqiad.wmnet
  • 18:37 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1347.eqiad.wmnet
  • 18:35 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1346.eqiad.wmnet
  • 18:33 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1352.eqiad.wmnet
  • 18:32 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1347.eqiad.wmnet
  • 18:28 mutante: mw1352 - powercycle via mgmt
  • 18:04 dduvall@deploy1001: Started scap: testwikis wikis to 1.36.0-wmf.31
  • 17:41 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1346.eqiad.wmnet with reason: REIMAGE
  • 17:39 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1347.eqiad.wmnet with reason: REIMAGE
  • 17:39 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1346.eqiad.wmnet with reason: REIMAGE
  • 17:37 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1347.eqiad.wmnet with reason: REIMAGE
  • 17:36 marxarelli: 1.36.0-wmf.31 was branched at c49ac6d (T271345)
  • 17:33 akosiaris@deploy1001: helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
  • 17:32 akosiaris@deploy1001: helmfile [staging-codfw] START helmfile.d/admin 'sync'.
  • 17:31 akosiaris@deploy1001: helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
  • 17:30 akosiaris@deploy1001: helmfile [staging-codfw] START helmfile.d/admin 'sync'.
  • 17:30 akosiaris@deploy1001: helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
  • 17:30 akosiaris@deploy1001: helmfile [staging-codfw] START helmfile.d/admin 'sync'.
  • 17:29 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1352.eqiad.wmnet with reason: REIMAGE
  • 17:27 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1352.eqiad.wmnet with reason: REIMAGE
  • 17:24 godog: swift codfw-prod decrease HDD weight for ms-be20[16-27] - T272837
  • 17:23 jforrester@deploy1001: Finished deploy [integration/docroot@8ab9125]: Update docroot with Special:MyLanguage links. (duration: 00m 11s)
  • 17:23 jforrester@deploy1001: Started deploy [integration/docroot@8ab9125]: Update docroot with Special:MyLanguage links.
  • 17:21 akosiaris@deploy1001: helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
  • 17:21 akosiaris@deploy1001: helmfile [staging-codfw] START helmfile.d/admin 'sync'.
  • 17:18 akosiaris@deploy1001: helmfile [staging-codfw] START helmfile.d/admin 'sync'.
  • 16:25 klausman@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-etcd[2001-2003].codfw.wmnet with reason: klausman: Pushing new etcd changes from T273071
  • 16:25 klausman@cumin2001: START - Cookbook sre.hosts.downtime for 2:00:00 on ml-etcd[2001-2003].codfw.wmnet with reason: klausman: Pushing new etcd changes from T273071
  • 16:17 moritzm: installing edk2 security updates
  • 16:09 moritzm: installing python-bottle security updates on buster
  • 15:58 papaul: power down ms-be2031 for firmware upgrade
  • 15:44 klausman@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-etcd[1001-1003].eqiad.wmnet with reason: klausman: Pushing new etcd changes from T273071
  • 15:44 klausman@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on ml-etcd[1001-1003].eqiad.wmnet with reason: klausman: Pushing new etcd changes from T273071
  • 15:27 cdanis: re-enabling Puppet on cp-upload@eqsin to deploy Iab4d211 T274888
  • 15:26 akosiaris@deploy1001: helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
  • 15:25 akosiaris@deploy1001: helmfile [staging-codfw] START helmfile.d/admin 'sync'.
  • 15:25 akosiaris@deploy1001: helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
  • 15:25 akosiaris@deploy1001: helmfile [staging-codfw] START helmfile.d/admin 'sync'.
  • 15:17 akosiaris@deploy1001: helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
  • 15:17 akosiaris@deploy1001: helmfile [staging-codfw] START helmfile.d/admin 'sync'.
  • 15:16 akosiaris@deploy1001: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
  • 15:16 akosiaris@deploy1001: helmfile [staging-codfw] START helmfile.d/admin 'apply'.
  • 15:15 mholloway-shell@deploy1001: Synchronized wmf-config/InitialiseSettings.php: WikimediaEvents: Sample mediawiki.client.session_tick at 1:100 (T274172) (duration: 01m 00s)
  • 15:14 akosiaris@deploy1001: helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
  • 15:14 akosiaris@deploy1001: helmfile [staging-codfw] START helmfile.d/admin 'sync'.
  • 15:13 akosiaris@deploy1001: helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
  • 15:13 akosiaris@deploy1001: helmfile [staging-codfw] START helmfile.d/admin 'sync'.
  • 15:12 cdanis: previous message was re: T274888
  • 15:11 cdanis: ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕙☕ sudo cumin 'A:cp-upload and A:eqsin' 'disable-puppet "cdanis deploying Iab4d211 T263496"'
  • 14:38 twentyafterfour@deploy1001: rebuilt and synchronized wikiversions files: all wikis to 1.36.0-wmf.30 refs T271344 bfc73b6
  • 14:24 twentyafterfour: MediaWiki train: prepare to promote all wikis to 1.36.0-wmf.30 refs T271344
  • 14:07 akosiaris: rolling restart of cp500[1-6]
  • 13:40 marostegui: Deploy schema change on s2 codfw - T273359
  • 13:13 urbanecm@deploy1001: Synchronized static/images/cswiki-black-ribbon.png: 5d5b5c4: Temporarily add cswiki-black-ribbon.png as a static resource (duration: 01m 07s)
  • 13:02 jayme@deploy1001: helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
  • 12:53 aborrero@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 12:46 aborrero@cumin1001: START - Cookbook sre.dns.netbox
  • 12:41 jayme@deploy1001: helmfile [staging-codfw] START helmfile.d/admin 'sync'.
  • 12:39 lucaswerkmeister-wmde@deploy1001: Synchronized wmf-config/Wikibase.php: Config: Enable Wikibase Repo ID generator rate limiting on Test Wikidata (T272032) 2/2 (duration: 01m 06s)
  • 12:38 lucaswerkmeister-wmde@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Config: Enable Wikibase Repo ID generator rate limiting on Test Wikidata (T272032) 1/2 (duration: 01m 12s)
  • 12:29 jayme@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'kube-system' for release 'eventrouter' .
  • 12:08 jayme@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'kube-system' for release 'eventrouter' .
  • 12:06 marostegui: Deploy schema change on s5 codfw - T273359
  • 11:54 urbanecm@deploy1001: Synchronized php-1.36.0-wmf.30/extensions/DiscussionTools/includes/CommentFormatter.php: 5f4f516: CommentFormatter: Fix problems with editsection and quotes (T274709) (duration: 01m 12s)
  • 11:54 kharlan@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
  • 11:54 kharlan@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
  • 11:52 kharlan@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
  • 11:52 kharlan@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
  • 11:47 kharlan@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
  • 11:47 jiji@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1023.eqiad.wmnet
  • 11:45 marostegui: Failover m2-master back from dbproxy1015 to dbproxy1013
  • 11:42 effie: upgrade mc2037 to memcached 1.6 - T270315
  • 11:41 jiji@cumin1001: START - Cookbook sre.hosts.reboot-single for host mc1023.eqiad.wmnet
  • 11:40 marostegui: Reboot dbproxy1013 for kernel upgrade
  • 11:29 jayme@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'kube-system' for release 'eventrouter' .
  • 11:28 elukey@cumin1001: START - Cookbook sre.hadoop.reboot-workers for Hadoop analytics cluster
  • 11:27 elukey@cumin1001: END (PASS) - Cookbook sre.hadoop.reboot-workers (exit_code=0) for Hadoop test cluster
  • 10:53 marostegui: Reboot es2023, es2024 and es2025 for kernel upgrade
  • 10:46 elukey@cumin1001: START - Cookbook sre.hadoop.reboot-workers for Hadoop test cluster
  • 10:37 marostegui@cumin1001: dbctl commit (dc=all): 'db1092 (re)pooling @ 100%: Slowly repool db1092', diff saved to https://phabricator.wikimedia.org/P14373 and previous config saved to /var/cache/conftool/dbconfig/20210216-103730-root.json
  • 10:22 marostegui@cumin1001: dbctl commit (dc=all): 'db1092 (re)pooling @ 80%: Slowly repool db1092', diff saved to https://phabricator.wikimedia.org/P14372 and previous config saved to /var/cache/conftool/dbconfig/20210216-102227-root.json
  • 10:19 akosiaris@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'apertium' for release 'production' .
  • 10:18 akosiaris@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'apertium' for release 'staging' .
  • 10:18 marostegui: Reboot pc1010 for kernel upgrade
  • 10:17 marostegui@cumin1001: dbctl commit (dc=all): 'Remove db1075 from dbctl T274235', diff saved to https://phabricator.wikimedia.org/P14371 and previous config saved to /var/cache/conftool/dbconfig/20210216-101710-marostegui.json
  • 10:07 marostegui@cumin1001: dbctl commit (dc=all): 'db1092 (re)pooling @ 60%: Slowly repool db1092', diff saved to https://phabricator.wikimedia.org/P14370 and previous config saved to /var/cache/conftool/dbconfig/20210216-100723-root.json
  • 09:52 marostegui@cumin1001: dbctl commit (dc=all): 'db1092 (re)pooling @ 40%: Slowly repool db1092', diff saved to https://phabricator.wikimedia.org/P14369 and previous config saved to /var/cache/conftool/dbconfig/20210216-095220-root.json
  • 09:40 akosiaris: deploy new certs for apertium
  • 09:40 akosiaris@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'apertium' for release 'production' .
  • 09:40 akosiaris@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'apertium' for release 'staging' .
  • 09:37 marostegui@cumin1001: dbctl commit (dc=all): 'db1092 (re)pooling @ 20%: Slowly repool db1092', diff saved to https://phabricator.wikimedia.org/P14368 and previous config saved to /var/cache/conftool/dbconfig/20210216-093716-root.json
  • 09:28 marostegui: Failover m2-master from dbproxy1013 to dbproxy1015
  • 09:22 marostegui@cumin1001: dbctl commit (dc=all): 'db1092 (re)pooling @ 10%: Slowly repool db1092', diff saved to https://phabricator.wikimedia.org/P14367 and previous config saved to /var/cache/conftool/dbconfig/20210216-092213-root.json
  • 08:37 godog: swift eqiad-prod: decrease weight for SSDs on ms-be[1019-1026] - T272836
  • 08:30 marostegui: Deploy schema change on s6 codfw - T273359
  • 07:40 dcausse: restarting blazegraph on wdqs1013
  • 07:27 marostegui: Reboot dbproxy1021 for kernel upgrade
  • 07:21 marostegui: Reboot dbproxy1012, 1015, 1016, 1017 for kernel upgrade
  • 07:18 marostegui: Reboot dbproxy2* for kernel upgrade
  • 06:49 marostegui: Reboot pc2010 pc2009 pc2008 pc2007 for kernel upgrade
  • 06:46 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1092 to clone db1172 T258361', diff saved to https://phabricator.wikimedia.org/P14365 and previous config saved to /var/cache/conftool/dbconfig/20210216-064602-marostegui.json
  • 06:43 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 06:37 marostegui@cumin1001: START - Cookbook sre.hosts.decommission
  • 06:32 marostegui@cumin1001: dbctl commit (dc=all): 'Remove db1093 from dbctl T273955', diff saved to https://phabricator.wikimedia.org/P14364 and previous config saved to /var/cache/conftool/dbconfig/20210216-063250-marostegui.json
  • 04:17 jforrester@deploy1001: Finished deploy [integration/docroot@864afdb]: Update docroot with changes from this weekend. (duration: 00m 17s)
  • 04:17 jforrester@deploy1001: Started deploy [integration/docroot@864afdb]: Update docroot with changes from this weekend.

2021-02-15

  • 21:33 eileen: civicrm revision changed from dfbb8f41bc to c535ac603a, config revision is ba9b2380b1
  • 16:46 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestage1002.eqiad.wmnet
  • 16:39 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single for host kubestage1002.eqiad.wmnet
  • 16:33 volans: restarted netbox on netbox1001
  • 16:32 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestage1001.eqiad.wmnet
  • 16:27 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single for host kubestage1001.eqiad.wmnet
  • 16:26 jayme: rolled back linkrecommendation helm releases to the most recent revision running chart verion linkrecommendation-0.0.4 on clusters codfw and eqiad (cc: kostajh)
  • 16:22 jmm@puppetmaster1001: conftool action : set/pooled=inactive; selector: name=mwdebug1002.eqiad.wmnet
  • 16:18 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestage2002.codfw.wmnet
  • 16:14 hoo: Updated the Wikidata property suggester with data from the 2021-02-01 JSON dump (with pre-applied T132839 workarounds)
  • 16:12 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single for host kubestage2002.codfw.wmnet
  • 16:12 jayme@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestage2001.codfw.wmnet
  • 16:09 aborrero@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudnet2003-dev.codfw.wmnet
  • 16:07 jayme@cumin1001: START - Cookbook sre.hosts.reboot-single for host kubestage2001.codfw.wmnet
  • 16:05 aborrero@cumin2001: START - Cookbook sre.hosts.reboot-single for host cloudnet2003-dev.codfw.wmnet
  • 15:58 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host schema2004.codfw.wmnet
  • 15:53 elukey@cumin1001: START - Cookbook sre.hosts.reboot-single for host schema2004.codfw.wmnet
  • 15:51 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host schema2003.codfw.wmnet
  • 15:48 elukey@cumin1001: START - Cookbook sre.hosts.reboot-single for host schema2003.codfw.wmnet
  • 15:48 jayme@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
  • 15:46 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host schema1004.eqiad.wmnet
  • 15:38 elukey@cumin1001: START - Cookbook sre.hosts.reboot-single for host schema1004.eqiad.wmnet
  • 15:36 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast3004.wikimedia.org with reason: REIMAGE
  • 15:36 jayme@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
  • 15:36 jayme@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
  • 15:36 jayme@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
  • 15:34 jmm@cumin2001: START - Cookbook sre.hosts.downtime for 2:00:00 on bast3004.wikimedia.org with reason: REIMAGE
  • 15:33 moritzm: installing linux-4.19 update for Stretch on servers which have it installed (no reboots, just updating the kernels)
  • 15:30 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host schema1003.eqiad.wmnet
  • 15:17 elukey@cumin1001: START - Cookbook sre.hosts.reboot-single for host schema1003.eqiad.wmnet
  • 15:09 moritzm: reimaging bast3004 to buster
  • 15:04 godog: upgrade grafana to 7.4.1 on grafana1002 - T263747
  • 14:17 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: 00905c4: Add *.president.az to the wgCopyUploadsDomains allowlist of Wikimedia Commons (T274789) (duration: 01m 09s)
  • 14:08 godog: swift eqiad-prod: add weight back to sdg on ms-be1054 - T273582
  • 13:57 moritzm: installing libonig security update for stretch
  • 13:53 gehel@cumin2001: START - Cookbook sre.wdqs.data-reload
  • 13:38 moritzm: installing subversion security updates
  • 13:33 marostegui: Stop MySQL on db1093 - T273955
  • 13:19 kharlan@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
  • 13:06 godog: swift eqiad-prod: decrease weight for SSDs on ms-be[1019-1026] - T272836
  • 13:05 Lucas_WMDE: notice: stashbot had issues between 8:19 and 12:50, see for https://wm-bot.wmflabs.org/browser/index.php?start=02%2F15%2F2021&end=02%2F15%2F2021&display=%23wikimedia-operations for missed !log messages
  • 13:03 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast4002.wikimedia.org with reason: REIMAGE
  • 13:02 kharlan@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
  • 13:02 kharlan@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
  • 13:01 jmm@cumin2001: START - Cookbook sre.hosts.downtime for 2:00:00 on bast4002.wikimedia.org with reason: REIMAGE
  • 12:58 kharlan@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
  • 12:58 kharlan@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
  • 08:04 marostegui@cumin1001: dbctl commit (dc=all): 'db1162 (re)pooling @ 4%: Slowly pool db1162', diff saved to https://phabricator.wikimedia.org/P14343 and previous config saved to /var/cache/conftool/dbconfig/20210215-080435-root.json
  • 07:49 marostegui@cumin1001: dbctl commit (dc=all): 'db1162 (re)pooling @ 3%: Slowly pool db1162', diff saved to https://phabricator.wikimedia.org/P14342 and previous config saved to /var/cache/conftool/dbconfig/20210215-074932-root.json
  • 07:42 elukey@cumin1001: START - Cookbook sre.druid.reboot-workers for Druid analytics cluster: Reboot Druid nodes - elukey@cumin1001
  • 07:33 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-tool1010.eqiad.wmnet
  • 07:28 elukey@cumin1001: START - Cookbook sre.hosts.reboot-single for host an-tool1010.eqiad.wmnet
  • 07:26 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-tool1009.eqiad.wmnet
  • 07:24 elukey@cumin1001: START - Cookbook sre.hosts.reboot-single for host an-tool1009.eqiad.wmnet
  • 07:22 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-tool1008.eqiad.wmnet
  • 07:20 elukey@cumin1001: START - Cookbook sre.hosts.reboot-single for host an-tool1008.eqiad.wmnet
  • 07:16 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-tool1007.eqiad.wmnet
  • 07:14 elukey@cumin1001: START - Cookbook sre.hosts.reboot-single for host an-tool1007.eqiad.wmnet
  • 07:02 marostegui@cumin1001: dbctl commit (dc=all): 'Pool db1162 with minimal weight T258361', diff saved to https://phabricator.wikimedia.org/P14341 and previous config saved to /var/cache/conftool/dbconfig/20210215-070206-marostegui.json
  • 06:46 marostegui@cumin1001: dbctl commit (dc=all): 'Pool db1162 with minimal weight T258361', diff saved to https://phabricator.wikimedia.org/P14340 and previous config saved to /var/cache/conftool/dbconfig/20210215-064628-marostegui.json
  • 06:40 marostegui@cumin1001: dbctl commit (dc=all): 'Add db1162 to dbctl - depooled T258361', diff saved to https://phabricator.wikimedia.org/P14339 and previous config saved to /var/cache/conftool/dbconfig/20210215-064001-marostegui.json

2021-02-14

  • 13:13 akosiaris: sudo cumin -b 1 -s 120 'cp500[2,3,5,6].eqsin.wmnet' 'systemctl restart varnish-frontend.service'
  • 13:10 _joe_: restarted varnish-fe on cp5004
  • 13:09 akosiaris: restart varnish-fe on cp5001
  • 09:27 joal@deploy1001: Finished deploy [analytics/refinery@dd5f947] (thin): Hotfix analytics deployment - THIN [analytics/refinery@dd5f947] (duration: 00m 06s)
  • 09:27 joal@deploy1001: Started deploy [analytics/refinery@dd5f947] (thin): Hotfix analytics deployment - THIN [analytics/refinery@dd5f947]
  • 09:27 joal@deploy1001: Finished deploy [analytics/refinery@dd5f947]: Hotfix analytics deployment [analytics/refinery@dd5f947] (duration: 12m 52s)
  • 09:14 joal@deploy1001: Started deploy [analytics/refinery@dd5f947]: Hotfix analytics deployment [analytics/refinery@dd5f947]

2021-02-13

  • 03:23 ryankemper: Depooled `wdqs1006` to catch up on lag
  • 03:23 ryankemper: Restarted blazegraph on `wdqs1006`
  • 01:30 crusnov@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host mwdebug1002.eqiad.wmnet
  • 01:00 crusnov@cumin1001: START - Cookbook sre.ganeti.makevm for new host mwdebug1002.eqiad.wmnet
  • 00:55 dzahn@cumin1001: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host mwdebug1002.eqiad.wmnet
  • 00:49 dzahn@cumin1001: START - Cookbook sre.ganeti.makevm for new host mwdebug1002.eqiad.wmnet
  • 00:38 dzahn@cumin1001: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host mwdebug1002.eqiad.wmnet
  • 00:31 legoktm@cumin1001: conftool action : set/pooled=yes; selector: name=mw1281.eqiad.wmnet
  • 00:31 legoktm@cumin1001: conftool action : set/pooled=yes; selector: name=mw1282.eqiad.wmnet
  • 00:31 legoktm@cumin1001: conftool action : set/pooled=yes; selector: name=mw1283.eqiad.wmnet
  • 00:31 legoktm@cumin1001: conftool action : set/pooled=yes; selector: name=mw1284.eqiad.wmnet
  • 00:30 dzahn@cumin1001: START - Cookbook sre.ganeti.makevm for new host mwdebug1002.eqiad.wmnet
  • 00:30 dzahn@cumin1001: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host mwdebug1002.eqiad.wmnet
  • 00:26 mutante: ganeti - attempting to recreate VM mwdebug1002 with cookbook that wsa previously deleted manually (T274689 T274023)
  • 00:25 dzahn@cumin1001: START - Cookbook sre.ganeti.makevm for new host mwdebug1002.eqiad.wmnet
  • 00:08 mutante: ganeti1011 - manually deleting VM mwdebug1002 - T274689 T274023

2021-02-12

  • 23:59 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1348.eqiad.wmnet
  • 23:58 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1356.eqiad.wmnet
  • 23:43 legoktm@cumin1001: conftool action : set/pooled=no; selector: name=mw1284.eqiad.wmnet
  • 23:42 legoktm@cumin1001: conftool action : set/pooled=no; selector: name=mw1283.eqiad.wmnet
  • 23:42 legoktm@cumin1001: conftool action : set/pooled=no; selector: name=mw1282.eqiad.wmnet
  • 23:42 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1348.eqiad.wmnet
  • 23:41 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1356.eqiad.wmnet
  • 23:41 legoktm@cumin1001: conftool action : set/pooled=inactive; selector: name=mw1221.eqiad.wmnet
  • 23:39 legoktm@cumin1001: conftool action : set/pooled=no; selector: name=mw1221.eqiad.wmnet
  • 23:38 legoktm@cumin1001: conftool action : set/pooled=no; selector: name=mw1281.eqiad.wmnet
  • 23:26 dduvall@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'blubberoid' for release 'production' .
  • 23:24 dduvall@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'blubberoid' for release 'production' .
  • 23:14 dduvall@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' .
  • 23:02 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1)
  • 22:52 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
  • 22:51 legoktm@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1284.eqiad.wmnet with reason: REIMAGE
  • 22:49 legoktm@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1283.eqiad.wmnet with reason: REIMAGE
  • 22:48 legoktm@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1284.eqiad.wmnet with reason: REIMAGE
  • 22:47 legoktm@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1283.eqiad.wmnet with reason: REIMAGE
  • 22:47 legoktm@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1282.eqiad.wmnet with reason: REIMAGE
  • 22:45 legoktm@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1282.eqiad.wmnet with reason: REIMAGE
  • 22:44 legoktm@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1281.eqiad.wmnet with reason: REIMAGE
  • 22:42 legoktm@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1281.eqiad.wmnet with reason: REIMAGE
  • 22:32 krinkle@deploy1001: Synchronized wmf-config/PoolCounterSettings.php: Idc385de0 cleanup (duration: 05m 14s)
  • 22:15 krinkle@deploy1001: Synchronized wmf-config/etcd.php: b3447343a cleanup (duration: 05m 20s)
  • 22:00 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1348.eqiad.wmnet with reason: REIMAGE
  • 21:57 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1348.eqiad.wmnet with reason: REIMAGE
  • 21:26 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1357.eqiad.wmnet
  • 20:52 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1357.eqiad.wmnet with reason: REIMAGE
  • 20:50 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1356.eqiad.wmnet with reason: REIMAGE
  • 20:48 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1357.eqiad.wmnet with reason: REIMAGE
  • 20:48 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1356.eqiad.wmnet with reason: REIMAGE
  • 20:36 mutante: mwdebug1003 now on buster - mwdebug1002 rebooting and reimaging to buster
  • 20:36 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on mwdebug1002.eqiad.wmnet with reason: OS upgrade
  • 20:35 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 3:00:00 on mwdebug1002.eqiad.wmnet with reason: OS upgrade
  • 20:32 mutante: mw1353, mw1358 - scap pull, repooled
  • 20:30 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1353.eqiad.wmnet
  • 20:30 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1358.eqiad.wmnet
  • 20:17 mutante: mwdebug2001 - restarted memcached
  • 20:09 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1358.eqiad.wmnet
  • 20:09 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1353.eqiad.wmnet
  • 19:56 mutante: mwdebug2002 - restart memcached
  • 19:53 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on mw1358.eqiad.wmnet with reason: OS upgrade
  • 19:53 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 3:00:00 on mw1358.eqiad.wmnet with reason: OS upgrade
  • 19:53 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on mw1353.eqiad.wmnet with reason: OS upgrade
  • 19:53 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 3:00:00 on mw1353.eqiad.wmnet with reason: OS upgrade
  • 19:46 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on mwdebug1003.eqiad.wmnet with reason: OS upgrade
  • 19:46 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 3:00:00 on mwdebug1003.eqiad.wmnet with reason: OS upgrade
  • 19:43 twentyafterfour@deploy1001: rebuilt and synchronized wikiversions files: roll back commonswiki to 1.36.0-wmf.27 due to T274589
  • 19:42 mutante: mwdebug2001 now on buster - mwdebug1003 rebooting and reimaging to stretch
  • 19:38 milimetric@deploy1001: Finished deploy [analytics/refinery@e0c09a2] (thin): Fix for mediarequest per file cassandra job - 2 (duration: 00m 06s)
  • 19:38 milimetric@deploy1001: Started deploy [analytics/refinery@e0c09a2] (thin): Fix for mediarequest per file cassandra job - 2
  • 19:38 milimetric@deploy1001: Finished deploy [analytics/refinery@e0c09a2]: Fix for mediarequest per file cassandra job - 2 (duration: 11m 01s)
  • 19:34 twentyafterfour: Train status: Rolling back commonswiki to wmf.27 due to T274589 (refs T271344)
  • 19:29 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw1358.eqiad.wmnet with reason: REIMAGE
  • 19:28 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw1353.eqiad.wmnet with reason: REIMAGE
  • 19:28 robh@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on aqs1015.eqiad.wmnet with reason: REIMAGE
  • 19:27 milimetric@deploy1001: Started deploy [analytics/refinery@e0c09a2]: Fix for mediarequest per file cassandra job - 2
  • 19:24 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1358.eqiad.wmnet with reason: REIMAGE
  • 19:23 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1353.eqiad.wmnet with reason: REIMAGE
  • 19:23 robh@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1015.eqiad.wmnet with reason: REIMAGE
  • 19:20 robh@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1013.eqiad.wmnet with reason: REIMAGE
  • 19:18 robh@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1013.eqiad.wmnet with reason: REIMAGE
  • 19:18 milimetric@deploy1001: Finished deploy [analytics/refinery@366962f]: Fix for mediarequest per file cassandra job (duration: 11m 58s)
  • 19:17 robh@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1012.eqiad.wmnet with reason: REIMAGE
  • 19:15 robh@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1011.eqiad.wmnet with reason: REIMAGE
  • 19:15 robh@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1012.eqiad.wmnet with reason: REIMAGE
  • 19:13 robh@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1010.eqiad.wmnet with reason: REIMAGE
  • 19:13 robh@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1011.eqiad.wmnet with reason: REIMAGE
  • 19:11 robh@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1010.eqiad.wmnet with reason: REIMAGE
  • 19:06 milimetric@deploy1001: Started deploy [analytics/refinery@366962f]: Fix for mediarequest per file cassandra job
  • 19:02 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on mwdebug2001.codfw.wmnet with reason: OS upgrade
  • 19:02 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 3:00:00 on mwdebug2001.codfw.wmnet with reason: OS upgrade
  • 19:02 mutante: rebooting and reimaging mwdebug2001 to buster T274023
  • 18:35 mutante: mwdebug2002 now a buster VM; you can find a .tar.gz in your home dir with the contents of your previous home
  • 18:30 ebernhardson@deploy1001: Finished deploy [wikimedia/discovery/analytics@98264b8]: airflow: review and correct usage of catchup=False (duration: 03m 10s)
  • 18:27 ebernhardson@deploy1001: Started deploy [wikimedia/discovery/analytics@98264b8]: airflow: review and correct usage of catchup=False
  • 17:33 elukey@cumin1001: END (PASS) - Cookbook sre.presto.reboot-workers (exit_code=0) for Presto analytics cluster: Reboot Presto nodes - elukey@cumin1001
  • 17:23 bblack: cp*: re-enabling puppet after successful agent run on one host as a test!
  • 17:13 bblack: cp*: disable puppet ahead of https://gerrit.wikimedia.org/r/c/operations/puppet/+/663845
  • 17:08 elukey@cumin1001: START - Cookbook sre.presto.reboot-workers for Presto analytics cluster: Reboot Presto nodes - elukey@cumin1001
  • 17:01 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-tool1005.eqiad.wmnet
  • 16:48 elukey@cumin1001: START - Cookbook sre.hosts.reboot-single for host an-tool1005.eqiad.wmnet
  • 16:45 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-ui1001.eqiad.wmnet
  • 16:43 elukey@cumin1001: START - Cookbook sre.hosts.reboot-single for host an-test-ui1001.eqiad.wmnet
  • 16:43 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-client1001.eqiad.wmnet
  • 16:38 elukey@cumin1001: START - Cookbook sre.hosts.reboot-single for host an-test-client1001.eqiad.wmnet
  • 16:30 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-druid1001.eqiad.wmnet
  • 16:26 elukey@cumin1001: START - Cookbook sre.hosts.reboot-single for host an-test-druid1001.eqiad.wmnet
  • 16:12 mforns@deploy1001: Finished deploy [analytics/refinery@9cd1297] (hadoop-test): Fix for data quality alarms after BigTop migration TEST [analytics/refinery@9cd129764edbac04b192c922ec0a975bc47455a5] (duration: 04m 05s)
  • 16:11 hnowlan: joining maps2007 to cassandra cluster
  • 16:08 mforns@deploy1001: Started deploy [analytics/refinery@9cd1297] (hadoop-test): Fix for data quality alarms after BigTop migration TEST [analytics/refinery@9cd129764edbac04b192c922ec0a975bc47455a5]
  • 16:08 mforns@deploy1001: Finished deploy [analytics/refinery@9cd1297] (thin): Fix for data quality alarms after BigTop migration THIN [analytics/refinery@9cd129764edbac04b192c922ec0a975bc47455a5] (duration: 00m 06s)
  • 16:07 mforns@deploy1001: Started deploy [analytics/refinery@9cd1297] (thin): Fix for data quality alarms after BigTop migration THIN [analytics/refinery@9cd129764edbac04b192c922ec0a975bc47455a5]
  • 16:07 mforns@deploy1001: Finished deploy [analytics/refinery@9cd1297]: Fix for data quality alarms after BigTop migration [analytics/refinery@9cd129764edbac04b192c922ec0a975bc47455a5] (duration: 38m 56s)
  • 15:28 mforns@deploy1001: Started deploy [analytics/refinery@9cd1297]: Fix for data quality alarms after BigTop migration [analytics/refinery@9cd129764edbac04b192c922ec0a975bc47455a5]
  • 15:22 herron: rolling reboot of alert[12]001 hosts for updates
  • 15:16 elukey: roll restart druid broker on druid-public to pick up new settings
  • 14:39 jiji@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1022.eqiad.wmnet
  • 14:32 jiji@cumin1001: START - Cookbook sre.hosts.reboot-single for host mc1022.eqiad.wmnet
  • 14:04 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp1090.eqiad.wmnet
  • 14:04 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp1089.eqiad.wmnet
  • 14:03 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3065.esams.wmnet
  • 14:03 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3064.esams.wmnet
  • 13:52 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp3065.esams.wmnet
  • 13:52 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp3064.esams.wmnet
  • 13:52 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp1090.eqiad.wmnet
  • 13:51 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp1089.eqiad.wmnet
  • 13:10 hnowlan@puppetmaster1001: conftool action : set/pooled=yes:weight=10; selector: name=maps1005.eqiad.wmnet
  • 12:12 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on maps2007.codfw.wmnet with reason: Resyncing database
  • 12:11 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on maps2007.codfw.wmnet with reason: Resyncing database
  • 11:44 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2042.codfw.wmnet
  • 11:44 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2041.codfw.wmnet
  • 11:32 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp2042.codfw.wmnet
  • 11:32 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp2041.codfw.wmnet
  • 11:27 moritzm: installing emacs updates from buster point release
  • 11:25 moritzm: installing device-tree-compiler updates from buster point release
  • 11:22 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp1088.eqiad.wmnet
  • 11:22 moritzm: installing node-ini security updates
  • 11:22 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp1087.eqiad.wmnet
  • 11:22 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3062.esams.wmnet
  • 11:21 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3063.esams.wmnet
  • 11:15 aborrero@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudgw2002-dev.codfw.wmnet with reason: REIMAGE
  • 11:14 moritzm: installing golang-1.11 security updates
  • 11:13 aborrero@cumin2001: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw2002-dev.codfw.wmnet with reason: REIMAGE
  • 11:10 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp3063.esams.wmnet
  • 11:10 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp3062.esams.wmnet
  • 11:10 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp1088.eqiad.wmnet
  • 11:10 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp1087.eqiad.wmnet
  • 11:10 jynus@cumin1001: dbctl commit (dc=all): 'Increase db1163 traffic to 100%', diff saved to https://phabricator.wikimedia.org/P14337 and previous config saved to /var/cache/conftool/dbconfig/20210212-111010-jynus.json
  • 11:06 moritzm: installing xcftools security updates
  • 10:58 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2039.codfw.wmnet
  • 10:57 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2040.codfw.wmnet
  • 10:50 legoktm: repooled registry1002 after revert
  • 10:42 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp2040.codfw.wmnet
  • 10:41 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp2039.codfw.wmnet
  • 10:39 jynus@cumin1001: dbctl commit (dc=all): 'Increase db1163 traffic to 75%', diff saved to https://phabricator.wikimedia.org/P14336 and previous config saved to /var/cache/conftool/dbconfig/20210212-103921-jynus.json
  • 10:24 moritzm: installing wireshark security updates for stretch
  • 10:22 legoktm: depooled registry1002 while fixing/debugging nginx config
  • 10:22 Urbanecm: mwscript importImages.php --wiki=commonswiki --comment-ext=txt --user=Victorgrigas . # T274608
  • 10:18 jynus@cumin1001: dbctl commit (dc=all): 'Increase db1163 traffic to 50%', diff saved to https://phabricator.wikimedia.org/P14335 and previous config saved to /var/cache/conftool/dbconfig/20210212-101814-jynus.json
  • 10:12 jiji@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thumbor2004.codfw.wmnet
  • 10:06 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4032.ulsfo.wmnet
  • 10:05 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4026.ulsfo.wmnet
  • 10:04 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp1086.eqiad.wmnet
  • 10:04 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3061.esams.wmnet
  • 10:04 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3060.esams.wmnet
  • 10:04 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp1085.eqiad.wmnet
  • 10:04 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2038.codfw.wmnet
  • 10:04 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5012.eqsin.wmnet
  • 10:04 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2037.codfw.wmnet
  • 10:02 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5006.eqsin.wmnet
  • 10:01 jiji@cumin1001: START - Cookbook sre.hosts.reboot-single for host thumbor2004.codfw.wmnet
  • 09:57 jiji@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thumbor2003.codfw.wmnet
  • 09:54 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp5006.eqsin.wmnet
  • 09:54 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp5012.eqsin.wmnet
  • 09:54 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp4026.ulsfo.wmnet
  • 09:54 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp4032.ulsfo.wmnet
  • 09:53 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp3061.esams.wmnet
  • 09:53 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp3060.esams.wmnet
  • 09:53 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp2038.codfw.wmnet
  • 09:53 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp2037.codfw.wmnet
  • 09:53 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp1086.eqiad.wmnet
  • 09:52 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp1085.eqiad.wmnet
  • 09:46 jiji@cumin1001: START - Cookbook sre.hosts.reboot-single for host thumbor2003.codfw.wmnet
  • 09:45 jynus@cumin1001: dbctl commit (dc=all): 'Increase db1163 traffic to 30%', diff saved to https://phabricator.wikimedia.org/P14334 and previous config saved to /var/cache/conftool/dbconfig/20210212-094520-jynus.json
  • 09:32 jynus@cumin1001: dbctl commit (dc=all): 'Increase db1163 traffic to 20%', diff saved to https://phabricator.wikimedia.org/P14333 and previous config saved to /var/cache/conftool/dbconfig/20210212-093211-jynus.json
  • 09:31 moritzm: installing node-y18n security updates
  • 08:31 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast2002.wikimedia.org with reason: REIMAGE
  • 08:29 jmm@cumin2001: START - Cookbook sre.hosts.downtime for 2:00:00 on bast2002.wikimedia.org with reason: REIMAGE
  • 08:25 jynus@cumin1001: dbctl commit (dc=all): 'Increase db1163 traffic to 10%', diff saved to https://phabricator.wikimedia.org/P14331 and previous config saved to /var/cache/conftool/dbconfig/20210212-082526-jynus.json
  • 08:15 moritzm: reimaging bast2002 to buster
  • 07:54 elukey: roll restart of druid brokers on druid-public - locked after scheduled datasource deletion
  • 03:36 krinkle@deploy1001: Finished deploy [integration/docroot@3c943ba]: I89e1ec881 (duration: 00m 08s)
  • 03:36 krinkle@deploy1001: Started deploy [integration/docroot@3c943ba]: I89e1ec881
  • 01:21 legoktm@cumin1001: conftool action : set/pooled=yes; selector: name=mw1329.eqiad.wmnet
  • 01:21 legoktm@cumin1001: conftool action : set/pooled=yes; selector: name=mw1330.eqiad.wmnet
  • 01:21 legoktm@cumin1001: conftool action : set/pooled=yes; selector: name=mw1331.eqiad.wmnet
  • 01:21 legoktm@cumin1001: conftool action : set/pooled=yes; selector: name=mw1332.eqiad.wmnet
  • 01:08 legoktm@cumin1001: conftool action : set/pooled=no; selector: name=mw1332.eqiad.wmnet
  • 01:08 legoktm@cumin1001: conftool action : set/pooled=no; selector: name=mw1331.eqiad.wmnet
  • 01:08 legoktm@cumin1001: conftool action : set/pooled=no; selector: name=mw1330.eqiad.wmnet
  • 01:07 legoktm@cumin1001: conftool action : set/pooled=no; selector: name=mw1329.eqiad.wmnet
  • 01:06 Urbanecm: Evening B&C done
  • 01:04 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: 389f7f1: Enable DiscussionTools Reply Tool A/B test (T273554) (duration: 01m 08s)
  • 01:02 urbanecm@deploy1001: sync-file aborted: 389f7f1: Enable DiscussionTools Reply Tool A/B test (duration: 00m 48s)
  • 01:01 urbanecm@deploy1001: Synchronized php-1.36.0-wmf.27/extensions/VisualEditor/: c86cd00: de4a562: VE backports (T273096) (duration: 01m 15s)
  • 00:56 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: 5d92ed1: Add import sources for zh_yuewiki (T274597) (duration: 01m 13s)
  • 00:34 foks: removing 2 files for legal compliance
  • 00:32 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: a022f2b: Oversample DiscussionTools EditAttemptStep logging (T273946) (duration: 01m 08s)
  • 00:30 Urbanecm: mwscript namespaceDupes.php itwikiquote --fix --add-prefix=BROKEN # T273362
  • 00:29 Urbanecm: mwscript namespaceDupes.php itwikiquote --fix # T273362
  • 00:24 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: f051c6c: Adding WQ as namespace alias for itwikiquote (T273362) (duration: 01m 10s)
  • 00:20 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: 53229b0: Enabling extension SandboxLink on ltwiki (T273957) (duration: 01m 07s)
  • 00:11 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on mwdebug2002.codfw.wmnet with reason: OS upgrade
  • 00:11 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on mwdebug2002.codfw.wmnet with reason: OS upgrade
  • 00:07 ejegg: updated fundraising civicrm from b81cb5e702 to dfbb8f41bc

2021-02-11

  • 23:50 Urbanecm: Deploy security patch for T274514
  • 23:47 mutante: reimaged mwdebug2002 with buster - since this is a VM: manually cleaned puppet cert on puppetmaster1001, signed new cert for same hostname, initial puppet run etc (T274023)
  • 23:44 twentyafterfour: Train status for wmf.30 (T271344) is blocked until monday. leaving wmf.30 on group1 and wmf.27 on group2 in spite of T260401
  • 23:34 robh@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mwlog1002.eqiad.wmnet with reason: REIMAGE
  • 23:32 robh@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mwlog1002.eqiad.wmnet with reason: REIMAGE
  • 23:23 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mwdebug2002.codfw.wmnet with reason: OS upgrade
  • 23:23 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mwdebug2002.codfw.wmnet with reason: OS upgrade
  • 23:20 mutante: reimaging mwdebug2002 - stretch -> buster
  • 22:57 Urbanecm: Run scap pull at mwmaint1002
  • 22:53 mutante: powercycling crashed mwmaint1002
  • 22:53 Urbanecm: Deploy security patch for T274514
  • 22:11 legoktm@deploy1001: Synchronized php-1.36.0-wmf.30/extensions/GlobalWatchlist: GlobalWatchlist backports (duration: 01m 11s)
  • 22:05 legoktm@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1332.eqiad.wmnet with reason: REIMAGE
  • 22:03 legoktm@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1331.eqiad.wmnet with reason: REIMAGE
  • 22:03 legoktm@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1332.eqiad.wmnet with reason: REIMAGE
  • 22:01 legoktm@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1330.eqiad.wmnet with reason: REIMAGE
  • 22:01 legoktm@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1331.eqiad.wmnet with reason: REIMAGE
  • 21:59 legoktm@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1329.eqiad.wmnet with reason: REIMAGE
  • 21:59 legoktm@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1330.eqiad.wmnet with reason: REIMAGE
  • 21:57 legoktm@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1329.eqiad.wmnet with reason: REIMAGE
  • 21:55 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1354.eqiad.wmnet
  • 21:50 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1354.eqiad.wmnet
  • 21:50 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1355.eqiad.wmnet
  • 21:45 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1355.eqiad.wmnet
  • 21:44 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1359.eqiad.wmnet
  • 21:40 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1359.eqiad.wmnet
  • 21:37 mutante: mw1355, mw1359 - power cycling
  • 21:23 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1354.eqiad.wmnet with reason: REIMAGE
  • 21:21 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1354.eqiad.wmnet with reason: REIMAGE
  • 21:20 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1360.eqiad.wmnet
  • 21:12 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1360.eqiad.wmnet
  • 21:05 mutante: mw1360 - powercycling
  • 21:01 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1364.eqiad.wmnet
  • 20:59 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1364.eqiad.wmnet
  • 20:52 mutante: mw1364 - powercycled
  • 20:44 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1355.eqiad.wmnet with reason: REIMAGE
  • 20:42 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1355.eqiad.wmnet with reason: REIMAGE
  • 20:31 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1359.eqiad.wmnet with reason: REIMAGE
  • 20:29 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1359.eqiad.wmnet with reason: REIMAGE
  • 20:26 twentyafterfour: new train blocker preventing deploy of 1.36.0-wmf.30 to all wikis. T274589 blocks T271344
  • 20:24 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1365.eqiad.wmnet
  • 20:23 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1365.eqiad.wmnet
  • 20:15 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1360.eqiad.wmnet with reason: REIMAGE
  • 20:13 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1360.eqiad.wmnet with reason: REIMAGE
  • 20:11 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1361.eqiad.wmnet
  • 20:09 mutante: mw1365 - powercycle - reboot issue
  • 20:08 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1361.eqiad.wmnet
  • 20:02 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1364.eqiad.wmnet with reason: REIMAGE
  • 20:00 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1364.eqiad.wmnet with reason: REIMAGE
  • 19:55 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1362.eqiad.wmnet
  • 19:54 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1362.eqiad.wmnet
  • 19:42 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1368.eqiad.wmnet
  • 19:41 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1361.eqiad.wmnet with reason: REIMAGE
  • 19:40 mutante: mw1368 - had the reboot via IPMI issue, did DRAC reset and repeated wmf-autoreimage, issue did not happen again
  • 19:40 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1368.eqiad.wmnet
  • 19:39 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1361.eqiad.wmnet with reason: REIMAGE
  • 19:32 urbanecm@deploy1001: Synchronized wmf-config/logos.php: noop: a1244df: Add inline documentation to configuration about updating logos regarding labs (duration: 01m 08s)
  • 19:24 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1365.eqiad.wmnet with reason: REIMAGE
  • 19:24 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: 93e168c: Added Kokebok namespace to nowikibooks (T274265) (duration: 01m 20s)
  • 19:23 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1362.eqiad.wmnet with reason: REIMAGE
  • 19:22 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1365.eqiad.wmnet with reason: REIMAGE
  • 19:20 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1362.eqiad.wmnet with reason: REIMAGE
  • 19:20 robh@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 19:15 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1363.eqiad.wmnet
  • 19:13 robh@cumin1001: START - Cookbook sre.dns.netbox
  • 19:13 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1363.eqiad.wmnet
  • 19:13 robh@cumin1001: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
  • 19:12 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1368.eqiad.wmnet with reason: REIMAGE
  • 19:10 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1368.eqiad.wmnet with reason: REIMAGE
  • 19:04 mutante: mw1363 - powercycled, reboot issue
  • 18:56 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1374.eqiad.wmnet
  • 18:48 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1374.eqiad.wmnet
  • 18:46 mutante: mw1368 - racadm racreset
  • 18:46 mutante: mw1368 - reboot via IPMI issue & can't powercycle "Unable to perform requested operation." - racreet
  • 18:43 mutante: mw1374 - powercycled, reboot via ipmi issue
  • 18:19 robh@cumin1001: START - Cookbook sre.dns.netbox
  • 18:18 cmjohnson@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 18:11 cmjohnson@cumin1001: START - Cookbook sre.dns.netbox
  • 18:00 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1368.eqiad.wmnet with reason: REIMAGE
  • 17:59 bblack: lvs2007 - downtimes ended, back in service - T274571
  • 17:58 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1363.eqiad.wmnet with reason: REIMAGE
  • 17:57 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1368.eqiad.wmnet with reason: REIMAGE
  • 17:56 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1363.eqiad.wmnet with reason: REIMAGE
  • 17:56 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1374.eqiad.wmnet with reason: REIMAGE
  • 17:54 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1374.eqiad.wmnet with reason: REIMAGE
  • 17:52 bblack: lvs2007 - starting up puppet + pybal - T274571
  • 17:36 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1375.eqiad.wmnet
  • 17:35 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1375.eqiad.wmnet
  • 17:32 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1376.eqiad.wmnet
  • 17:31 bblack: lvs2007 - shutting down host - T274571
  • 17:27 bblack: lvs2007 - stopping pybal - T274571
  • 17:26 bblack: lvs2007 - puppet disabled, downtimed in icinga - T274571
  • 17:20 cmjohnson@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 17:11 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1376.eqiad.wmnet
  • 17:09 cmjohnson@cumin1001: START - Cookbook sre.dns.netbox
  • 17:07 mutante: mw1375 - powercycle - stuck at reboot
  • 17:03 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1376.eqiad.wmnet
  • 16:39 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1368.eqiad.wmnet with reason: cumin execution failed during wmf-reimaged
  • 16:39 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1368.eqiad.wmnet with reason: cumin execution failed during wmf-reimaged
  • 16:38 mutante: mw1368 - File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 637, in _execute raise RemoteExecutionError(ret, 'Cumin execution failed')
  • 16:33 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw1368.eqiad.wmnet with reason: REIMAGE
  • 16:32 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1375.eqiad.wmnet with reason: REIMAGE
  • 16:30 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1376.eqiad.wmnet with reason: REIMAGE
  • 16:30 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1375.eqiad.wmnet with reason: REIMAGE
  • 16:28 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1368.eqiad.wmnet with reason: REIMAGE
  • 16:28 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1376.eqiad.wmnet with reason: REIMAGE
  • 16:24 ejegg: updated payments-wiki from a232fc3438 to 4b7b195c8a
  • 16:13 kormat@cumin1001: dbctl commit (dc=all): 'Pool db1163 at 1%, again T258361', diff saved to https://phabricator.wikimedia.org/P14323 and previous config saved to /var/cache/conftool/dbconfig/20210211-161308-kormat.json
  • 15:52 jynus: deploying fixed grants to db1163
  • 15:50 gehel: ban elastic2054 from shard allocation - T274555
  • 15:49 jynus@cumin1001: dbctl commit (dc=all): 'Depool 1163', diff saved to https://phabricator.wikimedia.org/P14321 and previous config saved to /var/cache/conftool/dbconfig/20210211-154902-jynus.json
  • 15:47 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host serpens.wikimedia.org
  • 15:46 gehel: depooling elastic2054 - T274555
  • 15:45 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host serpens.wikimedia.org
  • 15:45 kormat@cumin1001: dbctl commit (dc=all): 'Pool db1163 at 1% T258361', diff saved to https://phabricator.wikimedia.org/P14320 and previous config saved to /var/cache/conftool/dbconfig/20210211-154501-kormat.json
  • 15:39 gehel: powercycle elastic2054 - T274555
  • 15:39 gehel: powercycle elastic2054
  • 14:44 kormat@cumin1001: dbctl commit (dc=all): 'Add db1163 to s1 T258361', diff saved to https://phabricator.wikimedia.org/P14318 and previous config saved to /var/cache/conftool/dbconfig/20210211-144445-kormat.json
  • 14:24 mholloway-shell@deploy1001: Synchronized wmf-config/InitialiseSettings.php: EventStreams: Update sampling config syntax for test.instrumentation.sampled (duration: 01m 08s)
  • 14:11 filippo@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2001.wikimedia.org
  • 14:02 filippo@cumin1001: START - Cookbook sre.hosts.reboot-single for host netmon2001.wikimedia.org
  • 13:53 filippo@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe1003.eqiad.wmnet
  • 13:48 filippo@cumin1001: START - Cookbook sre.hosts.reboot-single for host thanos-fe1003.eqiad.wmnet
  • 13:48 filippo@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe1002.eqiad.wmnet
  • 13:41 filippo@cumin1001: START - Cookbook sre.hosts.reboot-single for host thanos-fe1002.eqiad.wmnet
  • 13:28 godog: test grafana 7.4.1 upgrade on grafana2001 - T263747
  • 13:27 moritzm: re-adding ganeti5002 to the eqsin Ganeti cluster following mainboard replacement/reinstall T261130
  • 13:22 filippo@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe1001.eqiad.wmnet
  • 13:16 filippo@cumin1001: START - Cookbook sre.hosts.reboot-single for host thanos-fe1001.eqiad.wmnet
  • 13:08 filippo@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe2003.codfw.wmnet
  • 13:04 hnowlan@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'similar-users' for release 'main' .
  • 13:03 filippo@cumin1001: START - Cookbook sre.hosts.reboot-single for host thanos-fe2003.codfw.wmnet
  • 13:00 hnowlan@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'similar-users' for release 'main' .
  • 12:57 hnowlan@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' .
  • 12:53 filippo@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe2002.codfw.wmnet
  • 12:45 filippo@cumin1001: START - Cookbook sre.hosts.reboot-single for host thanos-fe2002.codfw.wmnet
  • 12:41 filippo@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet
  • 12:40 filippo@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe2001.codfw.wmnet
  • 12:40 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: d2b1df1: Changing frwiktionary wmgBabelMainCategory (T274137) (duration: 01m 08s)
  • 12:37 filippo@cumin1001: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet
  • 12:35 filippo@cumin1001: START - Cookbook sre.hosts.reboot-single for host thanos-fe2001.codfw.wmnet
  • 12:18 lucaswerkmeister-wmde@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Config: wikidata: post edit constraint jobs on 50% of edits (T204031) (up from 40%) (duration: 01m 08s)
  • 12:15 lucaswerkmeister-wmde@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Config: wikidata: add Dagbani to wmgExtraLanguageNames (T272242) (duration: 01m 29s)
  • 12:06 jynus: restart-failed systemd on cumin1001 after s5 eqiad snapshot failed
  • 11:49 jiji@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thumbor2002.codfw.wmnet
  • 11:45 mvolz@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'citoid' for release 'production' .
  • 11:41 jynus@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1163.eqiad.wmnet with reason: REIMAGE
  • 11:40 mvolz@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'citoid' for release 'production' .
  • 11:39 jynus@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on db1163.eqiad.wmnet with reason: REIMAGE
  • 11:39 jiji@cumin1001: START - Cookbook sre.hosts.reboot-single for host thumbor2002.codfw.wmnet
  • 11:35 mvolz@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'citoid' for release 'staging' .
  • 11:35 jiji@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thumbor2001.codfw.wmnet
  • 11:25 jiji@cumin1001: START - Cookbook sre.hosts.reboot-single for host thumbor2001.codfw.wmnet
  • 11:25 jiji@cumin1001: conftool action : set/pooled=yes; selector: name=thumbor1004.eqiad.wmnet
  • 11:17 mvolz@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'zotero' for release 'production' .
  • 11:13 mvolz@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'zotero' for release 'production' .
  • 11:06 mvolz@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'zotero' for release 'staging' .
  • 11:04 kormat@cumin1001: dbctl commit (dc=all): 'db1118 (re)pooling @ 100%: changed binlog_format T274472', diff saved to https://phabricator.wikimedia.org/P14315 and previous config saved to /var/cache/conftool/dbconfig/20210211-110447-kormat.json
  • 11:03 moritzm: installing firejail security updates on Stretch
  • 10:56 filippo@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be2004.codfw.wmnet
  • 10:50 filippo@cumin1001: START - Cookbook sre.hosts.reboot-single for host thanos-be2004.codfw.wmnet
  • 10:49 kormat@cumin1001: dbctl commit (dc=all): 'db1118 (re)pooling @ 66%: changed binlog_format T274472', diff saved to https://phabricator.wikimedia.org/P14314 and previous config saved to /var/cache/conftool/dbconfig/20210211-104943-kormat.json
  • 10:48 filippo@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be2003.codfw.wmnet
  • 10:40 filippo@cumin1001: START - Cookbook sre.hosts.reboot-single for host thanos-be2003.codfw.wmnet
  • 10:39 filippo@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be2002.codfw.wmnet
  • 10:34 kormat@cumin1001: dbctl commit (dc=all): 'db1118 (re)pooling @ 33%: changed binlog_format T274472', diff saved to https://phabricator.wikimedia.org/P14313 and previous config saved to /var/cache/conftool/dbconfig/20210211-103440-kormat.json
  • 10:33 filippo@cumin1001: START - Cookbook sre.hosts.reboot-single for host thanos-be2002.codfw.wmnet
  • 10:25 filippo@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be2001.codfw.wmnet
  • 10:20 kormat@cumin1001: dbctl commit (dc=all): 'db1118 depooling: change binlog_format', diff saved to https://phabricator.wikimedia.org/P14312 and previous config saved to /var/cache/conftool/dbconfig/20210211-101959-kormat.json
  • 10:19 kormat@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1118.eqiad.wmnet with reason: Depooling to change binglog_format T274472
  • 10:19 kormat@cumin1001: START - Cookbook sre.hosts.downtime for 1:00:00 on db1118.eqiad.wmnet with reason: Depooling to change binglog_format T274472
  • 10:18 filippo@cumin1001: START - Cookbook sre.hosts.reboot-single for host thanos-be2001.codfw.wmnet
  • 10:15 filippo@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be1004.eqiad.wmnet
  • 10:14 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4025.ulsfo.wmnet
  • 10:14 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4031.ulsfo.wmnet
  • 10:13 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2035.codfw.wmnet
  • 10:12 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp1084.eqiad.wmnet
  • 10:12 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3059.esams.wmnet
  • 10:12 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp1083.eqiad.wmnet
  • 10:12 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3058.esams.wmnet
  • 10:12 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2036.codfw.wmnet
  • 10:10 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5005.eqsin.wmnet
  • 10:08 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5011.eqsin.wmnet
  • 10:07 filippo@cumin1001: START - Cookbook sre.hosts.reboot-single for host thanos-be1004.eqiad.wmnet
  • 10:02 jynus: switching db1118 to row_format=STATEMENT as new s1 master candidate
  • 10:00 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp5005.eqsin.wmnet
  • 10:00 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp5011.eqsin.wmnet
  • 10:00 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp4025.ulsfo.wmnet
  • 09:59 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp4031.ulsfo.wmnet
  • 09:59 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp3059.esams.wmnet
  • 09:59 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp3058.esams.wmnet
  • 09:59 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp2036.codfw.wmnet
  • 09:59 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp2035.codfw.wmnet
  • 09:58 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp1084.eqiad.wmnet
  • 09:58 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp1083.eqiad.wmnet
  • 09:53 jiji@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host thumbor1004.eqiad.wmnet
  • 09:44 akosiaris@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'cxserver' for release 'production' .
  • 09:44 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet
  • 09:43 akosiaris@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'cxserver' for release 'production' .
  • 09:38 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet
  • 09:13 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2001.codfw.wmnet
  • 09:12 akosiaris@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'staging' .
  • 09:10 filippo@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be1003.eqiad.wmnet
  • 09:09 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host rpki2001.codfw.wmnet
  • 09:03 filippo@cumin1001: START - Cookbook sre.hosts.reboot-single for host thanos-be1003.eqiad.wmnet
  • 09:00 filippo@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be1002.eqiad.wmnet
  • 08:59 jiji@cumin1001: START - Cookbook sre.hosts.reboot-single for host thumbor1004.eqiad.wmnet
  • 08:59 jiji@cumin1001: conftool action : set/pooled=yes; selector: name=thumbor1003.eqiad.wmnet
  • 08:52 filippo@cumin1001: START - Cookbook sre.hosts.reboot-single for host thanos-be1002.eqiad.wmnet
  • 08:48 filippo@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be1001.eqiad.wmnet
  • 08:41 jiji@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thumbor1003.eqiad.wmnet
  • 08:38 filippo@cumin1001: START - Cookbook sre.hosts.reboot-single for host thanos-be1001.eqiad.wmnet
  • 08:35 godog: swift codfw-prod decrease HDD weight for ms-be20[16-27] - T272837
  • 08:29 jiji@cumin1001: START - Cookbook sre.hosts.reboot-single for host thumbor1003.eqiad.wmnet
  • 08:24 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast3005.wikimedia.org
  • 08:18 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host bast3005.wikimedia.org
  • 08:11 legoktm@deploy1001: Synchronized php-1.36.0-wmf.30/vendor/wikimedia/shellbox/src/Command/BashWrapper.php: wikimedia/shellbox: Don't unconditionally allowPath( 'limit.sh' ) - T274474 (duration: 01m 32s)
  • 08:09 jiji@cumin1001: START - Cookbook sre.hosts.reboot-single for host thumbor1002.eqiad.wmnet
  • 08:07 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4003.wikimedia.org
  • 08:01 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host bast4003.wikimedia.org
  • 07:59 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5002.wikimedia.org
  • 07:51 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single for host bast5002.wikimedia.org
  • 07:46 jiji@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1021.eqiad.wmnet
  • 07:44 XioNoX: push improved loopback dhcp term to all routers
  • 07:39 jiji@cumin1001: START - Cookbook sre.hosts.reboot-single for host mc1021.eqiad.wmnet
  • 07:25 effie: pool thumbor1001
  • 07:06 elukey@puppetmaster1001: conftool action : set/pooled=no; selector: name=thumbor1001.eqiad.wmnet
  • 07:06 elukey: powercycle thumbor1001 - no ssh, no mgmt serial tty available, no racadm getsel infos
  • 06:45 kart_: Updated cxserver to 2021-02-10-134029-production (T274133, T273456, T271980)
  • 06:41 kartik@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'cxserver' for release 'production' .
  • 06:35 kartik@deploy1001: helmfile [codfw] Ran 'sync' command on namespace 'cxserver' for release 'production' .
  • 06:33 kartik@deploy1001: helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'staging' .
  • 03:10 rzl@cumin1001: dbctl commit (dc=all): 'depool db1134', diff saved to https://phabricator.wikimedia.org/P14310 and previous config saved to /var/cache/conftool/dbconfig/20210211-031048-rzl.json
  • 03:10 rzl: depooled db1134
  • 02:18 milimetric@deploy1001: Finished deploy [analytics/refinery@01d811f] (thin): Fix spelling error in mediacounts job (duration: 00m 06s)
  • 02:18 milimetric@deploy1001: Started deploy [analytics/refinery@01d811f] (thin): Fix spelling error in mediacounts job
  • 02:18 milimetric@deploy1001: Finished deploy [analytics/refinery@01d811f]: Fix spelling error in mediacounts job (duration: 11m 06s)
  • 02:07 milimetric@deploy1001: Started deploy [analytics/refinery@01d811f]: Fix spelling error in mediacounts job
  • 02:05 dwisehaupt: move payments1* and frpig1* out of maintenance mode
  • 02:04 eileen: process-control config revision is 726db3446a
  • 02:02 dwisehaupt: move civi1001 out of maintenance mode
  • 01:54 eileen: civicrm revision changed from 3776363c90 to b81cb5e702, config revision is f216d8fe8e
  • 01:35 dwisehaupt: applying new civicrm triggers to frdb1002
  • 01:14 eileen: civicrm revision changed from 2ce8194c07 to 3776363c90, config revision is f216d8fe8e
  • 01:06 dwisehaupt: stopping mariadb replication on frdev1001 and frdb1004
  • 01:05 dwisehaupt: Move payments/civi/frpig into maint mode for civi upgrade
  • 01:04 eileen: process-control config revision is f216d8fe8e
  • 00:26 legoktm@deploy1001: Synchronized wmf-config/profiler.php: Revert "profiler: Send data to excimer-buster pipeline" (duration: 02m 00s)
  • 00:03 milimetric@deploy1001: Finished deploy [analytics/refinery@3da19b6] (thin): More fixes for jobs after cluster upgrade (duration: 00m 07s)
  • 00:03 milimetric@deploy1001: Started deploy [analytics/refinery@3da19b6] (thin): More fixes for jobs after cluster upgrade

2021-02-10

  • 23:53 milimetric@deploy1001: Finished deploy [analytics/refinery@3da19b6]: More fixes for jobs after cluster upgrade (duration: 14m 23s)
  • 23:49 legoktm@cumin1001: conftool action : set/pooled=yes; selector: name=mw1328.eqiad.wmnet
  • 23:49 legoktm@cumin1001: conftool action : set/pooled=yes; selector: name=mw1327.eqiad.wmnet
  • 23:49 legoktm@cumin1001: conftool action : set/pooled=yes; selector: name=mw1326.eqiad.wmnet
  • 23:49 legoktm@cumin1001: conftool action : set/pooled=yes; selector: name=mw1325.eqiad.wmnet
  • 23:38 milimetric@deploy1001: Started deploy [analytics/refinery@3da19b6]: More fixes for jobs after cluster upgrade
  • 23:36 eileen: civicrm revision changed from ae24f87158 to 2ce8194c07, config revision is a48a7db0a2
  • 22:37 legoktm@cumin1001: conftool action : set/pooled=no; selector: name=mw1328.eqiad.wmnet
  • 22:37 legoktm@cumin1001: conftool action : set/pooled=no; selector: name=mw1327.eqiad.wmnet
  • 22:37 legoktm@cumin1001: conftool action : set/pooled=no; selector: name=mw1326.eqiad.wmnet
  • 22:37 legoktm@cumin1001: conftool action : set/pooled=no; selector: name=mw1325.eqiad.wmnet
  • 22:32 ebernhardson@deploy1001: Finished deploy [wikimedia/discovery/analytics@d97f7d9]: query_clicks: Remove result file merging (duration: 01m 27s)
  • 22:30 ebernhardson@deploy1001: Started deploy [wikimedia/discovery/analytics@d97f7d9]: query_clicks: Remove result file merging
  • 22:24 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1377.eqiad.wmnet
  • 22:23 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1369.eqiad.wmnet
  • 22:14 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1377.eqiad.wmnet
  • 22:13 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1369.eqiad.wmnet
  • 22:07 mutante: mw1369, mw1377 - all servers in this section now consistenly fail to reboot when triggered as the last step of wmf-reimage script
  • 21:43 legoktm@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1328.eqiad.wmnet with reason: REIMAGE
  • 21:41 legoktm@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1327.eqiad.wmnet with reason: REIMAGE
  • 21:41 legoktm@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1328.eqiad.wmnet with reason: REIMAGE
  • 21:39 legoktm@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1326.eqiad.wmnet with reason: REIMAGE
  • 21:39 legoktm@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1327.eqiad.wmnet with reason: REIMAGE
  • 21:37 legoktm@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1325.eqiad.wmnet with reason: REIMAGE
  • 21:37 legoktm@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1326.eqiad.wmnet with reason: REIMAGE
  • 21:35 legoktm@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1325.eqiad.wmnet with reason: REIMAGE
  • 21:06 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1377.eqiad.wmnet with reason: REIMAGE
  • 21:04 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1369.eqiad.wmnet with reason: REIMAGE
  • 21:02 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1377.eqiad.wmnet with reason: REIMAGE
  • 21:02 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1369.eqiad.wmnet with reason: REIMAGE
  • 20:39 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1293.eqiad.wmnet
  • 20:37 eileen: civicrm revision changed from f161a34266 to ae24f87158, config revision is a48a7db0a2
  • 20:36 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1293.eqiad.wmnet
  • 20:36 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1370.eqiad.wmnet
  • 20:36 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1378.eqiad.wmnet
  • 20:31 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1378.eqiad.wmnet
  • 20:31 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1370.eqiad.wmnet
  • 20:23 mutante: mw1370, mw1378 - powercycling via DRAC
  • 20:21 mutante: mw1370, mw1378 - again failing to reboot as the last step of reimaging script
  • 20:19 jgleeson: updated civicrm from 1e9a86dd6e to f161a34266
  • 20:13 twentyafterfour@deploy1001: Synchronized php: group1 wikis to 1.36.0-wmf.30 (duration: 01m 02s)
  • 20:12 twentyafterfour@deploy1001: rebuilt and synchronized wikiversions files: group1 wikis to 1.36.0-wmf.30
  • 20:05 legoktm@cumin1001: conftool action : set/pooled=yes; selector: name=mw1324.eqiad.wmnet
  • 20:01 ebernhardson@deploy1001: Finished deploy [wikimedia/discovery/analytics@1c5477d]: query_clicks: timestamp is now a reserved keyword (duration: 02m 19s)
  • 20:01 legoktm@cumin1001: conftool action : set/pooled=yes; selector: name=mw1323.eqiad.wmnet
  • 20:00 legoktm@cumin1001: conftool action : set/pooled=yes; selector: name=mw1322.eqiad.wmnet
  • 20:00 legoktm@cumin1001: conftool action : set/pooled=yes; selector: name=mw1321.eqiad.wmnet
  • 19:59 ebernhardson@deploy1001: Started deploy [wikimedia/discovery/analytics@1c5477d]: query_clicks: timestamp is now a reserved keyword
  • 19:54 legoktm@cumin1001: conftool action : set/pooled=no; selector: name=mw1324.eqiad.wmnet
  • 19:54 legoktm@cumin1001: conftool action : set/pooled=no; selector: name=mw1323.eqiad.wmnet
  • 19:54 legoktm@cumin1001: conftool action : set/pooled=no; selector: name=mw1322.eqiad.wmnet
  • 19:54 legoktm@cumin1001: conftool action : set/pooled=no; selector: name=mw1321.eqiad.wmnet
  • 19:53 cmjohnson@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 19:49 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1293.eqiad.wmnet with reason: REIMAGE
  • 19:47 cmjohnson@cumin1001: START - Cookbook sre.dns.netbox
  • 19:46 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1293.eqiad.wmnet with reason: REIMAGE
  • 19:25 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1378.eqiad.wmnet with reason: REIMAGE
  • 19:23 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1378.eqiad.wmnet with reason: REIMAGE
  • 19:20 thcipriani@deploy1001: Synchronized wmf-config/ProductionServices.php: Remove a couple of useless DNS lookups from mediawiki-config T231025 (duration: 01m 10s)
  • 19:19 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1294.eqiad.wmnet
  • 19:17 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1294.eqiad.wmnet
  • 19:17 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1379.eqiad.wmnet
  • 19:16 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1370.eqiad.wmnet with reason: REIMAGE
  • 19:15 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1379.eqiad.wmnet
  • 19:14 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1370.eqiad.wmnet with reason: REIMAGE
  • 19:12 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1294.eqiad.wmnet
  • 19:04 mutante: mw1379 - racadm racreset - host did not come back from reboot and DRAC says it can't powercycle it.. while it also ALREADY ON
  • 19:00 legoktm@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1324.eqiad.wmnet with reason: REIMAGE
  • 19:00 dzahn@cumin1001: conftool action : set/pooled=inactive; selector: name=mw1379.eqiad.wmnet
  • 18:58 legoktm@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1323.eqiad.wmnet with reason: REIMAGE
  • 18:58 legoktm@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1324.eqiad.wmnet with reason: REIMAGE
  • 18:57 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1371.eqiad.wmnet
  • 18:56 legoktm@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1322.eqiad.wmnet with reason: REIMAGE
  • 18:56 legoktm@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1323.eqiad.wmnet with reason: REIMAGE
  • 18:54 legoktm@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1321.eqiad.wmnet with reason: REIMAGE
  • 18:54 legoktm@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1322.eqiad.wmnet with reason: REIMAGE
  • 18:52 legoktm@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1321.eqiad.wmnet with reason: REIMAGE
  • 18:36 andrew@deploy1001: Finished deploy [horizon/deploy@02cb8a4]: security group dashboard policy updates, now after doing a submodule update! (duration: 03m 31s)
  • 18:34 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1379.eqiad.wmnet
  • 18:33 andrew@deploy1001: Started deploy [horizon/deploy@02cb8a4]: security group dashboard policy updates, now after doing a submodule update!
  • 18:32 andrew@deploy1001: Finished deploy [horizon/deploy@4f5a5a7]: security group dashboard policy updates (duration: 00m 07s)
  • 18:32 andrew@deploy1001: Started deploy [horizon/deploy@4f5a5a7]: security group dashboard policy updates
  • 18:32 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1371.eqiad.wmnet
  • 18:20 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1294.eqiad.wmnet with reason: REIMAGE
  • 18:18 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1294.eqiad.wmnet with reason: REIMAGE
  • 18:14 jiji@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host thumbor1001.eqiad.wmnet
  • 17:51 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1295.eqiad.wmnet
  • 17:43 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1295.eqiad.wmnet
  • 17:19 jiji@cumin1001: START - Cookbook sre.hosts.reboot-single for host thumbor1001.eqiad.wmnet
  • 17:18 shdubsh: restart pybal on low-traffic lvs1015
  • 17:13 shdubsh: restart pybal on backup lvs1016
  • 17:13 andrew@deploy1001: Finished deploy [horizon/deploy@4f5a5a7]: puppet dashboard policy updates (duration: 03m 53s)
  • 17:09 andrew@deploy1001: Started deploy [horizon/deploy@4f5a5a7]: puppet dashboard policy updates
  • 16:57 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1295.eqiad.wmnet with reason: REIMAGE
  • 16:53 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1295.eqiad.wmnet with reason: REIMAGE
  • 16:44 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1379.eqiad.wmnet with reason: REIMAGE
  • 16:42 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1379.eqiad.wmnet with reason: REIMAGE
  • 16:42 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1371.eqiad.wmnet with reason: REIMAGE
  • 16:40 dzahn@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on mw1371.eqiad.wmnet with reason: REIMAGE
  • 16:20 moritzm: installing unzip security updates
  • 16:12 moritzm: installing atftp security updates
  • 16:01 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on maps1005.eqiad.wmnet with reason: Resyncing database, still
  • 16:01 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime for 20:00:00 on maps1005.eqiad.wmnet with reason: Resyncing database, still
  • 15:26 otto@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Do not produce canary events for rdf-streaming-updater streams - T269619 (duration: 01m 13s)
  • 15:11 hashar@deploy1001: rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.30
  • 15:05 hashar: group0 wikis to 1.36.0-wmf.30 T271344
  • 14:59 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2033.codfw.wmnet
  • 14:57 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4024.ulsfo.wmnet
  • 14:56 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3057.esams.wmnet
  • 14:51 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp1082.eqiad.wmnet
  • 14:51 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp1081.eqiad.wmnet
  • 14:51 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5010.eqsin.wmnet
  • 14:51 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4030.ulsfo.wmnet
  • 14:51 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3056.esams.wmnet
  • 14:51 jynus: updating puppet-compiler-facts
  • 14:50 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5004.eqsin.wmnet
  • 14:50 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2034.codfw.wmnet
  • 14:45 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp2033.codfw.wmnet
  • 14:41 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp5004.eqsin.wmnet
  • 14:41 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp5010.eqsin.wmnet
  • 14:41 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp4024.ulsfo.wmnet
  • 14:40 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp4030.ulsfo.wmnet
  • 14:40 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp3057.esams.wmnet
  • 14:39 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp3056.esams.wmnet
  • 14:39 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp2034.codfw.wmnet
  • 14:39 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp1082.eqiad.wmnet
  • 14:39 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp1081.eqiad.wmnet
  • 12:26 dcausse@deploy1001: Synchronized wmf-config/InitialiseSettings.php: T269619: [wdqs] Add flink sideoutput stream definitions (duration: 01m 06s)
  • 12:20 lucaswerkmeister-wmde@deploy1001: Synchronized wmf-config/InitialiseSettings-labs.php: Config: Remove Wikibase.NewItemIdFormatter log channel (T268870) 2/2 (prod no-op) (duration: 01m 08s)
  • 12:18 lucaswerkmeister-wmde@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Config: Remove Wikibase.NewItemIdFormatter log channel (T268870) 1/2 (duration: 01m 07s)
  • 12:11 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: e8214ee: Enable GrowthExperiments on bnwiki (T266020) (duration: 01m 08s)
  • 12:07 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: 2d8cb10: Set wgGEHelpPanelAskMentor to true for several wikis (T272753) (duration: 01m 21s)
  • 12:05 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5003.eqsin.wmnet
  • 12:01 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4029.ulsfo.wmnet
  • 11:56 vgutierrez: powercycle cp5003
  • 11:55 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3055.esams.wmnet
  • 11:55 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5009.eqsin.wmnet
  • 11:54 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp1080.eqiad.wmnet
  • 11:54 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3054.esams.wmnet
  • 11:54 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp1079.eqiad.wmnet
  • 11:54 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2032.codfw.wmnet
  • 11:54 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2031.codfw.wmnet
  • 11:45 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp5003.eqsin.wmnet
  • 11:45 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp5009.eqsin.wmnet
  • 11:44 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp4029.ulsfo.wmnet
  • 11:44 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp3055.esams.wmnet
  • 11:43 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp3054.esams.wmnet
  • 11:43 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp2032.codfw.wmnet
  • 11:43 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp2031.codfw.wmnet
  • 11:43 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp1080.eqiad.wmnet
  • 11:42 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp1079.eqiad.wmnet
  • 11:28 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4023.ulsfo.wmnet
  • 11:22 legoktm@cumin1001: conftool action : set/pooled=yes; selector: name=mw1404.eqiad.wmnet
  • 11:18 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp4023.ulsfo.wmnet
  • 11:00 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5008.eqsin.wmnet
  • 10:46 marostegui@cumin1001: dbctl commit (dc=all): 'db1127 (re)pooling @ 100%: Slowly repool db1127', diff saved to https://phabricator.wikimedia.org/P14301 and previous config saved to /var/cache/conftool/dbconfig/20210210-104649-root.json
  • 10:42 vgutierrez: powercycle cp5008
  • 10:40 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4028.ulsfo.wmnet
  • 10:40 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5002.eqsin.wmnet
  • 10:40 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4022.ulsfo.wmnet
  • 10:39 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3052.esams.wmnet
  • 10:39 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2030.codfw.wmnet
  • 10:38 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3053.esams.wmnet
  • 10:38 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2029.codfw.wmnet
  • 10:37 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp1078.eqiad.wmnet
  • 10:37 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp1077.eqiad.wmnet
  • 10:31 marostegui@cumin1001: dbctl commit (dc=all): 'db1127 (re)pooling @ 80%: Slowly repool db1127', diff saved to https://phabricator.wikimedia.org/P14300 and previous config saved to /var/cache/conftool/dbconfig/20210210-103146-root.json
  • 10:28 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp5002.eqsin.wmnet
  • 10:28 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp5008.eqsin.wmnet
  • 10:28 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp4022.ulsfo.wmnet
  • 10:27 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp4028.ulsfo.wmnet
  • 10:27 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp3053.esams.wmnet
  • 10:26 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp3052.esams.wmnet
  • 10:26 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp2030.codfw.wmnet
  • 10:25 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp2029.codfw.wmnet
  • 10:25 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp1078.eqiad.wmnet
  • 10:25 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp1077.eqiad.wmnet
  • 10:16 marostegui@cumin1001: dbctl commit (dc=all): 'db1127 (re)pooling @ 60%: Slowly repool db1127', diff saved to https://phabricator.wikimedia.org/P14299 and previous config saved to /var/cache/conftool/dbconfig/20210210-101642-root.json
  • 10:16 moritzm: installing firejail security updates
  • 10:14 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 10:10 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4021.ulsfo.wmnet
  • 10:05 marostegui@cumin1001: START - Cookbook sre.hosts.decommission
  • 10:01 marostegui@cumin1001: dbctl commit (dc=all): 'db1127 (re)pooling @ 40%: Slowly repool db1127', diff saved to https://phabricator.wikimedia.org/P14298 and previous config saved to /var/cache/conftool/dbconfig/20210210-100139-root.json
  • 10:01 marostegui@cumin1001: dbctl commit (dc=all): 'db1076 (re)pooling @ 100%: Slowly repooling db1076 after cloning db1162', diff saved to https://phabricator.wikimedia.org/P14297 and previous config saved to /var/cache/conftool/dbconfig/20210210-100111-root.json
  • 10:00 vgutierrez: power cycling cp4021
  • 09:57 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5007.eqsin.wmnet
  • 09:56 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5001.eqsin.wmnet
  • 09:50 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4027.ulsfo.wmnet
  • 09:48 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3051.esams.wmnet
  • 09:47 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3050.esams.wmnet
  • 09:47 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2027.codfw.wmnet
  • 09:46 marostegui@cumin1001: dbctl commit (dc=all): 'db1127 (re)pooling @ 20%: Slowly repool db1127', diff saved to https://phabricator.wikimedia.org/P14296 and previous config saved to /var/cache/conftool/dbconfig/20210210-094635-root.json
  • 09:46 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2028.codfw.wmnet
  • 09:46 marostegui@cumin1001: dbctl commit (dc=all): 'db1076 (re)pooling @ 75%: Slowly repooling db1076 after cloning db1162', diff saved to https://phabricator.wikimedia.org/P14295 and previous config saved to /var/cache/conftool/dbconfig/20210210-094608-root.json
  • 09:46 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp1075.eqiad.wmnet
  • 09:45 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp1076.eqiad.wmnet
  • 09:40 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp5001.eqsin.wmnet
  • 09:40 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp5007.eqsin.wmnet
  • 09:38 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp4021.ulsfo.wmnet
  • 09:38 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp4027.ulsfo.wmnet
  • 09:37 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp3051.esams.wmnet
  • 09:36 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp3050.esams.wmnet
  • 09:35 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp2028.codfw.wmnet
  • 09:34 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp2027.codfw.wmnet
  • 09:33 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp1076.eqiad.wmnet
  • 09:31 vgutierrez@cumin1001: START - Cookbook sre.hosts.reboot-single for host cp1075.eqiad.wmnet
  • 09:31 marostegui@cumin1001: dbctl commit (dc=all): 'db1127 (re)pooling @ 10%: Slowly repool db1127', diff saved to https://phabricator.wikimedia.org/P14294 and previous config saved to /var/cache/conftool/dbconfig/20210210-093132-root.json
  • 09:31 marostegui@cumin1001: dbctl commit (dc=all): 'db1076 (re)pooling @ 50%: Slowly repooling db1076 after cloning db1162', diff saved to https://phabricator.wikimedia.org/P14293 and previous config saved to /var/cache/conftool/dbconfig/20210210-093104-root.json
  • 09:30 marostegui@cumin1001: dbctl commit (dc=all): 'db1157 (re)pooling @ 100%: Slowly repool db1127', diff saved to https://phabricator.wikimedia.org/P14292 and previous config saved to /var/cache/conftool/dbconfig/20210210-093011-root.json
  • 09:23 vgutierrez: rolling restart of cp nodes to catch up on kernel upgrades
  • 09:16 marostegui@cumin1001: dbctl commit (dc=all): 'db1076 (re)pooling @ 25%: Slowly repooling db1076 after cloning db1162', diff saved to https://phabricator.wikimedia.org/P14290 and previous config saved to /var/cache/conftool/dbconfig/20210210-091601-root.json
  • 09:15 marostegui@cumin1001: dbctl commit (dc=all): 'db1157 (re)pooling @ 80%: Slowly repool db1127', diff saved to https://phabricator.wikimedia.org/P14289 and previous config saved to /var/cache/conftool/dbconfig/20210210-091507-root.json
  • 09:11 elukey@cumin1001: END (PASS) - Cookbook sre.hadoop.change-distro-from-cdh-clients (exit_code=0) for Hadoop test cluster: Change Hadoop distribution - elukey@cumin1001
  • 09:10 elukey@cumin1001: START - Cookbook sre.hadoop.change-distro-from-cdh-clients for Hadoop test cluster: Change Hadoop distribution - elukey@cumin1001
  • 09:00 marostegui@cumin1001: dbctl commit (dc=all): 'db1076 (re)pooling @ 10%: Slowly repooling db1076 after cloning db1162', diff saved to https://phabricator.wikimedia.org/P14288 and previous config saved to /var/cache/conftool/dbconfig/20210210-090057-root.json
  • 09:00 marostegui@cumin1001: dbctl commit (dc=all): 'db1157 (re)pooling @ 60%: Slowly repool db1127', diff saved to https://phabricator.wikimedia.org/P14287 and previous config saved to /var/cache/conftool/dbconfig/20210210-090004-root.json
  • 08:45 marostegui@cumin1001: dbctl commit (dc=all): 'db1157 (re)pooling @ 40%: Slowly repool db1127', diff saved to https://phabricator.wikimedia.org/P14286 and previous config saved to /var/cache/conftool/dbconfig/20210210-084500-root.json
  • 08:41 legoktm: depooling mw1404.eqiad.wmnet for perf benchmarking (T274041)
  • 08:41 legoktm@cumin1001: conftool action : set/pooled=no; selector: name=mw1404.eqiad.wmnet
  • 08:29 marostegui@cumin1001: dbctl commit (dc=all): 'db1157 (re)pooling @ 20%: Slowly repool db1127', diff saved to https://phabricator.wikimedia.org/P14285 and previous config saved to /var/cache/conftool/dbconfig/20210210-082957-root.json
  • 08:19 godog: swift eqiad-prod: decrease weight for SSDs on ms-be[1019-1026] - T272836
  • 08:14 marostegui@cumin1001: dbctl commit (dc=all): 'db1157 (re)pooling @ 10%: Slowly repool db1127', diff saved to https://phabricator.wikimedia.org/P14284 and previous config saved to /var/cache/conftool/dbconfig/20210210-081453-root.json
  • 08:05 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1127 T266483', diff saved to https://phabricator.wikimedia.org/P14283 and previous config saved to /var/cache/conftool/dbconfig/20210210-080512-marostegui.json
  • 06:43 marostegui@cumin1001: dbctl commit (dc=all): 'Fully pool db1170:3312, db1170:3317 T258361', diff saved to https://phabricator.wikimedia.org/P14282 and previous config saved to /var/cache/conftool/dbconfig/20210210-064330-marostegui.json
  • 06:35 marostegui@cumin1001: dbctl commit (dc=all): 'Give more weight to db1170:3312, db1170:3317 T258361', diff saved to https://phabricator.wikimedia.org/P14281 and previous config saved to /var/cache/conftool/dbconfig/20210210-063534-marostegui.json
  • 06:22 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1162.eqiad.wmnet with reason: REIMAGE
  • 06:20 marostegui@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on db1162.eqiad.wmnet with reason: REIMAGE
  • 06:19 marostegui@cumin1001: dbctl commit (dc=all): 'Pool db1170:3312, db1170:3317 with minimal weight for the first time T258361', diff saved to https://phabricator.wikimedia.org/P14279 and previous config saved to /var/cache/conftool/dbconfig/20210210-061924-marostegui.json
  • 06:16 marostegui@cumin1001: dbctl commit (dc=all): 'Add db1170:3312 and db1170:3317 to dbctl, depooled T258361', diff saved to https://phabricator.wikimedia.org/P14278 and previous config saved to /var/cache/conftool/dbconfig/20210210-061638-marostegui.json
  • 06:11 jiji@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1020.eqiad.wmnet
  • 06:04 jiji@cumin1001: START - Cookbook sre.hosts.reboot-single for host mc1020.eqiad.wmnet
  • 05:58 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1076 to clone db1162 T258361', diff saved to https://phabricator.wikimedia.org/P14277 and previous config saved to /var/cache/conftool/dbconfig/20210210-055846-marostegui.json
  • 03:46 ryankemper: `ryankemper@wdqs1012:~$ sudo systemctl restart wdqs-blazegraph.service`
  • 01:54 krinkle@deploy1001: Finished deploy [integration/docroot@0234db2]: Unbreak doc.wm.o (2) - Ib67da9 (duration: 00m 06s)
  • 01:54 krinkle@deploy1001: Started deploy [integration/docroot@0234db2]: Unbreak doc.wm.o (2) - Ib67da9
  • 01:43 krinkle@deploy1001: Finished deploy [integration/docroot@fddc7c9]: Unbreak doc.wm.o - Ibf28e02ec03 (duration: 00m 06s)
  • 01:43 krinkle@deploy1001: Started deploy [integration/docroot@fddc7c9]: Unbreak doc.wm.o - Ibf28e02ec03
  • 01:06 milimetric@deploy1001: Finished deploy [analytics/refinery@b539bf6] (thin): Job fixes after Hadoop upgrade (duration: 00m 06s)
  • 01:06 milimetric@deploy1001: Started deploy [analytics/refinery@b539bf6] (thin): Job fixes after Hadoop upgrade
  • 01:06 milimetric@deploy1001: Finished deploy [analytics/refinery@b539bf6]: Job fixes after Hadoop upgrade (duration: 10m 55s)
  • 00:58 mutante: doc1001 - reloaded apache2
  • 00:55 milimetric@deploy1001: Started deploy [analytics/refinery@b539bf6]: Job fixes after Hadoop upgrade
  • 00:42 Amir1: changing frwiki to wmf.30 in mwdebug1002 to test T264391
  • 00:33 ladsgroup@deploy1001: Synchronized php-1.36.0-wmf.30/extensions/FeaturedFeeds: Fix iss