You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Server Admin Log

From Wikitech-static
Revision as of 01:42, 3 March 2022 by imported>Stashbot (razzi@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on datahubsearch[1001-1003].eqiad.wmnet with reason: Still having errors setting up opensearch)
Jump to navigation Jump to search

2022-03-03

  • 01:42 razzi@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on datahubsearch[1001-1003].eqiad.wmnet with reason: Still having errors setting up opensearch
  • 01:42 razzi@cumin1001: START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on datahubsearch[1001-1003].eqiad.wmnet with reason: Still having errors setting up opensearch
  • 00:31 robh@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts dumpsdata1007.eqiad.wmnet
  • 00:31 robh@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 00:25 robh@cumin1001: START - Cookbook sre.dns.netbox
  • 00:21 robh@cumin1001: START - Cookbook sre.hosts.decommission for hosts dumpsdata1007.eqiad.wmnet

2022-03-02

  • 23:47 robh@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dumpsdata1007.eqiad.wmnet with OS bullseye
  • 23:37 robh@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on dumpsdata1007.eqiad.wmnet with reason: host reimage
  • 23:32 robh@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on dumpsdata1007.eqiad.wmnet with reason: host reimage
  • 23:25 ryankemper: T276198 Re-enabled puppet across fleet: `ryankemper@cumin1001:~$ sudo -E cumin 'R:Elasticsearch::instance' 'enable-puppet "deploy fix from T276198"'`
  • 23:21 robh@cumin1001: START - Cookbook sre.hosts.reimage for host dumpsdata1007.eqiad.wmnet with OS bullseye
  • 23:21 ryankemper: T276198 https://gerrit.wikimedia.org/r/c/operations/puppet/+/767600 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/767603/ fixed all the problems. Re-enabling puppet on elastic*, cloudelastic*, and relforge* shortly
  • 23:15 robh@cumin1001: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dumpsdata1007.eqiad.wmnet with OS bullseye
  • 23:08 robh@cumin1001: START - Cookbook sre.hosts.reimage for host dumpsdata1007.eqiad.wmnet with OS bullseye
  • 22:56 rzl@deploy1002: helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply
  • 22:55 rzl@deploy1002: helmfile [codfw] START helmfile.d/services/shellbox-media: apply
  • 22:55 rzl@deploy1002: helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply
  • 22:54 rzl@deploy1002: helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply
  • 22:54 rzl@deploy1002: helmfile [codfw] DONE helmfile.d/services/proton: apply
  • 22:52 rzl@deploy1002: helmfile [codfw] START helmfile.d/services/proton: apply
  • 22:52 rzl@deploy1002: helmfile [codfw] DONE helmfile.d/services/mathoid: apply
  • 22:51 rzl@deploy1002: helmfile [codfw] START helmfile.d/services/mathoid: apply
  • 22:51 rzl@deploy1002: helmfile [codfw] DONE helmfile.d/services/linkrecommendation: apply
  • 22:50 rzl@deploy1002: helmfile [codfw] START helmfile.d/services/linkrecommendation: apply
  • 22:50 rzl@deploy1002: helmfile [codfw] DONE helmfile.d/services/eventstreams-internal: apply
  • 22:49 rzl@deploy1002: helmfile [codfw] START helmfile.d/services/eventstreams-internal: apply
  • 22:49 rzl@deploy1002: helmfile [codfw] DONE helmfile.d/services/eventstreams: apply
  • 22:48 rzl@deploy1002: helmfile [codfw] START helmfile.d/services/eventstreams: apply
  • 22:48 rzl@deploy1002: helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply
  • 22:47 rzl@deploy1002: helmfile [codfw] START helmfile.d/services/eventgate-main: apply
  • 22:47 rzl@deploy1002: helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply
  • 22:46 rzl@deploy1002: helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply
  • 22:46 rzl@deploy1002: helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: apply
  • 22:45 rzl@deploy1002: helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply
  • 22:45 rzl@deploy1002: helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply
  • 22:43 rzl@deploy1002: helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply
  • 22:43 rzl@deploy1002: helmfile [codfw] DONE helmfile.d/services/cxserver: apply
  • 22:43 rzl@deploy1002: helmfile [codfw] START helmfile.d/services/cxserver: apply
  • 22:43 rzl@deploy1002: helmfile [codfw] DONE helmfile.d/services/blubberoid: apply
  • 22:42 rzl@deploy1002: helmfile [codfw] START helmfile.d/services/blubberoid: apply
  • 22:42 rzl@deploy1002: helmfile [codfw] DONE helmfile.d/services/apertium: apply
  • 22:41 rzl@deploy1002: helmfile [codfw] START helmfile.d/services/apertium: apply
  • 22:21 ryankemper: T276198 Downtimed `elastic1052` for 2 hours while troubleshooting
  • 22:16 ryankemper: T276198 Testing https://gerrit.wikimedia.org/r/c/operations/puppet/+/766876/ on `elastic1052`; elasticsearch service fails to start. It's expecting to find `/etc/tmpfiles.d/elasticsearch-production-search-psi-eqiad.conf` but the actual filename is `elasticsearch-production-search-psi-eqiad-conf.conf`. Not sure why that trailing `-conf` is there in the filename. It doesn't look like something `systemd::tmpfile` is doing.
  • 22:05 robh@cumin1001: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dumpsdata1007.eqiad.wmnet with OS bullseye
  • 21:59 brennen@deploy1002: Synchronized php-1.38.0-wmf.24/extensions/Linter/includes/Hooks.php: Backport: Hooks.php: Check for non-array $tags (T302918) (duration: 00m 50s)
  • 21:53 ryankemper: T276198 Disabled puppet across all of elastic*, cloudelastic*, and relforge* to test https://gerrit.wikimedia.org/r/c/operations/puppet/+/766876/ on a single elastic host
  • 21:44 mutante: rolling out scap 4.4.2 on 'all' T302919
  • 21:36 robh@cumin1001: START - Cookbook sre.hosts.reimage for host dumpsdata1007.eqiad.wmnet with OS bullseye
  • 21:19 dancy@deploy1002: Synchronized wmf-config/InitialiseSettings.php: Config: wmf-config: Undeploy the fawiki test survey from production (T300291) (duration: 00m 50s)
  • 21:13 robh@cumin1001: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dumpsdata1007.eqiad.wmnet with OS bullseye
  • 21:10 dancy@deploy1002: rebuilt and synchronized wikiversions files: testing scap 4.4.2
  • 21:05 robh@cumin1001: START - Cookbook sre.hosts.reimage for host dumpsdata1007.eqiad.wmnet with OS bullseye
  • 21:00 mutante: deploy1002 - upgraded scap to 4.4.2-1 T302919
  • 20:48 mutante: running test-deploy to devcluster (restbase) to test new scap version, succesful and then rolled back, as the docs say T302919
  • 20:48 dzahn@deploy1002: Finished deploy [restbase/deploy@0848b15] (dev-cluster): (no justification provided) (duration: 00m 41s)
  • 20:47 dzahn@deploy1002: Started deploy [restbase/deploy@0848b15] (dev-cluster): (no justification provided)
  • 20:44 mutante: testec 'scap pull' still worked on mwdebug1001; rolling out scap 4.4.2 to A:restbase-canary (T302919)
  • 20:38 mutante: rolling out scap 4.4.2 to A:mw-canary or A:parsoid-canary or A:mw-jobrunner-canary (T302919)
  • 20:20 robh@cumin1001: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dumpsdata1007.mgmt.eqiad.wmnet with reboot policy FORCED
  • 20:11 robh@cumin1001: START - Cookbook sre.hosts.provision for host dumpsdata1007.mgmt.eqiad.wmnet with reboot policy FORCED
  • 20:07 robh@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 20:03 robh@cumin1001: START - Cookbook sre.dns.netbox
  • 19:57 brennen@deploy1002: rebuilt and synchronized wikiversions files: (no justification provided)
  • 19:53 brennen@deploy1002: rebuilt and synchronized wikiversions files: (no justification provided)
  • 19:47 robh@cumin1001: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dumpsdata1006.mgmt.eqiad.wmnet with reboot policy FORCED
  • 19:46 brennen@deploy1002: Synchronized php-1.38.0-wmf.24/extensions/ApiFeatureUsage: Backport: Add a non-namespaced alias for ApiFeatureUsageQueryEngineElastica (T302907) (duration: 00m 50s)
  • 19:45 robh@cumin1001: START - Cookbook sre.hosts.provision for host dumpsdata1006.mgmt.eqiad.wmnet with reboot policy FORCED
  • 19:36 robh@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 19:33 robh@cumin1001: START - Cookbook sre.dns.netbox
  • 19:30 mutante: stopped icinga-wm
  • 19:14 brennen@deploy1002: Synchronized php: group1 wikis to 1.38.0-wmf.24 refs T300200 (duration: 00m 50s)
  • 19:13 brennen@deploy1002: rebuilt and synchronized wikiversions files: group1 wikis to 1.38.0-wmf.24 refs T300200
  • 19:13 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1148 (T300992)', diff saved to https://phabricator.wikimedia.org/P21729 and previous config saved to /var/cache/conftool/dbconfig/20220302-191323-ladsgroup.json
  • 19:10 brennen: 1.38.0-wmf.24 train (T300200): no current blockers; proceeding to group1
  • 18:58 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P21728 and previous config saved to /var/cache/conftool/dbconfig/20220302-185819-ladsgroup.json
  • 18:43 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P21727 and previous config saved to /var/cache/conftool/dbconfig/20220302-184314-ladsgroup.json
  • 18:30 cmooney@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 18:28 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1148 (T300992)', diff saved to https://phabricator.wikimedia.org/P21726 and previous config saved to /var/cache/conftool/dbconfig/20220302-182809-ladsgroup.json
  • 18:26 cmooney@cumin1001: START - Cookbook sre.dns.netbox
  • 18:21 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1148 (T300992)', diff saved to https://phabricator.wikimedia.org/P21725 and previous config saved to /var/cache/conftool/dbconfig/20220302-182153-ladsgroup.json
  • 18:21 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1148.eqiad.wmnet with reason: Maintenance
  • 18:21 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1148.eqiad.wmnet with reason: Maintenance
  • 18:21 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1149 (T300992)', diff saved to https://phabricator.wikimedia.org/P21724 and previous config saved to /var/cache/conftool/dbconfig/20220302-182145-ladsgroup.json
  • 18:14 rzl@deploy1002: helmfile [staging] DONE helmfile.d/services/shellbox-media: apply
  • 18:14 rzl@deploy1002: helmfile [staging] START helmfile.d/services/shellbox-media: apply
  • 18:14 rzl@deploy1002: helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
  • 18:13 rzl@deploy1002: helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
  • 18:13 rzl@deploy1002: helmfile [staging] DONE helmfile.d/services/proton: apply
  • 18:13 rzl@deploy1002: helmfile [staging] START helmfile.d/services/proton: apply
  • 18:13 rzl@deploy1002: helmfile [staging] DONE helmfile.d/services/mathoid: apply
  • 18:12 rzl@deploy1002: helmfile [staging] START helmfile.d/services/mathoid: apply
  • 18:12 rzl@deploy1002: helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply
  • 18:12 rzl@deploy1002: helmfile [staging] START helmfile.d/services/linkrecommendation: apply
  • 18:12 rzl@deploy1002: helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply
  • 18:12 rzl@deploy1002: helmfile [staging] START helmfile.d/services/eventstreams-internal: apply
  • 18:12 rzl@deploy1002: helmfile [staging] DONE helmfile.d/services/eventstreams: apply
  • 18:11 rzl@deploy1002: helmfile [staging] START helmfile.d/services/eventstreams: apply
  • 18:11 rzl@deploy1002: helmfile [staging] DONE helmfile.d/services/eventgate-main: apply
  • 18:11 rzl@deploy1002: helmfile [staging] START helmfile.d/services/eventgate-main: apply
  • 18:11 rzl@deploy1002: helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply
  • 18:10 rzl@deploy1002: helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply
  • 18:10 rzl@deploy1002: helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply
  • 18:10 rzl@deploy1002: helmfile [staging] START helmfile.d/services/eventgate-analytics: apply
  • 18:10 rzl@deploy1002: helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply
  • 18:10 rzl@deploy1002: helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply
  • 18:10 rzl@deploy1002: helmfile [staging] DONE helmfile.d/services/cxserver: apply
  • 18:09 rzl@deploy1002: helmfile [staging] START helmfile.d/services/cxserver: apply
  • 18:09 rzl@deploy1002: helmfile [staging] DONE helmfile.d/services/blubberoid: apply
  • 18:09 rzl@deploy1002: helmfile [staging] START helmfile.d/services/blubberoid: apply
  • 18:09 rzl@deploy1002: helmfile [staging] DONE helmfile.d/services/apertium: apply
  • 18:09 rzl@deploy1002: helmfile [staging] START helmfile.d/services/apertium: apply
  • 18:06 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P21723 and previous config saved to /var/cache/conftool/dbconfig/20220302-180640-ladsgroup.json
  • 17:51 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P21722 and previous config saved to /var/cache/conftool/dbconfig/20220302-175136-ladsgroup.json
  • 17:36 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1149 (T300992)', diff saved to https://phabricator.wikimedia.org/P21721 and previous config saved to /var/cache/conftool/dbconfig/20220302-173631-ladsgroup.json
  • 17:31 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1149 (T300992)', diff saved to https://phabricator.wikimedia.org/P21720 and previous config saved to /var/cache/conftool/dbconfig/20220302-173112-ladsgroup.json
  • 17:31 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1149.eqiad.wmnet with reason: Maintenance
  • 17:31 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1149.eqiad.wmnet with reason: Maintenance
  • 17:31 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1160 (T300992)', diff saved to https://phabricator.wikimedia.org/P21719 and previous config saved to /var/cache/conftool/dbconfig/20220302-173104-ladsgroup.json
  • 17:16 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P21718 and previous config saved to /var/cache/conftool/dbconfig/20220302-171559-ladsgroup.json
  • 17:00 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P21717 and previous config saved to /var/cache/conftool/dbconfig/20220302-170055-ladsgroup.json
  • 16:51 vgutierrez: pool cp3061 running HAProxy as TLS termination layer - T290005 T271421
  • 16:50 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3061.esams.wmnet with OS buster
  • 16:45 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1160 (T300992)', diff saved to https://phabricator.wikimedia.org/P21716 and previous config saved to /var/cache/conftool/dbconfig/20220302-164550-ladsgroup.json
  • 16:33 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1160 (T300992)', diff saved to https://phabricator.wikimedia.org/P21715 and previous config saved to /var/cache/conftool/dbconfig/20220302-163329-ladsgroup.json
  • 16:33 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1160.eqiad.wmnet with reason: Maintenance
  • 16:33 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1160.eqiad.wmnet with reason: Maintenance
  • 16:33 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1121 (T300992)', diff saved to https://phabricator.wikimedia.org/P21714 and previous config saved to /var/cache/conftool/dbconfig/20220302-163322-ladsgroup.json
  • 16:27 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3061.esams.wmnet with reason: host reimage
  • 16:24 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on cp3061.esams.wmnet with reason: host reimage
  • 16:18 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P21713 and previous config saved to /var/cache/conftool/dbconfig/20220302-161817-ladsgroup.json
  • 16:03 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P21711 and previous config saved to /var/cache/conftool/dbconfig/20220302-160312-ladsgroup.json
  • 15:56 vgutierrez@cumin1001: START - Cookbook sre.hosts.reimage for host cp3061.esams.wmnet with OS buster
  • 15:49 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5014.eqsin.wmnet with OS buster
  • 15:48 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1121 (T300992)', diff saved to https://phabricator.wikimedia.org/P21710 and previous config saved to /var/cache/conftool/dbconfig/20220302-154807-ladsgroup.json
  • 15:47 vgutierrez: pool cp5014 running HAProxy as TLS termination layer - T290005 T271421
  • 15:45 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet
  • 15:41 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet
  • 15:40 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1121 (T300992)', diff saved to https://phabricator.wikimedia.org/P21709 and previous config saved to /var/cache/conftool/dbconfig/20220302-154039-ladsgroup.json
  • 15:40 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
  • 15:40 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
  • 15:40 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1121.eqiad.wmnet with reason: Maintenance
  • 15:40 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1121.eqiad.wmnet with reason: Maintenance
  • 15:40 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1143 (T300992)', diff saved to https://phabricator.wikimedia.org/P21708 and previous config saved to /var/cache/conftool/dbconfig/20220302-154026-ladsgroup.json
  • 15:25 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P21707 and previous config saved to /var/cache/conftool/dbconfig/20220302-152519-ladsgroup.json
  • 15:23 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5014.eqsin.wmnet with reason: host reimage
  • 15:18 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on cp5014.eqsin.wmnet with reason: host reimage
  • 15:10 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P21706 and previous config saved to /var/cache/conftool/dbconfig/20220302-151015-ladsgroup.json
  • 14:55 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1143 (T300992)', diff saved to https://phabricator.wikimedia.org/P21705 and previous config saved to /var/cache/conftool/dbconfig/20220302-145510-ladsgroup.json
  • 14:52 vgutierrez@cumin1001: START - Cookbook sre.hosts.reimage for host cp5014.eqsin.wmnet with OS buster
  • 14:50 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1143 (T300992)', diff saved to https://phabricator.wikimedia.org/P21704 and previous config saved to /var/cache/conftool/dbconfig/20220302-145054-ladsgroup.json
  • 14:50 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1143.eqiad.wmnet with reason: Maintenance
  • 14:50 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1143.eqiad.wmnet with reason: Maintenance
  • 14:50 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T300992)', diff saved to https://phabricator.wikimedia.org/P21703 and previous config saved to /var/cache/conftool/dbconfig/20220302-145046-ladsgroup.json
  • 14:41 vgutierrez@cumin1001: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp4034.ulsfo.wmnet with OS buster
  • 14:41 vgutierrez@cumin1001: START - Cookbook sre.hosts.reimage for host cp4034.ulsfo.wmnet with OS buster
  • 14:38 vgutierrez@cumin1001: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp4034.ulsfo.wmnet with OS buster
  • 14:37 vgutierrez@cumin1001: START - Cookbook sre.hosts.reimage for host cp4034.ulsfo.wmnet with OS buster
  • 14:35 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P21702 and previous config saved to /var/cache/conftool/dbconfig/20220302-143541-ladsgroup.json
  • 14:34 vgutierrez@cumin1001: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4034.ulsfo.wmnet with OS buster
  • 14:27 moritzm: rebalance VMs in Ganeti row A after adding new servers (and decomissioning old ones)
  • 14:26 vgutierrez@cumin1001: START - Cookbook sre.hosts.reimage for host cp4034.ulsfo.wmnet with OS buster
  • 14:24 vgutierrez@cumin1001: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp4034.ulsfo.wmnet with OS buster
  • 14:21 ladsgroup@deploy1002: Synchronized php-1.38.0-wmf.23/extensions/FlaggedRevs/modules/ext.flaggedRevs.review/review.js: Backport: ext.flaggedRevs.review: Restore tolerance when setting "disabled" prop (duration: 00m 52s)
  • 14:20 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P21701 and previous config saved to /var/cache/conftool/dbconfig/20220302-142037-ladsgroup.json
  • 14:13 mmandere: pool cp6013
  • 14:05 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T300992)', diff saved to https://phabricator.wikimedia.org/P21700 and previous config saved to /var/cache/conftool/dbconfig/20220302-140532-ladsgroup.json
  • 14:01 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1144:3314 (T300992)', diff saved to https://phabricator.wikimedia.org/P21699 and previous config saved to /var/cache/conftool/dbconfig/20220302-140112-ladsgroup.json
  • 14:01 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance
  • 14:01 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance
  • 14:01 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1141 (T300992)', diff saved to https://phabricator.wikimedia.org/P21698 and previous config saved to /var/cache/conftool/dbconfig/20220302-140105-ladsgroup.json
  • 13:50 vgutierrez@cumin1001: START - Cookbook sre.hosts.reimage for host cp4034.ulsfo.wmnet with OS buster
  • 13:46 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P21697 and previous config saved to /var/cache/conftool/dbconfig/20220302-134600-ladsgroup.json
  • 13:30 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P21696 and previous config saved to /var/cache/conftool/dbconfig/20220302-133055-ladsgroup.json
  • 13:15 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1141 (T300992)', diff saved to https://phabricator.wikimedia.org/P21695 and previous config saved to /var/cache/conftool/dbconfig/20220302-131550-ladsgroup.json
  • 13:10 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1141 (T300992)', diff saved to https://phabricator.wikimedia.org/P21694 and previous config saved to /var/cache/conftool/dbconfig/20220302-131032-ladsgroup.json
  • 13:10 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1141.eqiad.wmnet with reason: Maintenance
  • 13:10 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1141.eqiad.wmnet with reason: Maintenance
  • 13:10 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1142 (T300992)', diff saved to https://phabricator.wikimedia.org/P21693 and previous config saved to /var/cache/conftool/dbconfig/20220302-131024-ladsgroup.json
  • 12:55 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P21692 and previous config saved to /var/cache/conftool/dbconfig/20220302-125519-ladsgroup.json
  • 12:47 reedy@deploy1002: Finished scap: Fix MassMessage translations T302840 (duration: 01m 50s)
  • 12:45 reedy@deploy1002: Started scap: Fix MassMessage translations T302840
  • 12:43 vgutierrez@cumin1001: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp4034.ulsfo.wmnet with OS buster
  • 12:40 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P21690 and previous config saved to /var/cache/conftool/dbconfig/20220302-124014-ladsgroup.json
  • 12:25 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1142 (T300992)', diff saved to https://phabricator.wikimedia.org/P21689 and previous config saved to /var/cache/conftool/dbconfig/20220302-122510-ladsgroup.json
  • 12:20 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1142 (T300992)', diff saved to https://phabricator.wikimedia.org/P21688 and previous config saved to /var/cache/conftool/dbconfig/20220302-122049-ladsgroup.json
  • 12:20 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1142.eqiad.wmnet with reason: Maintenance
  • 12:20 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1142.eqiad.wmnet with reason: Maintenance
  • 12:18 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 12 hosts with reason: Maintenance
  • 12:18 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on 12 hosts with reason: Maintenance
  • 12:18 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2110.codfw.wmnet with reason: Maintenance
  • 12:18 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db2110.codfw.wmnet with reason: Maintenance
  • 12:17 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1147 (T300992)', diff saved to https://phabricator.wikimedia.org/P21687 and previous config saved to /var/cache/conftool/dbconfig/20220302-121754-ladsgroup.json
  • 12:09 vgutierrez@cumin1001: START - Cookbook sre.hosts.reimage for host cp4034.ulsfo.wmnet with OS buster
  • 12:02 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P21686 and previous config saved to /var/cache/conftool/dbconfig/20220302-120250-ladsgroup.json
  • 11:47 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P21685 and previous config saved to /var/cache/conftool/dbconfig/20220302-114745-ladsgroup.json
  • 11:32 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1147 (T300992)', diff saved to https://phabricator.wikimedia.org/P21684 and previous config saved to /var/cache/conftool/dbconfig/20220302-113240-ladsgroup.json
  • 11:28 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1147 (T300992)', diff saved to https://phabricator.wikimedia.org/P21683 and previous config saved to /var/cache/conftool/dbconfig/20220302-112824-ladsgroup.json
  • 11:28 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1147.eqiad.wmnet with reason: Maintenance
  • 11:28 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1147.eqiad.wmnet with reason: Maintenance
  • 11:26 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
  • 11:26 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
  • 11:23 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance
  • 11:23 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance
  • 11:23 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T300992)', diff saved to https://phabricator.wikimedia.org/P21682 and previous config saved to /var/cache/conftool/dbconfig/20220302-112347-ladsgroup.json
  • 11:23 mbsantos@deploy1002: Finished deploy [kartotherian/deploy@3dc404c] (eqiad): Merge "Update kartotherian-package to f239c6e" (duration: 01m 29s)
  • 11:22 mbsantos: rollback maps eqiad to a previous working state to mitigate geoshape errors
  • 11:21 mbsantos@deploy1002: Started deploy [kartotherian/deploy@3dc404c] (eqiad): Merge "Update kartotherian-package to f239c6e"
  • 11:08 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P21681 and previous config saved to /var/cache/conftool/dbconfig/20220302-110842-ladsgroup.json
  • 11:05 moritzm: installing expat security updates
  • 10:56 moritzm: restarting apache2 and mailman3-web on lists.wikimedia.org for expat security update
  • 10:53 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P21680 and previous config saved to /var/cache/conftool/dbconfig/20220302-105336-ladsgroup.json
  • 10:38 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T300992)', diff saved to https://phabricator.wikimedia.org/P21678 and previous config saved to /var/cache/conftool/dbconfig/20220302-103832-ladsgroup.json
  • 10:34 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1146:3314 (T300992)', diff saved to https://phabricator.wikimedia.org/P21677 and previous config saved to /var/cache/conftool/dbconfig/20220302-103407-ladsgroup.json
  • 10:34 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance
  • 10:34 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance
  • 10:31 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance
  • 10:31 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance
  • 10:20 jayme@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mathoid: apply
  • 10:18 jayme@deploy1002: helmfile [eqiad] START helmfile.d/services/mathoid: apply
  • 10:15 jayme@deploy1002: helmfile [codfw] DONE helmfile.d/services/mathoid: apply
  • 10:15 jgiannelos@deploy1002: Finished deploy [kartotherian/deploy@d049589] (eqiad): Revert "Temporarily increase poolsize for debugging" (duration: 01m 45s)
  • 10:14 jayme@deploy1002: helmfile [codfw] START helmfile.d/services/mathoid: apply
  • 10:13 jgiannelos@deploy1002: Started deploy [kartotherian/deploy@d049589] (eqiad): Revert "Temporarily increase poolsize for debugging"
  • 10:13 jgiannelos@deploy1002: Finished deploy [kartotherian/deploy@d049589] (codfw): Revert "Temporarily increase poolsize for debugging" (duration: 01m 36s)
  • 10:11 jgiannelos@deploy1002: Started deploy [kartotherian/deploy@d049589] (codfw): Revert "Temporarily increase poolsize for debugging"
  • 10:09 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1110 (T300992)', diff saved to https://phabricator.wikimedia.org/P21676 and previous config saved to /var/cache/conftool/dbconfig/20220302-100903-ladsgroup.json
  • 10:04 klausman@cumin2002: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ml-staging-ctrl2002.codfw.wmnet
  • 09:56 jayme@deploy1002: helmfile [staging] DONE helmfile.d/services/mathoid: apply
  • 09:55 jayme@deploy1002: helmfile [staging] START helmfile.d/services/mathoid: apply
  • 09:55 klausman@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 09:53 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P21675 and previous config saved to /var/cache/conftool/dbconfig/20220302-095358-ladsgroup.json
  • 09:51 jgiannelos@deploy1002: Finished deploy [kartotherian/deploy@fd6bc59] (codfw): Temporarily increase poolsize for debugging (duration: 04m 26s)
  • 09:49 klausman@cumin2002: START - Cookbook sre.dns.netbox
  • 09:49 klausman@cumin2002: START - Cookbook sre.ganeti.makevm for new host ml-staging-ctrl2002.codfw.wmnet
  • 09:48 klausman@cumin2002: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ml-staging-ctrl2001.codfw.wmnet
  • 09:47 jgiannelos@deploy1002: Started deploy [kartotherian/deploy@fd6bc59] (codfw): Temporarily increase poolsize for debugging
  • 09:46 jgiannelos@deploy1002: Finished deploy [kartotherian/deploy@fd6bc59] (eqiad): Temporarily increase poolsize for debugging (duration: 02m 13s)
  • 09:44 jgiannelos@deploy1002: Started deploy [kartotherian/deploy@fd6bc59] (eqiad): Temporarily increase poolsize for debugging
  • 09:39 klausman@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 09:38 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P21674 and previous config saved to /var/cache/conftool/dbconfig/20220302-093853-ladsgroup.json
  • 09:35 klausman@cumin2002: START - Cookbook sre.dns.netbox
  • 09:35 klausman@cumin2002: START - Cookbook sre.ganeti.makevm for new host ml-staging-ctrl2001.codfw.wmnet
  • 09:30 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1167 (T302185)', diff saved to https://phabricator.wikimedia.org/P21673 and previous config saved to /var/cache/conftool/dbconfig/20220302-093027-ladsgroup.json
  • 09:23 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1110 (T300992)', diff saved to https://phabricator.wikimedia.org/P21672 and previous config saved to /var/cache/conftool/dbconfig/20220302-092348-ladsgroup.json
  • 09:21 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1110 (T300992)', diff saved to https://phabricator.wikimedia.org/P21671 and previous config saved to /var/cache/conftool/dbconfig/20220302-092128-ladsgroup.json
  • 09:21 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance
  • 09:21 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance
  • 09:21 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1100 (T300992)', diff saved to https://phabricator.wikimedia.org/P21670 and previous config saved to /var/cache/conftool/dbconfig/20220302-092120-ladsgroup.json
  • 09:16 mmandere: rolling restart of varnishkafka-* on cp6*
  • 09:15 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P21669 and previous config saved to /var/cache/conftool/dbconfig/20220302-091523-ladsgroup.json
  • 09:06 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P21668 and previous config saved to /var/cache/conftool/dbconfig/20220302-090615-ladsgroup.json
  • 09:05 XioNoX: push Capirca managed labs-in firewall filter to eqiad routers
  • 09:00 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P21667 and previous config saved to /var/cache/conftool/dbconfig/20220302-090018-ladsgroup.json
  • 08:51 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P21666 and previous config saved to /var/cache/conftool/dbconfig/20220302-085111-ladsgroup.json
  • 08:45 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1167 (T302185)', diff saved to https://phabricator.wikimedia.org/P21665 and previous config saved to /var/cache/conftool/dbconfig/20220302-084513-ladsgroup.json
  • 08:38 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1167.eqiad.wmnet with OS bullseye
  • 08:36 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1100 (T300992)', diff saved to https://phabricator.wikimedia.org/P21664 and previous config saved to /var/cache/conftool/dbconfig/20220302-083606-ladsgroup.json
  • 08:33 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1100 (T300992)', diff saved to https://phabricator.wikimedia.org/P21663 and previous config saved to /var/cache/conftool/dbconfig/20220302-083345-ladsgroup.json
  • 08:33 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance
  • 08:33 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance
  • 08:33 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T300992)', diff saved to https://phabricator.wikimedia.org/P21662 and previous config saved to /var/cache/conftool/dbconfig/20220302-083338-ladsgroup.json
  • 08:24 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1167.eqiad.wmnet with reason: host reimage
  • 08:20 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on db1167.eqiad.wmnet with reason: host reimage
  • 08:18 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P21661 and previous config saved to /var/cache/conftool/dbconfig/20220302-081832-ladsgroup.json
  • 08:09 godog: test thanos 0.24.0 on thanos-fe2001 to check if https://github.com/thanos-io/thanos/issues/4531 is fixed
  • 08:09 ladsgroup@cumin1001: START - Cookbook sre.hosts.reimage for host db1167.eqiad.wmnet with OS bullseye
  • 08:03 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P21660 and previous config saved to /var/cache/conftool/dbconfig/20220302-080327-ladsgroup.json
  • 08:02 Amir1: killing all entity dumpers of wikidata in snapshot1008 (T300255)
  • 07:48 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T300992)', diff saved to https://phabricator.wikimedia.org/P21659 and previous config saved to /var/cache/conftool/dbconfig/20220302-074822-ladsgroup.json
  • 07:46 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1096:3315 (T300992)', diff saved to https://phabricator.wikimedia.org/P21658 and previous config saved to /var/cache/conftool/dbconfig/20220302-074602-ladsgroup.json
  • 07:45 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance
  • 07:45 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance
  • 07:45 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance
  • 07:45 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance
  • 07:45 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance
  • 07:45 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance
  • 07:42 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1167 (T302185)', diff saved to https://phabricator.wikimedia.org/P21657 and previous config saved to /var/cache/conftool/dbconfig/20220302-074210-ladsgroup.json
  • 07:42 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
  • 07:42 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
  • 07:42 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance
  • 07:41 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance
  • 07:36 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1101:3318 (T302185)', diff saved to https://phabricator.wikimedia.org/P21656 and previous config saved to /var/cache/conftool/dbconfig/20220302-073610-ladsgroup.json
  • 07:35 _joe_: filling request patterns in etcd
  • 07:21 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1101:3318', diff saved to https://phabricator.wikimedia.org/P21655 and previous config saved to /var/cache/conftool/dbconfig/20220302-072105-ladsgroup.json
  • 07:09 _joe_: installing scap 4.4.1 everywhere T302464
  • 07:06 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1101:3318', diff saved to https://phabricator.wikimedia.org/P21654 and previous config saved to /var/cache/conftool/dbconfig/20220302-070601-ladsgroup.json
  • 06:50 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1101:3318 (T302185)', diff saved to https://phabricator.wikimedia.org/P21653 and previous config saved to /var/cache/conftool/dbconfig/20220302-065056-ladsgroup.json
  • 06:39 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T302185)', diff saved to https://phabricator.wikimedia.org/P21652 and previous config saved to /var/cache/conftool/dbconfig/20220302-063933-ladsgroup.json
  • 06:24 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P21651 and previous config saved to /var/cache/conftool/dbconfig/20220302-062428-ladsgroup.json
  • 06:09 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P21650 and previous config saved to /var/cache/conftool/dbconfig/20220302-060924-ladsgroup.json
  • 05:54 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T302185)', diff saved to https://phabricator.wikimedia.org/P21649 and previous config saved to /var/cache/conftool/dbconfig/20220302-055419-ladsgroup.json
  • 05:48 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1101.eqiad.wmnet with OS bullseye
  • 05:34 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1101.eqiad.wmnet with reason: host reimage
  • 05:32 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on db1101.eqiad.wmnet with reason: host reimage
  • 05:23 ladsgroup@cumin1001: START - Cookbook sre.hosts.reimage for host db1101.eqiad.wmnet with OS bullseye
  • 05:20 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1101:3318 (T302185)', diff saved to https://phabricator.wikimedia.org/P21648 and previous config saved to /var/cache/conftool/dbconfig/20220302-052033-ladsgroup.json
  • 05:19 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T300992)', diff saved to https://phabricator.wikimedia.org/P21647 and previous config saved to /var/cache/conftool/dbconfig/20220302-051947-ladsgroup.json
  • 05:18 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1101:3317 (T302185)', diff saved to https://phabricator.wikimedia.org/P21646 and previous config saved to /var/cache/conftool/dbconfig/20220302-051853-ladsgroup.json
  • 05:18 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1101.eqiad.wmnet with reason: Maintenance
  • 05:18 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1101.eqiad.wmnet with reason: Maintenance
  • 05:05 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1104 (T302185)', diff saved to https://phabricator.wikimedia.org/P21645 and previous config saved to /var/cache/conftool/dbconfig/20220302-050526-ladsgroup.json
  • 05:04 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P21644 and previous config saved to /var/cache/conftool/dbconfig/20220302-050442-ladsgroup.json
  • 04:50 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1104', diff saved to https://phabricator.wikimedia.org/P21643 and previous config saved to /var/cache/conftool/dbconfig/20220302-045021-ladsgroup.json
  • 04:49 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P21642 and previous config saved to /var/cache/conftool/dbconfig/20220302-044938-ladsgroup.json
  • 04:35 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1104', diff saved to https://phabricator.wikimedia.org/P21641 and previous config saved to /var/cache/conftool/dbconfig/20220302-043516-ladsgroup.json
  • 04:34 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T300992)', diff saved to https://phabricator.wikimedia.org/P21640 and previous config saved to /var/cache/conftool/dbconfig/20220302-043433-ladsgroup.json
  • 04:33 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1113:3315 (T300992)', diff saved to https://phabricator.wikimedia.org/P21639 and previous config saved to /var/cache/conftool/dbconfig/20220302-043313-ladsgroup.json
  • 04:33 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance
  • 04:33 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance
  • 04:32 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance
  • 04:32 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance
  • 04:32 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
  • 04:32 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
  • 04:32 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1161 (T300992)', diff saved to https://phabricator.wikimedia.org/P21638 and previous config saved to /var/cache/conftool/dbconfig/20220302-043229-ladsgroup.json
  • 04:20 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1104 (T302185)', diff saved to https://phabricator.wikimedia.org/P21637 and previous config saved to /var/cache/conftool/dbconfig/20220302-042012-ladsgroup.json
  • 04:17 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P21636 and previous config saved to /var/cache/conftool/dbconfig/20220302-041725-ladsgroup.json
  • 04:15 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1104.eqiad.wmnet with OS bullseye
  • 04:02 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P21635 and previous config saved to /var/cache/conftool/dbconfig/20220302-040220-ladsgroup.json
  • 04:00 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1104.eqiad.wmnet with reason: host reimage
  • 03:57 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on db1104.eqiad.wmnet with reason: host reimage
  • 03:48 ladsgroup@cumin1001: START - Cookbook sre.hosts.reimage for host db1104.eqiad.wmnet with OS bullseye
  • 03:47 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1161 (T300992)', diff saved to https://phabricator.wikimedia.org/P21634 and previous config saved to /var/cache/conftool/dbconfig/20220302-034715-ladsgroup.json
  • 03:45 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1104 (T302185)', diff saved to https://phabricator.wikimedia.org/P21633 and previous config saved to /var/cache/conftool/dbconfig/20220302-034502-ladsgroup.json
  • 03:44 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1104.eqiad.wmnet with reason: Maintenance
  • 03:44 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1104.eqiad.wmnet with reason: Maintenance
  • 03:44 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1161 (T300992)', diff saved to https://phabricator.wikimedia.org/P21632 and previous config saved to /var/cache/conftool/dbconfig/20220302-034454-ladsgroup.json
  • 03:44 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
  • 03:44 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
  • 03:44 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance
  • 03:44 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance
  • 03:43 ejegg: updated CiviCRM from e9f0eff5 to cb0605ed
  • 02:13 ejegg: Fundraising CiviCRM updated from 2874d623 to e9f0eff5
  • 00:15 topranks: Re-enabling Lumen AS3356 BGP session over IPv4 on cr3-ulsfo to assess affect on currently broken routing to ulsfo.
  • 00:07 topranks: disabling Lumen AS3356 BGP session over IPv4 on cr3-ulsfo to assess affect on currently broken routing to ulsfo.

2022-03-01

  • 22:51 inflatador: T276198 reenabled puppet on elastic1052.eqiad.wmnet
  • 22:37 inflatador: T276198 rebooting elastic1052.eqiad.wmnet to test failure condition
  • 22:33 sukhe@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp6016.drmrs.wmnet with reason: debugging till we find the root cause of the purged OOM issue; no traffic served
  • 22:33 sukhe@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on cp6016.drmrs.wmnet with reason: debugging till we find the root cause of the purged OOM issue; no traffic served
  • 22:32 inflatador: T276198 disabling puppet on elastic1052.eqiad.wmnet to test failure condition (rebooting shortly)
  • 21:53 dancy@deploy1002: Finished scap: Resync to try to clear alerts (duration: 12m 08s)
  • 21:41 dancy@deploy1002: Started scap: Resync to try to clear alerts
  • 21:36 dancy@deploy1002: Started scap: Resync to try to clear alerts
  • 20:36 brennen@deploy1002: rebuilt and synchronized wikiversions files: group0 wikis to 1.38.0-wmf.24 refs T300200
  • 20:33 brennen: 1.38.0-wmf.24 train (T300200): no current blockers; proceeding to group0; note this may briefly trigger some version alerts
  • 20:30 brennen@deploy1002: Synchronized php-1.38.0-wmf.24/includes: Backport: Revert "preferences: Use a faster and simpler form descriptor when validating" (T302643) (duration: 00m 55s)
  • 20:05 mutante: alert1001 - re-enabled puppet
  • 20:05 brennen@deploy1002: Finished scap: testwikis wikis to 1.38.0-wmf.24 refs T300200 (duration: 53m 17s)
  • 19:45 mutante: alert1001 - disable puppet, systemctl stop ircecho - to stop bot spam, caused somehow by new scap version breaking "mw versions mismwatch" alerting - affects labtestwiki,testwiki,testwikidatawiki
  • 19:38 mutante: mw1449 - scap pull
  • 19:36 mutante: mw1414 - scap pull
  • 19:11 brennen@deploy1002: Started scap: testwikis wikis to 1.38.0-wmf.24 refs T300200
  • 19:07 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti2008.codfw.wmnet
  • 19:01 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 18:58 jmm@cumin2002: START - Cookbook sre.dns.netbox
  • 18:57 brennen: 1.38.0-wmf.24 train (T300200): there's currently a single blocker at T302643; staging to testwikis and holding there until backport's available
  • 18:54 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ganeti2008.codfw.wmnet
  • 18:45 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti2008.codfw.wmnet with reason: Remove from Ganeti cluster for decom
  • 18:45 jmm@cumin2002: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ganeti2008.codfw.wmnet with reason: Remove from Ganeti cluster for decom
  • 18:02 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T300992)', diff saved to https://phabricator.wikimedia.org/P21626 and previous config saved to /var/cache/conftool/dbconfig/20220301-180216-ladsgroup.json
  • 17:52 cwhite: completed grafana upgrade in eqiad T282863
  • 17:50 herron: re-enabling puppet and ircecho on alert1001
  • 17:47 cwhite: upgrade grafana in eqiad T282863
  • 17:47 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P21625 and previous config saved to /var/cache/conftool/dbconfig/20220301-174711-ladsgroup.json
  • 17:44 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
  • 17:34 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
  • 17:32 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P21624 and previous config saved to /var/cache/conftool/dbconfig/20220301-173206-ladsgroup.json
  • 17:24 dancy@deploy1002: Finished scap: testing container image build (duration: 28m 39s)
  • 17:17 herron: stopped ircecho on alert1001 due to systemd unit alert shower
  • 17:17 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T300992)', diff saved to https://phabricator.wikimedia.org/P21622 and previous config saved to /var/cache/conftool/dbconfig/20220301-171701-ladsgroup.json
  • 17:14 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1144:3315 (T300992)', diff saved to https://phabricator.wikimedia.org/P21621 and previous config saved to /var/cache/conftool/dbconfig/20220301-171441-ladsgroup.json
  • 17:14 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance
  • 17:14 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance
  • 16:55 dancy@deploy1002: Started scap: testing container image build
  • 16:24 ebysans@deploy1002: Finished deploy [airflow-dags/analytics@cac16e8]: (no justification provided) (duration: 00m 03s)
  • 16:23 ebysans@deploy1002: Started deploy [airflow-dags/analytics@cac16e8]: (no justification provided)
  • 16:12 moritzm: restarting apache on logstash nodes to pick up expat update
  • 16:11 elukey@deploy1002: Finished deploy [ores/deploy@29de1cc]: ORES Winter deployment - T300195 (duration: 36m 13s)
  • 16:05 moritzm: restarting nginx on wcqs* nodes to pick up expat update
  • 15:35 elukey@deploy1002: Started deploy [ores/deploy@29de1cc]: ORES Winter deployment - T300195
  • 15:21 ntsako@deploy1002: Finished deploy [airflow-dags/analytics@cac16e8]: (no justification provided) (duration: 00m 07s)
  • 15:21 ntsako@deploy1002: Started deploy [airflow-dags/analytics@cac16e8]: (no justification provided)
  • 15:06 klausman@cumin2002: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ml-staging-etcd2003.codfw.wmnet
  • 14:57 klausman@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 14:52 elukey: elukey@deploy1002:~$ sudo kill `pgrep -u zpapierski` (offboarded user, puppet broken on the node)
  • 14:51 klausman@cumin2002: START - Cookbook sre.dns.netbox
  • 14:51 klausman@cumin2002: START - Cookbook sre.ganeti.makevm for new host ml-staging-etcd2003.codfw.wmnet
  • 14:48 klausman@cumin2002: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ml-staging-etcd2002.codfw.wmnet
  • 14:42 hnowlan@deploy1002: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: sync
  • 14:41 hnowlan@deploy1002: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: sync
  • 14:38 klausman@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 14:36 vgutierrez: pool cp1087 running HAProxy as TLS termination layer - T290005 T271421
  • 14:35 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1087.eqiad.wmnet with OS buster
  • 14:35 klausman@cumin2002: START - Cookbook sre.dns.netbox
  • 14:35 klausman@cumin2002: START - Cookbook sre.ganeti.makevm for new host ml-staging-etcd2002.codfw.wmnet
  • 14:32 klausman@cumin2002: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host ml-staging-etcd2003.codfw.wmnet
  • 14:32 klausman@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 14:28 klausman@cumin2002: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ml-staging-etcd2001.codfw.wmnet
  • 14:19 klausman@cumin2002: START - Cookbook sre.dns.netbox
  • 14:19 klausman@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 14:15 klausman@cumin2002: START - Cookbook sre.dns.netbox
  • 14:14 klausman@cumin2002: START - Cookbook sre.ganeti.makevm for new host ml-staging-etcd2003.codfw.wmnet
  • 14:09 moritzm: restarting nginx on wdqs* nodes to pick up expat update
  • 14:03 klausman@cumin2002: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host ml-staging-etcd2002.codfw.wmnet
  • 14:03 klausman@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 13:57 klausman@cumin2002: START - Cookbook sre.dns.netbox
  • 13:57 klausman@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 13:53 mmandere: restart purged on cp60[15-16]
  • 13:49 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1087.eqiad.wmnet with reason: host reimage
  • 13:48 klausman@cumin2002: START - Cookbook sre.dns.netbox
  • 13:48 klausman@cumin2002: START - Cookbook sre.ganeti.makevm for new host ml-staging-etcd2002.codfw.wmnet
  • 13:48 klausman@cumin2002: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host ml-staging-etcd2002.codfw.wmnet
  • 13:48 klausman@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 13:47 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on cp1087.eqiad.wmnet with reason: host reimage
  • 13:44 klausman@cumin2002: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host ml-staging-etcd2003.codfw.wmnet
  • 13:43 klausman@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 13:43 klausman@cumin2002: START - Cookbook sre.dns.netbox
  • 13:43 klausman@cumin2002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
  • 13:40 kormat: Deploying wmfmariadbpy 0.9 T302796
  • 13:40 kormat: uploaded wmfmariadbpy 0.9 to apt.wm.o T302796
  • 13:39 klausman@cumin2002: START - Cookbook sre.dns.netbox
  • 13:39 klausman@cumin2002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97)
  • 13:39 klausman@cumin2002: START - Cookbook sre.dns.netbox
  • 13:39 klausman@cumin2002: START - Cookbook sre.ganeti.makevm for new host ml-staging-etcd2003.codfw.wmnet
  • 13:39 klausman@cumin2002: START - Cookbook sre.dns.netbox
  • 13:39 klausman@cumin2002: START - Cookbook sre.ganeti.makevm for new host ml-staging-etcd2002.codfw.wmnet
  • 13:32 moritzm: restarting nginx on registry* nodes to pick up expat update
  • 13:31 vgutierrez@cumin1001: START - Cookbook sre.hosts.reimage for host cp1087.eqiad.wmnet with OS buster
  • 13:15 XioNoX: restart cr1-drmrs for software upgrade
  • 13:03 moritzm: restarting FPM/Apache on parsoid hosts to pick up expat update
  • 12:50 vgutierrez: pool cp3062 running HAProxy as TLS termination layer - T290005 T271421
  • 12:47 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3062.esams.wmnet with OS buster
  • 12:39 moritzm: installing expat security updates
  • 12:34 mmandere: restart purged on cp60[12-14]
  • 12:32 jgiannelos@deploy1002: Finished deploy [kartotherian/deploy@41d2498] (eqiad): Reduce pool size to 1 connection per node worker (duration: 01m 06s)
  • 12:31 jgiannelos@deploy1002: Started deploy [kartotherian/deploy@41d2498] (eqiad): Reduce pool size to 1 connection per node worker
  • 12:30 jgiannelos@deploy1002: Finished deploy [kartotherian/deploy@41d2498] (codfw): Reduce pool size to 1 connection per node worker (duration: 01m 30s)
  • 12:28 jgiannelos@deploy1002: Started deploy [kartotherian/deploy@41d2498] (codfw): Reduce pool size to 1 connection per node worker
  • 12:15 jgiannelos@deploy1002: Finished deploy [kartotherian/deploy@51d5a07] (codfw): Fix pool size configuration (duration: 01m 41s)
  • 12:13 jgiannelos@deploy1002: Started deploy [kartotherian/deploy@51d5a07] (codfw): Fix pool size configuration
  • 12:11 jgiannelos@deploy1002: Finished deploy [kartotherian/deploy@51d5a07] (eqiad): Fix pool size configuration (duration: 02m 01s)
  • 12:09 jgiannelos@deploy1002: Started deploy [kartotherian/deploy@51d5a07] (eqiad): Fix pool size configuration
  • 11:43 klausman@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 11:36 kharlan@deploy1002: helmfile [codfw] DONE helmfile.d/services/linkrecommendation: apply
  • 11:35 klausman@cumin2002: START - Cookbook sre.dns.netbox
  • 11:35 klausman@cumin2002: START - Cookbook sre.ganeti.makevm for new host ml-staging-etcd2001.codfw.wmnet
  • 11:33 kharlan@deploy1002: helmfile [codfw] START helmfile.d/services/linkrecommendation: apply
  • 11:32 kharlan@deploy1002: helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: apply
  • 11:30 kharlan@deploy1002: helmfile [eqiad] START helmfile.d/services/linkrecommendation: apply
  • 11:28 kharlan@deploy1002: helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply
  • 11:27 kharlan@deploy1002: helmfile [staging] START helmfile.d/services/linkrecommendation: apply
  • 11:27 cmooney@cumin1001: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1148.mgmt.eqiad.wmnet with reboot policy FORCED
  • 11:21 _joe_: restarted pybal, removed ipvsadm entry on lvs1019. Now all of MediaWiki has no http LVS endpoint available.T244843
  • 11:18 _joe_: also removed the ipvsadm entry for apaches:80 T244843
  • 11:17 jayme: rolled back linkrecommendation staging helm release to revision 12 - T302744
  • 11:17 _joe_: restarting pybal on lvs1020 T244843
  • 11:11 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3062.esams.wmnet with reason: host reimage
  • 11:11 _joe_: restarted pybal on lvs2009, T244843
  • 11:09 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on cp3062.esams.wmnet with reason: host reimage
  • 11:07 _joe_: restarted pybal on lvs2010, T244843
  • 11:02 mmandere: restart purged on cp60[09,10,11]
  • 11:00 cmooney@cumin1001: START - Cookbook sre.hosts.provision for host an-worker1148.mgmt.eqiad.wmnet with reboot policy FORCED
  • 10:47 cmooney@cumin1001: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1147.mgmt.eqiad.wmnet with reboot policy FORCED
  • 10:40 vgutierrez@cumin1001: START - Cookbook sre.hosts.reimage for host cp3062.esams.wmnet with OS buster
  • 10:40 jmm@cumin2002: END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Ema out of all services on: 259 hosts
  • 10:40 jmm@cumin2002: START - Cookbook sre.idm.logout Logging Ema out of all services on: 259 hosts
  • 10:40 jmm@cumin2002: END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Ema out of all services on: 1353 hosts
  • 10:39 jmm@cumin2002: START - Cookbook sre.idm.logout Logging Ema out of all services on: 1353 hosts
  • 10:31 mmandere: restart purged on cp600[6-8]
  • 10:28 cmooney@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 10:24 cmooney@cumin1001: START - Cookbook sre.dns.netbox
  • 10:05 vgutierrez: pool cp2039 running HAProxy as TLS termination layer - T290005 T271421
  • 09:48 elukey: elukey@stat1004:~$ sudo kill `pgrep -u zpapierski` (offboarded user, puppet broken on the host)
  • 09:45 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2039.codfw.wmnet with OS buster
  • 09:33 _joe_: restarted pybal on lvs1019, removed the mw api from ipvsadm, the mw api is internally fully encrypted
  • 09:31 _joe_: restart pybal on lvs1020
  • 09:25 jmm@cumin2002: END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Amuigai out of all services on: 1881 hosts
  • 09:25 elukey: restart varnishkafka-webrequest on cp6009 as attempt to clear a weird status of librdkafka (delivery errors to kafka)
  • 09:25 _joe_: manually removed ipvs entries on lvs2*, so it is actually now that the http api is not available in codfw anymore
  • 09:24 jmm@cumin2002: START - Cookbook sre.idm.logout Logging Amuigai out of all services on: 1881 hosts
  • 09:24 jmm@cumin2002: END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging ZPapierski out of all services on: 1881 hosts
  • 09:22 jmm@cumin2002: START - Cookbook sre.idm.logout Logging ZPapierski out of all services on: 1881 hosts
  • 09:22 _joe_: restarted pybal on lvs2009, the mw api is now effectively https-only in codfw T287820
  • 09:20 _joe_: restarted pybal on lvs2010
  • 09:14 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2039.codfw.wmnet with reason: host reimage
  • 09:12 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on cp2039.codfw.wmnet with reason: host reimage
  • 09:06 elukey: restart purged on cp6005
  • 08:57 elukey: restart purged on cp6004
  • 08:54 vgutierrez@cumin1001: START - Cookbook sre.hosts.reimage for host cp2039.codfw.wmnet with OS buster
  • 08:27 urbanecm: UTC morning B&C window done
  • 08:25 elukey: restart purged on cp6003
  • 08:16 moritzm: drain instances off ganeti2008 for eventual decom
  • 08:08 urbanecm@deploy1002: Synchronized wmf-config/ProductionServices.php: d149208: Use service-proxy to connect to linkrecommendation (T302719) (duration: 00m 49s)
  • 07:59 elukey: restart purged on cp6002
  • 06:58 oblivian@deploy1002: Finished deploy [restbase/deploy@0848b15] (dev-cluster): T302464 test (duration: 00m 17s)
  • 06:57 oblivian@deploy1002: Started deploy [restbase/deploy@0848b15] (dev-cluster): T302464 test
  • 06:56 elukey: restart purged on cp6001 to clear stale kafka TLS consumer state (or attempting to)
  • 06:46 _joe_: uploaded scap 4.4.1 to {stretch,buster,bullseye} T302464
  • 06:46 _joe_: uploaded scap 4.4.1 to {stretch,buster,bullseye}
  • 02:59 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1104 (T302185)', diff saved to https://phabricator.wikimedia.org/P21618 and previous config saved to /var/cache/conftool/dbconfig/20220301-025938-ladsgroup.json
  • 02:44 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1104', diff saved to https://phabricator.wikimedia.org/P21617 and previous config saved to /var/cache/conftool/dbconfig/20220301-024433-ladsgroup.json
  • 02:29 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1104', diff saved to https://phabricator.wikimedia.org/P21616 and previous config saved to /var/cache/conftool/dbconfig/20220301-022928-ladsgroup.json
  • 02:14 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1104 (T302185)', diff saved to https://phabricator.wikimedia.org/P21615 and previous config saved to /var/cache/conftool/dbconfig/20220301-021424-ladsgroup.json
  • 01:14 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1104 (T302185)', diff saved to https://phabricator.wikimedia.org/P21614 and previous config saved to /var/cache/conftool/dbconfig/20220301-011404-ladsgroup.json
  • 01:14 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1104.eqiad.wmnet with reason: Maintenance
  • 01:13 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1104.eqiad.wmnet with reason: Maintenance
  • 00:17 mutante: 15.wikipedia.org on k8s (staging) deploy1002:~] $ curl -s --resolve "15.wikipedia.org:4111:staging.svc.eqiad.wmnet" 'https://15.wikipedia.org' | grep grandpa => "“Wikipedia is like an all-knowing grandpa.”" | T300171

Archives

See Server Admin Log/Archives.