You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Server Admin Log
Jump to navigation
Jump to search
2022-03-02
- 00:15 topranks: Re-enabling Lumen AS3356 BGP session over IPv4 on cr3-ulsfo to assess affect on currently broken routing to ulsfo.
- 00:07 topranks: disabling Lumen AS3356 BGP session over IPv4 on cr3-ulsfo to assess affect on currently broken routing to ulsfo.
2022-03-01
- 22:51 inflatador: T276198 reenabled puppet on elastic1052.eqiad.wmnet
- 22:37 inflatador: T276198 rebooting elastic1052.eqiad.wmnet to test failure condition
- 22:33 sukhe@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp6016.drmrs.wmnet with reason: debugging till we find the root cause of the purged OOM issue; no traffic served
- 22:33 sukhe@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on cp6016.drmrs.wmnet with reason: debugging till we find the root cause of the purged OOM issue; no traffic served
- 22:32 inflatador: T276198 disabling puppet on elastic1052.eqiad.wmnet to test failure condition (rebooting shortly)
- 21:53 dancy@deploy1002: Finished scap: Resync to try to clear alerts (duration: 12m 08s)
- 21:41 dancy@deploy1002: Started scap: Resync to try to clear alerts
- 21:36 dancy@deploy1002: Started scap: Resync to try to clear alerts
- 20:36 brennen@deploy1002: rebuilt and synchronized wikiversions files: group0 wikis to 1.38.0-wmf.24 refs T300200
- 20:33 brennen: 1.38.0-wmf.24 train (T300200): no current blockers; proceeding to group0; note this may briefly trigger some version alerts
- 20:30 brennen@deploy1002: Synchronized php-1.38.0-wmf.24/includes: Backport: Revert "preferences: Use a faster and simpler form descriptor when validating" (T302643) (duration: 00m 55s)
- 20:05 mutante: alert1001 - re-enabled puppet
- 20:05 brennen@deploy1002: Finished scap: testwikis wikis to 1.38.0-wmf.24 refs T300200 (duration: 53m 17s)
- 19:45 mutante: alert1001 - disable puppet, systemctl stop ircecho - to stop bot spam, caused somehow by new scap version breaking "mw versions mismwatch" alerting - affects labtestwiki,testwiki,testwikidatawiki
- 19:38 mutante: mw1449 - scap pull
- 19:36 mutante: mw1414 - scap pull
- 19:11 brennen@deploy1002: Started scap: testwikis wikis to 1.38.0-wmf.24 refs T300200
- 19:07 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti2008.codfw.wmnet
- 19:01 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
- 18:58 jmm@cumin2002: START - Cookbook sre.dns.netbox
- 18:57 brennen: 1.38.0-wmf.24 train (T300200): there's currently a single blocker at T302643; staging to testwikis and holding there until backport's available
- 18:54 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ganeti2008.codfw.wmnet
- 18:45 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti2008.codfw.wmnet with reason: Remove from Ganeti cluster for decom
- 18:45 jmm@cumin2002: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ganeti2008.codfw.wmnet with reason: Remove from Ganeti cluster for decom
- 18:02 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T300992)', diff saved to https://phabricator.wikimedia.org/P21626 and previous config saved to /var/cache/conftool/dbconfig/20220301-180216-ladsgroup.json
- 17:52 cwhite: completed grafana upgrade in eqiad T282863
- 17:50 herron: re-enabling puppet and ircecho on alert1001
- 17:47 cwhite: upgrade grafana in eqiad T282863
- 17:47 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P21625 and previous config saved to /var/cache/conftool/dbconfig/20220301-174711-ladsgroup.json
- 17:44 mwdebug-deploy@deploy1002: helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
- 17:34 mwdebug-deploy@deploy1002: helmfile [eqiad] START helmfile.d/services/mwdebug: apply
- 17:32 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P21624 and previous config saved to /var/cache/conftool/dbconfig/20220301-173206-ladsgroup.json
- 17:24 dancy@deploy1002: Finished scap: testing container image build (duration: 28m 39s)
- 17:17 herron: stopped ircecho on alert1001 due to systemd unit alert shower
- 17:17 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T300992)', diff saved to https://phabricator.wikimedia.org/P21622 and previous config saved to /var/cache/conftool/dbconfig/20220301-171701-ladsgroup.json
- 17:14 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1144:3315 (T300992)', diff saved to https://phabricator.wikimedia.org/P21621 and previous config saved to /var/cache/conftool/dbconfig/20220301-171441-ladsgroup.json
- 17:14 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance
- 17:14 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance
- 16:55 dancy@deploy1002: Started scap: testing container image build
- 16:24 ebysans@deploy1002: Finished deploy [airflow-dags/analytics@cac16e8]: (no justification provided) (duration: 00m 03s)
- 16:23 ebysans@deploy1002: Started deploy [airflow-dags/analytics@cac16e8]: (no justification provided)
- 16:12 moritzm: restarting apache on logstash nodes to pick up expat update
- 16:11 elukey@deploy1002: Finished deploy [ores/deploy@29de1cc]: ORES Winter deployment - T300195 (duration: 36m 13s)
- 16:05 moritzm: restarting nginx on wcqs* nodes to pick up expat update
- 15:35 elukey@deploy1002: Started deploy [ores/deploy@29de1cc]: ORES Winter deployment - T300195
- 15:21 ntsako@deploy1002: Finished deploy [airflow-dags/analytics@cac16e8]: (no justification provided) (duration: 00m 07s)
- 15:21 ntsako@deploy1002: Started deploy [airflow-dags/analytics@cac16e8]: (no justification provided)
- 15:06 klausman@cumin2002: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ml-staging-etcd2003.codfw.wmnet
- 14:57 klausman@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
- 14:52 elukey: elukey@deploy1002:~$ sudo kill `pgrep -u zpapierski` (offboarded user, puppet broken on the node)
- 14:51 klausman@cumin2002: START - Cookbook sre.dns.netbox
- 14:51 klausman@cumin2002: START - Cookbook sre.ganeti.makevm for new host ml-staging-etcd2003.codfw.wmnet
- 14:48 klausman@cumin2002: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ml-staging-etcd2002.codfw.wmnet
- 14:42 hnowlan@deploy1002: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: sync
- 14:41 hnowlan@deploy1002: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: sync
- 14:38 klausman@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
- 14:36 vgutierrez: pool cp1087 running HAProxy as TLS termination layer - T290005 T271421
- 14:35 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1087.eqiad.wmnet with OS buster
- 14:35 klausman@cumin2002: START - Cookbook sre.dns.netbox
- 14:35 klausman@cumin2002: START - Cookbook sre.ganeti.makevm for new host ml-staging-etcd2002.codfw.wmnet
- 14:32 klausman@cumin2002: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host ml-staging-etcd2003.codfw.wmnet
- 14:32 klausman@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
- 14:28 klausman@cumin2002: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ml-staging-etcd2001.codfw.wmnet
- 14:19 klausman@cumin2002: START - Cookbook sre.dns.netbox
- 14:19 klausman@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
- 14:15 klausman@cumin2002: START - Cookbook sre.dns.netbox
- 14:14 klausman@cumin2002: START - Cookbook sre.ganeti.makevm for new host ml-staging-etcd2003.codfw.wmnet
- 14:09 moritzm: restarting nginx on wdqs* nodes to pick up expat update
- 14:03 klausman@cumin2002: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host ml-staging-etcd2002.codfw.wmnet
- 14:03 klausman@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
- 13:57 klausman@cumin2002: START - Cookbook sre.dns.netbox
- 13:57 klausman@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
- 13:53 mmandere: restart purged on cp60[15-16]
- 13:49 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1087.eqiad.wmnet with reason: host reimage
- 13:48 klausman@cumin2002: START - Cookbook sre.dns.netbox
- 13:48 klausman@cumin2002: START - Cookbook sre.ganeti.makevm for new host ml-staging-etcd2002.codfw.wmnet
- 13:48 klausman@cumin2002: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host ml-staging-etcd2002.codfw.wmnet
- 13:48 klausman@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
- 13:47 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on cp1087.eqiad.wmnet with reason: host reimage
- 13:44 klausman@cumin2002: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host ml-staging-etcd2003.codfw.wmnet
- 13:43 klausman@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
- 13:43 klausman@cumin2002: START - Cookbook sre.dns.netbox
- 13:43 klausman@cumin2002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
- 13:40 kormat: Deploying wmfmariadbpy 0.9 T302796
- 13:40 kormat: uploaded wmfmariadbpy 0.9 to apt.wm.o T302796
- 13:39 klausman@cumin2002: START - Cookbook sre.dns.netbox
- 13:39 klausman@cumin2002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97)
- 13:39 klausman@cumin2002: START - Cookbook sre.dns.netbox
- 13:39 klausman@cumin2002: START - Cookbook sre.ganeti.makevm for new host ml-staging-etcd2003.codfw.wmnet
- 13:39 klausman@cumin2002: START - Cookbook sre.dns.netbox
- 13:39 klausman@cumin2002: START - Cookbook sre.ganeti.makevm for new host ml-staging-etcd2002.codfw.wmnet
- 13:32 moritzm: restarting nginx on registry* nodes to pick up expat update
- 13:31 vgutierrez@cumin1001: START - Cookbook sre.hosts.reimage for host cp1087.eqiad.wmnet with OS buster
- 13:15 XioNoX: restart cr1-drmrs for software upgrade
- 13:03 moritzm: restarting FPM/Apache on parsoid hosts to pick up expat update
- 12:50 vgutierrez: pool cp3062 running HAProxy as TLS termination layer - T290005 T271421
- 12:47 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3062.esams.wmnet with OS buster
- 12:39 moritzm: installing expat security updates
- 12:34 mmandere: restart purged on cp60[12-14]
- 12:32 jgiannelos@deploy1002: Finished deploy [kartotherian/deploy@41d2498] (eqiad): Reduce pool size to 1 connection per node worker (duration: 01m 06s)
- 12:31 jgiannelos@deploy1002: Started deploy [kartotherian/deploy@41d2498] (eqiad): Reduce pool size to 1 connection per node worker
- 12:30 jgiannelos@deploy1002: Finished deploy [kartotherian/deploy@41d2498] (codfw): Reduce pool size to 1 connection per node worker (duration: 01m 30s)
- 12:28 jgiannelos@deploy1002: Started deploy [kartotherian/deploy@41d2498] (codfw): Reduce pool size to 1 connection per node worker
- 12:15 jgiannelos@deploy1002: Finished deploy [kartotherian/deploy@51d5a07] (codfw): Fix pool size configuration (duration: 01m 41s)
- 12:13 jgiannelos@deploy1002: Started deploy [kartotherian/deploy@51d5a07] (codfw): Fix pool size configuration
- 12:11 jgiannelos@deploy1002: Finished deploy [kartotherian/deploy@51d5a07] (eqiad): Fix pool size configuration (duration: 02m 01s)
- 12:09 jgiannelos@deploy1002: Started deploy [kartotherian/deploy@51d5a07] (eqiad): Fix pool size configuration
- 11:43 klausman@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
- 11:36 kharlan@deploy1002: helmfile [codfw] DONE helmfile.d/services/linkrecommendation: apply
- 11:35 klausman@cumin2002: START - Cookbook sre.dns.netbox
- 11:35 klausman@cumin2002: START - Cookbook sre.ganeti.makevm for new host ml-staging-etcd2001.codfw.wmnet
- 11:33 kharlan@deploy1002: helmfile [codfw] START helmfile.d/services/linkrecommendation: apply
- 11:32 kharlan@deploy1002: helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: apply
- 11:30 kharlan@deploy1002: helmfile [eqiad] START helmfile.d/services/linkrecommendation: apply
- 11:28 kharlan@deploy1002: helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply
- 11:27 kharlan@deploy1002: helmfile [staging] START helmfile.d/services/linkrecommendation: apply
- 11:27 cmooney@cumin1001: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1148.mgmt.eqiad.wmnet with reboot policy FORCED
- 11:21 _joe_: restarted pybal, removed ipvsadm entry on lvs1019. Now all of MediaWiki has no http LVS endpoint available.T244843
- 11:18 _joe_: also removed the ipvsadm entry for apaches:80 T244843
- 11:17 jayme: rolled back linkrecommendation staging helm release to revision 12 - T302744
- 11:17 _joe_: restarting pybal on lvs1020 T244843
- 11:11 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3062.esams.wmnet with reason: host reimage
- 11:11 _joe_: restarted pybal on lvs2009, T244843
- 11:09 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on cp3062.esams.wmnet with reason: host reimage
- 11:07 _joe_: restarted pybal on lvs2010, T244843
- 11:02 mmandere: restart purged on cp60[09,10,11]
- 11:00 cmooney@cumin1001: START - Cookbook sre.hosts.provision for host an-worker1148.mgmt.eqiad.wmnet with reboot policy FORCED
- 10:47 cmooney@cumin1001: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1147.mgmt.eqiad.wmnet with reboot policy FORCED
- 10:40 vgutierrez@cumin1001: START - Cookbook sre.hosts.reimage for host cp3062.esams.wmnet with OS buster
- 10:40 jmm@cumin2002: END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Ema out of all services on: 259 hosts
- 10:40 jmm@cumin2002: START - Cookbook sre.idm.logout Logging Ema out of all services on: 259 hosts
- 10:40 jmm@cumin2002: END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Ema out of all services on: 1353 hosts
- 10:39 jmm@cumin2002: START - Cookbook sre.idm.logout Logging Ema out of all services on: 1353 hosts
- 10:31 mmandere: restart purged on cp600[6-8]
- 10:28 cmooney@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
- 10:24 cmooney@cumin1001: START - Cookbook sre.dns.netbox
- 10:05 vgutierrez: pool cp2039 running HAProxy as TLS termination layer - T290005 T271421
- 09:48 elukey: elukey@stat1004:~$ sudo kill `pgrep -u zpapierski` (offboarded user, puppet broken on the host)
- 09:45 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2039.codfw.wmnet with OS buster
- 09:33 _joe_: restarted pybal on lvs1019, removed the mw api from ipvsadm, the mw api is internally fully encrypted
- 09:31 _joe_: restart pybal on lvs1020
- 09:25 jmm@cumin2002: END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Amuigai out of all services on: 1881 hosts
- 09:25 elukey: restart varnishkafka-webrequest on cp6009 as attempt to clear a weird status of librdkafka (delivery errors to kafka)
- 09:25 _joe_: manually removed ipvs entries on lvs2*, so it is actually now that the http api is not available in codfw anymore
- 09:24 jmm@cumin2002: START - Cookbook sre.idm.logout Logging Amuigai out of all services on: 1881 hosts
- 09:24 jmm@cumin2002: END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging ZPapierski out of all services on: 1881 hosts
- 09:22 jmm@cumin2002: START - Cookbook sre.idm.logout Logging ZPapierski out of all services on: 1881 hosts
- 09:22 _joe_: restarted pybal on lvs2009, the mw api is now effectively https-only in codfw T287820
- 09:20 _joe_: restarted pybal on lvs2010
- 09:14 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2039.codfw.wmnet with reason: host reimage
- 09:12 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime for 2:00:00 on cp2039.codfw.wmnet with reason: host reimage
- 09:06 elukey: restart purged on cp6005
- 08:57 elukey: restart purged on cp6004
- 08:54 vgutierrez@cumin1001: START - Cookbook sre.hosts.reimage for host cp2039.codfw.wmnet with OS buster
- 08:27 urbanecm: UTC morning B&C window done
- 08:25 elukey: restart purged on cp6003
- 08:16 moritzm: drain instances off ganeti2008 for eventual decom
- 08:08 urbanecm@deploy1002: Synchronized wmf-config/ProductionServices.php: d149208: Use service-proxy to connect to linkrecommendation (T302719) (duration: 00m 49s)
- 07:59 elukey: restart purged on cp6002
- 06:58 oblivian@deploy1002: Finished deploy [restbase/deploy@0848b15] (dev-cluster): T302464 test (duration: 00m 17s)
- 06:57 oblivian@deploy1002: Started deploy [restbase/deploy@0848b15] (dev-cluster): T302464 test
- 06:56 elukey: restart purged on cp6001 to clear stale kafka TLS consumer state (or attempting to)
- 06:46 _joe_: uploaded scap 4.4.1 to {stretch,buster,bullseye} T302464
- 06:46 _joe_: uploaded scap 4.4.1 to {stretch,buster,bullseye}
- 02:59 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1104 (T302185)', diff saved to https://phabricator.wikimedia.org/P21618 and previous config saved to /var/cache/conftool/dbconfig/20220301-025938-ladsgroup.json
- 02:44 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1104', diff saved to https://phabricator.wikimedia.org/P21617 and previous config saved to /var/cache/conftool/dbconfig/20220301-024433-ladsgroup.json
- 02:29 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1104', diff saved to https://phabricator.wikimedia.org/P21616 and previous config saved to /var/cache/conftool/dbconfig/20220301-022928-ladsgroup.json
- 02:14 ladsgroup@cumin1001: dbctl commit (dc=all): 'Repooling after maintenance db1104 (T302185)', diff saved to https://phabricator.wikimedia.org/P21615 and previous config saved to /var/cache/conftool/dbconfig/20220301-021424-ladsgroup.json
- 01:14 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1104 (T302185)', diff saved to https://phabricator.wikimedia.org/P21614 and previous config saved to /var/cache/conftool/dbconfig/20220301-011404-ladsgroup.json
- 01:14 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1104.eqiad.wmnet with reason: Maintenance
- 01:13 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1104.eqiad.wmnet with reason: Maintenance
- 00:17 mutante: 15.wikipedia.org on k8s (staging) deploy1002:~] $ curl -s --resolve "15.wikipedia.org:4111:staging.svc.eqiad.wmnet" 'https://15.wikipedia.org' | grep grandpa => "“Wikipedia is like an all-knowing grandpa.”" | T300171