You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Server Admin Log

From Wikitech-static
Revision as of 00:26, 11 February 2020 by imported>Stashbot (ebernhardson@deploy1001: Synchronized php-1.35.0-wmf.18/skins/MinervaNeue: SWAT: Revert: Reduce userContributions icon code (duration: 01m 06s))
Jump to navigation Jump to search

2020-02-11

  • 00:26 ebernhardson@deploy1001: Synchronized php-1.35.0-wmf.18/skins/MinervaNeue: SWAT: Revert: Reduce userContributions icon code (duration: 01m 06s)
  • 00:20 ebernhardson@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Give NS_HELP same weight as NS_MAIN in search on wikitech (duration: 01m 06s)
  • 00:15 ebernhardson@deploy1001: Synchronized wmf-config/: SWAT: Enable SpecialMute page on all wikis (duration: 01m 06s)

2020-02-10

  • 23:30 robh: cp108[23] returned to service via T243167
  • 23:28 legoktm: restarting zuul
  • 23:26 reedy@deploy1001: Synchronized php-1.35.0-wmf.18/extensions/OATHAuth/src/Key/TOTPKey.php: T244308 (duration: 01m 04s)
  • 23:25 reedy@deploy1001: Synchronized php-1.35.0-wmf.16/extensions/OATHAuth/src/Key/TOTPKey.php: T244308 (duration: 01m 07s)
  • 23:06 robh: cp108[01] returned to service, cp108[23] offline for bios update via T243167
  • 22:50 chasemp: phab1001:~# sudo /srv/phab/phabricator/bin/bulk make-silent --id 2164
  • 22:45 sbassett@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Add authevents as monolog channel (duration: 01m 06s)
  • 22:43 robh: cp107[789] returned to service, cp108[01] offline for bios update via T243167
  • 22:42 robh: cp107[89] returned to service, cp108[01] offline for bios update via T243167
  • 21:58 robh: cp107[56] returned to service, cp107[78] offline for bios update via T243167
  • 21:43 arlolra: Updated Parsoid to 612106d2 (T244412, T244413, T242746, T235273, T235307, T238845, T204618, T240054)
  • 21:38 robh: cp1075 & cp1076 offline for bios updates per T243167
  • 21:36 robh: cp1075 and cp1076 going offline for bios updates. This will cause a bit of cp irc icinga noise, but no paging. Not putting into maint mode, as there is no way to maint mode the noisest check (which checks all backends and thus shouldnt be disabled)
  • 21:33 arlolra@deploy1001: Finished deploy [parsoid/deploy@d2d4870]: Updating Parsoid to 612106d2 (duration: 10m 26s)
  • 21:32 XioNoX: clamp tcp-mss on cr2-eqiad:xe-3/3/3
  • 21:23 arlolra@deploy1001: Started deploy [parsoid/deploy@d2d4870]: Updating Parsoid to 612106d2
  • 21:12 halfak@deploy1001: Finished deploy [ores/deploy@a6f4f14]: T242705 (duration: 12m 18s)
  • 21:00 halfak@deploy1001: Started deploy [ores/deploy@a6f4f14]: T242705
  • 20:55 mholloway-shell@deploy1001: Synchronized php-1.35.0-wmf.16/extensions/MachineVision: MachineVision: Fix page id parsing from imageinfo results (T244752) (duration: 01m 11s)
  • 20:14 mholloway-shell@deploy1001: Synchronized php-1.35.0-wmf.18/extensions/MachineVision: MachineVision: Fix page id parsing from imageinfo results (T244752) (duration: 01m 15s)
  • 19:31 ppchelko@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: gerrit:570393 Config: Session Store: Switch group0 and group1 to kask-session T243106 (duration: 01m 06s)
  • 19:28 mutante: Gerrit - added eevans to 'wmf-deployment' group (T244508)
  • 19:12 jforrester@deploy1001: Synchronized wmf-config/CommonSettings.php: T242122 Load new EventStreamConfig extension if so configured (duration: 01m 06s)
  • 19:07 jforrester@deploy1001: Scap failed!: Call to mwscript eval.php stderr: not empty
  • 19:06 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: T242122 Set default of wmgUseEventStreamConfig false everywhere (duration: 01m 06s)
  • 18:39 twentyafterfour@deploy1001: Synchronized php: group1 wikis to 1.35.0-wmf.18 refs T233866 (duration: 01m 05s)
  • 18:38 twentyafterfour@deploy1001: rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.18 refs T233866
  • 18:25 twentyafterfour@deploy1001: rebuilt and synchronized wikiversions files: group0 wikis to 1.35.0-wmf.18 refs T233867
  • 18:21 twentyafterfour: MediaWiki train: finally moving forward with group0 wikis to 1.35.0-wmf.18 refs T233866
  • 17:52 jforrester@deploy1001: Synchronized wmf-config/CommonSettings.php: T244561 Set Kartographer servers to Wikimedia servers (duration: 01m 06s)
  • 16:48 moritzm: installing libexif security updates on jessie
  • 16:22 vgutierrez: pooling cp5002 and cp5009 running buster - T242093
  • 15:45 XioNoX: push outbound flowspec support to core routers
  • 15:45 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1107 after first day of 10.4 testing - T242702', diff saved to https://phabricator.wikimedia.org/P10366 and previous config saved to /var/cache/conftool/dbconfig/20200210-154552-marostegui.json
  • 15:41 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 15:41 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 15:33 godog: roll restart cassandra on session* to apply logging changes - T242585
  • 15:23 moritzm: uploading debdeploy 0.0.99.13 to apt.wikimedia.org
  • 15:22 godog: roll restart cassandra on restbase* to apply logging changes - T242585
  • 15:19 vgutierrez@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 15:19 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:19 vgutierrez@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 15:19 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:06 marostegui: Reload haproxy on dbproxy1017 and dbproxy1017 - T244209
  • 15:04 twentyafterfour@deploy1001: Finished scap: full scap sync prior to wmf.18 rollout (duration: 20m 13s)
  • 15:04 godog: roll restart cassandra on maps* to apply logging changes - T242585
  • 15:03 vgutierrez: rolling restart of ats-tls - T240950
  • 15:00 marostegui: Restart mysql on m5 master (wikitech will go down) - T244209
  • 14:52 vgutierrez: rolling restart of ats-tls in ulsfo - T244464
  • 14:46 vgutierrez: depool cp5002 and cp5009 and reimage as buster - T242093
  • 14:44 twentyafterfour@deploy1001: Started scap: full scap sync prior to wmf.18 rollout
  • 14:42 vgutierrez: repool cp5003 and cp5010 running buster - T242093
  • 14:41 marostegui: Full-upgrade db1133 (without restarting mysql) - T244209
  • 14:40 twentyafterfour: MediaWiki Train: Running a full scap to prepare for moving forward to 1.35.0-wmf.18 ( T233866 )
  • 14:32 marostegui: Downtime m5 hosts for the upcoming maintenance - T244209
  • 14:19 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 14:17 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:17 vgutierrez@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 14:17 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:11 XioNoX: remove TCP-MSS clamping on cr3-knams
  • 13:48 vgutierrez: depool cp5003 and reimage as buster - T242093
  • 13:47 vgutierrez: pooling cp5004 with buster - T242093
  • 13:46 vgutierrez: depool cp5010 and reimage as buster - T242093
  • 13:45 vgutierrez: pooling cp5011 with buster - T242093
  • 13:28 godog: roll restart cassandra on aqs to apply logging changes - T242585
  • 13:03 ladsgroup@deploy1001: Synchronized php-1.35.0-wmf.18/extensions/Wikibase: Revert "wbterms: Set default for the term store to read new" (T244529) (duration: 01m 00s)
  • 13:03 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 13:00 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 12:59 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 12:58 Urbanecm: EU SWAT is done
  • 12:58 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 12:56 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 989c9f8: Revert "Revert "Remove handler deleted from the MachineVision extension"" (duration: 00m 58s)
  • 12:51 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 989c9f8: Revert "Revert "Remove handler deleted from the MachineVision extension"" (duration: 00m 59s)
  • 12:49 urbanecm@deploy1001: Finished scap: SWAT: 799224f: 137a40e (T241242; T243974) (duration: 20m 18s)
  • 12:30 vgutierrez: depool cp5004 and reimage as buster - T242093
  • 12:29 vgutierrez: pooling cp5005 with buster - T242093
  • 12:28 urbanecm@deploy1001: Started scap: SWAT: 799224f: 137a40e (T241242; T243974)
  • 12:23 vgutierrez: pooling ncredir1001 with buster - T243391
  • 12:18 _joe_: running puppet, scap pull on mwdebug1001
  • 12:17 vgutierrez: upload trafficserver 8.0.5-1wm15 to apt.wm.o (buster) - T244538
  • 12:08 vgutierrez@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 12:08 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 12:06 vgutierrez: testing ats 8.0.5-1-wm15 on cp4032 - T244538
  • 12:06 urbanecm@deploy1001: Synchronized wmf-config/throttle.php: SWAT: 014405a: Add throttle rules for OSU Editathon and workshop for cawiki, remove expired ones (T244608, T244645) (duration: 01m 03s)
  • 11:57 vgutierrez: depool ncredir1001 and reimage as buster - T243391
  • 11:57 vgutierrez: pooling ncredir1002 with buster - T243391
  • 11:43 vgutierrez: pooling cp4027 with buster - T242093
  • 11:38 vgutierrez: depool ncredir1002 and reimage as buster - T243391
  • 11:31 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 11:29 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 11:22 vgutierrez: depooling cp5011 and cp5005 & reimage as buster - T242093
  • 11:07 vgutierrez: depool cp4027 & reimage as buster - T242093
  • 11:07 vgutierrez: pooling ncredir2001 with buster - T243391
  • 11:03 vgutierrez: pooling cp4028 with buster - T242093
  • 10:50 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 10:48 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 10:47 godog: remove old logs from /var/log/swift on swift hsots
  • 10:31 vgutierrez: depool ncredir2001 and reimage as buster - T243391
  • 10:26 vgutierrez: depool cp4028 & reimage as buster - T242093
  • 10:14 moritzm: installing sudo security updates for buster
  • 08:53 vgutierrez: pooling cp4029 with buster - T242093
  • 08:44 marostegui@cumin1001: dbctl commit (dc=all): 'Increase weight from 1 to 5 for db1107 - T242702', diff saved to https://phabricator.wikimedia.org/P10364 and previous config saved to /var/cache/conftool/dbconfig/20200210-084446-marostegui.json
  • 08:43 vgutierrez: pooling ncredir2002 with buster - T243391
  • 08:34 effie: rolling restart php-fpm on labweb[1001-1002].wikimedia.org,mw*.eqiad.wmnet,scandium.eqiad.wmnet, wtp[1025-1048].eqiad.wmnet
  • 08:32 effie: update php-apcu on eqiad - T236800
  • 08:29 effie: rolling restart php-fpm on cloudweb2001-dev.wikimedia.org,mw[2135-2147,2150-2212,2214-2290].codfw.wmnet,wtp[2001-2020].codfw.wmnet
  • 08:23 effie: update php-apcu on codfw - T236800
  • 07:58 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 07:56 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 07:54 moritzm: updating d-i netinst image for Stretch 9.12 point release (which bumped the kernel ABI)
  • 07:29 moritzm: updating d-i netinst image for Buster 10.3 point release (which bumped the kernel ABI)
  • 07:09 elukey: restore mw1347's mcrouter settings to its default (proxy threads 10 -> 5)
  • 07:01 marostegui@cumin1001: dbctl commit (dc=all): 'Place db1107 - MariaDB 10.4 on s1 with minimal weight - T242702', diff saved to https://phabricator.wikimedia.org/P10363 and previous config saved to /var/cache/conftool/dbconfig/20200210-070140-marostegui.json
  • 06:55 vgutierrez: depool ncredir2002 and reimage as buster - T243391
  • 06:53 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool es1019', diff saved to https://phabricator.wikimedia.org/P10362 and previous config saved to /var/cache/conftool/dbconfig/20200210-065326-marostegui.json
  • 06:51 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1091 T232446', diff saved to https://phabricator.wikimedia.org/P10361 and previous config saved to /var/cache/conftool/dbconfig/20200210-065135-marostegui.json
  • 06:47 vgutierrez: depool cp4029 & reimage as buster - T242093
  • 06:45 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool es1019', diff saved to https://phabricator.wikimedia.org/P10360 and previous config saved to /var/cache/conftool/dbconfig/20200210-064553-marostegui.json
  • 06:45 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1091 T232446', diff saved to https://phabricator.wikimedia.org/P10359 and previous config saved to /var/cache/conftool/dbconfig/20200210-064458-marostegui.json
  • 06:39 marostegui: Compress db1124:3318 - this will generate lag on s8 wiki replicas - T232446
  • 06:37 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1091 T232446', diff saved to https://phabricator.wikimedia.org/P10358 and previous config saved to /var/cache/conftool/dbconfig/20200210-063716-marostegui.json
  • 06:23 marostegui: Remove partitions from db1099:3311, db1099:3318 T239453
  • 06:21 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1099:3318 T239453', diff saved to https://phabricator.wikimedia.org/P10357 and previous config saved to /var/cache/conftool/dbconfig/20200210-062112-marostegui.json
  • 06:18 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1099:3311 T239453', diff saved to https://phabricator.wikimedia.org/P10356 and previous config saved to /var/cache/conftool/dbconfig/20200210-061822-marostegui.json
  • 06:16 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1091 T232446', diff saved to https://phabricator.wikimedia.org/P10355 and previous config saved to /var/cache/conftool/dbconfig/20200210-061656-marostegui.json

2020-02-09

  • 05:11 cdanis: T238305 hardreset cp3051

2020-02-08

  • 19:12 _joe_: set cpufreq governor to performance on mw1328
  • 17:04 _joe_: restarted php7.2-fpm on mw1332
  • 16:53 Urbanecm: mwscript resetAuthenticationThrottle.php --wiki=enwiki --signup --ip 12.24.27.50
  • 16:47 gjg@deploy1001: Synchronized wmf-config/throttle.php: SWAT: Editathon in Charolette (duration: 00m 58s)
  • 00:05 Jeff_Green: switched payments.wikimedia.org to codfw datacenter due to T244610

2020-02-07

  • 22:20 jeh: ceph: round 2 OSD failover and recovery testing on cloudcephosd1003.wikimedia.org T240718
  • 20:47 mutante: OS install on new install_server VMs worked on second attempt, issues are gone. signed puppet certs for install1003.eqiad.wmnet, install2003.codfw.wmnet, initial puppet runs (T224576)
  • 20:42 jeh: ceph: OSD failover and recovery testing on cloudcephosd1003.wikimedia.org T240718
  • 20:32 mutante: ganeti: attempting to reinstall install1003 which failed last time
  • 17:38 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool es1019 after on-site maintenance T243963', diff saved to https://phabricator.wikimedia.org/P10350 and previous config saved to /var/cache/conftool/dbconfig/20200207-173850-marostegui.json
  • 17:36 twentyafterfour@deploy1001: Synchronized wmf-config/InitialiseSettings.php: sync InitializeSettings again for lols refs T233866 (duration: 01m 03s)
  • 17:32 twentyafterfour@deploy1001: Synchronized wmf-config/InitialiseSettings.php: sync https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/570929 refs T233866 (duration: 01m 02s)
  • 17:25 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool es1019 after on-site maintenance T243963', diff saved to https://phabricator.wikimedia.org/P10349 and previous config saved to /var/cache/conftool/dbconfig/20200207-172541-marostegui.json
  • 17:22 twentyafterfour@deploy1001: rebuilt and synchronized wikiversions files: roll back all wikis to 1.35.0-wmf.16 refs T233866
  • 17:19 marostegui: Start MySQL on es1019 after onsite maintenance T243963
  • 16:46 filippo@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
  • 16:38 filippo@cumin1001: START - Cookbook sre.ganeti.makevm
  • 16:13 XioNoX: remove MSS clamping from eqiad/eqord/knams/esams
  • 16:05 andrew@deploy1001: Finished deploy [horizon/deploy@bc777d6]: Fix for T243422 (duration: 03m 45s)
  • 16:04 vgutierrez: pooling cp4030 with buster - T242093
  • 16:03 bblack: removing GRE MTU mitigations from cp[135]xxx - T232602
  • 16:01 andrew@deploy1001: Started deploy [horizon/deploy@bc777d6]: Fix for T243422
  • 15:50 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 15:48 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:25 vgutierrez: depool & reimage cp4030 as buster - T242093
  • 15:21 vgutierrez: pooling cp4031 with buster - T242093
  • 15:20 vgutierrez: pooling ncredir3001 running buster - T243391
  • 15:18 marostegui: Restart all instances on db1124 and db1125 to pick up a new replication filter - T240094
  • 15:11 marostegui: Restart all instances on db2094 and db2095 to pick up a new replication filter - T240094
  • 14:56 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 14:53 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:43 hoo@deploy1001: Synchronized wmf-config/Wikibase.php: REVERT: Wikibase Client: Fix setting name typo (T244529) (duration: 01m 40s)
  • 14:43 Amir1: ladsgroup@mwmaint1002:~$ mwscript createAndPromote.php --wiki=zhwiki --force "Amir Sarabadani (WMDE)" --sysop (T244578)
  • 14:40 hoo@deploy1001: Scap failed!: 9/11 canaries failed their endpoint checks(http://en.wikipedia.org)
  • 14:38 hoo@deploy1001: Synchronized wmf-config/Wikibase.php: Wikibase Client: Fix setting name typo (T244529) (duration: 01m 20s)
  • 14:33 vgutierrez: depool and reimage ncredir3001 as buster - T243391
  • 14:32 vgutierrez: depool & reimage cp4031 as buster - T242093
  • 14:23 vgutierrez: pooling ncredir3002 running buster - T243391
  • 13:26 vgutierrez: pooling cp4021 with buster - T242093
  • 13:05 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 13:03 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 12:51 vgutierrez: depool and reimage ncredir3002 as buster - T243391
  • 12:42 vgutierrez: depool & reimage cp4021 as buster - T242093
  • 12:08 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 12:08 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 11:58 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 11:57 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 11:25 vgutierrez: pooling ncredir5001 running buster - T243391
  • 11:24 vgutierrez: pooling cp4022 with buster - T242093
  • 11:09 akosiaris: undo wikifeeds experiments
  • 11:07 akosiaris@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'wikifeeds' for release 'production' .
  • 10:42 akosiaris@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'wikifeeds' for release 'production' .
  • 10:40 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 10:37 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 10:36 akosiaris: conduct experiments with stopping/starting uwsgi-ores on ores2001 T242705
  • 10:24 akosiaris@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'wikifeeds' for release 'production' .
  • 10:23 vgutierrez: depool and reimage ncredir5001 as buster - T243391
  • 10:14 vgutierrez: depool & reimage cp4022 as buster - T242093
  • 10:02 akosiaris: increase capacity for wikifeeds by 50% T244535
  • 10:02 akosiaris@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'wikifeeds' for release 'production' .
  • 10:01 akosiaris@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'wikifeeds' for release 'production' .
  • 09:53 ema: A:mw: increase keepalive_requests from 100 to 200 https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/570670/ T241145
  • 09:09 godog: roll restart cassandra instance on restbase-dev
  • 09:03 akosiaris@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'wikifeeds' for release 'production' .
  • 09:03 godog: restart cassandra on restbase-dev1004 to test logging pipeline onboard
  • 09:01 akosiaris@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'wikifeeds' for release 'production' .
  • 08:59 akosiaris@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'wikifeeds' for release 'staging' .
  • 08:58 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1090:3312, db1090:3317', diff saved to https://phabricator.wikimedia.org/P10343 and previous config saved to /var/cache/conftool/dbconfig/20200207-085846-marostegui.json
  • 08:54 marostegui: Upgrade db1090:3312, db1090:3317
  • 08:54 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1090:3312, db1090:3317 for upgrade', diff saved to https://phabricator.wikimedia.org/P10342 and previous config saved to /var/cache/conftool/dbconfig/20200207-085432-marostegui.json
  • 08:44 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1101:3317 T239453', diff saved to https://phabricator.wikimedia.org/P10341 and previous config saved to /var/cache/conftool/dbconfig/20200207-084447-marostegui.json
  • 08:44 moritzm: installing libexif security updates
  • 08:21 akosiaris: deploy https://gerrit.wikimedia.org/r/570726 T244535 to avoid CPU throttling of wikifeeds
  • 08:21 akosiaris@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'wikifeeds' for release 'staging' .
  • 07:53 marostegui@cumin1001: dbctl commit (dc=all): 'Increase base weight for db1126', diff saved to https://phabricator.wikimedia.org/P10340 and previous config saved to /var/cache/conftool/dbconfig/20200207-075323-marostegui.json
  • 07:52 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1101:3317 T239453', diff saved to https://phabricator.wikimedia.org/P10339 and previous config saved to /var/cache/conftool/dbconfig/20200207-075234-marostegui.json
  • 07:48 marostegui: Remove revision partitions from db2085:3318 T239453
  • 07:45 marostegui@cumin1001: dbctl commit (dc=all): 'Fullyy repool db1126 T232446', diff saved to https://phabricator.wikimedia.org/P10338 and previous config saved to /var/cache/conftool/dbconfig/20200207-074511-marostegui.json
  • 07:44 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db2085:3318 T239453', diff saved to https://phabricator.wikimedia.org/P10337 and previous config saved to /var/cache/conftool/dbconfig/20200207-074407-marostegui.json
  • 07:42 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1101:3317 T239453', diff saved to https://phabricator.wikimedia.org/P10336 and previous config saved to /var/cache/conftool/dbconfig/20200207-074258-marostegui.json
  • 07:31 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1101:3317 T239453', diff saved to https://phabricator.wikimedia.org/P10335 and previous config saved to /var/cache/conftool/dbconfig/20200207-073130-marostegui.json
  • 07:30 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1126 T232446', diff saved to https://phabricator.wikimedia.org/P10334 and previous config saved to /var/cache/conftool/dbconfig/20200207-073026-marostegui.json
  • 06:38 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1126 T232446', diff saved to https://phabricator.wikimedia.org/P10333 and previous config saved to /var/cache/conftool/dbconfig/20200207-063831-marostegui.json
  • 06:34 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1105:3311 T239453', diff saved to https://phabricator.wikimedia.org/P10332 and previous config saved to /var/cache/conftool/dbconfig/20200207-063402-marostegui.json
  • 06:31 elukey: force a puppet run on all ores[12] nodes
  • 06:27 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1105:3311 T239453', diff saved to https://phabricator.wikimedia.org/P10331 and previous config saved to /var/cache/conftool/dbconfig/20200207-062731-marostegui.json
  • 06:26 marostegui: Reboot db1107 for update - T242702
  • 06:25 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1126 T232446', diff saved to https://phabricator.wikimedia.org/P10330 and previous config saved to /var/cache/conftool/dbconfig/20200207-062502-marostegui.json
  • 06:23 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1105:3311 T239453', diff saved to https://phabricator.wikimedia.org/P10329 and previous config saved to /var/cache/conftool/dbconfig/20200207-062345-marostegui.json
  • 06:20 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1105:3311 T239453', diff saved to https://phabricator.wikimedia.org/P10328 and previous config saved to /var/cache/conftool/dbconfig/20200207-062043-marostegui.json
  • 04:49 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 04:46 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 04:16 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 04:14 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 04:13 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 04:11 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 03:51 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 03:49 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 03:42 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 03:40 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 01:27 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 01:25 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 01:24 robh: eqsin pdu work ongoing starting now. ps1-603 swapping per T242250
  • 00:13 pt1979@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 00:11 pt1979@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 00:09 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 00:08 pt1979@cumin2001: START - Cookbook sre.hosts.downtime

2020-02-06

  • 23:44 pt1979@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 23:42 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 23:37 pt1979@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 23:35 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 23:25 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: T244133 [cswikisource] Enable VisualEditor in the Edice namespace (duration: 01m 07s)
  • 23:22 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: T159711 T161365 T164435 [nlwiki] Enable VisualEditor in the Project namespace (duration: 01m 08s)
  • 23:21 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 23:19 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 23:15 pt1979@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 23:13 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 23:10 jforrester@deploy1001: Synchronized wmf-config/CommonSettings.php: T244405 Don't trying to assign to if it's unset (duration: 01m 07s)
  • 22:50 jforrester@deploy1001: Synchronized php-1.35.0-wmf.18/extensions/VisualEditor: T242184 Change tags method so anon edits will go through (duration: 01m 08s)
  • 22:42 pt1979@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 22:40 pt1979@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 22:39 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 22:38 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 22:18 pt1979@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 22:15 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 22:13 mutante: turning mw2271 and mw2163 into canary appservers for codfw, this adds mediawiki-testers shell users and removes scap sql scripts, rest stays as is (T242606)
  • 21:54 pt1979@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 21:52 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 21:40 twentyafterfour: train blocked due to serious incident related to deploying the latest branch. Incident documentation: https://wikitech.wikimedia.org/wiki/Incident_documentation/20200206-mediawiki refs T233866
  • 21:30 pt1979@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 21:27 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 21:05 pt1979@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 21:03 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 20:52 akosiaris: restart all wikifeeds pods
  • 20:48 akosiaris@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'wikifeeds' for release 'production' .
  • 20:45 akosiaris: restart restbase on restbase1027
  • 20:32 twentyafterfour@deploy1001: rebuilt and synchronized wikiversions files: (no justification provided)
  • 20:30 twentyafterfour: sync-wikiversions --force
  • 20:30 twentyafterfour@deploy1001: Scap failed!: 9/11 canaries failed their endpoint checks(http://en.wikipedia.org)
  • 20:25 twentyafterfour@deploy1001: rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.18 refs T233866
  • 19:45 jforrester@deploy1001: Synchronized wmf-config/CommonSettings.php: T244405 Set wgLogoHD before adding wordmark (duration: 01m 06s)
  • 19:36 bblack: re-pool cp1075 (eqiad text)
  • 19:33 addshore: SWAT done!
  • 19:32 addshore@deploy1001: Synchronized php-1.35.0-wmf.18/extensions/WikibaseLexemeCirrusSearch: T244479 Update namespace for PrefetchingTermLookup & fix tests (duration: 01m 06s)
  • 19:31 bblack: depool cp1075 (eqiad text) for minor experimentation
  • 19:29 addshore@deploy1001: Synchronized php-1.35.0-wmf.16/extensions/Babel/includes/Babel.php: T243713 Timeout for meta api call from 10 to 2 seconds. (duration: 01m 07s)
  • 19:28 addshore@deploy1001: Synchronized php-1.35.0-wmf.18/extensions/Babel/includes/Babel.php: T243713 Timeout for meta api call from 10 to 2 seconds. (duration: 01m 07s)
  • 19:25 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Fix incorrect spellings of "RESTBase" in config variables (2/2) 2.IS (duration: 01m 06s)
  • 19:23 addshore@deploy1001: Synchronized wmf-config/CommonSettings.php: Fix incorrect spellings of "RESTBase" in config variables (2/2) 1.CS (duration: 01m 07s)
  • 19:23 cdanis: manual puppet run on netflow1001 looked good; ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕑☕ sudo cumin A:netflow "run-puppet-agent --enable 'rollout of I60692f0e8 T237587 cdanis'"
  • 19:22 pt1979@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 19:20 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Fix incorrect spellings of "RESTBase" in config variables (1/2) (duration: 01m 06s)
  • 19:20 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 19:14 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Enable EntitySourceBasedFederation everywhere T243395, sync again for luck (duration: 01m 06s)
  • 19:12 cdanis: ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕑☕ sudo cumin A:netflow "disable-puppet 'rollout of I60692f0e8 T237587 cdanis'"
  • 19:10 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Enable EntitySourceBasedFederation everywhere T243395 (duration: 01m 07s)
  • 19:05 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Enable EntitySourceBasedFederation for group1 T243395 (duration: 01m 10s)
  • 19:01 moritzm: restarting exim on mendelevium to pick up cyrus-sasl security updates
  • 18:58 pt1979@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 18:56 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 18:55 moritzm: restarting apache on tungsten/dbmonitor to pick up cyrus-sasl security updates
  • 18:53 mholloway-shell@deploy1001: Finished deploy [mobileapps/deploy@8e15868]: Update mobileapps to ceeb950 (duration: 06m 27s)
  • 18:46 mholloway-shell@deploy1001: Started deploy [mobileapps/deploy@8e15868]: Update mobileapps to ceeb950
  • 18:36 pt1979@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 18:34 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 18:06 pt1979@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 18:04 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 17:32 herron: set performance cpu scaling governor on maps*
  • 16:49 vgutierrez: pooling ncredir5002 running buster - T243391
  • 16:38 vgutierrez: pooling cp4023 with buster - T242093
  • 16:36 ebernhardson@deploy1001: Finished deploy [wikimedia/discovery/analytics@524be2b]: airflow: Update ores data transfer from drafttopic -> articletopic (duration: 00m 19s)
  • 16:35 ebernhardson@deploy1001: Started deploy [wikimedia/discovery/analytics@524be2b]: airflow: Update ores data transfer from drafttopic -> articletopic
  • 16:35 XioNoX: remove AS prepending in esams/knams
  • 16:31 bblack: lvs1013 - restart pybal for dual bgp session config - T180069
  • 16:30 bblack: lvs1014 - restart pybal for dual bgp session config - T180069
  • 16:30 bblack: lvs1015 - restart pybal for dual bgp session config - T180069
  • 16:29 bblack: lvs1016 - restart pybal for dual bgp session config - T180069
  • 16:28 moritzm: restarting apache on bromine to pick up SASL security updates
  • 16:24 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 16:22 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 16:22 moritzm: installing cyrus-sasl2 security updates on jessie
  • 16:20 bblack: lvs2001 - restart pybal for dual bgp session config - T180069
  • 16:19 bblack: lvs2002 - restart pybal for dual bgp session config - T180069
  • 16:19 bblack: lvs2003 - restart pybal for dual bgp session config - T180069
  • 16:07 vgutierrez: depool and reimage ncredir5002 as buster - T243391
  • 16:07 bblack: lvs4005 - restart pybal for dual bgp session config - T180069
  • 16:06 bblack: lvs4006 - restart pybal for dual bgp session config - T180069
  • 16:06 bblack: lvs4007 - restart pybal for dual bgp session config - T180069
  • 16:03 vgutierrez: depool & reimage cp4023 as buster - T242093
  • 16:03 vgutierrez: pooling cp4024 with buster - T242093
  • 15:59 akosiaris: repool eventgate-analytics/eqiad. Experiment proved the failover wouldn't cause (on it's own) a problem. Experiment done.
  • 15:58 akosiaris@cumin1001: conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=eventgate-analytics
  • 15:57 halfak@deploy1001: Finished deploy [ores/deploy@50a101a]: T242705 (duration: 04m 35s)
  • 15:56 vgutierrez: pooling ncredir4001 running buster - T243391
  • 15:55 moritzm: installing qemu security updates
  • 15:54 bblack: lvs5001 - restart pybal for dual bgp session config - T180069
  • 15:53 bblack: lvs5002 - restart pybal for dual bgp session config - T180069
  • 15:53 halfak@deploy1001: Started deploy [ores/deploy@50a101a]: T242705
  • 15:52 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 15:52 bblack: lvs5003 - restart pybal for dual bgp session config - T180069
  • 15:50 moritzm: installing python-ecdsa security updates
  • 15:50 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:41 moritzm: installing jsoup security updates
  • 15:30 vgutierrez: depool & reimage ncredir4001 as buster - T243391
  • 15:29 vgutierrez: depool & reimage cp4024 as buster - T242093
  • 15:28 vgutierrez: pooling ncredir4002 running buster - T243391
  • 15:27 moritzm: installing sudo security updates on jessie
  • 15:23 vgutierrez: pooling cp4025 with buster - T242093
  • 15:14 ema: A:mw-api: force puppet run to increase keepalive_requests from 100 to 200 https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/570670/ T241145
  • 15:09 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 15:07 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:59 godog: extend graphite1004 / graphite2003 fs +200G
  • 14:56 vgutierrez: depool and reimage ncredir4002 as buster - T243391
  • 14:46 vgutierrez: depool & reimage cp4025 as buster - T242093
  • 14:16 akosiaris: 20mins in with eventgate-analytics/eqiad depooled from discovery, no issues yet.
  • 14:14 ema: run puppet on mw-api-canary to revert nginx keepalive_requests bump T241145
  • 13:55 marostegui: Stop MySQL on es1019, upgrade and poweroff for on-site maintenance - T243963
  • 13:54 akosiaris@cumin1001: conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=eventgate-analytics
  • 13:53 akosiaris: depool eqiad eventgate-analytics for testing purposes. Requests will flow to codfw, monitoring https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=now-30m&to=now for issues.
  • 13:51 marostegui@cumin1001: dbctl commit (dc=all): 'Depool es1019 for onsite maintenance T243963', diff saved to https://phabricator.wikimedia.org/P10321 and previous config saved to /var/cache/conftool/dbconfig/20200206-135157-marostegui.json
  • 13:45 XioNoX: rollback deactivate BGP transits on cr3-knams
  • 13:34 elukey: repool mw1347 with mcrouter running with 10 proxy threads (was: 5)
  • 13:31 XioNoX: reboot cr3-knams
  • 13:31 elukey: depool mw1347 to test some mcrouter settings
  • 13:27 XioNoX: deactivate BGP transits on cr3-knams
  • 13:22 vgutierrez: Enable server session sharing on ats-tls in cp4031 - T244464
  • 13:10 XioNoX: rollback: deactivate BGP transits on cr2-eqsin
  • 13:00 XioNoX: reboot cr2-eqsin for sw upgrade
  • 13:00 addshore: SWAT done
  • 13:00 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: resync REVERT Enable EntitySourceBasedFederation for group1 (duration: 01m 07s)
  • 12:59 XioNoX: deactivate BGP transits on cr2-eqsin
  • 12:58 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: REVERT Enable EntitySourceBasedFederation for group1 T243395, due to T244479 (duration: 01m 07s)
  • 12:52 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Enable EntitySourceBasedFederation for group1 T243395 (duration: 01m 06s)
  • 12:46 addshore@deploy1001: Synchronized php-1.35.0-wmf.18/extensions/Babel: REVERT Fetch central babel information over SQL query, not API (T243726) (duration: 01m 07s)
  • 12:44 addshore@deploy1001: sync-file aborted: Fetch central babel information over SQL query, not API (T243726) (duration: 01m 04s)
  • 12:40 vgutierrez: pooling cp3065 - T242093
  • 12:39 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Enable EntitySourceBasedFederation for group0 T243395 (duration: 01m 07s)
  • 12:34 cparle@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Re-enable delayed new upload jobs for MachineVision extension (duration: 01m 08s)
  • 12:26 cparle@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Remove handler deleted from the MachineVision extension (duration: 01m 05s)
  • 12:25 XioNoX: remove full-duplex statement from eqsin Tata link (not supported on Junos 18, as 10G is full duplex anyway)
  • 12:24 cparle@deploy1001: Synchronized php-1.35.0-wmf.18/extensions/MachineVision: Use the wbsetclaim API to add depicts statements (duration: 01m 09s)
  • 12:07 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 5e1cbb2: Enable CX in te, kn, gu, mr and pawiki as a default tool (T243271, T243272, T243273, T243274, T243275) (duration: 01m 09s)
  • 11:41 akosiaris: upgrade etherpad-lite on etherpad1002 to 1.8.0-1
  • 11:38 kart_: Updated cxserver to 2020-02-05-051751-production (T244230, T234323)
  • 11:35 kartik@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'cxserver' for release 'production' .
  • 11:33 akosiaris: upload etherpad-lite_1.8.0-1 to apt.wikimedia.org buster-wikimedia/main
  • 11:31 kartik@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'cxserver' for release 'production' .
  • 11:28 kartik@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'cxserver' for release 'staging' .
  • 11:14 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 11:11 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 10:21 akosiaris: undo "switchover selectively eventgate-analytics.discovery.wmnet to codfw for mw1331 and mw1348". no effect observed
  • 10:20 akosiaris: undo "switchover selectively eventgate-analytics.discovery.wmnet to codfw for mw1331 and mw1348"
  • 10:19 vgutierrez: Enabling HTTP keepalive between ats-tls and varnish-frontend on cp4031 - T244464
  • 10:00 vgutierrez: depool and reimage cp3065 as buster - T242093
  • 09:59 vgutierrez: upload trafficserver 8.0.5-1wm14 to apt.wm.o (buster) - T242093
  • 09:08 dcausse@deploy1001: Finished deploy [wdqs/wdqs@4306c64]: deploying wdqs 0.3.14-SNAPSHOT and gui 5a1af3b (duration: 11m 41s)
  • 08:56 dcausse@deploy1001: Started deploy [wdqs/wdqs@4306c64]: deploying wdqs 0.3.14-SNAPSHOT and gui 5a1af3b
  • 08:45 dcausse@deploy1001: Finished deploy [wdqs/wdqs@4306c64]: deploying wdqs 0.3.14-SNAPSHOT and gui 5a1af3b to wdqs1010.eqiad.wmnet (duration: 00m 29s)
  • 08:44 dcausse@deploy1001: Started deploy [wdqs/wdqs@4306c64]: deploying wdqs 0.3.14-SNAPSHOT and gui 5a1af3b to wdqs1010.eqiad.wmnet
  • 08:23 marostegui: Reboot dbproxy1012 and dbproxy1014 for upgrade
  • 08:18 dcausse: restarting blazegraph on wdqs1006: T242453
  • 08:17 akosiaris: switchover selectively eventgate-analytics.discovery.wmnet to codfw for mw1331 and mw1348 to
  • 06:59 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1101:3317 - T239453', diff saved to https://phabricator.wikimedia.org/P10319 and previous config saved to /var/cache/conftool/dbconfig/20200206-065906-marostegui.json
  • 06:52 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1098:3317 - T239453', diff saved to https://phabricator.wikimedia.org/P10318 and previous config saved to /var/cache/conftool/dbconfig/20200206-065238-marostegui.json
  • 06:46 elukey: run puppet on all ores[12]* nodes
  • 02:49 dzahn@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
  • 02:42 mutante: ganeti - Creating new VM named install2003.codfw.wmnet in codfw with row=A vcpu=1 memory=1 gigabytes disk=20 gigabytes link=private (T244390)
  • 02:39 dzahn@cumin1001: START - Cookbook sre.ganeti.makevm
  • 02:30 dzahn@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
  • 02:21 mutante: ganeti - Creating new VM named install1003.eqiad.wmnet in eqiad with row=C vcpu=1 memory=1 gigabytes disk=20 gigabytes link=private (T244390)
  • 02:20 dzahn@cumin1001: START - Cookbook sre.ganeti.makevm

2020-02-05

  • 23:30 ebernhardson: delete search indices duplicated on multiple clusters for: hywwiki, chrwiktionary, gcrwiki, mnwwiki, noboard_chapterswikimedia nqowiki nrmwiki outreachwiki and srnwiki
  • 23:08 mholloway-shell@deploy1001: Finished deploy [mobileapps/deploy@a51f927]: Update mobileapps to a7928fa (duration: 10m 48s)
  • 22:57 mholloway-shell@deploy1001: Started deploy [mobileapps/deploy@a51f927]: Update mobileapps to a7928fa
  • 22:07 mutante: Gerrit - added ppchelko to 'wmf-deployment' Gerrit group (he is already in deployment admin group) (T244389)
  • 21:37 arlolra@deploy1001: Finished deploy [parsoid/deploy@01d9d3d]: Updating Parsoid to 74730a3 (duration: 03m 07s)
  • 21:33 arlolra@deploy1001: Started deploy [parsoid/deploy@01d9d3d]: Updating Parsoid to 74730a3
  • 21:31 mutante: killing and restarting wikibugs, it was reporting each update twice
  • 20:51 joal@deploy1001: Finished deploy [analytics/refinery@a47f0d5] (thin): Analytics regular weekly deploy (duration: 00m 07s)
  • 20:51 joal@deploy1001: Started deploy [analytics/refinery@a47f0d5] (thin): Analytics regular weekly deploy
  • 20:51 joal@deploy1001: Finished deploy [analytics/refinery@a47f0d5]: Analytics regular weekly deploy (duration: 13m 28s)
  • 20:50 mutante: ores1004 - systemctl start celery-ores-worker
  • 20:45 twentyafterfour@deploy1001: Synchronized php: group1 wikis to 1.35.0-wmf.18 refs T233866 (duration: 01m 07s)
  • 20:44 twentyafterfour@deploy1001: rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.18 refs T233866
  • 20:37 joal@deploy1001: Started deploy [analytics/refinery@a47f0d5]: Analytics regular weekly deploy
  • 20:34 dzahn@cumin1001: conftool action : set/weight=25; selector: name=mw1269.eqiad.wmnet
  • 20:25 dzahn@cumin1001: conftool action : set/weight=25; selector: name=mw1267.eqiad.wmnet
  • 20:25 mutante: mw1267 restarting php7.2-fpm
  • 20:21 joal@deploy1001: Finished deploy [analytics/hdfs-tools/deploy@714e2d0]: Deploy bug fix version (duration: 00m 08s)
  • 20:21 joal@deploy1001: Started deploy [analytics/hdfs-tools/deploy@714e2d0]: Deploy bug fix version
  • 20:09 twentyafterfour: Preparing to deploy wmf/1.35.0-wmf.18 to group1 wikis refs T233866
  • 20:09 moritzm: installing git security updates for jessie
  • 20:00 moritzm: installing unzip security updates
  • 19:44 mutante: LDAP - added spramduya to wmf group (T243802)
  • 19:38 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Clean up VisualEditor settings (duration: 01m 07s)
  • 19:38 ebernhardson: restart mjolnir-kafka-bulk-daemon across eqiad, daemons appear stuck and not reading new messages
  • 19:19 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: T238029 Enable InukaPageView logging on production Wikipedias (duration: 01m 07s)
  • 19:15 jforrester@deploy1001: Synchronized wmf-config/CommonSettings.php: Sync back revert of 975b4bbb9 (duration: 01m 06s)
  • 19:10 jforrester@deploy1001: scap failed: average error rate on 4/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/db09a36be5ed3e81155041f7d46ad040 for details)
  • 18:35 vgutierrez: pooling cp5012 - T242093
  • 18:23 vgutierrez: rebooting cp5012 - T242093
  • 18:21 elukey: restart memcached on mc1025 with 8 threads (rollback - revert https://gerrit.wikimedia.org/r/#/c/570370/, run puppet, restart memcached)
  • 17:51 mutante: ganeti1017 - rebooting (not in use yet)
  • 17:34 reedy@deploy1001: Synchronized php-1.35.0-wmf.18/languages/: T244300 (duration: 01m 13s)
  • 17:33 reedy@deploy1001: Synchronized php-1.35.0-wmf.18/includes/: T244300 (duration: 01m 14s)
  • 16:53 urandom: Sessionstore deployment (mediawiki-config) is done
  • 16:37 ppchelko@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: gerrit:569678 Config: Enable sessionstore on group0 and 1 T243106 (duration: 01m 08s)
  • 16:25 jforrester@deploy1001: Synchronized wmf-config/CommonSettings.php: T232140 Restore wgLogoHD to wikis without a MinervaCustomLogos defined (duration: 01m 09s)
  • 16:07 elukey: update puppet compiler's facts
  • 15:54 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 15:52 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:29 effie: restart php-fpm on canaries - T236800
  • 15:24 effie: Rollout php-apcu_5.1.17+4.0.11-1+0~20190217111312.9+stretch~1.gbp192528+wmf2 to api, app and jobrunner canaries - T236800
  • 15:15 vgutierrez: depooling & reimaging cp5012 as buster - T242093
  • 15:12 ema: cp: unset Accept-Encoding from ats-be requests to applayer T242478
  • 14:35 vgutierrez: updating acme-chief to version 0.24 - T244236
  • 14:32 _joe_: restarting mcrouter at nice -19 on mw1331 for testing effects of that change
  • 14:30 vgutierrez: upload acme-chief 0.24 to apt.wm.o (buster) - T244236
  • 14:26 XioNoX: push inital flowspec config to all routers
  • 14:23 vgutierrez: pooling cp5006 - T242093
  • 14:13 ema: cp1075: back to leaving Accept-Encoding as it is due to unrelated applayer issues T242478
  • 13:46 marostegui: Decrease buffer pool size on db1107 for testing - T242702
  • 13:45 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 13:43 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 13:42 akosiaris: undo the manually set 10.2.1.42 eventgate-analytics.discovery.wmnet in /etc/hosts for mw1331, mw1348. Verify hypothesis that this should cause increased latency. Restart php-fpm
  • 13:41 ema: cp1075: unset Accept-Encoding on origin server requests T242478
  • 13:39 Amir1: EU SWAT is done
  • 13:38 ema: cp: disable puppet and merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/570311/ T242478
  • 13:35 XioNoX: rollback traffic steering off cr2-eqord
  • 13:29 akosiaris: manually set 10.2.1.42 eventgate-analytics.discovery.wmnet in /etc/hosts for mw1331, mw1348. Verify hypothesis that this should cause increased latency
  • 13:25 XioNoX: reboot cr2-eqord for software upgrade - yaaaaa
  • 13:24 ladsgroup@deploy1001: Synchronized php-1.35.0-wmf.18/extensions/Wikibase/lib/includes/Store/CachingPropertyInfoLookup.php: SWAT: Cache PropertyInfoLookup internally (T243955) (duration: 01m 07s)
  • 13:17 XioNoX: increase ospf cost for cr2-eqord links