You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Difference between revisions of "Server Admin Log"

From Wikitech-static
Jump to navigation Jump to search
imported>Stashbot
(_joe_: set cpufreq governor to performance on mw1328)
imported>Stashbot
(cdanis: T238305 hardreset cp3051)
Line 1: Line 1:
== 2020-02-09 ==
* 05:11 cdanis: [[phab:T238305|T238305]] hardreset cp3051
== 2020-02-08 ==
== 2020-02-08 ==
* 19:12 _joe_: set cpufreq governor to performance on mw1328
* 19:12 _joe_: set cpufreq governor to performance on mw1328

Revision as of 05:11, 9 February 2020

2020-02-09

  • 05:11 cdanis: T238305 hardreset cp3051

2020-02-08

  • 19:12 _joe_: set cpufreq governor to performance on mw1328
  • 17:04 _joe_: restarted php7.2-fpm on mw1332
  • 16:53 Urbanecm: mwscript resetAuthenticationThrottle.php --wiki=enwiki --signup --ip 12.24.27.50
  • 16:47 gjg@deploy1001: Synchronized wmf-config/throttle.php: SWAT: Editathon in Charolette (duration: 00m 58s)
  • 00:05 Jeff_Green: switched payments.wikimedia.org to codfw datacenter due to T244610

2020-02-07

  • 22:20 jeh: ceph: round 2 OSD failover and recovery testing on cloudcephosd1003.wikimedia.org T240718
  • 20:47 mutante: OS install on new install_server VMs worked on second attempt, issues are gone. signed puppet certs for install1003.eqiad.wmnet, install2003.codfw.wmnet, initial puppet runs (T224576)
  • 20:42 jeh: ceph: OSD failover and recovery testing on cloudcephosd1003.wikimedia.org T240718
  • 20:32 mutante: ganeti: attempting to reinstall install1003 which failed last time
  • 17:38 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool es1019 after on-site maintenance T243963', diff saved to https://phabricator.wikimedia.org/P10350 and previous config saved to /var/cache/conftool/dbconfig/20200207-173850-marostegui.json
  • 17:36 twentyafterfour@deploy1001: Synchronized wmf-config/InitialiseSettings.php: sync InitializeSettings again for lols refs T233866 (duration: 01m 03s)
  • 17:32 twentyafterfour@deploy1001: Synchronized wmf-config/InitialiseSettings.php: sync https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/570929 refs T233866 (duration: 01m 02s)
  • 17:25 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool es1019 after on-site maintenance T243963', diff saved to https://phabricator.wikimedia.org/P10349 and previous config saved to /var/cache/conftool/dbconfig/20200207-172541-marostegui.json
  • 17:22 twentyafterfour@deploy1001: rebuilt and synchronized wikiversions files: roll back all wikis to 1.35.0-wmf.16 refs T233866
  • 17:19 marostegui: Start MySQL on es1019 after onsite maintenance T243963
  • 16:46 filippo@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
  • 16:38 filippo@cumin1001: START - Cookbook sre.ganeti.makevm
  • 16:13 XioNoX: remove MSS clamping from eqiad/eqord/knams/esams
  • 16:05 andrew@deploy1001: Finished deploy [horizon/deploy@bc777d6]: Fix for T243422 (duration: 03m 45s)
  • 16:04 vgutierrez: pooling cp4030 with buster - T242093
  • 16:03 bblack: removing GRE MTU mitigations from cp[135]xxx - T232602
  • 16:01 andrew@deploy1001: Started deploy [horizon/deploy@bc777d6]: Fix for T243422
  • 15:50 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 15:48 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:25 vgutierrez: depool & reimage cp4030 as buster - T242093
  • 15:21 vgutierrez: pooling cp4031 with buster - T242093
  • 15:20 vgutierrez: pooling ncredir3001 running buster - T243391
  • 15:18 marostegui: Restart all instances on db1124 and db1125 to pick up a new replication filter - T240094
  • 15:11 marostegui: Restart all instances on db2094 and db2095 to pick up a new replication filter - T240094
  • 14:56 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 14:53 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:43 hoo@deploy1001: Synchronized wmf-config/Wikibase.php: REVERT: Wikibase Client: Fix setting name typo (T244529) (duration: 01m 40s)
  • 14:43 Amir1: ladsgroup@mwmaint1002:~$ mwscript createAndPromote.php --wiki=zhwiki --force "Amir Sarabadani (WMDE)" --sysop (T244578)
  • 14:40 hoo@deploy1001: Scap failed!: 9/11 canaries failed their endpoint checks(http://en.wikipedia.org)
  • 14:38 hoo@deploy1001: Synchronized wmf-config/Wikibase.php: Wikibase Client: Fix setting name typo (T244529) (duration: 01m 20s)
  • 14:33 vgutierrez: depool and reimage ncredir3001 as buster - T243391
  • 14:32 vgutierrez: depool & reimage cp4031 as buster - T242093
  • 14:23 vgutierrez: pooling ncredir3002 running buster - T243391
  • 13:26 vgutierrez: pooling cp4021 with buster - T242093
  • 13:05 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 13:03 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 12:51 vgutierrez: depool and reimage ncredir3002 as buster - T243391
  • 12:42 vgutierrez: depool & reimage cp4021 as buster - T242093
  • 12:08 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 12:08 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 11:58 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 11:57 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 11:25 vgutierrez: pooling ncredir5001 running buster - T243391
  • 11:24 vgutierrez: pooling cp4022 with buster - T242093
  • 11:09 akosiaris: undo wikifeeds experiments
  • 11:07 akosiaris@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'wikifeeds' for release 'production' .
  • 10:42 akosiaris@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'wikifeeds' for release 'production' .
  • 10:40 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 10:37 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 10:36 akosiaris: conduct experiments with stopping/starting uwsgi-ores on ores2001 T242705
  • 10:24 akosiaris@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'wikifeeds' for release 'production' .
  • 10:23 vgutierrez: depool and reimage ncredir5001 as buster - T243391
  • 10:14 vgutierrez: depool & reimage cp4022 as buster - T242093
  • 10:02 akosiaris: increase capacity for wikifeeds by 50% T244535
  • 10:02 akosiaris@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'wikifeeds' for release 'production' .
  • 10:01 akosiaris@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'wikifeeds' for release 'production' .
  • 09:53 ema: A:mw: increase keepalive_requests from 100 to 200 https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/570670/ T241145
  • 09:09 godog: roll restart cassandra instance on restbase-dev
  • 09:03 akosiaris@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'wikifeeds' for release 'production' .
  • 09:03 godog: restart cassandra on restbase-dev1004 to test logging pipeline onboard
  • 09:01 akosiaris@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'wikifeeds' for release 'production' .
  • 08:59 akosiaris@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'wikifeeds' for release 'staging' .
  • 08:58 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1090:3312, db1090:3317', diff saved to https://phabricator.wikimedia.org/P10343 and previous config saved to /var/cache/conftool/dbconfig/20200207-085846-marostegui.json
  • 08:54 marostegui: Upgrade db1090:3312, db1090:3317
  • 08:54 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1090:3312, db1090:3317 for upgrade', diff saved to https://phabricator.wikimedia.org/P10342 and previous config saved to /var/cache/conftool/dbconfig/20200207-085432-marostegui.json
  • 08:44 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1101:3317 T239453', diff saved to https://phabricator.wikimedia.org/P10341 and previous config saved to /var/cache/conftool/dbconfig/20200207-084447-marostegui.json
  • 08:44 moritzm: installing libexif security updates
  • 08:21 akosiaris: deploy https://gerrit.wikimedia.org/r/570726 T244535 to avoid CPU throttling of wikifeeds
  • 08:21 akosiaris@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'wikifeeds' for release 'staging' .
  • 07:53 marostegui@cumin1001: dbctl commit (dc=all): 'Increase base weight for db1126', diff saved to https://phabricator.wikimedia.org/P10340 and previous config saved to /var/cache/conftool/dbconfig/20200207-075323-marostegui.json
  • 07:52 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1101:3317 T239453', diff saved to https://phabricator.wikimedia.org/P10339 and previous config saved to /var/cache/conftool/dbconfig/20200207-075234-marostegui.json
  • 07:48 marostegui: Remove revision partitions from db2085:3318 T239453
  • 07:45 marostegui@cumin1001: dbctl commit (dc=all): 'Fullyy repool db1126 T232446', diff saved to https://phabricator.wikimedia.org/P10338 and previous config saved to /var/cache/conftool/dbconfig/20200207-074511-marostegui.json
  • 07:44 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db2085:3318 T239453', diff saved to https://phabricator.wikimedia.org/P10337 and previous config saved to /var/cache/conftool/dbconfig/20200207-074407-marostegui.json
  • 07:42 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1101:3317 T239453', diff saved to https://phabricator.wikimedia.org/P10336 and previous config saved to /var/cache/conftool/dbconfig/20200207-074258-marostegui.json
  • 07:31 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1101:3317 T239453', diff saved to https://phabricator.wikimedia.org/P10335 and previous config saved to /var/cache/conftool/dbconfig/20200207-073130-marostegui.json
  • 07:30 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1126 T232446', diff saved to https://phabricator.wikimedia.org/P10334 and previous config saved to /var/cache/conftool/dbconfig/20200207-073026-marostegui.json
  • 06:38 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1126 T232446', diff saved to https://phabricator.wikimedia.org/P10333 and previous config saved to /var/cache/conftool/dbconfig/20200207-063831-marostegui.json
  • 06:34 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1105:3311 T239453', diff saved to https://phabricator.wikimedia.org/P10332 and previous config saved to /var/cache/conftool/dbconfig/20200207-063402-marostegui.json
  • 06:31 elukey: force a puppet run on all ores[12] nodes
  • 06:27 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1105:3311 T239453', diff saved to https://phabricator.wikimedia.org/P10331 and previous config saved to /var/cache/conftool/dbconfig/20200207-062731-marostegui.json
  • 06:26 marostegui: Reboot db1107 for update - T242702
  • 06:25 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1126 T232446', diff saved to https://phabricator.wikimedia.org/P10330 and previous config saved to /var/cache/conftool/dbconfig/20200207-062502-marostegui.json
  • 06:23 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1105:3311 T239453', diff saved to https://phabricator.wikimedia.org/P10329 and previous config saved to /var/cache/conftool/dbconfig/20200207-062345-marostegui.json
  • 06:20 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1105:3311 T239453', diff saved to https://phabricator.wikimedia.org/P10328 and previous config saved to /var/cache/conftool/dbconfig/20200207-062043-marostegui.json
  • 04:49 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 04:46 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 04:16 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 04:14 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 04:13 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 04:11 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 03:51 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 03:49 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 03:42 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 03:40 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 01:27 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 01:25 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 01:24 robh: eqsin pdu work ongoing starting now. ps1-603 swapping per T242250
  • 00:13 pt1979@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 00:11 pt1979@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 00:09 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 00:08 pt1979@cumin2001: START - Cookbook sre.hosts.downtime

2020-02-06

  • 23:44 pt1979@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 23:42 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 23:37 pt1979@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 23:35 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 23:25 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: T244133 [cswikisource] Enable VisualEditor in the Edice namespace (duration: 01m 07s)
  • 23:22 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: T159711 T161365 T164435 [nlwiki] Enable VisualEditor in the Project namespace (duration: 01m 08s)
  • 23:21 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 23:19 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 23:15 pt1979@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 23:13 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 23:10 jforrester@deploy1001: Synchronized wmf-config/CommonSettings.php: T244405 Don't trying to assign to if it's unset (duration: 01m 07s)
  • 22:50 jforrester@deploy1001: Synchronized php-1.35.0-wmf.18/extensions/VisualEditor: T242184 Change tags method so anon edits will go through (duration: 01m 08s)
  • 22:42 pt1979@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 22:40 pt1979@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 22:39 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 22:38 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 22:18 pt1979@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 22:15 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 22:13 mutante: turning mw2271 and mw2163 into canary appservers for codfw, this adds mediawiki-testers shell users and removes scap sql scripts, rest stays as is (T242606)
  • 21:54 pt1979@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 21:52 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 21:40 twentyafterfour: train blocked due to serious incident related to deploying the latest branch. Incident documentation: https://wikitech.wikimedia.org/wiki/Incident_documentation/20200206-mediawiki refs T233866
  • 21:30 pt1979@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 21:27 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 21:05 pt1979@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 21:03 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 20:52 akosiaris: restart all wikifeeds pods
  • 20:48 akosiaris@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'wikifeeds' for release 'production' .
  • 20:45 akosiaris: restart restbase on restbase1027
  • 20:32 twentyafterfour@deploy1001: rebuilt and synchronized wikiversions files: (no justification provided)
  • 20:30 twentyafterfour: sync-wikiversions --force
  • 20:30 twentyafterfour@deploy1001: Scap failed!: 9/11 canaries failed their endpoint checks(http://en.wikipedia.org)
  • 20:25 twentyafterfour@deploy1001: rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.18 refs T233866
  • 19:45 jforrester@deploy1001: Synchronized wmf-config/CommonSettings.php: T244405 Set wgLogoHD before adding wordmark (duration: 01m 06s)
  • 19:36 bblack: re-pool cp1075 (eqiad text)
  • 19:33 addshore: SWAT done!
  • 19:32 addshore@deploy1001: Synchronized php-1.35.0-wmf.18/extensions/WikibaseLexemeCirrusSearch: T244479 Update namespace for PrefetchingTermLookup & fix tests (duration: 01m 06s)
  • 19:31 bblack: depool cp1075 (eqiad text) for minor experimentation
  • 19:29 addshore@deploy1001: Synchronized php-1.35.0-wmf.16/extensions/Babel/includes/Babel.php: T243713 Timeout for meta api call from 10 to 2 seconds. (duration: 01m 07s)
  • 19:28 addshore@deploy1001: Synchronized php-1.35.0-wmf.18/extensions/Babel/includes/Babel.php: T243713 Timeout for meta api call from 10 to 2 seconds. (duration: 01m 07s)
  • 19:25 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Fix incorrect spellings of "RESTBase" in config variables (2/2) 2.IS (duration: 01m 06s)
  • 19:23 addshore@deploy1001: Synchronized wmf-config/CommonSettings.php: Fix incorrect spellings of "RESTBase" in config variables (2/2) 1.CS (duration: 01m 07s)
  • 19:23 cdanis: manual puppet run on netflow1001 looked good; ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕑☕ sudo cumin A:netflow "run-puppet-agent --enable 'rollout of I60692f0e8 T237587 cdanis'"
  • 19:22 pt1979@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 19:20 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Fix incorrect spellings of "RESTBase" in config variables (1/2) (duration: 01m 06s)
  • 19:20 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 19:14 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Enable EntitySourceBasedFederation everywhere T243395, sync again for luck (duration: 01m 06s)
  • 19:12 cdanis: ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕑☕ sudo cumin A:netflow "disable-puppet 'rollout of I60692f0e8 T237587 cdanis'"
  • 19:10 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Enable EntitySourceBasedFederation everywhere T243395 (duration: 01m 07s)
  • 19:05 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Enable EntitySourceBasedFederation for group1 T243395 (duration: 01m 10s)
  • 19:01 moritzm: restarting exim on mendelevium to pick up cyrus-sasl security updates
  • 18:58 pt1979@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 18:56 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 18:55 moritzm: restarting apache on tungsten/dbmonitor to pick up cyrus-sasl security updates
  • 18:53 mholloway-shell@deploy1001: Finished deploy [mobileapps/deploy@8e15868]: Update mobileapps to ceeb950 (duration: 06m 27s)
  • 18:46 mholloway-shell@deploy1001: Started deploy [mobileapps/deploy@8e15868]: Update mobileapps to ceeb950
  • 18:36 pt1979@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 18:34 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 18:06 pt1979@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 18:04 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 17:32 herron: set performance cpu scaling governor on maps*
  • 16:49 vgutierrez: pooling ncredir5002 running buster - T243391
  • 16:38 vgutierrez: pooling cp4023 with buster - T242093
  • 16:36 ebernhardson@deploy1001: Finished deploy [wikimedia/discovery/analytics@524be2b]: airflow: Update ores data transfer from drafttopic -> articletopic (duration: 00m 19s)
  • 16:35 ebernhardson@deploy1001: Started deploy [wikimedia/discovery/analytics@524be2b]: airflow: Update ores data transfer from drafttopic -> articletopic
  • 16:35 XioNoX: remove AS prepending in esams/knams
  • 16:31 bblack: lvs1013 - restart pybal for dual bgp session config - T180069
  • 16:30 bblack: lvs1014 - restart pybal for dual bgp session config - T180069
  • 16:30 bblack: lvs1015 - restart pybal for dual bgp session config - T180069
  • 16:29 bblack: lvs1016 - restart pybal for dual bgp session config - T180069
  • 16:28 moritzm: restarting apache on bromine to pick up SASL security updates
  • 16:24 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 16:22 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 16:22 moritzm: installing cyrus-sasl2 security updates on jessie
  • 16:20 bblack: lvs2001 - restart pybal for dual bgp session config - T180069
  • 16:19 bblack: lvs2002 - restart pybal for dual bgp session config - T180069
  • 16:19 bblack: lvs2003 - restart pybal for dual bgp session config - T180069
  • 16:07 vgutierrez: depool and reimage ncredir5002 as buster - T243391
  • 16:07 bblack: lvs4005 - restart pybal for dual bgp session config - T180069
  • 16:06 bblack: lvs4006 - restart pybal for dual bgp session config - T180069
  • 16:06 bblack: lvs4007 - restart pybal for dual bgp session config - T180069
  • 16:03 vgutierrez: depool & reimage cp4023 as buster - T242093
  • 16:03 vgutierrez: pooling cp4024 with buster - T242093
  • 15:59 akosiaris: repool eventgate-analytics/eqiad. Experiment proved the failover wouldn't cause (on it's own) a problem. Experiment done.
  • 15:58 akosiaris@cumin1001: conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=eventgate-analytics
  • 15:57 halfak@deploy1001: Finished deploy [ores/deploy@50a101a]: T242705 (duration: 04m 35s)
  • 15:56 vgutierrez: pooling ncredir4001 running buster - T243391
  • 15:55 moritzm: installing qemu security updates
  • 15:54 bblack: lvs5001 - restart pybal for dual bgp session config - T180069
  • 15:53 bblack: lvs5002 - restart pybal for dual bgp session config - T180069
  • 15:53 halfak@deploy1001: Started deploy [ores/deploy@50a101a]: T242705
  • 15:52 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 15:52 bblack: lvs5003 - restart pybal for dual bgp session config - T180069
  • 15:50 moritzm: installing python-ecdsa security updates
  • 15:50 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:41 moritzm: installing jsoup security updates
  • 15:30 vgutierrez: depool & reimage ncredir4001 as buster - T243391
  • 15:29 vgutierrez: depool & reimage cp4024 as buster - T242093
  • 15:28 vgutierrez: pooling ncredir4002 running buster - T243391
  • 15:27 moritzm: installing sudo security updates on jessie
  • 15:23 vgutierrez: pooling cp4025 with buster - T242093
  • 15:14 ema: A:mw-api: force puppet run to increase keepalive_requests from 100 to 200 https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/570670/ T241145
  • 15:09 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 15:07 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:59 godog: extend graphite1004 / graphite2003 fs +200G
  • 14:56 vgutierrez: depool and reimage ncredir4002 as buster - T243391
  • 14:46 vgutierrez: depool & reimage cp4025 as buster - T242093
  • 14:16 akosiaris: 20mins in with eventgate-analytics/eqiad depooled from discovery, no issues yet.
  • 14:14 ema: run puppet on mw-api-canary to revert nginx keepalive_requests bump T241145
  • 13:55 marostegui: Stop MySQL on es1019, upgrade and poweroff for on-site maintenance - T243963
  • 13:54 akosiaris@cumin1001: conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=eventgate-analytics
  • 13:53 akosiaris: depool eqiad eventgate-analytics for testing purposes. Requests will flow to codfw, monitoring https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=now-30m&to=now for issues.
  • 13:51 marostegui@cumin1001: dbctl commit (dc=all): 'Depool es1019 for onsite maintenance T243963', diff saved to https://phabricator.wikimedia.org/P10321 and previous config saved to /var/cache/conftool/dbconfig/20200206-135157-marostegui.json
  • 13:45 XioNoX: rollback deactivate BGP transits on cr3-knams
  • 13:34 elukey: repool mw1347 with mcrouter running with 10 proxy threads (was: 5)
  • 13:31 XioNoX: reboot cr3-knams
  • 13:31 elukey: depool mw1347 to test some mcrouter settings
  • 13:27 XioNoX: deactivate BGP transits on cr3-knams
  • 13:22 vgutierrez: Enable server session sharing on ats-tls in cp4031 - T244464
  • 13:10 XioNoX: rollback: deactivate BGP transits on cr2-eqsin
  • 13:00 XioNoX: reboot cr2-eqsin for sw upgrade
  • 13:00 addshore: SWAT done
  • 13:00 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: resync REVERT Enable EntitySourceBasedFederation for group1 (duration: 01m 07s)
  • 12:59 XioNoX: deactivate BGP transits on cr2-eqsin
  • 12:58 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: REVERT Enable EntitySourceBasedFederation for group1 T243395, due to T244479 (duration: 01m 07s)
  • 12:52 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Enable EntitySourceBasedFederation for group1 T243395 (duration: 01m 06s)
  • 12:46 addshore@deploy1001: Synchronized php-1.35.0-wmf.18/extensions/Babel: REVERT Fetch central babel information over SQL query, not API (T243726) (duration: 01m 07s)
  • 12:44 addshore@deploy1001: sync-file aborted: Fetch central babel information over SQL query, not API (T243726) (duration: 01m 04s)
  • 12:40 vgutierrez: pooling cp3065 - T242093
  • 12:39 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Enable EntitySourceBasedFederation for group0 T243395 (duration: 01m 07s)
  • 12:34 cparle@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Re-enable delayed new upload jobs for MachineVision extension (duration: 01m 08s)
  • 12:26 cparle@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Remove handler deleted from the MachineVision extension (duration: 01m 05s)
  • 12:25 XioNoX: remove full-duplex statement from eqsin Tata link (not supported on Junos 18, as 10G is full duplex anyway)
  • 12:24 cparle@deploy1001: Synchronized php-1.35.0-wmf.18/extensions/MachineVision: Use the wbsetclaim API to add depicts statements (duration: 01m 09s)
  • 12:07 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 5e1cbb2: Enable CX in te, kn, gu, mr and pawiki as a default tool (T243271, T243272, T243273, T243274, T243275) (duration: 01m 09s)
  • 11:41 akosiaris: upgrade etherpad-lite on etherpad1002 to 1.8.0-1
  • 11:38 kart_: Updated cxserver to 2020-02-05-051751-production (T244230, T234323)
  • 11:35 kartik@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'cxserver' for release 'production' .
  • 11:33 akosiaris: upload etherpad-lite_1.8.0-1 to apt.wikimedia.org buster-wikimedia/main
  • 11:31 kartik@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'cxserver' for release 'production' .
  • 11:28 kartik@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'cxserver' for release 'staging' .
  • 11:14 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 11:11 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 10:21 akosiaris: undo "switchover selectively eventgate-analytics.discovery.wmnet to codfw for mw1331 and mw1348". no effect observed
  • 10:20 akosiaris: undo "switchover selectively eventgate-analytics.discovery.wmnet to codfw for mw1331 and mw1348"
  • 10:19 vgutierrez: Enabling HTTP keepalive between ats-tls and varnish-frontend on cp4031 - T244464
  • 10:00 vgutierrez: depool and reimage cp3065 as buster - T242093
  • 09:59 vgutierrez: upload trafficserver 8.0.5-1wm14 to apt.wm.o (buster) - T242093
  • 09:08 dcausse@deploy1001: Finished deploy [wdqs/wdqs@4306c64]: deploying wdqs 0.3.14-SNAPSHOT and gui 5a1af3b (duration: 11m 41s)
  • 08:56 dcausse@deploy1001: Started deploy [wdqs/wdqs@4306c64]: deploying wdqs 0.3.14-SNAPSHOT and gui 5a1af3b
  • 08:45 dcausse@deploy1001: Finished deploy [wdqs/wdqs@4306c64]: deploying wdqs 0.3.14-SNAPSHOT and gui 5a1af3b to wdqs1010.eqiad.wmnet (duration: 00m 29s)
  • 08:44 dcausse@deploy1001: Started deploy [wdqs/wdqs@4306c64]: deploying wdqs 0.3.14-SNAPSHOT and gui 5a1af3b to wdqs1010.eqiad.wmnet
  • 08:23 marostegui: Reboot dbproxy1012 and dbproxy1014 for upgrade
  • 08:18 dcausse: restarting blazegraph on wdqs1006: T242453
  • 08:17 akosiaris: switchover selectively eventgate-analytics.discovery.wmnet to codfw for mw1331 and mw1348 to
  • 06:59 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1101:3317 - T239453', diff saved to https://phabricator.wikimedia.org/P10319 and previous config saved to /var/cache/conftool/dbconfig/20200206-065906-marostegui.json
  • 06:52 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1098:3317 - T239453', diff saved to https://phabricator.wikimedia.org/P10318 and previous config saved to /var/cache/conftool/dbconfig/20200206-065238-marostegui.json
  • 06:46 elukey: run puppet on all ores[12]* nodes
  • 02:49 dzahn@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
  • 02:42 mutante: ganeti - Creating new VM named install2003.codfw.wmnet in codfw with row=A vcpu=1 memory=1 gigabytes disk=20 gigabytes link=private (T244390)
  • 02:39 dzahn@cumin1001: START - Cookbook sre.ganeti.makevm
  • 02:30 dzahn@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
  • 02:21 mutante: ganeti - Creating new VM named install1003.eqiad.wmnet in eqiad with row=C vcpu=1 memory=1 gigabytes disk=20 gigabytes link=private (T244390)
  • 02:20 dzahn@cumin1001: START - Cookbook sre.ganeti.makevm

2020-02-05

  • 23:30 ebernhardson: delete search indices duplicated on multiple clusters for: hywwiki, chrwiktionary, gcrwiki, mnwwiki, noboard_chapterswikimedia nqowiki nrmwiki outreachwiki and srnwiki
  • 23:08 mholloway-shell@deploy1001: Finished deploy [mobileapps/deploy@a51f927]: Update mobileapps to a7928fa (duration: 10m 48s)
  • 22:57 mholloway-shell@deploy1001: Started deploy [mobileapps/deploy@a51f927]: Update mobileapps to a7928fa
  • 22:07 mutante: Gerrit - added ppchelko to 'wmf-deployment' Gerrit group (he is already in deployment admin group) (T244389)
  • 21:37 arlolra@deploy1001: Finished deploy [parsoid/deploy@01d9d3d]: Updating Parsoid to 74730a3 (duration: 03m 07s)
  • 21:33 arlolra@deploy1001: Started deploy [parsoid/deploy@01d9d3d]: Updating Parsoid to 74730a3
  • 21:31 mutante: killing and restarting wikibugs, it was reporting each update twice
  • 20:51 joal@deploy1001: Finished deploy [analytics/refinery@a47f0d5] (thin): Analytics regular weekly deploy (duration: 00m 07s)
  • 20:51 joal@deploy1001: Started deploy [analytics/refinery@a47f0d5] (thin): Analytics regular weekly deploy
  • 20:51 joal@deploy1001: Finished deploy [analytics/refinery@a47f0d5]: Analytics regular weekly deploy (duration: 13m 28s)
  • 20:50 mutante: ores1004 - systemctl start celery-ores-worker
  • 20:45 twentyafterfour@deploy1001: Synchronized php: group1 wikis to 1.35.0-wmf.18 refs T233866 (duration: 01m 07s)
  • 20:44 twentyafterfour@deploy1001: rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.18 refs T233866
  • 20:37 joal@deploy1001: Started deploy [analytics/refinery@a47f0d5]: Analytics regular weekly deploy
  • 20:34 dzahn@cumin1001: conftool action : set/weight=25; selector: name=mw1269.eqiad.wmnet
  • 20:25 dzahn@cumin1001: conftool action : set/weight=25; selector: name=mw1267.eqiad.wmnet
  • 20:25 mutante: mw1267 restarting php7.2-fpm
  • 20:21 joal@deploy1001: Finished deploy [analytics/hdfs-tools/deploy@714e2d0]: Deploy bug fix version (duration: 00m 08s)
  • 20:21 joal@deploy1001: Started deploy [analytics/hdfs-tools/deploy@714e2d0]: Deploy bug fix version
  • 20:09 twentyafterfour: Preparing to deploy wmf/1.35.0-wmf.18 to group1 wikis refs T233866
  • 20:09 moritzm: installing git security updates for jessie
  • 20:00 moritzm: installing unzip security updates
  • 19:44 mutante: LDAP - added spramduya to wmf group (T243802)
  • 19:38 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Clean up VisualEditor settings (duration: 01m 07s)
  • 19:38 ebernhardson: restart mjolnir-kafka-bulk-daemon across eqiad, daemons appear stuck and not reading new messages
  • 19:19 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: T238029 Enable InukaPageView logging on production Wikipedias (duration: 01m 07s)
  • 19:15 jforrester@deploy1001: Synchronized wmf-config/CommonSettings.php: Sync back revert of 975b4bbb9 (duration: 01m 06s)
  • 19:10 jforrester@deploy1001: scap failed: average error rate on 4/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/db09a36be5ed3e81155041f7d46ad040 for details)
  • 18:35 vgutierrez: pooling cp5012 - T242093
  • 18:23 vgutierrez: rebooting cp5012 - T242093
  • 18:21 elukey: restart memcached on mc1025 with 8 threads (rollback - revert https://gerrit.wikimedia.org/r/#/c/570370/, run puppet, restart memcached)
  • 17:51 mutante: ganeti1017 - rebooting (not in use yet)
  • 17:34 reedy@deploy1001: Synchronized php-1.35.0-wmf.18/languages/: T244300 (duration: 01m 13s)
  • 17:33 reedy@deploy1001: Synchronized php-1.35.0-wmf.18/includes/: T244300 (duration: 01m 14s)
  • 16:53 urandom: Sessionstore deployment (mediawiki-config) is done
  • 16:37 ppchelko@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: gerrit:569678 Config: Enable sessionstore on group0 and 1 T243106 (duration: 01m 08s)
  • 16:25 jforrester@deploy1001: Synchronized wmf-config/CommonSettings.php: T232140 Restore wgLogoHD to wikis without a MinervaCustomLogos defined (duration: 01m 09s)
  • 16:07 elukey: update puppet compiler's facts
  • 15:54 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 15:52 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:29 effie: restart php-fpm on canaries - T236800
  • 15:24 effie: Rollout php-apcu_5.1.17+4.0.11-1+0~20190217111312.9+stretch~1.gbp192528+wmf2 to api, app and jobrunner canaries - T236800
  • 15:15 vgutierrez: depooling & reimaging cp5012 as buster - T242093
  • 15:12 ema: cp: unset Accept-Encoding from ats-be requests to applayer T242478
  • 14:35 vgutierrez: updating acme-chief to version 0.24 - T244236
  • 14:32 _joe_: restarting mcrouter at nice -19 on mw1331 for testing effects of that change
  • 14:30 vgutierrez: upload acme-chief 0.24 to apt.wm.o (buster) - T244236
  • 14:26 XioNoX: push inital flowspec config to all routers
  • 14:23 vgutierrez: pooling cp5006 - T242093
  • 14:13 ema: cp1075: back to leaving Accept-Encoding as it is due to unrelated applayer issues T242478
  • 13:46 marostegui: Decrease buffer pool size on db1107 for testing - T242702
  • 13:45 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 13:43 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 13:42 akosiaris: undo the manually set 10.2.1.42 eventgate-analytics.discovery.wmnet in /etc/hosts for mw1331, mw1348. Verify hypothesis that this should cause increased latency. Restart php-fpm
  • 13:41 ema: cp1075: unset Accept-Encoding on origin server requests T242478
  • 13:39 Amir1: EU SWAT is done
  • 13:38 ema: cp: disable puppet and merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/570311/ T242478
  • 13:35 XioNoX: rollback traffic steering off cr2-eqord
  • 13:29 akosiaris: manually set 10.2.1.42 eventgate-analytics.discovery.wmnet in /etc/hosts for mw1331, mw1348. Verify hypothesis that this should cause increased latency
  • 13:25 XioNoX: reboot cr2-eqord for software upgrade - yaaaaa
  • 13:24 ladsgroup@deploy1001: Synchronized php-1.35.0-wmf.18/extensions/Wikibase/lib/includes/Store/CachingPropertyInfoLookup.php: SWAT: Cache PropertyInfoLookup internally (T243955) (duration: 01m 07s)
  • 13:17 XioNoX: increase ospf cost for cr2-eqord links