You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Server Admin Log

From Wikitech-static
Jump to navigation Jump to search

2020-07-09

  • 00:58 cdanis: βœ”οΈ cdanis@cumin1001.eqiad.wmnet ~ πŸ•˜πŸΊ sudo cumin A:cp 'enable-puppet "cdanis deploying I6c1b646e T256395"'
  • 00:49 cdanis: βœ”οΈ cdanis@cumin1001.eqiad.wmnet ~ πŸ•˜πŸΊ sudo cumin A:cp 'disable-puppet "cdanis deploying I6c1b646e T256395"'

2020-07-08

  • 21:56 mutante: deleting files from releases2001 that are not existing on releases1001 to make them mirrors. rsync with --delete and the command from quickdatacopy class (T247652)
  • 21:55 mutante: rsyncing releases files from releases1001 to releases2002 and releases1002. deleting files from releases2002 not existing on releases1002 to make them mirrors ( T247652_
  • 20:59 cstone: civicrm revision changed from d73ee2e73f to 8b09c87ce2,
  • 20:27 Amir1: end of foreachwikiindblist wikidataclient extensions/Wikibase/lib/maintenance/populateSitesTable.php --force-protocol https (T256012)
  • 20:08 Amir1_: start of foreachwikiindblist wikidataclient extensions/Wikibase/lib/maintenance/populateSitesTable.php --force-protocol https (T256012)
  • 19:18 twentyafterfour@deploy1001: Synchronized php: group1 wikis to 1.35.0-wmf.40 refs T256668 (duration: 01m 04s)
  • 19:17 twentyafterfour@deploy1001: rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.40 refs T256668
  • 18:58 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: 091442c: Add *.nga.gov to the wgCopyUploadsDomains allowlist of Wikimedia Commons (T256518) (duration: 01m 04s)
  • 18:55 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: 2e5943d: Add scan-bugs.org to $wgCopyUploadsDomains (T256569) (duration: 01m 04s)
  • 18:46 urbanecm@deploy1001: Synchronized static/images/project-logos/: f42cdf2: Change bnwiki logo (T255328) (duration: 01m 04s)
  • 18:27 ppchelko@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Cleanup: remove temporary wmgDisableHTCP variable gerrit:607596 T250781 IS.php (duration: 01m 01s)
  • 18:20 ppchelko@deploy1001: Synchronized wmf-config/CommonSettings.php: Disable HTCP purging everywhere gerrit:607593 T250781 CS.php (duration: 01m 03s)
  • 18:18 ppchelko@deploy1001: Synchronized wmf-config/wikitech.php: Disable HTCP purging everywhere gerrit:607593 T250781 wikitech.php (duration: 01m 04s)
  • 18:17 ppchelko@deploy1001: Synchronized wmf-config/reverse-proxy.php: Disable HTCP purging everywhere gerrit:607593 T250781 reverse-proxy.php (duration: 01m 04s)
  • 18:11 ppchelko@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Add wgEventServiceDefault to refactor EventBus event stream config gerrit:610160 T229863, IS.php (duration: 01m 03s)
  • 18:04 ppchelko@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Add wgEventServiceDefault to refactor EventBus event stream config gerrit:610160 T229863 (duration: 01m 04s)
  • 17:34 elukey@cumin1001: END (FAIL) - Cookbook sre.hadoop.change-distro (exit_code=99)
  • 17:16 elukey@cumin1001: START - Cookbook sre.hadoop.change-distro
  • 17:16 elukey@cumin1001: END (PASS) - Cookbook sre.hadoop.stop-cluster (exit_code=0)
  • 17:08 elukey@cumin1001: START - Cookbook sre.hadoop.stop-cluster
  • 16:57 _joe_: restarting restbase across the fleet to transition to using envoy
  • 16:40 _joe_: restarting restbase on restbase2010 to route calls to mediawiki, parsoid via envoy
  • 16:40 cmjohnson@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 16:37 cmjohnson@cumin1001: START - Cookbook sre.dns.netbox
  • 16:32 cmjohnson@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 16:27 cmjohnson@cumin1001: START - Cookbook sre.dns.netbox
  • 15:22 jgleeson: updated fundraising-tools from a244e0e85f --> f5b8528214
  • 15:16 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 15:12 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 15:12 moritzm: rebooting people1002 (people.wikimedia.org) for kernel security update
  • 15:06 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 15:04 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 14:48 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 14:46 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 14:46 moritzm: installing isc-dhcp security updates
  • 14:31 elukey@cumin1001: END (FAIL) - Cookbook sre.hadoop.change-distro (exit_code=99)
  • 14:31 moritzm: installing gdk-pixbuf security updates
  • 14:26 _joe_: repooling mw1346
  • 14:24 _joe_: php7adm /opcache-free on mw1346
  • 14:15 jbond42: switch icinga authentication to CAS SSO
  • 14:12 _joe_: depooling mw1346
  • 14:12 elukey@cumin1001: START - Cookbook sre.hadoop.change-distro
  • 14:11 elukey@cumin1001: END (PASS) - Cookbook sre.hadoop.stop-cluster (exit_code=0)
  • 14:06 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 14:04 elukey@cumin1001: START - Cookbook sre.hadoop.stop-cluster
  • 14:04 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 14:04 moritzm: rebooting idp-test1001 for kernel update
  • 13:59 elukey@cumin1001: END (ERROR) - Cookbook sre.hadoop.stop-cluster (exit_code=97)
  • 13:59 elukey@cumin1001: START - Cookbook sre.hadoop.stop-cluster
  • 13:39 elukey@cumin1001: END (PASS) - Cookbook sre.hadoop.change-distro (exit_code=0)
  • 13:31 jynus: replacing ssh key for ci_docroot at deploy1001
  • 13:31 moritzm: imported git 2.20.1-2+deb10u3~wmf1 for stretch-wikimedia component/git T257308
  • 13:10 elukey@cumin1001: START - Cookbook sre.hadoop.change-distro
  • 13:07 elukey@cumin1001: END (PASS) - Cookbook sre.hadoop.stop-cluster (exit_code=0)
  • 13:00 elukey@cumin1001: START - Cookbook sre.hadoop.stop-cluster
  • 12:41 marostegui: Deploy schema change on s7 codfw, lag is expected
  • 12:17 xionox-tmp: rollout less frequent option-refresh-rate - T240658
  • 12:01 xionox-tmp: renumber eqiad NTT link - T254877
  • 11:42 awight: EU BACON complete
  • 11:41 awight@deploy1001: Synchronized wmf-config/InitialiseSettings.php: BACON: Undeploy graphoid for phase 1 wikis (T257402) (duration: 01m 03s)
  • 11:31 awight@deploy1001: Synchronized wmf-config/InitialiseSettings.php: BACON: Add nature.com to commonswiki wgCopyUploadDomains (T254342) (duration: 01m 03s)
  • 11:29 moritzm: installing freetype security updates
  • 11:26 awight@deploy1001: Synchronized wmf-config/InitialiseSettings.php: BACON: [hiwikibooks] Translate sitename for hi.wikibooks (T256587) (duration: 01m 03s)
  • 11:19 awight@deploy1001: Synchronized wmf-config/InitialiseSettings.php: BACON: [arwiki] Grant 'patrolmarks' to all (T257106) (duration: 01m 04s)
  • 11:18 akosiaris@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'proton' for release 'production' .
  • 11:18 moritzm: installing libgcrypt20 security updates
  • 11:16 akosiaris@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'proton' for release 'production' .
  • 11:07 awight@deploy1001: Synchronized wmf-config/InitialiseSettings.php: BACON: Provision WMDE TeWΓΌ survey for prototype 1 (T257306), file 2/2 (duration: 01m 03s)
  • 11:06 awight@deploy1001: Synchronized wmf-config/InitialiseSettings-labs.php: BACON: Provision WMDE TeWΓΌ survey for prototype 1 (T257306), file 1/2 (duration: 01m 16s)
  • 11:05 akosiaris@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'proton' for release 'production' .
  • 11:05 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1121', diff saved to https://phabricator.wikimedia.org/P11818 and previous config saved to /var/cache/conftool/dbconfig/20200708-110546-marostegui.json
  • 10:51 akosiaris@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
  • 10:51 akosiaris@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
  • 10:50 akosiaris: apply calico egress policies
  • 10:50 akosiaris@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
  • 10:45 moritzm: installing json-c security updates
  • 10:25 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1121', diff saved to https://phabricator.wikimedia.org/P11817 and previous config saved to /var/cache/conftool/dbconfig/20200708-102553-marostegui.json
  • 10:25 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1084', diff saved to https://phabricator.wikimedia.org/P11816 and previous config saved to /var/cache/conftool/dbconfig/20200708-102500-marostegui.json
  • 10:13 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1084', diff saved to https://phabricator.wikimedia.org/P11815 and previous config saved to /var/cache/conftool/dbconfig/20200708-101313-marostegui.json
  • 09:58 kormat@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 09:57 elukey@cumin1001: END (PASS) - Cookbook sre.hadoop.stop-cluster (exit_code=0)
  • 09:56 kormat@cumin2001: START - Cookbook sre.hosts.downtime
  • 09:50 elukey@cumin1001: START - Cookbook sre.hadoop.stop-cluster
  • 09:45 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1149', diff saved to https://phabricator.wikimedia.org/P11814 and previous config saved to /var/cache/conftool/dbconfig/20200708-094539-marostegui.json
  • 09:26 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1149', diff saved to https://phabricator.wikimedia.org/P11813 and previous config saved to /var/cache/conftool/dbconfig/20200708-092650-marostegui.json
  • 09:26 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1148', diff saved to https://phabricator.wikimedia.org/P11812 and previous config saved to /var/cache/conftool/dbconfig/20200708-092627-marostegui.json
  • 09:24 xionox-tmp: renumber eqord NTT link - T254877
  • 09:18 xionox-tmp: remove eqord-eqiad tunnel - T254877
  • 09:15 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1148', diff saved to https://phabricator.wikimedia.org/P11811 and previous config saved to /var/cache/conftool/dbconfig/20200708-091557-marostegui.json
  • 08:57 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1147', diff saved to https://phabricator.wikimedia.org/P11810 and previous config saved to /var/cache/conftool/dbconfig/20200708-085745-marostegui.json
  • 08:55 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 08:54 marostegui@cumin1001: START - Cookbook sre.hosts.decommission
  • 08:54 marostegui@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99)
  • 08:54 marostegui@cumin1001: START - Cookbook sre.hosts.decommission
  • 08:50 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1147', diff saved to https://phabricator.wikimedia.org/P11809 and previous config saved to /var/cache/conftool/dbconfig/20200708-085024-marostegui.json
  • 08:42 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1074', diff saved to https://phabricator.wikimedia.org/P11808 and previous config saved to /var/cache/conftool/dbconfig/20200708-084227-marostegui.json
  • 08:40 moritzm: upgrading docker on remaining buster hosts
  • 08:38 hashar: Upgraded docker.io on contint1001 and contint2001
  • 08:28 marostegui: Remove dbproxy1003 grants from misc hosts T231280
  • 08:26 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1146:3314', diff saved to https://phabricator.wikimedia.org/P11807 and previous config saved to /var/cache/conftool/dbconfig/20200708-082624-marostegui.json
  • 08:20 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1146:3314', diff saved to https://phabricator.wikimedia.org/P11806 and previous config saved to /var/cache/conftool/dbconfig/20200708-082040-marostegui.json
  • 08:16 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1144:3314', diff saved to https://phabricator.wikimedia.org/P11805 and previous config saved to /var/cache/conftool/dbconfig/20200708-081647-marostegui.json
  • 08:15 kormat@cumin1001: dbctl commit (dc=all): 'Depool es2020 for reimaging T257284', diff saved to https://phabricator.wikimedia.org/P11804 and previous config saved to /var/cache/conftool/dbconfig/20200708-081519-kormat.json
  • 08:00 marostegui: Failover m1 from db1097 to db1080 - T256717
  • 07:57 kormat: reimaging es2020 to buster T257284
  • 07:51 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 07:49 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1144:3314', diff saved to https://phabricator.wikimedia.org/P11803 and previous config saved to /var/cache/conftool/dbconfig/20200708-074939-marostegui.json
  • 07:49 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 07:48 jynus: stop bacula-director on backup1001 in preparation for m1 switchover T256717
  • 07:47 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 07:47 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 07:47 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 07:47 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 07:47 akosiaris@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 07:47 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 07:45 moritzm: installing PHP 7.3 security updates
  • 07:35 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1143', diff saved to https://phabricator.wikimedia.org/P11802 and previous config saved to /var/cache/conftool/dbconfig/20200708-073548-marostegui.json
  • 07:30 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1143', diff saved to https://phabricator.wikimedia.org/P11801 and previous config saved to /var/cache/conftool/dbconfig/20200708-073037-marostegui.json
  • 07:30 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1142', diff saved to https://phabricator.wikimedia.org/P11800 and previous config saved to /var/cache/conftool/dbconfig/20200708-073011-marostegui.json
  • 07:24 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1142', diff saved to https://phabricator.wikimedia.org/P11799 and previous config saved to /var/cache/conftool/dbconfig/20200708-072431-marostegui.json
  • 07:09 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1141', diff saved to https://phabricator.wikimedia.org/P11798 and previous config saved to /var/cache/conftool/dbconfig/20200708-070921-marostegui.json
  • 07:04 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1141', diff saved to https://phabricator.wikimedia.org/P11797 and previous config saved to /var/cache/conftool/dbconfig/20200708-070432-marostegui.json
  • 07:04 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1138', diff saved to https://phabricator.wikimedia.org/P11796 and previous config saved to /var/cache/conftool/dbconfig/20200708-070403-marostegui.json
  • 06:47 marostegui: start topology changes on m1 T256717
  • 06:43 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1138', diff saved to https://phabricator.wikimedia.org/P11795 and previous config saved to /var/cache/conftool/dbconfig/20200708-064354-marostegui.json
  • 06:36 marostegui: Deploy schema change on s2 primary master db1122 T238966
  • 06:18 _joe_: rolling restart of restbase to pick up the proton url change
  • 03:36 andrew@deploy1001: Finished deploy [horizon/deploy@505819d]: further fixes for proxy editing --bug 610130 (duration: 03m 44s)
  • 03:32 andrew@deploy1001: Started deploy [horizon/deploy@505819d]: further fixes for proxy editing --bug 610130

2020-07-07

  • 22:41 mutante: new Wikimedia Annual Report 2019 now available on annual.wikimedia.org
  • 21:29 andrew@deploy1001: Finished deploy [horizon/deploy@fce8183]: further fixes for proxy editing --bug 610130 (duration: 03m 35s)
  • 21:25 andrew@deploy1001: Started deploy [horizon/deploy@fce8183]: further fixes for proxy editing --bug 610130
  • 21:10 andrew@deploy1001: Finished deploy [horizon/deploy@abcd051]: further fixes for proxy editing --bug 610130 (duration: 03m 26s)
  • 21:07 andrew@deploy1001: Started deploy [horizon/deploy@abcd051]: further fixes for proxy editing --bug 610130
  • 20:41 ppchelko@deploy1001: Finished deploy [restbase/deploy@05b8bd5]: Remove restbase2009, take 2 (duration: 09m 15s)
  • 20:32 ppchelko@deploy1001: Started deploy [restbase/deploy@05b8bd5]: Remove restbase2009, take 2
  • 20:32 ppchelko@deploy1001: Finished deploy [restbase/deploy@05b8bd5]: Remove restbase2009 (duration: 14m 28s)
  • 20:24 mutante: kubernetes1003 - starting nagios-nrpe-server
  • 20:23 mutante: kubernetes1001 - starting nagios-nrpe-server
  • 20:17 ppchelko@deploy1001: Started deploy [restbase/deploy@05b8bd5]: Remove restbase2009
  • 19:29 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 19:27 mutante: destroying VM gerrit1002 - decom cookbook
  • 19:26 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
  • 19:18 twentyafterfour@deploy1001: rebuilt and synchronized wikiversions files: group0 wikis to 1.35.0-wmf.40 refs T256668
  • 19:04 mutante: contint2001 - move /var/lib/zuul/.ssh/known_hosts to root and run puppet to recreate it
  • 18:38 andrew@deploy1001: Finished deploy [horizon/deploy@eaa056e]: fix for proxy editing --bug 610130 (duration: 03m 18s)
  • 18:35 andrew@deploy1001: Started deploy [horizon/deploy@eaa056e]: fix for proxy editing --bug 610130
  • 18:27 andrew@deploy1001: Finished deploy [horizon/deploy@a39e86c]: update proxy UI to support editing existing proxies (duration: 03m 26s)
  • 18:23 andrew@deploy1001: Started deploy [horizon/deploy@a39e86c]: update proxy UI to support editing existing proxies
  • 18:10 krinkle@deploy1001: Synchronized w/: remove untracked test cookie file (duration: 01m 04s)
  • 18:08 krinkle@deploy1001: Synchronized php-1.35.0-wmf.40/includes/Revision/RevisionStore.php: I8f986daeab4 (duration: 01m 05s)
  • 17:59 herron: imported (logstash|kibana|elasticsearch)-oss-7.8.0 into buster-wikimedia thirdparty/elastic78
  • 17:54 hnowlan: finished removing restbase2009 from cassandra pool
  • 17:06 hnowlan: removed restbase2009-b from cassandra pool, removing restbase2009-c
  • 16:40 lucaswerkmeister-wmde@deploy1001: Synchronized php-1.35.0-wmf.40/extensions/Wikibase: Backport: Revert "Don’t load $wgWBClientSettings in WikibaseClient.php" (T257296) (duration: 01m 10s)
  • 15:49 hnowlan: running nodetool removenode for restbase2009-a
  • 15:38 hnowlan@deploy1001: Started restart [restbase/deploy@05b8bd5]: Restarting restbase after removal of restbase2009
  • 15:27 elukey: root-tmux on cumin1001 - cumin 'c:profile::mediawiki::mcrouter_wancache' '/usr/local/sbin/restart-mcrouter' -b 2 -s 5 - roll restart of mw-mcrouter to pick up new settings - T255511
  • 15:13 hnowlan@deploy1001: Started restart [restbase/deploy@05b8bd5]: Restarting restbase after removal of restbase2009
  • 15:12 otto@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' .
  • 15:12 otto@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
  • 15:09 otto@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
  • 15:09 otto@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
  • 15:06 otto@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' .
  • 15:04 otto@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' .
  • 15:04 otto@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' .
  • 15:02 otto@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
  • 15:02 otto@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' .
  • 15:01 otto@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' .
  • 14:58 hashar@deploy1001: Finished deploy [integration/docroot@708d3eb]: Second deployment to ensure everything works fine. Thank you jynus (duration: 00m 04s)
  • 14:58 hashar@deploy1001: Started deploy [integration/docroot@708d3eb]: Second deployment to ensure everything works fine. Thank you jynus
  • 14:53 _joe_: restarted restbase on restbase2022 after removing restbase2009 from the cassandra seeds
  • 14:48 otto@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
  • 14:47 otto@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
  • 14:38 otto@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' .
  • 14:38 otto@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' .
  • 14:31 otto@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
  • 14:31 otto@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' .
  • 14:30 papaul: replacing msw-a5,a6,a7 and a8
  • 14:30 otto@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' .
  • 14:24 otto@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
  • 14:24 otto@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
  • 14:20 otto@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' .
  • 14:20 otto@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' .
  • 14:16 hashar@deploy1001: Finished deploy [integration/docroot@708d3eb]: (no justification provided) (duration: 00m 09s)
  • 14:16 hashar@deploy1001: Started deploy [integration/docroot@708d3eb]: (no justification provided)
  • 13:38 _joe_: rolling restart of restbase to pick up using envoy
  • 13:31 otto@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' .
  • 13:31 otto@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' .
  • 13:29 XioNoX: cr2-eqiad> request vmhost snapshot routing-engine both - T257153
  • 13:24 XioNoX: cr1-eqiad> request vmhost snapshot routing-engine both - T257153
  • 13:15 kormat@cumin1001: dbctl commit (dc=all): 'Promote es2021 to es4 master T257284', diff saved to https://phabricator.wikimedia.org/P11789 and previous config saved to /var/cache/conftool/dbconfig/20200707-131524-kormat.json
  • 12:59 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 12:57 elukey@cumin1001: START - Cookbook sre.hosts.downtime
  • 12:44 kormat: starting (codfw) es5 failover from es2020 to es2021 T257284
  • 12:30 kormat@cumin1001: dbctl commit (dc=all): 'Set es2021 to weight 50 T257284', diff saved to https://phabricator.wikimedia.org/P11787 and previous config saved to /var/cache/conftool/dbconfig/20200707-123003-kormat.json
  • 12:12 jforrester@deploy1001: Finished scap: Full scap and testwikis to 1.35.0-wmf.40 for T256668 (duration: 33m 09s)
  • 12:01 marostegui: Deploy schema change on labswiki (wikitech) master - T253276
  • 11:58 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1082', diff saved to https://phabricator.wikimedia.org/P11786 and previous config saved to /var/cache/conftool/dbconfig/20200707-115838-marostegui.json
  • 11:39 jforrester@deploy1001: Started scap: Full scap and testwikis to 1.35.0-wmf.40 for T256668
  • 11:38 jforrester@deploy1001: scap failed: LockFailedError Failed to acquire lock "/var/lock/scap.operations_mediawiki-config.lock"; owner is "jforrester"; reason is "testwikis wikis to 1.35.0-wmf.40" (duration: 00m 00s)
  • 11:33 moritzm: installing PHP 7.0 security updates
  • 11:29 marostegui: Deploy schema change on db1082, this will create lag on s5 labs
  • 11:29 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1082', diff saved to https://phabricator.wikimedia.org/P11784 and previous config saved to /var/cache/conftool/dbconfig/20200707-112926-marostegui.json
  • 11:28 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1130', diff saved to https://phabricator.wikimedia.org/P11783 and previous config saved to /var/cache/conftool/dbconfig/20200707-112830-marostegui.json
  • 11:26 godog: test bumping logstash7 batch size to 256
  • 11:17 moritzm: prune PHP 7.0 packages from mwdebug1001/2001/2002
  • 11:05 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1130', diff saved to https://phabricator.wikimedia.org/P11782 and previous config saved to /var/cache/conftool/dbconfig/20200707-110506-marostegui.json
  • 11:04 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1110', diff saved to https://phabricator.wikimedia.org/P11781 and previous config saved to /var/cache/conftool/dbconfig/20200707-110412-marostegui.json
  • 10:57 moritzm: prune PHP 7.0 packages from mw2190-mw2214
  • 10:46 jforrester@deploy1001: Started scap: testwikis wikis to 1.35.0-wmf.40
  • 10:44 jforrester@deploy1001: Pruned MediaWiki: 1.35.0-wmf.38 (duration: 17m 23s)
  • 10:32 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1110', diff saved to https://phabricator.wikimedia.org/P11780 and previous config saved to /var/cache/conftool/dbconfig/20200707-103255-marostegui.json
  • 10:27 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1113:3315', diff saved to https://phabricator.wikimedia.org/P11779 and previous config saved to /var/cache/conftool/dbconfig/20200707-102757-marostegui.json
  • 10:26 moritzm: prune PHP 7.0 packages from mw2135-mw2147
  • 10:12 addshore@deploy1001: Synchronized wmf-config/config/testcommonswiki.yaml: gerrit:609985 Make testcommonswiki a testwikidata client T257266 PT2/2 (duration: 00m 55s)
  • 10:11 addshore@deploy1001: sync-file aborted: gerrit:609985 Make testcommonswiki a testwikidata client T257266 PT1/2 (duration: 00m 00s)
  • 10:10 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1113:3315', diff saved to https://phabricator.wikimedia.org/P11778 and previous config saved to /var/cache/conftool/dbconfig/20200707-101043-marostegui.json
  • 10:10 addshore@deploy1001: Synchronized dblists/wikidataclient-test.dblist: gerrit:609985 Make testcommonswiki a testwikidata client T257266 PT1/2 (duration: 00m 56s)
  • 10:08 addshore@deploy1001: sync-file aborted: gerrit:609985 Make testcommonswiki a testwikidata client T257266 PT1/2 (duration: 00m 36s)
  • 10:06 elukey: decommission archiva1001
  • 10:05 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 10:03 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1144:3315', diff saved to https://phabricator.wikimedia.org/P11777 and previous config saved to /var/cache/conftool/dbconfig/20200707-100328-marostegui.json
  • 10:03 elukey@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 10:03 elukey@cumin1001: START - Cookbook sre.hosts.downtime
  • 10:03 elukey@cumin1001: START - Cookbook sre.hosts.decommission
  • 09:54 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1144:3315', diff saved to https://phabricator.wikimedia.org/P11776 and previous config saved to /var/cache/conftool/dbconfig/20200707-095443-marostegui.json
  • 09:54 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1096:3315', diff saved to https://phabricator.wikimedia.org/P11775 and previous config saved to /var/cache/conftool/dbconfig/20200707-095428-marostegui.json
  • 09:42 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: gerrit:609971 T257266 Enable sitelinks to testcommons from test wikidata sites (duration: 00m 56s)
  • 09:40 kormat@cumin1001: dbctl commit (dc=all): 'Repool es2021 after reimaging T257284', diff saved to https://phabricator.wikimedia.org/P11774 and previous config saved to /var/cache/conftool/dbconfig/20200707-094017-kormat.json
  • 09:37 addshore@deploy1001: Synchronized wmf-config: gerrit:609986 T257266 T241975 Wikibase: Remove config option wmgUseEntitySourceBasedFederation (take2) (duration: 00m 57s)
  • 09:36 _joe_: errata: restbase2010, not 2009
  • 09:36 _joe_: applying the new configuration using the service proxy to restbase2009 too
  • 09:34 godog: bounce logstash on logstash1023
  • 09:33 addshore@deploy1001: Synchronized wmf-config/Wikibase.php: gerrit:609645 T257266 T241975 Wikibase: stop using wmgUseEntitySourceBasedFederation (take2) (duration: 00m 59s)
  • 09:33 _joe_: depooling restbase1025 while we fix the troubled relationship between envoy and proton
  • 09:33 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1096:3315', diff saved to https://phabricator.wikimedia.org/P11773 and previous config saved to /var/cache/conftool/dbconfig/20200707-093345-marostegui.json
  • 09:26 marostegui@cumin1001: dbctl commit (dc=all): 'Remove weight from es1024 as it is the current master T255755', diff saved to https://phabricator.wikimedia.org/P11772 and previous config saved to /var/cache/conftool/dbconfig/20200707-092635-marostegui.json
  • 09:24 James_F: 1.35.0-wmf.40 was branched at 88ecd6d for T256668
  • 09:23 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool es1023 after reimage T255755', diff saved to https://phabricator.wikimedia.org/P11771 and previous config saved to /var/cache/conftool/dbconfig/20200707-092357-marostegui.json
  • 09:10 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool es1023 after reimage T255755', diff saved to https://phabricator.wikimedia.org/P11770 and previous config saved to /var/cache/conftool/dbconfig/20200707-091015-marostegui.json
  • 08:33 kormat@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 08:31 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool es1023 after reimage T255755', diff saved to https://phabricator.wikimedia.org/P11769 and previous config saved to /var/cache/conftool/dbconfig/20200707-083144-marostegui.json
  • 08:30 kormat@cumin2001: START - Cookbook sre.hosts.downtime
  • 08:26 XioNoX: cr2-codfw> request vmhost snapshot routing-engine both - T257153
  • 08:22 XioNoX: cr2-eqsin> request vmhost snapshot - T257153
  • 08:19 XioNoX: cr2-eqord> request vmhost snapshot - T257153
  • 08:19 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool es1023 after reimage T255755', diff saved to https://phabricator.wikimedia.org/P11768 and previous config saved to /var/cache/conftool/dbconfig/20200707-081909-marostegui.json
  • 08:18 elukey@cumin1001: END (ERROR) - Cookbook sre.hadoop.change-distro (exit_code=97)
  • 08:17 XioNoX: cr2-eqdfw> request vmhost snapshot - T257153
  • 08:15 XioNoX: cr3-knams> request vmhost snapshot - T257153
  • 08:15 hashar: upgrading and restart CI Jenkins on contint2001 # T256978
  • 08:12 XioNoX: cr4-ulsfo> request vmhost snapshot - T257153
  • 08:09 kormat@cumin1001: dbctl commit (dc=all): 'Depool es2021 for reimaging T257284', diff saved to https://phabricator.wikimedia.org/P11767 and previous config saved to /var/cache/conftool/dbconfig/20200707-080914-kormat.json
  • 07:50 marostegui: Stop MySQL on db1074 to deploy schema change and remove triggers - T238966
  • 07:45 _joe_: restarting restbase again on rb1025
  • 07:44 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1074 for schema change', diff saved to https://phabricator.wikimedia.org/P11766 and previous config saved to /var/cache/conftool/dbconfig/20200707-074435-marostegui.json
  • 07:41 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 07:39 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1079 and db1136 T257216', diff saved to https://phabricator.wikimedia.org/P11765 and previous config saved to /var/cache/conftool/dbconfig/20200707-073918-marostegui.json
  • 07:38 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 07:31 _joe_: restarting restbase on restbase1025, reaching proton via envoy for now
  • 07:31 lucaswerkmeister-wmde@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Config: Revert "Commons: Define entity sources configuration" (T256906, T256907, T256909, T254315, T257266) (forgot to git rebase so the last sync was a no-op) (duration: 00m 56s)
  • 07:27 elukey@cumin1001: START - Cookbook sre.hadoop.change-distro
  • 07:27 lucaswerkmeister-wmde@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Config: Revert "Commons: Define entity sources configuration" (T256906, T256907, T256909, T254315, T257266) (duration: 00m 53s)
  • 07:27 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1079 and give more main weight to db1136 T257216', diff saved to https://phabricator.wikimedia.org/P11764 and previous config saved to /var/cache/conftool/dbconfig/20200707-072703-marostegui.json
  • 07:24 lucaswerkmeister-wmde@deploy1001: Synchronized wmf-config/: Config: Revert "Wikidata client wikis: Define entity sources configuration (take 2)" (T254315, T257266) (duration: 00m 56s)
  • 07:24 elukey@cumin1001: END (PASS) - Cookbook sre.hadoop.stop-cluster (exit_code=0)
  • 07:23 lucaswerkmeister-wmde@deploy1001: Synchronized dblists/wikidataclient.dblist: Config: Revert "Wikidata client wikis: Define entity sources configuration (take 2)" (T254315, T257266) (duration: 00m 56s)
  • 07:19 lucaswerkmeister-wmde@deploy1001: Synchronized wmf-config/Wikibase.php: Config: Revert "Wikibase: stop using wmgUseEntitySourceBasedFederation" (T241975, T257266) (duration: 00m 55s)
  • 07:16 elukey@cumin1001: START - Cookbook sre.hadoop.stop-cluster
  • 07:15 lucaswerkmeister-wmde@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Config: Revert "Wikibase: Remove config option wmgUseEntitySourceBasedFederation" (T241975, T257266) (duration: 00m 57s)
  • 07:10 _joe_: restart restbase on restbase1025 to pick up the switch to https for cxserver
  • 06:37 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1079 and give more main weight to db1136 T257216', diff saved to https://phabricator.wikimedia.org/P11762 and previous config saved to /var/cache/conftool/dbconfig/20200707-063737-marostegui.json
  • 06:29 marostegui: Reimage es1023 to Buster T255755
  • 06:20 marostegui@cumin1001: dbctl commit (dc=all): 'Give db1136 some weight back into main traffic T257216', diff saved to https://phabricator.wikimedia.org/P11761 and previous config saved to /var/cache/conftool/dbconfig/20200707-062008-marostegui.json
  • 06:18 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1079 T257216', diff saved to https://phabricator.wikimedia.org/P11760 and previous config saved to /var/cache/conftool/dbconfig/20200707-061849-marostegui.json
  • 05:26 marostegui@deploy1001: Synchronized wmf-config/db-eqiad.php: Enable es5 writes T255755 (duration: 00m 56s)
  • 05:16 marostegui@cumin1001: dbctl commit (dc=all): 'Depool es1023 entirely T255755', diff saved to https://phabricator.wikimedia.org/P11759 and previous config saved to /var/cache/conftool/dbconfig/20200707-051620-marostegui.json
  • 05:12 marostegui@cumin1001: dbctl commit (dc=all): 'Promote es1024 to es5 master T255755', diff saved to https://phabricator.wikimedia.org/P11758 and previous config saved to /var/cache/conftool/dbconfig/20200707-051236-marostegui.json
  • 05:02 marostegui@deploy1001: Synchronized wmf-config/db-eqiad.php: Disable es5 writes T255755 (duration: 00m 56s)
  • 05:01 marostegui: "Starting es failover from es1023 to es1024 - https://phabricator.wikimedia.org/T255755"
  • 01:05 ejegg: turned on debug logging for Adyen SmashPig
  • 00:22 cstone: civicrm revision changed from a48caf0f37 to d73ee2e73f

2020-07-06

  • 23:32 reedy@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Enable sidebar instrumentation on test wikipedia (duration: 00m 56s)
  • 23:32 eileen: process-control config revision is 3fe6753e56
  • 23:22 reedy@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Change some zh canonical namespaces. Don't index NS_USER on hywiki (duration: 00m 58s)
  • 22:59 eileen: tools revision changed from e974147f27 to 73557b8038
  • 22:14 ryankemper@deploy1001: Finished deploy [wdqs/wdqs@65502b2]: 0.3.40 (duration: 18m 58s)
  • 21:55 ryankemper@deploy1001: Started deploy [wdqs/wdqs@65502b2]: 0.3.40
  • 21:52 hashar: Upgraded Jenkins on releases1002 and releases2002 # T256978
  • 21:41 mutante: upgrading jenkins on releases1001 and releases2001 (T256980)
  • 21:37 mutante: importing jenkins 2.235.1 into APT repo for both stretch and buster T256980
  • 20:08 eileen: tools revision is e974147f27
  • 19:41 qchris: Enabling puppet on gerrit1002 again to catch up with puppetmaster.
  • 18:56 addshore: backport / deploy window done
  • 18:55 addshore@deploy1001: Synchronized wmf-config: gerrit:569263 T241975 Wikibase: Remove config option wmgUseEntitySourceBasedFederation (duration: 00m 58s)
  • 18:54 addshore@deploy1001: sync-file aborted: gerrit:569263 (duration: 00m 00s)
  • 18:51 addshore@deploy1001: Synchronized wmf-config/Wikibase.php: gerrit:608944 T241975 Wikibase: stop using wmgUseEntitySourceBasedFederation (duration: 00m 56s)
  • 18:47 addshore@deploy1001: Synchronized dblists/wikidataclient.dblist: T254315 Wikidata client wikis: Define entity sources configuration (take 2) gerrit:608839 (duration: 00m 56s)
  • 18:45 addshore@deploy1001: Synchronized wmf-config: T254315 Wikidata client wikis: Define entity sources configuration (take 2) gerrit:608839 (duration: 00m 58s)
  • 18:38 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: T256906 T256907 T256909 T254315 gerrit:569260 Commons: Define entity sources configuration (duration: 00m 56s)
  • 18:30 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: adffbe6: Enable validation of new signatures (T248632) (duration: 00m 57s)
  • 18:24 urbanecm@deploy1001: Synchronized wmf-config/abusefilter.php: 8878c60: Add `abusefilter-view` as a default right for the CU log user (T255506) (duration: 00m 55s)
  • 18:22 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: 1398171: Add arbcom group to plwiki (T256572) (duration: 00m 56s)
  • 18:08 andrew@deploy1001: Finished deploy [horizon/deploy@bb176c2]: update proxy UI to support multiple pre-set domains (duration: 03m 39s)
  • 18:04 andrew@deploy1001: Started deploy [horizon/deploy@bb176c2]: update proxy UI to support multiple pre-set domains
  • 17:54 otto@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Migrate SearchSatisfaction from EventLogging to EventGate on all wikis - T249261 - take 2 (duration: 00m 56s)
  • 17:50 otto@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Migrate SearchSatisfaction from EventLogging to EventGate on all wikis - T249261 (duration: 00m 56s)
  • 16:09 otto@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Migrate SearchSatisfaction from EventLogging to EventGate on group1 - T249261 (duration: 00m 58s)
  • 15:02 jynus: removing old snapshots for x1 on dbprov[12]002
  • 14:50 mholloway-shell@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'wikifeeds' for release 'production' .
  • 14:46 mholloway-shell@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'wikifeeds' for release 'production' .
  • 14:44 mholloway-shell@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' .
  • 14:42 moritzm: installing PHP 7.0 security updates
  • 14:37 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1074', diff saved to https://phabricator.wikimedia.org/P11753 and previous config saved to /var/cache/conftool/dbconfig/20200706-143754-marostegui.json
  • 14:36 godog: reboot ms-be2025 for hw raid software upgrade - T257214
  • 14:28 godog: powercycle ms-be2025, no ssh available - T257214
  • 14:14 cmjohnson@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 14:09 marostegui: Stop MySQL and poweroff db1079 T257216
  • 14:08 cmjohnson@cumin1001: START - Cookbook sre.dns.netbox
  • 14:02 jynus@cumin1001: dbctl commit (dc=all): 'depool db1136 from main traffic as it is the only s7 api host right now', diff saved to https://phabricator.wikimedia.org/P11752 and previous config saved to /var/cache/conftool/dbconfig/20200706-140217-jynus.json
  • 13:56 marostegui: Downtime and reboot db1079 after BBU crash
  • 13:54 jynus@cumin1001: dbctl commit (dc=all): 'depool db1079', diff saved to https://phabricator.wikimedia.org/P11751 and previous config saved to /var/cache/conftool/dbconfig/20200706-135430-jynus.json
  • 13:30 marostegui: Deploy schema change on s5 codfw master T253276
  • 13:26 marostegui@cumin1001: dbctl commit (dc=all): 'Reduce es1024 weight in preparation for tomorrow's switchover T255755', diff saved to https://phabricator.wikimedia.org/P11750 and previous config saved to /var/cache/conftool/dbconfig/20200706-132634-marostegui.json
  • 13:03 elukey: force umount/mount of /mnt/hdfs on an-airflow1001 to unblock dpkg checks (fuse misbehaving, all checks hanging)
  • 12:53 elukey: kill hanging lsof processes on an-airflow to reduce cpu load
  • 12:42 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1074', diff saved to https://phabricator.wikimedia.org/P11748 and previous config saved to /var/cache/conftool/dbconfig/20200706-124237-marostegui.json
  • 12:41 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1129', diff saved to https://phabricator.wikimedia.org/P11747 and previous config saved to /var/cache/conftool/dbconfig/20200706-124105-marostegui.json
  • 11:17 Urbanecm: EU B&C window was done
  • 11:16 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: 5d971dc: GrowthExperiments: Remove overrides to welcome survey privacy policy URL (T252572) (duration: 00m 56s)
  • 11:12 marostegui: Deploy schema changes on db1129
  • 11:12 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1129', diff saved to https://phabricator.wikimedia.org/P11746 and previous config saved to /var/cache/conftool/dbconfig/20200706-111221-marostegui.json
  • 11:09 marostegui: Compress InnoDB on db1107 T254462
  • 11:09 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: f4b5001: Add arxiv.org to commonswiki wgCopyUploadsDomains (T257036) (duration: 00m 56s)
  • 11:07 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1107 T254462', diff saved to https://phabricator.wikimedia.org/P11745 and previous config saved to /var/cache/conftool/dbconfig/20200706-110723-marostegui.json
  • 11:05 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1076', diff saved to https://phabricator.wikimedia.org/P11744 and previous config saved to /var/cache/conftool/dbconfig/20200706-110544-marostegui.json
  • 11:05 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: 3bc1b46: Remove "Create a book" link from sidebar on Finnish Wikipedia (T257073) (duration: 00m 56s)
  • 10:52 jdrewniak@deploy1001: Synchronized portals: Wikimedia Portals Update: Bumping portals to master (609762) (duration: 00m 57s)
  • 10:51 jdrewniak@deploy1001: Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: Bumping portals to master (609762) (duration: 00m 56s)
  • 10:31 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 10:28 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 10:28 moritzm: rebooting idp1001 for kernel update
  • 09:35 urbanecm@deploy1001: Synchronized private/PrivateSettings.php: Update T250887 mitigations (duration: 00m 58s)
  • 08:51 XioNoX: cr1-codfw> request vmhost snapshot routing-engine both - T257153
  • 08:44 XioNoX: cr3-ulsfo> request vmhost snapshot - T257153
  • 08:24 kormat: restarting all mariadb instances on sanitarium hosts T256545
  • 08:09 elukey: roll restart aqs on aqs100[4-9] to pick up new druid settings
  • 08:05 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1076', diff saved to https://phabricator.wikimedia.org/P11742 and previous config saved to /var/cache/conftool/dbconfig/20200706-080509-marostegui.json
  • 07:58 qchris: Disable puppet on gerrit1002 (gerrit-test) to deploy Gerrit UI updates there to gather more feedback
  • 07:51 elukey: enable binlog on matomo's database on matomo1002
  • 07:46 XioNoX: repool eqsin - T257154
  • 07:11 XioNoX: reboot cr3-eqsin - T257154
  • 06:55 XioNoX: depool eqsin for cr3-eqsin reboot/investigation - T257154
  • 06:54 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1089', diff saved to https://phabricator.wikimedia.org/P11740 and previous config saved to /var/cache/conftool/dbconfig/20200706-065437-marostegui.json
  • 06:54 elukey@cumin1001: END (FAIL) - Cookbook sre.hadoop.change-distro (exit_code=99)
  • 06:22 elukey@cumin1001: START - Cookbook sre.hadoop.change-distro
  • 06:21 elukey@cumin1001: END (PASS) - Cookbook sre.hadoop.stop-cluster (exit_code=0)
  • 06:14 elukey@cumin1001: START - Cookbook sre.hadoop.stop-cluster
  • 05:45 kart_: Updated cxserver to 2020-07-01-044435-production (T254143)
  • 05:40 kartik@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'cxserver' for release 'production' .
  • 05:36 kartik@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'cxserver' for release 'production' .
  • 05:32 kartik@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'cxserver' for release 'staging' .
  • 05:13 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1089', diff saved to https://phabricator.wikimedia.org/P11739 and previous config saved to /var/cache/conftool/dbconfig/20200706-051333-marostegui.json
  • 05:03 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1089', diff saved to https://phabricator.wikimedia.org/P11738 and previous config saved to /var/cache/conftool/dbconfig/20200706-050347-marostegui.json
  • 04:49 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1089', diff saved to https://phabricator.wikimedia.org/P11737 and previous config saved to /var/cache/conftool/dbconfig/20200706-044908-marostegui.json

2020-07-05

  • 21:50 qchris: Restarting gerrit on gerrit1001 to pick up new war and jars.
  • 21:50 qchris@deploy1001: Finished deploy [gerrit/gerrit@fbd0684]: Bump gerrit to 3.2.2-102-g3bbb138e13, zuul plugin to master-0-g7accc67, and gitiles to v3.2.2-1-g00c5ca0-with-0e3b533 on gerrit1001 (duration: 00m 07s)
  • 21:50 qchris@deploy1001: Started deploy [gerrit/gerrit@fbd0684]: Bump gerrit to 3.2.2-102-g3bbb138e13, zuul plugin to master-0-g7accc67, and gitiles to v3.2.2-1-g00c5ca0-with-0e3b533 on gerrit1001
  • 21:46 qchris: Restarting gerrit on gerrit2001 to pick up new war and jars.
  • 21:45 qchris@deploy1001: Finished deploy [gerrit/gerrit@fbd0684]: Bump gerrit to 3.2.2-102-g3bbb138e13, zuul plugin to master-0-g7accc67, and gitiles to v3.2.2-1-g00c5ca0-with-0e3b533 on gerrit2001 (duration: 00m 10s)
  • 21:45 qchris@deploy1001: Started deploy [gerrit/gerrit@fbd0684]: Bump gerrit to 3.2.2-102-g3bbb138e13, zuul plugin to master-0-g7accc67, and gitiles to v3.2.2-1-g00c5ca0-with-0e3b533 on gerrit2001
  • 21:32 qchris: Restarting gerrit on gerrit1002 to pick up new wars and jars.
  • 21:32 qchris@deploy1001: Finished deploy [gerrit/gerrit@fbd0684]: Bump gerrit to 3.2.2-102-g3bbb138e13 and zuul plugin to master-0-g7accc67 (duration: 00m 08s)
  • 21:32 qchris@deploy1001: Started deploy [gerrit/gerrit@fbd0684]: Bump gerrit to 3.2.2-102-g3bbb138e13 and zuul plugin to master-0-g7accc67
  • 21:20 qchris: Enable puppet on gerrit1002 (gerrit-test) again to let it catch up again
  • 16:01 gehel: restart elastic-psi on elastic1052 (high GC rate)
  • 15:56 gehel: restart blazegraph + updater on wdqs1007 and depool to allow catching up on lag

2020-07-04

  • 19:23 qchris@deploy1001: Finished deploy [gerrit/gerrit@b78914b]: Bump gitiles to v3.2.2-1-g00c5ca0-with-0e3b533 on gerrit1002 (duration: 00m 08s)
  • 19:23 qchris@deploy1001: Started deploy [gerrit/gerrit@b78914b]: Bump gitiles to v3.2.2-1-g00c5ca0-with-0e3b533 on gerrit1002
  • 14:05 qchris: Disable puppet on gerrit1002 (gerrit-test) to deploy Gerrit UI updates there to gather feedback
  • 12:42 reedy@deploy1001: Synchronized wmf-config/interwiki.php: Update interwiki cache (duration: 02m 24s)
  • 02:28 reedy@deploy1001: Synchronized php-1.35.0-wmf.39/extensions/Score/includes/Score.php: Short circuit lilypond version check to allow usage of cached files T257066 (duration: 00m 55s)

2020-07-03

  • 21:49 reedy@deploy1001: Synchronized php-1.35.0-wmf.39/extensions/Score/: Sync maintenance script (duration: 00m 58s)
  • 18:47 cdanis: βœ”οΈ cdanis@an-coord1001.eqiad.wmnet ~ πŸ•’β˜• sudo systemctl restart hive-server2.service
  • 16:51 krinkle@deploy1001: Synchronized wmf-config/CommonSettings.php: Ifa929b2ad4 (duration: 00m 57s)
  • 16:02 reedy@deploy1001: Synchronized wmf-config/CommonSettings.php: Rename wgRestrictionMethod to wgShellRestrictionMethod (duration: 00m 58s)
  • 15:46 jayme@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
  • 15:43 jayme@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
  • 15:43 jynus@cumin1001: dbctl commit (dc=all): 'Reduce db1118 weight to spread load mode evenly', diff saved to https://phabricator.wikimedia.org/P11730 and previous config saved to /var/cache/conftool/dbconfig/20200703-154337-jynus.json
  • 15:40 jayme@cumin1001: START - Cookbook sre.ganeti.makevm
  • 15:38 jayme@cumin1001: START - Cookbook sre.ganeti.makevm
  • 15:09 elukey@cumin1001: END (PASS) - Cookbook sre.hadoop.stop-cluster (exit_code=0)
  • 15:02 elukey@cumin1001: START - Cookbook sre.hadoop.stop-cluster
  • 14:11 elukey@cumin1001: END (FAIL) - Cookbook sre.hadoop.stop-cluster (exit_code=99)
  • 14:11 _joe_: restarted php-fpm on wtp1033, stuck in sigill
  • 13:59 elukey@cumin1001: START - Cookbook sre.hadoop.stop-cluster
  • 12:41 hashar: Restarting Zuul / CI
  • 11:39 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 11:36 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 11:32 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 11:29 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 11:29 moritzm: rebooting urldownloader standby hosts for kernel updates (1002/2002)
  • 10:59 moritzm: installing json-c security updates on jessie
  • 10:51 moritzm: installing ruby-json security updates
  • 10:25 moritzm: installing nss security updates on jessie
  • 10:18 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 10:15 elukey: notebook1004 renamed to an-scheduler1001
  • 10:15 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 10:09 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 10:07 elukey@cumin1001: START - Cookbook sre.hosts.downtime
  • 09:06 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 09:04 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 09:00 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 08:58 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 08:58 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 08:56 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 08:55 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 08:51 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 08:47 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 08:43 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 08:43 moritzm: rebooting netflow* hosts for kernel security update
  • 08:16 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 08:14 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 08:04 jayme: authdns-update for chartmuseum - T256970
  • 08:03 elukey@cumin1001: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
  • 07:55 moritzm: installing mutt security updates for jessie (stretch/buster already fixed)
  • 07:44 elukey@cumin1001: START - Cookbook sre.dns.netbox
  • 07:40 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 07:39 elukey@cumin1001: START - Cookbook sre.hosts.decommission
  • 07:00 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 07:00 jmm@cumin2001: START - Cookbook sre.hosts.downtime
  • 06:47 moritzm: installing php5 security updates
  • 06:27 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 06:27 jmm@cumin2001: START - Cookbook sre.hosts.downtime
  • 06:09 moritzm: rebooting mw1390-mw1419 for kernel security updates
  • 05:46 XioNoX: remove chassis redundancy failover from fasw-c-eqiad for consistency with all other VCs
  • 05:33 XioNoX: remove chassis redundancy failover from fasw-c-codfw for consistency with all other VCs

2020-07-02

  • 23:22 jhuneidi@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'blubberoid' for release 'production' .
  • 23:16 jhuneidi@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'blubberoid' for release 'production' .
  • 22:03 jhuneidi@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'blubberoid' for release 'staging' .
  • 21:56 mutante: gerrit1001 (prod gerrit) - restarting gerrit service
  • 21:52 maryum: frwikibooks reindex sucessful, continuing on with remainder of french wikis
  • 21:32 mutante: gerrit - deleted gerrit db_pass from prod private repo, running puppet
  • 21:25 mutante: gerrit2001 - restarted gerrit
  • 21:14 mutante: gerrit1002 restarted gerrit
  • 20:20 maryum: reindexing frwikibooks to test https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/604221
  • 19:52 mutante: gerrit2001 - restarting gerrit after removing db_pass from config
  • 16:05 cmjohnson@cumin1001: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
  • 16:01 cmjohnson@cumin1001: START - Cookbook sre.dns.netbox
  • 15:40 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 15:37 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 15:23 cmjohnson@cumin1001: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
  • 15:19 cmjohnson@cumin1001: START - Cookbook sre.dns.netbox
  • 15:07 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 15:07 jmm@cumin2001: START - Cookbook sre.hosts.downtime
  • 14:42 moritzm: rebooting mw1370-mw1389 for kernel security updates
  • 14:33 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 14:33 jmm@cumin2001: START - Cookbook sre.hosts.downtime
  • 14:03 kormat: stopped mariadb@s8 on dbstore1005 for data restoration T256966
  • 12:43 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 12:43 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 12:36 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 12:36 jmm@cumin2001: START - Cookbook sre.hosts.downtime
  • 12:32 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 12:32 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 12:31 moritzm: rebooting mw1349-mw1369 for kernel security updates
  • 12:28 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 12:27 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 12:27 vgutierrez: rolling restart of esams load balancers to catch up on kernel upgrades
  • 12:12 XioNoX: pre-configure asw2-b-eqiad<->cloudsw1-c8-eqiad - T251632
  • 12:10 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 12:10 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 11:53 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 11:53 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 11:40 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 11:40 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 11:35 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 11:35 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 11:33 vgutierrez: rolling restart of codfw load balancers to catch up on kernel upgrades
  • 11:18 akosiaris: preactively restart docker-registry on registry1001, registry1002 to force CA refresh
  • 11:16 akosiaris: restart docker-registry on registry2002 for CA refresh
  • 11:14 _joe_: restarting docker-registry on registry2001
  • 10:34 godog: move "cluster overview" dashboard to Thanos - T256954
  • 09:35 XioNoX: advertise codfw prefixes from eqord
  • 09:28 jayme: imported chartmuseum_0.12.0-2 to buster-wikimedia - T253843
  • 09:07 addshore: addshore@mwmaint1002:~$ mwscript maintenance/createAndPromote.php --wiki testwikidatawiki --force --custom-groups oversight "DCausse_(WMF)" # T256949
  • 09:07 addshore: addshore@mwmaint1002:~$ mwscript maintenance/createAndPromote.php --wiki testwikidatawiki --force --custom-groups oversight "Addshore" # T256949
  • 08:59 XioNoX: deploy flex flow for MX204s - T248394
  • 05:52 _joe_: removing all tags for envoy-tls-local-proxy
  • 05:46 _joe_: upload docker-report 0.0.4 on buster-wikimedia T242604
  • 04:32 eileen: process-control config revision is b4655897b5
  • 03:17 eileen: process-control config revision is 12fe6b5151
  • 03:15 eileen: tools revision changed from 4ea8567819 to e974147f27
  • 02:32 eileen: tools revision changed from e38f7a83d4 to 4ea8567819
  • 00:53 eileen: tools revision changed from 806e2b4412 to e38f7a83d4

2020-07-01

  • 23:53 catrope@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Set $wgForceUIAsContentMsg for zhwikibooks, zhwikinews, zhwikiquote, zhwikisource, zhwikiversity, zhwiktionary (T256521) (duration: 00m 55s)
  • 23:35 ejegg: updated fundraising CiviCRM from 391d0fdf75 to a48caf0f37
  • 23:32 catrope@deploy1001: Synchronized static/images/project-logos/: Change Simplified Chinese logo for zhwiki (T256839) (duration: 00m 55s)
  • 23:18 krinkle@deploy1001: Synchronized wmf-config/CommonSettings.php: Ibb42db7fd1ee (duration: 00m 55s)
  • 23:00 bstorm: set a short downtime on labstore1006/7 to prevent alert while disabling direct systemd monitoring
  • 22:37 krinkle@deploy1001: Synchronized php-1.35.0-wmf.39/includes/Title.php: I8d5bad (duration: 01m 00s)
  • 21:00 mholloway-shell@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
  • 20:58 mholloway-shell@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
  • 20:56 mholloway-shell@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'mobileapps' for release 'staging' .
  • 20:56 Krinkle: krinkle@deploy1001 Ran `scap deploy --init` for /srv/deployment/performance/arc-lamp
  • 20:55 mholloway-shell@deploy1001: Finished deploy [mobileapps/deploy@d7476f5]: Update mobileapps to 953fc41a (duration: 04m 08s)
  • 20:51 mholloway-shell@deploy1001: Started deploy [mobileapps/deploy@d7476f5]: Update mobileapps to 953fc41a
  • 20:27 eileen: tools revision changed from 6f38c14fe3 to 806e2b4412 -
  • 20:11 eileen: tools revision changed from aab96444df to 6f38c14fe3
  • 19:23 twentyafterfour: 1.35.0-wmf.39 is now deployed to group2 wikis, everything appears to be normal. refs T254176
  • 19:18 twentyafterfour@deploy1001: rebuilt and synchronized wikiversions files: group2 wikis to 1.35.0-wmf.39 refs T254176
  • 18:44 addshore@deploy1001: Synchronized wmf-config: REVERT T254315 Wikidata client wikis: Define entity sources configuration gerrit:569259 (duration: 01m 04s)
  • 18:41 addshore@deploy1001: sync-file aborted: T254315 Wikidata client wikis: Define entity sources configuration gerrit:569259 (duration: 00m 38s)
  • 18:38 joal@deploy1001: Finished deploy [analytics/refinery@8b7bddf] (thin): Regular analytics weekly train THIN [analytics/refinery@8b7bddf] (duration: 02m 19s)
  • 18:36 joal@deploy1001: Started deploy [analytics/refinery@8b7bddf] (thin): Regular analytics weekly train THIN [analytics/refinery@8b7bddf]
  • 18:35 joal@deploy1001: Finished deploy [analytics/refinery@8b7bddf]: Regular analytics weekly train [analytics/refinery@8b7bddf] (duration: 08m 09s)
  • 18:27 joal@deploy1001: Started deploy [analytics/refinery@8b7bddf]: Regular analytics weekly train [analytics/refinery@8b7bddf]
  • 18:25 joal@deploy1001: Finished deploy [analytics/refinery@114bfed]: Regular analytics weekly train [analytics/refinery@114bfed] (duration: 03m 41s)
  • 18:21 joal@deploy1001: Started deploy [analytics/refinery@114bfed]: Regular analytics weekly train [analytics/refinery@114bfed]
  • 18:18 ppchelko@deploy1001: Synchronized wmf-config/InitialiseSettings-labs.php: Enable kafka purges on wikitech gerrit:607590 IS-labs.php (duration: 01m 03s)
  • 18:07 ppchelko@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Deploy MediaModeration on all production wikis gerrit:608753 (duration: 01m 07s)
  • 17:14 XioNoX: set flex-flow-sizing to cr2-eqsin - T248394
  • 16:57 XioNoX: restart cr2-eqsin for software upgrade - T243080
  • 16:00 XioNoX: updating eqsin LVS BGP neighbors IPs - T255766
  • 15:16 XioNoX: re0.cr1-eqsin> request system power-off both-routing-engines - T255766
  • 15:15 XioNoX: disable BGP to pybal on cr1-eqsin - T255766
  • 15:13 XioNoX: disable cr1-eqsin transit/peering BGP - T255766
  • 15:09 XioNoX: bump eqsin-codfw ospf link cost - T255766
  • 15:03 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 15:03 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:03 XioNoX: move vrrp master to cr2-eqsin - T255766
  • 15:00 XioNoX: depool eqsin for routers work - T255766
  • 14:28 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 14:28 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:04 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 14:04 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 13:37 hashar: contint1001 stopped zuul-merger for a test. started it again
  • 13:35 hashar: Restarting zuul-merger on contint2001 # T252310
  • 13:30 hashar@deploy1001: Finished deploy [zuul/deploy@00f69b3]: (no justification provided) (duration: 00m 08s)
  • 13:30 hashar@deploy1001: Started deploy [zuul/deploy@00f69b3]: (no justification provided)
  • 13:29 hashar@deploy1001: Finished deploy [zuul/deploy@00f69b3]: (no justification provided) (duration: 00m 32s)
  • 13:28 hashar@deploy1001: Started deploy [zuul/deploy@00f69b3]: (no justification provided)
  • 13:16 hashar@deploy1001: Synchronized php: group1 wikis to 1.35.0-wmf.39 (duration: 01m 04s)
  • 13:15 hashar@deploy1001: rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.39
  • 13:10 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 13:10 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 13:09 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 13:09 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 13:08 cdanis: βœ”οΈ cdanis@netflow2001.codfw.wmnet ~ πŸ•˜β˜• sudo apt remove valgrind libc6-dbg
  • 13:03 cdanis: T256790 βœ”οΈ cdanis@cumin1001.eqiad.wmnet ~ πŸ•˜β˜• sudo cumin 'netflow[3-5]001*' 'systemctl restart nfacctd'
  • 12:58 cdanis: T256790 βœ”οΈ cdanis@cumin1001.eqiad.wmnet ~ πŸ•˜β˜• sudo debdeploy deploy -u 2020-07-01-pmacct.yaml -s netflow
  • 12:55 cdanis: T256790 βœ”οΈ cdanis@apt1001.wikimedia.org ~ πŸ•˜β˜• sudo -E reprepro -C main include buster-wikimedia pmacct_1.7.2-3+wmf1_amd64.changes
  • 12:53 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 12:53 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 11:47 ema: A:cp upgrade librdkafka1 to 0.11.6-1.1wmf1 and restart purged, varnishkafka T256444
  • 11:46 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: T254315 Wikidata: Define entity sources configuration gerrit:569258 (duration: 01m 06s)
  • 11:32 Lucas_WMDE: EU B&C window done
  • 11:24 lucaswerkmeister-wmde@deploy1001: Synchronized w/touch.php: Config: Fully set MW_NO_SESSION for browser metadata endpoints, 4/4 (duration: 01m 06s)
  • 11:22 lucaswerkmeister-wmde@deploy1001: Synchronized w/robots.php: Config: Fully set MW_NO_SESSION for browser metadata endpoints, 3/4 (duration: 01m 03s)
  • 11:21 lucaswerkmeister-wmde@deploy1001: Synchronized w/favicon.php: Config: Fully set MW_NO_SESSION for browser metadata endpoints, 2/4 (duration: 01m 04s)
  • 11:19 lucaswerkmeister-wmde@deploy1001: Synchronized w/extract2.php: Config: Fully set MW_NO_SESSION for browser metadata endpoints, 1/4 (duration: 01m 16s)
  • 11:07 Amir1: Changing datatype of several properties with mwscript extensions/Wikibase/repo/maintenance/changePropertyDataType.php (T255241)
  • 11:07 jayme@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'blubberoid' for release 'staging' .
  • 11:02 ema: restbase2009 depooled T256863
  • 11:02 ema@puppetmaster1001: conftool action : set/pooled=no; selector: name=restbase2009.codfw.wmnet
  • 10:50 ema: power on restbase2009
  • 10:45 jayme: draining and docker restart (one at a time) kubernetes[1001-1004].eqiad.wmnet - T256786
  • 10:34 ema: power-cycle restbase2009
  • 10:17 XioNoX: renumber NTT transit links - T254877
  • 10:16 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 10:16 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 10:14 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 10:14 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 10:09 jayme: draining and docker restart (one at a time) kubernetes[2001-2004].codfw.wmnet
  • 09:52 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 09:52 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 09:46 jayme: cordoning kubernetes[2001-2004].codfw.wmnet,kubernetes[1001-1004].eqiad.wmnet - T256786
  • 09:42 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 09:42 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 09:34 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 09:34 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 09:23 jayme: restarting dockerd on kubestage1002.eqiad.wmnet - T256786
  • 09:15 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 09:15 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 09:08 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 09:08 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 08:53 jayme: draining kubernetes staging node kubestage1001.eqiad.wmnet - T256786
  • 08:52 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 08:52 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 08:44 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 08:44 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 08:29 XioNoX: disable BGP to nfacct in eqiad - T256790
  • 08:23 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 08:23 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 08:08 jayme@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'mobileapps' for release 'staging' .
  • 08:05 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 08:05 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 08:01 vgutierrez: rolling restart of esams cache nodes to catch up on kernel upgrades
  • 07:42 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 07:42 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 07:40 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 07:40 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 07:39 ema: cp2041: restart purged, varnishkafka after librdkafka1 upgrade to 0.11.6-1.1wmf1 T256444
  • 05:47 _joe_: restarting nfacctd on netflow1001, it's segfaulting
  • 04:01 krinkle@deploy1001: Synchronized php-1.35.0-wmf.39/maintenance/findBadBlobs.php: I47c11190b665 (duration: 01m 08s)
  • 00:14 krinkle@deploy1001: Synchronized private/PrivateSettings.php: T254795 - Set $wmgXhguiDBuser and $wmgXhguiDBpasswor (duration: 01m 06s)

2020-06-30

  • 21:48 crusnov@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 21:46 crusnov@cumin1001: START - Cookbook sre.hosts.reboot-single
  • 21:45 crusnov@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 21:43 crusnov@cumin1001: START - Cookbook sre.hosts.reboot-single
  • 21:42 crusnov@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 21:40 crusnov@cumin1001: START - Cookbook sre.hosts.reboot-single
  • 21:40 crusnov@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 21:38 crusnov@cumin1001: START - Cookbook sre.hosts.reboot-single
  • 21:38 crusnov@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99)
  • 21:38 crusnov@cumin1001: START - Cookbook sre.hosts.reboot-single
  • 19:19 hashar@deploy1001: rebuilt and synchronized wikiversions files: group 0 wikis to 1.35.0-wmf.39 # T254176
  • 18:31 cdanis: T256790 βœ”οΈ cdanis@netflow2001.codfw.wmnet ~ πŸ•β˜• sudo apt install valgrind
  • 18:27 tgr: Morning deploys done
  • 18:23 tgr@deploy1001: Synchronized php-1.35.0-wmf.39/extensions/ElectronPdfService/src/ElectronPdfServiceHooks.php: Backport: Hotfix: "Undefined index: print" (T256761) (duration: 01m 05s)
  • 18:11 shdubsh: restart varnishmtail,atsmtail,ncredirmtail on ncredir,cp hosts in codfw and eqsin
  • 18:05 cdanis: installing libc6-dbg on netflow2001 T256790
  • 17:40 mdholloway: mobileapps deployments on k8s failing with timeouts; filed T256786
  • 17:37 cdanis: βœ”οΈ cdanis@netflow2001.codfw.wmnet ~ πŸ•œβ˜• sudo systemctl restart nfacctd
  • 17:33 mholloway-shell@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'mobileapps' for release 'staging' .
  • 17:18 mholloway-shell@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'mobileapps' for release 'staging' .
  • 17:17 papaul: uplugging msw-c3 power to relocate port on PDU
  • 17:09 mholloway-shell@deploy1001: Finished deploy [mobileapps/deploy@f9df1af]: Update mobileapps to 5c7611b9 (duration: 03m 33s)
  • 17:05 mholloway-shell@deploy1001: Started deploy [mobileapps/deploy@f9df1af]: Update mobileapps to 5c7611b9
  • 16:57 cdanis: T256444 restarted purged on cp2030 and repooling
  • 16:48 cdanis: T256444 βœ”οΈ cdanis@cp2030.codfw.wmnet ~ πŸ•β˜• sudo depool
  • 15:54 otto@deploy1001: Finished deploy [analytics/refinery@d63944e]: Deploying new camus wmf10 jar to an-launcher1002 for T256370 - take 3 (duration: 00m 03s)
  • 15:54 otto@deploy1001: Started deploy [analytics/refinery@d63944e]: Deploying new camus wmf10 jar to an-launcher1002 for T256370 - take 3
  • 15:40 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 15:35 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 15:32 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 15:30 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 15:24 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 15:20 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 15:16 otto@deploy1001: Finished deploy [analytics/refinery@1112749]: roll back to 1112749 on an-launcher1002, git-fat not pulling artifacts (duration: 01m 21s)
  • 15:14 otto@deploy1001: Started deploy [analytics/refinery@1112749]: roll back to 1112749 on an-launcher1002, git-fat not pulling artifacts
  • 15:14 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 15:10 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 15:10 moritzm: rebooting mwdebug* hosts for kernel security update
  • 15:05 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 15:03 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 15:01 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 14:59 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 14:59 moritzm: rebooting failoid hosts for kernel update
  • 14:49 otto@deploy1001: Finished deploy [analytics/refinery@d63944e]: Deploying new camus wmf10 jar to an-launcher1002 for T256370 - take 3 (duration: 00m 03s)
  • 14:49 otto@deploy1001: Started deploy [analytics/refinery@d63944e]: Deploying new camus wmf10 jar to an-launcher1002 for T256370 - take 3
  • 14:47 otto@deploy1001: Finished deploy [analytics/refinery@d63944e]: Deploying new camus wmf10 jar to an-launcher1002 for T256370 - take 2 (duration: 00m 03s)
  • 14:47 otto@deploy1001: Started deploy [analytics/refinery@d63944e]: Deploying new camus wmf10 jar to an-launcher1002 for T256370 - take 2
  • 14:44 hashar: Train blocked on Flow being broken: T256761 # T254176
  • 14:38 hashar@deploy1001: rebuilt and synchronized wikiversions files: Revert "group0 wikis to 1.35.0-wmf.39" - T256759
  • 14:34 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 14:32 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 14:28 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 14:26 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 14:25 hashar@deploy1001: rebuilt and synchronized wikiversions files: group0 wikis to 1.35.0-wmf.39
  • 14:21 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 14:21 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:21 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 14:18 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 14:18 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 14:15 moritzm: rebooting miscweb servers for kernel security update
  • 14:15 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 14:13 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 14:11 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 14:10 otto@deploy1001: Finished deploy [analytics/refinery@d63944e]: Deploying new camus wmf10 jar to an-launcher1002 for T256370 (duration: 01m 56s)
  • 14:10 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 14:09 hashar@deploy1001: Finished scap: testwikis wikis to 1.35.0-wmf.39 (duration: 62m 30s)
  • 14:08 otto@deploy1001: Started deploy [analytics/refinery@d63944e]: Deploying new camus wmf10 jar to an-launcher1002 for T256370
  • 14:08 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 13:59 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 13:59 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 13:59 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 13:57 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 13:51 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 13:49 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 13:37 moritzm: rebooting LDAP replicas for kernel security update
  • 13:32 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 13:32 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 13:31 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 13:30 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 13:07 hashar@deploy1001: Started scap: testwikis wikis to 1.35.0-wmf.39
  • 12:03 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 12:03 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 11:35 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 11:35 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 11:33 awight: EU BACON cooked
  • 11:32 awight@deploy1001: Synchronized wmf-config/InitialiseSettings.php: BACON: Configure TeWΓΌ survey on dewiki (take 2) (T253112) (duration: 00m 58s)
  • 11:32 jayme: restarted docker-reporter-base-images and docker-reporter-releng-images on deneb - T253396
  • 11:31 jayme: pushed a scratch docker image as docker-registry.discovery.wmnet/envoy-tls-local-proxy:dontuseme - T253396
  • 11:28 awight@deploy1001: Synchronized php-1.35.0-wmf.38/extensions/QuickSurveys: BACON: Embedded surveys are hidden when no element is available (T256627) (duration: 00m 56s)
  • 11:26 awight@deploy1001: Synchronized php-1.35.0-wmf.38/extensions/FileImporter: BACON: Set Status error if permission check returns false. (T256428) (duration: 00m 58s)
  • 11:13 ema: deneb: systemctl restart docker-reporter-base-images.service
  • 10:59 ema: upload librdkafka 0.11.6-1.1wmf1 to buster-wikimedia https://phabricator.wikimedia.org/P11703 T256444
  • 10:59 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 10:59 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 10:52 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1076', diff saved to https://phabricator.wikimedia.org/P11710 and previous config saved to /var/cache/conftool/dbconfig/20200630-105254-marostegui.json
  • 10:45 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 10:45 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 10:41 ema: cp2040: restart purged and varnishkafka to use updated librdkafka1 T256444
  • 10:38 ema: cp2040: upgrade librdkafka1 to 0.11.6-1.1wmf1 https://phabricator.wikimedia.org/P11703 T256444
  • 10:37 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 10:37 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 10:30 hashar@deploy1001: Synchronized php-1.35.0-wmf.39/includes/specials/SpecialUndelete.php: Remove another use of PageArchive::getRevision - T249982 T254176 (duration: 00m 56s)
  • 10:09 marostegui: Deploy schema change on db1076
  • 10:09 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1076', diff saved to https://phabricator.wikimedia.org/P11708 and previous config saved to /var/cache/conftool/dbconfig/20200630-100912-marostegui.json
  • 10:04 vgutierrez: rolling restart of eqiad cache nodes to catch up on kernel upgrades
  • 10:03 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 10:03 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 10:02 volker-e@deploy1001: Finished deploy [design/style-guide@e3fda83]: Deploy design/style-guide: (duration: 00m 07s)
  • 10:02 volker-e@deploy1001: Started deploy [design/style-guide@e3fda83]: Deploy design/style-guide:
  • 09:47 hashar@deploy1001: Pruned MediaWiki: 1.35.0-wmf.37 (duration: 02m 20s)
  • 09:40 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 09:40 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 09:21 hashar@deploy1001: Pruned MediaWiki: 1.35.0-wmf.36 (duration: 28m 11s)
  • 08:54 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 08:53 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 08:53 hashar@deploy1001: clean aborted: Pruned MediaWiki: 1.35.0-wmf.36 (duration: 00m 00s)
  • 08:51 hashar: Applied security patches to wmf/1.35.0-wmf.39 # T254176
  • 08:51 vgutierrez: rolling restart of codfw cp nodes after "re-formatting" nvme devices - T256655
  • 08:23 vgutierrez: repool cp3053 - T256632
  • 08:10 hashar: 1.35.0-wmf.39 was branched at e169e3d T254176
  • 08:05 marostegui: Stop MySQL on db1117:3322 to clone db1080 (this will trigger haproxy alerts) - T256717
  • 08:05 vgutierrez: powercycle cp3053 (unresponsive after reboot) - T256632
  • 08:01 jbond42: disable puppet to restart puppetmasters front ends
  • 07:42 vgutierrez: reboot cp3053 - T256632
  • 05:51 jhuneidi@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'blubberoid' for release 'staging' .
  • 05:13 marostegui: Deploy schema change on s8 codfw - T256680
  • 04:58 marostegui: remove pl_from index from db1141, db1121, db1148 - T256684
  • 04:57 jhuneidi@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'blubberoid' for release 'staging' .
  • 04:56 marostegui: Remove plfrom from db1096:3316 and db1098:3316 - T256684

2020-06-29

  • 23:28 eileen: civicrm revision changed from 52a32f2d66 to 391d0fdf75, config revision is f1b4bdb7b7
  • 22:00 sbassett: Deployed patch for T256171
  • 21:56 sbassett: Deployed patch for T255918
  • 20:00 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1144:3315 T256679', diff saved to https://phabricator.wikimedia.org/P11699 and previous config saved to /var/cache/conftool/dbconfig/20200629-200002-marostegui.json
  • 19:43 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1144:3315 T256679', diff saved to https://phabricator.wikimedia.org/P11698 and previous config saved to /var/cache/conftool/dbconfig/20200629-194327-marostegui.json
  • 18:55 shdubsh: test mtail rc35+wmf2 on cp5001 - T255776
  • 18:15 Urbanecm: Morning B&C done
  • 18:15 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: c86fcd4: Add HTTP proxy to MediaModeration (T247943) (duration: 00m 58s)
  • 18:10 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: aeb7b52: Setup rollbacker and mover on lijwiki (T256109) (duration: 02m 05s)
  • 17:30 sukhe: LDAP - added datn to groups wmde, nda - T254442
  • 15:43 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 15:43 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:37 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 15:37 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:31 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1105:3312', diff saved to https://phabricator.wikimedia.org/P11696 and previous config saved to /var/cache/conftool/dbconfig/20200629-153140-marostegui.json
  • 15:20 gehel: repool wdqs1004 - catched up on lag
  • 14:50 hnowlan@deploy1001: Finished deploy [restbase/deploy@900bcf6]: Redeploy to fix transient error in gom wiktionary deploy (duration: 00m 06s)
  • 14:50 hnowlan@deploy1001: Started deploy [restbase/deploy@900bcf6]: Redeploy to fix transient error in gom wiktionary deploy
  • 14:48 hnowlan@deploy1001: Finished deploy [restbase/deploy@900bcf6]: Enable gom wiktionary (duration: 13m 40s)
  • 14:34 hnowlan@deploy1001: Started deploy [restbase/deploy@900bcf6]: Enable gom wiktionary
  • 14:33 hnowlan@deploy1001: Finished deploy [restbase/deploy@900bcf6]: Enable gom wiktionary (duration: 17m 49s)
  • 14:28 ema: A:cp rolling purged upgrade to 0.16 T256479
  • 14:22 lucaswerkmeister-wmde@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Config: Add "E" as an alias of EntitySchema namespace on wikidata (T245529) (duration: 00m 57s)
  • 14:20 ema: upload purged 0.16 to apt.wm.org T256479
  • 14:16 hnowlan@deploy1001: Started deploy [restbase/deploy@900bcf6]: Enable gom wiktionary
  • 14:14 hnowlan@deploy1001: Finished deploy [restbase/deploy@ce5177e]: Enable gom wiktionary (duration: 20m 44s)
  • 14:02 jforrester@deploy1001: Synchronized multiversion/MWConfigCacheGenerator.php: Fix 'closed-labs' reading as 'closed' for static config (duration: 00m 56s)
  • 13:54 jforrester@deploy1001: Synchronized dblists/: Drop nonbetafeatures dblist, unused (duration: 00m 57s)
  • 13:54 hnowlan@deploy1001: Started deploy [restbase/deploy@ce5177e]: Enable gom wiktionary
  • 13:50 jforrester@deploy1001: Synchronized multiversion/MWConfigCacheGenerator.php: Drop 'nonbetafeatures' dblist from production reads (duration: 00m 56s)
  • 13:49 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Switch uses from nonbetafeatures to lockeddown (duration: 00m 57s)
  • 13:47 jforrester@deploy1001: Synchronized multiversion/MWConfigCacheGenerator.php: Add 'lockeddown' dblist to production reads (duration: 00m 57s)
  • 13:43 jforrester@deploy1001: Synchronized dblists/lockeddown.dblist: Add lockddown dblist (unused as yet) (duration: 00m 59s)
  • 13:35 vgutierrez: depool cp3053 due to nvme hardware issues
  • 13:02 XioNoX: test pfw3-codfw uplinks failover
  • 13:00 elukey: move archiva.wikimedia.org to archiva1002 (new buster vm); create archiva-old.wikimedia.org to archiva1001
  • 12:58 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1105:3312', diff saved to https://phabricator.wikimedia.org/P11693 and previous config saved to /var/cache/conftool/dbconfig/20200629-125824-marostegui.json
  • 12:56 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1085', diff saved to https://phabricator.wikimedia.org/P11692 and previous config saved to /var/cache/conftool/dbconfig/20200629-125630-marostegui.json
  • 12:41 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 12:38 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 12:32 jayme: deleted all tags for docker-registry.wikimedia.org/envoy-tls-local-proxy from docker registry - T253396
  • 12:20 marostegui: Stop MySQL on db2096 (codfw x1 master) for reimage T254871
  • 12:03 cdanis: re-pool eqiad T256512
  • 11:59 cdanis: deployed I132075ee on cr1-eqiad T256512
  • 11:58 cdanis: deployed I132075ee on cr2-eqiad T256512
  • 11:58 cdanis: deployed I132075ee on cr2-eqiad
  • 11:41 cdanis: depool eqiad T256512
  • 11:15 awight: EU BACON cooked
  • 11:08 marostegui: Deploy schema change on db1095:3312 (lag will show up)
  • 10:41 jdrewniak@deploy1001: Synchronized portals: Wikimedia Portals Update: Bumping portals to master (608284) (duration: 00m 57s)
  • 10:41 jdrewniak@deploy1001: Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: Bumping portals to master (608284) (duration: 00m 58s)
  • 10:29 gehel: restart blazegraph on wdqs1004 + depool to catchup on lag
  • 09:59 ema: cp2040: upgrade purged to 0.16 T256479
  • 09:59 jbond42: switch idp to memcached
  • 09:47 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 09:47 jmm@cumin2001: START - Cookbook sre.hosts.downtime
  • 09:45 marostegui: Deploy schema change on dbstore1004:3312
  • 09:11 jbond42: dploying shellcheck CI https://gerrit.wikimedia.org/r/c/operations/puppet/+/602693
  • 08:59 marostegui: Compress InnoDB on db1089 (this will cause lag and will take a few days) - T254462
  • 08:58 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1089 for InnoDB compression T254462', diff saved to https://phabricator.wikimedia.org/P11690 and previous config saved to /var/cache/conftool/dbconfig/20200629-085854-marostegui.json
  • 08:48 marostegui@cumin1001: dbctl commit (dc=all): 'Fully pool db1135 into s1 T253217', diff saved to https://phabricator.wikimedia.org/P11688 and previous config saved to /var/cache/conftool/dbconfig/20200629-084827-marostegui.json
  • 08:40 ema: cp2034: restart purged T256444
  • 08:36 ema: cp4025: restart purged T256444
  • 08:36 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly pool db1135 into s1 T253217', diff saved to https://phabricator.wikimedia.org/P11687 and previous config saved to /var/cache/conftool/dbconfig/20200629-083631-marostegui.json
  • 08:33 ema: cp1087, cp2033, cp2037, cp2039: repool after spending (way) more than 24h depooled T256444
  • 08:26 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly pool db1135 into s1 T253217', diff saved to https://phabricator.wikimedia.org/P11686 and previous config saved to /var/cache/conftool/dbconfig/20200629-082635-marostegui.json
  • 08:24 marostegui: Deploy schema change on s2 codfw (lag will show up) T253276
  • 08:04 XioNoX: add term selected-paths to policy BGP_IXP_in on all routers
  • 08:03 godog: prometheus eqiad -- lvextend --resizefs --size +200G vg-ssd/prometheus-ops
  • 08:02 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly pool db1135 into s1 T253217', diff saved to https://phabricator.wikimedia.org/P11685 and previous config saved to /var/cache/conftool/dbconfig/20200629-080253-marostegui.json
  • 07:46 marostegui@cumin1001: dbctl commit (dc=all): 'Add db1135 (depooled) to s1 T253217', diff saved to https://phabricator.wikimedia.org/P11684 and previous config saved to /var/cache/conftool/dbconfig/20200629-074611-marostegui.json
  • 07:16 XioNoX: push new pfw firewall rules - T256170
  • 07:13 marostegui: Deploy schema change on db1085 with replication to labs T253276
  • 07:12 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1085', diff saved to https://phabricator.wikimedia.org/P11683 and previous config saved to /var/cache/conftool/dbconfig/20200629-071236-marostegui.json
  • 06:53 marostegui@cumin1001: dbctl commit (dc=all): 'Remove db1080 from MW', diff saved to https://phabricator.wikimedia.org/P11682 and previous config saved to /var/cache/conftool/dbconfig/20200629-065335-marostegui.json
  • 06:50 elukey: execute gnt-instance remove an-launcher1001.eqiad.wmnet on ganeti1011 - T256363
  • 06:47 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 06:46 elukey@cumin1001: START - Cookbook sre.hosts.decommission
  • 06:45 marostegui: Deploy MCR schema change on db1090:3312
  • 06:35 elukey: force puppet run on ores* to overcome celery OOMs on some nodes
  • 04:57 marostegui: Stop MySQL on db1080 to clone db1135 T253217
  • 04:56 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 04:53 marostegui@cumin1001: START - Cookbook sre.hosts.downtime

2020-06-28

  • 21:43 krinkle@deploy1001: Synchronized wmf-config/CommonSettings.php: no-op I56eb4a802 (duration: 00m 58s)
  • 21:38 krinkle@deploy1001: Synchronized wmf-config/InitialiseSettings-labs.php: beta-only I56eb4a802 (duration: 01m 00s)

2020-06-27

  • 20:22 qchris: Gerrit upgrade done.
  • 19:49 mutante: removed 2620:0:861:3:208:80:154:136 from /etc/network/interfaces on gerrit1001, rebooting
  • 19:27 mutante: rebooting gerrit1001 one more time
  • 19:24 mutante: restarted ferm on gerrit1001
  • 19:19 mutante: rebooting gerrit1001 one more time
  • 19:05 mutante: rebooting gerrit1001
  • 18:58 mutante: rebooting gerrit2001
  • 18:49 hashar: Enabling beta cluster update job (gerrit maintenance) https://integration.wikimedia.org/ci/view/Beta/job/beta-code-update-eqiad/
  • 18:35 qchris@deploy1001: Finished deploy [gerrit/gerrit@da40615]: Gerrit to v3.2.2-98-g98d827eaa3 on gerrit2001 (duration: 00m 10s)
  • 18:34 qchris@deploy1001: Started deploy [gerrit/gerrit@da40615]: Gerrit to v3.2.2-98-g98d827eaa3 on gerrit2001
  • 18:27 qchris@deploy1001: Finished deploy [gerrit/gerrit@da40615]: Gerrit to v3.2.2-98-g98d827eaa3 on gerrit1001 (duration: 00m 08s)
  • 18:27 qchris@deploy1001: Started deploy [gerrit/gerrit@da40615]: Gerrit to v3.2.2-98-g98d827eaa3 on gerrit1001
  • 17:25 hashar: Disabled beta cluster update job (gerrit maintenance) https://integration.wikimedia.org/ci/view/Beta/job/beta-code-update-eqiad/
  • 17:19 qchris: Stopping gerrit on gerrit1001 for the Gerrit upgrade
  • 17:14 qchris: Duplicating reviewdb changes so we get a cheap and quick rollback
  • 17:11 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 17:11 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 17:11 qchris: Disabling puppet on gerrit1001 for Gerrit upgrades + data migrations
  • 17:11 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 17:11 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 17:07 qchris: Starting Gerrit upgrade to v3.2.2-98-g98d827eaa3
  • 15:44 qchris@deploy1001: Finished deploy [gerrit/gerrit@da40615]: Gerrit to v3.2.2-98-g98d827eaa3 on gerrit1002 (gerrit-test) (duration: 00m 08s)
  • 15:44 qchris@deploy1001: Started deploy [gerrit/gerrit@da40615]: Gerrit to v3.2.2-98-g98d827eaa3 on gerrit1002 (gerrit-test)
  • 13:03 qchris@deploy1001: Finished deploy [gerrit/gerrit@460e439]: Gerrit to v3.2.2-97-gcaf5020db1 on gerrit1002 (gerrit-test) (duration: 00m 08s)
  • 13:03 qchris@deploy1001: Started deploy [gerrit/gerrit@460e439]: Gerrit to v3.2.2-97-gcaf5020db1 on gerrit1002 (gerrit-test)

2020-06-26

  • 18:42 robh: all ulsfo onsite work completed as of 30 minutes ago
  • 17:52 robh: msw2-ulsfo work done, all mgmt items confirmed back online and icinga alerts cleared, moving onto msw1-ulsfo (rack 22) and will lose all mgmt in that rack for next 10-20 minutes T256300
  • 17:52 robh: msw2-ulsfo work done, all mgmt items confirmed back online and icinga alerts cleared, moving onto msw1-ulsfo (rack 22) and will lose all mgmt in that rack for next 10-20 minutes
  • 17:11 robh: msw work in ulsfo via T256300
  • 10:24 ema: pool 5006 T256449
  • 10:22 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1085', diff saved to https://phabricator.wikimedia.org/P11677 and previous config saved to /var/cache/conftool/dbconfig/20200626-102248-marostegui.json
  • 10:22 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1093', diff saved to https://phabricator.wikimedia.org/P11676 and previous config saved to /var/cache/conftool/dbconfig/20200626-102201-marostegui.json
  • 10:03 ema: cp2039: restart purged T256444
  • 09:57 ema: cp2037: restart purged T256444
  • 09:55 ema: cp1087: restart purged T256444
  • 09:46 ema: cp2033: restart purged T256444
  • 09:38 akosiaris: move the sessionstore eqiad pods back to the dedicated sessionstore nodes
  • 09:37 akosiaris@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'sessionstore' for release 'production' .
  • 09:35 akosiaris: move the sessionstore codfw pods back to the dedicated sessionstore nodes
  • 09:35 akosiaris@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'sessionstore' for release 'production' .
  • 09:08 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1093 for schema change', diff saved to https://phabricator.wikimedia.org/P11675 and previous config saved to /var/cache/conftool/dbconfig/20200626-090813-marostegui.json
  • 08:58 jynus@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 08:56 jynus@cumin1001: START - Cookbook sre.hosts.downtime
  • 08:33 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1088', diff saved to https://phabricator.wikimedia.org/P11674 and previous config saved to /var/cache/conftool/dbconfig/20200626-083319-marostegui.json
  • 08:25 ayounsi@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 08:22 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1088 for schema change', diff saved to https://phabricator.wikimedia.org/P11673 and previous config saved to /var/cache/conftool/dbconfig/20200626-082242-marostegui.json
  • 08:20 ayounsi@cumin1001: START - Cookbook sre.dns.netbox
  • 08:20 ayounsi@cumin1001: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
  • 08:05 akosiaris@cumin1001: conftool action : set/pooled=yes; selector: name=kubernetes.*.wmnet
  • 08:04 akosiaris@cumin1001: conftool action : set/weight=10; selector: name=kubernetes.*.wmnet
  • 08:04 akosiaris: pool all new kubernetes nodes in LVS T252185 T256236
  • 07:57 ayounsi@cumin1001: START - Cookbook sre.dns.netbox
  • 07:44 volans: force rebooted cp5006 that is unresponsive (after having depooled it) - T256449
  • 07:42 volans@cumin1001: conftool action : set/pooled=no; selector: name=cp5006.eqsin.wmnet
  • 06:40 tstarling@deploy1001: Synchronized wmf-config/InitialiseSettings.php: add cache-cookies log channel (duration: 00m 59s)
  • 05:13 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db2088:3312, db2104', diff saved to https://phabricator.wikimedia.org/P11672 and previous config saved to /var/cache/conftool/dbconfig/20200626-051328-marostegui.json
  • 05:06 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 05:03 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 04:01 cdanis: re-enable puppet on cps
  • 03:54 cdanis: βœ”οΈ cdanis@cumin1001.eqiad.wmnet ~ πŸ•›πŸΊ sudo cumin A:cp 'disable-puppet "I39e1c68a is broken"'
  • 03:54 cdanis: https://gerrit.wikimedia.org/r/c/operations/puppet/+/607917
  • 02:52 tstarling@deploy1001: Synchronized private/PrivateSettings.php: updating wgAuthenticationTokenVersion per my wikitech-l post (duration: 00m 57s)
  • 02:19 cdanis: three more hosts not processing purges for multiple days βœ”οΈ cdanis@cumin1001.eqiad.wmnet ~ πŸ•₯🍺 sudo cumin 'cp2033*,cp2037*,cp2039*' 'depool'
  • 02:17 cdanis: depooling cp1087 which has not been processing purges for 11.415 days
  • 01:53 cdanis: I6cc5f3e6 has been deployed to all cp text nodes T256395
  • 01:41 cdanis: βœ”οΈ cdanis@cumin1001.eqiad.wmnet ~ πŸ•˜πŸΊ sudo cumin A:cp 'enable-puppet "cdanis deploying I6cc5f3e6 T256395"'
  • 01:13 cdanis: βœ”οΈ cdanis@cumin1001.eqiad.wmnet ~ πŸ•˜πŸΊ sudo cumin A:cp 'disable-puppet "cdanis deploying I6cc5f3e6 T256395"'
  • 00:41 eileen: tools revision changed from c96813eda4 to aab96444df
  • 00:38 tstarling@deploy1001: Synchronized w/T256395-cookie-test.php: (no justification provided) (duration: 00m 56s)
  • 00:36 tstarling@deploy1001: Synchronized w/T256395-cookie-test.php: (no justification provided) (duration: 00m 58s)

2020-06-25

  • 23:37 mutante: puppetmaster - signing certs and initial puppet run for logstash1030/logstash1031 - no prod role yet
  • 22:25 mutante: puppetmaster - signing certs and initial run for logstash2030/2031 - no prod role yet
  • 20:57 dzahn@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
  • 20:31 dzahn@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
  • 19:30 dcausse: repooling wdqs1007.eqiad.wmnet
  • 19:05 brennen@deploy1001: rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.38
  • 18:58 mutante: LDAP - added qchris to archiva-deployers (T256404)
  • 17:37 mutante: mwmaint1002 - restarted apache2 to add server_headers snippet for T255629 - but not working as expected yet
  • 16:40 dzahn@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
  • 16:31 sukhe@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 16:31 sukhe@cumin1001: START - Cookbook sre.hosts.downtime
  • 16:28 krinkle@deploy1001: Synchronized wmf-config/logging.php: Ia6ef7617d378 (duration: 01m 02s)
  • 16:20 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 16:16 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 16:16 dzahn@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
  • 16:16 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 16:15 Krinkle: I've deleted a "saved object" visualisation in logstash called "Production Errors & Deployments" which seemed to be corrupt and redirect random logstash dashboards to a management page. Backed up at https://phabricator.wikimedia.org/P11666 (NDA)
  • 16:15 moritzm: installing libxml2 security updates
  • 16:12 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 16:09 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 16:06 moritzm: installing 4.9.210-1+deb9u1~deb8u1 on jessie hosts (fixed kernel for recent cacheoutattack CPU leaks)
  • 16:04 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 16:03 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 15:59 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 15:55 krinkle@deploy1001: Synchronized wmf-config/logging.php: I4c519f (duration: 01m 05s)
  • 15:54 dzahn@cumin1001: START - Cookbook sre.ganeti.makevm
  • 15:53 dzahn@cumin1001: START - Cookbook sre.ganeti.makevm
  • 15:51 vgutierrez: upgrade ATS in eqiad to version 8.0.8
  • 15:42 ppchelko@deploy1001: Finished deploy [restbase/deploy@821e96b]: Only emit vary: accept-language for feeds when it matters T256358, more groups (duration: 05m 09s)
  • 15:37 ppchelko@deploy1001: Started deploy [restbase/deploy@821e96b]: Only emit vary: accept-language for feeds when it matters T256358, more groups
  • 15:37 ppchelko@deploy1001: Finished deploy [restbase/deploy@821e96b]: Only emit vary: accept-language for feeds when it matters T256358, more groups (duration: 03m 38s)
  • 15:33 ppchelko@deploy1001: Started deploy [restbase/deploy@821e96b]: Only emit vary: accept-language for feeds when it matters T256358, more groups
  • 15:33 ppchelko@deploy1001: Finished deploy [restbase/deploy@821e96b]: Only emit vary: accept-language for feeds when it matters T256358, more groups (duration: 03m 24s)
  • 15:30 vgutierrez: upgrade ATS in codfw to version 8.0.8
  • 15:30 dzahn@cumin1001: START - Cookbook sre.ganeti.makevm
  • 15:30 ppchelko@deploy1001: Started deploy [restbase/deploy@821e96b]: Only emit vary: accept-language for feeds when it matters T256358, more groups
  • 15:29 ppchelko@deploy1001: Finished deploy [restbase/deploy@821e96b]: Only emit vary: accept-language for feeds when it matters T256358, take 2 (duration: 06m 38s)
  • 15:29 dzahn@cumin1001: START - Cookbook sre.ganeti.makevm
  • 15:25 reedy@deploy1001: Synchronized wmf-config/CommonSettings.php: structured logging for xff log, stop logging jobrunner requests (duration: 01m 05s)
  • 15:23 ppchelko@deploy1001: Started deploy [restbase/deploy@821e96b]: Only emit vary: accept-language for feeds when it matters T256358, take 2
  • 15:20 ppchelko@deploy1001: Finished deploy [restbase/deploy@821e96b]: Only emit vary: accept-language for feeds when it matters T256358 (duration: 01m 37s)
  • 15:19 ppchelko@deploy1001: Started deploy [restbase/deploy@821e96b]: Only emit vary: accept-language for feeds when it matters T256358
  • 14:51 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 14:48 elukey@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:43 vgutierrez: upgrade ATS in esams to version 8.0.8
  • 14:29 papaul: replacing mr1-codfw
  • 14:24 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 14:19 vgutierrez: upgrade ATS in eqsin to version 8.0.8
  • 14:19 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 14:17 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 14:12 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 14:05 marostegui: Stop MySQL on db2104 and db2088:3312
  • 14:05 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db2104', diff saved to https://phabricator.wikimedia.org/P11664 and previous config saved to /var/cache/conftool/dbconfig/20200625-140519-marostegui.json
  • 14:04 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 14:04 marostegui@cumin2001: dbctl commit (dc=all): 'Depool db2088:3312', diff saved to https://phabricator.wikimedia.org/P11663 and previous config saved to /var/cache/conftool/dbconfig/20200625-140421-marostegui.json
  • 13:59 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 13:57 reedy@deploy1001: Synchronized wmf-config/CommonSettings.php: T254301 Remove OAuthReplaceMessage hook subscriber (duration: 01m 05s)
  • 13:56 vgutierrez: upgrade ATS in ulsfo to version 8.0.8
  • 13:51 vgutierrez: upload trafficserver 8.0.8 to apt.wm.o (buster)
  • 13:51 reedy@deploy1001: Synchronized wmf-config/CommonSettings-labs.php: Replace PasswordNotInLargeBlacklist with PasswordNotInCommonList (duration: 01m 05s)
  • 13:49 reedy@deploy1001: Synchronized wmf-config/CommonSettings.php: Replace PasswordNotInLargeBlacklist with PasswordNotInCommonList (duration: 01m 06s)
  • 13:40 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 13:36 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 13:30 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 13:28 godog: bounce logstash on logstash1007
  • 13:26 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 13:19 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 13:13 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 13:02 moritzm: installing 4.9.210-1+deb9u1~deb8u1 on jessie hosts (fixed kernel for recent cacheoutattack CPU leaks)
  • 12:55 elukey: rename notebook1003 to an-launcher1002 - T256363
  • 12:51 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 12:47 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 12:45 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 12:44 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 12:44 elukey@cumin1001: START - Cookbook sre.hosts.decommission
  • 12:42 moritzm: installing libmspack security updates
  • 12:39 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 12:34 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 12:32 moritzm: installing libssh2 security updates
  • 12:30 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 12:30 elukey@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 12:27 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 12:26 moritzm: installing libjpeg-turbo security updates
  • 12:25 elukey@cumin1001: START - Cookbook sre.hosts.downtime
  • 12:25 elukey@cumin1001: START - Cookbook sre.hosts.downtime
  • 12:21 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 12:17 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 12:13 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 12:09 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 11:55 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 11:55 moritzm: installing python3.4 security updates
  • 11:55 awight: EU BACON is cooked
  • 11:51 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 11:50 awight@deploy1001: Synchronized wmf-config/InitialiseSettings.php: BACON: Enable QuickSurveys on metawiki (T253112) (duration: 01m 05s)
  • 11:48 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 11:45 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 11:41 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 11:38 awight@deploy1001: Synchronized wmf-config/InitialiseSettings.php: BACON: Enable WMDE Tech Wishes survey configuration (T253112) (duration: 01m 09s)
  • 11:36 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 11:27 moritzm: rolling reboot of ms-be[1044-1059].eqiad.wmnet
  • 11:25 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 11:21 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 11:06 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 11:01 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 11:00 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 10:56 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 10:53 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 10:50 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 10:45 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 10:45 moritzm: rolling reboot of ms-be[2044-2056]
  • 10:41 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 10:38 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 10:33 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 10:32 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 10:27 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 10:25 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 10:22 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 10:18 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 10:17 akosiaris@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
  • 10:17 akosiaris@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
  • 10:13 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 10:12 akosiaris@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
  • 10:12 akosiaris@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
  • 10:07 akosiaris@cumin1001: START - Cookbook sre.ganeti.makevm
  • 10:07 akosiaris@cumin1001: START - Cookbook sre.ganeti.makevm
  • 10:07 akosiaris@cumin1001: START - Cookbook sre.ganeti.makevm
  • 10:07 akosiaris@cumin1001: START - Cookbook sre.ganeti.makevm
  • 10:04 akosiaris: poweroff kubestagetcd1004 and ganeti1005 for T244530
  • 10:04 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 10:00 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 10:00 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 10:00 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 10:00 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 09:59 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 09:58 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 09:57 akosiaris@cumin1001: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99)
  • 09:57 akosiaris@cumin1001: START - Cookbook sre.ganeti.makevm
  • 09:53 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 09:37 volans@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 09:34 volans@cumin1001: START - Cookbook sre.dns.netbox
  • 09:28 akosiaris: schedule downtime for eqiad wikifeeds as it's flapping too much without yet knowing why. T256358
  • 09:28 godog: extend lv on thanos-fe2001 and restart thanos-compact
  • 09:21 vgutierrez: rolling restart of ncredir instances to catch up on kernel updates
  • 09:13 joal@deploy1001: Finished deploy [analytics/refinery@4aba370] (thin): Analytics fix over weekly train THIN [analytics/refinery@4aba370] (duration: 00m 10s)
  • 09:13 joal@deploy1001: Started deploy [analytics/refinery@4aba370] (thin): Analytics fix over weekly train THIN [analytics/refinery@4aba370]
  • 09:13 joal@deploy1001: Finished deploy [analytics/refinery@4aba370]: Analytics fix over weekly train [analytics/refinery@4aba370] (duration: 16m 27s)
  • 09:01 vgutierrez: restarting acme-chief instances to catch up on kernel updates
  • 08:56 joal@deploy1001: Started deploy [analytics/refinery@4aba370]: Analytics fix over weekly train [analytics/refinery@4aba370]
  • 08:42 hashar: releases2002: restarted bacula-fd to take in account the puppet provided configuration # T247652
  • 08:14 jynus: restarting bacula-dir on backup1001
  • 08:09 akosiaris: restart etherpad-lite on etherpad1002
  • 08:03 marostegui: Failover m1 from db1135 to db1097 - T254556
  • 07:52 jynus: stop bacula-director on backup1001 for db maintenance T254556
  • 07:49 akosiaris@cumin1001: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99)
  • 07:49 akosiaris@cumin1001: START - Cookbook sre.ganeti.makevm
  • 07:49 akosiaris@cumin1001: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99)
  • 07:49 akosiaris@cumin1001: START - Cookbook sre.ganeti.makevm
  • 07:49 akosiaris@cumin1001: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99)
  • 07:48 akosiaris@cumin1001: START - Cookbook sre.ganeti.makevm
  • 07:48 akosiaris@cumin1001: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99)
  • 07:47 akosiaris@cumin1001: START - Cookbook sre.ganeti.makevm
  • 07:36 elukey: reboot an-launcher1001 for kernel upgrades
  • 07:18 elukey: reboot kafkamon* vms for kernel upgrades
  • 07:08 marostegui: Start pre switchover steps on m1 T254556
  • 06:40 elukey: reboot matomo1002 for kernel upgrades
  • 06:35 elukey: reboot archiva1002 (new vm, not yet in service) for kernel upgrades
  • 06:34 elukey: reboot archiva for kernel upgrades
  • 06:31 elukey: force puppet run on ores1003/1005 to restore celery (killed by the oom)
  • 06:24 elukey: reboot an-tool* vms for kernel upgrades
  • 06:23 elukey: reboot analytics-tool1004 for kernel upgrades (Superset host)
  • 06:22 elukey: reboot analytics-tool1001 for kernel upgrades
  • 06:19 elukey: execute ip addr flush ens5 on an-airflow1001 to clear RTNETLINK answers: File exists (error from ifup@ens5.service)
  • 06:03 elukey: reboot an-airflow1001 for kernel upgrades
  • 04:26 marostegui: Remove triggers from db2095:3312 - T238966
  • 04:25 marostegui: Deploy schema change on s2 codfw - T238966
  • 00:48 twentyafterfour: restart php-fpm on phab1001 to fix T256343
  • 00:12 twentyafterfour: phabricator updated, all seems normal
  • 00:11 twentyafterfour: updating phabricator to release/2020-06-25/1, momentary (<1 minute) downtime expected.

2020-06-24

  • 23:44 mutante: releases2002 - systemctl stop jenkins, kill 15244 (rogue jenkins process), start jenkins with systemctl start jenkins (T247652)
  • 23:43 mutante: releases1002 - kill rogue jenkins process, start jenkins with systemctl start jenkins (T247652)
  • 23:02 mutante: releases1002/2002 - disabling puppet, removing failing cron job to pull deployment_charts (because /srv/deployment-charts does not exist yet)
  • 21:45 shdubsh: install mtail 3.0.0~rc35+wmf2 on logstash1007 - T255776
  • 20:42 brennen@deploy1001: Synchronized php: group1 wikis to 1.35.0-wmf.38 (duration: 01m 06s)
  • 20:41 brennen@deploy1001: rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.38
  • 20:41 brennen: train 1.35.0-wmf.38: attempting to roll forward to group1 after php-fpm restart on mw1287 (T256305, T254175)
  • 20:32 cdanis: restarting php-fpm on mw1287 T256305
  • 20:32 bsitzmann@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
  • 20:30 bsitzmann@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
  • 20:28 bsitzmann@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'mobileapps' for release 'staging' .
  • 20:14 halfak@deploy1001: Finished deploy [ores/deploy@1b87365]: T254505 (duration: 14m 08s)
  • 20:09 bsitzmann@deploy1001: Finished deploy [mobileapps/deploy@80c763d]: Update mobileapps to a413db4f (duration: 03m 37s)
  • 20:06 bsitzmann@deploy1001: Started deploy [mobileapps/deploy@80c763d]: Update mobileapps to a413db4f
  • 20:00 halfak@deploy1001: Started deploy [ores/deploy@1b87365]: T254505
  • 19:38 otto@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Revert Migrate SearchSatisfaction from EventLogging to EventGate on group1 - T249261 (duration: 01m 06s)
  • 19:17 brennen@deploy1001: rebuilt and synchronized wikiversions files: Revert group1 wikis to 1.35.0-wmf.37
  • 19:11 brennen@deploy1001: Synchronized php: group1 wikis to 1.35.0-wmf.38 (duration: 01m 04s)
  • 19:10 brennen@deploy1001: rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.38
  • 19:01 brennen: train 1.35.0-wmf.38: finished triage meeting, clear to proceed to group 1 (T254175)
  • 18:53 joal@deploy1001: Finished deploy [analytics/refinery@1112749] (thin): Regular analytics weekly train THIN [analytics/refinery@1112749] (duration: 00m 09s)
  • 18:53 joal@deploy1001: Started deploy [analytics/refinery@1112749] (thin): Regular analytics weekly train THIN [analytics/refinery@1112749]
  • 18:53 joal@deploy1001: Finished deploy [analytics/refinery@1112749]: Regular analytics weekly train [analytics/refinery@1112749] (duration: 05m 50s)
  • 18:49 Urbanecm: Morning B&C deploy window is done
  • 18:48 cstone: payments-wiki revision changed from 28ad76dcd7 to 91852dbc9b
  • 18:47 Urbanecm: mwscript namespaceDupes.php --wiki=guwiki --fix (T255358)
  • 18:47 joal@deploy1001: Started deploy [analytics/refinery@1112749]: Regular analytics weekly train [analytics/refinery@1112749]
  • 18:46 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: 2a1dfc5: Set namespace aliases for guwiki (T255358) (duration: 01m 05s)
  • 18:42 Urbanecm: mwscript namespaceDupes.php --wiki=banwiki --add-prefix=T255941 --fix (T255941)
  • 18:41 Urbanecm: Run mwscript namespaceDupes.php --wiki=banwiki --fix (T255941)
  • 18:41 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: c6d6c85: Set WP as a NS_PROJECT alias for banwiki (T255941) (duration: 01m 06s)
  • 18:38 Urbanecm: Run mwscript namespaceDupes.php dewiktionary --fix (T256242)
  • 18:38 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: 2b93e0f: Define Rekonstruktion NS for dewiktionary (T256242) (duration: 01m 05s)
  • 18:29 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: dea9214: Revert "IS: Cleanup some redundant rows." (T256279) (duration: 01m 05s)
  • 18:25 ppchelko@deploy1001: Synchronized wmf-config/InitialiseSettings.php: EventBus: Emit kafka purges for everything gerrit:607298 (duration: 01m 05s)
  • 18:19 ppchelko@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Enable MediaModeration on group0 gerrit:607327 (duration: 01m 04s)
  • 18:08 ppchelko@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Enable click tracking in Vector on beta cluster gerrit:607136 IS.php (duration: 01m 05s)
  • 18:06 ppchelko@deploy1001: Synchronized wmf-config/InitialiseSettings-labs.php: Enable click tracking in Vector on beta cluster gerrit:607136 IS-labs.php (duration: 01m 07s)
  • 17:31 elukey: update archiva-ci user's password in Jenkins credentials plugin
  • 16:56 elukey: update archiva-deploy user's password in Jenkins credentials plugin
  • 16:46 ppchelko@deploy1001: Finished deploy [restbase/deploy@5f08f32]: Release PCS endpoints updates, feeds timed out, redo (duration: 05m 11s)
  • 16:41 ppchelko@deploy1001: Started deploy [restbase/deploy@5f08f32]: Release PCS endpoints updates, feeds timed out, redo
  • 16:40 ppchelko@deploy1001: Finished deploy [restbase/deploy@5f08f32]: Release PCS endpoints updates, take 2 (duration: 14m 11s)
  • 16:34 brennen@deploy1001: Finished scap: (no justification provided) (duration: 60m 22s)
  • 16:30 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 16:27 elukey@cumin1001: START - Cookbook sre.hosts.downtime
  • 16:26 ppchelko@deploy1001: Started deploy [restbase/deploy@5f08f32]: Release PCS endpoints updates, take 2
  • 16:17 elukey: reimage db1108 to debian Buster - T234826
  • 15:53 ppchelko@deploy1001: Finished deploy [restbase/deploy@386b736]: Revert (duration: 27m 21s)
  • 15:38 brennen: previous scap sync for T256151 - gerrit:607379 and gerrit:607380
  • 15:36 kormat@cumin1001: dbctl commit (dc=all): 'Pool db1088 @ 100% into s6 T255927', diff saved to https://phabricator.wikimedia.org/P11652 and previous config saved to /var/cache/conftool/dbconfig/20200624-153604-kormat.json
  • 15:34 brennen@deploy1001: Started scap: (no justification provided)
  • 15:25 ppchelko@deploy1001: Started deploy [restbase/deploy@386b736]: Revert
  • 15:24 ppchelko@deploy1001: deploy aborted: Release updates to PCS endpoints (duration: 05m 04s)
  • 15:20 jayme: rolling restart of swift-proxy on thanos-fe[2001-2003].codfw.wmnet,thanos-fe[1001-1003].eqiad.wmnet - T256020
  • 15:19 ppchelko@deploy1001: Started deploy [restbase/deploy@9686627]: Release updates to PCS endpoints
  • 15:06 brennen: merging backports and running a full scap sync for UBN at T256151
  • 15:00 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 14:57 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 14:57 moritzm: rebooting deneb for kernel update
  • 14:57 ema: rmlist teampractices T255525
  • 14:42 otto@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Migrate SearchSatisfaction from EventLogging to EventGate on group0 - T249261 (duration: 01m 06s)
  • 13:28 nikerabbit@deploy1001: Synchronized wmf-config/CommonSettings.php: [config] 603167 Remove TranslationNotifications user settings 1/2 (2nd attempt, now with correct file) (duration: 01m 06s)
  • 13:23 marostegui: Deploy schema change on s6 eqiad primary master - T238966
  • 12:59 jbond42: update metamonitoring to use icinga-extmon.wikimedia.org
  • 12:23 akosiaris@cumin1001: conftool action : set/pooled=yes; selector: name=kubernetes1005.eqiad.wmnet
  • 12:23 akosiaris@cumin1001: conftool action : set/pooled=yes; selector: name=kubernetes1006.eqiad.wmnet
  • 12:19 akosiaris@cumin1001: conftool action : set/pooled=no; selector: name=kubernetes1006.eqiad.wmnet
  • 12:19 akosiaris@cumin1001: conftool action : set/pooled=no; selector: name=kubernetes1005.eqiad.wmnet
  • 12:19 akosiaris@cumin1001: conftool action : set/pooled=yes; selector: name=kubernetes2005.codfw.wmnet
  • 12:19 akosiaris@cumin1001: conftool action : set/pooled=yes; selector: name=kubernetes2006.codfw.wmnet
  • 12:17 akosiaris: depool/drain/reboot/pool kubernetes1005,6 for CPU capacity increase T256236
  • 12:14 akosiaris: reboot kubernetes2005,6 for CPU capacity increase T256236
  • 12:11 akosiaris: depool kubernetes2005,kubernetes2006 for CPU capacity increase T256236
  • 12:10 akosiaris: depool kubernetes2005,kubernetes2006 for CPU capacity increase
  • 12:05 akosiaris@cumin1001: conftool action : set/pooled=no; selector: name=kubernetes2006.codfw.wmnet
  • 12:05 akosiaris@cumin1001: conftool action : set/pooled=no; selector: name=kubernetes2005.codfw.wmnet
  • 12:04 awight: EU vegan BACON cooked
  • 12:03 awight@deploy1001: Synchronized php-1.35.0-wmf.38/extensions/GrowthExperiments: BACON: Help panel home screen menu item fixes (T255254) (duration: 01m 06s)
  • 11:40 nikerabbit@deploy1001: Synchronized private/PrivateSettings.php: Remove TranslationNotifications user settings 3/2 (duration: 01m 06s)
  • 11:35 nikerabbit@deploy1001: Synchronized private/readme.php: [config] 607414 Remove TranslationNotifications user settings 2/2 (duration: 01m 04s)
  • 11:28 nikerabbit@deploy1001: Synchronized wmf-config/InitialiseSettings.php: [config] 603167 Remove TranslationNotifications user settings 1/2 (duration: 01m 03s)
  • 11:09 awight@deploy1001: Synchronized wmf-config/CommonSettings.php: BACON: TwoColConflict: Talk page small deployment CommonSettings.php (T254458) (duration: 01m 17s)
  • 10:45 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 10:39 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 10:38 marostegui: Stop haproxy on dbproxy1003 T256216
  • 10:36 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 10:36 jmm@cumin2001: START - Cookbook sre.hosts.downtime
  • 10:01 volans: Production management IP allocation must be done from Netbox from now on, see https://wikitech.wikimedia.org/wiki/DNS/Netbox#Cutoff_dates
  • 09:55 volans@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 09:53 kormat@cumin1001: dbctl commit (dc=all): 'Pool db1088 @ 75% into s6 T255927', diff saved to https://phabricator.wikimedia.org/P11648 and previous config saved to /var/cache/conftool/dbconfig/20200624-095338-kormat.json
  • 09:50 volans@cumin1001: START - Cookbook sre.dns.netbox
  • 09:36 kormat@cumin1001: dbctl commit (dc=all): 'Pool db1088 @ 50% into s6 T255927', diff saved to https://phabricator.wikimedia.org/P11647 and previous config saved to /var/cache/conftool/dbconfig/20200624-093624-kormat.json
  • 09:13 marostegui@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 09:10 marostegui@cumin2001: START - Cookbook sre.hosts.downtime
  • 08:40 moritzm: prune remaining nginx packages on mw* servers T255565
  • 08:31 kormat@cumin1001: dbctl commit (dc=all): 'Pool db1088 @ 20% into s6 T255927', diff saved to https://phabricator.wikimedia.org/P11645 and previous config saved to /var/cache/conftool/dbconfig/20200624-083120-kormat.json
  • 08:06 moritzm: re-enable puppet in eqiad
  • 08:04 marostegui@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 08:04 marostegui@cumin2001: START - Cookbook sre.hosts.downtime
  • 08:00 moritzm: disable puppet in eqiad to unblock puppetdb1002 VM migration
  • 07:22 gehel: restarting blazegraph on wdqs1007
  • 06:53 moritzm: draining ganeti1009 for eventual reboot
  • 06:28 XioNoX: enable peering BGP sessions on AMS-IX - T253970
  • 05:59 XioNoX: disable peering BGP sessions on AMS-IX - T253970
  • 05:34 marostegui@cumin2001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 05:33 marostegui@cumin2001: START - Cookbook sre.hosts.decommission
  • 05:14 marostegui: Remove grants from dbproxy1008 - T231280 T255406
  • 05:03 marostegui: Remove revision triggers from db1125:Β·3316
  • 05:02 marostegui@cumin2001: dbctl commit (dc=all): 'Depool db1085 for MCR schema change', diff saved to https://phabricator.wikimedia.org/P11643 and previous config saved to /var/cache/conftool/dbconfig/20200624-050235-marostegui.json
  • 04:53 marostegui: Reload haproxy on dbproxy1012 and dbproxy1014
  • 00:35 ejegg: restarted fundraising jobs on main CiviCRM box
  • 00:33 ejegg: updated Fundraising CiviCRM from f01b036128 to 52a32f2d66

2020-06-23

  • 23:16 wkandek: releases1002 is back after being moved to row D (T255590)
  • 23:11 dzahn@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
  • 22:35 ejegg: disabled fundraising jobs on civi1001 for testing on civi2001
  • 22:24 wkandek@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
  • 22:13 AndyRussG: updated payments-wiki from 5fd4eb1519 to 28ad76dcd7
  • 22:06 dzahn@cumin1001: START - Cookbook sre.ganeti.makevm
  • 21:23 wkandek@cumin1001: START - Cookbook sre.ganeti.makevm
  • 21:23 dzahn@cumin1001: END (ERROR) - Cookbook sre.ganeti.makevm (exit_code=97)
  • 21:23 dzahn@cumin1001: START - Cookbook sre.ganeti.makevm
  • 21:22 wkandek@cumin1001: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99)
  • 21:22 wkandek@cumin1001: START - Cookbook sre.ganeti.makevm
  • 21:22 dzahn@cumin1001: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99)
  • 21:22 dzahn@cumin1001: START - Cookbook sre.ganeti.makevm
  • 21:15 wkandek@cumin1001: END (ERROR) - Cookbook sre.hosts.decommission (exit_code=97)
  • 21:14 wkandek@cumin1001: START - Cookbook sre.hosts.decommission
  • 20:31 otto@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Migrate TemplateWizard from EventLogging to EventGate on all wikis - take 2 - T238230 (duration: 01m 06s)
  • 19:16 otto@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Migrate TemplateWizard from EventLogging to EventGate on all wikis - T238230 (duration: 01m 05s)
  • 19:06 brennen@deploy1001: rebuilt and synchronized wikiversions files: group0 wikis to 1.35.0-wmf.38
  • 18:55 mutante: gerrit1001 (prod) - restarting gerrit service to verify config changes
  • 18:53 otto@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Migrate TemplateWizard from EventLogging to EventGate on group0 - T238230 (duration: 01m 06s)
  • 18:24 reedy@deploy1001: Synchronized wmf-config/InitialiseSettings.php: T254925 T246489 (duration: 01m 06s)
  • 18:04 brennen@deploy1001: Finished scap: testwikis wikis to 1.35.0-wmf.38 (duration: 85m 53s)
  • 16:39 brennen@deploy1001: Started scap: testwikis wikis to 1.35.0-wmf.38
  • 16:01 brennen: 1.35.0-wmf.38 was branched at a35f7318 for https://phabricator.wikimedia.org/T254175
  • 15:47 moritzm: prune nginx packages on mwdebug hosts T255565
  • 15:37 moritzm: prune nginx packages on mw1380-mw1412 T255565
  • 15:28 moritzm: installing libvpx security updates
  • 15:27 mutante: removing ganeti VM xhgui1001 from eqiad row_A, will recreate in another row for rebalancing VMs between rows (T180761 T238098)
  • 15:26 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 15:18 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
  • 15:12 mutante: removing ganeti VM releases1002 in eqiad row_A - will recreate in another row to re-balance (T255590)
  • 15:12 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 15:10 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
  • 14:56 moritzm: failover ganeti master in eqiad to ganeti1011
  • 14:55 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 14:50 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 14:48 urbanecm@deploy1001: Synchronized private/PrivateSettings.php: T250887 (duration: 00m 58s)
  • 14:08 mholloway-shell@deploy1001: Finished deploy [recommendation-api/deploy@db7fd80]: Update recommendation-api to 7e00177 (duration: 03m 13s)
  • 14:05 mholloway-shell@deploy1001: Started deploy [recommendation-api/deploy@db7fd80]: Update recommendation-api to 7e00177
  • 13:54 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 13:54 jmm@cumin2001: START - Cookbook sre.hosts.downtime
  • 13:54 jmm@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 13:54 jmm@cumin2001: START - Cookbook sre.hosts.downtime
  • 13:34 moritzm: draining ganeti1012 for eventual reboot
  • 13:34 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 13:28 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 12:56 jynus@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 12:54 jynus@cumin1001: START - Cookbook sre.hosts.downtime
  • 12:45 moritzm: draining ganeti1011 for eventual reboot
  • 12:45 marostegui: Deploy schema change on s6 codfw master (lag will appear on codfw) - T253276
  • 12:36 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 12:31 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 12:00 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 11:56 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 11:35 awight: EU BACON cooked
  • 11:34 awight@deploy1001: Synchronized php-1.35.0-wmf.37/extensions/TwoColConflict/: BACON: Fix broken copy link in JS mode (T253724) (duration: 00m 57s)
  • 11:07 mlitn@deploy1001: Synchronized wmf-config/InitialiseSettings.php: test commons: Use the database name in the Wikibase entity source config (duration: 00m 59s)
  • 11:04 moritzm: draining ganeti1008 for eventual reboot
  • 10:58 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 10:55 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 10:42 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 10:42 jmm@cumin2001: START - Cookbook sre.hosts.downtime
  • 10:40 kormat@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 10:38 moritzm: temporarily shutdown xhgui1001/releases1002 to reshuffle Ganeti instances for reboots
  • 10:38 kormat@cumin1001: START - Cookbook sre.hosts.downtime
  • 10:35 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 10:35 jmm@cumin2001: START - Cookbook sre.hosts.downtime
  • 10:22 kormat: reimaging db1088 to buster T250666
  • 10:03 jynus@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 10:01 jynus@cumin2001: START - Cookbook sre.hosts.downtime
  • 09:48 jbond42: add new CI check for cloud yaml data https://gerrit.wikimedia.org/r/c/operations/puppet/+/606444/
  • 09:46 jynus: stopping and reimaging db2101 into buster T254871
  • 09:32 marostegui: Reload haproxy on dbproxy1012 and dbproxy1014 to test db1097 as secondary for 24h T254556
  • 08:46 ema: mwmaint1002: add uid=abban,ou=people,dc=wikimedia,dc=org to group 'nda' T255775
  • 08:38 XioNoX: re-enable peering BGP sessions on AMS-IX - T253970
  • 08:03 moritzm: draining ganeti1007 for eventual reboot
  • 07:58 XioNoX: restart scs-a8-eqiad - T256101
  • 07:51 marostegui@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 07:49 marostegui@cumin2001: START - Cookbook sre.hosts.downtime
  • 07:42 marostegui: Deploy schema change on db1088
  • 07:30 marostegui: Reimage db2133 (m2 codfw master) to Buster (this will trigger haproxy IRC alert) T250666
  • 07:01 marostegui@cumin2001: dbctl commit (dc=all): 'Fully repool db1118', diff saved to https://phabricator.wikimedia.org/P11637 and previous config saved to /var/cache/conftool/dbconfig/20200623-070120-marostegui.json
  • 06:06 XioNoX: disable peering BGP sessions on AMS-IX - T253970
  • 05:24 marostegui: Compress InnoDB on db1080 T254462
  • 05:23 marostegui@cumin2001: dbctl commit (dc=all): 'Depool db1080 for InnoDB compression', diff saved to https://phabricator.wikimedia.org/P11636 and previous config saved to /var/cache/conftool/dbconfig/20200623-052350-marostegui.json
  • 05:22 marostegui@cumin2001: dbctl commit (dc=all): 'Slowly repool db1118', diff saved to https://phabricator.wikimedia.org/P11635 and previous config saved to /var/cache/conftool/dbconfig/20200623-052254-marostegui.json
  • 05:12 marostegui@cumin2001: dbctl commit (dc=all): 'Slowly repool db1118', diff saved to https://phabricator.wikimedia.org/P11634 and previous config saved to /var/cache/conftool/dbconfig/20200623-051159-marostegui.json
  • 05:03 marostegui@cumin2001: dbctl commit (dc=all): 'Slowly repool db1118', diff saved to https://phabricator.wikimedia.org/P11633 and previous config saved to /var/cache/conftool/dbconfig/20200623-050314-marostegui.json

2020-06-22

  • 23:41 catrope@deploy1001: Synchronized wmf-config/InitialiseSettings.php: touch for T247330 (duration: 00m 56s)
  • 23:36 catrope@deploy1001: Synchronized dblists/: Close trwikinews (T247330) (duration: 00m 58s)
  • 23:28 RoanKattouw: Synchronized wmf-config/InitialiseSettings.php: Create rollbacker group on elwiktionary (T255569) (typoed the task number before)
  • 23:26 catrope@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Create rollbacker group on elwiktionary (T225569) (duration: 00m 56s)
  • 23:21 catrope@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Add localized sitename for bewikibooks (T253962) (duration: 00m 57s)
  • 23:16 catrope@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Add domains to wgCopyUploadsDomains (T255336, T255363, T255386, T255313) (duration: 01m 01s)
  • 22:39 bstorm_: downtimed labstore1005 to prevent an alert during puppet merge T253353
  • 22:38 volans@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 22:35 volans@cumin1001: START - Cookbook sre.dns.netbox
  • 22:16 ebernhardson@deploy1001: Finished deploy [wikimedia/discovery/analytics@f2002c8]: bump glent jar to 0.2.2 (duration: 00m 56s)
  • 22:15 ebernhardson@deploy1001: Started deploy [wikimedia/discovery/analytics@f2002c8]: bump glent jar to 0.2.2
  • 22:12 volans: cleanup interfaces and addresses in Netbox for offline servers - T233183
  • 21:59 ebernhardson@deploy1001: Finished deploy [wikimedia/discovery/analytics@6e7f9f7]: bump glent jar to 0.2.2 (duration: 00m 18s)
  • 21:58 ebernhardson@deploy1001: Started deploy [wikimedia/discovery/analytics@6e7f9f7]: bump glent jar to 0.2.2
  • 17:19 mutante: gerrit1002 - let puppet remove [database] secttion from config; restart gerrit another time
  • 17:14 mutante: gerrit1002 (gerrit-test): re-enabled puppet, restarted gerrit service
  • 16:58 volans@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 16:49 volans@cumin1001: START - Cookbook sre.dns.netbox
  • 15:05 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 15:01 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 14:48 moritzm: installing mutt security updates
  • 14:47 Amir1: creating shnwiktionary is done
  • 14:44 ladsgroup@deploy1001: Synchronized wmf-config/interwiki.php: Update interwiki cache (duration: 02m 58s)
  • 14:42 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 14:41 ladsgroup@deploy1001: Synchronized static/images/project-logos/: Creating shnwiktionary (T253029) (duration: 00m 56s)
  • 14:40 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Creating shnwiktionary (T253029) (duration: 00m 56s)
  • 14:39 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 14:37 ladsgroup@deploy1001: rebuilt and synchronized wikiversions files: Creating shnwiktionary (T253029)
  • 14:36 ladsgroup@deploy1001: Synchronized dblists: Creating shnwiktionary (T253029) (duration: 00m 58s)
  • 14:16 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 14:10 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 14:05 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 13:59 moritzm: re-enabling Puppet in codfw
  • 13:55 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 13:51 moritzm: disable Puppet in codfw to reduce puppetdb2002 memory activity, unblocking the migration of the Ganeti instance for a reboot
  • 13:19 otto@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Bump eventlogging_Test schema version to 1.1.0 to pick up client_dt and set wgEventLoggingServiceUri for all wikis - T238230 (duration: 00m 58s)
  • 13:11 marostegui: Stop MySQL on db2078 instances
  • 12:53 vgutierrez: upgrade to trafficserver 8.0.8~rc0-1wm1 on cp5006 and cp5012
  • 12:45 moritzm: draining ganeti2007 for eventual reboot
  • 12:42 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 12:38 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 12:31 akosiaris: failover logstash2023 from ganeti2007->ganeti2023 for migration_downtime change to apply
  • 12:26 volans@deploy1001: Finished deploy [homer/deploy@e9acec8]: Release v0.2.3 on cumin1001 now on buster (duration: 01m 25s)
  • 12:24 volans@deploy1001: Started deploy [homer/deploy@e9acec8]: Release v0.2.3 on cumin1001 now on buster
  • 12:22 volans@deploy1001: Finished deploy [homer/deploy@e9acec8]: Release v0.2.3 on cumin1001 now on buster (duration: 00m 03s)
  • 12:22 volans@deploy1001: Started deploy [homer/deploy@e9acec8]: Release v0.2.3 on cumin1001 now on buster
  • 11:53 Urbanecm: EU B&C window done
  • 11:50 urbanecm@deploy1001: Synchronized php-1.35.0-wmf.37/extensions/VisualEditor/modules/: Backport: 0a08066: Revert "Allow generic params to be passed to getWikitextFragment" (T255785) (duration: 00m 58s)
  • 11:45 marostegui@cumin2001: dbctl commit (dc=all): 'Fully repool db1094', diff saved to https://phabricator.wikimedia.org/P11627 and previous config saved to /var/cache/conftool/dbconfig/20200622-114554-marostegui.json
  • 11:40 moritzm: draining ganeti2008 for eventual reboot
  • 11:37 volans@deploy1001: Finished deploy [homer/deploy@e9acec8]: Release v0.2.3 on cumin1001 now on buster (duration: 00m 28s)
  • 11:37 volans@deploy1001: Started deploy [homer/deploy@e9acec8]: Release v0.2.3 on cumin1001 now on buster
  • 11:34 marostegui@cumin2001: dbctl commit (dc=all): 'Slowly repool db1094', diff saved to https://phabricator.wikimedia.org/P11625 and previous config saved to /var/cache/conftool/dbconfig/20200622-113401-marostegui.json
  • 11:30 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: 74e8295: IS: Cleanup some redundant rows (duration: 00m 56s)
  • 11:29 Urbanecm: Run namespaceDupes.php for zh* projects (T165593)
  • 11:24 marostegui@cumin2001: dbctl commit (dc=all): 'Slowly repool db1094', diff saved to https://phabricator.wikimedia.org/P11623 and previous config saved to /var/cache/conftool/dbconfig/20200622-112451-marostegui.json
  • 11:24 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: db952ba: Add zh-hans and zh-hant translation of Module and Module_talk aliases for all Zh Projects (T165593) (duration: 00m 56s)
  • 11:16 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: 1301fd4: Add import sources for gomwiktionary (T255098) (duration: 00m 57s)
  • 11:08 marostegui@cumin2001: dbctl commit (dc=all): 'Slowly repool db1094', diff saved to https://phabricator.wikimedia.org/P11622 and previous config saved to /var/cache/conftool/dbconfig/20200622-110806-marostegui.json
  • 11:07 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: defa81e: Disable NS_USER(_TALK) search engine indexing on trwiki (T255538) (duration: 00m 58s)
  • 10:35 jdrewniak@deploy1001: Synchronized portals: Wikimedia Portals Update: Bumping portals to master (606985) (duration: 00m 56s)
  • 10:34 jdrewniak@deploy1001: Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: Bumping portals to master (606985) (duration: 01m 12s)
  • 09:58 marostegui@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 09:56 marostegui@cumin2001: START - Cookbook sre.hosts.downtime
  • 09:33 marostegui@cumin2001: dbctl commit (dc=all): 'Depool db1094 for reimage', diff saved to https://phabricator.wikimedia.org/P11621 and previous config saved to /var/cache/conftool/dbconfig/20200622-093323-marostegui.json
  • 09:31 godog: roll-restart logstash in codfw/eqiad to apply configuration change
  • 08:59 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 08:56 jmm@cumin2001: START - Cookbook sre.hosts.downtime
  • 08:33 moritzm: reimaging cumin1001 to buster T245114
  • 08:13 godog: extend prometheus codfw ops filesystem to 1TB
  • 08:02 vgutierrez: upgrade to trafficserver 8.0.8~rc0-1wm1 on cp4026 and cp4032
  • 08:02 vgutierrez: upload trafficserver 8.0.8~rc0-1wm1 to apt.wm.o (buster)
  • 07:33 marostegui@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 07:30 marostegui@cumin2001: START - Cookbook sre.hosts.downtime
  • 07:16 marostegui: Reimage db1117 (irc haproxy alerts will be triggered)
  • 06:26 marostegui@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 06:24 marostegui@cumin2001: START - Cookbook sre.hosts.downtime
  • 06:06 marostegui: Stop MySQL on dbstore1005 for reimage to Buster - T254870
  • 05:58 marostegui: Compress InnoDb on db1118 T254462
  • 05:51 marostegui@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 05:49 marostegui@cumin2001: START - Cookbook sre.hosts.downtime
  • 05:43 marostegui: Stop haproxy on dbproxy1008 - T255406
  • 05:33 marostegui@cumin2001: dbctl commit (dc=all): 'Depool db1118 for reimage and InnoDB compression', diff saved to https://phabricator.wikimedia.org/P11617 and previous config saved to /var/cache/conftool/dbconfig/20200622-053334-marostegui.json
  • 05:31 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1134', diff saved to https://phabricator.wikimedia.org/P11616 and previous config saved to /var/cache/conftool/dbconfig/20200622-053104-marostegui.json
  • 05:17 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1134', diff saved to https://phabricator.wikimedia.org/P11615 and previous config saved to /var/cache/conftool/dbconfig/20200622-051730-marostegui.json
  • 05:17 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1134', diff saved to https://phabricator.wikimedia.org/P11614 and previous config saved to /var/cache/conftool/dbconfig/20200622-051720-marostegui.json
  • 05:03 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1134', diff saved to https://phabricator.wikimedia.org/P11613 and previous config saved to /var/cache/conftool/dbconfig/20200622-050259-marostegui.json
  • 04:50 marostegui: Deploy schema change on s3 primary master with a big sleep between wikis - T250066
  • 04:48 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1134', diff saved to https://phabricator.wikimedia.org/P11612 and previous config saved to /var/cache/conftool/dbconfig/20200622-044853-marostegui.json

2020-06-20

  • 22:56 cdanis@cumin2001: dbctl commit (dc=all): 'db1088 seems to have crashed', diff saved to https://phabricator.wikimedia.org/P11611 and previous config saved to /var/cache/conftool/dbconfig/20200620-225624-cdanis.json
  • 07:42 elukey: powercycle an-worker1093 - bug soft lock up CPU showed in mgmt console
  • 07:36 elukey: powercycle an-worker1091 - bug soft lock up CPU showed in mgmt console

2020-06-19

  • 18:10 otto@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Bump eventlogging_Test schema version to 1.1.0 to pick up client_dt - T238230 (duration: 00m 59s)
  • 16:07 mutante: ganeti4003 - rebooting install4001 - trying to bootstrap OS install from install2003
  • 15:47 dzahn@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
  • 15:28 godog: roll-restart kibana to apply new settings
  • 13:01 moritzm: installing cups security updates (client side libs/tools)
  • 12:31 qchris: Disabling puppet on gerrit1002 (test instance) to do some more testing
  • 12:14 godog: delete march indices from logstash 5 eqiad to free up space
  • 12:12 marostegui@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 12:10 marostegui@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 12:08 marostegui@cumin2001: START - Cookbook sre.hosts.downtime
  • 12:07 marostegui@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 12:06 marostegui@cumin2001: START - Cookbook sre.hosts.downtime
  • 12:05 marostegui@cumin2001: START - Cookbook sre.hosts.downtime
  • 11:39 marostegui: Reimage db2116 db2119 db2130
  • 10:55 moritzm: installing mesa security updates
  • 10:49 godog: close april logstash indices on logstash 5 eqiad
  • 10:45 moritzm: installing tomcat8 security updates
  • 10:38 jayme: imported chartmuseum_0.12.0-1 to buster-wikimedia
  • 10:24 marostegui@cumin2001: dbctl commit (dc=all): 'Repool db1093', diff saved to https://phabricator.wikimedia.org/P11604 and previous config saved to /var/cache/conftool/dbconfig/20200619-102447-marostegui.json
  • 10:21 godog: start closing logstash indices for 2020.03 in elastic 5 eqiad
  • 09:22 godog: restart elasticsearch on logstash1010
  • 09:14 apergos: rsync from dumpsdata1003 as root to labstore1007 of dumps output files to catch up, with --bwlimit=160000 up from 80000
  • 08:45 volans: backup netbox and run one-time script to reserve first IPs on all infra prefixes on Netbox - T233183
  • 08:45 godog: roll restart elasticsearch_5@production-logstash-eqiad
  • 08:26 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 08:21 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 08:18 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 08:15 godog: roll-restart logstash elk5 for "JVM GC Old generation-s runs" alert
  • 08:12 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 08:00 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 07:59 marostegui@cumin2001: dbctl commit (dc=all): 'Depool db1093', diff saved to https://phabricator.wikimedia.org/P11601 and previous config saved to /var/cache/conftool/dbconfig/20200619-075907-marostegui.json
  • 07:54 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 07:52 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 07:47 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 07:47 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 07:44 marostegui@cumin2001: dbctl commit (dc=all): 'Repool db1098:3316', diff saved to https://phabricator.wikimedia.org/P11600 and previous config saved to /var/cache/conftool/dbconfig/20200619-074420-marostegui.json
  • 07:39 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 07:28 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 07:23 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 07:22 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 07:16 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 07:15 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 07:10 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 07:09 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 07:03 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 07:02 moritzm: rebooting ganeti nodes in eqiad for kernel security updates
  • 06:57 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 06:51 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 06:47 moritzm: force reinstall of memcached 1.6 deb packages to ensure that the override is used in addition to the unmodified systemd unit from the deb T233933
  • 06:39 marostegui@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 06:36 marostegui@cumin2001: START - Cookbook sre.hosts.downtime
  • 06:20 marostegui: Stop mysql on db2132 to reimage m1 codfw master - T254556
  • 06:19 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db2075 db2111', diff saved to https://phabricator.wikimedia.org/P11599 and previous config saved to /var/cache/conftool/dbconfig/20200619-061922-marostegui.json
  • 06:05 marostegui@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 06:02 marostegui@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 06:01 marostegui@cumin2001: START - Cookbook sre.hosts.downtime
  • 06:00 marostegui@cumin2001: START - Cookbook sre.hosts.downtime
  • 05:54 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1112', diff saved to https://phabricator.wikimedia.org/P11598 and previous config saved to /var/cache/conftool/dbconfig/20200619-055430-marostegui.json
  • 05:41 marostegui@cumin2001: dbctl commit (dc=all): 'Depool db2075 and db2111 for reimage', diff saved to https://phabricator.wikimedia.org/P11597 and previous config saved to /var/cache/conftool/dbconfig/20200619-054118-marostegui.json
  • 05:34 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db2108', diff saved to https://phabricator.wikimedia.org/P11596 and previous config saved to /var/cache/conftool/dbconfig/20200619-053402-marostegui.json
  • 05:25 marostegui@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 05:23 marostegui@cumin2001: START - Cookbook sre.hosts.downtime
  • 04:44 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db2108 for reimage', diff saved to https://phabricator.wikimedia.org/P11595 and previous config saved to /var/cache/conftool/dbconfig/20200619-044440-marostegui.json
  • 04:39 marostegui@cumin2001: dbctl commit (dc=all): 'Depool db1098:3316', diff saved to https://phabricator.wikimedia.org/P11594 and previous config saved to /var/cache/conftool/dbconfig/20200619-043956-marostegui.json
  • 04:35 marostegui@cumin2001: dbctl commit (dc=all): 'Depool db1112', diff saved to https://phabricator.wikimedia.org/P11593 and previous config saved to /var/cache/conftool/dbconfig/20200619-043554-marostegui.json

2020-06-18

  • 22:30 otto@deploy1001: Synchronized wmf-config/InitialiseSettings.php: EventLogging to EventGate: - SearchSatisfaction on all wikis - T249261 (duration: 00m 56s)
  • 21:14 volans: start check-homer-diff.service on cumin2001 after merging the fix r/606526
  • 20:17 otto@deploy1001: Synchronized wmf-config/InitialiseSettings.php: EventLogging to EventGate: - SearchSatisfaction on all wikis - T249261 (duration: 00m 57s)
  • 19:44 otto@deploy1001: Synchronized wmf-config/InitialiseSettings.php: EventLogging to EventGate: - SearchSatisfaction on group1 wikis - T249261 (duration: 00m 57s)
  • 18:53 wkandek@cumin1001: conftool action : set/pooled=yes; selector: name=mw2339.codfw.wmnet
  • 18:35 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 18:32 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 17:16 wkandek@cumin1001: conftool action : set/pooled=no; selector: name=mw2339.codfw.wmnet
  • 17:14 wkandek@cumin1001: conftool action : set/pooled=yes; selector: name=mw2339.codfw.wmnet
  • 17:12 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw2339.codfw.wmnet
  • 16:51 maryum: reindex suspended until deployment of code
  • 16:49 hnowlan: Shut off non-dockerised deployment-prep instance of changeprop
  • 16:15 maryum: reindexing French wiki in Elasticsearch
  • 15:37 Reedy: creatd bot_passwords tables on officeiwki and otrs_wikiwiki T254925 T246489
  • 15:34 moritzm: installing harfbuzz security updates
  • 15:23 moritzm: installing Ruby 2.1 security updates
  • 15:15 moritzm: installing python-django security updates (packaged buster version)
  • 15:04 moritzm: installing bind updates on jessie (client side tools/libs)
  • 14:19 marostegui@cumin2001: dbctl commit (dc=all): 'Repool db1078', diff saved to https://phabricator.wikimedia.org/P11591 and previous config saved to /var/cache/conftool/dbconfig/20200618-141941-marostegui.json
  • 14:14 moritzm: failover ganeti master in codfw to ganeti2021
  • 14:03 marostegui@cumin2001: dbctl commit (dc=all): 'Depool db1078 for schema change', diff saved to https://phabricator.wikimedia.org/P11590 and previous config saved to /var/cache/conftool/dbconfig/20200618-140352-marostegui.json
  • 14:02 marostegui@cumin2001: dbctl commit (dc=all): 'Repool db1075', diff saved to https://phabricator.wikimedia.org/P11589 and previous config saved to /var/cache/conftool/dbconfig/20200618-140203-marostegui.json
  • 13:53 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 13:53 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 13:52 akosiaris: restart logstash2005 for applying an increased ganeti migration_downtime of 10k
  • 13:47 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 13:43 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 13:40 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 13:34 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 13:11 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 13:05 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 12:52 marostegui@cumin2001: dbctl commit (dc=all): 'Depool db1075 for schema change', diff saved to https://phabricator.wikimedia.org/P11586 and previous config saved to /var/cache/conftool/dbconfig/20200618-125216-marostegui.json
  • 12:48 marostegui@cumin1001: dbctl commit (dc=all): 'Remove weight from es5 master as es1024 is fully repooled now', diff saved to https://phabricator.wikimedia.org/P11585 and previous config saved to /var/cache/conftool/dbconfig/20200618-124801-marostegui.json
  • 12:23 kormat@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 12:20 kormat@cumin1001: START - Cookbook sre.hosts.downtime
  • 12:05 kormat: reimaging db1077 for final test T251768
  • 11:51 jbond@deploy1001: Synchronized wmf-config/CommonSettings-labs.php: (no justification provided) (duration: 01m 00s)
  • 11:06 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 11:00 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 10:34 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 10:34 jmm@cumin2001: START - Cookbook sre.hosts.downtime
  • 09:54 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 09:48 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 09:40 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db2076', diff saved to https://phabricator.wikimedia.org/P11583 and previous config saved to /var/cache/conftool/dbconfig/20200618-094001-marostegui.json
  • 09:39 akosiaris: update wikifeeds to latest chart version in codfw
  • 09:39 akosiaris@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'wikifeeds' for release 'production' .
  • 09:38 marostegui@cumin2001: dbctl commit (dc=all): 'Repool es2022', diff saved to https://phabricator.wikimedia.org/P11582 and previous config saved to /var/cache/conftool/dbconfig/20200618-093803-marostegui.json
  • 09:38 akosiaris: uncordon kubernetes20{07..14} and kubernetes10{07..14}. Nodes are now fully put in rotation and ready to receive production traffic
  • 09:34 marostegui: Deploy schema change on s3 codfw master (this will create lag on codfw) - T250066
  • 09:31 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 09:30 godog: temp stop logstash on elk7 to test 8 pipeline workers - T255243
  • 09:25 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 09:15 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 09:09 marostegui@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 09:08 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 09:06 marostegui@cumin2001: START - Cookbook sre.hosts.downtime
  • 09:04 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 08:59 marostegui@cumin2001: dbctl commit (dc=all): 'Fully repool es1025', diff saved to https://phabricator.wikimedia.org/P11581 and previous config saved to /var/cache/conftool/dbconfig/20200618-085927-marostegui.json
  • 08:59 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 08:50 ayounsi@cumin2001: END (FAIL) - Cookbook sre.network.prepare-upgrade (exit_code=1)
  • 08:49 ayounsi@cumin2001: START - Cookbook sre.network.prepare-upgrade
  • 08:49 ayounsi@cumin2001: END (FAIL) - Cookbook sre.network.prepare-upgrade (exit_code=99)
  • 08:49 ayounsi@cumin2001: START - Cookbook sre.network.prepare-upgrade
  • 08:49 marostegui@cumin2001: dbctl commit (dc=all): 'Slowly repool es1025', diff saved to https://phabricator.wikimedia.org/P11580 and previous config saved to /var/cache/conftool/dbconfig/20200618-084929-marostegui.json
  • 08:47 marostegui@cumin2001: dbctl commit (dc=all): 'Depool es2022 for reimage', diff saved to https://phabricator.wikimedia.org/P11578 and previous config saved to /var/cache/conftool/dbconfig/20200618-084720-marostegui.json
  • 08:47 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 08:40 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 08:37 marostegui@cumin2001: dbctl commit (dc=all): 'Slowly repool es1025', diff saved to https://phabricator.wikimedia.org/P11577 and previous config saved to /var/cache/conftool/dbconfig/20200618-083749-marostegui.json
  • 08:25 elukey: change archiva-ci password in archiva
  • 08:24 marostegui@cumin2001: dbctl commit (dc=all): 'Slowly repool es1025', diff saved to https://phabricator.wikimedia.org/P11576 and previous config saved to /var/cache/conftool/dbconfig/20200618-082432-marostegui.json
  • 08:12 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 08:10 marostegui@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 08:08 marostegui@cumin2001: START - Cookbook sre.hosts.downtime
  • 08:06 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 08:05 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 08:00 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 07:57 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 07:51 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 07:50 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 07:45 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 07:41 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 07:41 marostegui: Reimage es1025
  • 07:35 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 07:34 marostegui@cumin2001: dbctl commit (dc=all): 'Repool db1136', diff saved to https://phabricator.wikimedia.org/P11574 and previous config saved to /var/cache/conftool/dbconfig/20200618-073414-marostegui.json
  • 07:33 ayounsi@cumin2001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 07:31 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 07:25 ayounsi@cumin2001: START - Cookbook sre.dns.netbox
  • 07:24 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 07:22 moritzm: rolling reboot of ganeti servers in codfw
  • 07:10 ayounsi@cumin1001: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
  • 07:07 ayounsi@cumin1001: START - Cookbook sre.dns.netbox
  • 04:50 marostegui@cumin2001: dbctl commit (dc=all): 'Depool db1136', diff saved to https://phabricator.wikimedia.org/P11573 and previous config saved to /var/cache/conftool/dbconfig/20200618-045047-marostegui.json

2020-06-17

  • 23:25 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: 0e7079d: Install DiscussionTools on all wikis (attempt 2) (T252264; T253943) (duration: 00m 56s)
  • 23:23 urbanecm@deploy1001: Synchronized php-1.35.0-wmf.36/extensions/DiscussionTools/includes/Hooks.php: ff01083: Use $wgLocaltimezone global instead of request context (T255704) (duration: 00m 57s)
  • 23:21 urbanecm@deploy1001: Synchronized php-1.35.0-wmf.37/extensions/DiscussionTools/includes/Hooks.php: 4551d29: Use $wgLocaltimezone global instead of request context (T252264; T253943; T255704) (duration: 00m 58s)
  • 23:01 ryankemper@deploy1001: Finished deploy [wdqs/wdqs@79fb82f]: 0.3.39 (duration: 14m 38s)
  • 22:47 ryankemper@deploy1001: Started deploy [wdqs/wdqs@79fb82f]: 0.3.39
  • 21:01 ryankemper@cumin2001: END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
  • 20:32 hashar: Fixed up zuul-merger on contint1001 due to some faulty hotfix
  • 20:08 hashar: Stopped zuul-merger on contint1001
  • 19:21 marostegui: Deploy schema change on s6 codfw master T238966
  • 19:17 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1094', diff saved to https://phabricator.wikimedia.org/P11572 and previous config saved to /var/cache/conftool/dbconfig/20200617-191723-marostegui.json
  • 19:11 ryankemper@cumin2001: START - Cookbook sre.wdqs.data-transfer
  • 19:08 ryankemper@cumin2001: END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
  • 19:05 ryankemper@cumin2001: START - Cookbook sre.wdqs.data-transfer
  • 18:57 milimetric@deploy1001: Finished deploy [analytics/refinery@6640d6f] (thin): Quick fix for data quality bundles (THIN) (duration: 00m 10s)
  • 18:57 milimetric@deploy1001: Started deploy [analytics/refinery@6640d6f] (thin): Quick fix for data quality bundles (THIN)
  • 18:52 ryankemper@cumin1001: END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
  • 18:49 ryankemper@cumin1001: START - Cookbook sre.wdqs.data-transfer
  • 18:44 milimetric@deploy1001: Finished deploy [analytics/refinery@6640d6f]: Quick fix for data quality bundles (duration: 27m 55s)
  • 18:41 Urbanecm: Morning B&C window done
  • 18:31 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: 96153f9: Add temporary logging for mediamoderation (T247943) (duration: 00m 56s)
  • 18:24 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: REVERT: ae76450: Install DiscussionTools on all wikis (T252264; T253943) (duration: 00m 34s)
  • 18:22 urbanecm@deploy1001: scap failed: average error rate on 3/9 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/e474f13ffac6b8c3bf919c4aeafc8c9b for details)
  • 18:21 urbanecm@deploy1001: scap failed: average error rate on 9/9 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/e474f13ffac6b8c3bf919c4aeafc8c9b for details)
  • 18:16 milimetric@deploy1001: Started deploy [analytics/refinery@6640d6f]: Quick fix for data quality bundles
  • 18:14 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: c9f6452: Set DiscussionToolsEnableVisual to true by default (T251654) (duration: 00m 56s)
  • 18:05 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 18:04 elukey@cumin1001: START - Cookbook sre.hosts.decommission
  • 16:57 otto@deploy1001: Synchronized wmf-config/InitialiseSettings.php: EventLogging to EventGate: - SearchSatisfaction on group0 wikis - T249261 (duration: 00m 56s)
  • 16:00 marostegui@cumin2001: dbctl commit (dc=all): 'Depool db1094', diff saved to https://phabricator.wikimedia.org/P11571 and previous config saved to /var/cache/conftool/dbconfig/20200617-160013-marostegui.json
  • 15:28 godog: temp bump logstash7 workers to 8 and temp stop logstash - T255243
  • 15:17 jforrester@deploy1001: Synchronized private/PrivateSettings.php: T247943 Add API key and recipient config for MediaModeration (duration: 00m 55s)
  • 15:17 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw2338.codfw.wmnet
  • 15:11 dzahn@cumin1001: conftool action : set/weight=15; selector: name=mw233[5-9].codfw.wmnet
  • 15:11 jforrester@deploy1001: Synchronized wmf-config/CommonSettings.php: T247943 Install MediaModeration extension - III: Install where enabled (duration: 00m 56s)
  • 15:10 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw2335.codfw.wmnet
  • 15:09 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw2336.codfw.wmnet
  • 15:09 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw2337.codfw.wmnet
  • 15:09 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw2339.codfw.wmnet
  • 15:08 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw233[5-9].codfw.wmnet
  • 14:58 jforrester@deploy1001: Synchronized php-1.35.0-wmf.37/extensions/GrowthExperiments/modules/help/ext.growthExperiments.HelpPanelProcessDialog.js: T255607 Fix help panel sizing logic (duration: 00m 56s)
  • 14:54 hnowlan@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
  • 14:52 hnowlan@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
  • 14:52 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 14:50 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:49 mdholloway: rolled back recommendation-api deployment due to canary endpoint check failure (T255683)
  • 14:44 mholloway-shell@deploy1001: Finished deploy [recommendation-api/deploy@c39d567]: Update recommendation-api to db97742 (duration: 01m 16s)
  • 14:43 mholloway-shell@deploy1001: Started deploy [recommendation-api/deploy@c39d567]: Update recommendation-api to db97742
  • 14:30 akosiaris: redrain kubernetes1007-14
  • 14:27 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 14:27 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:27 mutante: disabling puppet on icinga to avoid alert spam when adding new appservers
  • 14:25 akosiaris@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'wikifeeds' for release 'production' .
  • 14:22 akosiaris: uncordon kubernetes10{07..14} again
  • 14:13 mutante: generating new mcrouter certs for mw2335 - mw2339 (T247021)
  • 14:02 mutante: rebooting mw2335 through mw2339 (not in service)
  • 13:51 XioNoX: cleanup msw1-codfw interfaces
  • 13:44 akosiaris: redrain kubernetes1007-14
  • 13:37 akosiaris@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'mathoid' for release 'production' .
  • 13:35 akosiaris@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'blubberoid' for release 'production' .
  • 13:31 otto@deploy1001: Synchronized wmf-config/InitialiseSettings.php: EventLogging to EventGate: - SearchSatisfaction on testwiki version 1.1.0 - T249261 (duration: 00m 58s)
  • 13:30 moritzm: upgrade remaining parsoid nodes to PHP 7.2.31
  • 13:21 jbond42: re-enable puppet on C:memcached nodes
  • 13:04 marostegui@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 13:04 marostegui: The above db1129 depool was meant to be a repool, wrong commit message
  • 13:03 liw@deploy1001: rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.37
  • 13:03 jbond42: disable puppet on C:memcache to deploy a new change
  • 13:02 marostegui@cumin2001: dbctl commit (dc=all): 'Depool db1129', diff saved to https://phabricator.wikimedia.org/P11567 and previous config saved to /var/cache/conftool/dbconfig/20200617-130236-marostegui.json
  • 13:02 akosiaris@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 13:00 marostegui@cumin2001: START - Cookbook sre.hosts.downtime
  • 13:00 akosiaris@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 13:00 akosiaris@cumin2001: START - Cookbook sre.hosts.downtime
  • 13:00 akosiaris@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 13:00 akosiaris@cumin2001: START - Cookbook sre.hosts.downtime
  • 13:00 akosiaris@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 12:59 akosiaris@cumin2001: START - Cookbook sre.hosts.downtime
  • 12:59 akosiaris@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 12:59 akosiaris@cumin2001: START - Cookbook sre.hosts.downtime
  • 12:59 akosiaris@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 12:59 akosiaris@cumin2001: START - Cookbook sre.hosts.downtime
  • 12:59 akosiaris@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 12:59 akosiaris@cumin2001: START - Cookbook sre.hosts.downtime
  • 12:59 akosiaris@cumin2001: START - Cookbook sre.hosts.downtime
  • 12:54 hnowlan: upgraded cpjobqueue to newer container image, rolled back
  • 12:40 marostegui@cumin2001: dbctl commit (dc=all): 'Add db2091 to s8 T253217', diff saved to https://phabricator.wikimedia.org/P11566 and previous config saved to /var/cache/conftool/dbconfig/20200617-124034-marostegui.json
  • 12:32 hnowlan: Removed remaining changeprop systemd components from scb
  • 12:06 marostegui@cumin2001: dbctl commit (dc=all): 'Depool db2076 to remove triggers from sanitarium T238966', diff saved to https://phabricator.wikimedia.org/P11565 and previous config saved to /var/cache/conftool/dbconfig/20200617-120622-marostegui.json
  • 11:59 Amir1: not today, just EU noon
  • 11:59 Amir1: B&C is done for today
  • 11:58 ladsgroup@deploy1001: Synchronized wmf-config/config/trwikisource.yaml: Change sidebar upload link destination for tr.wikisource (T253490) (duration: 01m 03s)
  • 11:55 ladsgroup@deploy1001: Synchronized dblists/commonsuploads.dblist: Change sidebar upload link destination for tr.wikisource (T253490) (duration: 01m 04s)
  • 11:48 hnowlan@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
  • 11:47 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Add extended-confirmed group and restriction level for rowiki (T254471) (duration: 01m 04s)
  • 11:30 marostegui@cumin1001: dbctl commit (dc=all): 'Depool es1025 for reimage, give weight to es1023 (es5 master)', diff saved to https://phabricator.wikimedia.org/P11563 and previous config saved to /var/cache/conftool/dbconfig/20200617-113026-marostegui.json
  • 11:23 ladsgroup@deploy1001: Synchronized php-1.35.0-wmf.37/extensions/GrowthExperiments/extension.json: Fix NewcomerTask schema (T255597) (duration: 01m 04s)
  • 11:18 ladsgroup@deploy1001: Synchronized php-1.35.0-wmf.36/extensions/GrowthExperiments/extension.json: Fix NewcomerTask schema (T255597) (duration: 01m 06s)
  • 11:07 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Set hiwiktionary timezone to Asia/Kolkata (T255531) (duration: 01m 05s)
  • 10:48 marostegui@cumin2001: dbctl commit (dc=all): 'Remove db2091 from dbctl in s2 and s4', diff saved to https://phabricator.wikimedia.org/P11562 and previous config saved to /var/cache/conftool/dbconfig/20200617-104816-marostegui.json
  • 10:40 marostegui@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 10:38 marostegui@cumin2001: START - Cookbook sre.hosts.downtime
  • 10:31 liw@deploy1001: Synchronized php: group1 wikis to 1.35.0-wmf.37 (duration: 01m 04s)
  • 10:30 liw@deploy1001: rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.37
  • 09:44 marostegui@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 09:42 marostegui@cumin2001: START - Cookbook sre.hosts.downtime
  • 09:40 hnowlan: killing stale changeprop instances running on scb hosts
  • 09:16 jforrester@deploy1001: Synchronized php-1.35.0-wmf.37/extensions/Flow/: T255608 Revert 'Hooks: Use PageMoveComplete instead of TitleMoveCompleting' (duration: 01m 05s)
  • 09:15 marostegui@cumin2001: dbctl commit (dc=all): 'Fully repool db1113:3315, db1113:3316', diff saved to https://phabricator.wikimedia.org/P11558 and previous config saved to /var/cache/conftool/dbconfig/20200617-091509-marostegui.json
  • 09:11 jforrester@deploy1001: Synchronized php-1.35.0-wmf.37/includes/HookContainer/DeprecatedHooks.php: T255608 Revert 'Hard deprecate the hook' (duration: 01m 05s)
  • 09:02 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: T247943 Install MediaModeration extension - II: Add flag to IS (duration: 01m 05s)
  • 08:56 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 08:56 jmm@cumin2001: START - Cookbook sre.hosts.downtime
  • 08:52 marostegui@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 08:49 marostegui@cumin2001: START - Cookbook sre.hosts.downtime
  • 08:47 marostegui@cumin2001: dbctl commit (dc=all): 'Slowly repool db1113:3315, db1113:3316', diff saved to https://phabricator.wikimedia.org/P11557 and previous config saved to /var/cache/conftool/dbconfig/20200617-084751-marostegui.json
  • 08:44 marostegui@cumin2001: dbctl commit (dc=all): 'Slowly repool db1113:3315, db1113:3316', diff saved to https://phabricator.wikimedia.org/P11556 and previous config saved to /var/cache/conftool/dbconfig/20200617-084402-marostegui.json
  • 08:43 jforrester@deploy1001: Synchronized php-1.35.0-wmf.37/includes/EditPage.php: T255177 T255614 Do not return internal edit status from EditPage (duration: 01m 08s)
  • 08:31 marostegui@cumin2001: dbctl commit (dc=all): 'Slowly repool db1113:3315, db1113:3316', diff saved to https://phabricator.wikimedia.org/P11554 and previous config saved to /var/cache/conftool/dbconfig/20200617-083120-marostegui.json
  • 08:30 godog: start logstash on logstash7 - T255243
  • 08:29 moritzm: prune nginx from remaining mw* servers in codfw T255565
  • 08:23 marostegui@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 08:20 marostegui@cumin2001: START - Cookbook sre.hosts.downtime
  • 08:10 godog: stop logstash temporarily on logstash7 hosts to test increased es shards - T255243
  • 08:05 marostegui@cumin2001: dbctl commit (dc=all): 'Depool db1113:3315 db1113:3316', diff saved to https://phabricator.wikimedia.org/P11553 and previous config saved to /var/cache/conftool/dbconfig/20200617-080511-marostegui.json
  • 07:53 elukey: reboot kafka-jumbo1009 for kernel upgrades
  • 06:40 elukey: reboot krb1001 for kernel upgrades
  • 06:24 elukey: reboot an-master100[1,2] for kernel upgrades
  • 06:23 XioNoX: set lacp active on cr2-esams:ae2 - T253970
  • 06:15 tstarling@deploy1001: Synchronized wmf-config/PoolCounterSettings.php: test fast stale mode on testwiki T250248 (duration: 01m 17s)
  • 06:03 elukey: reboot an-conf100[1-3] for kernel upgrades
  • 05:45 elukey: reboot stat1007/8 for kernel upgrades
  • 05:45 elukey: clean up old systemd timer config on an-coord1001 (came up after the last reboot)
  • 05:42 volker-e@deploy1001: Finished deploy [design/style-guide@37c67dd]: Deploy design/style-guide: (duration: 00m 05s)
  • 05:42 volker-e@deploy1001: Started deploy [design/style-guide@37c67dd]: Deploy design/style-guide:
  • 05:34 marostegui@cumin2001: dbctl commit (dc=all): 'Fully repool db1090:3312, db1090:3317', diff saved to https://phabricator.wikimedia.org/P11552 and previous config saved to /var/cache/conftool/dbconfig/20200617-053421-marostegui.json
  • 05:29 marostegui: Deploy schema change on s7 codfw (lag will appear) - T250066
  • 05:28 marostegui@cumin2001: dbctl commit (dc=all): 'Slowly repool db1090:3312, db1090:3317', diff saved to https://phabricator.wikimedia.org/P11551 and previous config saved to /var/cache/conftool/dbconfig/20200617-052809-marostegui.json
  • 05:22 marostegui@cumin2001: dbctl commit (dc=all): 'Slowly repool db1090:3312, db1090:3317', diff saved to https://phabricator.wikimedia.org/P11550 and previous config saved to /var/cache/conftool/dbconfig/20200617-052202-marostegui.json
  • 05:19 marostegui@cumin2001: dbctl commit (dc=all): 'Slowly repool db1090:3312, db1090:3317', diff saved to https://phabricator.wikimedia.org/P11549 and previous config saved to /var/cache/conftool/dbconfig/20200617-051916-marostegui.json
  • 05:10 marostegui@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 05:08 marostegui@cumin2001: START - Cookbook sre.hosts.downtime
  • 04:51 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1090:3312, db1090:3317 for reimage', diff saved to https://phabricator.wikimedia.org/P11548 and previous config saved to /var/cache/conftool/dbconfig/20200617-045105-marostegui.json
  • 04:44 marostegui: Reload pt-kill on labsdb analytics host to pick up new config
  • 04:38 marostegui@cumin2001: dbctl commit (dc=all): 'Depool db1129', diff saved to https://phabricator.wikimedia.org/P11547 and previous config saved to /var/cache/conftool/dbconfig/20200617-043826-marostegui.json
  • 01:43 shdubsh: restart elasticsearch on logstash1011

2020-06-16

  • 23:43 crusnov@deploy1001: Finished deploy [netbox/deploy@5251cf1]: Deploying Netbox to netbox-dev T253140 (duration: 00m 05s)
  • 23:43 crusnov@deploy1001: Started deploy [netbox/deploy@5251cf1]: Deploying Netbox to netbox-dev T253140
  • 23:35 ebernhardson@deploy1001: Synchronized wmf-config/InitialiseSettings.php: cirrus: update ML models for ko and zh, drop ja (duration: 01m 00s)
  • 23:34 ebernhardson@deploy1001: sync-file aborted: cirrus: update ML models for ko and zh, drop ja (duration: 00m 04s)
  • 22:40 krinkle@deploy1001: Synchronized src/Noc/: (no justification provided) (duration: 01m 04s)
  • 22:31 krinkle@deploy1001: Synchronized docroot/noc: (no justification provided) (duration: 01m 05s)
  • 21:12 krinkle@deploy1001: Synchronized php-1.35.0-wmf.37/extensions/WikimediaEvents/modules/: I67794c (duration: 01m 04s)
  • 20:42 brennen@deploy1001: rebuilt and synchronized wikiversions files: Revert group1 wikis to 1.35.0-wmf.37
  • 20:41 foks: reset email and pw for CactusJack
  • 20:32 brennen: rolling 1.35.0-wmf.37 back to group0
  • 20:29 mutante: signing puppet cert requests for releases1002 and releases2002 - T255590
  • 19:24 brennen@deploy1001: Synchronized php: group1 wikis to 1.35.0-wmf.37 (duration: 01m 04s)
  • 19:23 brennen@deploy1001: rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.37
  • 19:18 otto@deploy1001: Started deploy [analytics/refinery@8b8ce6e]: deploying refinery source 0.0.127 for eventlogging -> eventgate migration - T249261
  • 19:15 brennen@deploy1001: Synchronized php-1.35.0-wmf.37/skins/Vector/resources/skins.vector.styles/: Restore Watchlist star (duration: 01m 05s)
  • 19:03 brennen: CORRECTION: holding _1.35.0-wmf.37_ deploy to group1 for a few minutes while merging & testing fix for T255574
  • 19:01 brennen: holding 1.35.0-wmf.27 deploy to group1 for a few minutes while merging & testing fix for T255574
  • 18:59 dzahn@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
  • 18:52 qchris: Turning on puppet again on gerrit1002 to avoid having it lag too far behind.
  • 18:32 dzahn@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
  • 18:18 mutante: mw2293 - scap pull (because Icinga reports mismatched MW versions)
  • 18:01 crusnov@cumin2001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
  • 17:55 dzahn@cumin1001: START - Cookbook sre.ganeti.makevm
  • 17:52 crusnov@cumin2001: START - Cookbook sre.ganeti.makevm
  • 17:44 ebernhardson@deploy1001: Finished deploy [wikimedia/discovery/analytics@f4f5d7b]: airflow: adjust glent legal cutoff (duration: 01m 35s)
  • 17:42 ebernhardson@deploy1001: Started deploy [wikimedia/discovery/analytics@f4f5d7b]: airflow: adjust glent legal cutoff
  • 17:32 dzahn@cumin1001: START - Cookbook sre.ganeti.makevm
  • 17:03 herron: performing rolling reboots of kafka-main hosts for security updates T254990
  • 16:27 hnowlan@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 16:26 hnowlan: Updating changeprop to new container version with updated dependencies
  • 16:07 hnowlan@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 16:04 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
  • 16:02 elukey: reboot kafka-jumbo1008 for kernel upgrades
  • 15:58 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 15:49 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1076', diff saved to https://phabricator.wikimedia.org/P11543 and previous config saved to /var/cache/conftool/dbconfig/20200616-154924-marostegui.json
  • 15:45 ebernhardson@deploy1001: Finished deploy [wikimedia/discovery/analytics@7d4458c]: Reduce glent maximum yarn resource usage to reasonable levels (duration: 00m 41s)
  • 15:44 ebernhardson@deploy1001: Started deploy [wikimedia/discovery/analytics@7d4458c]: Reduce glent maximum yarn resource usage to reasonable levels
  • 15:26 milimetric@deploy1001: Finished deploy [analytics/refinery@c652f62] (thin): Regular analytics weekly THIN train [analytics/refinery@c652f62] (duration: 00m 08s)
  • 15:25 milimetric@deploy1001: Started deploy [analytics/refinery@c652f62] (thin): Regular analytics weekly THIN train [analytics/refinery@c652f62]
  • 15:23 milimetric@deploy1001: Finished deploy [analytics/refinery@c652f62]: Regular analytics weekly train [analytics/refinery@c652f62] (duration: 07m 56s)
  • 15:20 elukey: reboot kafka-jumbo1007 for kernel upgrades
  • 15:15 moritzm: upgrading intel-microcode on jessie hosts
  • 15:15 milimetric@deploy1001: Started deploy [analytics/refinery@c652f62]: Regular analytics weekly train [analytics/refinery@c652f62]
  • 15:06 elukey: reboot an-coord1001 for kernel upgrades
  • 14:49 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 14:49 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 14:45 moritzm: rebooting scandium for kernel security update
  • 14:45 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 14:43 cdanis: repool eqiad T243080
  • 14:40 papaul: power off ms-be2018 for BBU replacement
  • 14:33 cdanis: eqiad router upgrades completed! πŸŽ‰ T243080
  • 14:33 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 14:31 elukey: reboot druid100[7,8] for kernel upgrades
  • 14:28 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 14:25 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 14:22 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 14:15 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1076', diff saved to https://phabricator.wikimedia.org/P11541 and previous config saved to /var/cache/conftool/dbconfig/20200616-141540-marostegui.json
  • 14:14 cdanis: T243080 cdanis@re1.cr2-eqiad> request chassis routing-engine master switch
  • 14:11 moritzm: removing stray nginx packages from mw canaries (mw1261-mw1265 and mw1276-mw1283) T255565
  • 14:06 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 14:03 akosiaris@cumin1001: START - Cookbook sre.hosts.decommission
  • 14:03 akosiaris@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1)
  • 14:03 akosiaris@cumin1001: START - Cookbook sre.hosts.decommission
  • 14:03 akosiaris@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99)
  • 14:03 akosiaris@cumin1001: START - Cookbook sre.hosts.decommission
  • 13:56 cdanis: T243080 cdanis@re0.cr2-eqiad> request chassis routing-engine master switch
  • 13:50 cdanis: cr2-eqiad: rebooting RE1 [backup] with new junos version T243080
  • 13:39 cdanis: cr2-eqiad: disable transit/peering BGP & bump fr MED T243080
  • 13:32 marostegui@cumin2001: dbctl commit (dc=all): 'Repool db2092 T254462', diff saved to https://phabricator.wikimedia.org/P11535 and previous config saved to /var/cache/conftool/dbconfig/20200616-133241-marostegui.json
  • 13:17 XioNoX: pfw3-eqiad rollback MED to cr1 to 0 - T243080
  • 13:12 XioNoX: add graceful-switchover to cr1-eqiad
  • 13:09 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 13:08 liw@deploy1001: rebuilt and synchronized wikiversions files: group0 wikis to 1.35.0-wmf.37
  • 13:06 akosiaris@cumin1001: START - Cookbook sre.hosts.decommission
  • 13:03 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 13:03 cdanis: T243080 cdanis@re1.cr1-eqiad> request chassis routing-engine master switch
  • 13:03 jmm@cumin2001: START - Cookbook sre.hosts.downtime
  • 13:03 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 13:03 jmm@cumin2001: START - Cookbook sre.hosts.downtime
  • 13:01 moritzm: rebooting mw2291-mw2334
  • 12:54 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 12:51 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 12:47 jbond42: upload new memcache package with TLS to component/memcached16 in buster-wikimedia
  • 12:42 XioNoX: pfw3-eqiad set MED to cr1 to 300 - T243080
  • 12:38 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 12:31 cdanis: T243080 cr1-eqiad: request chassis routing-engine master switch
  • 12:31 cdanis: cr1-eqiad: request chassis routing-engine master switch
  • 12:25 cdanis: cr1-eqiad: rebooting RE1 [backup] with new junos version T243080
  • 12:15 cdanis: cdanis@re0.cr1-eqiad# commit confirmed 2 comment "force VRRP failover T243080"
  • 12:14 cdanis: disable transit/peering & increase frack MED on cr1-eqiad T243080
  • 12:09 hnowlan@cumin2001: END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0)
  • 11:48 cdanis: depooling eqiad for router upgrade T243080
  • 11:42 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 11:42 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 11:42 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 11:42 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 11:42 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 11:42 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 11:42 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 11:41 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 11:41 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 11:41 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 11:41 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 11:41 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 11:41 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 11:41 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 11:41 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 11:41 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 11:41 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 11:41 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 11:41 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 11:41 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 11:40 hnowlan: roll-restarting restbase201[0-2] for cert updates
  • 11:40 hnowlan@cumin2001: START - Cookbook sre.cassandra.roll-restart
  • 11:39 jmm@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 11:39 jmm@cumin1001: START - Cookbook sre.hosts.downtime
  • 11:38 hnowlan@cumin2001: END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0)
  • 11:35 elukey: reboot an-druid100[1,2] for kernel upgrades
  • 11:27 hnowlan: roll-restart restbase2009 for cert update
  • 11:26 hnowlan@cumin2001: START - Cookbook sre.cassandra.roll-restart
  • 11:21 hnowlan@cumin1001: END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0)
  • 11:18 jforrester@deploy1001: Synchronized dblists/mobilemainpagelegacy.dblist: T32405 T254731 Drop mobile special casing of main page for simplewiki, itwikisource, vecwikisource (duration: 01m 05s)
  • 11:15 moritzm: updating perf on stretch hosts
  • 11:14 marostegui: Deploy MCR schema change on db2087:3316
  • 11:09 moritzm: updating perf on buster
  • 11:02 moritzm: rebooting mw2350-mw2376
  • 11:01 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Stop setting wgActorTableSchemaMigrationStage, no longer read in core (duration: 01m 05s)
  • 10:52 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Stop setting wgTagStatisticsNewTable, no longer read in core (duration: 01m 04s)
  • 10:51 hnowlan: roll-restarting restbase101[6-8].eqiad.wmnet for cert updates
  • 10:50 hnowlan@cumin1001: START - Cookbook sre.cassandra.roll-restart
  • 10:44 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Stop setting wgChangeTagsSchemaMigrationStage, no longer read in core (duration: 01m 06s)
  • 10:26 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Stop setting wgCommentTableSchemaMigrationStage, no longer read in core (duration: 01m 07s)
  • 09:54 volans: restarting netbox to pickup modified customscripts
  • 09:14 filippo@cumin1001: conftool action : set/pooled=true; selector: dnsdisc=thanos-swift,name=eqiad
  • 08:53 godog: roll restart prometheus eqiad ops to enable thanos upload
  • 08:48 marostegui: Upgrade db2132
  • 08:44 marostegui@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 08:42 marostegui@cumin2001: START - Cookbook sre.hosts.downtime
  • 08:39 liw@deploy1001: Finished scap: testwikis wikis to 1.35.0-wmf.37 (duration: 59m 05s)
  • 08:19 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 08:19 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 08:19 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 08:19 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 08:19 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 08:19 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 08:18 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 08:18 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 08:18 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 08:18 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 08:18 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 08:18 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 08:09 volans@deploy1001: Finished deploy [homer/deploy@e9acec8]: Release v0.2.3 on cumin2001 now on buster (take 3bis) (duration: 00m 12s)
  • 08:09 volans@deploy1001: Started deploy [homer/deploy@e9acec8]: Release v0.2.3 on cumin2001 now on buster (take 3bis)
  • 08:09 volans@deploy1001: Finished deploy [homer/deploy@e9acec8]: Release v0.2.3 on cumin2001 now on buster (take 3) (duration: 01m 37s)
  • 08:08 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 08:08 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 08:08 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 08:08 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 08:08 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 08:08 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 08:07 volans@deploy1001: Started deploy [homer/deploy@e9acec8]: Release v0.2.3 on cumin2001 now on buster (take 3)
  • 07:59 volans@deploy1001: Finished deploy [homer/deploy@85e92b8]: Release v0.2.3 on cumin2001 now on buster (take 2) (duration: 00m 57s)
  • 07:58 volans@deploy1001: Started deploy [homer/deploy@85e92b8]: Release v0.2.3 on cumin2001 now on buster (take 2)
  • 07:49 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 07:49 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 07:40 liw@deploy1001: Started scap: testwikis wikis to 1.35.0-wmf.37
  • 07:37 liw@deploy1001: Pruned MediaWiki: 1.35.0-wmf.35 (duration: 01m 47s)
  • 07:31 liw@deploy1001: Pruned MediaWiki: 1.35.0-wmf.34 (duration: 11m 52s)
  • 07:23 ayounsi@cumin1001: END (PASS) - Cookbook sre.network.prepare-upgrade (exit_code=0)
  • 07:08 ayounsi@cumin1001: START - Cookbook sre.network.prepare-upgrade
  • 07:07 liw: 1.35.0-wmf.37 was branched at f856960 for T254174
  • 07:06 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1148', diff saved to https://phabricator.wikimedia.org/P11526 and previous config saved to /var/cache/conftool/dbconfig/20200616-070651-marostegui.json
  • 07:04 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1148', diff saved to https://phabricator.wikimedia.org/P11525 and previous config saved to /var/cache/conftool/dbconfig/20200616-070450-marostegui.json
  • 07:04 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1084', diff saved to https://phabricator.wikimedia.org/P11524 and previous config saved to /var/cache/conftool/dbconfig/20200616-070429-marostegui.json
  • 07:02 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1084', diff saved to https://phabricator.wikimedia.org/P11523 and previous config saved to /var/cache/conftool/dbconfig/20200616-070209-marostegui.json
  • 06:57 marostegui: Compress InnoDB on db1134 T254462
  • 06:56 marostegui@cumin2001: dbctl commit (dc=all): 'Depool db1134 for InnoDB compression T254462', diff saved to https://phabricator.wikimedia.org/P11522 and previous config saved to /var/cache/conftool/dbconfig/20200616-065600-marostegui.json
  • 06:55 ayounsi@cumin1001: END (PASS) - Cookbook sre.network.prepare-upgrade (exit_code=0)
  • 06:54 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1093', diff saved to https://phabricator.wikimedia.org/P11521 and previous config saved to /var/cache/conftool/dbconfig/20200616-065412-marostegui.json
  • 06:40 ayounsi@cumin1001: START - Cookbook sre.network.prepare-upgrade
  • 06:25 elukey: roll restart memcached on mc-gp* (gutter pools) to pick up new slab size distribution setting - T252391
  • 06:04 hashar: Restarted Zuul scheduler and merger on contint2001 a couple hotfixes # T252310 T255424
  • 05:54 volker-e@deploy1001: Finished deploy [design/style-guide@37c67dd]: Deploy design/style-guide: (duration: 00m 05s)
  • 05:54 volker-e@deploy1001: Started deploy [design/style-guide@37c67dd]: Deploy design/style-guide:
  • 05:04 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 05:01 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 04:59 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1146:3314', diff saved to https://phabricator.wikimedia.org/P11520 and previous config saved to /var/cache/conftool/dbconfig/20200616-045958-marostegui.json
  • 04:57 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1146:3314', diff saved to https://phabricator.wikimedia.org/P11519 and previous config saved to /var/cache/conftool/dbconfig/20200616-045744-marostegui.json
  • 04:56 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1147', diff saved to https://phabricator.wikimedia.org/P11518 and previous config saved to /var/cache/conftool/dbconfig/20200616-045636-marostegui.json
  • 04:55 marostegui: Deploy schema change on db1147
  • 04:54 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1147', diff saved to https://phabricator.wikimedia.org/P11517 and previous config saved to /var/cache/conftool/dbconfig/20200616-045451-marostegui.json
  • 04:46 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1149', diff saved to https://phabricator.wikimedia.org/P11516 and previous config saved to /var/cache/conftool/dbconfig/20200616-044612-marostegui.json
  • 04:44 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1149', diff saved to https://phabricator.wikimedia.org/P11515 and previous config saved to /var/cache/conftool/dbconfig/20200616-044409-marostegui.json
  • 04:43 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1143', diff saved to https://phabricator.wikimedia.org/P11514 and previous config saved to /var/cache/conftool/dbconfig/20200616-044326-marostegui.json
  • 04:41 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1143', diff saved to https://phabricator.wikimedia.org/P11513 and previous config saved to /var/cache/conftool/dbconfig/20200616-044126-marostegui.json
  • 04:40 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1138', diff saved to https://phabricator.wikimedia.org/P11512 and previous config saved to /var/cache/conftool/dbconfig/20200616-044036-marostegui.json
  • 04:37 marostegui: Deploy schema change on db1138
  • 04:37 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1138', diff saved to https://phabricator.wikimedia.org/P11511 and previous config saved to /var/cache/conftool/dbconfig/20200616-043748-marostegui.json
  • 00:28 tstarling@deploy1001: Synchronized wmf-config/CommonSettings.php: limit HTTP client timeout T245170 (duration: 00m 56s)
  • 00:25 tstarling@deploy1001: Synchronized wmf-config/set-time-limit.php: expose excimer timeout as a global variable T245170 (duration: 00m 56s)
  • 00:16 ebernhardson@deploy1001: Finished deploy [wikimedia/discovery/analytics@17212bb]: airflow: migrate leven-dist to edit-dist (duration: 00m 45s)
  • 00:16 volker-e@deploy1001: Finished deploy [design/style-guide@37c67dd]: Deploy design/style-guide: (duration: 00m 04s)
  • 00:16 volker-e@deploy1001: Started deploy [design/style-guide@37c67dd]: Deploy design/style-guide:
  • 00:16 ebernhardson@deploy1001: Started deploy [wikimedia/discovery/analytics@17212bb]: airflow: migrate leven-dist to edit-dist

2020-06-15

  • 23:56 tstarling@deploy1001: Synchronized wmf-config/PoolCounterSettings.php: reducing connect timeout per T105378 (duration: 01m 00s)
  • 23:31 ebernhardson@deploy1001: Finished deploy [wikimedia/discovery/analytics@eb0ac12]: Ship templatad table names in HivePartitionRangeSensor (duration: 00m 49s)
  • 23:30 ebernhardson@deploy1001: Started deploy [wikimedia/discovery/analytics@eb0ac12]: Ship templatad table names in HivePartitionRangeSensor
  • 22:58 krinkle@deploy1001: Synchronized wmf-config/PhpAutoPrepend.php: If7e1613cbcf8 (duration: 00m 56s)
  • 22:57 krinkle@deploy1001: Synchronized wmf-config/profiler.php: If7e1613cbcf8 (duration: 00m 59s)
  • 22:02 bstorm_: downtimed puppet alerts for testing some changes on labstore1004/5
  • 20:59 ebernhardson@deploy1001: Finished deploy [search/airflow@62a024b]: Add pydruid to airflow (duration: 00m 50s)
  • 20:58 ebernhardson@deploy1001: Started deploy [search/airflow@62a024b]: Add pydruid to airflow
  • 20:55 shdubsh: update mtail to 3.0.0~rc35 on the rest of the hosts - eqiad and esams
  • 20:44 shdubsh: update mtail to 3.0.0~rc35 on cp nodes in eqiad and esams
  • 20:30 shdubsh: update mtail to 3.0.0~rc35 on wtp in eqiad
  • 19:35 shdubsh: update mtail to 3.0.0~rc35 on mw in eqiad
  • 18:50 ebernhardson@deploy1001: Finished deploy [wikimedia/discovery/analytics@41186c8]: port glent from oozie to airflow (duration: 00m 39s)
  • 18:50 ebernhardson@deploy1001: Started deploy [wikimedia/discovery/analytics@41186c8]: port glent from oozie to airflow
  • 18:28 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: gerrit:605584 T254315 test wikidata: Use the database name in the Wikibase entity source config (duration: 00m 58s)
  • 17:56 krinkle@deploy1001: Synchronized wmf-config: I7721f4 (duration: 00m 58s)
  • 17:55 krinkle@deploy1001: Synchronized wmf-config/ProductionServices.php: I7721f4 (duration: 00m 57s)
  • 17:52 krinkle@deploy1001: Synchronized lib/: I7721f4 (duration: 00m 58s)
  • 15:38 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1142', diff saved to https://phabricator.wikimedia.org/P11504 and previous config saved to /var/cache/conftool/dbconfig/20200615-153825-marostegui.json
  • 15:37 marostegui: Deploy schema change on db1142
  • 15:36 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1142', diff saved to https://phabricator.wikimedia.org/P11503 and previous config saved to /var/cache/conftool/dbconfig/20200615-153630-marostegui.json
  • 15:35 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1141', diff saved to https://phabricator.wikimedia.org/P11502 and previous config saved to /var/cache/conftool/dbconfig/20200615-153546-marostegui.json
  • 15:33 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1141', diff saved to https://phabricator.wikimedia.org/P11501 and previous config saved to /var/cache/conftool/dbconfig/20200615-153344-marostegui.json
  • 15:16 moritzm: upgrading wtp1025-wtp1027 to PHP 7.2.31
  • 15:09 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1121', diff saved to https://phabricator.wikimedia.org/P11499 and previous config saved to /var/cache/conftool/dbconfig/20200615-150908-marostegui.json
  • 15:07 marostegui: Deploy schema change on db1121 (and labs)
  • 15:06 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1121', diff saved to https://phabricator.wikimedia.org/P11498 and previous config saved to /var/cache/conftool/dbconfig/20200615-150639-marostegui.json
  • 15:01 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1144:3314', diff saved to https://phabricator.wikimedia.org/P11497 and previous config saved to /var/cache/conftool/dbconfig/20200615-150148-marostegui.json
  • 15:00 marostegui: Deploy schema change on db1144:3314
  • 14:59 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1144:3314', diff saved to https://phabricator.wikimedia.org/P11496 and previous config saved to /var/cache/conftool/dbconfig/20200615-145914-marostegui.json
  • 14:55 XioNoX: delete VCP from msw1-codfw
  • 14:24 marostegui: Deploy schema change on db2107 (s2 codfw master) - T250066
  • 14:16 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 14:11 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 14:09 elukey@cumin2001: END (PASS) - Cookbook sre.hadoop.roll-restart-workers (exit_code=0)
  • 13:54 marostegui: Deploy schema change on db1100 (s5 master) - T250066
  • 13:54 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 13:49 marostegui: Upgrade db2133
  • 13:49 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 13:44 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 13:44 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 13:41 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 13:40 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 13:38 elukey@cumin2001: START - Cookbook sre.hadoop.roll-restart-workers
  • 13:31 volans@deploy1001: Finished deploy [homer/deploy@ac7a4c6]: Release v0.2.3 on cumin2001 now on buster (duration: 01m 15s)
  • 13:30 moritzm: rolling reboot on the ganeti cluster in esams (for kernel security updates and to pick up the network changes to provides instances with a public IP)
  • 13:30 volans@deploy1001: Started deploy [homer/deploy@ac7a4c6]: Release v0.2.3 on cumin2001 now on buster
  • 13:26 hashar: Started zuul-merger on contint1001 with newer virtualenv # T255424
  • 13:26 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 13:21 filippo@cumin1001: conftool action : set/pooled=true; selector: dnsdisc=thanos-query,name=eqiad
  • 13:20 hashar: Stopping zuul-merger on contint1001 to rebuild the virtualenv # T255424
  • 13:19 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 13:04 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 12:59 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db2091:3312, db2091:3314 - T253217', diff saved to https://phabricator.wikimedia.org/P11495 and previous config saved to /var/cache/conftool/dbconfig/20200615-125856-marostegui.json
  • 12:58 vgutierrez: upgrade acme-chief to version 0.26
  • 12:57 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 12:46 vgutierrez: upload acme-chief 0.26 to apt.wm.o (buster) - T255249
  • 12:43 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 12:38 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 12:34 moritzm: rolling reboot on the ganeti cluster in eqsin (for security updates and to pick up the network changes to provides instances with a public IP)
  • 12:12 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 12:11 marostegui: Upgrade db2134
  • 12:09 jmm@cumin2001: START - Cookbook sre.hosts.downtime
  • 12:01 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 11:59 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 11:57 moritzm: reimaging sretest1002 to validate the reimage script on Buster
  • 11:43 marostegui: Reimage dbproxy2003 which points to m3-master.codfw.wmnet (not in use) - T255408
  • 11:40 tgr@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Config: GrowthExperiments: Switch on guidance feature (T239181) (duration: 00m 57s)
  • 11:10 sukhe@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 11:10 sukhe@cumin1001: START - Cookbook sre.hosts.downtime
  • 11:07 hnowlan: regenerated certificates for restbase2009, restbase101[678], restbase201[012]. Did not roll-restart yet
  • 11:07 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 11:04 jmm@cumin2001: START - Cookbook sre.hosts.reboot-single
  • 11:03 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 11:03 jmm@cumin2001: START - Cookbook sre.hosts.downtime
  • 10:54 moritzm: imported python-phabricator 0.7.0-2~wmf2 to apt.wikimedia.org/buster-wikimedia T245114
  • 10:39 jdrewniak@deploy1001: Synchronized portals: Wikimedia Portals Update: Bumping portals to master (605553) (duration: 00m 58s)
  • 10:38 hnowlan: regenerated restbase2009's cassandra certificates
  • 10:38 jdrewniak@deploy1001: Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: Bumping portals to master (605553) (duration: 00m 58s)
  • 10:16 jmm@cumin1001: END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97)
  • 10:16 jmm@cumin1001: START - Cookbook sre.hosts.reboot-single
  • 10:12 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: T254820 [enwikivoyage] Undeploy the Listings extension (duration: 01m 00s)
  • 10:08 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 10:05 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 09:53 filippo@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 09:50 filippo@cumin1001: START - Cookbook sre.hosts.downtime
  • 09:46 godog: run logstash benchmark on logstash1023
  • 09:42 volans: deploying esams mgmt DNS records automatically generated by Netbox ( operations/dns/+/604136/ ) - T233183
  • 09:41 volans@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 09:35 volans@cumin1001: START - Cookbook sre.dns.netbox
  • 09:29 elukey: update analytics-in4/6 filters on cr1-cr2 eqiad to update the Druid term (new nodes added)
  • 09:21 jbond42: offlining puppetmaster1003 and 2003 for reboot
  • 09:17 XioNoX: reduce ae device-count from 10 to 3 on asw2-a/b/c-eqiad
  • 09:14 jmm@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 09:11 jmm@cumin1001: START - Cookbook sre.hosts.downtime
  • 08:55 marostegui: Deploy schema change on db2123 (s5 codfw master) - T250066
  • 08:50 kart_: Updated cxserver to 2020-06-10-044445-production (T246319, T254959)
  • 08:46 kartik@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'cxserver' for release 'production' .
  • 08:42 kartik@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'cxserver' for release 'production' .
  • 08:39 kartik@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'cxserver' for release 'staging' .
  • 08:34 moritzm: reimaging cumin2001 T245114
  • 08:22 marostegui: Switchover m3-master from dbproxy1008 to dbproxy1016 - T202367
  • 08:17 marostegui: Deploy schema change on db1131 (s6 master) - T250066
  • 08:09 moritzm: installing libexif security updates
  • 07:51 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 07:49 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 07:46 XioNoX: standardize ae device-count on all routers
  • 07:36 XioNoX: push new pfw firewall policies - T255185
  • 07:28 marostegui: Deploy schema change on db1093
  • 07:28 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1093 for schema change', diff saved to https://phabricator.wikimedia.org/P11492 and previous config saved to /var/cache/conftool/dbconfig/20200615-072835-marostegui.json
  • 07:27 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db2092', diff saved to https://phabricator.wikimedia.org/P11491 and previous config saved to /var/cache/conftool/dbconfig/20200615-072742-marostegui.json
  • 06:21 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 06:19 marostegui@cumin1001: START - Cookbook sre.hosts.downtime

2020-06-14

  • 13:51 qchris: Disabling puppet on gerrit1002 (test instance) to do some more upgrade testing

2020-06-13

  • 21:12 qchris: Enabling puppet on gerrit1002 (test instance). Done with testing for today.
  • 12:51 herron: restarted logstash service on logstash1007, logstash1009
  • 12:34 qchris: Disabling puppet on gerrit1002 (test instance) to do some more upgrade testing
  • 12:33 godog: bounce logstash on logstash1008, GC death

2020-06-12

  • 17:44 herron: restarting logstash1011 elasticsearch instance
  • 16:49 elukey: restart php-fpm and pool mw1384 - T255282
  • 16:33 elukey: (correct) depool again mw1384 - investigation will follow up in a task
  • 16:32 elukey: depool again mw1348 - investigation will follow up in a task
  • 15:49 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 15:44 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:40 hnowlan@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 15:40 pt1979@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 15:37 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 15:36 hnowlan@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 15:27 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 15:25 hnowlan@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 15:24 akosiaris@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 15:24 akosiaris@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 15:24 akosiaris@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 15:24 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 15:22 akosiaris@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 15:22 akosiaris@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 15:22 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:22 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:22 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:22 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:22 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:22 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:22 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:51 elukey: repool mw1384 as test
  • 14:31 akosiaris@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 14:30 akosiaris: bump cpu limits for changeprop another 50%
  • 14:30 akosiaris@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 13:36 akosiaris@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 13:34 akosiaris: update changeprop in eqiad+codfw for higher CPU limits
  • 13:34 akosiaris@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 13:12 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1088 after schema change', diff saved to https://phabricator.wikimedia.org/P11483 and previous config saved to /var/cache/conftool/dbconfig/20200612-131205-marostegui.json
  • 12:40 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1088 for schema change', diff saved to https://phabricator.wikimedia.org/P11482 and previous config saved to /var/cache/conftool/dbconfig/20200612-124015-marostegui.json
  • 12:18 filippo@cumin1001: END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0)
  • 11:52 filippo@cumin1001: START - Cookbook sre.hosts.upgrade-and-reboot
  • 11:23 jmm@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 11:19 jmm@cumin1001: START - Cookbook sre.hosts.reboot-single
  • 11:15 moritzm: failover ganeti master in ulsfo to ganeti4003
  • 11:14 marostegui@cumin1001: dbctl commit (dc=all): 'Pool db2080 and db2084 into s8 T253217', diff saved to https://phabricator.wikimedia.org/P11481 and previous config saved to /var/cache/conftool/dbconfig/20200612-111422-marostegui.json
  • 11:11 jmm@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 11:07 jmm@cumin1001: START - Cookbook sre.hosts.reboot-single
  • 11:02 jmm@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 10:58 jmm@cumin1001: START - Cookbook sre.hosts.reboot-single
  • 10:39 jmm@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 10:36 jmm@cumin1001: START - Cookbook sre.hosts.reboot-single
  • 10:33 moritzm: rolling restart of the ulsfo ganeti cluster
  • 10:21 filippo@cumin1001: END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0)
  • 10:02 filippo@cumin1001: START - Cookbook sre.hosts.upgrade-and-reboot
  • 10:01 jmm@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 10:01 filippo@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
  • 10:01 filippo@cumin1001: START - Cookbook sre.hosts.reboot-single
  • 10:01 jmm@cumin1001: START - Cookbook sre.hosts.downtime
  • 09:58 marostegui@cumin1001: dbctl commit (dc=all): 'Include db2084 in dbctl, depooled', diff saved to https://phabricator.wikimedia.org/P11480 and previous config saved to /var/cache/conftool/dbconfig/20200612-095855-marostegui.json
  • 09:58 godog: roll-restart thanos-fe / thanos-be for microcode updates
  • 08:51 elukey: restart gerrit on gerrit1001
  • 08:48 elukey: update cr1/cr2 analyitics filters for T252767 and T252675
  • 08:44 marostegui: Compress InnoDB on db2092 - T254462
  • 08:36 marostegui: Clone db2084 from db2080
  • 08:32 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db2080 to clone db2084', diff saved to https://phabricator.wikimedia.org/P11478 and previous config saved to /var/cache/conftool/dbconfig/20200612-083231-marostegui.json
  • 08:24 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 08:22 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 08:14 marostegui@cumin1001: dbctl commit (dc=all): 'Remove db2084 from s4 and s5', diff saved to https://phabricator.wikimedia.org/P11477 and previous config saved to /var/cache/conftool/dbconfig/20200612-081455-marostegui.json
  • 07:56 elukey: depool mw1384
  • 07:52 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db2084 from s4 and s5', diff saved to https://phabricator.wikimedia.org/P11476 and previous config saved to /var/cache/conftool/dbconfig/20200612-075202-marostegui.json
  • 07:26 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 07:24 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 07:08 marostegui: Reimage db2086
  • 07:07 elukey: depool/scap pull/pool mw1384
  • 07:05 moritzm: installing intel-microcode security updates (regressions have been sorted out)
  • 05:42 moritzm: installing stretch kernel security updates (no reboots yet)
  • 05:40 moritzm: installing buster kernel security updates (no reboots yet)
  • 04:54 marostegui: Deploy schema change on s6 codfw - T250066
  • 01:02 ejegg: updated payments-wiki from aceddff8b5 to 5fd4eb1519
  • 00:10 Amir1: BACON is done

2020-06-11

  • 23:54 ladsgroup@deploy1001: Synchronized php-1.35.0-wmf.36/extensions/Wikibase: Fix entity id lookup for interwiki special page links (T255078) (duration: 00m 38s)
  • 23:51 ladsgroup@deploy1001: scap failed: average error rate on 3/9 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/e474f13ffac6b8c3bf919c4aeafc8c9b for details)
  • 23:43 ladsgroup@deploy1001: Synchronized wmf-config/extension-list: Remove ContributionTracking extension (T255216), Part III (duration: 00m 57s)
  • 23:42 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Remove ContributionTracking extension (T255216), Part II (duration: 00m 58s)
  • 23:38 ladsgroup@deploy1001: Synchronized wmf-config/CommonSettings.php: Remove ContributionTracking extension (T255216), Part I (duration: 00m 59s)
  • 23:37 Reedy: create cn_notice_regions on metawiki and testwiki T252596
  • 20:34 pt1979@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 20:31 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 20:15 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 20:13 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 20:00 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 19:59 jhuneidi@deploy1001: rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.36
  • 19:58 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 19:33 akosiaris: apply emergency sessionstore fixes in codfw as well
  • 19:32 akosiaris@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'sessionstore' for release 'production' .
  • 19:25 gilles@deploy1001: Finished deploy [performance/asoranking@0a096c4]: T252424 (duration: 00m 47s)
  • 19:19 gilles@deploy1001: Started deploy [performance/asoranking@0a096c4]: T252424
  • 19:12 akosiaris: repool eqiad for sessionstore
  • 19:12 akosiaris@cumin1001: conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=sessionstore
  • 19:10 akosiaris: remove the podaffinity restrictions for sessionstore in eqiad
  • 19:10 akosiaris@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'sessionstore' for release 'production' .
  • 19:07 akosiaris@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'sessionstore' for release 'production' .
  • 18:08 ppchelko@deploy1001: Synchronized wmf-config/reverse-proxy-staging.php: Beta: Switch from HTCP purging to kafka purging gerrit:603530, reverse-proxy-staging.php (duration: 01m 06s)
  • 18:06 ppchelko@deploy1001: Synchronized wmf-config/InitialiseSettings-labs.php: Beta: Switch from HTCP purging to kafka purging gerrit:603530, IS-labs.php (duration: 01m 06s)
  • 17:29 mbsantos@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'proton' for release 'production' .
  • 17:26 mbsantos@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
  • 17:22 mbsantos@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'proton' for release 'production' .
  • 17:19 mbsantos@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
  • 17:12 bstorm_: reboot for stretch upgrade on labstore1004 T224582
  • 16:49 bstorm_: doing stretch upgrade for labstore1004 T224582
  • 16:36 bstorm_: rebooting labstore1004 for upgrades T224582
  • 16:12 bstorm_: downtimed labstore1005 for upgrades on 1004 since that will alert as well T224582
  • 16:10 bstorm_: downtimed labstore1004 for upgrades T224582
  • 15:50 cstone: SmashPig revision changed from b9de3c7aac to 2246685626
  • 15:34 jmm@cumin1001: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
  • 15:31 jmm@cumin1001: START - Cookbook sre.hosts.reboot-single
  • 15:25 moritzm: installing buster kernel security updates (no reboots yet)
  • 15:04 jmm@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99)
  • 15:04 mforns@deploy1001: Finished deploy [analytics/refinery@c969b56]: Regular analytics weekly train [analytics/refinery@c969b56afae1b2532e07f0ff699c2ce161360966] (duration: 01m 39s)
  • 15:04 root@cumin1001: END (FAIL) - Cookbook sre.network.prepare-upgrade (exit_code=99)
  • 15:04 root@cumin1001: START - Cookbook sre.network.prepare-upgrade
  • 15:02 mforns@deploy1001: Started deploy [analytics/refinery@c969b56]: Regular analytics weekly train [analytics/refinery@c969b56afae1b2532e07f0ff699c2ce161360966]
  • 15:02 jmm@cumin1001: START - Cookbook sre.hosts.reboot-single
  • 14:56 herron: bounced elasticsearch on logstash1012
  • 14:41 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 14:40 akosiaris@cumin1001: START - Cookbook sre.hosts.decommission
  • 14:37 herron: enabled VO incident resolution notification in global settings
  • 14:34 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 14:31 akosiaris@cumin1001: START - Cookbook sre.hosts.decommission
  • 14:30 godog: bounce logstash on logstash1009, apparent GC death spiral
  • 14:03 jmm@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99)
  • 14:03 jmm@cumin1001: START - Cookbook sre.hosts.reboot-single
  • 14:03 jmm@cumin1001: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
  • 14:03 jmm@cumin1001: START - Cookbook sre.hosts.reboot-single
  • 13:35 filippo@cumin1001: conftool action : set/pooled=false; selector: dnsdisc=thanos-query,name=eqiad
  • 13:35 filippo@cumin1001: conftool action : set/pooled=false; selector: dnsdisc=thanos-swift,name=eqiad
  • 12:39 ayounsi@cumin1001: END (PASS) - Cookbook sre.network.prepare-upgrade (exit_code=0)
  • 12:36 elukey: updated pcc facts
  • 12:28 jayme@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' .
  • 12:28 jayme@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
  • 12:28 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 12:25 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 12:15 jayme@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' .
  • 12:15 jayme@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' .
  • 12:04 jforrester@deploy1001: Synchronized php-1.35.0-wmf.36/includes/title/NamespaceInfo.php: T253098 NamespaceInfo::makeValidNamespace: Don't throw for -1 or -2 (duration: 01m 06s)
  • 12:03 marostegui: Reimage es2023 (es5 codfw master)
  • 11:54 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db2075 T254139', diff saved to https://phabricator.wikimedia.org/P11469 and previous config saved to /var/cache/conftool/dbconfig/20200611-115430-marostegui.json
  • 11:46 marostegui: Deploy schema change on s6 codfw - T250066
  • 11:44 volans@deploy1001: Finished deploy [homer/deploy@df83901]: Release v0.2.3 (duration: 00m 25s)
  • 11:44 volans@deploy1001: Started deploy [homer/deploy@df83901]: Release v0.2.3
  • 11:36 ayounsi@cumin1001: START - Cookbook sre.network.prepare-upgrade
  • 11:36 matthiasmullie: EU BACON done
  • 11:35 mlitn@deploy1001: Synchronized php-1.35.0-wmf.36/extensions/GrowthExperiments: Help panel: Update guidance behavior rules (duration: 01m 06s)
  • 11:34 jayme@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
  • 11:34 jayme@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
  • 11:28 kartik@deploy1001: Synchronized php-1.35.0-wmf.36/extensions/ContentTranslation/modules/tools/mw.cx.tools.IssueTrackingTool.js: Backport: 604587|IssueTrackingTool: Fix js error in getCurrentNodeId method (T254965) (duration: 01m 07s)
  • 11:08 jayme@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' .
  • 11:04 mlitn@deploy1001: Synchronized php-1.35.0-wmf.36/extensions/MachineVision: $aliases should be an array of strings, not AliasGroup objects (duration: 01m 07s)
  • 10:47 moritzm: repooling mw1318,mw2139,mw2145,mw2147,mw2221,mw2219,mw2250,mw2350 (these were depooled, but seem all fine in Icinga and were probably just forgotten)
  • 10:41 filippo@cumin1001: conftool action : set/pooled=yes; selector: cluster=thanos,service=thanos-swift
  • 10:40 filippo@cumin1001: conftool action : set/pooled=yes; selector: cluster=thanos,service=thanos-query
  • 10:37 moritzm: installing buster kernel security updates (no reboots yet, on hold for regression-free microcode update)
  • 10:32 godog: roll-restart pybal in eqiad lvs low-traffic
  • 10:21 mutante: restarting gerrit on gerrit-replica (gerrit2001) - java.lang.OutOfMemoryError: Java heap space
  • 10:21 Urbanecm: Run scap pull at mwdebug1001 to revert temporary changes
  • 10:14 Urbanecm: Applying temporary changes on mwdebug1001
  • 09:58 moritzm: upgrading netmon* to PHP 7.2.31
  • 09:55 marostegui: Upgrade es2025
  • 09:54 moritzm: upgrading mwmaint* to PHP 7.2.31
  • 09:46 moritzm: upgrading labweb* PHP 7.2.31
  • 09:36 elukey: switch piwik.wikimedia.org from matomo1001 to matomo1002 (new buster node)
  • 09:02 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 09:00 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 08:48 jayme@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
  • 08:48 jayme@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' .
  • 08:42 moritzm: imported memcached 1.6.6-1~wmf10u1
  • 08:39 marostegui: Reimage es2024 to buster
  • 08:30 filippo@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 08:30 filippo@cumin1001: START - Cookbook sre.hosts.downtime
  • 08:25 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 08:25 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 08:25 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 08:25 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 08:24 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 08:24 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 08:24 akosiaris@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 08:24 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 08:23 jayme@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' .
  • 08:23 jayme@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' .
  • 08:22 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 08:20 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 08:18 filippo@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 08:18 filippo@cumin1001: START - Cookbook sre.hosts.downtime
  • 08:01 jayme@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
  • 08:01 jayme@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
  • 07:59 moritzm: upgrading remaining job runners in eqiad to PHP 7.2.31
  • 07:59 hashar: Restarted Zuul on contint2001 for config change # T253263
  • 07:43 jayme@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' .
  • 07:34 moritzm: upgrading remaining app servers in eqiad to PHP 7.2.31
  • 07:26 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 07:24 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 07:07 marostegui: Stop MySQL on dbstore1003 for reimage - T254870
  • 06:38 XioNoX: make asw2-esams interfaces Homer like - T250429
  • 05:55 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1127 T253217', diff saved to https://phabricator.wikimedia.org/P11467 and previous config saved to /var/cache/conftool/dbconfig/20200611-055536-marostegui.json
  • 05:25 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1127 T253217', diff saved to https://phabricator.wikimedia.org/P11466 and previous config saved to /var/cache/conftool/dbconfig/20200611-052535-marostegui.json
  • 05:04 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1127 T253217', diff saved to https://phabricator.wikimedia.org/P11465 and previous config saved to /var/cache/conftool/dbconfig/20200611-050446-marostegui.json
  • 05:02 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1078', diff saved to https://phabricator.wikimedia.org/P11464 and previous config saved to /var/cache/conftool/dbconfig/20200611-050200-marostegui.json
  • 04:54 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1078', diff saved to https://phabricator.wikimedia.org/P11463 and previous config saved to /var/cache/conftool/dbconfig/20200611-045426-marostegui.json
  • 04:50 marostegui: Deploy schema change on testwiki - T254371
  • 04:47 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1084 and slowly repool db1127 T253217', diff saved to https://phabricator.wikimedia.org/P11462 and previous config saved to /var/cache/conftool/dbconfig/20200611-044725-marostegui.json
  • 03:13 shdubsh: removing WDQS-Streaming-Updater-POC metrics on graphite1004 - T255044
  • 02:43 tstarling@deploy1001: Synchronized php-1.35.0-wmf.36/extensions/Wikibase/lib/includes/Store/EntityLinkTargetEntityIdLookup.php: investigate UBN T255078 (duration: 01m 07s)

2020-06-10

  • 23:55 catrope@deploy1001: Synchronized php-1.35.0-wmf.36/includes/skins/SkinTemplate.php: T255073 (duration: 01m 07s)
  • 22:14 eileen: civicrm revision changed from 80a0d22350 to f01b036128, config revision is a26d023633
  • 21:23 akosiaris: increase memory/cpu limits for proton
  • 21:23 akosiaris@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'proton' for release 'production' .
  • 21:11 mbsantos@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'proton' for release 'production' .
  • 21:08 akosiaris@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'mobileapps' for release 'staging' .
  • 21:06 akosiaris@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'mobileapps' for release 'staging' .
  • 20:45 mbsantos@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'proton' for release 'production' .
  • 20:33 jhuneidi@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'mobileapps' for release 'staging' .
  • 20:15 mbsantos@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'mobileapps' for release 'staging' .
  • 20:04 mbsantos@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'mobileapps' for release 'staging' .
  • 19:46 herron: bouncing elasticsearch on logstash1011
  • 19:01 ppchelko@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Use EventRelayerNull for wikitech, gerrit:604469 (duration: 01m 05s)
  • 18:54 urbanecm@deploy1001: Synchronized php-1.35.0-wmf.36/extensions/VisualEditor/: 8958860: Make VisualEditorDisableForAnons only hide the tabs, not disable the editor (T253941) (duration: 01m 07s)
  • 18:32 urbanecm@deploy1001: Synchronized php-1.35.0-wmf.35/extensions/VisualEditor/: 5f4c609: Make VisualEditorDisableForAnons only hide the tabs, not disable the editor (T253941) (duration: 01m 14s)
  • 16:40 godog: EDIT: in esams
  • 16:39 godog: restart prometheus@ops in eqiad
  • 16:31 ppchelko@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Disable HTCP purges everywhere, gerrit:603655 (duration: 01m 05s)
  • 16:27 jayme@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' .
  • 16:27 jayme@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
  • 16:18 jayme@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' .
  • 16:18 jayme@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' .
  • 16:13 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 16:13 ema: correction: restart purged on all *cache_upload* hosts to apply https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/604430/ T250781 T133821
  • 16:12 jayme@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
  • 16:12 jayme@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
  • 16:12 ema: restart purged on all cache hosts to apply https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/604430/ T250781 T133821
  • 16:11 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 16:06 ema: cp3051: restart purged to apply https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/604430/ T250781 T133821
  • 16:02 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 16:00 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 15:49 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 15:45 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 15:38 jayme@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' .
  • 15:37 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 15:36 ppchelko@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Send kafka purges everywhere, gerrit:603654 (duration: 01m 05s)
  • 15:35 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 15:32 ema: remaining-cp (non-ulsfo): rolling ats-tls-restart to apply https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/604305/ T255015
  • 15:29 ppchelko@deploy1001: Synchronized wmf-config/CommonSettings.php: Make kafka purges config more robust, gerrit:603649, CS.php (duration: 01m 05s)
  • 15:27 ppchelko@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Make kafka purges config more robust, gerrit:603649, IS.php (duration: 01m 08s)
  • 15:21 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 15:19 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 15:08 godog: roll-restart prometheus k8s to enable thanos upload
  • 15:02 ema: A:cp-ulsfo: rolling ats-tls-restart to apply https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/604305/ T255015
  • 14:43 ema: A:cp rolling systemctl restart trafficserver
  • 14:28 ema: systemctl restart trafficserver for instances critical in icinga
  • 14:21 ema: cp3056: ats-backend-restart
  • 14:09 ema: A:cp rolling ats-be/ats-tls restarts to apply https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/604305/ T255015
  • 14:08 cmjohnson@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 14:06 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:02 cmjohnson@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 13:59 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 13:57 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1094 into s7', diff saved to https://phabricator.wikimedia.org/P11458 and previous config saved to /var/cache/conftool/dbconfig/20200610-135753-marostegui.json
  • 13:50 ema: cp3050: ats-tls-restart to apply https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/604305/ T255015
  • 13:50 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1094 into s7', diff saved to https://phabricator.wikimedia.org/P11457 and previous config saved to /var/cache/conftool/dbconfig/20200610-135039-marostegui.json
  • 13:40 ema: cp3050: ats-backend-restart to apply https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/604305/ T255015
  • 13:36 ayounsi@cumin1001: END (FAIL) - Cookbook sre.network.prepare-upgrade (exit_code=99)
  • 13:06 liw@deploy1001: Synchronized php: group1 wikis to 1.35.0-wmf.36 (duration: 01m 04s)
  • 13:05 liw@deploy1001: rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.36
  • 12:33 ayounsi@cumin1001: START - Cookbook sre.network.prepare-upgrade
  • 12:32 ayounsi@cumin1001: END (FAIL) - Cookbook sre.network.prepare-upgrade (exit_code=99)
  • 12:32 ayounsi@cumin1001: START - Cookbook sre.network.prepare-upgrade
  • 12:13 akosiaris: pool thumbor2002, thumbor2001. T251570
  • 12:12 akosiaris@cumin1001: conftool action : set/pooled=yes; selector: name=thumbor2002.codfw.wmnet
  • 12:12 akosiaris@cumin1001: conftool action : set/pooled=yes; selector: name=thumbor2001.codfw.wmnet
  • 11:50 marostegui: Deploy schema change on commonswiki codfw T255003
  • 11:41 moritzm: upgrading remaining app servers in codfw to PHP 7.2.31
  • 11:38 marostegui: Deploy schema change on testcommonswiki T255003
  • 11:37 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: 52091b8: Grant cswiki accountcreators tboverride-account and override-antispoof (T254927) (duration: 01m 06s)
  • 11:13 moritzm: upgrading remaining job runners in codfw to PHP 7.2.31
  • 11:02 marostegui: Stop MySQL on db1094 to clone db1127
  • 11:02 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1094 moving to clone db1127 T253217', diff saved to https://phabricator.wikimedia.org/P11453 and previous config saved to /var/cache/conftool/dbconfig/20200610-110204-marostegui.json
  • 10:57 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 10:54 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 10:37 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1127 moving it to s7 T253217', diff saved to https://phabricator.wikimedia.org/P11452 and previous config saved to /var/cache/conftool/dbconfig/20200610-103742-marostegui.json
  • 10:28 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1103,db1137 into x1', diff saved to https://phabricator.wikimedia.org/P11451 and previous config saved to /var/cache/conftool/dbconfig/20200610-102805-marostegui.json
  • 10:24 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: T254036 Undeploy CollaborationKit: IV – Drop flag to load (duration: 01m 05s)
  • 10:23 jayme: T254581 re-enabled puppet on all mw, api and jobrunner servers
  • 10:20 jforrester@deploy1001: Synchronized wmf-config/CommonSettings.php: T254036 Undeploy CollaborationKit: III – Drop ability to load (duration: 01m 05s)
  • 10:16 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: T254036 Undeploy CollaborationKit: II – Disable on Test Wikipedia (duration: 01m 37s)
  • 10:14 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1103,db1137 into x1', diff saved to https://phabricator.wikimedia.org/P11450 and previous config saved to /var/cache/conftool/dbconfig/20200610-101407-marostegui.json
  • 10:12 moritzm: upgrading remaining API servers in codfw to PHP 7.2.31
  • 10:08 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1103,db1137 into x1', diff saved to https://phabricator.wikimedia.org/P11449 and previous config saved to /var/cache/conftool/dbconfig/20200610-100834-marostegui.json
  • 10:03 jynus: cloning reviewdb into reviewdb-test at db1132 with replication enabled T254516
  • 10:03 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1103 into x1', diff saved to https://phabricator.wikimedia.org/P11448 and previous config saved to /var/cache/conftool/dbconfig/20200610-100306-marostegui.json
  • 10:00 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1137 into x1', diff saved to https://phabricator.wikimedia.org/P11447 and previous config saved to /var/cache/conftool/dbconfig/20200610-100037-marostegui.json
  • 09:35 volans: imported 0.0.38-1+deb10u1 into buster-wikimedia APT - T245114
  • 09:35 marostegui: Stop mysql on db1127 to clone db1103
  • 09:34 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1137 for cloning db1103 - T253217', diff saved to https://phabricator.wikimedia.org/P11443 and previous config saved to /var/cache/conftool/dbconfig/20200610-093440-marostegui.json
  • 09:31 elukey@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
  • 09:31 godog: configure thanos-be1* HDDs as raid0 - T252186
  • 09:26 marostegui@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 09:26 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 09:26 marostegui@cumin1001: dbctl commit (dc=all): 'Add db1103 to dbctl, depooled T253217', diff saved to https://phabricator.wikimedia.org/P11442 and previous config saved to /var/cache/conftool/dbconfig/20200610-092603-marostegui.json
  • 09:24 marostegui@cumin1001: dbctl commit (dc=all): 'Remove db1103:3312 and db1103:3314', diff saved to https://phabricator.wikimedia.org/P11441 and previous config saved to /var/cache/conftool/dbconfig/20200610-092406-marostegui.json
  • 09:14 jayme: T254581 disabling puppet on all mw, api and jobrunner servers to move termbox envoy config to TLS
  • 09:08 kormat@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 09:08 kormat@cumin1001: START - Cookbook sre.hosts.downtime
  • 08:50 XioNoX: make asw1-eqsin interfaces Homer like - T250429
  • 08:45 jayme@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'eventstreams' for release 'production' .
  • 08:45 jayme@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'eventstreams' for release 'canary' .
  • 08:45 elukey@cumin1001: START - Cookbook sre.ganeti.makevm
  • 08:17 jayme@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'eventstreams' for release 'production' .
  • 08:15 kormat@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 08:14 kormat@cumin1001: START - Cookbook sre.hosts.downtime
  • 08:13 jayme@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'eventstreams' for release 'canary' .
  • 07:53 kormat: reimaging db1077 T252027
  • 07:36 jayme@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'eventstreams' for release 'production' .
  • 07:36 XioNoX: make asw2-ulsfo interfaces Homer like - T250429
  • 07:33 jayme@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'eventstreams' for release 'canary' .
  • 07:31 moritzm: upgrade mw1298-mw1309 (job runners) to PHP 7.2.31
  • 07:26 XioNoX: trunk public vlan to esams ganeti hosts - T254157
  • 07:16 XioNoX: trunk public vlan to eqsin ganeti hosts - T254157
  • 07:15 moritzm: upgrade remaining API servers in eqiad to PHP 7.2.31
  • 07:08 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1103 for reimage - T253217', diff saved to https://phabricator.wikimedia.org/P11439 and previous config saved to /var/cache/conftool/dbconfig/20200610-070822-marostegui.json
  • 07:05 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db2113 after on-site maintenance T251570', diff saved to https://phabricator.wikimedia.org/P11438 and previous config saved to /var/cache/conftool/dbconfig/20200610-070508-marostegui.json
  • 06:53 XioNoX: trunk public vlan to ulsfo ganeti hosts - T254157
  • 05:10 marostegui: Deploy schema change on s3 master with 2 minutes sleep between wikis - T206103

2020-06-09

  • 23:18 Reedy: run namespaceDupes.php --fix for hiwikibooks T254012
  • 23:10 reedy@deploy1001: Synchronized wmf-config/InitialiseSettings.php: T254706 T254012 T241893 (duration: 01m 06s)
  • 23:03 Reedy: created wikilove_log on slwiki T254706
  • 20:00 jhuneidi@deploy1001: Pruned MediaWiki: 1.35.0-wmf.32 (duration: 05m 11s)
  • 19:51 jhuneidi@deploy1001: rebuilt and synchronized wikiversions files: group0 wikis to 1.35.0-wmf.36
  • 19:42 jhuneidi@deploy1001: Finished scap: testwikis wikis to 1.35.0-wmf.36 (duration: 57m 47s)
  • 19:29 cmjohnson@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 19:26 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 19:26 cmjohnson@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 19:23 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 19:10 cmjohnson@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 19:07 cmjohnson@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 19:07 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 19:05 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 18:45 jhuneidi@deploy1001: Started scap: testwikis wikis to 1.35.0-wmf.36
  • 18:41 jforrester@deploy1001: Synchronized php-1.35.0-wmf.36/extensions/TimedMediaHandler/includes/TimedMediaHandler.php: T254824 Avoid undefined index error (duration: 00m 57s)
  • 18:36 volans: migrated mgmt DNS records in eqsin to the Netbox-generated records - T233183
  • 18:13 jforrester@deploy1001: Synchronized php-1.35.0-wmf.36/extensions/CheckUser/: T234921 T254912 Use UserGroupManagerFactory with correct domain to fetch groups (duration: 02m 26s)
  • 18:12 volans: uploaded cumin_4.0.0rc1-1_amd64.deb to apt.wikimedia.org buster-wikimedia
  • 16:43 andrew@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 16:40 andrew@cumin1001: START - Cookbook sre.hosts.downtime
  • 16:06 longma: cutting the branch for 1.35.0-wmf.36 T254173
  • 15:26 aborrero@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 15:26 aborrero@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:25 aborrero@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 15:25 aborrero@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:06 volans: forcing a debmonitor GC to verify the fix of T254865
  • 14:59 mutante: gerrit2001 - delete gerrit logfiles older than 30 days, crons are now enabled to keep doing it in the future
  • 14:55 volans@deploy1001: Finished deploy [debmonitor/deploy@44aa1ee]: Release v0.2.5 (duration: 00m 43s)
  • 14:54 volans@deploy1001: Started deploy [debmonitor/deploy@44aa1ee]: Release v0.2.5
  • 14:49 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db2131 after reimage', diff saved to https://phabricator.wikimedia.org/P11436 and previous config saved to /var/cache/conftool/dbconfig/20200609-144929-marostegui.json
  • 14:45 kormat@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 14:42 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 14:40 kormat@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:40 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:34 moritzm: rebooting auth1002
  • 14:33 andrew@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 14:32 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 14:32 jmm@cumin2001: START - Cookbook sre.hosts.downtime
  • 14:30 andrew@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:12 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 14:09 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:00 elukey: update release repository's settings on Archiva - T254849
  • 14:00 aborrero@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 14:00 aborrero@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:00 aborrero@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 14:00 aborrero@cumin1001: START - Cookbook sre.hosts.downtime
  • 13:56 kormat@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 13:54 kormat@cumin1001: START - Cookbook sre.hosts.downtime
  • 12:38 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db2131 for reimage', diff saved to https://phabricator.wikimedia.org/P11434 and previous config saved to /var/cache/conftool/dbconfig/20200609-123817-marostegui.json
  • 12:22 kormat: reimaging sretest1002 T252027
  • 12:18 kartik@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'cxserver' for release 'production' .
  • 12:16 kartik@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'cxserver' for release 'production' .
  • 12:14 kartik@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'cxserver' for release 'staging' .
  • 12:00 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1141 into s4 T252512', diff saved to https://phabricator.wikimedia.org/P11433 and previous config saved to /var/cache/conftool/dbconfig/20200609-120009-marostegui.json
  • 11:50 kartik@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'cxserver' for release 'production' .
  • 11:50 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1141 into s4 T252512', diff saved to https://phabricator.wikimedia.org/P11432 and previous config saved to /var/cache/conftool/dbconfig/20200609-115016-marostegui.json
  • 11:46 kartik@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'cxserver' for release 'production' .
  • 11:46 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1148 into s4 T252512', diff saved to https://phabricator.wikimedia.org/P11431 and previous config saved to /var/cache/conftool/dbconfig/20200609-114615-marostegui.json
  • 11:44 kartik@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'cxserver' for release 'staging' .
  • 11:38 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1141 into s4 T252512', diff saved to https://phabricator.wikimedia.org/P11430 and previous config saved to /var/cache/conftool/dbconfig/20200609-113818-marostegui.json
  • 11:37 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1148 into s4 T252512', diff saved to https://phabricator.wikimedia.org/P11429 and previous config saved to /var/cache/conftool/dbconfig/20200609-113702-marostegui.json
  • 11:30 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1141 into s4 T252512', diff saved to https://phabricator.wikimedia.org/P11428 and previous config saved to /var/cache/conftool/dbconfig/20200609-113056-marostegui.json
  • 11:27 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1148 into s4 T252512', diff saved to https://phabricator.wikimedia.org/P11427 and previous config saved to /var/cache/conftool/dbconfig/20200609-112701-marostegui.json
  • 11:15 ladsgroup@deploy1001: Synchronized langlist: Add be-tarask to langlist (T111853) (duration: 00m 57s)
  • 11:14 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1148 into s4 T252512', diff saved to https://phabricator.wikimedia.org/P11426 and previous config saved to /var/cache/conftool/dbconfig/20200609-111443-marostegui.json
  • 10:49 elukey: update pcc facts
  • 10:48 moritzm: imported tqdm 4.23.4-1+wmf1 to buster-wikimedia/component/spicerack
  • 10:35 volans: installed spicerack 0.0.38 on cumin[12]001
  • 10:32 marostegui@cumin1001: dbctl commit (dc=all): 'Add db1141 depooled to s4 T252512', diff saved to https://phabricator.wikimedia.org/P11425 and previous config saved to /var/cache/conftool/dbconfig/20200609-103252-marostegui.json
  • 10:27 volans: uploaded spicerack_0.0.38-1_amd64.deb to apt.wikimedia.org stretch-wikimedia
  • 10:14 jayme: restarting pybal on lvs1015 and lvs2009 for T254581
  • 10:12 XioNoX: "Re-order some BGP transit neighbors terms"
  • 10:07 marostegui: Deploy schema change on s7 T206103
  • 10:00 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 10:00 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 10:00 akosiaris@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 10:00 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 09:57 jayme: restarting pybal on lvs1016 and lvs2010 for T254581
  • 09:57 akosiaris: correction: depool and set as inactive thumbor200{1,2} for T251570
  • 09:57 akosiaris: depool and set as inactive thumber200{1,2} for T251750
  • 09:56 vgutierrez: disable parent proxies on ats-tls
  • 09:55 akosiaris@cumin1001: conftool action : set/pooled=inactive; selector: name=thumbor2001.codfw.wmnet
  • 09:55 akosiaris@cumin1001: conftool action : set/pooled=inactive; selector: name=thumbor2002.codfw.wmnet
  • 09:41 marostegui: Compress InnoDB on db2072 T254462
  • 09:34 marostegui: Stop MySQL on db1148 to clone db1141 - T252512
  • 09:30 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 09:29 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1148 to clone db1141 - T252512', diff saved to https://phabricator.wikimedia.org/P11423 and previous config saved to /var/cache/conftool/dbconfig/20200609-092915-marostegui.json
  • 09:28 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 09:01 moritzm: rolling restart of cassandra on maps* to pick up Java security updates
  • 08:39 moritzm: upgrading snapshot servers to PHP 7.2.31
  • 08:28 moritzm: upgrading deployment servers to PHP 7.2.31
  • 08:01 marostegui: stop m1 on db1117 to clone db1097 (this will trigger an haproxy irc alert) - T254556
  • 07:36 marostegui@cumin1001: dbctl commit (dc=all): 'Remove db1097 from config', diff saved to https://phabricator.wikimedia.org/P11421 and previous config saved to /var/cache/conftool/dbconfig/20200609-073635-marostegui.json
  • 07:36 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 07:33 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 07:30 moritzm: upgrading mw1390-mw1413 to PHP 7.2.31
  • 07:11 ema: deployment-cache-text06: stop vhtcpd, start purged T254844
  • 07:09 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1097:3314, db1097:3315 T253217', diff saved to https://phabricator.wikimedia.org/P11420 and previous config saved to /var/cache/conftool/dbconfig/20200609-070917-marostegui.json
  • 06:53 marostegui: Stop MySQL on db2113 for maintenance - T251570
  • 06:51 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db2113 for on-site maintenance T251570', diff saved to https://phabricator.wikimedia.org/P11419 and previous config saved to /var/cache/conftool/dbconfig/20200609-065125-marostegui.json
  • 06:48 marostegui@cumin1001: dbctl commit (dc=all): 'Fully pool db1091 into s1 T253217', diff saved to https://phabricator.wikimedia.org/P11418 and previous config saved to /var/cache/conftool/dbconfig/20200609-064829-marostegui.json
  • 06:40 marostegui: Deploy schema change on s2 T206103
  • 06:33 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly pool db1091 into s1 T253217', diff saved to https://phabricator.wikimedia.org/P11417 and previous config saved to /var/cache/conftool/dbconfig/20200609-063344-marostegui.json
  • 06:19 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly pool db1091 into s1 T253217', diff saved to https://phabricator.wikimedia.org/P11416 and previous config saved to /var/cache/conftool/dbconfig/20200609-061916-marostegui.json
  • 05:51 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly pool db1091 into s1 T253217', diff saved to https://phabricator.wikimedia.org/P11415 and previous config saved to /var/cache/conftool/dbconfig/20200609-055128-marostegui.json
  • 05:32 marostegui: Switch dbproxy1018 from "master" service to "replicas" - T249188
  • 01:02 eileen: civicrm revision changed from 4a19db672f to 80a0d22350, config revision is 386b9bc457
  • 00:39 ejegg: updated payments-wiki from c1d14a5db7 to aceddff8b5
  • 00:30 shdubsh: restart elasticsearch on logstash1010
  • 00:24 eileen: civicrm revision changed from be4c5a4951 to 4a19db672f, config revision is 386b9bc457

2020-06-08

  • 23:49 krinkle@deploy1001: Synchronized wmf-config/logging.php: If99192 (duration: 00m 57s)
  • 23:35 krinkle@deploy1001: Synchronized wmf-config/logging.php: I8c22a1a8fc402 (duration: 00m 58s)
  • 23:32 foks: removing one file for legal compliance
  • 23:25 cmjohnson@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 23:23 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 23:02 ryankemper@cumin2001: END (FAIL) - Cookbook sre.elasticsearch.rolling-upgrade (exit_code=99)
  • 22:58 ryankemper@cumin2001: START - Cookbook sre.elasticsearch.rolling-upgrade
  • 22:53 ryankemper@cumin2001: END (FAIL) - Cookbook sre.elasticsearch.rolling-upgrade (exit_code=99)
  • 22:53 shdubsh: update mtail to 3.0.0~rc35 on mw and wtp hosts codfw
  • 22:49 eileen: civicrm revision changed from 11b0e7c7e5 to be4c5a4951, config revision is 386b9bc457
  • 22:49 ryankemper@cumin2001: START - Cookbook sre.elasticsearch.rolling-upgrade
  • 20:52 Amir1: applying the sql alter table on ipblocks on labswiki (T251188)
  • 20:27 RoanKattouw: Running initUserPreference.php -s growthexperiments-homepage-enable -t growthexperiments-help-panel-tog-help-panel on wikis that have GrowthExperiments installed (T240920)
  • 18:56 Urbanecm: Morning SWATconfig/backport window done
  • 18:56 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 1630a10: Set wgProofreadPagePageJoiner to __PAGEJOIN__ for zhwikisource (T205826) (duration: 00m 58s)
  • 18:55 urbanecm@deploy1001: sync-file aborted: SWAT: 1630a10: Set wgProofreadPagePageJoiner to __PAGEJOIN__ for zhwikisource (duration: 00m 00s)
  • 18:51 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 0e85203: Enable subpages in Page namespace on napwikisource (T252755) (duration: 00m 58s)
  • 18:44 cmjohnson@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 18:42 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 18:28 catrope@deploy1001: Synchronized wmf-config/InitialiseSettings.php: End GrowthExperiments homepage A/B test (T254413) (duration: 00m 57s)
  • 18:23 catrope@deploy1001: Synchronized wmf-config/CommonSettings.php: Disable HTCP purges for testwiki (T250781) (part 2) (duration: 00m 56s)
  • 18:20 catrope@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Disable HTCP purges for testwiki (T250781) (part 1) (duration: 00m 59s)
  • 17:50 elukey: restart prometheus burrow exporter for kafka main on kafkamon1001 - T254498
  • 17:43 ladsgroup@deploy1001: Synchronized php-1.35.0-wmf.35/resources/src/mediawiki.misc-authed-curate/rollback.js: Fix: Diff pages show rollback confirmation prompt if there is the "Mark as patrolled" link (T254538) (duration: 00m 59s)
  • 17:14 elukey@cumin1001: END (PASS) - Cookbook sre.kafka.roll-restart-mirror-maker (exit_code=0)
  • 16:55 elukey@cumin1001: START - Cookbook sre.kafka.roll-restart-mirror-maker
  • 16:54 elukey@cumin1001: END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0)
  • 16:44 liw: testing upcoming Scap release on beta
  • 15:29 hnowlan: Migrated all cpjobqueue jobs from scb to Kubernetes
  • 15:29 hnowlan@deploy1001: Finished deploy [cpjobqueue/deploy@07d8c32]: Disabling jobs migrated to k8s (duration: 04m 34s)
  • 15:28 jynus@cumin2001: dbctl commit (dc=all): 'depool db2075 for mw maintenance T254139', diff saved to https://phabricator.wikimedia.org/P11411 and previous config saved to /var/cache/conftool/dbconfig/20200608-152811-jynus.json
  • 15:24 hnowlan@deploy1001: Started deploy [cpjobqueue/deploy@07d8c32]: Disabling jobs migrated to k8s
  • 15:12 ladsgroup@deploy1001: Synchronized php-1.35.0-wmf.35/extensions/Wikibase/client/includes/Store/Sql/DirectSqlStore.php: Wrap WAN-cached PropertyInfoLookup with an APCu cache, Part III out of III (T254536) (duration: 00m 57s)
  • 15:10 ladsgroup@deploy1001: Synchronized php-1.35.0-wmf.35/extensions/Wikibase/repo/includes/Store/Sql/SqlStore.php: Wrap WAN-cached PropertyInfoLookup with an APCu cache, Part II out of III (T254536) (duration: 00m 57s)
  • 15:09 ladsgroup@deploy1001: Synchronized php-1.35.0-wmf.35/extensions/Wikibase/lib/includes/Store/CachingPropertyInfoLookup.php: Wrap WAN-cached PropertyInfoLookup with an APCu cache, Part I out of III (T254536) (duration: 00m 59s)
  • 15:05 hnowlan@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
  • 14:53 cdanis: βœ”οΈ cdanis@cumin1001.eqiad.wmnet ~ πŸ•šβ˜• sudo cumin A:mw-canary 'enable-puppet "cdanis deploying I25ab44c1 T252605"'
  • 14:52 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
  • 14:48 papaul: powering down ms-be2016 for BBU replacement
  • 14:47 cdanis: βœ”οΈ cdanis@cumin1001.eqiad.wmnet ~ πŸ•šβ˜• sudo cumin A:mw-canary 'disable-puppet "cdanis deploying I25ab44c1 T252605"'
  • 14:41 moritzm: upgrading mw API servers in codfw to PHP 7.2.31
  • 14:00 jbond42: updating puppet-merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/602738/4
  • 13:58 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 13:58 jmm@cumin2001: START - Cookbook sre.hosts.downtime
  • 13:50 urbanecm@deploy1001: Synchronized private/PrivateSettings.php: Update mitigations for T250887 (duration: 00m 57s)
  • 13:41 elukey@cumin1001: START - Cookbook sre.kafka.roll-restart-brokers
  • 12:23 XioNoX: repool codfw - T243080
  • 12:18 XioNoX: rollback cr2-codfw vrrp/ospf/bgp changes - T243080
  • 12:18 marostegui: Compress InnoDB on db2094:3311 T254462
  • 12:09 XioNoX: cr2-codfw> request chassis routing-engine master switch - T243080
  • 12:05 XioNoX: reboot cr2-codfw:re0 (backup) - T243080
  • 11:53 XioNoX: cr2-codfw> request chassis routing-engine master switch - T243080
  • 11:53 moritzm: restarting dnsdist on malmok
  • 11:53 marostegui: Deploy schema change on s3 - T251188
  • 11:49 XioNoX: reboot cr2-codfw:re1 (backup) - T243080
  • 11:45 moritzm: restarting slapd on ldap-corp* for Gnu TLS security update
  • 11:43 moritzm: rolling restart of Apache on Kibana/7 host to pick up Gnu TLS security update
  • 11:41 XioNoX: de-pref cr2-codfw OSPF - T243080
  • 11:39 XioNoX: deactivate cr2-codfw transit/peering - T243080
  • 11:38 XioNoX: fail vrrp master from cr2 to cr1 - T243080
  • 11:32 XioNoX: cr1-codfw set OSPF metrics back to normal - T243080
  • 11:30 XioNoX: cr1-codfw re-enable transit/peering - T243080
  • 11:29 XioNoX: cr1-codfw add graceful-restart - T243080
  • 11:28 XioNoX: cr1-codfw add graceful-switchover - T243080
  • 11:18 Lucas_WMDE: EU SWAT done
  • 11:16 lucaswerkmeister-wmde@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Remove Wikibase idBlacklist setting (T254686), part 2 (duration: 00m 56s)
  • 11:15 XioNoX: cr1-codfw> request chassis routing-engine master switch - T243080
  • 11:15 lucaswerkmeister-wmde@deploy1001: Synchronized wmf-config/Wikibase.php: SWAT: Remove Wikibase idBlacklist setting (T254686), part 1 (duration: 00m 56s)
  • 11:11 XioNoX: reboot cr1-codfw:re0 (backup) - T243080
  • 11:09 lucaswerkmeister-wmde@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable GrowthExperiments guidance everywhere behind feature flag (T253794) (duration: 00m 57s)
  • 11:05 marostegui: Install events on es1 T254689
  • 11:05 XioNoX: install Junos on cr1-codfw:re0 (backup) - T243080
  • 10:56 XioNoX: do cr1-codfw RE mastership switch - T243080
  • 10:53 XioNoX: reboot cr1-codfw:re1 (backup) - T243080
  • 10:46 XioNoX: install Junos on cr1-codfw:re1 (backup) - T243080
  • 10:43 XioNoX: deactivate cr1-codfw transit/peering - T243080
  • 10:41 XioNoX: bump all cr1-codfw OSPF metrics - T243080
  • 10:41 jdrewniak@deploy1001: Synchronized portals: Wikimedia Portals Update: Bumping portals to master (603408) (duration: 00m 57s)
  • 10:40 jdrewniak@deploy1001: Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: Bumping portals to master (603408) (duration: 01m 09s)
  • 10:39 XioNoX: depool codfw - T243080
  • 09:46 moritzm: installing gnutls28 security updates on buster (older releases not affected)
  • 09:32 qchris: Turning on puppet on gerrit1002 again to avoid starting to lag too far behind
  • 08:17 XioNoX: push T250136 to eqsin - T250136
  • 08:09 XioNoX: push T250136 to eqiad - T250136
  • 08:07 moritzm: upgrading mw1349-mw1383 to PHP 7.2.31
  • 08:07 mutante: stat1006 moved broken jupyter-dedcode-singleuser.service out of /run/systemd/transient. systemctl reset-failed
  • 08:02 XioNoX: push T250136 to codfw - T250136
  • 07:58 XioNoX: push T250136 to eqord/eqdfw - T250136
  • 07:58 mutante: stat1006 bash[40607]: /bin/bash: line 0: exec: jupyterhub-singleuser: not found
  • 07:57 mutante: ran puppet on all stat* hosts for an access request (dcipoletti was added) - stat1006 systemd state broke right after, jupyter-dedcode-singleuser.service failed
  • 07:46 XioNoX: push T250136 to esams/knams - T250136
  • 07:42 XioNoX: cr4-ulsfo protocols bgp group Transit4 family inet any -> unicast - T250136
  • 07:39 XioNoX: cr3-ulsfo protocols bgp group Transit4 family inet any -> unicast - T250136
  • 07:37 moritzm: installing nodejs security updates
  • 07:05 marostegui: Stop MySQL on labsdb1012 to clone labsdb1011 T249188
  • 05:22 marostegui: Upgrade db1077 to 10.4.13 to test events memory leak
  • 04:45 _joe_: de-firewalling mc1029
  • 04:27 _joe_: firewallingf off memcached on mc1029

2020-06-05

  • 16:45 elukey@deploy1001: Finished deploy [analytics/turnilo/deploy@f7e4f78]: Upgrade to 1.24.0 (duration: 00m 11s)
  • 16:45 elukey@deploy1001: Started deploy [analytics/turnilo/deploy@f7e4f78]: Upgrade to 1.24.0
  • 16:29 bd808: Testing stashbot following hard restart of service. It was having LDAP connection failure problems.
  • 16:00 AndyRussG: Turned off Fundraising job recurring_smashpig_charge
  • 15:54 cdanis: enabling & rerunning puppet on netflow* T254574
  • 15:39 cdanis: disabling puppet on netflow* and trying I6598d8f8 on netflow3001 first T254574
  • 15:39 cdanis: disabling puppet on netflow* and trying I6598d8f8 on netflow3001 first
  • 13:33 jayme@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'mathoid' for release 'staging' .
  • 13:19 akosiaris@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'cxserver' for release 'staging' .
  • 13:19 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 13:19 akosiaris@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'citoid' for release 'staging' .
  • 13:18 akosiaris@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
  • 13:15 elukey@cumin1001: START - Cookbook sre.hosts.downtime
  • 12:55 ladsgroup@deploy1001: Synchronized wmf-config/interwiki.php: Hotfix for be-tarask interwiki link being broken (T111853) (duration: 01m 00s)
  • 12:41 mutante: rebooting gerrit1002 to add more vCPUs, after [ganeti1009:~] $ sudo gnt-instance modify -B vcpus=8 gerrit1002.wikimedia.org T239151
  • 12:20 akosiaris@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'zotero' for release 'staging' .
  • 12:19 akosiaris@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' .
  • 12:19 akosiaris@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'cxserver' for release 'staging' .
  • 12:19 akosiaris@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'citoid' for release 'staging' .
  • 12:19 akosiaris@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
  • 12:17 akosiaris: update blubberoid changeprop changeprop-jobqueue citoid cxserver wikifeeds zotero in staging to latest charts
  • 12:17 akosiaris@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 12:17 akosiaris@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'blubberoid' for release 'staging' .
  • 12:17 akosiaris: fix typo in ganeti2016 /etc/network/interfaces and reboot
  • 11:28 akosiaris: master-failover from ganeti2001 to ganeti2019 for ganeti01.svc.codfw.wmnet
  • 11:25 akosiaris@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
  • 11:25 akosiaris@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
  • 11:25 akosiaris@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
  • 11:14 mutante: running puppet on all ganeti nodes
  • 11:05 ladsgroup@deploy1001: Synchronized wmf-config/interwiki.php: Update interwiki cache (duration: 02m 14s)
  • 10:32 elukey@cumin1001: END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0)
  • 10:11 elukey@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
  • 10:02 dzahn@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
  • 09:49 jayme@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'mathoid' for release 'staging' .
  • 09:46 elukey@cumin1001: START - Cookbook sre.ganeti.makevm
  • 09:25 elukey@cumin1001: START - Cookbook sre.cassandra.roll-restart
  • 09:03 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 09:00 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 08:44 akosiaris: reimage ganeti2016 for stretch
  • 08:42 akosiaris: migrate mx2001.wikimedia.org to new ganeti nodes
  • 08:40 akosiaris: migrate acrab to new ganeti nodes
  • 08:38 akosiaris: failover master IP from ganeti1003 to ganeti1009
  • 08:37 akosiaris: empty ganeti100{1,2,3,4}. Move all VMs to new ganeti nodes
  • 08:28 akosiaris: migrate seaborgium.wikimedia.org to new ganeti nodes
  • 08:27 akosiaris: migrate etherpad1002 to new ganeti nodes
  • 08:11 marostegui: Upgrade db2075 to 10.1.45
  • 07:52 vgutierrez: rolling restart of ats-tls - T249335
  • 07:20 dzahn@cumin1001: START - Cookbook sre.ganeti.makevm
  • 06:20 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 06:17 elukey@cumin1001: START - Cookbook sre.hosts.downtime

2020-06-04

  • 23:45 catrope@deploy1001: Synchronized wmf-config/mc.php: Set coalesceKeys=non-global for WANCache on enwiki (duration: 00m 59s)
  • 23:29 catrope@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Enable Minerva site notices on Wikivoyage wiis (T254391) (duration: 00m 58s)
  • 23:19 catrope@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Set guwiki timezone to Asia/Kolkata (T253827) (duration: 00m 57s)
  • 23:17 catrope@deploy1001: Synchronized static/images/: Change logo for zhwiki (T254467) (duration: 01m 00s)
  • 22:56 ryankemper: re-enabled puppet on `cloudelastic1006`. All `cloudelastic` instances now have puppet enabled and are in sync
  • 20:56 ryankemper: enabled puppet on `cloudelastic1005` in order to kick off a puppet run and verify that this new node joins the ES cluster properly
  • 20:39 ryankemper: disabled puppet on `cloudelastic100[5,6]` which are two racked nodes that we are now bringing into service. Will re-enable after successful puppet-merge / elasticsearch cluster join
  • 20:38 ryankemper: disabled puppet on `cloudelastic100[5,6]` which are two racked nodes that we are now bringing into service. Will re-enable after successful puppet-merge / elasticsearch cluster join
  • 19:04 jforrester@deploy1001: rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.35
  • 15:12 elukey@puppetmaster1001: conftool action : set/pooled=yes; selector: name=druid1004.eqiad.wmnet
  • 15:11 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 15:10 hnowlan@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
  • 15:08 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 15:07 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:06 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:37 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
  • 14:36 moritzm: installing libexif security updates on jessie
  • 14:08 moritzm: installing clamav security updates on mendelevium (ticket.wikimedia.org)
  • 14:00 qchris: Stopping puppet on gerrit1002 (gerrit-test) to run tests for Gerrit upgrade
  • 13:41 moritzm: bounced ferm on ms-be1023
  • 13:35 moritzm: installing exim security updates on jessie (stretch/buster already done)
  • 12:54 urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: c06e720: Revert "wgNamespaceRobotPolicies: thwiki: Add 100 NS to noindex" (T253574) (duration: 01m 06s)
  • 12:18 jayme@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'cxserver' for release 'production' .
  • 12:14 jayme@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'citoid' for release 'production' .
  • 12:02 moritzm: upgrading mw1276 to PHP 7.2.31
  • 11:59 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1107', diff saved to https://phabricator.wikimedia.org/P11396 and previous config saved to /var/cache/conftool/dbconfig/20200604-115933-marostegui.json
  • 11:53 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: ec07467: wgNamespaceRobotPolicies: thwiki: Add 100 NS to noindex (T253574) (duration: 01m 15s)
  • 11:49 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 11:49 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 338cb90: 1ade16f: Change $wgNamespaceRobotPolicies on Thai wikis (T253578; T253577; T253576; T253575; T253574) (duration: 01m 07s)
  • 11:46 elukey@cumin1001: START - Cookbook sre.hosts.downtime
  • 11:41 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1107', diff saved to https://phabricator.wikimedia.org/P11395 and previous config saved to /var/cache/conftool/dbconfig/20200604-114149-marostegui.json
  • 11:29 marostegui: Compress InnoDB on db1091 before pooling it as new slave on s1 - T254462
  • 11:21 hashar@deploy1001: Synchronized wmf-config/InitialiseSettings.php: [metawiki] Add `centralauth-rename` to WMF OIT staff - T254372 (duration: 01m 08s)
  • 11:14 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 11:12 elukey@cumin1001: START - Cookbook sre.hosts.downtime
  • 11:04 jayme@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'wikifeeds' for release 'production' .
  • 10:59 jayme@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'cxserver' for release 'production' .
  • 10:53 jayme@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'citoid' for release 'production' .
  • 10:53 elukey@puppetmaster1001: conftool action : set/pooled=no; selector: name=druid1004.eqiad.wmnet
  • 10:46 marostegui: Deploy schema change on s3 (only testwiki) eqiad - T238966
  • 10:42 marostegui: Deploy schema change on s3 (only testwiki) codfw - T238966
  • 10:41 jbond42: deployed new version of puppet-merge revert is https://gerrit.wikimedia.org/r/c/operations/puppet/+/602329
  • 09:57 jayme@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' .
  • 09:56 akosiaris@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 09:56 akosiaris@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 09:55 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 09:55 akosiaris@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 09:55 akosiaris@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 09:54 jayme@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'cxserver' for release 'staging' .
  • 09:53 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 09:51 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 09:51 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 09:50 akosiaris@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 09:50 akosiaris@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 09:50 akosiaris@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 09:50 akosiaris@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 09:50 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 09:50 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 09:50 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 09:50 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 09:50 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 09:50 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 09:50 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 09:50 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 09:48 akosiaris@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 09:48 akosiaris@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 09:48 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 09:48 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 09:48 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 09:48 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 09:46 jayme@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'citoid' for release 'staging' .
  • 09:42 jmm@cumin2001: END (FAIL) - Cookbook sre.cassandra.roll-restart (exit_code=99)
  • 09:42 jmm@cumin2001: START - Cookbook sre.cassandra.roll-restart
  • 09:41 jmm@cumin2001: END (FAIL) - Cookbook sre.cassandra.roll-restart (exit_code=99)
  • 09:41 jmm@cumin2001: START - Cookbook sre.cassandra.roll-restart
  • 09:26 moritzm: rolling restart of cassandra on maps* to pick up Java security updates
  • 09:09 akosiaris@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 09:08 akosiaris@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 09:08 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 09:05 akosiaris@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 09:05 akosiaris@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 09:05 akosiaris@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 09:05 akosiaris@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 09:05 akosiaris@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 09:05 akosiaris@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 09:05 akosiaris@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 09:04 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 09:04 akosiaris@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 09:03 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 09:03 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 09:03 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 09:03 moritzm: deploying Java security updates on elastic search nodes
  • 09:03 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 09:00 akosiaris@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 09:00 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 09:00 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 09:00 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 09:00 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 09:00 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 09:00 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 09:00 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 08:59 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 08:59 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 08:59 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 08:58 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 08:50 marostegui: Repool labsdb1009 after running maintain-views T252219
  • 08:42 moritzm: restarting archiva to pick up Java security updates
  • 08:15 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1107 to clone db1091 on s1 T253217', diff saved to https://phabricator.wikimedia.org/P11392 and previous config saved to /var/cache/conftool/dbconfig/20200604-081545-marostegui.json
  • 08:14 marostegui: Run sudo /usr/local/sbin/maintain-views --all-databases --replace-all on labsdb1009 - T252219
  • 07:49 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 07:45 marostegui: Depool labsdb1009 - T252219
  • 07:45 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 07:33 oblivian@puppetmaster1001: conftool action : set/weight=10; selector: dc=eqiad,cluster=labweb,service=labweb-ssl
  • 07:32 oblivian@puppetmaster1001: conftool action : set/pooled=yes:weight=10; selector: dc=eqiad,cluster=cloudceph,service=cloudceph
  • 06:52 mutante: mwmaint1002 started mediawiki_job_cirrus_build_completion_indices_eqiad.service
  • 06:06 oblivian@puppetmaster1001: conftool action : set/weight=10; selector: name=logstash200.*
  • 06:05 oblivian@puppetmaster1001: conftool action : set/weight=10; selector: name=logstash100.*
  • 06:04 oblivian@puppetmaster1001: conftool action : set/weight=10; selector: cluster=eventschemas,service=eventschemas
  • 06:02 oblivian@puppetmaster1001: conftool action : set/weight=10; selector: dc=codfw,cluster=elasticsearch,service=elasticsearch.*
  • 06:01 oblivian@puppetmaster1001: conftool action : set/weight=10; selector: dc=codfw,cluster=elasticsearch,service=elasticsearch
  • 06:00 _joe_: fixing weights of cp2040 T245594
  • 05:31 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 05:28 elukey@cumin1001: START - Cookbook sre.hosts.downtime
  • 00:36 reedy@deploy1001: Synchronized php-1.35.0-wmf.35/includes/specials/SpecialUserrights.php: T254417 T251534 (duration: 01m 06s)

2020-06-03

  • 23:08 reedy@deploy1001: Synchronized wmf-config/CommonSettings-labs.php: T249834 (duration: 01m 06s)
  • 23:06 reedy@deploy1001: Synchronized wmf-config/InitialiseSettings-labs.php: T249834 (duration: 01m 06s)
  • 22:22 ryankemper@cumin2001: END (PASS) - Cookbook sre.elasticsearch.rolling-upgrade (exit_code=0)
  • 21:54 jforrester@deploy1001: rebuilt and synchronized wikiversions files: Re-rolling group1 to 1.35.0-wmf.35 for T253023
  • 21:49 jforrester@deploy1001: Synchronized php-1.35.0-wmf.35/extensions/EventStreamConfig/includes/ApiStreamConfigs.php: T254390 ApiStreamConfigs: If the 'constraints' parameter is unset, don't explode (duration: 01m 06s)
  • 21:43 cstone: civicrm revision changed from 63508b01b9 to 11b0e7c7e5
  • 21:16 ryankemper@cumin2001: START - Cookbook sre.elasticsearch.rolling-upgrade
  • 21:15 ryankemper: The previously ran `_cluster/reroute?retry_failed=true` command worked as intended, the two shards in question have recovered and we're back to green cluster status. We're now in a known state and ready to proceed with the eqiad rolling upgrade
  • 21:13 ryankemper: Ran `curl -X POST "https://localhost:9243/_cluster/reroute?pretty&retry_failed=true&explain=true" -H 'Content-Type: application/json' -d '{}' --insecure` via the ssh tunnel `ssh bast4002.wikimedia.org -L 9243:search.svc.eqiad.wmnet:9243 -L 9443:search.svc.eqiad.wmnet:9443 -L 9643:search.svc.eqiad.wmnet:9643`, two unassigned shards are now initializing
  • 21:05 ryankemper: Elasticsearch Eqiad was in yellow cluster status before starting the above cookbook run (therefore the run was a no-op until I ctlr+C'd), going to try unsticking the two unassigned shards via `/_cluster/reroute?retry_failed=true`
  • 21:03 ryankemper@cumin2001: END (ERROR) - Cookbook sre.elasticsearch.rolling-upgrade (exit_code=97)
  • 20:58 ryankemper@cumin2001: START - Cookbook sre.elasticsearch.rolling-upgrade
  • 20:52 ryankemper@cumin2001: END (PASS) - Cookbook sre.elasticsearch.rolling-upgrade (exit_code=0)
  • 20:49 eileen: civicrm revision changed from eb156dffa4 to 63508b01b9, config revision is 95dcdb0a8a
  • 20:47 ryankemper@cumin2001: START - Cookbook sre.elasticsearch.rolling-upgrade
  • 20:19 gehel: elasticsearch cluster restart stopped
  • 20:18 ryankemper@cumin2001: END (ERROR) - Cookbook sre.elasticsearch.rolling-upgrade (exit_code=97)
  • 19:35 ppchelko@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' .
  • 19:35 ppchelko@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
  • 19:33 ppchelko@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
  • 19:32 ppchelko@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' .
  • 19:30 ryankemper@cumin2001: START - Cookbook sre.elasticsearch.rolling-upgrade
  • 19:29 ppchelko@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
  • 19:29 ppchelko@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' .
  • 19:20 jforrester@deploy1001: rebuilt and synchronized wikiversions files: Revert group1 wikis to wmf.34 T253023
  • 19:16 hnowlan@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
  • 19:15 hnowlan@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
  • 19:14 jforrester@deploy1001: Synchronized php: group1 wikis to 1.35.0-wmf.35 (duration: 01m 05s)
  • 19:13 jforrester@deploy1001: rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.35
  • 19:05 jforrester@deploy1001: Synchronized dblists/mobilemainpagelegacy.dblist: T32405 Stop special casing the main page on another 47 projects (duration: 01m 08s)
  • 19:01 ppchelko@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: gerrit 601843 Enable talk pages on Swedish Minerva (duration: 01m 08s)
  • 18:59 hnowlan@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
  • 18:56 hnowlan@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
  • 18:55 ppchelko@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: gerrit 601842 - Disable growth survey (duration: 01m 06s)
  • 18:49 ppchelko@deploy1001: Synchronized wmf-config/CommonSettings.php: SWAT: gerrit 596277 Use AddFooterLink hook for code of conduct and contact links (duration: 01m 05s)
  • 18:34 ppchelko@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: gerrit 599150 - enable kafka purges for group0 (duration: 01m 06s)
  • 18:19 ppchelko@deploy1001: Synchronized wmf-config/CommonSettings.php: SWAT: gerrit 570396 - enable kask-session everywhere. CS.php (duration: 01m 05s)
  • 18:14 ppchelko@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: gerrit 570396 - enable kask-session everywhere. IS.php (duration: 01m 06s)
  • 17:15 ejegg: updated payments-wiki from e46114d8b1 to c1d14a5db7
  • 17:08 elukey: ganeti: gnd-instance reboot an-launcher1001 to get new memory settings - T254125
  • 15:21 hnowlan@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
  • 15:19 hnowlan@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
  • 15:12 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
  • 14:50 kormat@deploy1001: Synchronized wmf-config/db-eqiad.php: Repool pc1009 in pc3 after reimaging T252182 (duration: 01m 06s)
  • 14:47 moritzm: updated grafana on cloudmetrics* to 6.7.4
  • 14:26 kormat: stopping replication on pc1010
  • 14:20 kormat@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 14:17 kormat@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:16 gehel: cleaning commonsrdf-dumps cron entry manually on snapshot1008
  • 14:00 hashar: Restarted CI Jenkins for plugin update
  • 13:59 kormat@deploy1001: Synchronized wmf-config/db-eqiad.php: Replace pc1009 with pc1010 reimaging T252182 (duration: 01m 06s)
  • 13:47 kormat: reimaging *pc1009 (promise) to buster T252182
  • 13:44 kormat: reimaging pc1007 to buster, wish me luck T252182
  • 13:20 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
  • 13:13 kormat@deploy1001: Synchronized wmf-config/db-codfw.php: Put pc2009 back into pc3 after reimaging T252182 (duration: 01m 05s)
  • 13:01 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
  • 12:01 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1080', diff saved to https://phabricator.wikimedia.org/P11385 and previous config saved to /var/cache/conftool/dbconfig/20200603-120136-marostegui.json
  • 11:57 moritzm: updating linux-libc-dev on stretch and buster hosts
  • 11:56 XioNoX: configure management-instance on cr1/2-eqiad - T247073
  • 11:51 XioNoX: configure management-instance on cr2-codfw - T247073
  • 11:44 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db2124 after MCR schema change', diff saved to https://phabricator.wikimedia.org/P11384 and previous config saved to /var/cache/conftool/dbconfig/20200603-114409-marostegui.json
  • 11:43 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1080', diff saved to https://phabricator.wikimedia.org/P11383 and previous config saved to /var/cache/conftool/dbconfig/20200603-114351-marostegui.json
  • 11:31 Lucas_WMDE: EU SWAT done
  • 11:30 Lucas_WMDE: lucaswerkmeister-wmde@mwmaint1002:~$ mwscript namespaceDupes.php eswiki --fix | tee T254077.fix
  • 11:29 Lucas_WMDE: lucaswerkmeister-wmde@mwmaint1002:~$ mwscript namespaceDupes.php eswiki | tee T254077.dry-run
  • 11:27 moritzm: installing rubygems-integration updates for Buster
  • 11:26 lucaswerkmeister-wmde@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: [eswiki] Normalize talk namespaces for Anexo, Portal and Wikiproyecto (T254077) (duration: 01m 03s)
  • 11:25 moritzm: install brltty updates on Buster
  • 11:23 XioNoX: configure management-instance on cr1-codfw - T247073
  • 11:19 XioNoX: configure management-instance on cr1-eqsin - T247073
  • 11:15 moritzm: installing python-oslo.utils security updates
  • 11:12 XioNoX: remove unused logical-systems from all MX204 routers - T247073
  • 11:10 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db2124', diff saved to https://phabricator.wikimedia.org/P11382 and previous config saved to /var/cache/conftool/dbconfig/20200603-111055-marostegui.json
  • 11:08 marostegui: Add rev_id to revision table on db2124 - T238966
  • 11:05 moritzm: installing pango updates for buster
  • 10:42 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1091 - will be reimaged and moved to s1 T252512', diff saved to https://phabricator.wikimedia.org/P11381 and previous config saved to /var/cache/conftool/dbconfig/20200603-104251-marostegui.json
  • 10:14 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1080', diff saved to https://phabricator.wikimedia.org/P11380 and previous config saved to /var/cache/conftool/dbconfig/20200603-101426-marostegui.json
  • 09:38 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1080', diff saved to https://phabricator.wikimedia.org/P11378 and previous config saved to /var/cache/conftool/dbconfig/20200603-093810-marostegui.json
  • 09:28 kormat@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 09:25 kormat@cumin1001: START - Cookbook sre.hosts.downtime
  • 09:21 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 09:18 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 09:10 moritzm: upgrading mw1262-1265 to PHP 7.2.31
  • 09:08 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 09:05 elukey@cumin1001: START - Cookbook sre.hosts.downtime
  • 09:04 marostegui: Reimage db1080
  • 09:01 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1080 for reimage', diff saved to https://phabricator.wikimedia.org/P11376 and previous config saved to /var/cache/conftool/dbconfig/20200603-090143-marostegui.json
  • 08:42 kormat@deploy1001: Synchronized wmf-config/db-codfw.php: Replace pc2009 with pc2010 while reimaging (duration: 01m 16s)
  • 08:19 moritzm: upgrading mw1261 to PHP 7.2.31
  • 08:17 XioNoX: re-add ae2 physical interfaces to external group - T253970
  • 08:09 moritzm: upgrading remaining mwdebug* servers to PHP 7.2.31
  • 08:08 kormat: reimaging pc2009 to buster T252182
  • 08:08 XioNoX: remove ae2 physical interfaces from external group - T253970
  • 07:52 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 07:51 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
  • 07:48 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 07:45 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
  • 07:44 dzahn@cumin1001: conftool action : set/pooled=inactive; selector: name=mw218[0-6].codfw.wmnet
  • 07:36 mutante: depooling mw2180 - mw2186
  • 07:36 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw218[0-6].codfw.wmnet
  • 07:35 moritzm: imported PHP 7.2.31 to apt.wikimedia.org/component/php72
  • 07:33 ema: cp: upgrade purged to 0.15
  • 07:28 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db2071 after cloning it from db2130 to restore all the schema changes applied', diff saved to https://phabricator.wikimedia.org/P11375 and previous config saved to /var/cache/conftool/dbconfig/20200603-072841-marostegui.json
  • 07:15 XioNoX: repool esams - T254021
  • 07:09 XioNoX: re-activate peering/transit BGP on cr2-esams - T254021
  • 07:00 XioNoX: re0.cr2-esams> request system reboot both-routing-engines - T254021
  • 06:56 XioNoX: deactivate peering/transit BGP cr2-esams - T244497
  • 06:54 XioNoX: failover vrrp to cr3-esams - T244497
  • 06:37 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1138 T253808', diff saved to https://phabricator.wikimedia.org/P11374 and previous config saved to /var/cache/conftool/dbconfig/20200603-063752-marostegui.json
  • 06:18 XioNoX: cr3-esams> request chassis routing-engine master switch - T244497
  • 06:11 XioNoX: cr3-esams> request vmhost reboot re1 (backup re) - T244497
  • 06:08 XioNoX: re-activate transit BGP to cr3-knams - T254021
  • 06:01 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1138 T253808', diff saved to https://phabricator.wikimedia.org/P11373 and previous config saved to /var/cache/conftool/dbconfig/20200603-060124-marostegui.json
  • 05:58 XioNoX: reboot cr3-knams - T254021
  • 05:51 XioNoX: deactivate transit BGP ton cr3-knams - T254021
  • 05:48 XioNoX: depool esams - T254021
  • 05:41 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db2130 to clone db2071', diff saved to https://phabricator.wikimedia.org/P11371 and previous config saved to /var/cache/conftool/dbconfig/20200603-054117-marostegui.json
  • 05:40 marostegui: Stop MySQL on db2130 to clone db2071
  • 05:38 XioNoX: deactivate graceful-switchover on cr3-esams - T254021
  • 05:37 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1138 T253808', diff saved to https://phabricator.wikimedia.org/P11370 and previous config saved to /var/cache/conftool/dbconfig/20200603-053748-marostegui.json
  • 05:35 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 05:32 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 05:14 XioNoX: turn cr1-codfw:fpc0 online - T254110
  • 05:09 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1138 T253808', diff saved to https://phabricator.wikimedia.org/P11369 and previous config saved to /var/cache/conftool/dbconfig/20200603-050911-marostegui.json
  • 01:00 reedy@deploy1001: Synchronized wmf-config/CommonSettings-labs.php: wgLocalVirtualHosts (duration: 01m 06s)
  • 00:59 krinkle@deploy1001: Synchronized wmf-config/mc.php: Ic27b60 (duration: 01m 11s)

2020-06-02

  • 23:58 ejegg: updated fundraising CiviCRM from 657c4b9455 to eb156dffa4
  • 23:55 ejegg: updated payments-wiki from 1942a537ef to e46114d8b1
  • 22:48 cstone: civicrm revision changed from d1cd99166f to 657c4b9455
  • 21:48 reedy@deploy1001: Synchronized wmf-config/interwiki-labs.php: laaaaabs (duration: 01m 05s)
  • 21:23 reedy@deploy1001: Synchronized multiversion/MWMultiVersion.php: beta apiportalwiki T254185 (duration: 01m 06s)
  • 21:21 reedy@deploy1001: Synchronized wmf-config/config/apiportalwiki.yaml: beta apiportalwiki T254185 (duration: 01m 05s)
  • 21:20 reedy@deploy1001: Synchronized wmf-config/InitialiseSettings-labs.php: beta apiportalwiki T254185 (duration: 01m 05s)
  • 21:19 reedy@deploy1001: Synchronized wikiversions-labs.json: beta apiportalwiki T254185 (duration: 01m 05s)
  • 21:17 reedy@deploy1001: Synchronized dblists/all-labs.dblist: beta apiportalwiki T254185 (duration: 01m 06s)
  • 21:12 cdanis: repooled wtp1032 T254258
  • 20:24 reedy@deploy1001: Synchronized composer.lock: Update (duration: 01m 06s)
  • 20:02 jforrester@deploy1001: rebuilt and synchronized wikiversions files: group0 wikis to 1.35.0-wmf.35
  • 19:59 jforrester@deploy1001: Finished scap: testwikis wikis to 1.35.0-wmf.35 (duration: 93m 52s)
  • 18:25 jforrester@deploy1001: Started scap: testwikis wikis to 1.35.0-wmf.35
  • 18:22 jforrester@deploy1001: Pruned MediaWiki: 1.35.0-wmf.31 (duration: 19m 59s)
  • 18:05 cdanis: fixing g+w permissions of deploy1001 /srv/mediawiki-staging/php-*/.git/objects/*
  • 17:20 James_F: 1.35.0-wmf.35 was branched at 8d70150 for T253023
  • 16:55 cstone: SmashPig revision changed from 44690f761c to b9de3c7aac
  • 15:57 ejegg: updated payments-wiki from d11efeb1cf to 1942a537ef
  • 15:50 cdanis: thumbor1003 and thumbor1004 blipped, no obvious explanation, logs gathered at P11365 P11366 P11367
  • 15:49 XioNoX: push frack fw rules - T254260
  • 15:48 mutante: contint1001 - rm -rf /mnt/docker (T224591)
  • 15:45 mutante: contint1001 - restarting docker afer changed data-root path (T224591)
  • 15:37 cdanis@cumin1001: conftool action : set/pooled=no; selector: name=wtp1032.*
  • 15:35 cdanis: power cycling wtp1032 which is bootlooping? https://phabricator.wikimedia.org/P11364
  • 15:31 hnowlan@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
  • 15:24 rzl@cumin1001: conftool action : set/pooled=yes; selector: name=thumbor100[34].*
  • 15:23 XioNoX: repool codfw - T254216
  • 15:19 XioNoX: rollback ospf changes - T254216
  • 15:09 hnowlan@deploy1001: Finished deploy [cpjobqueue/deploy@8a53ff1]: (no justification provided) (duration: 02m 33s)
  • 15:07 XioNoX: reboot cr1-codfw:fpc5 - T254216
  • 15:06 hnowlan@deploy1001: Started deploy [cpjobqueue/deploy@8a53ff1]: (no justification provided)
  • 15:05 hnowlan: shifting all high traffic cpjobqueue rules to k8s
  • 14:57 hnowlan@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
  • 14:57 XioNoX: depref ulsfo-codfw link - T254216
  • 14:51 hnowlan@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
  • 14:50 jynus@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 14:49 jynus@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:49 XioNoX: prefer eqsin-ulsfo tunnel - T254216
  • 14:47 cdanis@cumin1001: conftool action : set/pooled=no; selector: name=thumbor100[34].*
  • 14:38 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
  • 14:31 XioNoX: depool codfw - T254216
  • 14:09 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
  • 13:45 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 13:42 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
  • 13:42 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 13:38 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
  • 13:37 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 13:28 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
  • 13:19 dzahn@cumin1001: END (ERROR) - Cookbook sre.hosts.decommission (exit_code=97)
  • 13:18 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
  • 13:18 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1)
  • 13:18 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
  • 13:05 cdanis@deploy1001: Synchronized php-1.35.0-wmf.31/includes/specials/pagers/ContribsPager.php: revert contribs limit to 5000 T234450 (duration: 00m 57s)
  • 13:04 cdanis@deploy1001: Synchronized php-1.35.0-wmf.32/includes/specials/pagers/ContribsPager.php: revert contribs limit to 5000 T234450 (duration: 00m 57s)
  • 13:03 cdanis@deploy1001: Synchronized php-1.35.0-wmf.34/includes/specials/pagers/ContribsPager.php: revert contribs limit to 5000 T234450 (duration: 00m 58s)
  • 12:59 marostegui@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 12:59 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 12:56 cdanis@deploy1001: Synchronized wmf-config/PoolCounterSettings.php: 5debc3223 limit per-user Special:Contributions concurrency to 2 T234450 (duration: 00m 58s)
  • 12:50 kormat@cumin1001: dbctl commit (dc=all): 'Pool db2140 into s4 T252985', diff saved to https://phabricator.wikimedia.org/P11363 and previous config saved to /var/cache/conftool/dbconfig/20200602-125012-kormat.json
  • 12:39 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
  • 12:31 dzahn@cumin1001: conftool action : set/pooled=inactive; selector: name=mw217[3-9].codfw.wmnet
  • 12:30 kormat@cumin1001: dbctl commit (dc=all): 'Repool db2110, copy to db2140 complete T252985', diff saved to https://phabricator.wikimedia.org/P11362 and previous config saved to /var/cache/conftool/dbconfig/20200602-123020-kormat.json
  • 12:28 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw217[3-9].codfw.wmnet
  • 11:10 kart_: Finished EU Mid-day SWAT.
  • 11:08 mutante: contint1001 - common issue after reinstalls again - a2dismod mpm_event ; systemctl restart apache2 ; puppet agent -tv ( T196968) https://gerrit.wikimedia.org/r/c/operations/puppet/+/451206
  • 11:07 kartik@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 601174|Create URL campaign for African languages for COVID-19 translation project (T253305) (duration: 01m 00s)
  • 11:01 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
  • 10:48 mutante: LDAP - added uid=lulu to group nda (T254121)
  • 10:29 akosiaris: switch over ores1XXX hosts to redis::misc from oresrdb hosts. T254226
  • 10:12 jynus: disable non-global root login to gerrit2001 T254162
  • 10:12 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 10:11 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1121, db1148 T252512', diff saved to https://phabricator.wikimedia.org/P11361 and previous config saved to /var/cache/conftool/dbconfig/20200602-101150-marostegui.json
  • 10:09 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 10:09 akosiaris: switch over ores2XXX hosts to redis::misc from oresrdb hosts. T254226
  • 10:02 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1121, db1148 T252512', diff saved to https://phabricator.wikimedia.org/P11360 and previous config saved to /var/cache/conftool/dbconfig/20200602-100246-marostegui.json
  • 09:53 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1121, db1148 T252512', diff saved to https://phabricator.wikimedia.org/P11359 and previous config saved to /var/cache/conftool/dbconfig/20200602-095321-marostegui.json
  • 09:49 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1138', diff saved to https://phabricator.wikimedia.org/P11358 and previous config saved to /var/cache/conftool/dbconfig/20200602-094914-marostegui.json
  • 09:44 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1121, db1148 T252512', diff saved to https://phabricator.wikimedia.org/P11357 and previous config saved to /var/cache/conftool/dbconfig/20200602-094441-marostegui.json
  • 09:38 marostegui@cumin1001: dbctl commit (dc=all): 'Add db1148 to dbctl depooled T252512', diff saved to https://phabricator.wikimedia.org/P11356 and previous config saved to /var/cache/conftool/dbconfig/20200602-093841-marostegui.json
  • 08:59 ema: upload purged 0.15 to buster-wikimedia
  • 08:09 mutante: re-imaging contint1001 with buster
  • 07:43 marostegui: Stop MySQL on db1121
  • 07:40 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1121 to clone db1148', diff saved to https://phabricator.wikimedia.org/P11353 and previous config saved to /var/cache/conftool/dbconfig/20200602-074027-marostegui.json
  • 07:32 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1079 after data check', diff saved to https://phabricator.wikimedia.org/P11351 and previous config saved to /var/cache/conftool/dbconfig/20200602-073245-marostegui.json
  • 07:22 marostegui: Stop slave on db1079 for data check
  • 07:22 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1079 for data check', diff saved to https://phabricator.wikimedia.org/P11350 and previous config saved to /var/cache/conftool/dbconfig/20200602-072214-marostegui.json
  • 07:06 marostegui: Stop MySQL and poweroff on db1138 for on-site maintenance - T253808
  • 05:01 marostegui: Stop mysql on db1141 to save a binary backup - T249188
  • 01:03 krinkle@deploy1001: Synchronized wmf-config/mc.php: I06897bcc92c5 (duration: 00m 59s)

2020-06-01

  • 20:14 shdubsh: downgrade mtail to rc5 in ulsfo -- T254192
  • 20:12 XioNoX: enable IX4/6 on cr4-ulsfo - T237575
  • 19:57 XioNoX: disable IX4/6 on cr4-ulsfo - T237575
  • 19:55 XioNoX: fail vrrp over cr3-ulsfo - T237575
  • 19:44 shdubsh: restart atsmtail in eqsin
  • 18:21 ppchelko@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable kask-transition for all wikis (duration: 01m 00s)
  • 17:59 XioNoX: offline cr1-codfw:fpc0 - T254110
  • 17:47 XioNoX: turn online cr1-codfw:fpc0 - T254110
  • 17:46 shdubsh: update mtail in ulsfo caching hosts. restarting atsmtail and varnishmtail
  • 17:31 mutante: backup1001 - queued job 42 - gerrit backup after renaming of the file set and addition of LFS data (T254155, T254162) it is incremental, the full one already ran
  • 16:49 otto@deploy1001: Synchronized wmf-config/InitialiseSettings.php: EventLogging - fix searchsatisfaction schema URI - testwiki only - T249261 (duration: 00m 59s)
  • 16:48 otto@deploy1001: sync-file aborted: EventLogging - fix searchsatisfaction schema URI - testwiki only - T249261 (duration: 00m 02s)
  • 16:39 bstorm_: running view updates on db1141 T252219
  • 14:53 elukey: ganeti: increase memory available for an-launcher1001 from 8g to 12g - T254125
  • 14:44 volans: deploying ulsfo mgmt DNS records automatically generated by Netbox ( operations/dns/+/585545/ ) - T233183
  • 12:00 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1142, db1147 T252512', diff saved to https://phabricator.wikimedia.org/P11345 and previous config saved to /var/cache/conftool/dbconfig/20200601-120000-marostegui.json
  • 11:44 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1142, db1147 T252512', diff saved to https://phabricator.wikimedia.org/P11344 and previous config saved to /var/cache/conftool/dbconfig/20200601-114440-marostegui.json
  • 11:30 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1142, db1147 T252512', diff saved to https://phabricator.wikimedia.org/P11343 and previous config saved to /var/cache/conftool/dbconfig/20200601-113032-marostegui.json
  • 10:49 jdrewniak@deploy1001: Synchronized portals: Wikimedia Portals Update: Bumping portals to master (601328) (duration: 00m 59s)
  • 10:48 jdrewniak@deploy1001: Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: Bumping portals to master (601328) (duration: 01m 03s)
  • 09:37 volans@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 09:30 volans@cumin1001: START - Cookbook sre.dns.netbox
  • 09:26 jynus: reenabling puppet on all db/es/pc hosts after deploy of gerrit:599596
  • 09:22 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1142, db1147 T252512', diff saved to https://phabricator.wikimedia.org/P11342 and previous config saved to /var/cache/conftool/dbconfig/20200601-092220-marostegui.json
  • 09:18 marostegui@cumin1001: dbctl commit (dc=all): 'Add db1147 to dbctl, depooled T252512', diff saved to https://phabricator.wikimedia.org/P11341 and previous config saved to /var/cache/conftool/dbconfig/20200601-091809-marostegui.json
  • 09:06 filippo@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 09:05 filippo@cumin1001: START - Cookbook sre.hosts.decommission
  • 09:05 filippo@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 09:05 XioNoX: offline cr1-codfw:fpc0 - T254110
  • 09:05 filippo@cumin1001: START - Cookbook sre.hosts.decommission
  • 09:04 filippo@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 09:03 filippo@cumin1001: START - Cookbook sre.hosts.decommission
  • 08:58 godog: prometheus eqiad lvextend --resizefs --size +100G vg-ssd/prometheus-ops
  • 08:43 mutante: deneb - apt-get remove --purge apt-listchanges (packages was in status "rc" causing DPKG alert, should be removed but config was not purged)
  • 08:41 mutante: deneb - apt-get remove python3-debconf (package was in status "ri" causing DPKG icinga alert. ri means it should be removed but is not)
  • 08:33 XioNoX: restart cr1-codfw:fpc0 - T254110
  • 08:22 mutante: mw1331 re-enabled puppet (SAL told me about an experiment a little while ago)
  • 08:19 jynus: disabling puppet on all db/es/pc hosts for deploy of gerrit:599596
  • 07:05 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1142 to clone db1147 T252512', diff saved to https://phabricator.wikimedia.org/P11339 and previous config saved to /var/cache/conftool/dbconfig/20200601-070519-marostegui.json
  • 05:03 marostegui@cumin1001: dbctl commit (dc=all): 'Depool enwiki db2071 slave to test new index - T238966', diff saved to https://phabricator.wikimedia.org/P11338 and previous config saved to /var/cache/conftool/dbconfig/20200601-050354-marostegui.json
  • 04:54 marostegui: Drop testreduce_0715 from m5 master T245408
  • 04:44 marostegui: Depool db1141 from Analytics role - T249188

2020-05-31

  • 09:56 Urbanecm: mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=commonswiki --logwiki=metawiki 'Vox Golf' 'Colonel Chicken' (T254068)

2020-05-29

  • 22:32 bstorm_: updated views on labsdb1010 T252219
  • 20:55 bstorm_: updating views on labsdb1011 T252219
  • 19:27 ryankemper: Successfully finished a rolling restart of the `cloudelastic` clusters (chi, psi, omega) as part of elasticsearch plugins upgrade. Host and service checks re-enabled.
  • 17:28 bstorm_: updating views on labsdb1009 T252219
  • 16:50 ryankemper: Performing a rolling restart of the `cloudelastic` clusters (chi, psi, omega) as part of elasticsearch plugins upgrade. Host and service checks disabled.
  • 16:00 bstorm_: Updating views on labsdb1012 T252219
  • 15:59 ryankemper: Concluded rolling restart of the `relforge` clusters as part of elasticsearch plugins upgrade. Both hosts `relforge1001` and `relforge1002` are back up. Downtime lifted.
  • 15:29 ryankemper: Performing a rolling restart of the `relforge` clusters as part of elasticsearch plugins upgrade
  • 14:59 cdanis: disabling puppet on netflow* to deploy Ic71e96f0 T253128
  • 14:47 akosiaris@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
  • 14:47 akosiaris@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'kube-system' for release 'coredns' .
  • 14:41 akosiaris@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
  • 14:41 akosiaris@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'kube-system' for release 'coredns' .
  • 14:35 akosiaris@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
  • 14:35 akosiaris@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'coredns' .
  • 14:27 hnowlan@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 14:24 hnowlan@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:15 mdholloway: ran extensions/MachineVision/maintenance/removeBlacklistedSuggestions.php on commonswiki (T253821)
  • 12:49 hnowlan: reimaging restbase2009 after disk replacement
  • 12:37 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 12:35 elukey@cumin1001: START - Cookbook sre.hosts.downtime
  • 12:15 godog: roll-restart to upgrade thanos to 0.13.0rc0 - T252186 T233956
  • 11:32 moritzm: installing cups security updates (client-side libs/tools)
  • 11:01 ema: upload prometheus-rdkafka-exporter 0.2 to buster-wikimedia T253551
  • 10:53 moritzm: updating mwdebug2002 to 7.2.31
  • 10:02 marostegui: Compress InnoDB on db1138 T232446
  • 08:30 godog: update swift uid/gid on thanos hosts - T123918
  • 08:04 mutante: phabricator - restarted apache2 - back for me now
  • 08:03 XioNoX: add new AMS-IX link to LACP bundle
  • 08:01 mutante: phabricator - broken due to "PhabricatorRepositoryMirrorEngine::pushToGitRepository" starting git process that uses 100% CPU, stopped phd service
  • 07:56 mutante: phabricator - killed pid 25070 (git) which used 100% of CPU, restarted phd service
  • 07:25 moritzm: updating perf on buster systems to new version from 10.4 point release
  • 07:15 moritzm: installing el-api update from latest Buster point release
  • 07:12 moritzm: installing xdg-utils update from latest Buster point release
  • 07:11 mutante: mw1293 (canary jobrunner ) replace apache2.conf with version from mwdebug1001, restart apache, to debug for T190111
  • 07:00 moritzm: installing rake security updates
  • 06:36 mutante: deneb - systemctl start docker-reporter-releng-images
  • 05:20 marostegui: Deploy schema change on db1138 (no longer s4 master) - T250055
  • 05:02 marostegui@cumin1001: dbctl commit (dc=all): 'Promote db1081 to s4 master and remove read-only from s4 T253808', diff saved to https://phabricator.wikimedia.org/P11334 and previous config saved to /var/cache/conftool/dbconfig/20200529-050224-marostegui.json
  • 05:01 marostegui@cumin1001: dbctl commit (dc=all): 'Set s4 as read-only for maintenance T253808', diff saved to https://phabricator.wikimedia.org/P11333 and previous config saved to /var/cache/conftool/dbconfig/20200529-050153-marostegui.json
  • 05:00 marostegui: Starting s4 failover from db1138 to db1081 -T253808
  • 04:25 marostegui: Start topology changes in s4 - T253808

2020-05-28

  • 23:48 jforrester@deploy1001: Synchronized php-1.35.0-wmf.34/skins/Vector/resources/skins.vector.styles/Menu.less: T253912 Hotfix: Cannot rename emptyPortlet to empty-portlet yet (duration: 00m 59s)
  • 22:41 jforrester@deploy1001: Synchronized php-1.35.0-wmf.34/extensions/WikibaseMediaInfo/src/Services/FilePageLookup.php: T253792 Follow-up 1827c7a: Ensure inNamespace() is called only on Title object (duration: 00m 58s)
  • 22:24 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: T253821 Update MachineVision block list for 2020-05-27 (duration: 00m 57s)
  • 22:09 jforrester@deploy1001: Synchronized wmf-config/CommonSettings.php: Move one CheckUser right change next to the other (duration: 00m 57s)
  • 22:06 jforrester@deploy1001: Synchronized wmf-config/CommonSettings.php: Remove version wrapper around wgOverrideUcfirstCharacters; always true (duration: 00m 59s)
  • 21:48 jforrester@deploy1001: rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.34
  • 21:26 jforrester@deploy1001: Synchronized php-1.35.0-wmf.34/includes/filerepo/FileRepo.php: T253922 Mark two FileRepo functions public (duration: 01m 07s)
  • 21:12 jforrester@deploy1001: Synchronized php-1.35.0-wmf.34/includes/specials/SpecialUserrights.php: T253909 Restore visibility (previously implicitely public) (duration: 01m 06s)
  • 20:38 jforrester@deploy1001: Synchronized php-1.35.0-wmf.32/skins/Vector/resources/skins.vector.styles: T253905 HOTFIX: Do not apply p-personal absolute positioning to all menus (duration: 01m 07s)
  • 20:22 shdubsh: restart varnishmtail and atsmtail eqsin
  • 20:11 shdubsh: restart ncredirmtail on ncredir5001
  • 19:20 twentyafterfour@deploy1001: rebuilt and synchronized wikiversions files: roll back the train due to T253905
  • 19:20 twentyafterfour: group2 back to wmf.32 due to T253905
  • 19:20 milimetric@deploy1001: Finished deploy [analytics/refinery@f6d73c8] (thin): Hotfix #2 today (thin): forgot jars [analytics/refinery@f6d73c8] (duration: 00m 09s)
  • 19:20 milimetric@deploy1001: Started deploy [analytics/refinery@f6d73c8] (thin): Hotfix #2 today (thin): forgot jars [analytics/refinery@f6d73c8]
  • 19:17 milimetric@deploy1001: Finished deploy [analytics/refinery@f6d73c8]: Hotfix #2 today: forgot jars [analytics/refinery@f6d73c8] (duration: 16m 54s)
  • 19:14 twentyafterfour@deploy1001: rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.34 refs T253022
  • 19:01 shdubsh: restart varnishmtail and atsmtail on cp5001.eqsin.wmnet
  • 19:00 milimetric@deploy1001: Started deploy [analytics/refinery@f6d73c8]: Hotfix #2 today: forgot jars [analytics/refinery@f6d73c8]
  • 17:03 twentyafterfour@deploy1001: Synchronized php: group1 wikis to 1.35.0-wmf.34 refs T253022 (duration: 01m 06s)
  • 17:02 twentyafterfour@deploy1001: rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.34 refs T253022
  • 16:32 jforrester@deploy1001: Synchronized php-1.35.0-wmf.34/extensions/Wikibase: T253804 Use ThrowingEntityTermStoreWriter when writers shouldn't be called (duration: 01m 15s)
  • 15:37 milimetric@deploy1001: Finished deploy [analytics/refinery@203d182] (thin): Three hotfixes (THIN) [analytics/refinery@203d182] (duration: 00m 10s)
  • 15:37 milimetric@deploy1001: Started deploy [analytics/refinery@203d182] (thin): Three hotfixes (THIN) [analytics/refinery@203d182]
  • 15:05 milimetric@deploy1001: Finished deploy [analytics/refinery@203d182]: Three hotfixes [analytics/refinery@203d182] (duration: 25m 59s)
  • 15:02 moritzm: installing exim4 security updates on jessie (stretch/buster already fixed)
  • 14:39 milimetric@deploy1001: Started deploy [analytics/refinery@203d182]: Three hotfixes [analytics/refinery@203d182]
  • 14:33 andrew@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 14:30 andrew@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:01 ema: atskafka 0.8 uploaded to buster-wikimedia T253551
  • 13:49 godog: roll-restart prometheus k8s-staging to enable thanos upload - T252186
  • 13:36 hashar: Restarting CI Jenkins for plugin rollback
  • 11:49 moritzm: installing unbound security updates
  • 11:03 kormat@cumin1001: dbctl commit (dc=all): 'Add db2138 to s2+s4 T252985', diff saved to https://phabricator.wikimedia.org/P11330 and previous config saved to /var/cache/conftool/dbconfig/20200528-110333-kormat.json
  • 10:36 jayme@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'blubberoid' for release 'production' .
  • 10:34 jayme@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'blubberoid' for release 'production' .
  • 10:30 jayme@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'blubberoid' for release 'staging' .
  • 10:02 mutante: gerrit1002 (test server) - chown -R gerrit2:gerrit2 /var/lib/gerrit/review_site ; restarted gerrit service, now the service is not in restart loop anymore, gerrit-ssh is listening too, just not accepting publickey (T239151)
  • 09:51 XioNoX: failover VRRP in ulsfo
  • 09:41 XioNoX: re-activate peering/transit on cr2-eqdfw - T243080
  • 09:35 mutante: restarting gerrit on gerrit1002 after fixing db_pass to the readonly one (T243800)
  • 09:33 XioNoX: restart cr2-eqdfw for upgrade - T243080
  • 09:30 XioNoX: deactivate peering/transit on cr2-eqdfw - T243080
  • 09:25 _joe_: updating ACLs on all etcd servers
  • 09:22 XioNoX: install new Junos on cr2-eqdfw - T243080
  • 09:16 XioNoX: rollback cr2-eqord ospf/bgp - T243080
  • 09:07 XioNoX: restart cr2-eqord for upgrade - T243080
  • 09:05 jayme@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'blubberoid' for release 'staging' .
  • 08:50 _joe_: upgrading etcd ACLs (adding new users) to conf1004
  • 08:50 XioNoX: install new Junos on cr2-eqord - T243080
  • 08:46 XioNoX: deactivate peering/transit on cr2-eqord - T243080
  • 08:45 XioNoX: de-pref all OSPF links to cr2-eqord - T243080
  • 08:13 marostegui: Pool db1141 into labsdb analytics role - T249188
  • 07:33 gilles@deploy1001: Synchronized static/images: T252108 Deploying optimised static PNGs (duration: 01m 39s)
  • 07:31 gilles@deploy1001: Synchronized static/apple-touch: T252108 Deploying optimised static PNGs (duration: 01m 12s)
  • 06:30 marostegui@cumin1001: dbctl commit (dc=all): 'Remove db1081 from API and set its weight to 0 on main traffic - preparation for tomorrow's failover T253808', diff saved to https://phabricator.wikimedia.org/P11329 and previous config saved to /var/cache/conftool/dbconfig/20200528-063037-marostegui.json
  • 04:44 marostegui: Run check_private data on db1141 - T249188
  • 04:22 marostegui: Stop MySQL on db1141 - T249188

2020-05-27

  • 23:20 catrope@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Add autoreviewrestore right to rollbacker group on hiwiki (T252986) (duration: 01m 05s)
  • 23:16 catrope@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Add thwiki Draft namespace to wmgExemptFromUserRobotsControlExtra and enable VE there (T252959) (duration: 01m 06s)
  • 22:58 gehel@cumin1001: END (PASS) - Cookbook sre.postgresql.postgres-init (exit_code=0)
  • 22:02 crusnov@deploy1001: Finished deploy [netbox/deploy@5251cf1]: Netbox Upgrade to 2.8.4 (part4) (duration: 00m 10s)
  • 22:02 crusnov@deploy1001: Started deploy [netbox/deploy@5251cf1]: Netbox Upgrade to 2.8.4 (part4)
  • 22:01 crusnov@deploy1001: Finished deploy [netbox/deploy@5251cf1]: Netbox Upgrade to 2.8.4 (part3) (duration: 01m 29s)
  • 22:00 crusnov@deploy1001: Started deploy [netbox/deploy@5251cf1]: Netbox Upgrade to 2.8.4 (part3)
  • 22:00 crusnov@deploy1001: deploy aborted: Netbox Upgrade to 2.8.4 (part2) (duration: 01m 31s)
  • 21:58 crusnov@deploy1001: Started deploy [netbox/deploy@5251cf1]: Netbox Upgrade to 2.8.4 (part2)
  • 21:58 crusnov@deploy1001: Finished deploy [netbox/deploy@5251cf1]: Netbox Upgrade to 2.8.1 (part1) (duration: 01m 01s)
  • 21:57 crusnov@deploy1001: Started deploy [netbox/deploy@5251cf1]: Netbox Upgrade to 2.8.1 (part1)
  • 20:43 gehel@cumin1001: START - Cookbook sre.postgresql.postgres-init
  • 20:28 marostegui: Decrease innodb poolsize on s4 master and restart mysql
  • 20:11 mbsantos@deploy1001: Finished deploy [mobileapps/deploy@9dc827f]: Update mobileapps to b3b9214c (T253648) (duration: 03m 31s)
  • 20:08 mbsantos@deploy1001: Started deploy [mobileapps/deploy@9dc827f]: Update mobileapps to b3b9214c (T253648)
  • 20:04 twentyafterfour@deploy1001: Synchronized php: group1 wikis to 1.35.0-wmf.32 refs T253022 (duration: 01m 04s)
  • 20:03 twentyafterfour@deploy1001: rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.32 refs T253022
  • 20:00 gehel@cumin1001: END (PASS) - Cookbook sre.postgresql.postgres-init (exit_code=0)
  • 19:56 twentyafterfour@deploy1001: scap failed: average error rate on 4/9 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/e474f13ffac6b8c3bf919c4aeafc8c9b for details)
  • 19:46 jforrester@deploy1001: Synchronized php-1.35.0-wmf.34/includes/parser/CoreParserFunctions.php: T253725 Partially revert 'Fix impedance mismatch with Parser::getRevisionRecordObject()' (duration: 01m 05s)
  • 19:12 joal@deploy1001: Finished deploy [analytics/refinery@8a3dcb3]: Analytics regular weekly train (an-launcher1001 only) [8a3dcb3] (duration: 06m 07s)
  • 19:09 jforrester@deploy1001: Synchronized dblists/mobilemainpagelegacy.dblist: T32405 Stop special casing the main page on mobile for twelve wikis (duration: 01m 05s)
  • 19:06 joal@deploy1001: Started deploy [analytics/refinery@8a3dcb3]: Analytics regular weekly train (an-launcher1001 only) [8a3dcb3]
  • 19:03 joal@deploy1001: Finished deploy [analytics/refinery@8a3dcb3] (thin): Analytics regular weekly train THIN [8a3dcb3] (duration: 00m 08s)
  • 19:03 joal@deploy1001: Started deploy [analytics/refinery@8a3dcb3] (thin): Analytics regular weekly train THIN [8a3dcb3]
  • 19:03 joal@deploy1001: Finished deploy [analytics/refinery@8a3dcb3]: Analytics regular weekly train [8a3dcb3] (duration: 21m 20s)
  • 18:41 joal@deploy1001: Started deploy [analytics/refinery@8a3dcb3]: Analytics regular weekly train [8a3dcb3]
  • 18:06 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Enable DiscussionTools as beta on mediawiki.org, part II T251208 (duration: 01m 05s)
  • 17:56 jayme: updated tiller to 2.16.7-wmf1 for all services in kubernetes cluster: eqiad
  • 17:53 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Enable DiscussionTools as beta on mediawiki.org T251208 (duration: 01m 05s)
  • 17:42 gehel@cumin1001: START - Cookbook sre.postgresql.postgres-init
  • 17:40 gehel: repool maps2003
  • 17:32 twentyafterfour@deploy1001: Synchronized php-1.35.0-wmf.34/extensions/Translate/: Deploy https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/Translate/+/599027/ to wmf.34 refs T253748 and T253022 (duration: 01m 07s)
  • 16:55 hnowlan@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
  • 16:53 hnowlan@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
  • 16:52 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
  • 16:26 hnowlan@deploy1001: Finished deploy [cpjobqueue/deploy@c8c653e]: Disabling ThumbnailRender as a test of k8s cpjobqueue (duration: 01m 57s)
  • 16:24 hnowlan@deploy1001: Started deploy [cpjobqueue/deploy@c8c653e]: Disabling ThumbnailRender as a test of k8s cpjobqueue
  • 16:10 hnowlan@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
  • 16:09 hnowlan@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
  • 16:06 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
  • 15:52 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Stop defining wmgUsePerformanceInspector, unread T253689 (duration: 01m 04s)
  • 15:52 gehel@cumin1001: END (PASS) - Cookbook sre.postgresql.postgres-init (exit_code=0)
  • 15:52 jayme: updated tiller to 2.16.7-wmf1 for all services in kubernetes cluster: codfw
  • 15:49 jforrester@deploy1001: Synchronized wmf-config/CommonSettings.php: Stop loading PerformanceInspector on any wiki T253689 (duration: 01m 06s)
  • 15:18 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
  • 15:02 godog: eqiad-prod: decom ms-be101[678] - T252008
  • 14:58 jayme: updated tiller to 2.16.7-wmf1 for all services in cluster: staging
  • 14:58 volans@cumin1001: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
  • 14:56 otto@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
  • 14:56 otto@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' .
  • 14:54 volans@cumin1001: START - Cookbook sre.dns.netbox
  • 14:52 otto@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' .
  • 14:51 cdanis: cumin1001: upgrading python3-conftool and python3-conftool-dbctl
  • 14:50 otto@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
  • 14:50 otto@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
  • 14:46 cdanis: cumin2001: upgrading python3-conftool and python3-conftool-dbctl
  • 14:43 otto@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' .
  • 14:43 otto@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' .
  • 14:40 cdanis: reprepro: upload conftool_1.3.1-1{,+deb10u1} to {stretch,buster}-wikimedia
  • 14:36 otto@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
  • 14:36 otto@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' .
  • 14:32 otto@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
  • 14:32 otto@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
  • 14:30 otto@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' .
  • 14:16 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1146:3312, db1146:3314 and db1103:3312, db1103:3314', diff saved to https://phabricator.wikimedia.org/P11318 and previous config saved to /var/cache/conftool/dbconfig/20200527-141635-marostegui.json
  • 14:13 otto@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' .
  • 14:13 otto@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' .
  • 14:07 otto@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
  • 14:07 otto@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' .
  • 14:04 otto@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' .
  • 14:04 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1146:3312, db1146:3314 and db1103:3312, db1103:3314', diff saved to https://phabricator.wikimedia.org/P11317 and previous config saved to /var/cache/conftool/dbconfig/20200527-140442-marostegui.json
  • 14:03 otto@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
  • 14:03 otto@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
  • 13:58 dzahn@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
  • 13:51 otto@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' .
  • 13:51 otto@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' .
  • 13:48 dzahn@cumin1001: START - Cookbook sre.ganeti.makevm
  • 13:47 dzahn@cumin1001: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99)
  • 13:47 dzahn@cumin1001: START - Cookbook sre.ganeti.makevm
  • 13:47 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1146:3312, db1146:3314 and db1103:3312, db1103:3314', diff saved to https://phabricator.wikimedia.org/P11316 and previous config saved to /var/cache/conftool/dbconfig/20200527-134704-marostegui.json
  • 13:45 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1)
  • 13:42 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
  • 13:36 gehel@cumin1001: START - Cookbook sre.postgresql.postgres-init
  • 13:34 gehel@cumin1001: END (FAIL) - Cookbook sre.postgresql.postgres-init (exit_code=99)
  • 13:34 gehel@cumin1001: START - Cookbook sre.postgresql.postgres-init
  • 13:21 gehel: repool maps2004 / depool maps2003
  • 13:21 ema: cp: upgrade purged to 0.14
  • 13:19 marostegui: Kill /usr/local/bin/mwscriptwikiset updateSpecialPages.php s8.dblist --override --only=Fewestrevisions T238199
  • 13:16 dzahn@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
  • 13:15 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1146:3312, db1146:3314 and db1103:3312, db1103:3314', diff saved to https://phabricator.wikimedia.org/P11313 and previous config saved to /var/cache/conftool/dbconfig/20200527-131515-marostegui.json
  • 13:08 marostegui@cumin1001: dbctl commit (dc=all): 'Add db1146:3312 and db1146:3314 to dbctl T252512', diff saved to https://phabricator.wikimedia.org/P11312 and previous config saved to /var/cache/conftool/dbconfig/20200527-130820-marostegui.json
  • 13:06 dzahn@cumin1001: START - Cookbook sre.ganeti.makevm
  • 11:34 Urbanecm: EU SWAT done
  • 11:29 urbanecm@deploy1001: Synchronized php-1.35.0-wmf.34/extensions/GrowthExperiments/: SWAT: 983eda5: Mentorship dialog: Swap panel to ask-help on open (T253692) (duration: 01m 06s)
  • 11:18 ema: cp2027: upgrade purged to 0.14
  • 11:17 ema: purged 0.14 uploaded to buster-wikimedia
  • 11:09 kartik@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 598678|Enable ContentTranslation in Galician Wikipedia as a default tool (T250355) (duration: 01m 18s)
  • 10:15 hashar: contint2001: starting zuul
  • 10:15 hashar: contint2001: started jenkins
  • 10:03 mutante: contint2001 - find /var/lib/jenkins -user statsite -exec chown -h jenkins:jenkins {} \;
  • 10:02 mutante: repeated rsync of /var/lib/jenkins with -p ; find /var/lib/jenkins -group bacula -user statsite -exec chown -h jenkins:jenkins {} \;
  • 09:55 hashar: contint2001: starting jenkins
  • 09:54 hashar: contint1001 / contint2001 : deleted obsolete files /var/lib/jenkins/.git and /var/lib/jenkins/jobs/_shared/
  • 09:52 mutante: contint2001 - find /var/lib/jenkins -user statsite -exec chown -h jenkins:jenkins {} \;
  • 09:51 godog: roll restart prometheus on the fleet to apply I0e2fe8af
  • 09:49 mutante: contint2001 - find /var/lib/jenkins -group bacula -user statsite -exec chown jenkins:jenkins {} \;
  • 09:48 hashar: contint2001: unmasked jenkins and started it
  • 09:42 filippo@cumin1001: conftool action : set/pooled=yes; selector: name=prometheus2003.codfw.wmnet
  • 09:42 mutante: switching CI backend from contint1001 to contint2001
  • 09:40 mutante: repeated rsync -avp --delete /var/lib/zuul/ rsync://contint2001.wikimedia.org/ci--var-lib-zuul-
  • 09:40 hashar: contint1001: masked jenkins and zuul
  • 09:39 mutante: repeated rsync -avp --delete /var/lib/jenkins/ rsync://contint2001.wikimedia.org/ci--var-lib-jenkins-
  • 09:39 hashar: Stopping Zuul and Jenkins CI for scheduled maintenance # T224591
  • 09:35 filippo@cumin1001: conftool action : set/pooled=no; selector: name=prometheus2003.codfw.wmnet
  • 08:52 hashar: contint1001: find /srv/jenkins/builds/operations-puppet-wmf-style-guide -type f -name '*.tmp' -delete # T253729
  • 08:48 marostegui: Stop MySQL on db1103
  • 08:47 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1103:3312 db1103:3314 to clone db1146 T252512', diff saved to https://phabricator.wikimedia.org/P11308 and previous config saved to /var/cache/conftool/dbconfig/20200527-084713-marostegui.json
  • 08:46 arturo: removing more old packages in labstore1006 (all packages in 'rc' state)
  • 08:43 arturo: running apt-get autoremove on labstore1006
  • 08:42 jynus: starting again db2097 db instances T252492
  • 08:11 jayme: updated admin tiller (namespace: kube-system) to 2.16.7-wmf1 in clusters: staging, codfw, eqiad
  • 08:08 hashar: contint1001 / contint2001 : deleted unused /var/lib/zuul/git (the real one is /srv/zuul/git )
  • 08:02 mutante: contint2001 - chown root:root /var/lib/zuul/git
  • 07:54 XioNoX: test new bird conf on dns4001 - T253666
  • 07:45 hashar: contint2001 also fixing symlink permissions: sudo find /var/lib/jenkins -not -user jenkins -exec chown -h jenkins:jenkins {} +
  • 07:35 mutante: contint2001 - find /var/lib/jenkins -group bacula -user jenkins -exec chown jenkins:jenkins {} \;
  • 07:30 mutante: contint2001 - find /var/lib/jenkins -user statsite -exec chown jenkins {} \;
  • 07:26 mutante: contint2001 - chown -R zuul:zuul /var/lib/zuul/
  • 07:26 mutante: contint1001:~# rsync -avpz --delete /srv/jenkins/ rsync://contint2001.wikimedia.org/ci--srv-/jenkins/
  • 07:25 mutante: contint1001:~# rsync -avp --delete /var/lib/jenkins/ rsync://contint2001.wikimedia.org/ci--var-lib-jenkins-
  • 07:25 mutante: contint1001:~# rsync -avp --delete /var/lib/zuul/ rsync://contint2001.wikimedia.org/ci--var-lib-zuul-
  • 07:18 moritzm: installing bind security updates (only client-side tools/libraries in use)
  • 07:04 elukey: matomo upgraded to 3.13.5 on matomo1001 - T252741
  • 06:57 elukey: update matomo on stretch-wikimedia to 3.13.5
  • 06:10 elukey@deploy1001: Finished deploy [analytics/superset/deploy@369a2dd]: Upgrade Superset to 0.36 - second attempt (duration: 00m 57s)
  • 06:09 elukey@deploy1001: Started deploy [analytics/superset/deploy@369a2dd]: Upgrade Superset to 0.36 - second attempt
  • 05:17 marostegui: Remove tmp_3 key from enwiki.recentchanges on db1099:3311 - T206103
  • 04:41 _joe_: cassandra cannot start on restbase2009, one of the disk is failed.
  • 04:39 _joe_: restarting cassandra instances on restbase2009, has a broken disk
  • 04:20 marostegui: Depool labsdb1011 - T249188

2020-05-26

  • 21:34 krinkle@deploy1001: Synchronized wmf-config/mc.php: I0fb124b3593 (duration: 01m 05s)
  • 21:30 krinkle@deploy1001: Synchronized wmf-config/InitialiseSettings.php: I2714e2ae26404 (duration: 01m 06s)
  • 21:18 krinkle@deploy1001: Synchronized wmf-config/profiler.php: Ib0bf8d97b10b, T253674 (duration: 01m 06s)
  • 20:29 twentyafterfour@deploy1001: rebuilt and synchronized wikiversions files: group0 wikis to 1.35.0-wmf.34 refs T253022
  • 20:08 twentyafterfour@deploy1001: Finished scap: testwikis wikis to 1.35.0-wmf.34 refs T253022 (duration: 70m 02s)
  • 18:58 twentyafterfour@deploy1001: Started scap: testwikis wikis to 1.35.0-wmf.34 refs T253022
  • 18:07 jforrester@deploy1001: Pruned MediaWiki: 1.35.0-wmf.30 (duration: 20m 45s)
  • 18:02 bblack: cr[12]-eqiad: re-route ns0.wikimedia.org to authdns1001 - T241770
  • 18:02 ejegg: restarted fundraising jobs: recurring charge, audit processing, deduplication
  • 17:57 moritzm: installing bind security updates for stretch (only client-side tools/libraries in use)
  • 17:47 cdanis: netflow3001: disabling puppet and testing some pmacct/librdkafka config tweaks T253128
  • 17:16 James_F: 1.35.0-wmf.34 was branched at b5012a1 for T253022
  • 16:45 moritzm: installing jsp-api bugfix update from Buster point release
  • 15:22 akosiaris: sync kubernetes eqiad namespaces configuration with helmfile
  • 15:15 akosiaris: sync kubernetes codfw namespaces configuration with helmfile
  • 15:08 arturo: delete/re-import docker/containerd.io packages in the right version in buster-wikimedia/thirdparty/kubeadm-k8s-1-{15,16} (T250866)
  • 15:08 jforrester@deploy1001: Synchronized wmf-config/CommonSettings.php: Add lazy-loading to Wikimedia Foundation powered-by icon T239377 (duration: 00m 57s)
  • 15:01 jforrester@deploy1001: Synchronized dblists/mobilemainpagelegacy.dblist: Drop enwiki mobile mainpage special casing T32405 (duration: 00m 59s)
  • 14:58 mholloway-shell@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'wikifeeds' for release 'production' .
  • 14:57 akosiaris: sync staging namespaces configuration
  • 14:57 akosiaris@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
  • 14:57 akosiaris@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'coredns' .
  • 14:57 akosiaris@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' .
  • 14:56 jforrester@deploy1001: Synchronized docroot/noc/: Clear out symlink to mobile.php, now removed (duration: 00m 55s)
  • 14:56 mholloway-shell@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'wikifeeds' for release 'production' .
  • 14:54 mholloway-shell@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' .
  • 14:53 jforrester@deploy1001: Synchronized wmf-config/CommonSettings.php: Move mobile.php into CommonSettings.php (duration: 00m 57s)
  • 14:44 arturo: upgrade packages in buster-wikimedia/thirdpardy/kubeadm-k8s-1-16 (T246122)
  • 14:44 jforrester@deploy1001: Synchronized docroot/noc/: Clear out symlink to mobile-labs.php, now removed (duration: 00m 58s)
  • 14:43 moritzm: installing rails security updates
  • 14:41 jforrester@deploy1001: Synchronized wmf-config/mobile.php: Don't try to load mobile-labs.php from mobile.php (duration: 00m 57s)
  • 14:38 jforrester@deploy1001: Synchronized wmf-config/CommonSettings.php: CommonSettings.php: Move uncondition/no-sideeffect includes up (duration: 00m 57s)
  • 14:35 jforrester@deploy1001: Synchronized wmf-config/CommonSettings.php: Clean up MWMultiVersion check in CommonSettings.php (duration: 00m 59s)
  • 14:33 XioNoX: test bgp med on dns4002
  • 14:31 jforrester@deploy1001: Synchronized wmf-config/CommonSettings.php: SpecialVersionVersionUrl: Don't use confusing local variable name (duration: 00m 58s)
  • 14:30 jforrester@deploy1001: Synchronized wmf-config/CommonSettings.php: ExtensionDistributor: Remove EOL REL1_32 (duration: 00m 58s)
  • 13:54 jayme@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'zotero' for release 'staging' .
  • 13:04 hashar@deploy1001: rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.32
  • 12:43 godog: swift eqiad-prod: decom ms-be101[678] - T252008
  • 12:21 XioNoX: repool ulsfo - T243080
  • 12:11 XioNoX: cr4-ulsfo re-activate transit/ix/4/6 - T243080
  • 12:03 XioNoX: cr4-ulsfo> request vmhost reboot - T243080
  • 12:01 XioNoX: cr4-ulsfo deactivate transit/ix/4/6 - T243080
  • 11:49 XioNoX: cr3-ulsfo> request vmhost reboot - T243080
  • 11:42 XioNoX: cr4-ulsfo> request vmhost software add ... - T243080
  • 11:28 XioNoX: cr3-ulsfo> request vmhost software add ... - T243080
  • 11:27 awight: nnwiki updateCollation.php script has finished.
  • 11:26 XioNoX: depool ulsfo for routers upgrade - T243080
  • 11:16 awight: EU SWAT done (pending a maintenance script to updateCollation)
  • 11:14 awight@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Add 'deletedtext' permission to researcher group (T253420) (duration: 01m 06s)
  • 11:06 awight@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: [nnwiki] Change category collation to (T253559) (duration: 01m 10s)
  • 10:46 marostegui: Stop tendril's event scheduler
  • 10:18 jynus: stop db2097 for hw maintenance T252492
  • 09:48 vgutierrez: rolling upgrade to ats 8.0.7-1wm11
  • 09:41 _joe_: all jobrunners converted to use envoy for TLS termination
  • 09:38 oblivian@cumin1001: conftool action : set/weight=10; selector: name=mw131[0-1].eqiad.wmnet
  • 09:38 oblivian@cumin1001: conftool action : set/weight=10; selector: name=mw133[4-8].eqiad.wmnet
  • 09:37 oblivian@cumin1001: conftool action : set/weight=10; selector: name=mw130[0-9].eqiad.wmnet
  • 09:37 oblivian@cumin1001: conftool action : set/weight=10; selector: name=mw130[0-3].eqiad.wmnet
  • 09:36 oblivian@cumin1001: conftool action : set/weight=10:pooled=yes; selector: name=mw129[3-9].eqiad.wmnet
  • 09:31 oblivian@cumin1001: conftool action : set/weight=1:pooled=yes; selector: name=mw130[0-3].eqiad.wmnet
  • 09:27 oblivian@cumin1001: conftool action : set/weight=1:pooled=yes; selector: name=mw130[4-7].eqiad.wmnet
  • 09:22 gehel: repool wdqs1007, catched up on lag
  • 09:09 oblivian@cumin1001: conftool action : set/weight=1:pooled=yes; selector: name=mw13(0[89]|1[01]).eqiad.wmnet
  • 09:02 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 09:02 mutante: decom'ing people1001 - replaced by people1002
  • 09:01 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
  • 09:01 oblivian@cumin1001: conftool action : set/weight=1:pooled=yes; selector: name=mw13(1|3)8.eqiad.wmnet
  • 08:57 oblivian@cumin1001: conftool action : set/weight=1:pooled=yes; selector: name=mw133[4-7].eqiad.wmnet
  • 08:55 _joe_: progressively converting jobrunners to envoy
  • 08:41 oblivian@cumin1001: conftool action : set/weight=1:pooled=yes; selector: name=mw1337.eqiad.wmnet
  • 07:20 moritzm: installing libssh security updates
  • 07:03 vgutierrez: upgrade to ats 8.0.7-1wm11 on cp3064 and cp3065
  • 06:49 marostegui: Deploy schema change on s3 directly on the master with 1 minute sleep in between wikis T253342
  • 06:47 marostegui: Deploy schema change on s1 directly on the master T253342
  • 06:44 marostegui: Deploy schema change on s4 directly on the master T253342
  • 06:35 XioNoX: reboot scs-ulsfo - T253609
  • 06:29 marostegui: Deploy schema change on s7 directly on the master T253342
  • 06:24 marostegui: Deploy schema change on s8 directly on the master T253342
  • 06:01 marostegui: Deploy schema change on s2 directly on the master T253342
  • 04:35 marostegui: Repool labsdb1011 - T249188
  • 04:14 marostegui: Stop slaves and stop mysql on labsdb1011 T249188
  • 03:55 tstarling@deploy1001: Synchronized php-1.35.0-wmf.31/includes/export/XmlDumpWriter.php: T253468 (duration: 01m 06s)
  • 03:53 tstarling@deploy1001: Synchronized php-1.35.0-wmf.32/includes/export/XmlDumpWriter.php: T253468 (duration: 01m 07s)
  • 03:20 tstarling@deploy1001: Synchronized php-1.35.0-wmf.32/includes/specials/SpecialChangeContentModel.php: for UBN T252963 (duration: 01m 07s)
  • 03:18 tstarling@deploy1001: sync-file aborted: (no justification provided) (duration: 00m 32s)

2020-05-25

  • 23:34 ejegg: re-enabled fundraising queue consumers and job runners, except audits, dedupe, and recurring
  • 21:38 eileen: civicrm revision changed from 5428c5c449 to d1cd99166f, config revision is 6b05d6bb25
  • 21:18 eileen: civicrm revision is 7380e0e8ce, config revision is 6b05d6bb25
  • 21:01 ejegg: updated fundraising CiviCRM from 737d88a5ee to 7380e0e8ce
  • 17:17 ejegg: updated fundraising CiviCRM from 6b1d5902dd to 737d88a5ee
  • 17:09 ejegg: enabled contribution tracking queue on payments-wiki
  • 16:24 ejegg: updated standalone SmashPig from 2702b04329 to 44690f761c
  • 16:17 jayme@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'mathoid' for release 'staging' .
  • 16:16 XioNoX: enable IX4/6 BGP group on cr4-ulsfo - T237575
  • 16:00 hnowlan@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
  • 15:55 XioNoX: disable IX4/6 BGP group on cr4-ulsfo - T237575
  • 15:17 hnowlan@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
  • 15:15 ejegg: updated payments-wiki from 3c465cb11c to d11efeb1cf, put it into maintenance mode
  • 15:15 hnowlan@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
  • 14:53 hnowlan@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
  • 14:39 hnowlan@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
  • 14:06 hnowlan@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
  • 14:00 jayme@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'zotero' for release 'production' .
  • 13:46 _joe_: uploaded doxygen 1.8.17-1 to wikimedia-buster component/ci
  • 13:43 filippo@cumin1001: conftool action : set/pooled=true; selector: dnsdisc=thanos-swift
  • 13:40 jayme@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'zotero' for release 'production' .
  • 13:18 jayme@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'zotero' for release 'staging' .
  • 13:10 hnowlan@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
  • 13:09 vgutierrez: upgrade ATS to version 8.0.7-1wm11 on cp4026 and cp4032
  • 12:52 godog: roll-restart pybal in low-traffic codfw
  • 12:44 ema: upload atskafka 0.7 to buster-wikimedia, upgrade cp3050 T253551
  • 12:37 hnowlan@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
  • 12:30 marostegui: Deploy schema change on s5 directly on the master T253342
  • 12:14 hnowlan@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
  • 12:09 hnowlan@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
  • 12:01 _joe_: converting the remaining appservers to use envoy for TLS termination
  • 11:57 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
  • 11:54 marostegui: Install a new tendril_purge_global_status_log event on db1115 (tendril) T252331
  • 11:52 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
  • 11:51 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
  • 11:48 marostegui: Stop event scheduler on db1115 (tendril) - T252331
  • 11:46 moritzm: uploaded CAS 6.1.5-1 to apt.wikimedia.org T233947
  • 11:36 _joe_: switch mw[1349-1355,1364-1373].eqiad.wmnet to envoy
  • 11:27 marostegui: Extend /srv 1100G on db213[6-9] T252985
  • 11:23 marostegui: Extend /srv 1100G on db114[1-9] T252512
  • 11:21 marostegui: Extend db1141's (temporary labsdb test host) /srv 1TB extra - T249188
  • 11:09 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
  • 11:09 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
  • 11:01 ema: upload prometheus-rdkafka-exporter to buster-wikimedia T253197
  • 10:34 jdrewniak@deploy1001: Synchronized portals: Wikimedia Portals Update: Bumping portals to master (598439) (duration: 01m 05s)
  • 10:33 jdrewniak@deploy1001: Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: Bumping portals to master (598439) (duration: 01m 06s)
  • 10:20 jayme@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'zotero' for release 'staging' .
  • 09:56 _joe_: transition done
  • 09:49 _joe_: depooled mw1337, it was getting all traffic supposed to go to the jobrunners
  • 09:45 vgutierrez: upload trafficserver 8.0.7-1wm10 to apt.wm.o (buster)
  • 09:42 _joe_: converting mw1319-1333 to use envoy for TLS termination
  • 09:17 _joe_: migrated mw1337 to use envoy for TLS termination T247389
  • 09:10 jayme@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'zotero' for release 'staging' .
  • 09:04 godog: turn on sni by default for check_http --ssl icinga invocations - T253292
  • 08:52 jayme@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'termbox' for release 'production' .
  • 08:39 jayme@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'termbox' for release 'production' .
  • 08:21 filippo@cumin1001: conftool action : set/pooled=yes:weight=100; selector: service=thanos-swift
  • 08:05 jayme@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'termbox' for release 'staging' .
  • 07:36 moritzm: installed linux-image-amd64 on labstore1005 (current meta package for kernels following the Stretch update) T224582
  • 07:36 moritzm: installed linux-imageamd64 on labstore (current meta package for kernels following the Stretch update) T224582
  • 07:02 marostegui: Stop event scheduler on tendril T252331
  • 05:11 marostegui: Deploy schema change on s6, directly on the master - T253342
  • 04:54 marostegui: Depool labsdb1011 - T249188
  • 04:11 kart_: Updated cxserver to 2020-05-22-083137-production (T246317, T252871)
  • 04:07 kartik@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'cxserver' for release 'production' .
  • 04:04 kartik@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'cxserver' for release 'production' .
  • 04:02 kartik@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'cxserver' for release 'staging' .

2020-05-24

  • 17:36 gehel: restarting elasticsearch psi on elastic1052
  • 16:44 gehel: depool wdqs1007 to catch on lag
  • 16:43 gehel: restart blazegraph on wdqs1007

2020-05-23

  • 19:04 krinkle@deploy1001: Synchronized php-1.35.0-wmf.31/includes/filerepo/file/LocalFile.php: I0f7e885997d60 (duration: 01m 06s)
  • 18:58 krinkle@deploy1001: Synchronized php-1.35.0-wmf.32/includes/filerepo/file/LocalFile.php: I0f7e885997d60 (duration: 01m 08s)
  • 18:06 krinkle@deploy1001: Synchronized php-1.35.0-wmf.32/includes/filerepo/: I31a9bb6672 (duration: 01m 06s)
  • 18:05 krinkle@deploy1001: Synchronized php-1.35.0-wmf.31/includes/filerepo/: I31a9bb6672 (duration: 01m 10s)
  • 15:44 krinkle@deploy1001: Synchronized wmf-config/mc.php: I5ad8fe - Disable coalesceKeys on commonswiki (duration: 01m 09s)
  • 14:58 Krinkle: scap-pull to reset state on mwdebug1002
  • 14:50 Krinkle: Testing mc.php changes on mwdebug1002
  • 08:04 elukey: powercycle an-presto1004 - unresponsive, racadm getsel shows CPU overheating alerts

2020-05-22

  • 22:42 krinkle@deploy1001: Synchronized php-1.35.0-wmf.31/includes/filerepo/: Ie19613ef7643a (duration: 01m 06s)
  • 22:40 krinkle@deploy1001: Synchronized php-1.35.0-wmf.32/includes/filerepo/: Ie19613ef7643a (duration: 01m 08s)
  • 15:58 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
  • 15:58 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
  • 15:57 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
  • 15:53 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
  • 15:47 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
  • 15:45 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
  • 15:45 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
  • 15:30 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
  • 15:25 cdanis: fixing prometheus-nic-firmware-textfile.service wherever it is broken T253374
  • 15:25 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
  • 15:24 hnowlan@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
  • 15:06 marostegui: Decrease tendril_purge_global_status_log_5m storing rows time from 2 days to 1 day T252331
  • 15:01 kormat@cumin1001: dbctl commit (dc=all): 'Pool db2137 into s4+s5 T252985', diff saved to https://phabricator.wikimedia.org/P11292 and previous config saved to /var/cache/conftool/dbconfig/20200522-150120-kormat.json
  • 14:53 reedy@deploy1001: Synchronized php-1.35.0-wmf.31/maintenance/blockUsers.php: (no justification provided) (duration: 01m 08s)
  • 14:51 reedy@deploy1001: Synchronized php-1.35.0-wmf.32/maintenance/blockUsers.php: (no justification provided) (duration: 01m 09s)
  • 14:35 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1144:331[45] and db1097:331[45]', diff saved to https://phabricator.wikimedia.org/P11290 and previous config saved to /var/cache/conftool/dbconfig/20200522-143541-marostegui.json
  • 14:24 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 14:22 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
  • 14:15 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1144:331[45] and db1097:331[45]', diff saved to https://phabricator.wikimedia.org/P11289 and previous config saved to /var/cache/conftool/dbconfig/20200522-141513-marostegui.json
  • 14:13 sukhe: upload dnsdist_1.4.0-1~deb10u1 to apt.wm.o (buster) - T252132
  • 14:08 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1144:331[45] and db1097:331[45]', diff saved to https://phabricator.wikimedia.org/P11288 and previous config saved to /var/cache/conftool/dbconfig/20200522-140847-marostegui.json
  • 13:14 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1144:331[45] and db1097:331[45]', diff saved to https://phabricator.wikimedia.org/P11286 and previous config saved to /var/cache/conftool/dbconfig/20200522-131452-marostegui.json
  • 13:10 jayme@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'termbox' for release 'production' .
  • 13:10 jayme@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'termbox' for release 'production' .
  • 13:09 jayme@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'termbox' for release 'staging' .
  • 13:08 jayme@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'termbox' for release 'test' .
  • 13:07 marostegui@cumin1001: dbctl commit (dc=all): 'Add db1144:3314 and db1144:3315 to the list of hosts', diff saved to https://phabricator.wikimedia.org/P11284 and previous config saved to /var/cache/conftool/dbconfig/20200522-130707-marostegui.json
  • 12:56 vgutierrez: depool cp4032 for some ats tests
  • 12:13 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 12:12 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
  • 12:06 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 12:04 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
  • 12:03 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1)
  • 12:03 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
  • 10:48 marostegui: Stop MySQL on db1097:3314, db1097:3315 to clone db1144 - T252512
  • 10:44 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1097:3314, db1097:3315 - T252512', diff saved to https://phabricator.wikimedia.org/P11281 and previous config saved to /var/cache/conftool/dbconfig/20200522-104437-marostegui.json
  • 10:37 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 10:35 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
  • 10:34 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 10:32 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
  • 10:10 marostegui: Stop event_scheduler on db1115 - T252331
  • 10:09 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 10:05 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
  • 10:05 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1)
  • 10:05 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
  • 10:00 jbond42: update pdns-recursor on dns recursors
  • 09:43 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 09:41 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
  • 09:22 jayme@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'mathoid' for release 'production' .
  • 09:09 elukey@deploy1001: Finished deploy [analytics/superset/deploy@be203c8]: Rollback superset to 0.35.2 (duration: 00m 43s)
  • 09:09 elukey@deploy1001: Started deploy [analytics/superset/deploy@be203c8]: Rollback superset to 0.35.2
  • 08:41 vgutierrez: reverting hugepages experiment on cp2041
  • 08:27 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1149 and db1081', diff saved to https://phabricator.wikimedia.org/P11278 and previous config saved to /var/cache/conftool/dbconfig/20200522-082700-marostegui.json
  • 08:18 elukey@deploy1001: Finished deploy [analytics/superset/deploy@59ba01d]: Upgrade Superset to 0.36 (duration: 01m 01s)
  • 08:17 elukey@deploy1001: Started deploy [analytics/superset/deploy@59ba01d]: Upgrade Superset to 0.36
  • 08:13 vgutierrez: test hugepages allocator on ATS in cp2041
  • 08:06 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1149 and db1081', diff saved to https://phabricator.wikimedia.org/P11277 and previous config saved to /var/cache/conftool/dbconfig/20200522-080629-marostegui.json
  • 07:48 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1149 and db1081', diff saved to https://phabricator.wikimedia.org/P11276 and previous config saved to /var/cache/conftool/dbconfig/20200522-074853-marostegui.json
  • 07:20 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1149 and db1081', diff saved to https://phabricator.wikimedia.org/P11275 and previous config saved to /var/cache/conftool/dbconfig/20200522-072000-marostegui.json
  • 07:07 elukey@puppetmaster1001: conftool action : set/pooled=yes:weight=10; selector: name=druid1008.eqiad.wmnet
  • 07:04 elukey@puppetmaster1001: conftool action : set/pooled=yes; selector: name=druid1007.eqiad.wmnet
  • 07:04 elukey@puppetmaster1001: conftool action : set/pooled=yes; selector: name=druid1007.eqiad.wmnet
  • 04:34 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1081 - T252512', diff saved to https://phabricator.wikimedia.org/P11272 and previous config saved to /var/cache/conftool/dbconfig/20200522-043418-marostegui.json

2020-05-21

  • 23:58 ejegg: updated civicrm from b658fd8233 to 6b1d5902dd
  • 23:54 krinkle@deploy1001: Synchronized php-1.35.0-wmf.32/includes/content/ContentHandlerFactory.php: If578893f5689 (duration: 01m 06s)
  • 23:47 krinkle@deploy1001: Synchronized php-1.35.0-wmf.32/extensions/LiquidThreads/classes/Thread.php: If3418cba06e (duration: 01m 07s)
  • 23:41 krinkle@deploy1001: Synchronized wmf-config/mc.php: I222457729a5b (duration: 01m 08s)
  • 21:46 eileen: civicrm revision changed from ed4c9522ac to b658fd8233, config revision is 9babae3954
  • 21:10 foks: removing two files for legal compliance
  • 20:44 bstorm_: labstore1005 is now running stretch and drbd devices are resyncing after several reboots and some significant effort T224582
  • 18:24 twentyafterfour: restarting phabricator on phab1001 to deploy https://phabricator.wikimedia.org/rPHEX2687d08786a9dadcbaa96709de991f471f239830
  • 17:24 bblack: anycast experiment done, all back to normal
  • 17:20 bblack: anycast experimentation commencing in ulsfo (test route withdrawal)...
  • 17:04 bstorm_: starting labstore1005 upgrades T224582
  • 16:14 andrew@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 16:12 andrew@cumin1001: START - Cookbook sre.hosts.downtime
  • 16:04 sbassett@deploy1001: Synchronized private/PrivateSettings.php: Update mitigations for T250887 (duration: 01m 08s)
  • 15:48 andrewbogott: rebuilding cloudnet1003.eqiad.wmnet with Debian Buster for T253124
  • 15:22 XioNoX: Add BGP between cr1/2-eqiad and authdns1001 - T253196
  • 15:09 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 15:09 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:08 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 15:08 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:07 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 15:07 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:59 dzahn@cumin1001: conftool action : set/pooled=inactive; selector: name=mw217[0-2].codfw.wmnet
  • 14:59 dzahn@cumin1001: conftool action : set/pooled=inactive; selector: name=mw216[0-9].codfw.wmnet
  • 14:58 dzahn@cumin1001: conftool action : set/pooled=inactive; selector: name=mw215[8-9].codfw.wmnet
  • 14:50 bblack@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 14:47 bblack@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:44 akosiaris@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'mathoid' for release 'canary' .
  • 14:33 akosiaris: upload helmfile 0.109.0 to apt.wikimedia.org/buster-wikimedia and stretch-wikimedia, component main
  • 13:51 vgutierrez: depool cp4032 for some ats tests
  • 13:22 mutante: cloudnet1004 - reboot to test PXE boot
  • 12:44 andrewbogott: reimaging cloudnet1004.eqiad.wmnet for T253124
  • 12:29 elukey: roll restart druid-public cluster (druid100[4-6], backend for the AQS API) to apply new settings + openjdk upgrade - T252771
  • 12:13 mutante: depooled mw2158 through mw2172 to make room again in C3 as planned (T247018)
  • 12:12 marostegui: Repool labsdb1011 into the analytics role 🀞- T249188
  • 12:12 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw217[0-2].codfw.wmnet
  • 12:10 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw216[0-9].codfw.wmnet
  • 12:05 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1143 and db1091', diff saved to https://phabricator.wikimedia.org/P11270 and previous config saved to /var/cache/conftool/dbconfig/20200521-120555-marostegui.json
  • 12:05 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw215[8-9].codfw.wmnet
  • 11:18 hnowlan: Removed changeprop from scb hosts
  • 11:04 vgutierrez: rolling restart of ncredir servers for kernel update
  • 10:17 vgutierrez: restart of acme-chief servers for kernel update
  • 10:13 jbond42: deploy CI for pupet privcate repo
  • 10:11 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1143 and db1091', diff saved to https://phabricator.wikimedia.org/P11268 and previous config saved to /var/cache/conftool/dbconfig/20200521-101100-marostegui.json
  • 10:07 mutante: replaced backend of people.wikimedia.org - people1001 will be inaccessible, replaced with people1002 on buster. all home dirs have been synced over, there should be no difference except you have to use people1002 now for uploads (T247649)
  • 10:06 godog: test adding --sni to check_http -S on icinga2001 - T253292
  • 09:51 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1143 and db1091', diff saved to https://phabricator.wikimedia.org/P11267 and previous config saved to /var/cache/conftool/dbconfig/20200521-095100-marostegui.json
  • 09:28 mutante: deneb - sudo systemctl reset-failed to clear Icinga alerts about systemd degraded state
  • 09:12 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1143 and db1091', diff saved to https://phabricator.wikimedia.org/P11266 and previous config saved to /var/cache/conftool/dbconfig/20200521-091245-marostegui.json
  • 09:01 mutante: LDAP - added lmata to wmf group (T253277)
  • 08:55 XioNoX: Advertise Anycast 198.35.27.0/24 from esams - T253196
  • 08:52 XioNoX: Advertise Anycast 198.35.27.0/24 from eqsin - T253196
  • 08:49 marostegui@cumin1001: dbctl commit (dc=all): 'Pool db1143 with minimal weight for the first time T252512', diff saved to https://phabricator.wikimedia.org/P11265 and previous config saved to /var/cache/conftool/dbconfig/20200521-084933-marostegui.json
  • 08:47 XioNoX: Advertise Anycast 198.35.27.0/24 from eqiad/eqord - T253196
  • 08:42 marostegui@cumin1001: dbctl commit (dc=all): 'Add db1143 to the list of s4 hosts, depooled - T252512', diff saved to https://phabricator.wikimedia.org/P11264 and previous config saved to /var/cache/conftool/dbconfig/20200521-084226-marostegui.json
  • 08:34 XioNoX: Advertise Anycast 198.35.27.0/24 from dfw - T253196
  • 08:27 XioNoX: Advertise Anycast 198.35.27.0/24 from ulsfo - T253196
  • 08:20 XioNoX: Delete ARIN route object for 198.35.26.0/23 - T253196
  • 08:13 XioNoX: Delete ROA for 198.35.26.0/23 - T253196
  • 08:10 XioNoX: repool ulsfo - T253196
  • 08:03 XioNoX: Shrink ulsfo's 198.35.26.0/23 to 198.35.26.0/24 - T253196
  • 07:29 XioNoX: depool ulsfo - T253196
  • 07:22 marostegui: Purge events from tendril.global_status_log older than 24h - T252331
  • 07:03 jynus@cumin1001: dbctl commit (dc=all): 'Repool es1019 fully', diff saved to https://phabricator.wikimedia.org/P11263 and previous config saved to /var/cache/conftool/dbconfig/20200521-070335-jynus.json
  • 06:59 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1091 - T252512', diff saved to https://phabricator.wikimedia.org/P11261 and previous config saved to /var/cache/conftool/dbconfig/20200521-065858-marostegui.json
  • 06:28 jynus@cumin1001: dbctl commit (dc=all): 'Repool es1019 with 50% weight', diff saved to https://phabricator.wikimedia.org/P11260 and previous config saved to /var/cache/conftool/dbconfig/20200521-062823-jynus.json
  • 06:04 vgutierrez: pool cp5012 - T251219
  • 05:42 jynus@cumin1001: dbctl commit (dc=all): 'Repool es1019 with low weight', diff saved to https://phabricator.wikimedia.org/P11259 and previous config saved to /var/cache/conftool/dbconfig/20200521-054231-jynus.json
  • 05:03 marostegui@cumin1001: dbctl commit (dc=all): 'Set enwiki as read-only=off after maintenance T251982', diff saved to https://phabricator.wikimedia.org/P11258 and previous config saved to /var/cache/conftool/dbconfig/20200521-050328-marostegui.json
  • 05:00 marostegui@cumin1001: dbctl commit (dc=all): 'Set enwiki as read-only for maintenance T251982', diff saved to https://phabricator.wikimedia.org/P11257 and previous config saved to /var/cache/conftool/dbconfig/20200521-050029-marostegui.json
  • 01:03 krinkle@deploy1001: Synchronized wmf-config/mc.php: Ic9efa98312b (duration: 01m 08s)

2020-05-20

  • 20:16 herron: logstash1011:~# kafka-preferred-replica-election --zookeeper conf1004.eqiad.wmnet,conf1005.eqiad.wmnet,conf1006.eqiad.wmnet/kafka/logging-eqiad
  • 19:27 robh: cp5012 still offline for mem tests, "fast" testing complete without errors and extended testing in progress. system firmware was updated before testing. T251219
  • 18:10 XioNoX: accept 198.35.27.0/24 from Anycast peers on all routers - T253196
  • 18:01 XioNoX: add BGP between authdns2001 and cr1-codfw - T253196
  • 17:57 XioNoX: accept 198.35.27.0/24 from Anycast peers on cr3-ulsfo - T253196
  • 17:44 robh: cp5012 rebooting for troubleshooting
  • 17:02 bblack: dns* + authdns* - disabling puppet to test https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/597311/
  • 16:53 bblack: kraz.wikimedia.org ( https://wikitech.wikimedia.org/wiki/IRCD ) - stopping ircecho then ircd, then restarting them in reverse order - T239993
  • 16:01 jayme@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'mathoid' for release 'production' .
  • 16:01 jayme@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'mathoid' for release 'canary' .
  • 15:42 elukey: update puppet compiler's facts
  • 15:21 moritzm: installing libssh security updates
  • 15:15 jayme@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'mathoid' for release 'staging' .
  • 15:00 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: T253096 [itwikivoyage] Undeploy Insider and Listings extensions (duration: 01m 08s)
  • 14:43 marostegui: Replace tendril_purge_global_status_log_5m event with the new one (purging every 2d of data and with a higher limit of rows) - T252331
  • 14:34 hnowlan@deploy1001: Finished deploy [restbase/deploy@6d2f88c]: Add awa.wikipedia.org to wikipedia list (duration: 19m 49s)
  • 14:15 hnowlan@deploy1001: Started deploy [restbase/deploy@6d2f88c]: Add awa.wikipedia.org to wikipedia list
  • 14:06 XioNoX: special-ranges6, remove 4000::/2 and 8000::/1
  • 14:03 bblack: authdns1001 - poweroff for T241770
  • 14:00 bblack: cr2-eqiad - re-routing ns[01] public IPs from authdns1001 (going offline for hw work) to dns1002 - T241770 (redo from earlier, commit didn't take for whatever reason)
  • 13:52 bblack: cr[12]-eqiad - re-routing ns[01] public IPs from authdns1001 (going offline for hw work) to dns1002 - T241770
  • 13:51 bblack: authdns1001 - downtimed for physical work - T241770
  • 13:24 milimetric@deploy1001: Finished deploy [analytics/refinery@a891999] (thin): Regular analytics weekly train THIN [analytics/refinery@a891999] (duration: 00m 10s)
  • 13:23 milimetric@deploy1001: Started deploy [analytics/refinery@a891999] (thin): Regular analytics weekly train THIN [analytics/refinery@a891999]
  • 13:23 milimetric@deploy1001: Finished deploy [analytics/refinery@a891999]: Regular analytics weekly train [analytics/refinery@a891999] (duration: 38m 33s)
  • 13:23 godog: remove stale tcp service on lvs codfw low-traffic 10.2.1.53:10902
  • 13:00 Amir1: creating two wikis are done
  • 12:52 ladsgroup@deploy1001: Synchronized wmf-config/interwiki.php: Update interwiki cache (duration: 10m 49s)
  • 12:45 milimetric@deploy1001: Started deploy [analytics/refinery@a891999]: Regular analytics weekly train [analytics/refinery@a891999]
  • 12:41 ladsgroup@deploy1001: Synchronized static/images/project-logos/: Creating Wiktionary Konkani (gomwiktionary) - T249506 (duration: 01m 06s)
  • 12:40 dzahn@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
  • 12:38 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Creating Wiktionary Konkani (gomwiktionary) - T249506 (duration: 01m 05s)
  • 12:35 ladsgroup@deploy1001: rebuilt and synchronized wikiversions files: Creating Wiktionary Konkani (gomwiktionary) - T249506
  • 12:33 ladsgroup@deploy1001: Synchronized dblists: Creating Wiktionary Konkani (gomwiktionary) - T249506 (duration: 01m 06s)
  • 12:28 godog: roll-restart pybal on codfw low-traffic - T233956
  • 12:26 dzahn@cumin1001: START - Cookbook sre.ganeti.makevm
  • 12:22 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1)
  • 12:22 ladsgroup@deploy1001: Synchronized wmf-config/interwiki.php: Update interwiki cache (duration: 03m 01s)
  • 12:21 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
  • 12:18 ladsgroup@deploy1001: Synchronized langlist: Create Awadhi Wikipedia (awawiki) - T251371 (duration: 01m 06s)
  • 12:16 ladsgroup@deploy1001: Synchronized static/images/project-logos: Create Awadhi Wikipedia (awawiki) - T251371 (duration: 01m 06s)
  • 12:14 ladsgroup@deploy1001: Synchronized multiversion/MWMultiVersion.php: Create Awadhi Wikipedia (awawiki) - T251371 (duration: 01m 06s)
  • 12:12 ladsgroup@deploy1001: rebuilt and synchronized wikiversions files: Create Awadhi Wikipedia (awawiki) - T251371
  • 12:07 ladsgroup@deploy1001: Synchronized dblists: (no justification provided) (duration: 01m 08s)
  • 11:37 mutante: rebooting ganeti1009 and ganeti1011 to hopefully clear icinga alerts about microcode mitigations
  • 11:10 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool new host db1142 and db1084', diff saved to https://phabricator.wikimedia.org/P11253 and previous config saved to /var/cache/conftool/dbconfig/20200520-111013-marostegui.json
  • 11:07 jynus@cumin1001: dbctl commit (dc=all): 'Repool es1018, es1015 fully', diff saved to https://phabricator.wikimedia.org/P11252 and previous config saved to /var/cache/conftool/dbconfig/20200520-110732-jynus.json
  • 11:04 jbond42: roll out update or exim4
  • 10:46 moritzm: installing 4.19.118 Linux packages on Buster hosts
  • 10:28 vgutierrez: rolling restart of ats-tls in text@esams - T249335
  • 10:19 marostegui@cumin1001: dbctl commit (dc=all): 'Increase weight for db1142 and db1084 on s4', diff saved to https://phabricator.wikimedia.org/P11250 and previous config saved to /var/cache/conftool/dbconfig/20200520-101928-marostegui.json
  • 10:07 jynus@cumin1001: dbctl commit (dc=all): 'Repool es1018, es1015 at 50% weight', diff saved to https://phabricator.wikimedia.org/P11249 and previous config saved to /var/cache/conftool/dbconfig/20200520-100726-jynus.json
  • 09:43 vgutierrez: disable KA for POST/PUT requests on esams - T249335
  • 09:36 XioNoX: create ROAs for 198.35.26.0/24 and 198.35.27.0/24 - T253196
  • 09:33 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 09:31 marostegui@cumin1001: dbctl commit (dc=all): 'Increase weight for db1142 and db1084 on s4', diff saved to https://phabricator.wikimedia.org/P11247 and previous config saved to /var/cache/conftool/dbconfig/20200520-093141-marostegui.json
  • 09:30 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 09:29 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 09:28 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 09:28 XioNoX: create ARIN inetnum 198.35.27.0/24 and route 198.35.26.0/24 + 198.35.27.0/24 - T253196
  • 09:26 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 09:26 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 09:26 marostegui: Upgrade db1083 (s1 master) to 10.1.43-2 without restarting T251982
  • 09:25 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 09:24 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 09:15 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 09:12 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 09:11 marostegui@cumin1001: dbctl commit (dc=all): 'Increase weight for new host db1142 and start to repool db1084', diff saved to https://phabricator.wikimedia.org/P11246 and previous config saved to /var/cache/conftool/dbconfig/20200520-091153-marostegui.json
  • 09:08 dzahn@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
  • 09:08 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 09:05 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 09:02 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 09:01 jayme@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'mathoid' for release 'staging' .
  • 09:00 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 09:00 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 08:57 marostegui@cumin1001: dbctl commit (dc=all): 'Pool db1142 with minimum weight for the first time T252512', diff saved to https://phabricator.wikimedia.org/P11245 and previous config saved to /var/cache/conftool/dbconfig/20200520-085757-marostegui.json
  • 08:57 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 08:55 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 08:52 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 08:52 dzahn@cumin1001: START - Cookbook sre.ganeti.makevm
  • 08:49 _joe_: converting mw1266-1275 to use envoy T247389
  • 08:46 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 08:44 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 08:42 XioNoX: Remove bogons4 for policy options on all routers - gerrit 597272
  • 08:36 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 08:34 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 08:33 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 08:33 _joe_: disabling puppet on mw1266-1275 for migration to envoy
  • 08:31 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 08:21 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 08:19 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 07:41 marostegui: alter table categorylinks engine=Innodb ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=8,force on all labsdb1011 wikis - T249188
  • 07:24 moritzm: install systemd security updates
  • 07:10 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1084 to clone db1142 T252512', diff saved to https://phabricator.wikimedia.org/P11241 and previous config saved to /var/cache/conftool/dbconfig/20200520-071010-marostegui.json
  • 00:05 RoanKattouw: Ran namespaceDupes.php on tiwiki and tiwiktionary for T251287
  • 00:03 catrope@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Set sitename and meta namespace localizations for tiwiki and tiwiktionary (T251287) (duration: 01m 06s)

2020-05-19

  • 23:59 RoanKattouw: Ran namespaceDupes.php on jvwiki and jvwiktionary for T252754
  • 23:57 jforrester@deploy1001: Synchronized php-1.35.0-wmf.32/extensions/Insider/includes/InsiderHooks.php: T252846 Use SidebarBeforeOutput hook with correct format (duration: 01m 06s)
  • 23:55 catrope@deploy1001: Finished scap: i18n scap for namespace localizations (T251287, T252754) (duration: 62m 26s)
  • 22:53 catrope@deploy1001: Started scap: i18n scap for namespace localizations (T251287, T252754)
  • 18:46 herron: performing rolling restarts of codfw/eqiad ELK clusters for java updates
  • 18:41 catrope@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Grant template editors editcontentmodel on enwiki (T253081) (duration: 01m 06s)
  • 18:35 catrope@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Enable GrowthExperiments features on frwiki (T252420) (duration: 01m 08s)
  • 17:09 arturo: added tesseract suite to stretch-wikimedia component/tesseract-410-bpo (T247422)
  • 16:24 godog: power cycle thanos-fe* / thanos-be*
  • 15:23 kormat@cumin1001: dbctl commit (dc=all): 'Repool db2073 into s4 T252985', diff saved to https://phabricator.wikimedia.org/P11236 and previous config saved to /var/cache/conftool/dbconfig/20200519-152340-kormat.json
  • 15:20 sukhe@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 15:20 sukhe@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:16 cdanis: canary on ~150 hosts looks great, re-enabling puppet on all physical hosts βœ”οΈ cdanis@cumin1001.eqiad.wmnet ~ πŸ•₯β˜• sudo cumin 'F:virtual = physical' 'enable-puppet "cdanis deploying I68c97d5"'
  • 15:04 sukhe@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 15:04 sukhe@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:59 moritzm: installing fuse update from Buster point release
  • 14:47 cdanis: disabling puppet on all physical hosts βœ”οΈ cdanis@cumin1001.eqiad.wmnet ~ πŸ•₯β˜• sudo cumin 'F:virtual = physical' 'disable-puppet "cdanis deploying I68c97d5"'
  • 14:38 akosiaris@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'blubberoid' for release 'staging' .
  • 14:26 XioNoX: Set minimum-links 2 to AMS-IX LACP - T253122
  • 13:53 XioNoX: configure new AMS-IX port as quarantine - T251121
  • 13:18 jayme@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'mathoid' for release 'staging' .
  • 13:09 elukey@cumin1001: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99)
  • 13:09 jayme: updated helm: 2.16.7-1 -> 2.16.7-2 on deploy[1,2]001 and contint[1,2]001
  • 13:09 elukey@cumin1001: START - Cookbook sre.ganeti.makevm
  • 13:03 kormat@cumin1001: dbctl commit (dc=all): 'Pool db2136 into s4 T252985', diff saved to https://phabricator.wikimedia.org/P11233 and previous config saved to /var/cache/conftool/dbconfig/20200519-130313-kormat.json
  • 12:40 ariel@deploy1001: Finished deploy [dumps/dumps@a329605]: make page content fixup script move inprog files into place if good (duration: 00m 04s)
  • 12:40 ariel@deploy1001: Started deploy [dumps/dumps@a329605]: make page content fixup script move inprog files into place if good
  • 12:37 jayme: imported helm 2.16.7-2 to main for buster-wikimedia, stretch-wikimedia, jessie-wikimedia
  • 12:17 hnowlan@cumin1001: END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0)
  • 11:51 jynus: starting backups of es1, es2, es3 on eqiad into backup1002
  • 11:41 jynus@cumin1001: dbctl commit (dc=all): 'Depool es1018, es1015, es1019', diff saved to https://phabricator.wikimedia.org/P11232 and previous config saved to /var/cache/conftool/dbconfig/20200519-114148-jynus.json
  • 11:12 marostegui: Deploy schema change on db2124 (frwiki, jawiki, ruwiki) T238966
  • 10:34 mutante: releases2001 - restarted failed jenkins
  • 10:33 mutante: releases2001 - Failed to restart jenkins.service: The name org.freedesktop.PolicyKit1 was not provided by any .service files
  • 10:32 volans: flushed all Netbox caches (manage.py invalidate all) - T253091
  • 10:29 volans: start Netbox restore - T253091
  • 10:18 jayme@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'mathoid' for release 'staging' .
  • 10:13 akosiaris: upgrade etherpad-lite to 1.8.4 on etherpad1002
  • 09:58 hnowlan: roll-restart of eqiad restbase hosts for java security updates
  • 09:58 hnowlan@cumin1001: START - Cookbook sre.cassandra.roll-restart
  • 09:55 jayme@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'mathoid' for release 'production' .
  • 09:55 jayme@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'mathoid' for release 'canary' .
  • 09:55 jayme@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'mathoid' for release 'production' .
  • 09:54 jayme@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'mathoid' for release 'staging' .
  • 09:10 godog: eqiad-prod: decom ms-be101[678] - T252008
  • 08:07 XioNoX: Push 596597: BGP: standardize fixed part of IX4/IX6 groups - eqsin
  • 08:04 XioNoX: Push 596597: BGP: standardize fixed part of IX4/IX6 groups - esams
  • 08:01 XioNoX: Push 596597: BGP: standardize fixed part of IX4/IX6 groups - eqiad
  • 07:55 volker-e@deploy1001: Finished deploy [design/style-guide@37c67dd]: Deploy design/style-guide: (duration: 00m 06s)
  • 07:54 volker-e@deploy1001: Started deploy [design/style-guide@37c67dd]: Deploy design/style-guide:
  • 07:52 XioNoX: Push 596597: BGP: standardize fixed part of IX4/IX6 groups - *dfw
  • 07:49 XioNoX: Push 596597: BGP: standardize fixed part of IX4/IX6 groups - ulsfo
  • 07:45 vgutierrez: rolling upgrade to trafficserver 8.0.7-1wm10 with puppet disabled on cp hosts
  • 07:09 jynus: starting es4 & es5 eqiad backups with low concurrency
  • 06:35 elukey@cumin1001: END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0)
  • 06:29 elukey@cumin1001: START - Cookbook sre.zookeeper.roll-restart-zookeeper
  • 06:24 elukey@cumin1001: END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0)
  • 06:17 elukey@cumin1001: START - Cookbook sre.zookeeper.roll-restart-zookeeper
  • 05:57 volker-e@deploy1001: Finished deploy [design/style-guide@7bfbd2a]: Deploy design/style-guide: (duration: 00m 06s)
  • 05:57 volker-e@deploy1001: Started deploy [design/style-guide@7bfbd2a]: Deploy design/style-guide:
  • 05:03 marostegui@cumin1001: dbctl commit (dc=all): 'Set s2 and s8 as read-only=off for maintenance T251981', diff saved to https://phabricator.wikimedia.org/P11227 and previous config saved to /var/cache/conftool/dbconfig/20200519-050346-marostegui.json
  • 05:00 marostegui@cumin1001: dbctl commit (dc=all): 'Set s2 and s8 as read-only for maintenance T251981', diff saved to https://phabricator.wikimedia.org/P11226 and previous config saved to /var/cache/conftool/dbconfig/20200519-050043-marostegui.json
  • 04:27 marostegui: Repool labsdb1011 T249188
  • 03:29 volker-e@deploy1001: Finished deploy [design/style-guide@4b4bc51]: Deploy design/style-guide: (duration: 00m 07s)
  • 03:28 volker-e@deploy1001: Started deploy [design/style-guide@4b4bc51]: Deploy design/style-guide:

2020-05-18

  • 23:50 pt1979@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 23:47 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 23:25 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 23:23 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 23:12 ryankemper: Restarted `wdqs-updater` across all wdqs nodes and restarted `wdqs-categories` across all nodes except 1010 (test wdqs server) and 1009 (automated deployment server)
  • 22:55 Krinkle: Clear module_deps on dewiki (group2, old mw version, s5) to monitor regeneration
  • 22:48 Krinkle: Clear module_deps on group0 (mostly s3) to monitor regeneration
  • 22:35 Krinkle: Clear module_deps on commonswiki (group1, s4) to monitor regeneration
  • 22:33 ryankemper@deploy1001: Finished deploy [wdqs/wdqs@4886dc3]: 0.3.32 (duration: 17m 12s)
  • 22:19 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 22:18 Krinkle: Clear module_deps on s2 wikis to monitor regeneration
  • 22:16 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 22:15 ryankemper@deploy1001: Started deploy [wdqs/wdqs@4886dc3]: 0.3.32
  • 22:02 Krinkle: Clear module_deps on hewiki (group1, s7) to monitor regeneration, ref T247028
  • 21:40 pt1979@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 21:37 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 21:23 krinkle@deploy1001: Synchronized php-1.35.0-wmf.32/includes/resourceloader/dependencystore/: I015fa5885, I972a93806006 (duration: 01m 07s)
  • 21:14 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 21:11 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 20:27 bsitzmann@deploy1001: Finished deploy [mobileapps/deploy@12efc14]: Update mobileapps to c960b349 (duration: 03m 31s)
  • 20:24 bsitzmann@deploy1001: Started deploy [mobileapps/deploy@12efc14]: Update mobileapps to c960b349
  • 19:07 herron: performing rolling maintenance on kafka-main to pick up java security updates
  • 19:00 krinkle@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Ic005093778d (duration: 01m 08s)
  • 18:58 krinkle@deploy1001: Synchronized wmf-config/CommonSettings-labs.php: Ic005093778d (duration: 01m 06s)
  • 18:49 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 18:46 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 18:38 volans: upgraded spicerack to 0.0.37-1 on cumin[12]001
  • 18:24 hnowlan@cumin1001: END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0)
  • 18:13 catrope@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Fix English Wikipedia wordmark dimensions (T252143) (duration: 01m 06s)
  • 17:14 XioNoX: update domain object for 56.15.185.in-addr.arpa - T247972
  • 17:06 bblack: dns1001 - removing downtimes, back in service - T241770
  • 16:45 bstorm_: updated views on labsdb1011 for the wb_terms changes T251598
  • 16:32 bblack@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 16:30 bblack@cumin1001: START - Cookbook sre.hosts.downtime
  • 16:17 bblack: dns1001 - reimaging for new NIC - T241770
  • 16:10 volans: uploaded spicerack_0.0.37-1_amd64.deb to apt.wikimedia.org stretch-wikimedia
  • 15:52 hnowlan: rolling codfw cassandra for java security updates
  • 15:51 hnowlan@cumin1001: START - Cookbook sre.cassandra.roll-restart
  • 15:30 hnowlan@cumin1001: END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0)
  • 15:22 elukey@cumin1001: END (PASS) - Cookbook sre.hadoop.roll-restart-workers (exit_code=0)
  • 15:11 Krinkle: krinkle@mc1021 Pruning the old `echo:seen:` Redis keys that didn't have a ttl yet
  • 14:57 jayme@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'termbox' for release 'production' .
  • 14:56 hnowlan: roll-restart of sessionstore cassandra hosts for java security update
  • 14:55 hnowlan@cumin1001: START - Cookbook sre.cassandra.roll-restart
  • 14:53 jayme@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'termbox' for release 'production' .
  • 14:50 jayme@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'termbox' for release 'staging' .
  • 14:50 jayme@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'termbox' for release 'test' .
  • 14:35 hnowlan@deploy1001: Finished deploy [changeprop/deploy@16bf19f]: Stop consuming purges topic, purged is now doing this (duration: 01m 22s)
  • 14:34 hnowlan@deploy1001: Started deploy [changeprop/deploy@16bf19f]: Stop consuming purges topic, purged is now doing this
  • 14:33 _joe_: start consuming $dc.resource-purge kafka topic from purged in all of esams T133821
  • 14:29 _joe_: start consuming $dc.resource-purge kafka topic from purged in all of eqiad T133821
  • 14:23 _joe_: start consuming $dc.resource-purge kafka topic from purged in all of eqsin, ulsfo T133821
  • 14:19 _joe_: start consuming $dc.resource-purge kafka topic from purged in all of codfw T133821
  • 14:15 kormat@cumin1001: dbctl commit (dc=all): 'Depool db2073 while replacing it T252985', diff saved to https://phabricator.wikimedia.org/P11216 and previous config saved to /var/cache/conftool/dbconfig/20200518-141505-kormat.json
  • 14:12 bblack: dns1001 - shutting down for T241770
  • 14:09 volans: uploaded spicerack_0.0.36-1_amd64.deb to apt.wikimedia.org stretch-wikimedia
  • 14:07 bblack: authdns - ns[01] static routes on cr[12]-eqiad switching back to authdns1001 (oops, that's not the server we're taking offline today!)
  • 14:06 vgutierrez: upload trafficserver 8.0.7-1wm9 to apt.wm.o (buster)
  • 14:02 elukey@cumin1001: START - Cookbook sre.hadoop.roll-restart-workers
  • 14:00 elukey@cumin1001: END (PASS) - Cookbook sre.hadoop.roll-restart-workers (exit_code=0)
  • 13:57 bblack: authdns - ns[01] static routes on cr[12]-eqiad switching from authdns1001 to dns1002 for T241770
  • 13:29 elukey@cumin1001: START - Cookbook sre.hadoop.roll-restart-workers
  • 13:00 hashar@deploy1001: Synchronized php-1.35.0-wmf.32/skins/Vector/includes/VectorTemplate.php: VectorTemplate: SkinTemplateToolboxEnd hook isn't deprecated - T252906 (duration: 01m 07s)
  • 11:52 marostegui: Install 10.1.43-2 on db1122 and db1109 - T251981
  • 11:27 Lucas_WMDE: EU SWAT done
  • 11:25 lucaswerkmeister-wmde@deploy1001: Synchronized php-1.35.0-wmf.32/extensions/Wikibase/: SWAT: Fix core's TitleFactory not being used correctly (T252803) (duration: 01m 12s)
  • 11:20 lucaswerkmeister-wmde@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Update GrowthExperiments mentor list page for viwiki (duration: 01m 06s)
  • 11:10 lucaswerkmeister-wmde@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Make the threshold for Chinese WP to prevent publishing 5% more strict (T252786) (duration: 01m 06s)
  • 10:38 jdrewniak@deploy1001: Synchronized portals: Wikimedia Portals Update: Bumping portals to master (597033) (duration: 01m 06s)
  • 10:37 jdrewniak@deploy1001: Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: Bumping portals to master (597033) (duration: 01m 32s)
  • 10:37 elukey: copy prometheus-druid-exporter 0.8-1 from stretch to buster wikimedia
  • 10:20 _joe_: upgrading purged in the remaining datacenters
  • 10:07 elukey: upload druid 0.12.3-1.1 to stretch|buster-wikimedia
  • 10:02 vgutierrez: upload trafficserver 8.0.7-1wm8 to apt.wm.o (buster)
  • 09:53 _joe_: upgrading purged in codfw, ulsfo
  • 09:46 mutante: contint2001 - apt-get remove --purge openjdk-11-* - T224591
  • 09:43 _joe_: upload purged 0.13 to buster-wikimedia
  • 08:44 jayme@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'termbox' for release 'production' .
  • 08:43 jayme@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'termbox' for release 'production' .
  • 08:25 jayme@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'termbox' for release 'staging' .
  • 08:25 jayme@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'termbox' for release 'test' .
  • 08:13 godog: set weight to 0 for all but objects in ms-be10[678] - T252008
  • 07:57 mutante: replacing apache module with httpd module on deployment servers
  • 07:47 moritzm: installing apt security updates on jessie systems
  • 07:36 marostegui: Remove and add pc2007 from tendril as the Act is frozen after reimage - T250666
  • 07:22 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db2088 after upgrade', diff saved to https://phabricator.wikimedia.org/P11214 and previous config saved to /var/cache/conftool/dbconfig/20200518-072234-marostegui.json
  • 07:20 marostegui: Upload MariaDB 10.4.13 to the buster repo - T250666
  • 07:00 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 06:58 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 06:41 marostegui: Stop MySQL on db2088
  • 06:24 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db2088 for upgrade', diff saved to https://phabricator.wikimedia.org/P11213 and previous config saved to /var/cache/conftool/dbconfig/20200518-062452-marostegui.json
  • 05:55 _joe_: installing purged 0.12 on cp2027
  • 05:54 _joe_: uploaded purged 0.12 to apt.w.o
  • 05:00 marostegui: Stop MySQL on labsdb1011 to copy its content to backup1001 T249188

2020-05-16

  • 22:04 Krinkle: krinkle@mc1022 Pruning the old `echo:seen:` Redis keys that didn't have a ttl yet
  • 21:56 Krinkle: krinkle@mc1019 Pruning the old `echo:seen:` Redis keys that didn't have a ttl yet
  • 20:23 Krinkle: krinkle@mc1034,mc1035,mc1036 Pruning the old `echo:seen:` Redis keys that didn't have a ttl yet
  • 20:04 Krinkle: krinkle@mc1033 Pruning the old `echo:seen:` Redis keys that didn't have a ttl yet
  • 19:57 Krinkle: krinkle@mc1032 Pruning the old `echo:seen:` Redis keys that didn't have a ttl yet
  • 19:51 Krinkle: krinkle@mc1031 Pruning the old `echo:seen:` Redis keys that didn't have a ttl yet
  • 19:42 Krinkle: krinkle@mc1030 Pruning the old `echo:seen:` Redis keys that didn't have a ttl yet
  • 19:25 Krinkle: krinkle@mc1029 Pruning the old `echo:seen:` Redis keys that didn't have a ttl yet
  • 19:10 Krinkle: krinkle@mc1028 Pruning the old `echo:seen:` Redis keys that didn't have a ttl yet
  • 18:58 Krinkle: krinkle@mc1027 Pruning the old `echo:seen:` Redis keys that didn't have a ttl yet, ref T252945
  • 18:54 Krinkle: krinkle@mc1026 Pruning the old `echo:seen:` Redis keys that didn't have a ttl yet, ref T252945
  • 18:30 Krinkle: krinkle@mc1024 Pruning the old `echo:seen:` Redis keys that didn't have a ttl yet, ref T252945
  • 18:24 Krinkle: krinkle@mc1025 Pruning the old `echo:seen:` Redis keys that didn't have a ttl yet, ref T252945
  • 17:56 Krinkle: krinkle@mc1023 Pruning old echo:seen: Redis keys that didn't use a ttl yet, ref T252945
  • 17:49 Krinkle: krinkle@mwmaint1002: Running cleanupRemovedModules.php to prune old module_deps rows T113916
  • 17:24 Krinkle: krinkle@mc1020 Prune old echo:seen: keys that have ttl:-1 from Redis main stash, ref T252945
  • 15:16 Krinkle: krinkle@mc1020 Looking at why there are still over 2M echo:seen keys in redis main stash
  • 00:55 krinkle@deploy1001: Synchronized wmf-config/logging.php: I046868190b472 (duration: 01m 13s)
  • 00:24 ryankemper@cumin1001: END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
  • 00:21 ryankemper@cumin1001: START - Cookbook sre.wdqs.data-transfer
  • 00:21 ryankemper@cumin1001: END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
  • 00:18 ryankemper@cumin1001: START - Cookbook sre.wdqs.data-transfer
  • 00:18 ryankemper@cumin1001: END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
  • 00:16 ryankemper@cumin1001: START - Cookbook sre.wdqs.data-transfer
  • 00:16 ryankemper@cumin1001: END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
  • 00:13 ryankemper@cumin1001: START - Cookbook sre.wdqs.data-transfer
  • 00:13 ryankemper@cumin1001: END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
  • 00:10 ryankemper@cumin1001: START - Cookbook sre.wdqs.data-transfer
  • 00:10 ryankemper@cumin1001: END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
  • 00:08 ryankemper@cumin1001: START - Cookbook sre.wdqs.data-transfer
  • 00:06 ryankemper@cumin2001: END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
  • 00:06 ryankemper@cumin2001: START - Cookbook sre.wdqs.data-transfer
  • 00:05 ryankemper@cumin2001: END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
  • 00:05 ryankemper@cumin2001: START - Cookbook sre.wdqs.data-transfer

2020-05-15

  • 23:50 ryankemper@cumin2001: END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
  • 23:47 ryankemper@cumin2001: START - Cookbook sre.wdqs.data-transfer
  • 23:46 ryankemper@cumin2001: END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
  • 23:46 ryankemper@cumin2001: START - Cookbook sre.wdqs.data-transfer
  • 23:46 ryankemper@cumin2001: END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
  • 23:43 ryankemper@cumin2001: START - Cookbook sre.wdqs.data-transfer
  • 23:43 ryankemper@cumin2001: END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
  • 23:37 ryankemper@cumin2001: START - Cookbook sre.wdqs.data-transfer
  • 23:35 ryankemper: Pooled wdqs2007 following successful query tests (all data transfers are done now)
  • 22:53 krinkle@deploy1001: Synchronized wmf-config/InitialiseSettings.php: I1b1578a57ef5 (duration: 01m 07s)
  • 22:51 krinkle@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Iaa240eb8cf9 (duration: 01m 06s)
  • 21:41 ryankemper: depooled wdqs2007 while it catches up on lag
  • 21:40 ryankemper@cumin2001: END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
  • 20:36 ryankemper@cumin2001: START - Cookbook sre.wdqs.data-transfer
  • 20:33 ryankemper: pooled wdqs2003 and wdqs1007 following successful query tests
  • 19:46 krinkle@deploy1001: Synchronized wmf-config/CommonSettings.php: If0fd1b51 (duration: 01m 08s)
  • 18:53 ryankemper@cumin1001: END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
  • 18:34 ryankemper: depooled wdqs2003 while lag catches up
  • 18:32 ryankemper@cumin2001: END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
  • 17:55 vgutierrez: upload acme-chief 0.25 to apt.wm.o (buster) - T252881
  • 17:27 XioNoX: renumber cr2-eqord:xe-0/1/1 to xe-0/1/3 - T221259
  • 17:02 ryankemper@cumin2001: START - Cookbook sre.wdqs.data-transfer
  • 17:01 ryankemper@cumin1001: START - Cookbook sre.wdqs.data-transfer
  • 17:00 ryankemper: depooled wqds1007 in preparation for impending wdqs data xfer
  • 16:58 ryankemper@cumin1001: END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
  • 16:53 ryankemper@cumin1001: START - Cookbook sre.wdqs.data-transfer
  • 16:52 ryankemper@cumin1001: END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
  • 16:49 ryankemper@cumin1001: START - Cookbook sre.wdqs.data-transfer
  • 16:02 gehel@cumin2001: END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
  • 15:57 gehel@cumin2001: START - Cookbook sre.wdqs.data-transfer
  • 15:56 gehel@cumin2001: END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
  • 15:52 gehel@cumin2001: START - Cookbook sre.wdqs.data-transfer
  • 15:49 gehel@cumin2001: END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
  • 15:45 gehel@cumin2001: START - Cookbook sre.wdqs.data-transfer
  • 15:44 gehel@cumin2001: END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
  • 15:40 gehel@cumin2001: START - Cookbook sre.wdqs.data-transfer
  • 15:36 gehel@cumin2001: END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
  • 15:32 gehel@cumin2001: START - Cookbook sre.wdqs.data-transfer
  • 15:31 gehel@cumin2001: END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
  • 15:27 gehel@cumin2001: START - Cookbook sre.wdqs.data-transfer
  • 14:19 cdanis: reverting sysctl net.ipv4.udp_mem to original on netflow3001
  • 14:18 cdanis: re-enable puppet on netflow*
  • 14:14 cdanis: disable puppet on netflow*
  • 14:04 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 14:01 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 13:47 ema: cp2029, cp3050: varnish-fe-restart to clear 'child restarted' alerts
  • 13:47 vgutierrez: downgrade ats to version 8.0.7-1wm7 on cp4032
  • 13:42 vgutierrez: upgrade ats to version 8.0.7-1wm8 on cp4032
  • 13:37 mutante: rsyncing gerrit git data from gerrit1001 to gerrit1002 (T200739)
  • 13:13 cdanis: increase samplicator recvbuf on netflow3001 & restart samplicator
  • 13:01 cdanis: increasing sysctl net.ipv4.udp_mem on netflow3001
  • 09:57 vgutierrez: upload trafficserver 8.0.7-1wm7 to apt.wm.o (buster)
  • 09:21 ema: cp2029: attempt forced discard of stuck VCL T236754
  • 09:09 elukey: restart druid brokers on druid100[4-6] - locked up due to datasources dropped - T226035
  • 08:51 ema: cp2029: try out varnish 5.1.3-1wm15 T236754
  • 07:36 XioNoX: bumps prefix limit for AS16735 in eqiad
  • 05:35 jynus: stop replication on pc2009, pc2010 for benchmarking T252761
  • 04:53 volker-e@deploy1001: Finished deploy [design/style-guide@dc956a3]: Deploy design/style-guide: (duration: 00m 10s)
  • 04:52 volker-e@deploy1001: Started deploy [design/style-guide@dc956a3]: Deploy design/style-guide:
  • 04:42 vgutierrez: repool cp5006
  • 04:28 vgutierrez: depool and reboot cp5006

2020-05-14

  • 23:24 catrope@deploy1001: Synchronized static/images/project-logos/: Revert temporary 20k logo for vecwiki (T252770) (duration: 01m 06s)
  • 23:23 RoanKattouw: Ran namespaceDupes.php for T252343
  • 23:20 catrope@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Create Gapura (Portal) namespace on jvwiki (T252343) (duration: 01m 06s)
  • 23:09 catrope@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Add *.ub.uni-heidelberg.de and hq.eso.org to $wgCopyUploadDomains (T252600, T252726) (duration: 01m 07s)
  • 21:43 ryankemper: depooled wdqs2006 while lag recovers
  • 21:42 ryankemper@cumin1001: END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
  • 21:08 ryankemper@cumin2001: END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
  • 20:16 volans: moved codereview.tar.gz and with_r.tar.gz from miscweb1002 to cumin1001 to free space
  • 20:15 hashar@deploy1001: Synchronized php-1.35.0-wmf.32/skins/Vector/includes/VectorTemplate.php: Allow plain text labels in side bar - T252727 (duration: 01m 06s)
  • 19:51 ryankemper@cumin2001: START - Cookbook sre.wdqs.data-transfer
  • 19:50 ryankemper@cumin1001: START - Cookbook sre.wdqs.data-transfer
  • 19:49 ryankemper: Depooled wqds1006 in preparation for impending wdqs data xfer
  • 18:36 Urbanecm: Morning SWAT done
  • 18:35 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 15adbbc: [thwikisource] Set ProofReadPage separator to an empty string (T252610) (duration: 01m 06s)
  • 18:26 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 4b8399c: Undeploy graphoid from mediawikiwiki (T242855) (duration: 01m 05s)
  • 18:16 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: f03a45c: Adding import to test wikis from mediawikiwiki (T242855) (duration: 01m 07s)
  • 17:03 XioNoX: asw2-d-eqiad> request virtual-chassis vc-port delete pic-slot 1 port 1 member 1 - T252797
  • 16:55 XioNoX: asw2-d-eqiad> request virtual-chassis vc-port delete pic-slot 1 port 3 member 1 - T252797
  • 16:51 XioNoX: asw2-d-eqiad> request virtual-chassis vc-port set pic-slot 0 port 48 member 2 - T252797
  • 16:50 XioNoX: request virtual-chassis vc-port set pic-slot 1 port 2 member 1 - T252797
  • 16:42 XioNoX: request virtual-chassis vc-port delete pic-slot 1 port 2 member 1 - T252797
  • 16:36 XioNoX: asw2-d-eqiad> request virtual-chassis vc-port delete pic-slot 0 port 48 member 2 - T252797
  • 15:59 hnowlan@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 15:57 hnowlan@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 15:56 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 15:25 XioNoX: disable asw2-d1-eqiad:et-1/1/0 - T251663
  • 14:39 mutante: kuai kuai is https://twitter.com/Arlieth/status/1257714333133357056 | https://en.wikipedia.org/wiki/Kuai_Kuai_culture
  • 13:31 _joe_: updating purged to 0.11 in eqiad,eqsin,esams
  • 12:47 vgutierrez: rolling upgrade ats to version 8.0.7-1wm7
  • 12:46 elukey@cumin1001: END (PASS) - Cookbook sre.aqs.roll-restart (exit_code=0)
  • 12:43 elukey@cumin1001: START - Cookbook sre.aqs.roll-restart
  • 12:22 kormat: reverted iosched on pc1010 to `mq-deadline` T252761
  • 11:47 kormat: changed iosched on pc1010 to `none` as a test T252761
  • 11:07 matthiasmullie: EU swat done
  • 11:05 mlitn@deploy1001: Synchronized php-1.35.0-wmf.32/extensions/WikibaseMediaInfo/: [MediaInfo] Enable media search for all users by default (duration: 01m 12s)
  • 11:04 vgutierrez: upgrade ats to version 8.0.7-1wm7 on cp3064
  • 10:31 fdans@deploy1001: Finished deploy [analytics/refinery@6f13979]: Regular analytics weekly train (duration: 17m 14s)
  • 10:14 fdans@deploy1001: Started deploy [analytics/refinery@6f13979]: Regular analytics weekly train
  • 09:58 elukey: remove matomo 3.11 from the main component of stretch-wikimedia
  • 09:56 elukey: upgrade matomo on matomo1001 to 3.13.3 (latest upstream) - T252741
  • 09:30 jayme@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'zotero' for release 'production' .
  • 09:29 elukey: upload matomo-3.13.3 to thirdparty/matomo on stretch|buster-wikimedia
  • 09:22 jayme@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'zotero' for release 'production' .
  • 08:57 elukey: imported gpg key 1FD752571FE36FF23F78F91B81E2E78B66FED89E in apt1001 (Matomo public debian repo)
  • 08:56 moritzm: installing Java security updates on Presto
  • 08:43 jayme: updated helm: 2.12.2-1 -> 2.16.7-1 on deploy[1,2]001 and contint1001. 2.12.2-4 -> 2.16.7-1 on contint2001
  • 08:39 jayme: imported helm 2.16.7-1 to main for jessie-wikimedia
  • 08:32 moritzm: installing Java security updates on Hadoop/AQS/Druid
  • 08:20 jayme@deploy2001: helmfile [STAGING] Ran 'sync' command on namespace 'zotero' for release 'staging' .
  • 08:00 vgutierrez: upgrade ats to version 8.0.7-1wm7 on cp5011
  • 07:03 moritzm: installing apt security updates
  • 06:33 ryankemper: Pooled wdqs2005 following successful test queries
  • 04:46 ryankemper@cumin1001: END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
  • 04:02 ryankemper@cumin2001: END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
  • 02:59 ryankemper@cumin1001: START - Cookbook sre.wdqs.data-transfer
  • 02:59 ryankemper: wdqs1005 has been de-pooled pending wdqs data xfer
  • 02:57 ryankemper@cumin2001: START - Cookbook sre.wdqs.data-transfer
  • 02:57 ryankemper: wdqs1004 was repooled after successful test queries
  • 02:55 ryankemper: wdqs2006 was repooled after successful test queries
  • 01:32 ryankemper: depooled wdqs2006 while waiting for lag to recover
  • 00:54 foks: change password for "Python eggs"
  • 00:37 ryankemper@cumin1001: END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
  • 00:31 ryankemper@cumin2001: END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
  • 00:08 twentyafterfour: phabricator update appears to be stable.
  • 00:05 twentyafterfour: updating phabricator. 1 patch + new translations. Expect only brief downtime.

2020-05-13

  • 23:46 cstone: SmashPig revision changed from cd1a49da5f to 2702b04329
  • 23:43 ejegg: updated payments-wiki from dabba1804c to 3c465cb11c
  • 23:36 ejegg: rolled back payments-wiki to dabba1804c
  • 23:29 ejegg: updated payment-wiki from dabba1804c to 3c465cb11c
  • 22:40 ryankemper@cumin1001: START - Cookbook sre.wdqs.data-transfer
  • 22:39 ryankemper@cumin2001: START - Cookbook sre.wdqs.data-transfer
  • 22:36 ryankemper: Depooled wdqs1004 for subsequent wdqs data xfer
  • 22:29 ryankemper: Pooled wdqs2005 given that lag has returned to normal levels and the instance is responding to queries correctly
  • 22:26 ryankemper: Pooled wdqs1008 given that lag has returned to normal levels and the instance is responding to queries correctly
  • 21:30 elukey: powercycle analytics1055
  • 21:05 eileen: civicrm revision changed from cfb6101e39 to ed4c9522ac, config revision is 2eb75f8dff
  • 20:16 jforrester@deploy1001: Synchronized wmf-config/CommonSettings.php: T242430 Stop loading the ParsoidBatchAPI extension (duration: 01m 08s)
  • 19:09 hashar@deploy1001: Synchronized php: group1 wikis to 1.35.0-wmf.32 (duration: 01m 05s)
  • 19:08 hashar@deploy1001: rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.32
  • 18:54 twentyafterfour: restarted php-fpm on phab1001
  • 18:53 thcipriani: restarting gerrit
  • 18:52 twentyafterfour: restarting apache on phab1001 for lack of a better idea
  • 18:50 herron: restarted kafka broker on kafka-main1001 for java security updates
  • 18:22 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 38db3e0: Update production wordmarks (T252143) (duration: 01m 07s)
  • 18:17 urbanecm@deploy1001: Synchronized static/images/mobile/copyright/: SWAT: 38db3e0: Update production wordmarks (T252143) (duration: 01m 09s)
  • 17:55 hnowlan@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 17:53 hnowlan@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 17:52 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 17:51 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 17:24 ryankemper: Manually depooled wdqs2005 while lag catches up following the data xfer
  • 17:21 ryankemper@cumin1001: END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
  • 17:18 ryankemper@cumin2001: END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
  • 17:12 urandom: restarted cassandra-c, restbase2017
  • 17:04 hnowlan@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 16:57 hnowlan@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 16:54 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 16:11 James_F: Running AbuseFilter updateVarDumps on group0 on mwmaint1002 T246539
  • 16:00 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 15:58 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 15:41 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 15:38 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 15:34 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 15:32 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 15:32 vgutierrez: upgrade ats to version 8.0.7-1wm7 on cp4032
  • 15:30 ryankemper@cumin1001: START - Cookbook sre.wdqs.data-transfer
  • 15:30 jayme: imported scap 3.14.0-1 to main for buster-wikimedia
  • 15:30 jayme: imported scap 3.14.0-1 to main for jessie-wikimedia
  • 15:29 ryankemper: Manually de-pooling `wdqs1008.eqiad.wmnet` in preparation for wdqs data transfer
  • 15:29 jayme: imported scap 3.14.0-1 to main for stretch-wikimedia
  • 15:26 ryankemper@cumin2001: START - Cookbook sre.wdqs.data-transfer
  • 15:23 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
  • 15:08 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 15:06 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 14:55 _joe_: upgrading + restarting purged across ulsfo and codfw T133821
  • 14:50 filippo@deploy1001: Finished deploy [librenms/librenms@0a88d64]: Upgrade LibreNMS to 1.63 T251222 (duration: 00m 10s)
  • 14:50 filippo@deploy1001: Started deploy [librenms/librenms@0a88d64]: Upgrade LibreNMS to 1.63 T251222
  • 14:35 vgutierrez: upload trafficserver 8.0.7-1wm6 to apt.wm.o (buster) - T249335 T251537
  • 13:59 mholloway-shell@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'wikifeeds' for release 'production' .
  • 13:57 mholloway-shell@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'wikifeeds' for release 'production' .
  • 13:55 mholloway-shell@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' .
  • 11:39 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Add *.deutsche-digitale-bibliothek.de to the wgCopyUploadsDomains (T252296) (duration: 01m 06s)
  • 11:17 Amir1: EU SWAT is done
  • 11:14 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Disable wgLegacyJavaScriptGlobals on fawiki and wikidatawiki (T72470) (duration: 01m 06s)
  • 11:09 jayme@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'mathoid' for release 'production' .
  • 11:06 lucaswerkmeister-wmde@deploy1001: Synchronized wmf-config/InitialiseSettings-labs.php: SWAT: Anchor RegExp for Data Bridge in Beta (BETA-ONLY) (duration: 01m 06s)
  • 11:00 jayme@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'mathoid' for release 'production' .
  • 11:00 jayme@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'mathoid' for release 'canary' .
  • 10:55 volans: imported tqdm 4.11.2-1 packages into buster-wikimedia component/spicerack
  • 10:34 jayme@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'mathoid' for release 'staging' .
  • 10:09 kormat@deploy1001: Synchronized wmf-config/db-eqiad.php: Repool pc1007 as pc1 master T252182 (duration: 01m 05s)
  • 09:55 jbond42: deployed a fix to ferm-status script. unmanaged ferm rules may get removed
  • 09:40 kormat@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 09:38 kormat@cumin1001: START - Cookbook sre.hosts.downtime
  • 09:37 marostegui: Upgrade db2102 to the new 10.4.13 - T250666
  • 09:32 _joe_: installing purged 0.11 on cp2027 T133821
  • 09:21 _joe_: installing purged 0.11 on cp2028 T133821
  • 09:11 moritzm: re-enabling puppet
  • 09:08 mutante: rsyncing /home dirs from people.wikimedia.org to new backend people1002
  • 09:00 moritzm: disabling puppet temporarily
  • 08:53 _joe_: uploaded purged 0.11
  • 08:52 kormat@deploy1001: Synchronized wmf-config/db-eqiad.php: Pool pc1010 as pc1 master T252182 (duration: 01m 17s)
  • 07:42 jayme: imported helm 2.16.7-1 to main for stretch-wikimedia
  • 07:41 jayme: imported helm 2.16.7-1 to main for buster-wikimedia
  • 07:29 godog: roll-restart logstash in codfw/eqiad for configuration change
  • 07:14 elukey: upload spark2_2.4.4-bin-hadoop2.6-2 for buster/stretch on apt1001
  • 05:33 ryankemper: wdqs2004 was depooled ~3 hours ago and was re-pooled ~10 mins ago after verifying the wdqs service was healthy
  • 05:32 ryankemper: wdqs1003 was depooled ~6 hours ago and was re-pooled ~10 mins ago after verifying the wdqs service was healthy
  • 05:27 _joe_: restarting php-fpm on mw1374, children dying with SIGILL
  • 05:11 root@cumin1001: END (PASS) - Cookbook sre.hosts.ipmi-password-reset (exit_code=0)
  • 05:11 root@cumin1001: Updating IPMI password on 1 hosts - root@cumin1001
  • 05:10 root@cumin1001: START - Cookbook sre.hosts.ipmi-password-reset
  • 05:10 root@cumin1001: END (FAIL) - Cookbook sre.hosts.ipmi-password-reset (exit_code=99)
  • 05:10 root@cumin1001: START - Cookbook sre.hosts.ipmi-password-reset
  • 04:52 kart_: Updated cxserver to 2020-05-11-082207-production (T250004)
  • 04:47 kartik@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'cxserver' for release 'production' .
  • 04:44 kartik@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'cxserver' for release 'production' .
  • 04:42 kartik@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'cxserver' for release 'staging' .
  • 02:27 ryankemper@cumin2001: END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
  • 02:10 ryankemper@cumin1001: END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
  • 00:43 ryankemper@cumin1001: START - Cookbook sre.wdqs.data-transfer
  • 00:33 ryankemper@cumin2001: START - Cookbook sre.wdqs.data-transfer

2020-05-12

  • 23:09 ryankemper@cumin2001: END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
  • 23:06 ryankemper@cumin2001: START - Cookbook sre.wdqs.data-transfer
  • 20:15 hashar@deploy1001: Synchronized php-1.35.0-wmf.32/includes/revisionlist/RevisionItemBase.php: Fix RevisionItemBase::getId to actually return an int, as intended - T252076 (duration: 01m 06s)
  • 19:55 dpifke@deploy1001: Finished deploy [performance/navtiming@48110b9]: Fixes swapped dc/host labels - T238086 (duration: 00m 05s)
  • 19:55 dpifke@deploy1001: Started deploy [performance/navtiming@48110b9]: Fixes swapped dc/host labels - T238086
  • 19:05 hashar@deploy1001: rebuilt and synchronized wikiversions files: group0 wikis to 1.35.0-wmf.32
  • 18:41 legoktm: started codereview-archiver script in screen on mwmaint1002
  • 18:23 otto@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
  • 18:23 otto@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
  • 18:17 otto@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
  • 18:17 otto@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
  • 18:14 otto@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
  • 18:14 otto@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
  • 17:49 bblack: 'gdnsdctl replace' on all authdns to load new maxmind data
  • 17:43 bblack: updating maxmind database on puppetmasters (usually automated weekly; we're mid-cycle)
  • 17:10 James_F: Running AbuseFilter updateVarDumps on testwikis on mwmaint1002 T246539
  • 16:55 James_F: Running AbuseFilter updateVarDumps on closed wikis on mwmaint1002 T246539
  • 16:55 mstyles@deploy1001: Finished deploy [wdqs/wdqs@f617307]: v0.3.31 (duration: 14m 53s)
  • 16:40 mstyles@deploy1001: Started deploy [wdqs/wdqs@f617307]: v0.3.31
  • 16:35 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 15:48 andrew@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 15:48 andrew@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:48 andrew@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 15:48 andrew@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:34 filippo@cumin1001: conftool action : set/pooled=true; selector: dnsdisc=thanos-query
  • 15:15 akosiaris@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
  • 15:15 akosiaris@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
  • 15:14 akosiaris@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
  • 15:13 sukhe@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 15:13 sukhe@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:12 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 15:12 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:09 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 15:09 jmm@cumin2001: START - Cookbook sre.hosts.downtime
  • 15:07 kormat@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 15:05 kormat@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:04 moritzm: installing 4.9.118 Linux updates on Buster nodes (reboots happening later)
  • 15:02 moritzm: upgrading contint2001 to openjdk-8 u252
  • 15:01 godog: bounce pybal on lvs2010 and lvs2009 - T252186
  • 14:40 moritzm: imported openjdk-8 u252 forward port for buster-wikimedia component/jdk8
  • 14:40 ema: rolling thumbor upgrade to 2.8-1+deb10u1 T252509 T219569 T236240
  • 14:39 andrewbogott: rebuilding cloudcontrol1003 and 1004
  • 14:38 hashar: 1.35.0-wmf.22 is on test wikis. Will be pushed to group0 later today during the american window (19:00 - 21:00 UTC) # T249964
  • 14:34 ema: thumbor2001: repool
  • 14:33 otto@deploy1001: Synchronized wmf-config/InitialiseSettings.php: EventLogging to EventGate: - Test everywhere, SearchSatisfaction on testwiki only - T249261 (duration: 01m 06s)
  • 14:33 ema: thumbor2001: upgrade python-thumbor-wikimedia to 2.8-1+deb10u1 T252509 T219569 T236240
  • 14:23 moritzm: installing Java security updates on WDQS hosts
  • 14:20 hashar@deploy1001: Finished scap: testwikis wikis to 1.35.0-wmf.32 (duration: 72m 04s)
  • 14:05 aborrero@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 14:05 aborrero@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:05 aborrero@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 14:05 aborrero@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:02 dzahn@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
  • 14:00 ema: thumbor2001: depool due to minor bug in 2.7-1+deb10u1 T252509 T219569 T236240
  • 13:54 ema: thumbor2001: pool thumbor 2.7-1+deb10u1 for prod traffic T252509 T219569 T236240
  • 13:50 ema: thumbor2001: upgrade python-thumbor-wikimedia to 2.7-1+deb10u1 T252509 T219569 T236240
  • 13:42 jbond42: disable puppet on all CP hosts to deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/583342
  • 13:36 kormat: reimaging pc2007 to buster T252182
  • 13:36 moritzm: rebooting netflow* hosts for kernel update
  • 13:36 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 13:36 jmm@cumin2001: START - Cookbook sre.hosts.downtime
  • 13:33 vgutierrez: rolling upgrade of ATS to version 8.0.7-1wm5 - T249335
  • 13:31 moritzm: rebooting deneb for kernel update
  • 13:30 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 13:30 jmm@cumin2001: START - Cookbook sre.hosts.downtime
  • 13:24 dzahn@cumin1001: START - Cookbook sre.ganeti.makevm
  • 13:24 dzahn@cumin1001: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99)
  • 13:24 dzahn@cumin1001: START - Cookbook sre.ganeti.makevm
  • 13:08 hashar@deploy1001: Started scap: testwikis wikis to 1.35.0-wmf.32
  • 13:05 hashar@deploy1001: Pruned MediaWiki: 1.35.0-wmf.28 (duration: 23m 47s)
  • 12:37 moritzm: installing iputils update from Buster point release
  • 12:08 hashar: Cutting branch 1.35.0-wmf.32 # T249964
  • 12:08 gehel: restart blazegraph + updater on wdqs2002 - JVM upgrade
  • 11:56 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 11:20 kormat@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 11:17 kormat@cumin1001: START - Cookbook sre.hosts.downtime
  • 10:55 vgutierrez: upgrade trafficserver to version 8.0.7-1wm5 on cp5011 - T249335
  • 10:54 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1)
  • 10:53 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
  • 10:53 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1)
  • 10:52 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
  • 10:43 kormat: reimaging pc2010 to buster T252182
  • 10:30 vgutierrez: upgrade trafficserver to version 8.0.7-1wm5 on cp4032 - T249335
  • 10:30 ema: rolling thumbor upgrade to 2.6-1+deb10u1 T226707
  • 10:19 ema: repool thumbor2001 with upgraded python-thumbor-wikimedia
  • 10:13 ema: thumbor2001: upgrade python-thumbor-wikimedia to 2.6-1+deb10u1
  • 10:04 godog: update compiler facts
  • 09:35 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 09:35 jmm@cumin2001: START - Cookbook sre.hosts.downtime
  • 09:34 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99)
  • 09:34 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
  • 09:31 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99)
  • 09:31 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
  • 09:31 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99)
  • 09:31 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
  • 09:29 filippo@cumin1001: conftool action : set/pooled=yes:weight=100; selector: cluster=thanos
  • 09:07 moritzm: rebooting contint2001 for kernel update
  • 09:07 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 09:06 jmm@cumin2001: START - Cookbook sre.hosts.downtime
  • 07:46 godog: reboot thanos hosts for kernel upgrade
  • 07:41 filippo@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 07:41 filippo@cumin1001: START - Cookbook sre.hosts.downtime
  • 07:29 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 07:29 jmm@cumin2001: START - Cookbook sre.hosts.downtime
  • 07:12 moritzm: rebooting the IDP hosts, SSO sessions will need to be renewed
  • 07:04 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 07:04 jmm@cumin2001: START - Cookbook sre.hosts.downtime
  • 06:56 vgutierrez: upload trafficserver 8.0.7-1wm4 to apt.wm.o (buster) - T242767 T249335
  • 05:29 marostegui: Restart docker-report-releng on deneb
  • 05:03 marostegui@cumin1001: dbctl commit (dc=all): 'Set s4 as read-only=off for maintenance T251502', diff saved to https://phabricator.wikimedia.org/P11180 and previous config saved to /var/cache/conftool/dbconfig/20200512-050339-marostegui.json
  • 05:00 marostegui@cumin1001: dbctl commit (dc=all): 'Set s4 as read-only for maintenance T251502', diff saved to https://phabricator.wikimedia.org/P11179 and previous config saved to /var/cache/conftool/dbconfig/20200512-050054-marostegui.json
  • 04:46 marostegui: Stop mysql on labsdb1011 to transfer its content - T249188
  • 02:14 pt1979@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 02:12 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 01:45 pt1979@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 01:43 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 01:16 pt1979@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 01:14 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 00:37 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 00:34 pt1979@cumin2001: START - Cookbook sre.hosts.downtime

2020-05-11

  • 21:00 ayounsi@cumin1001: END (PASS) - Cookbook sre.network.cf (exit_code=0)
  • 21:00 ayounsi@cumin1001: START - Cookbook sre.network.cf
  • 20:19 cdanis@cumin1001: END (PASS) - Cookbook sre.network.cf (exit_code=0)
  • 20:19 cdanis@cumin1001: START - Cookbook sre.network.cf
  • 19:08 cmjohnson@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 19:05 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 19:03 Zoranzoki21: T235414 is wrong task number, T235415 is correct
  • 19:02 catrope@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Add *.bollywoodhungama.in and *.britishmuseum.org to $wgCopyUploadDomains (T235414, T251882) (duration: 00m 57s)
  • 18:51 catrope@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Remove "Create a book" link on enwiki (T241683) (duration: 00m 57s)
  • 18:44 catrope@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Enable modern Vector on officewiki, reveal preference on testwiki (T251285) (duration: 00m 58s)
  • 18:42 cmjohnson@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 18:40 catrope@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Add tw-photometa.de to $wgCopyUploadsDomains (T252141) (duration: 00m 58s)
  • 18:39 cmjohnson@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 18:38 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 18:37 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 18:32 cmjohnson@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 18:30 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 18:28 catrope@deploy1001: Synchronized dblists/mobilemainpagelegacy.dblist: Drop mainpage special casing for scowiki and itwiki (T252048, T252065) (duration: 00m 58s)
  • 18:27 cmjohnson@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 18:25 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 18:20 cmjohnson@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 18:19 cmjohnson@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 18:16 cmjohnson@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 18:15 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 18:15 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 18:14 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 18:11 jforrester@deploy1001: Synchronized php-1.35.0-wmf.31/includes/Revision/RevisionStore.php: T252156 T212428 RevisionStore: fall back to master db if main slot is missing (duration: 00m 58s)
  • 18:11 cmjohnson@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 18:08 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 17:51 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 17:49 elukey@cumin1001: START - Cookbook sre.hosts.downtime
  • 17:30 jforrester@deploy1001: Synchronized php-1.35.0-wmf.31/extensions/AbuseFilter/maintenance/updateVarDumps.php: updateVarDumps: wait for replication after each batch (duration: 00m 58s)
  • 17:27 jforrester@deploy1001: Synchronized php-1.35.0-wmf.30/skins/Vector/includes/VectorTemplate.php: T251521 Correctly populate the language variants drop-down rather than breaking early (duration: 00m 59s)
  • 17:24 jforrester@deploy1001: Synchronized php-1.35.0-wmf.31/skins/Vector/includes/VectorTemplate.php: T251521 Correctly populate the language variants drop-down rather than breaking early (duration: 00m 59s)
  • 17:16 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 17:14 elukey@cumin1001: START - Cookbook sre.hosts.downtime
  • 17:04 brennen@deploy1001: rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.31
  • 16:47 brennen@deploy1001: Synchronized php: group1 wikis to 1.35.0-wmf.31 (duration: 04m 43s)
  • 16:42 mholloway-shell@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'wikifeeds' for release 'production' .
  • 16:42 brennen@deploy1001: rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.31
  • 16:40 mholloway-shell@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'wikifeeds' for release 'production' .
  • 16:34 brennen@deploy1001: rebuilt and synchronized wikiversions files: group0 wikis to 1.35.0-wmf.31
  • 16:17 akosiaris@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' .
  • 16:13 brennen@deploy1001: rebuilt and synchronized wikiversions files: mediawikiwiki to 1.35.0-wmf.31 (T249963) for testing T252179
  • 16:10 akosiaris@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' .
  • 16:06 brennen@deploy1001: Synchronized php-1.35.0-wmf.31/extensions/WikimediaMaintenance: Revert "Remove use of WikiPage::doEditContent" (duration: 01m 06s)
  • 16:05 brennen@deploy1001: Synchronized php-1.35.0-wmf.31/extensions/UploadWizard: Revert "Remove use of WikiPage::doEditContent" (duration: 01m 06s)
  • 16:04 hnowlan@deploy1001: Finished deploy [changeprop/deploy@82276cb]: Enabling consumption of purges topic (duration: 01m 58s)
  • 16:04 brennen@deploy1001: Synchronized php-1.35.0-wmf.31/extensions/Babel: Revert "Remove use of WikiPage::doEditContent" (duration: 01m 07s)
  • 16:03 brennen@deploy1001: Synchronized php-1.35.0-wmf.31/extensions/Translate: Revert "Remove uses of WikiPage::doEditContent" (duration: 01m 08s)
  • 16:02 hnowlan@deploy1001: Started deploy [changeprop/deploy@82276cb]: Enabling consumption of purges topic
  • 15:56 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 15:56 jmm@cumin2001: START - Cookbook sre.hosts.downtime
  • 15:54 hnowlan@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 15:52 hnowlan@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 15:52 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 15:49 cdanis@cumin1001: conftool action : set/ttl=300; selector: dnsdisc=eventgate-analytics.*
  • 15:45 hnowlan@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 15:42 brennen: syncing backports to 1.35.0-wmf.31 (T249963) for T252179
  • 15:42 hnowlan@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 15:01 moritzm: installing puma security updates
  • 14:29 mholloway-shell@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' .
  • 13:44 vgutierrez: upgrade ATS to 8.0.7-1wm4 in cp4032 - T249335
  • 13:36 hashar: Rolling back CI system switch to previous known state # T224591
  • 13:20 marostegui: Upgrade mysql package on s4 master in preparation for tomorrow's maintenance T251502
  • 12:50 hashar: Pointing CI Jenkins to contint2001 Gearman server T224591
  • 12:46 mutante: contint2001 - chown -R jenkins-slave:jenkins-slave /srv/.git
  • 12:45 mutante: contint1001 - rsync -avz --delete /srv/.git/ rsync://contint2001.wikimedia.org/ci--srv/.git/
  • 12:43 mutante: contint1001 - rsync -avz --delete /srv/.git/ rsync://contint2001.wikimedia.org/ci--srv-/org/.git/
  • 12:40 mutante: contint1001 - rsync -avz --delete /srv/org/wikimedia/integration/ rsync://contint2001.wikimedia.org/ci--srv-/org/wikimedia/integration/
  • 12:24 mutante: contint2001 - find /var/lib/jenkins/ -group bacula -exec chown jenkins:jenkins {} \;
  • 12:21 mutante: contint2001 - find /var/lib/jenkins/ -user statsite -exec chown jenkins {} \;
  • 12:19 mutante: contint2001 - chown -R jenkins:jenkins /srv/jenkins/*
  • 12:19 mutante: contint1001 - rsync -avz --delete /srv/jenkins/ rsync://contint2001.wikimedia.org/ci--srv-/jenkins/
  • 12:17 mutante: contint1001 - rsync -avz --delete /var/lib/jenkins/ rsync://contint2001.wikimedia.org/ci--var-lib-jenkins-
  • 12:14 hashar: shutting down Zuul and Jenkins for system switch # T224591
  • 12:02 jynus@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 11:59 jynus@cumin1001: START - Cookbook sre.hosts.downtime
  • 11:47 cmjohnson@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 11:45 cmjohnson@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 11:45 cmjohnson@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 11:43 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 11:43 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 11:42 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 11:32 Lucas_WMDE: EU SWAT done
  • 11:30 lucaswerkmeister-wmde@deploy1001: Synchronized php-1.35.0-wmf.30/extensions/WikimediaEvents/: SWAT: Update Banner Interaction Schema (T250791, wmf.30) (duration: 01m 08s)
  • 11:23 lucaswerkmeister-wmde@deploy1001: Synchronized php-1.35.0-wmf.31/extensions/WikimediaEvents/: SWAT: Update Banner Interaction Schema (T250791, wmf.31) (duration: 01m 07s)
  • 11:14 kartik@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 595478|Revert limit adjustment for Chinese translation with ContentTranslation (T252371) (duration: 01m 09s)
  • 10:58 jdrewniak@deploy1001: Synchronized portals: Wikimedia Portals Update: Bumping portals to master (595498) (duration: 01m 06s)
  • 10:56 jdrewniak@deploy1001: Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: Bumping portals to master (595498) (duration: 01m 07s)
  • 10:15 vgutierrez: upload trafficserver 8.0.7-1wm3 to apt.wm.o (buster) - T242767 T249335
  • 09:44 mutante: contint2001 - find /var/lib/jenkins -user statsite -exec chown jenkins:jenkins {} \;
  • 09:31 hashar: contint2001 started zuul-merger again (had permission issues in /var/lib/zuul )
  • 09:07 mutante: contint1001 - rsync -avpz --delete /srv/jenkins/ rsync://contint2001.wikimedia.org/ci--srv-/jenkins/ (T224591)
  • 09:05 mutante: contint2001 - mkdir /srv/jenkins
  • 08:55 hashar: contint2001 stopping zuul-merger , permission problem
  • 08:46 godog: bounce ferm on kubernetes1007 to resolve icinga UNKNOWN
  • 08:40 mutante: rsyncing /var/lib/jenkins from contint1001 to contint2001 with --delete
  • 08:32 mutante: rsynced data from contint1001 to contint2001 - pathes per T224591#6039192 for the migration later today
  • 08:30 ema: cp3050: upgrade atskafka to 0.6 T237993
  • 08:30 _joe_: removing the iptables DROP rule on mc1020 T251378
  • 07:54 moritzm: installing squid security updates
  • 07:21 moritzm: updated buster netboot images to 10.4 (updated to latest point release)
  • 07:09 _joe_: dropping requests to mc1020 via a firewall rule T251378
  • 06:04 elukey: restart wikimedia-discovery-golden on stat1007 - apparenlty killed by no memory left to allocate on the system

2020-05-10

  • 12:18 marostegui: Start event scheduler on db1115 after a massive delete - T252324
  • 11:05 marostegui: Stop event scheduler on db1115 to perform a massive delete - T252324
  • 10:27 dcausse: restarting blazgraph on wdqs1004: T242453
  • 09:56 marostegui: Change scaling_governor from powersave to performance on db1115 - T252324
  • 09:25 marostegui: Stop MySQL and restart db1115 - T252324
  • 08:50 marostegui: Restart mysql on db1115 to change buffer pool size from 20GB to 40GB T252324 (
  • 08:44 elukey: Power cycle analytics1052 after eno1 issue
  • 08:01 marostegui: Disable unused events like %_schema T252324 T231185
  • 07:11 marostegui: Restart mysql on db1115 T231185
  • 07:11 marostegui: Truncate tendril. processlist_query_log T231185

2020-05-08

  • 21:45 bstorm_: cleaned up wb_terms_no_longer_updated view for testwikidatawiki and testcommonswiki on labsdb1010 T251598
  • 21:45 bstorm_: cleaned up wb_terms_no_longer_updated view on labsdb1012 T251598
  • 21:33 bstorm_: cleaning up wb_terms_no_longer_updated view on labsdb1009 T251598
  • 21:06 ottomata: running prefered replica election for kafka-jumbo to get preferred leaders back after reboot of broker earlier today - T252203
  • 19:16 jhuneidi@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'blubberoid' for release 'production' .
  • 19:12 jhuneidi@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'blubberoid' for release 'production' .
  • 19:07 jhuneidi@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'blubberoid' for release 'staging' .
  • 18:12 andrewbogott: reprepro copy buster-wikimedia stretch-wikimedia prometheus-openstack-exporter for T252121
  • 17:59 marostegui: Extend /srv by 500G on labsdb1011 T249188
  • 16:55 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 16:53 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 16:51 cmjohnson@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 16:48 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 16:39 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 16:37 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 16:14 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 16:12 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 15:43 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 15:41 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 15:36 ottomata: starting kafka broker on kafka-jumbo1006, same issue on other brokers when they are leaders of offending partitions - T252203
  • 15:31 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 15:28 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 15:27 ottomata: stopping kafka broker on kafka-jumbo1006 to investigate camus import failures - T252203
  • 14:50 otto@deploy1001: Finished deploy [analytics/refinery@4a2c530]: fix for camus wrapper, deploy to an-launcher1001 only (duration: 00m 03s)
  • 14:50 otto@deploy1001: Started deploy [analytics/refinery@4a2c530]: fix for camus wrapper, deploy to an-launcher1001 only
  • 14:05 akosiaris: T243106 undo experiment with DROP iptable rules this time around. Use mw1331, mw1348
  • 13:22 vgutierrez: rolling restart of ats-tls on eqiad, codfw, ulsfo and eqsin - T249335
  • 13:20 akosiaris: T243106 redo experiment with DROP iptable rules this time around. Use mw1331, mw1348
  • 13:16 akosiaris: T243106 undo experiment with REJECT, DROP iptable rules now that we have envoy in the middle. Use mw1331, mw1348. Experiment done successfully, no issues to the infrastructure.
  • 12:49 akosiaris: T243106 redo experiment with REJECT, DROP iptable rules now that we have envoy in the middle. Use mw1331, mw1348
  • 12:49 akosiaris: T243106 redo experiment with REJECT, DROP iptable rules now that we have envoy in the middle
  • 11:49 hnowlan: restarting cassandra on restbase2009 for java updates
  • 11:28 cmjohnson@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 11:25 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 11:08 akosiaris: repool eqiad eventgate-analytics. Test concluded
  • 11:08 akosiaris@cumin1001: conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=eventgate-analytics
  • 09:54 mutante: disabling puppet on puppetmasters temporarily to switch them carefully to use httpd module and not apache module which we want to get rid of
  • 09:52 akosiaris: depool eqiad eventgate-analytics for a test involving reinitializing the eqiad kubernetes cluster
  • 09:52 akosiaris@cumin1001: conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=eventgate-analytics
  • 09:51 akosiaris@cumin1001: conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=eventgate-analytics
  • 09:45 oblivian@puppetmaster1001: conftool action : set/ttl=10; selector: dnsdisc=eventgate-analytics.*
  • 08:20 vgutierrez: rolling restart of ats-tls on esams - T249335
  • 07:19 vgutierrez: ats-tls restart on cp3050 and cp3052 (max_connections_active_in experiment) - T249335
  • 07:07 mutante: phabricator rmdir /var/run/phd/pid - empty and now unused
  • 07:01 moritzm: installing php5 security updates
  • 05:27 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 05:24 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 05:10 marostegui: Upgrade pc1010
  • 00:30 brennen@deploy1001: rebuilt and synchronized wikiversions files: Revert all wikis except test to 1.35.0-wmf.30 for T252179
  • 00:19 brennen: rolling 1.35.0-wmf.31 train back to group0 for T252179

2020-05-07

  • 22:36 brennen@deploy1001: rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.31
  • 22:31 brennen@deploy1001: Synchronized php-1.35.0-wmf.31/extensions/Scribunto/includes/engines/LuaCommon/TitleLibrary.php: Handle RevisionAccessException with try-catch (T252156) (duration: 01m 08s)
  • 20:40 ryankemper@cumin2001: END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
  • 20:37 ryankemper@cumin2001: START - Cookbook sre.wdqs.data-transfer
  • 20:10 otto@deploy1001: Synchronized wmf-config/InitialiseSettings.php: wgEventLoggingStreamNames: set initial stream names, as yet unused - T238230 (duration: 01m 07s)
  • 19:12 brennen@deploy1001: rebuilt and synchronized wikiversions files: Revert group2 wikis to 1.35.0-wmf.30
  • 19:09 brennen: rolling 1.35.0-wmf.31 back to group1
  • 19:09 XioNoX: Upgrade Routinator 3000 to 0.7.0 on rpki1001 - T252010
  • 19:05 brennen@deploy1001: rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.31
  • 18:25 ppchelko@deploy1001: Finished deploy [changeprop/deploy@383fba5]: Enable both purging types T252142 (duration: 01m 17s)
  • 18:23 ppchelko@deploy1001: Started deploy [changeprop/deploy@383fba5]: Enable both purging types T252142
  • 18:15 Urbanecm: Morning SWAT done
  • 18:14 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 899c175: Update project icons to refreshed SVGs (T249047; part 2/2) (duration: 01m 06s)
  • 18:13 urbanecm@deploy1001: Synchronized static/images/mobile/copyright/: SWAT: 899c175: Update project icons to refreshed SVGs (T249047; part 1/2) (duration: 01m 08s)
  • 18:08 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 54bd2f1: Add the investigate right to the checkuser group on testwiki (T251932) (duration: 01m 08s)
  • 17:50 bsitzmann@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'wikifeeds' for release 'production' .
  • 17:46 bsitzmann@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'wikifeeds' for release 'production' .
  • 17:44 bsitzmann@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' .
  • 17:44 otto@deploy1001: Finished deploy [analytics/refinery@4a2c530]: (no justification provided) (duration: 05m 31s)
  • 17:38 otto@deploy1001: Started deploy [analytics/refinery@4a2c530]: (no justification provided)
  • 17:18 ejegg: updated payments-wiki from afb84cc391 to dabba1804c
  • 16:46 hnowlan@deploy1001: Finished deploy [changeprop/deploy@cd1386e]: Rollback varnish consumption (duration: 01m 05s)
  • 16:45 hnowlan@deploy1001: Started deploy [changeprop/deploy@cd1386e]: Rollback varnish consumption
  • 16:42 mvolz@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'citoid' for release 'production' .
  • 16:36 mvolz@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'citoid' for release 'production' .
  • 16:32 cmjohnson@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 16:30 cmjohnson@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 16:29 mvolz@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'citoid' for release 'staging' .
  • 16:27 cmjohnson@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 16:27 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 16:26 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 16:26 hnowlan@deploy1001: Finished deploy [changeprop/deploy@cd1386e]: Enabling consumption of purges topic (duration: 01m 45s)
  • 16:25 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 16:24 hnowlan@deploy1001: Started deploy [changeprop/deploy@cd1386e]: Enabling consumption of purges topic
  • 16:23 hnowlan@deploy1001: Finished deploy [changeprop/deploy@6c65779]: Enabling consumption of purges topic (duration: 00m 24s)
  • 16:23 hnowlan@deploy1001: Started deploy [changeprop/deploy@6c65779]: Enabling consumption of purges topic
  • 15:59 mvolz@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'citoid' for release 'production' .
  • 15:51 mvolz@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'citoid' for release 'production' .
  • 15:36 jforrester@deploy1001: Synchronized php-1.35.0-wmf.31/extensions/Collection/includes/Specials/SpecialCollection.php: T251460 Set skin on BaseTemplates if you are using getSkin (duration: 01m 08s)
  • 15:28 mvolz@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'citoid' for release 'staging' .
  • 15:27 vgutierrez: rolling restart of ats-tls on text@esams - T249335
  • 15:26 jayme@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'wikifeeds' for release 'production' .
  • 15:19 cmjohnson@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 15:17 cmjohnson@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 15:15 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:14 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:12 hnowlan@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 15:09 hnowlan@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 15:03 jayme@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'wikifeeds' for release 'production' .
  • 14:59 moritzm: imported component/facter3 for stretch-wikimedia into "main"
  • 14:57 cmjohnson@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 14:55 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:54 cmjohnson@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 14:51 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:50 cmjohnson@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 14:50 moritzm: imported component/puppet5 for stretch-wikimedia into "main"
  • 14:49 jayme@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' .
  • 14:47 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:42 cmjohnson@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 14:41 cmjohnson@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 14:40 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 14:39 cmjohnson@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 14:38 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:37 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:36 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:30 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 14:17 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 14:07 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 14:06 moritzm: imported component/facter3 for jessie-wikimedia into "main"
  • 13:33 cmjohnson@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 13:30 cmjohnson@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 13:30 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 13:28 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 13:21 cmjohnson@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 13:19 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 13:12 hashar@deploy1001: rebuilt and synchronized wikiversions files: (no justification provided)
  • 13:04 cmjohnson@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 13:04 jynus: disabling puppet on all db hosts to control deployment of new paging alert T172489
  • 13:02 zpapierski@deploy1001: Finished deploy [wdqs/wdqs@94906d0]: Deploy WDQS 0.3.28 + GUI - new servers (duration: 02m 43s)
  • 13:01 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 12:59 zpapierski@deploy1001: Started deploy [wdqs/wdqs@94906d0]: Deploy WDQS 0.3.28 + GUI - new servers
  • 12:50 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 12:48 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 12:43 zpapierski@deploy1001: Finished deploy [wdqs/wdqs@94906d0]: Deploy WDQS 0.3.28 + GUI (duration: 16m 20s)
  • 12:27 zpapierski@deploy1001: Started deploy [wdqs/wdqs@94906d0]: Deploy WDQS 0.3.28 + GUI
  • 12:13 addshore@deploy1001: Synchronized php-1.35.0-wmf.31/extensions/Wikibase: gerrit:594920 T252079 Revert "Move prefetching-term-lookup-callback service wiring" (duration: 01m 12s)
  • 12:12 cmjohnson@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 12:10 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 11:55 cmjohnson@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 11:53 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 11:33 moritzm: imported component/puppet5 for jessie-wikimedia into "main"
  • 11:31 jbond42: enable ferm-status script https://gerrit.wikimedia.org/r/c/operations/puppet/+/576102
  • 11:10 matthiasmullie: EU swat done
  • 11:07 mlitn@deploy1001: Synchronized php-1.35.0-wmf.31/extensions/WikibaseMediaInfo/: [MediaInfo] Add dummy concept chips without thumbnail (duration: 01m 09s)
  • 10:07 moritzm: installing Java security updates on restbase/sessionstore
  • 09:11 elukey: roll restart cassandra on aqs1005 to pick up new openjdk upgrades (canary)
  • 08:32 moritzm: upgrading restbase-dev to latest OpenJDK security update
  • 08:06 jynus: setting pc2007, pc2009 as read-write
  • 07:44 godog: further decrease weight for ms-be10[678] - T252008
  • 05:49 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 05:45 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 05:33 elukey: restart hadoop yarn nodemanager on analytics1071
  • 05:22 marostegui: Reimage db2078
  • 05:04 marostegui@cumin1001: dbctl commit (dc=all): 'Set s3 and s7 as read-only=off for maintenance T251158', diff saved to https://phabricator.wikimedia.org/P11167 and previous config saved to /var/cache/conftool/dbconfig/20200507-050419-marostegui.json
  • 05:00 marostegui@cumin1001: dbctl commit (dc=all): 'Set s3 and s7 as read-only for maintenance T251158', diff saved to https://phabricator.wikimedia.org/P11166 and previous config saved to /var/cache/conftool/dbconfig/20200507-050046-marostegui.json
  • 02:56 brennen@deploy1001: rebuilt and synchronized wikiversions files: Revert group1 wikis to 1.35.0-wmf.30 for T252079
  • 02:55 brennen: reverting group1 to 1.35.0-wmf.30 for T252079
  • 00:12 ryankemper@cumin1001: END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)

2020-05-06

  • 23:59 catrope@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Disable GrowthExperiments guidance on testwiki (duration: 01m 07s)
  • 23:18 catrope@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Enable password-reset-update on Wikipedias (T245791) (duration: 01m 07s)
  • 22:22 brennen@deploy1001: Synchronized php-1.35.0-wmf.31/includes/revisionlist/RevisionItem.php: RevisionItem: Fix providing timestamp in getRevisionLink (duration: 01m 09s)
  • 21:45 andrewbogott: updating puppet compiler facts
  • 21:07 ryankemper@cumin1001: START - Cookbook sre.wdqs.data-transfer
  • 21:05 ryankemper@cumin1001: END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
  • 21:04 ryankemper@cumin1001: START - Cookbook sre.wdqs.data-transfer
  • 20:35 ejegg: updated Fundraising CiviCRM from b15b2cfbb5 to cfb6101e39
  • 19:08 brennen@deploy1001: Synchronized php: group1 wikis to 1.35.0-wmf.31 (duration: 01m 08s)
  • 19:07 brennen@deploy1001: rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.31
  • 19:03 brennen: CORRECTION: 1.35.0-wmf.31 train unblocked (T249963), rolling forward to group1
  • 19:03 brennen: 1.35.0-wmf.31 train unblocked (T249963), rolling forward to group0
  • 18:58 twentyafterfour@deploy1001: Synchronized php-1.35.0-wmf.31/includes/specials/pagers/DeletedContribsPager.php: deploy https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/594778/ fixes UBN T252052 (duration: 01m 09s)
  • 18:54 volans: upgraded spicerack to spicerack_0.0.34-1_amd64.deb on cumin[12]001
  • 18:45 volans: uploaded spicerack_0.0.34-1_amd64.deb to apt.wikimedia.org stretch-wikimedia
  • 18:44 volans@deploy1001: Finished deploy [homer/deploy@8224f0a]: Release v0.2.2 (duration: 00m 18s)
  • 18:44 volans@deploy1001: Started deploy [homer/deploy@8224f0a]: Release v0.2.2
  • 18:28 twentyafterfour@deploy1001: Synchronized php-1.35.0-wmf.31/includes/specials/pagers/DeletedContribsPager.php: sync https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/594768/ fixes T252043 (duration: 01m 08s)
  • 17:34 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 17:31 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 17:12 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 17:06 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 17:05 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 16:21 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 15:41 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 15:27 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 13:36 mutante: puppetmaster - revoking cert for webserver-misc-apps , recreating it with static-codereview.wikimedia.org as addiitonal SAN (T243056)
  • 13:32 hashar: Restarting CI Jenkins
  • 13:27 mutante: puppetmaster - revoking cert for webserver-misc-static, not used anymore, merged into webserver-misc-apps
  • 13:27 moritzm: installing graphicsmagick security updates
  • 13:26 XioNoX: Upgrade Routinator 3000 to 0.7.0 on rpki2001 - T252010
  • 13:25 XioNoX: add routinator 3000 0.7.0 to buster-wikimedia - T252010
  • 13:19 ema: cp: upgrade purged to v0.10
  • 13:08 godog: start swift decom ms-be101[678] - T252008
  • 11:22 kart_: EU SWAT done.
  • 11:13 kartik@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 594668|Enable ContentTranslation in Armenian WP as a default tool (T249229) (duration: 01m 08s)
  • 10:27 ema: cp2027: test purged v0.10
  • 10:20 moritzm: restarting apache on dbmonitor/grafana/miscweb/graphite/netmon to pick up openldap update
  • 10:00 moritzm: installing remaining openldap security updates (client-side libs, tools)
  • 09:52 jbond42: enable rember me feature of CAS
  • 09:39 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1121 and remove db1103:3314 from vslow in s4', diff saved to https://phabricator.wikimedia.org/P11159 and previous config saved to /var/cache/conftool/dbconfig/20200506-093940-marostegui.json
  • 09:12 marostegui: Upgrade package on s3 and s7 master (db1123 and db1086) in preparation for tomorrow's restart - T251158
  • 08:56 jbond42: restarting ps1-a4-eqiad.mgmt.eqiad.wmnet.
  • 08:53 jynus: kill FTWRL on db2101
  • 08:43 oblivian@deploy1001: Synchronized wmf-config/CommonSettings.php: Reverting change on mw1407 T99740 (duration: 01m 16s)
  • 08:02 _joe_: restarted php-fpm with tweaked parameters on mw1407, now briefly pooling for traffic (T99740)
  • 07:38 kormat@cumin1001: dbctl commit (dc=all): 'Set es1023 (es5 master) to 0 weight after reimaging es1024 T250666', diff saved to https://phabricator.wikimedia.org/P11158 and previous config saved to /var/cache/conftool/dbconfig/20200506-073856-kormat.json
  • 07:32 vgutierrez: downgrade to ATS 8.0.7-1wm3 on cp4026, cp4031, cp5006 and cp5011
  • 06:00 elukey: powercycle analytics1060 - host stuck - T251973
  • 05:03 marostegui@cumin1001: dbctl commit (dc=all): 'Pool db1103:3314 in vslow on s4 while db1121 is out T250055', diff saved to https://phabricator.wikimedia.org/P11157 and previous config saved to /var/cache/conftool/dbconfig/20200506-050340-marostegui.json
  • 05:02 marostegui: Deploy schema change on db1121

2020-05-05

  • 23:44 catrope@deploy1001: Synchronized wmf-config/flaggedrevs.php: Restore the reviewer group on fawiki (T249643) (duration: 01m 06s)
  • 23:22 crusnov@deploy1001: Finished deploy [netbox/deploy@03cc2dd]: Netbox upgrade to 2.8.1 (part3) (duration: 00m 11s)
  • 23:22 crusnov@deploy1001: Started deploy [netbox/deploy@03cc2dd]: Netbox upgrade to 2.8.1 (part3)
  • 23:22 crusnov@deploy1001: Finished deploy [netbox/deploy@03cc2dd]: Netbox upgrade to 2.8.1 (part1) (duration: 01m 14s)
  • 23:21 crusnov@deploy1001: Started deploy [netbox/deploy@03cc2dd]: Netbox upgrade to 2.8.1 (part1)
  • 23:21 crusnov@deploy1001: Finished deploy [netbox/deploy@03cc2dd]: Netbox upgrade to 2.8.1 (part1) (duration: 01m 20s)
  • 23:20 crusnov@deploy1001: Started deploy [netbox/deploy@03cc2dd]: Netbox upgrade to 2.8.1 (part1)
  • 22:00 reedy@deploy1001: Synchronized php-1.35.0-wmf.31/includes/parser/CoreParserFunctions.php: T251952 take 2 (duration: 01m 06s)
  • 21:57 reedy@deploy1001: Synchronized php-1.35.0-wmf.31/includes/parser/CoreParserFunctions.php: T251952 (duration: 01m 05s)
  • 21:55 reedy@deploy1001: Synchronized php-1.35.0-wmf.31/includes/specials/SpecialNewpages.php: T251950 (duration: 01m 06s)
  • 20:02 herron: added ryankemper to wmf and ops ldap groups T251572
  • 19:38 mforns@deploy1001: Finished deploy [analytics/refinery@6868fc0] (thin): Regular analytics weekly train THIN [analytics/refinery@ebd624a5e4c88ac6983387d4603971f8a326ee7c] (duration: 00m 08s)
  • 19:38 mforns@deploy1001: Started deploy [analytics/refinery@6868fc0] (thin): Regular analytics weekly train THIN [analytics/refinery@ebd624a5e4c88ac6983387d4603971f8a326ee7c]
  • 19:38 mforns@deploy1001: Finished deploy [analytics/refinery@6868fc0]: Regular analytics weekly train (2nd try) [analytics/refinery@ebd624a5e4c88ac6983387d4603971f8a326ee7c] (duration: 25m 18s)
  • 19:19 brennen@deploy1001: rebuilt and synchronized wikiversions files: group0 wikis to 1.35.0-wmf.31
  • 19:13 mforns@deploy1001: Started deploy [analytics/refinery@6868fc0]: Regular analytics weekly train (2nd try) [analytics/refinery@ebd624a5e4c88ac6983387d4603971f8a326ee7c]
  • 19:12 brennen@deploy1001: Finished scap: testwikis wikis to 1.35.0-wmf.31 (duration: 97m 23s)
  • 19:02 brennen: train status: 1.35.0-wmf.31: presently pressing enter through scap-cdb-rebuild; at 8% (T249963, T223287)
  • 18:39 cdanis: depool mw2221 for some manual testing
  • 18:35 mforns@deploy1001: Finished deploy [analytics/refinery@ebd624a] (thin): Regular analytics weekly train THIN [analytics/refinery@ebd624a5e4c88ac6983387d4603971f8a326ee7c] (duration: 00m 09s)
  • 18:35 mforns@deploy1001: Started deploy [analytics/refinery@ebd624a] (thin): Regular analytics weekly train THIN [analytics/refinery@ebd624a5e4c88ac6983387d4603971f8a326ee7c]
  • 18:34 mforns@deploy1001: Finished deploy [analytics/refinery@ebd624a]: Regular analytics weekly train [analytics/refinery@ebd624a5e4c88ac6983387d4603971f8a326ee7c] (duration: 18m 54s)
  • 18:15 mforns@deploy1001: Started deploy [analytics/refinery@ebd624a]: Regular analytics weekly train [analytics/refinery@ebd624a5e4c88ac6983387d4603971f8a326ee7c]
  • 17:35 brennen@deploy1001: Started scap: testwikis wikis to 1.35.0-wmf.31
  • 16:48 brennen: 1.35.0-wmf.31 was branched at 4d3fed3 for T249963
  • 16:34 brennen: triggering branch cut for 1.35.0-wmf.31 (T249963) via https://releases-jenkins.wikimedia.org/job/MediaWiki%20Train%20Branch%20Cut/build?delay=0sec
  • 16:18 brennen: notice: planning branch cut for 1.35.0-wmf.31 (T249963) at 16:30 UTC
  • 15:47 cstone: SmashPig revision changed from 8c30ed7fe5 to cd1a49da5f
  • 15:38 kormat@cumin1001: dbctl commit (dc=all): 'Repool es1024 to 100% after reimaging T250666', diff saved to https://phabricator.wikimedia.org/P11153 and previous config saved to /var/cache/conftool/dbconfig/20200505-153843-kormat.json
  • 15:26 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 15:24 elukey@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:03 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 15:00 elukey@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:58 hnowlan@deploy1001: Finished deploy [changeprop/deploy@6c65779]: Enabling on_transclusion_update on k8s, disabling on scb (duration: 01m 31s)
  • 14:56 hnowlan@deploy1001: Started deploy [changeprop/deploy@6c65779]: Enabling on_transclusion_update on k8s, disabling on scb
  • 14:45 hnowlan@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 14:43 hnowlan@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 14:32 kormat@cumin1001: dbctl commit (dc=all): 'Repool es1024 to 75% after reimaging T250666', diff saved to https://phabricator.wikimedia.org/P11149 and previous config saved to /var/cache/conftool/dbconfig/20200505-143158-kormat.json
  • 13:46 akosiaris: deploy cxserver chart 0.0.15 to staging, codfw, eqiad. T219921
  • 13:45 akosiaris@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'cxserver' for release 'production' .
  • 13:41 akosiaris@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'cxserver' for release 'production' .
  • 13:41 hashar: Updated Jenkins job https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler to have it defined in JJB # T97513
  • 13:36 akosiaris@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'cxserver' for release 'staging' .
  • 13:18 vgutierrez: upgrade ATS to version 8.1 () on cp4026, cp4032, cp5006 and cp5011
  • 13:15 kormat@cumin1001: dbctl commit (dc=all): 'Repool es1024 to 50% after reimaging T250666', diff saved to https://phabricator.wikimedia.org/P11147 and previous config saved to /var/cache/conftool/dbconfig/20200505-131520-kormat.json
  • 12:52 kormat@cumin1001: dbctl commit (dc=all): 'Repool es1024 at 25% after reimaging T250666', diff saved to https://phabricator.wikimedia.org/P11145 and previous config saved to /var/cache/conftool/dbconfig/20200505-125254-kormat.json
  • 12:37 XioNoX: push pfw policy - T251769
  • 12:07 jbond42: updating cas login page
  • 12:07 kormat@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 12:05 kormat@cumin1001: START - Cookbook sre.hosts.downtime
  • 12:03 moritzm: rolling restart of apache on puppetboard* to pick up OpenLDAP update
  • 11:47 moritzm: rolling restart of apache on kibana hosts
  • 11:41 mutante: LDAP - added eamedia to wmf group (T251358)
  • 11:31 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1087 T248086', diff saved to https://phabricator.wikimedia.org/P11144 and previous config saved to /var/cache/conftool/dbconfig/20200505-113152-marostegui.json
  • 11:31 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1087 T248086', diff saved to https://phabricator.wikimedia.org/P11143 and previous config saved to /var/cache/conftool/dbconfig/20200505-113100-marostegui.json
  • 11:30 marostegui: Drop T248086_wb_terms table on labsdb hosts - T248086
  • 11:26 moritzm: rolling restart of apache/FPM on mw1261-mw1265
  • 11:22 kart_: EU SWAT done.
  • 11:09 kartik@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 592479|Adjust ContentTranslation MT threshold for Chinese WP to 70% (T246383) (duration: 01m 01s)
  • 11:01 moritzm: installing remaining openldap security updates (client-side libs, tools)
  • 11:00 kormat@cumin1001: dbctl commit (dc=all): 'Depool es1024 for reimaging, add es1023 (master) for reading in the meantime T250666', diff saved to https://phabricator.wikimedia.org/P11141 and previous config saved to /var/cache/conftool/dbconfig/20200505-110031-kormat.json
  • 10:45 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1126 T248086', diff saved to https://phabricator.wikimedia.org/P11140 and previous config saved to /var/cache/conftool/dbconfig/20200505-104540-marostegui.json
  • 10:44 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1126 T248086', diff saved to https://phabricator.wikimedia.org/P11139 and previous config saved to /var/cache/conftool/dbconfig/20200505-104441-marostegui.json
  • 10:33 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 10:23 arturo: copy prometheus-rabbitmq-exporter v0.4 from stretch-wikimedia to buster-wikimedia in apt1001 (T251660)
  • 10:18 arturo: copy prometheus-pdns-exporter v0.5.1 from stretch-wikimedia to buster-wikimedia in apt1001 (T251575)
  • 10:16 mutante: temp disabling puppet on all ganeti hosts to carefully deploy change related to rapi cert location
  • 09:37 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 09:36 moritzm: removing boron.eqiad.wmnet
  • 09:36 jmm@cumin2001: START - Cookbook sre.hosts.decommission
  • 09:03 gehel: restarting wdqs updater on all servers
  • 08:53 moritzm: installing Java security updates on releases*
  • 08:44 kormat: reimaging es1024 to buster T250666
  • 08:27 ema: cp2028 and cp2030 (both upload): varnish-fe restart to clear cache and evaluate 'exp' admission policy T144187 T249809
  • 08:26 moritzm: upgrading slapd on serpens/seaborgium
  • 08:19 ema: cp2027 and cp2029 (both text): varnish-fe restart to clear cache and evaluate 'exp' admission policy T144187 T249809
  • 08:08 moritzm: installing Java security updates on notebook/stat hosts
  • 07:54 gehel@deploy1001: Finished deploy [wdqs/wdqs@d37a059]: rollback wdqs to v 0.3.22 (duration: 04m 18s)
  • 07:50 gehel@deploy1001: Started deploy [wdqs/wdqs@d37a059]: rollback wdqs to v 0.3.22
  • 07:36 zpapierski@deploy1001: Started deploy [wdqs/wdqs@d37a059]: fix for the duplicated jars
  • 06:59 addshore: depool wdqs1006 heavy lag
  • 05:23 marostegui@cumin1001: dbctl commit (dc=all): 'Set s5 and s6 as read-only=off for maintenance T251154', diff saved to https://phabricator.wikimedia.org/P11133 and previous config saved to /var/cache/conftool/dbconfig/20200505-052334-marostegui.json
  • 05:21 marostegui@cumin1001: dbctl commit (dc=all): 'Set s5 and s6 as read-only for maintenance T251154', diff saved to https://phabricator.wikimedia.org/P11132 and previous config saved to /var/cache/conftool/dbconfig/20200505-052058-marostegui.json
  • 05:19 marostegui: Start s5 and s6 maintenance - T251154
  • 04:39 marostegui: Restart mysql on tendril host: db1115 - T231769

2020-05-04

  • 23:38 mstyles@deploy1001: Finished deploy [wdqs/wdqs@6518a8d]: v.0.3.26 (duration: 14m 39s)
  • 23:37 reedy@deploy1001: Synchronized wmf-config/CommonSettings.php: Use namespaced EventBus classes (duration: 00m 57s)
  • 23:35 reedy@deploy1001: Synchronized wmf-config/logging.php: Use namespaced EventBus classes (duration: 00m 56s)
  • 23:33 reedy@deploy1001: Synchronized rpc/RunSingleJob.php: Use namespaced EventBus classes (duration: 00m 58s)
  • 23:29 reedy@deploy1001: Synchronized wmf-config/logging.php: Replace AuthManagerStatsdHandler with WikimediaEventsAuthManagerStatsdHandler::class (duration: 00m 57s)
  • 23:23 mstyles@deploy1001: Started deploy [wdqs/wdqs@6518a8d]: v.0.3.26
  • 22:42 sbassett@deploy1001: Synchronized private/PrivateSettings.php: T251835: Restore dc752af (duration: 00m 57s)
  • 22:16 eileen: process-control config revision is 2eb75f8dff
  • 22:06 sbassett@deploy1001: Synchronized private/PrivateSettings.php: Partial mitigation for T250887 (duration: 00m 57s)
  • 21:45 sbassett@deploy1001: Synchronized private/PrivateSettings.php: Revert partial mitigation for T250887 (duration: 00m 57s)
  • 21:41 sbassett@deploy1001: Synchronized private/PrivateSettings.php: Deploy partial mitigation for T250887 (duration: 00m 57s)
  • 18:20 dpifke@deploy1001: Finished deploy [performance/navtiming@239d359]: Deploy navtiming with new/updated Prometheus metrics - T249822, T238086 (duration: 00m 05s)
  • 18:19 dpifke@deploy1001: Started deploy [performance/navtiming@239d359]: Deploy navtiming with new/updated Prometheus metrics - T249822, T238086
  • 18:16 Urbanecm: Morning SWAT done
  • 18:15 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: c04fbdd: Adding upload_by_url user right to all registered users on Commons (T251474) (duration: 00m 57s)
  • 18:11 urbanecm@deploy1001: Synchronized php-1.35.0-wmf.30/extensions/DiscussionTools/includes/DiscussionToolsHooks.php: SWAT: b85fc16: Enable on all ExtraSignaturesNamespaces (T249036) (duration: 01m 00s)
  • 18:07 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 18c1efb: Load DiscussionTools on en.wiki (T249376) (duration: 00m 58s)
  • 17:57 XioNoX: configure singtel interface on cr1-eqsin
  • 17:36 volans: upgraded spicerack on cumin[12]001 to 0.0.33-1
  • 17:02 joal@deploy1001: Finished deploy [analytics/refinery@2252f9a] (thin): Analytics hotfix deploy 2 THIN (sqoop) [2252f9a] (duration: 00m 09s)
  • 17:02 joal@deploy1001: Started deploy [analytics/refinery@2252f9a] (thin): Analytics hotfix deploy 2 THIN (sqoop) [2252f9a]
  • 17:01 joal@deploy1001: Finished deploy [analytics/refinery@2252f9a]: Analytics hotfix deploy 2 (sqoop) [2252f9a] (duration: 16m 45s)
  • 16:44 joal@deploy1001: Started deploy [analytics/refinery@2252f9a]: Analytics hotfix deploy 2 (sqoop) [2252f9a]
  • 16:08 liw@deploy1001: rebuilt and synchronized wikiversions files: group2 wikis to 1.35.0-wmf.30
  • 15:59 liw@deploy1001: Synchronized php: group1 wikis to 1.35.0-wmf.30 (duration: 01m 05s)
  • 15:58 liw@deploy1001: rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.30
  • 15:53 root@cumin1001: END (PASS) - Cookbook sre.hosts.ipmi-password-reset (exit_code=0)
  • 15:53 root@cumin1001: Updating IPMI password on 1 hosts - root@cumin1001
  • 15:53 root@cumin1001: START - Cookbook sre.hosts.ipmi-password-reset
  • 15:52 root@cumin1001: END (FAIL) - Cookbook sre.hosts.ipmi-password-reset (exit_code=99)
  • 15:52 root@cumin1001: START - Cookbook sre.hosts.ipmi-password-reset
  • 15:47 kormat@cumin1001: dbctl commit (dc=all): 'Repool es2025 after reimaging T250666', diff saved to https://phabricator.wikimedia.org/P11128 and previous config saved to /var/cache/conftool/dbconfig/20200504-154747-kormat.json
  • 15:45 jforrester@deploy1001: Synchronized php-1.35.0-wmf.30/includes/libs/rdbms/database/DatabaseMysqlBase.php: T251457 rdbms: don't treat lock() as a write operation (duration: 01m 04s)
  • 15:43 jforrester@deploy1001: Synchronized php-1.35.0-wmf.30/resources/src/mediawiki.diff.styles/diff.less: T250393 Follow-up I07dd6f7: Fix font size in diff (duration: 01m 05s)
  • 15:34 volans: uploaded spicerack_0.0.33-1_amd64.deb to apt.wikimedia.org stretch-wikimedia
  • 15:26 volans: deploy1001: deleted old .hhvm.hhbc files (/home/*/.hhvm.hhbc) https://phabricator.wikimedia.org/P11127
  • 15:23 volans: deploy1001: deleted old .hhvm.hhbc files moved from tin (/home/*/home-tin/.hhvm.hhbc) https://phabricator.wikimedia.org/P11126
  • 15:12 kormat@cumin1001: dbctl commit (dc=all): 'Repool db1101:3318 fully after reimaging T250666', diff saved to https://phabricator.wikimedia.org/P11125 and previous config saved to /var/cache/conftool/dbconfig/20200504-151243-kormat.json
  • 15:11 ppchelko@deploy1001: Finished deploy [restbase/deploy@74db57e]: Enable greek community wiki, fix analytics endpoints (duration: 14m 36s)
  • 15:05 joal@deploy1001: Finished deploy [analytics/refinery@3396279] (thin): Analytics hotfix deploy (sqoop) THIN [3396279] (duration: 00m 10s)
  • 15:05 joal@deploy1001: Started deploy [analytics/refinery@3396279] (thin): Analytics hotfix deploy (sqoop) THIN [3396279]
  • 15:05 joal@deploy1001: Finished deploy [analytics/refinery@3396279]: Analytics hotfix deploy (sqoop) [3396279] (duration: 15m 07s)
  • 15:05 kormat@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 15:02 kormat@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:57 ppchelko@deploy1001: Started deploy [restbase/deploy@74db57e]: Enable greek community wiki, fix analytics endpoints
  • 14:50 joal@deploy1001: Started deploy [analytics/refinery@3396279]: Analytics hotfix deploy (sqoop) [3396279]
  • 14:19 kormat@cumin1001: dbctl commit (dc=all): 'Repool db1101:3317 fully and db1101:3318 to 75% after reimaging T250666', diff saved to https://phabricator.wikimedia.org/P11123 and previous config saved to /var/cache/conftool/dbconfig/20200504-141919-kormat.json
  • 14:15 XioNoX: add static nat for fran1001 - T251763
  • 13:50 kormat@cumin1001: dbctl commit (dc=all): 'Depool es2025 for reimaging T250666', diff saved to https://phabricator.wikimedia.org/P11122 and previous config saved to /var/cache/conftool/dbconfig/20200504-135039-kormat.json
  • 13:34 kormat: reimaging es2025 to buster T250666
  • 13:27 kormat@cumin1001: dbctl commit (dc=all): 'Repool db1101:3317 and db1101:3318 some more after reimaging T250666', diff saved to https://phabricator.wikimedia.org/P11121 and previous config saved to /var/cache/conftool/dbconfig/20200504-132744-kormat.json
  • 13:02 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: T248664 Stop setting legacy wmgWikibase(Repo/Client)Repositories for TEST wikis (duration: 01m 06s)
  • 12:47 kormat@cumin1001: dbctl commit (dc=all): 'Repool db1101:3317 and db1101:3318 after reimaging T250666', diff saved to https://phabricator.wikimedia.org/P11120 and previous config saved to /var/cache/conftool/dbconfig/20200504-124659-kormat.json
  • 12:10 marostegui: Temporary enable slow query log on db1099:3311 - T206103
  • 12:09 Amir1: EU SWAT is done
  • 11:53 tgr@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Increase wmgMemoryLimit from 660MB to 666MB (duration: 01m 06s)
  • 11:47 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1099:3311 T206103 after removing tmp_2 index', diff saved to https://phabricator.wikimedia.org/P11119 and previous config saved to /var/cache/conftool/dbconfig/20200504-114727-marostegui.json
  • 11:46 tgr@deploy1001: Synchronized php-1.35.0-wmf.30/extensions/GrowthExperiments/modules/helppanel/ext.growthExperiments.HelpPanel.cta.js: SWAT: Help panel: Check if guidance feature flag is set before loading mobile peek (T251589) (duration: 01m 06s)
  • 11:46 marostegui: Remove index tmp_2 from recentchanges on db1099:3311 T206103
  • 11:45 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1099:3311 T206103 to remove tmp_2 index', diff saved to https://phabricator.wikimedia.org/P11118 and previous config saved to /var/cache/conftool/dbconfig/20200504-114539-marostegui.json
  • 11:43 tgr@deploy1001: Synchronized php-1.35.0-wmf.28/extensions/GrowthExperiments/modules/helppanel/ext.growthExperiments.HelpPanel.cta.js: SWAT: Help panel: Check if guidance feature flag is set before loading mobile peek (T251589) (duration: 01m 10s)
  • 11:38 jbond42: rebooting ps1-a7-codfw.mgmt.eqiad.wmnet.
  • 11:30 jbond42: rebooting ps1-a7-codfw.mgmt.eqiad.wmnet.
  • 11:30 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 4d00236: Enable cross-project search on frwikibooks (T251683) (duration: 01m 05s)
  • 11:25 Urbanecm: Purge https://en.wikipedia.org/static/images/project-logos/elwikiversity*.png (T251050)
  • 11:24 urbanecm@deploy1001: Synchronized static/images/project-logos/: SWAT: 64556ba: Correct typo in Greek Wikiversity logo (T248391) (duration: 01m 06s)
  • 11:20 Urbanecm: Purge https://en.wikipedia.org/static/images/project-logos/jvwiki*.png (T251050)
  • 11:20 urbanecm@deploy1001: Synchronized static/images/project-logos/: SWAT: 3b8c618: Update jvwiki logos (T251050) (duration: 01m 05s)
  • 11:18 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: cc94ea7: Enable VisualEditor for more namespaces on vecwiki (T250419) (duration: 01m 07s)
  • 10:49 arturo: update packages in buster-wikimedia | thirdparty/kubead-k8s-1-15 and thirdparty/kubeadm-k8s-1-16 (T250866)
  • 10:44 jdrewniak@deploy1001: Synchronized portals: Wikimedia Portals Update: Bumping portals to master (563985) (duration: 01m 05s)
  • 10:43 jdrewniak@deploy1001: Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: Bumping portals to master (563985) (duration: 01m 29s)
  • 10:39 vgutierrez: rolling upgrade of ATS to version 8.0.7-1wm3
  • 10:36 kormat@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 10:33 kormat@cumin1001: START - Cookbook sre.hosts.downtime
  • 10:30 arturo: running `aborrero@apt1001:~ $ sudo -i reprepro --delete clearvanished` to cleanup buster-wikimedia|thirdparty/kubeadm-k8s (T250866)
  • 09:46 vgutierrez: upload trafficserver 8.0.7-1wm2 to apt.wm.o (buster)
  • 09:22 kormat: reimaging db1101 to buster T250666
  • 08:50 XioNoX: configure BGP peering with AS132203
  • 08:20 godog: add 50G to prometheus-ops on prometheus100[34]
  • 08:17 marostegui: Deploy schema change on s5 codfw - T251188
  • 07:51 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1101:3317 and db1101:3318 for reimage', diff saved to https://phabricator.wikimedia.org/P11113 and previous config saved to /var/cache/conftool/dbconfig/20200504-075148-marostegui.json
  • 07:31 marostegui: Drop unused flagged* tables from mediawikiwiki - T248298
  • 07:26 moritzm: removed jmorgan from cn=wmf
  • 07:24 marostegui: Install 10.1.43-2 on s5 (db110) and s6 (db1131) masters in preparations for tomorrow's restart - T251154
  • 07:24 moritzm: removed Kerberos principal for lexnasser and jmorgan
  • 07:23 moritzm: removed lexnasser from cn=nda
  • 07:07 elukey: execute ifdown eno1; ifup eno1 on analytics1052 - interface neg speed flapping
  • 06:41 elukey: upload prometheus-druid-exporter 0.8-1 to stretch-wikimedia

2020-05-03

  • 22:52 Krinkle: scap pull mwmaint1002 and mw2001 for noc.wm.o. – https://gerrit.wikimedia.org/r/593929
  • 22:42 Krinkle: scap pull mwmaint1002 and mw2001 for noc.wm.o. – https://gerrit.wikimedia.org/r/591459
  • 21:37 bmansurov@deploy1001: Finished deploy [recommendation-api/deploy@0c68d62]: Update the recommendation API service (duration: 04m 22s)
  • 21:32 bmansurov@deploy1001: Started deploy [recommendation-api/deploy@0c68d62]: Update the recommendation API service

2020-05-02

  • 07:49 oblivian@cumin1001: conftool action : set/pooled=yes; selector: name=mw13(49|5[0-9]|6[0-2])\.eqiad\.wmnet
  • 07:08 XioNoX: asw2-d-eqiad> request virtual-chassis vc-port delete pic-slot 1 port 0 member 1
  • 02:36 volker-e@deploy1001: Finished deploy [design/style-guide@f0d467b]: Deploy design/style-guide: (duration: 00m 07s)
  • 02:36 volker-e@deploy1001: Started deploy [design/style-guide@f0d467b]: Deploy design/style-guide:

2020-05-01

  • 19:56 rzl@cumin1001: conftool action : set/pooled=no; selector: name=mw13(5[6-9]|6[0-2]).eqiad.wmnet
  • 18:57 gehel: restart blazegraph on wdqs1006 - T242453
  • 14:23 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1104 - T232446', diff saved to https://phabricator.wikimedia.org/P11110 and previous config saved to /var/cache/conftool/dbconfig/20200501-142354-marostegui.json
  • 14:18 hknust: holger@mwmaint1002 finished renameInvalidUsernames.php (fail) as part of T219279
  • 14:06 marostegui@cumin1001: dbctl commit (dc=all): 'More traffic to db1104 - T232446', diff saved to https://phabricator.wikimedia.org/P11109 and previous config saved to /var/cache/conftool/dbconfig/20200501-140603-marostegui.json
  • 13:47 marostegui@cumin1001: dbctl commit (dc=all): 'More traffic to db1104 - T232446', diff saved to https://phabricator.wikimedia.org/P11108 and previous config saved to /var/cache/conftool/dbconfig/20200501-134707-marostegui.json
  • 13:28 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly warm up db1104 - T232446', diff saved to https://phabricator.wikimedia.org/P11107 and previous config saved to /var/cache/conftool/dbconfig/20200501-132804-marostegui.json
  • 13:06 hknust: holger@mwmaint1002 Starting renameInvalidUsernames.php as part of T219279
  • 13:01 vgutierrez: rolling restart of ats-tls in text@esams - T249335
  • 12:24 mutante: mw230* - rolling restart of php-fpm - icinga warnings about opcache health in codfw
  • 12:20 mutante: mw2376 - restarting php-fpm - icinga warnings about opcache health in codfw
  • 12:07 mutante: notebook1004 - puppet was failed due to removal of jmorgan while one of his processes was still running. "change to absent failed.. user jmorgan currently used by process 29038". killing 29038, running puppet T251560
  • 12:05 mutante: notebook1003 - puppet was failed due to removal of jmorgan while one of his processeswas still running. "change to absent failed.. user jmorgan currently used by porcess 3288". killing 3288, running puppet T251560
  • 11:52 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 11:51 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
  • 11:50 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 11:50 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
  • 11:31 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 11:31 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 11:31 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 11:31 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 08:54 _joe_: depooled all servers in the app pool in rack D1
  • 08:54 oblivian@cumin1001: conftool action : set/pooled=no:weight=30; selector: name=mw13(49|5[0-5])\.eqiad\.wmnet
  • 08:50 oblivian@cumin1001: conftool action : set/weight=10; selector: name=mw13(49|5[0-5])\.eqiad\.wmnet
  • 08:48 _joe_: repooling mw1407 with LCStoreStaticArray, increased opcache, puppet disabled
  • 08:45 _joe_: repooling mw1409
  • 08:39 _joe_: repool mw1352
  • 08:37 _joe_: depooling mw1352
  • 07:44 marostegui: Copy wikireplica dump from labsdb1009 to labsdb1011 - T249188
  • 01:36 bmansurov@deploy1001: Finished deploy [recommendation-api/deploy@5f47cd7]: Update the recommendation API service (duration: 04m 33s)
  • 01:32 bmansurov@deploy1001: Started deploy [recommendation-api/deploy@5f47cd7]: Update the recommendation API service

Archives

See Server Admin Log/Archives.