You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Difference between revisions of "Nova Resource:Admin/SAL"

From Wikitech-static
Jump to navigation Jump to search
imported>Stashbot
(bstorm: running a modified version of the prometheus dir size cron in screen T284964)
imported>Stashbot
(andrewbogott: deploying a change so that openstack clients use tls endpoints: https://gerrit.wikimedia.org/r/c/operations/puppet/+/732738)
(42 intermediate revisions by the same user not shown)
Line 1: Line 1:
=== 2021-10-24 ===
* 00:47 andrewbogott: deploying a change so that openstack clients use tls endpoints: https://gerrit.wikimedia.org/r/c/operations/puppet/+/732738
=== 2021-10-21 ===
* 10:19 arturo: drop firewall exception on core routers for wiki replicas legacy setup ([[phab:T293897|T293897]])
* 10:12 arturo: drop NAT exception for wiki replicas legacy setup ([[phab:T293897|T293897]])
=== 2021-10-20 ===
* 21:06 andrewbogott: creating cloudinfra-nfs project [[phab:T293936|T293936]]
=== 2021-10-18 ===
* 19:21 andrewbogott: also ticked the 'admin' box on wikitech for majavah [[phab:T292827|T292827]]
* 18:58 andrewbogott: granting majavah 'admin' role in the 'admin' project and also in the default domain.  [[phab:T292827|T292827]]
=== 2021-10-14 ===
* 12:28 arturo: [codfw1dev] add DB grants for cloudbackup2002.codfw.wmnet IP address to the cinder DB ([[phab:T292546|T292546]])
=== 2021-10-13 ===
* 10:46 arturo: updating python3-neutron across the fleet ([[phab:T292936|T292936]])
=== 2021-10-12 ===
* 09:06 dcaro: upgrading eqiad cloudnet hosts neutron packages ([[phab:T292936|T292936]])
* 08:57 dcaro: upgrading codfw cloudnet hosts neutron packages ([[phab:T292936|T292936]])
=== 2021-10-05 ===
* 09:39 arturo: [codfw1dev] cleaning up manila stuff from openstack (db, endpoints, tenant, VMs, and such) [[phab:T291257|T291257]]
=== 2021-09-30 ===
* 14:50 andrewbogott: sudo cumin  "cloud*" "ps -ef {{!}} grep nslcd && service nslcd restart" and sudo cumin  "lab*" "ps -ef {{!}} grep nslcd && service nslcd restart"  [[phab:T292202|T292202]]
* 14:43 andrewbogott: ran sudo cumin --force --timeout 500 -o json  "A:all" "ps -ef {{!}} grep nslcd && service nslcd restart" to get nslcd happy again [[phab:T292202|T292202]]
=== 2021-09-29 ===
* 09:41 arturo: [codfw1dev] cleanup manila shares definitions for a clean start now that the manila-sharecontroller VM is apparently well configured ([[phab:T291257|T291257]])
=== 2021-09-28 ===
* 16:23 bstorm: downtime for clouddb1020 to reduce re-pages in case this goes badly [[phab:T291963|T291963]]
* 16:21 bstorm: powering on clouddb1020 via remote console [[phab:T291963|T291963]]
* 15:58 bstorm: depooled clouddb1020 for repair [[phab:T291961|T291961]]
* 12:40 dcaro: Merged change on sssd for bullseye cloud hosts ([[phab:T291585|T291585]])
* 11:30 arturo: [codfw1dev] create floating IP 185.15.57.5 for manila-sharecontroller.cloudinfra-codfw1dev.codfw1dev.wmcloud.org ([[phab:T291257|T291257]])
=== 2021-09-27 ===
* 10:07 arturo: cloudcontrol1004 apparently healthy [[phab:T291446|T291446]]
* 09:25 arturo: rebooting cloudcontrol1004 for [[phab:T291446|T291446]]
=== 2021-09-24 ===
* 13:02 arturo: [codfw1dev] create VM manila-share-controller-01 on cloudinfra-codfw1dev
* 13:00 arturo: [codfw1dev] rebase labs/private.git on cloudinfra-puppetmaster-01, had merge conflict
=== 2021-09-21 ===
* 12:13 arturo: [codfw1dev] trying to create a manila service image ([[phab:T291257|T291257]])
* 11:45 arturo: [codfw1dev] created rabbitmq user ([[phab:T291257|T291257]])
* 11:32 arturo: [codfw1dev] populated manila DB & created service endpoints ([[phab:T291257|T291257]])
* 11:06 arturo: [codfw1dev] give manila user admin role @ manila project ([[phab:T291257|T291257]])
* 11:06 arturo: [codfw1dev] created manila project ([[phab:T291257|T291257]])
* 10:57 arturo: [codfw1dev] created manila user @ labtestwikitech ([[phab:T291257|T291257]])
* 10:49 arturo: [codfw1dev] create manila database on cloudcontrol-dev nodes (galera) [[phab:T291257|T291257]]
=== 2021-09-20 ===
* 23:08 bstorm: ran `echo check > /sys/block/md0/md/sync_action` on cloudcontrol1004 to check raid
* 22:48 andrewbogott: stopped puppet & mariadb on cloudcontrol1004; it was flapping
* 22:44 andrewbogott: sudo touch /tmp/galera.disabled on cloudcontrol1004, the service seems troubled there
* 21:57 andrewbogott: moving cloudvirt1043 into the 'nfs' aggregate for [[phab:T291405|T291405]]
=== 2021-09-17 ===
* 11:35 arturo: [codfw1dev] install manila on cloudcontrol2001-dev ([[phab:T291257|T291257]])
=== 2021-09-16 ===
* 15:56 bstorm: removing downtime for labstore1005 so we'll know if it has another issue [[phab:T290318|T290318]]
=== 2021-09-09 ===
* 22:03 bstorm: restarted the prometheus-mysqld-exporter@s1 service as it was not working [[phab:T290630|T290630]]
* 03:15 bstorm: resetting swap on clouddb1017 [[phab:T290630|T290630]]
* 03:08 andrewbogott: stopping maintain-dbusers on labstore1004 for help diagnosing [[phab:T290630|T290630]]
=== 2021-09-03 ===
* 15:34 bstorm: rebooting labstore1005 to disconnect the drives from labstore1004 [[phab:T290318|T290318]]
* 15:24 bstorm: stopping puppet and disabling backup syncs to labstore1005 on cloudbackup2002 [[phab:T290318|T290318]]
* 15:20 bstorm: stopping puppet and disabling backup syncs to labstore1005 on cloudbackup2001 [[phab:T290318|T290318]]
=== 2021-08-30 ===
* 16:16 wm-bot: Added 1 new OSDs ['cloudcephosd1018.eqiad.wmnet'] - cookbook ran by andrew@buster
* 16:16 wm-bot: Added OSD cloudcephosd1018.eqiad.wmnet... (1/1) - cookbook ran by andrew@buster
* 16:13 wm-bot: Adding OSD cloudcephosd1018.eqiad.wmnet... (1/1) - cookbook ran by andrew@buster
* 16:13 wm-bot: Adding new OSDs ['cloudcephosd1018.eqiad.wmnet'] to the cluster - cookbook ran by andrew@buster
* 16:10 wm-bot: Finished rebooting node cloudcephosd1018.eqiad.wmnet - cookbook ran by andrew@buster
* 16:07 wm-bot: Rebooting node cloudcephosd1018.eqiad.wmnet - cookbook ran by andrew@buster
* 16:07 wm-bot: Adding OSD cloudcephosd1018.eqiad.wmnet... (1/1) - cookbook ran by andrew@buster
* 16:07 wm-bot: Adding new OSDs ['cloudcephosd1018.eqiad.wmnet'] to the cluster - cookbook ran by andrew@buster
=== 2021-08-27 ===
* 18:57 andrewbogott: raising toolsbeta ram/core/instances quotas so majavah can experiment with bullseye
=== 2021-08-25 ===
* 14:45 wm-bot: Finished rebooting node cloudcephosd1018.eqiad.wmnet - cookbook ran by andrew@buster
* 14:42 wm-bot: Rebooting node cloudcephosd1018.eqiad.wmnet - cookbook ran by andrew@buster
* 14:42 wm-bot: Adding OSD cloudcephosd1018.eqiad.wmnet... (1/1) - cookbook ran by andrew@buster
* 14:42 wm-bot: Adding new OSDs ['cloudcephosd1018.eqiad.wmnet'] to the cluster - cookbook ran by andrew@buster
* 14:41 wm-bot: Adding new OSDs ['cloudcephosd1018.eqiad.wmnet'] to the cluster - cookbook ran by andrew@buster
=== 2021-08-19 ===
* 17:39 bstorm: restarting glance image backup to try and clear the page
=== 2021-08-18 ===
* 16:21 wm-bot: Rebooting node cloudcephosd1018.eqiad.wmnet - cookbook ran by andrew@buster
* 16:21 wm-bot: Adding OSD cloudcephosd1018.eqiad.wmnet... (1/1) - cookbook ran by andrew@buster
* 16:21 wm-bot: Adding new OSDs ['cloudcephosd1018.eqiad.wmnet'] to the cluster - cookbook ran by andrew@buster
* 16:17 wm-bot: Adding new OSDs ['cloudcephosd1018.eqiad.wmnet'] to the cluster - cookbook ran by andrew@buster
* 16:16 wm-bot: Adding new OSDs ['cloudcephosd1018.eqiad.wmnet'] to the cluster - cookbook ran by andrew@buster
* 16:15 wm-bot: Adding new OSDs ['cloudcephosd1018.eqiad.wmnet'] to the cluster - cookbook ran by andrew@buster
* 16:13 wm-bot: Adding new OSDs ['cloudcephosd1018.eqiad.wmnet'] to the cluster - cookbook ran by andrew@buster
* 14:47 andrewbogott: adding clouvirt1038 to the ceph aggregate, removing from the maintenance aggregate [[phab:T276922|T276922]]
=== 2021-08-17 ===
* 15:11 andrewbogott: rebooting cloudcephosd1008 to force raid rebuild -- [[phab:T287838|T287838]]
=== 2021-08-11 ===
* 13:51 wm-bot: Finished rebooting node cloudcephosd1018.eqiad.wmnet - cookbook ran by dcaro@vulcanus
* 13:48 wm-bot: Rebooting node cloudcephosd1018.eqiad.wmnet - cookbook ran by dcaro@vulcanus
* 13:47 wm-bot: Adding OSD cloudcephosd1018.eqiad.wmnet... (1/1) ([[phab:T285858|T285858]]) - cookbook ran by dcaro@vulcanus
* 13:47 wm-bot: Adding new OSDs ['cloudcephosd1018.eqiad.wmnet'] to the cluster ([[phab:T285858|T285858]]) - cookbook ran by dcaro@vulcanus
=== 2021-08-10 ===
* 15:15 andrewbogott: restarting all designate services in eqiad1
* 15:04 andrewbogott: restarting designate-sink in eqiad1; it's complaining about rabbit but I don't want to restart rabbit yet
=== 2021-08-05 ===
* 09:37 dcaro: Taking one osd daemon down ot codfw cluster ([[phab:T288203|T288203]])
=== 2021-08-04 ===
* 19:20 bd808: Running deleteBatch.php on cloudweb2001-dev to remove legacy Heira: pages from labtestwiki
=== 2021-08-03 ===
* 17:40 bstorm: rerunning the glance backup script after failure
=== 2021-07-31 ===
* 00:10 andrewbogott: "systemctl reset-failed cloud-init.service" on all VMs for [[phab:T287309|T287309]]
* 00:08 andrewbogott: "systemctl reset-failed cloud-final.service" on all VMs for [[phab:T287309|T287309]]
=== 2021-07-27 ===
* 21:32 andrewbogott: putting cloudvirt1012 back into service [[phab:T286748|T286748]]
* 20:52 andrewbogott: draining VMs off of cloudvirt1012 so we can replace the battery for [[phab:T286748|T286748]]
* 15:15 andrewbogott: "rm /etc/apt/sources.list.d/openstack-mitaka-jessie.list" cloud-wide
=== 2021-07-23 ===
* 15:22 bstorm: update wikireplicas-dns for s7 fix for web replicas
=== 2021-07-20 ===
* 17:07 andrewbogott: reloading haproxy on dbproxy1018 for [[phab:T286598|T286598]]
* 15:45 arturo: failback from labstore1006 to labstore1007 (dumps NFS) https://gerrit.wikimedia.org/r/c/operations/puppet/+/705417
* 00:10 bstorm: restarting nova-api on cloudcontrol1003 to try and recover whatever it's doing with designate_floating_ip_ptr_records_updater
=== 2021-07-19 ===
* 22:05 bstorm: set downtime scheduled for tomorrow from 1300 to 1600 UTC for cloudstore1008 and 1009 [[phab:T286599|T286599]]
* 20:40 andrewbogott: reloading haproxy on dbproxy1018 for [[phab:T286598|T286598]]
* 13:50 andrewbogott: upgrading mariadb to 10.3.29 on all cloudcontrols
=== 2021-07-16 ===
* 09:55 dcaro: checking HP raid issues on coludvirt1012 ([[phab:T286766|T286766]])
=== 2021-07-14 ===
* 21:08 andrewbogott: restarting lots of openstack services while trying to resolve [[phab:T286675|T286675]]
* 12:17 dcaro: doing ceph outage tests on codfw1 (fyi)
=== 2021-07-13 ===
* 10:57 dcaro: enabled autoscaling on codfw1 ceph cluster, setting a minimum of pgs on codfw1dev-compute to 128
=== 2021-07-02 ===
* 10:12 wm-bot: The cluster is not rebalance after adding the new OSDs ['cloudcephosd1019.eqiad.wmnet', 'cloudcephosd1020.eqiad.wmnet'] ([[phab:T285858|T285858]]) - cookbook ran by dcaro@vulcanus
* 10:12 wm-bot: Added 2 new OSDs ['cloudcephosd1019.eqiad.wmnet', 'cloudcephosd1020.eqiad.wmnet'] ([[phab:T285858|T285858]]) - cookbook ran by dcaro@vulcanus
* 10:12 wm-bot: Added OSD cloudcephosd1020.eqiad.wmnet... (2/2) ([[phab:T285858|T285858]]) - cookbook ran by dcaro@vulcanus
* 10:10 wm-bot: Finished rebooting node cloudcephosd1020.eqiad.wmnet - cookbook ran by dcaro@vulcanus
* 10:07 wm-bot: Rebooting node cloudcephosd1020.eqiad.wmnet - cookbook ran by dcaro@vulcanus
* 10:07 wm-bot: Adding OSD cloudcephosd1020.eqiad.wmnet... (2/2) ([[phab:T285858|T285858]]) - cookbook ran by dcaro@vulcanus
* 10:07 wm-bot: Added OSD cloudcephosd1019.eqiad.wmnet... (1/2) ([[phab:T285858|T285858]]) - cookbook ran by dcaro@vulcanus
* 10:05 wm-bot: Finished rebooting node cloudcephosd1019.eqiad.wmnet - cookbook ran by dcaro@vulcanus
* 10:02 wm-bot: Rebooting node cloudcephosd1019.eqiad.wmnet - cookbook ran by dcaro@vulcanus
* 10:02 wm-bot: Adding OSD cloudcephosd1019.eqiad.wmnet... (1/2) ([[phab:T285858|T285858]]) - cookbook ran by dcaro@vulcanus
* 10:01 wm-bot: Adding new OSDs ['cloudcephosd1019.eqiad.wmnet', 'cloudcephosd1020.eqiad.wmnet'] to the cluster ([[phab:T285858|T285858]]) - cookbook ran by dcaro@vulcanus
* 09:13 wm-bot: Adding OSD cloudcephosd1019.eqiad.wmnet... (1/2) ([[phab:T285858|T285858]]) - cookbook ran by dcaro@vulcanus
* 09:13 wm-bot: Adding new OSDs ['cloudcephosd1019.eqiad.wmnet', 'cloudcephosd1020.eqiad.wmnet'] to the cluster ([[phab:T285858|T285858]]) - cookbook ran by dcaro@vulcanus
=== 2021-07-01 ===
* 16:27 bstorm: failed over cloudstore1009 to cloudstore1008 [[phab:T224747|T224747]]
* 16:18 bstorm: downtimed cloudstore1008 and cloudstore1009 to fail over [[phab:T224747|T224747]]
* 14:25 wm-bot: Adding OSD cloudcephosd1019.eqiad.wmnet... (2/3) ([[phab:T285858|T285858]]) - cookbook ran by dcaro@vulcanus
* 14:25 wm-bot: Added OSD cloudcephosd1017.eqiad.wmnet... (1/3) ([[phab:T285858|T285858]]) - cookbook ran by dcaro@vulcanus
* 14:24 wm-bot: Finished rebooting node cloudcephosd1017.eqiad.wmnet - cookbook ran by dcaro@vulcanus
* 14:21 wm-bot: Rebooting node cloudcephosd1017.eqiad.wmnet - cookbook ran by dcaro@vulcanus
* 14:20 wm-bot: Adding OSD cloudcephosd1017.eqiad.wmnet... (1/3) ([[phab:T285858|T285858]]) - cookbook ran by dcaro@vulcanus
* 14:20 wm-bot: Adding new OSDs ['cloudcephosd1017.eqiad.wmnet', 'cloudcephosd1019.eqiad.wmnet', 'cloudcephosd1020.eqiad.wmnet'] to the cluster ([[phab:T285858|T285858]]) - cookbook ran by dcaro@vulcanus
* 14:18 wm-bot: Rebooting node cloudcephosd1017.eqiad.wmnet - cookbook ran by dcaro@vulcanus
* 14:17 wm-bot: Adding OSD cloudcephosd1017.eqiad.wmnet... (1/3) ([[phab:T285858|T285858]]) - cookbook ran by dcaro@vulcanus
* 14:17 wm-bot: Adding new OSDs ['cloudcephosd1017.eqiad.wmnet', 'cloudcephosd1019.eqiad.wmnet', 'cloudcephosd1020.eqiad.wmnet'] to the cluster ([[phab:T285858|T285858]]) - cookbook ran by dcaro@vulcanus
* 11:16 wm-bot: Added new OSD node cloudcephosd1016.eqiad.wmnet ([[phab:T285858|T285858]]) - cookbook ran by dcaro@vulcanus
* 11:13 wm-bot: Adding new OSD cloudcephosd1016.eqiad.wmnet to the cluster ([[phab:T285858|T285858]]) - cookbook ran by dcaro@vulcanus
* 10:58 dcaro: rebooting cloudcephosd1016 ([[phab:T285858|T285858]])
* 10:47 wm-bot: Adding new OSD cloudcephosd1016.eqiad.wmnet to the cluster ([[phab:T285858|T285858]]) - cookbook ran by dcaro@vulcanus
* 10:44 wm-bot: Adding new OSD cloudcephosd1016.eqiad.wmnet to the cluster ([[phab:T285858|T285858]]) - cookbook ran by dcaro@vulcanus
* 10:42 wm-bot: Adding new OSD cloudcephosd1016.eqiad.wmnet to the cluster ([[phab:T285858|T285858]]) - cookbook ran by dcaro@vulcanus
* 10:41 wm-bot: Adding new OSD cloudcephosd1016.eqiad.wmnet to the cluster ([[phab:T285858|T285858]]) - cookbook ran by dcaro@vulcanus
* 10:40 wm-bot: Adding new OSD cloudcephosd1016.eqiad.wmnet to the cluster ([[phab:T285858|T285858]]) - cookbook ran by dcaro@vulcanus
=== 2021-06-30 ===
* 21:48 bstorm: downtimed space alerts for scratch on cloudstore1008 until after the migration
=== 2021-06-25 ===
* 15:28 andrewbogott: restarting openstack services on cloudcontrol1005
* 09:16 arturo: icinga downtime cloudcontrols for 2h
* 08:20 dcaro: restarting rabbitmq on cloudcontrol100<nowiki>{</nowiki>3,4<nowiki>}</nowiki>
=== 2021-06-21 ===
* 13:54 dcaro: puppet fix merged and deployed, servers are back to normal
* 13:20 dcaro: merged broken puppet patch, downtimed all cloudvirts for 2h while fixing (nothing big, just added a bad systemd timer)
=== 2021-06-20 ===
* 22:21 andrewbogott: clearing admin-monitoring VMs; puppet has been failing lately due to a full drive on the puppetmaster
=== 2021-06-15 ===
=== 2021-06-15 ===
* 01:18 bstorm: running a modified version of the prometheus dir size cron in screen [[phab:T284964|T284964]]
* 01:18 bstorm: running a modified version of the prometheus dir size cron in screen [[phab:T284964|T284964]]

Revision as of 00:47, 24 October 2021

2021-10-24

2021-10-21

  • 10:19 arturo: drop firewall exception on core routers for wiki replicas legacy setup (T293897)
  • 10:12 arturo: drop NAT exception for wiki replicas legacy setup (T293897)

2021-10-20

  • 21:06 andrewbogott: creating cloudinfra-nfs project T293936

2021-10-18

  • 19:21 andrewbogott: also ticked the 'admin' box on wikitech for majavah T292827
  • 18:58 andrewbogott: granting majavah 'admin' role in the 'admin' project and also in the default domain. T292827

2021-10-14

  • 12:28 arturo: [codfw1dev] add DB grants for cloudbackup2002.codfw.wmnet IP address to the cinder DB (T292546)

2021-10-13

  • 10:46 arturo: updating python3-neutron across the fleet (T292936)

2021-10-12

  • 09:06 dcaro: upgrading eqiad cloudnet hosts neutron packages (T292936)
  • 08:57 dcaro: upgrading codfw cloudnet hosts neutron packages (T292936)

2021-10-05

  • 09:39 arturo: [codfw1dev] cleaning up manila stuff from openstack (db, endpoints, tenant, VMs, and such) T291257

2021-09-30

  • 14:50 andrewbogott: sudo cumin "cloud*" "ps -ef | grep nslcd && service nslcd restart" and sudo cumin "lab*" "ps -ef | grep nslcd && service nslcd restart" T292202
  • 14:43 andrewbogott: ran sudo cumin --force --timeout 500 -o json "A:all" "ps -ef | grep nslcd && service nslcd restart" to get nslcd happy again T292202

2021-09-29

  • 09:41 arturo: [codfw1dev] cleanup manila shares definitions for a clean start now that the manila-sharecontroller VM is apparently well configured (T291257)

2021-09-28

  • 16:23 bstorm: downtime for clouddb1020 to reduce re-pages in case this goes badly T291963
  • 16:21 bstorm: powering on clouddb1020 via remote console T291963
  • 15:58 bstorm: depooled clouddb1020 for repair T291961
  • 12:40 dcaro: Merged change on sssd for bullseye cloud hosts (T291585)
  • 11:30 arturo: [codfw1dev] create floating IP 185.15.57.5 for manila-sharecontroller.cloudinfra-codfw1dev.codfw1dev.wmcloud.org (T291257)

2021-09-27

  • 10:07 arturo: cloudcontrol1004 apparently healthy T291446
  • 09:25 arturo: rebooting cloudcontrol1004 for T291446

2021-09-24

  • 13:02 arturo: [codfw1dev] create VM manila-share-controller-01 on cloudinfra-codfw1dev
  • 13:00 arturo: [codfw1dev] rebase labs/private.git on cloudinfra-puppetmaster-01, had merge conflict

2021-09-21

  • 12:13 arturo: [codfw1dev] trying to create a manila service image (T291257)
  • 11:45 arturo: [codfw1dev] created rabbitmq user (T291257)
  • 11:32 arturo: [codfw1dev] populated manila DB & created service endpoints (T291257)
  • 11:06 arturo: [codfw1dev] give manila user admin role @ manila project (T291257)
  • 11:06 arturo: [codfw1dev] created manila project (T291257)
  • 10:57 arturo: [codfw1dev] created manila user @ labtestwikitech (T291257)
  • 10:49 arturo: [codfw1dev] create manila database on cloudcontrol-dev nodes (galera) T291257

2021-09-20

  • 23:08 bstorm: ran `echo check > /sys/block/md0/md/sync_action` on cloudcontrol1004 to check raid
  • 22:48 andrewbogott: stopped puppet & mariadb on cloudcontrol1004; it was flapping
  • 22:44 andrewbogott: sudo touch /tmp/galera.disabled on cloudcontrol1004, the service seems troubled there
  • 21:57 andrewbogott: moving cloudvirt1043 into the 'nfs' aggregate for T291405

2021-09-17

  • 11:35 arturo: [codfw1dev] install manila on cloudcontrol2001-dev (T291257)

2021-09-16

  • 15:56 bstorm: removing downtime for labstore1005 so we'll know if it has another issue T290318

2021-09-09

  • 22:03 bstorm: restarted the prometheus-mysqld-exporter@s1 service as it was not working T290630
  • 03:15 bstorm: resetting swap on clouddb1017 T290630
  • 03:08 andrewbogott: stopping maintain-dbusers on labstore1004 for help diagnosing T290630

2021-09-03

  • 15:34 bstorm: rebooting labstore1005 to disconnect the drives from labstore1004 T290318
  • 15:24 bstorm: stopping puppet and disabling backup syncs to labstore1005 on cloudbackup2002 T290318
  • 15:20 bstorm: stopping puppet and disabling backup syncs to labstore1005 on cloudbackup2001 T290318

2021-08-30

  • 16:16 wm-bot: Added 1 new OSDs ['cloudcephosd1018.eqiad.wmnet'] - cookbook ran by andrew@buster
  • 16:16 wm-bot: Added OSD cloudcephosd1018.eqiad.wmnet... (1/1) - cookbook ran by andrew@buster
  • 16:13 wm-bot: Adding OSD cloudcephosd1018.eqiad.wmnet... (1/1) - cookbook ran by andrew@buster
  • 16:13 wm-bot: Adding new OSDs ['cloudcephosd1018.eqiad.wmnet'] to the cluster - cookbook ran by andrew@buster
  • 16:10 wm-bot: Finished rebooting node cloudcephosd1018.eqiad.wmnet - cookbook ran by andrew@buster
  • 16:07 wm-bot: Rebooting node cloudcephosd1018.eqiad.wmnet - cookbook ran by andrew@buster
  • 16:07 wm-bot: Adding OSD cloudcephosd1018.eqiad.wmnet... (1/1) - cookbook ran by andrew@buster
  • 16:07 wm-bot: Adding new OSDs ['cloudcephosd1018.eqiad.wmnet'] to the cluster - cookbook ran by andrew@buster

2021-08-27

  • 18:57 andrewbogott: raising toolsbeta ram/core/instances quotas so majavah can experiment with bullseye

2021-08-25

  • 14:45 wm-bot: Finished rebooting node cloudcephosd1018.eqiad.wmnet - cookbook ran by andrew@buster
  • 14:42 wm-bot: Rebooting node cloudcephosd1018.eqiad.wmnet - cookbook ran by andrew@buster
  • 14:42 wm-bot: Adding OSD cloudcephosd1018.eqiad.wmnet... (1/1) - cookbook ran by andrew@buster
  • 14:42 wm-bot: Adding new OSDs ['cloudcephosd1018.eqiad.wmnet'] to the cluster - cookbook ran by andrew@buster
  • 14:41 wm-bot: Adding new OSDs ['cloudcephosd1018.eqiad.wmnet'] to the cluster - cookbook ran by andrew@buster

2021-08-19

  • 17:39 bstorm: restarting glance image backup to try and clear the page

2021-08-18

  • 16:21 wm-bot: Rebooting node cloudcephosd1018.eqiad.wmnet - cookbook ran by andrew@buster
  • 16:21 wm-bot: Adding OSD cloudcephosd1018.eqiad.wmnet... (1/1) - cookbook ran by andrew@buster
  • 16:21 wm-bot: Adding new OSDs ['cloudcephosd1018.eqiad.wmnet'] to the cluster - cookbook ran by andrew@buster
  • 16:17 wm-bot: Adding new OSDs ['cloudcephosd1018.eqiad.wmnet'] to the cluster - cookbook ran by andrew@buster
  • 16:16 wm-bot: Adding new OSDs ['cloudcephosd1018.eqiad.wmnet'] to the cluster - cookbook ran by andrew@buster
  • 16:15 wm-bot: Adding new OSDs ['cloudcephosd1018.eqiad.wmnet'] to the cluster - cookbook ran by andrew@buster
  • 16:13 wm-bot: Adding new OSDs ['cloudcephosd1018.eqiad.wmnet'] to the cluster - cookbook ran by andrew@buster
  • 14:47 andrewbogott: adding clouvirt1038 to the ceph aggregate, removing from the maintenance aggregate T276922

2021-08-17

  • 15:11 andrewbogott: rebooting cloudcephosd1008 to force raid rebuild -- T287838

2021-08-11

  • 13:51 wm-bot: Finished rebooting node cloudcephosd1018.eqiad.wmnet - cookbook ran by dcaro@vulcanus
  • 13:48 wm-bot: Rebooting node cloudcephosd1018.eqiad.wmnet - cookbook ran by dcaro@vulcanus
  • 13:47 wm-bot: Adding OSD cloudcephosd1018.eqiad.wmnet... (1/1) (T285858) - cookbook ran by dcaro@vulcanus
  • 13:47 wm-bot: Adding new OSDs ['cloudcephosd1018.eqiad.wmnet'] to the cluster (T285858) - cookbook ran by dcaro@vulcanus

2021-08-10

  • 15:15 andrewbogott: restarting all designate services in eqiad1
  • 15:04 andrewbogott: restarting designate-sink in eqiad1; it's complaining about rabbit but I don't want to restart rabbit yet

2021-08-05

  • 09:37 dcaro: Taking one osd daemon down ot codfw cluster (T288203)

2021-08-04

  • 19:20 bd808: Running deleteBatch.php on cloudweb2001-dev to remove legacy Heira: pages from labtestwiki

2021-08-03

  • 17:40 bstorm: rerunning the glance backup script after failure

2021-07-31

  • 00:10 andrewbogott: "systemctl reset-failed cloud-init.service" on all VMs for T287309
  • 00:08 andrewbogott: "systemctl reset-failed cloud-final.service" on all VMs for T287309

2021-07-27

  • 21:32 andrewbogott: putting cloudvirt1012 back into service T286748
  • 20:52 andrewbogott: draining VMs off of cloudvirt1012 so we can replace the battery for T286748
  • 15:15 andrewbogott: "rm /etc/apt/sources.list.d/openstack-mitaka-jessie.list" cloud-wide

2021-07-23

  • 15:22 bstorm: update wikireplicas-dns for s7 fix for web replicas

2021-07-20

  • 17:07 andrewbogott: reloading haproxy on dbproxy1018 for T286598
  • 15:45 arturo: failback from labstore1006 to labstore1007 (dumps NFS) https://gerrit.wikimedia.org/r/c/operations/puppet/+/705417
  • 00:10 bstorm: restarting nova-api on cloudcontrol1003 to try and recover whatever it's doing with designate_floating_ip_ptr_records_updater

2021-07-19

  • 22:05 bstorm: set downtime scheduled for tomorrow from 1300 to 1600 UTC for cloudstore1008 and 1009 T286599
  • 20:40 andrewbogott: reloading haproxy on dbproxy1018 for T286598
  • 13:50 andrewbogott: upgrading mariadb to 10.3.29 on all cloudcontrols

2021-07-16

  • 09:55 dcaro: checking HP raid issues on coludvirt1012 (T286766)

2021-07-14

  • 21:08 andrewbogott: restarting lots of openstack services while trying to resolve T286675
  • 12:17 dcaro: doing ceph outage tests on codfw1 (fyi)

2021-07-13

  • 10:57 dcaro: enabled autoscaling on codfw1 ceph cluster, setting a minimum of pgs on codfw1dev-compute to 128

2021-07-02

  • 10:12 wm-bot: The cluster is not rebalance after adding the new OSDs ['cloudcephosd1019.eqiad.wmnet', 'cloudcephosd1020.eqiad.wmnet'] (T285858) - cookbook ran by dcaro@vulcanus
  • 10:12 wm-bot: Added 2 new OSDs ['cloudcephosd1019.eqiad.wmnet', 'cloudcephosd1020.eqiad.wmnet'] (T285858) - cookbook ran by dcaro@vulcanus
  • 10:12 wm-bot: Added OSD cloudcephosd1020.eqiad.wmnet... (2/2) (T285858) - cookbook ran by dcaro@vulcanus
  • 10:10 wm-bot: Finished rebooting node cloudcephosd1020.eqiad.wmnet - cookbook ran by dcaro@vulcanus
  • 10:07 wm-bot: Rebooting node cloudcephosd1020.eqiad.wmnet - cookbook ran by dcaro@vulcanus
  • 10:07 wm-bot: Adding OSD cloudcephosd1020.eqiad.wmnet... (2/2) (T285858) - cookbook ran by dcaro@vulcanus
  • 10:07 wm-bot: Added OSD cloudcephosd1019.eqiad.wmnet... (1/2) (T285858) - cookbook ran by dcaro@vulcanus
  • 10:05 wm-bot: Finished rebooting node cloudcephosd1019.eqiad.wmnet - cookbook ran by dcaro@vulcanus
  • 10:02 wm-bot: Rebooting node cloudcephosd1019.eqiad.wmnet - cookbook ran by dcaro@vulcanus
  • 10:02 wm-bot: Adding OSD cloudcephosd1019.eqiad.wmnet... (1/2) (T285858) - cookbook ran by dcaro@vulcanus
  • 10:01 wm-bot: Adding new OSDs ['cloudcephosd1019.eqiad.wmnet', 'cloudcephosd1020.eqiad.wmnet'] to the cluster (T285858) - cookbook ran by dcaro@vulcanus
  • 09:13 wm-bot: Adding OSD cloudcephosd1019.eqiad.wmnet... (1/2) (T285858) - cookbook ran by dcaro@vulcanus
  • 09:13 wm-bot: Adding new OSDs ['cloudcephosd1019.eqiad.wmnet', 'cloudcephosd1020.eqiad.wmnet'] to the cluster (T285858) - cookbook ran by dcaro@vulcanus

2021-07-01

  • 16:27 bstorm: failed over cloudstore1009 to cloudstore1008 T224747
  • 16:18 bstorm: downtimed cloudstore1008 and cloudstore1009 to fail over T224747
  • 14:25 wm-bot: Adding OSD cloudcephosd1019.eqiad.wmnet... (2/3) (T285858) - cookbook ran by dcaro@vulcanus
  • 14:25 wm-bot: Added OSD cloudcephosd1017.eqiad.wmnet... (1/3) (T285858) - cookbook ran by dcaro@vulcanus
  • 14:24 wm-bot: Finished rebooting node cloudcephosd1017.eqiad.wmnet - cookbook ran by dcaro@vulcanus
  • 14:21 wm-bot: Rebooting node cloudcephosd1017.eqiad.wmnet - cookbook ran by dcaro@vulcanus
  • 14:20 wm-bot: Adding OSD cloudcephosd1017.eqiad.wmnet... (1/3) (T285858) - cookbook ran by dcaro@vulcanus
  • 14:20 wm-bot: Adding new OSDs ['cloudcephosd1017.eqiad.wmnet', 'cloudcephosd1019.eqiad.wmnet', 'cloudcephosd1020.eqiad.wmnet'] to the cluster (T285858) - cookbook ran by dcaro@vulcanus
  • 14:18 wm-bot: Rebooting node cloudcephosd1017.eqiad.wmnet - cookbook ran by dcaro@vulcanus
  • 14:17 wm-bot: Adding OSD cloudcephosd1017.eqiad.wmnet... (1/3) (T285858) - cookbook ran by dcaro@vulcanus
  • 14:17 wm-bot: Adding new OSDs ['cloudcephosd1017.eqiad.wmnet', 'cloudcephosd1019.eqiad.wmnet', 'cloudcephosd1020.eqiad.wmnet'] to the cluster (T285858) - cookbook ran by dcaro@vulcanus
  • 11:16 wm-bot: Added new OSD node cloudcephosd1016.eqiad.wmnet (T285858) - cookbook ran by dcaro@vulcanus
  • 11:13 wm-bot: Adding new OSD cloudcephosd1016.eqiad.wmnet to the cluster (T285858) - cookbook ran by dcaro@vulcanus
  • 10:58 dcaro: rebooting cloudcephosd1016 (T285858)
  • 10:47 wm-bot: Adding new OSD cloudcephosd1016.eqiad.wmnet to the cluster (T285858) - cookbook ran by dcaro@vulcanus
  • 10:44 wm-bot: Adding new OSD cloudcephosd1016.eqiad.wmnet to the cluster (T285858) - cookbook ran by dcaro@vulcanus
  • 10:42 wm-bot: Adding new OSD cloudcephosd1016.eqiad.wmnet to the cluster (T285858) - cookbook ran by dcaro@vulcanus
  • 10:41 wm-bot: Adding new OSD cloudcephosd1016.eqiad.wmnet to the cluster (T285858) - cookbook ran by dcaro@vulcanus
  • 10:40 wm-bot: Adding new OSD cloudcephosd1016.eqiad.wmnet to the cluster (T285858) - cookbook ran by dcaro@vulcanus

2021-06-30

  • 21:48 bstorm: downtimed space alerts for scratch on cloudstore1008 until after the migration

2021-06-25

  • 15:28 andrewbogott: restarting openstack services on cloudcontrol1005
  • 09:16 arturo: icinga downtime cloudcontrols for 2h
  • 08:20 dcaro: restarting rabbitmq on cloudcontrol100{3,4}

2021-06-21

  • 13:54 dcaro: puppet fix merged and deployed, servers are back to normal
  • 13:20 dcaro: merged broken puppet patch, downtimed all cloudvirts for 2h while fixing (nothing big, just added a bad systemd timer)

2021-06-20

  • 22:21 andrewbogott: clearing admin-monitoring VMs; puppet has been failing lately due to a full drive on the puppetmaster

2021-06-15

  • 01:18 bstorm: running a modified version of the prometheus dir size cron in screen T284964

2021-06-14

  • 10:13 dcaro: setting ssd to debug mode on tools-sgeexec-0917 (T284130)

2021-06-10

  • 10:58 wm-bot: Finished rebooting the nodes ['cloudcephmon2002-dev', 'cloudcephmon2003-dev', 'cloudcephmon2004-dev'] (T281248) - cookbook ran by dcaro@vulcanus
  • 10:58 wm-bot: Finished rebooting node cloudcephmon2004-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 10:55 wm-bot: Rebooting node cloudcephmon2004-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 10:55 wm-bot: Finished rebooting node cloudcephmon2003-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 10:52 wm-bot: Rebooting node cloudcephmon2003-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 10:52 wm-bot: Finished rebooting node cloudcephmon2002-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 10:49 wm-bot: Rebooting node cloudcephmon2002-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 10:49 wm-bot: Rebooting the nodes cloudcephmon2002-dev,cloudcephmon2003-dev,cloudcephmon2004-dev (T281248) - cookbook ran by dcaro@vulcanus
  • 10:48 wm-bot: Finished rebooting the nodes ['cloudcephosd2001-dev', 'cloudcephosd2002-dev', 'cloudcephosd2003-dev'] (T281248) - cookbook ran by dcaro@vulcanus
  • 10:48 wm-bot: Finished rebooting node cloudcephosd2003-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 10:45 wm-bot: Rebooting node cloudcephosd2003-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 10:45 wm-bot: Finished rebooting node cloudcephosd2002-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 10:42 wm-bot: Rebooting node cloudcephosd2002-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 10:42 wm-bot: Finished rebooting node cloudcephosd2001-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 10:39 wm-bot: Rebooting node cloudcephosd2001-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 10:39 wm-bot: Rebooting the nodes cloudcephosd2001-dev,cloudcephosd2002-dev,cloudcephosd2003-dev (T281248) - cookbook ran by dcaro@vulcanus
  • 09:39 wm-bot: Finished rebooting the nodes ['cloudcephosd2001-dev', 'cloudcephosd2002-dev', 'cloudcephosd2003-dev'] (T281248) - cookbook ran by dcaro@vulcanus
  • 09:38 wm-bot: Finished rebooting node cloudcephosd2003-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 09:35 wm-bot: Rebooting node cloudcephosd2003-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 09:35 wm-bot: Finished rebooting node cloudcephosd2002-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 09:32 wm-bot: Rebooting node cloudcephosd2002-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 09:32 wm-bot: Finished rebooting node cloudcephosd2001-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 09:29 wm-bot: Rebooting node cloudcephosd2001-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 09:29 wm-bot: Rebooting the nodes cloudcephosd2001-dev,cloudcephosd2002-dev,cloudcephosd2003-dev (T281248) - cookbook ran by dcaro@vulcanus
  • 09:26 wm-bot: Rebooting node cloudcephosd2001-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 09:26 wm-bot: Rebooting the nodes cloudcephosd2001-dev,cloudcephosd2002-dev,cloudcephosd2003-dev (T281248) - cookbook ran by dcaro@vulcanus
  • 09:24 wm-bot: Rebooting node cloudcephosd2001-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 09:24 wm-bot: Rebooting the nodes cloudcephosd2001-dev,cloudcephosd2002-dev,cloudcephosd2003-dev (T281248) - cookbook ran by dcaro@vulcanus

2021-06-09

  • 17:33 arturo: removed icinga downtime for cloudmetrics1002 -- to see if hardware is healthy (T281881)
  • 13:30 wm-bot: Finished rebooting the nodes ['cloudcephmon2002-dev', 'cloudcephmon2003-dev', 'cloudcephmon2004-dev'] (T281248) - cookbook ran by dcaro@vulcanus
  • 13:30 wm-bot: Finished rebooting node cloudcephmon2004-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 13:27 wm-bot: Rebooting node cloudcephmon2004-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 13:27 wm-bot: Finished rebooting node cloudcephmon2003-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 13:24 wm-bot: Rebooting node cloudcephmon2003-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 13:24 wm-bot: Finished rebooting node cloudcephmon2002-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 13:21 wm-bot: Rebooting node cloudcephmon2002-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 13:21 wm-bot: Rebooting the nodes cloudcephmon2002-dev,cloudcephmon2003-dev,cloudcephmon2004-dev (T281248) - cookbook ran by dcaro@vulcanus
  • 13:01 wm-bot: Rebooting node cloudcephmon2002-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 13:01 wm-bot: Rebooting the nodes cloudcephmon2002-dev,cloudcephmon2003-dev,cloudcephmon2004-dev (T281248) - cookbook ran by dcaro@vulcanus
  • 12:53 wm-bot: Rebooting node cloudcephmon2002-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 12:53 wm-bot: Rebooting the nodes cloudcephmon2002-dev,cloudcephmon2003-dev,cloudcephmon2004-dev (T281248) - cookbook ran by dcaro@vulcanus

2021-06-08

  • 23:19 bd808: Downtimed cloudmetrics1002 in icinga until 2021-06-30 23:59:01 (T281881)
  • 21:08 bstorm: downtiming grafana-labs for maintenance
  • 16:28 wm-bot: Finished rebooting the nodes ['cloudcephosd2001-dev', 'cloudcephosd2002-dev', 'cloudcephosd2003-dev'] (T281248) - cookbook ran by dcaro@vulcanus
  • 16:27 wm-bot: Finished rebooting node cloudcephosd2003-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 16:24 wm-bot: Rebooting node cloudcephosd2003-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 16:24 wm-bot: Finished rebooting node cloudcephosd2002-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 16:22 wm-bot: Rebooting node cloudcephosd2002-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 16:21 wm-bot: Finished rebooting node cloudcephosd2001-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 16:18 wm-bot: Rebooting node cloudcephosd2001-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 16:18 wm-bot: Rebooting the nodes ['cloudcephosd2001-dev', 'cloudcephosd2002-dev', 'cloudcephosd2003-dev'] (T281248) - cookbook ran by dcaro@vulcanus
  • 16:17 wm-bot: Rebooting the nodes ['cloudcephosd2001-dev', 'cloudcephosd2002-dev', 'cloudcephosd2003-dev'] (T281248) - cookbook ran by dcaro@vulcanus
  • 15:03 wm-bot: Finished rebooting node cloudcephosd2001-dev.codfw.wmnet - cookbook ran by dcaro@vulcanus
  • 14:59 wm-bot: Rebooting node cloudcephosd2001-dev.codfw.wmnet - cookbook ran by dcaro@vulcanus
  • 14:59 wm-bot: Rebooting node cloudcephosd2001-dev.codfw.wmnet - cookbook ran by dcaro@vulcanus
  • 14:57 wm-bot: Rebooting node cloudcephosd2001-dev.codfw.wmnet - cookbook ran by dcaro@vulcanus
  • 14:57 wm-bot: Rebooting node cloudcephosd2001-dev.codfw.wmnet - cookbook ran by dcaro@vulcanus
  • 14:29 wm-bot: Rebooting node cloudcephosd2001-dev.codfw.wmnet - cookbook ran by dcaro@vulcanus
  • 14:23 wm-bot: Rebooting node cloudcephosd2001-dev.codfw.wmnet - cookbook ran by dcaro@vulcanus
  • 14:18 wm-bot: Rebooting node cloudcephosd2001-dev.codfw.wmnet - cookbook ran by dcaro@vulcanus

2021-06-07

  • 14:27 andrewbogott: moving cloudvirt1040 from 'maintenance' aggregate to 'ceph' aggregate T281399

2021-06-01

  • 13:12 dcaro: Changed the ceph osd_memory_target on eqiad pool to 6Gi (we were reaching the limit, swapping at some points)
  • 09:57 arturo: fix PTR record for 185.15.56.1 (T284025)
  • 09:56 arturo: fix PTR record for 185.15.56.1 (T248025)

2021-05-27

  • 14:58 wm-bot: Testing - cookbook ran by dcaro@vulcanus

2021-05-26

  • 19:10 andrewbogott: reimaging cloudvirt1018 to support local VM storage
  • 18:07 andrewbogott: draining cloudvirt1018, converting it to a local-storage host like cloudvirt1019 and 1020 -- T283296
  • 14:36 dcaro: Enabled syslog logging for osd.55 on eqiad ceph cluster for testing (T281247)
  • 14:36 dcaro: Enabled syslog logging on codfw ceph cluster (mon/osd/mgr) (T281247)
  • 11:26 arturo: [codfw1dev] purge old kernel packages in cloudvirt200[12]-dev
  • 11:03 arturo: created public flavor `g3.cores16.ram36.disk20` (even though it was requested as private in T283293, but may be useful for others)

2021-05-25

  • 16:14 bd808: Closed #wikimedia-cloud-admin on f***node
  • 16:11 bd808: Closed #wikimedia-cloud-feed on f***node
  • 15:19 dcaro: rebooted cloudvirt1020, starting VMs (T275893)
  • 15:13 dcaro: rebooting cloudvirt1020 (T275893)
  • 14:42 dcaro: taking cloudvirt1020 out for maintenance (openstack wise) so no new VMs are scheduled on it (T275893)

2021-05-24

  • 22:32 andrewbogott: changing the default ttl for eqiad1.wikimedia.cloud. from 3600 to 60; this should help us avoid madness when re-using hostnames.
  • 11:20 arturo: created `g3.cores2.ram80.disk40.private` for the wmf-research-tools project, to allow resizing a 40G disk instance

2021-05-22

  • 02:14 bstorm: downtiming SMART alerts on dumps server labstore1007 for the weekend because it has been flapping T281045

2021-05-13

  • 21:25 bstorm: converted the maps and scratch volumes on cloudstore1008 (standby) to drbd T224747
  • 15:45 bstorm: re-running wikireplicas-dns after refactor of config to make sure it doesn't change anything

2021-05-12

  • 14:23 arturo: [codfw1dev] cleanup old unused agents (bgp, ovs)
  • 11:37 arturo: [codfw1dev] replacing cloudnet2003-dev with cloudnet2004-dev (T281381)

2021-05-11

  • 18:00 andrewbogott: adding 'trove' service project in advance of deploying trove in eqiad1
  • 10:22 arturo: rebooted cloudgw1002 (active) thus causing a failover to cloudgw1001

2021-05-09

  • 10:53 arturo: icinga-downtime cloudmetrics1002 for 3 months (T275605)

2021-05-07

  • 13:51 andrewbogott: add inherited 'admin' right to novaadmin user throughout eqiad1. I was trying to narrow down the rights here but lack of admin breaks some workflows, e.g. T281894 and T282235

2021-05-06

  • 15:31 arturo: about to migrating CloudVPS network to the cloudgw architecture T270704
  • 11:14 dcaro: restarting cinder-volume on the eqiad control nodes to refresh the ceph libraries (T282109)

2021-05-05

  • 16:07 dcaro: disallowing insecure global ids on the eqiad ceph cluster (T280641)
  • 15:15 wm-bot: Safe reboot of 'cloudvirt1046.eqiad.wmnet' finished successfully. (T280641) - cookbook ran by dcaro@vulcanus
  • 15:11 wm-bot: Safe rebooting 'cloudvirt1046.eqiad.wmnet'. (T280641) - cookbook ran by dcaro@vulcanus
  • 15:11 wm-bot: Safe reboot of 'cloudvirt1045.eqiad.wmnet' finished successfully. (T280641) - cookbook ran by dcaro@vulcanus
  • 15:07 wm-bot: Safe rebooting 'cloudvirt1045.eqiad.wmnet'. (T280641) - cookbook ran by dcaro@vulcanus
  • 15:07 wm-bot: Safe reboot of 'cloudvirt1044.eqiad.wmnet' finished successfully. (T280641) - cookbook ran by dcaro@vulcanus
  • 15:03 wm-bot: Safe rebooting 'cloudvirt1044.eqiad.wmnet'. (T280641) - cookbook ran by dcaro@vulcanus
  • 15:03 wm-bot: Safe reboot of 'cloudvirt1043.eqiad.wmnet' finished successfully. (T280641) - cookbook ran by dcaro@vulcanus
  • 14:59 wm-bot: Safe rebooting 'cloudvirt1043.eqiad.wmnet'. (T280641) - cookbook ran by dcaro@vulcanus
  • 14:59 wm-bot: Safe reboot of 'cloudvirt1042.eqiad.wmnet' finished successfully. (T280641) - cookbook ran by dcaro@vulcanus
  • 14:40 wm-bot: Safe rebooting 'cloudvirt1042.eqiad.wmnet'. (T280641) - cookbook ran by dcaro@vulcanus
  • 14:39 wm-bot: Safe reboot of 'cloudvirt1041.eqiad.wmnet' finished successfully. (T280641) - cookbook ran by dcaro@vulcanus
  • 14:14 wm-bot: Safe rebooting 'cloudvirt1041.eqiad.wmnet'. (T280641) - cookbook ran by dcaro@vulcanus
  • 14:14 wm-bot: Safe reboot of 'cloudvirt1039.eqiad.wmnet' finished successfully. (T280641) - cookbook ran by dcaro@vulcanus
  • 14:10 wm-bot: Safe rebooting 'cloudvirt1039.eqiad.wmnet'. (T280641) - cookbook ran by dcaro@vulcanus
  • 12:35 wm-bot: Safe rebooting 'cloudvirt1039.eqiad.wmnet'. (T280641) - cookbook ran by dcaro@vulcanus
  • 11:56 wm-bot: Safe rebooting 'cloudvirt1038.eqiad.wmnet'. (T280641) - cookbook ran by dcaro@vulcanus
  • 11:56 wm-bot: Safe reboot of 'cloudvirt1037.eqiad.wmnet' finished successfully. (T280641) - cookbook ran by dcaro@vulcanus
  • 11:31 wm-bot: Safe rebooting 'cloudvirt1037.eqiad.wmnet'. (T280641) - cookbook ran by dcaro@vulcanus
  • 11:31 wm-bot: Safe reboot of 'cloudvirt1036.eqiad.wmnet' finished successfully. (T280641) - cookbook ran by dcaro@vulcanus
  • 11:08 wm-bot: Safe rebooting 'cloudvirt1036.eqiad.wmnet'. (T280641) - cookbook ran by dcaro@vulcanus
  • 11:08 wm-bot: Safe reboot of 'cloudvirt1035.eqiad.wmnet' finished successfully. (T280641) - cookbook ran by dcaro@vulcanus
  • 10:39 wm-bot: Safe rebooting 'cloudvirt1035.eqiad.wmnet'. (T280641) - cookbook ran by dcaro@vulcanus
  • 10:39 wm-bot: Safe reboot of 'cloudvirt1034.eqiad.wmnet' finished successfully. (T280641) - cookbook ran by dcaro@vulcanus
  • 10:13 wm-bot: Safe rebooting 'cloudvirt1034.eqiad.wmnet'. (T280641) - cookbook ran by dcaro@vulcanus
  • 10:13 wm-bot: Safe reboot of 'cloudvirt1033.eqiad.wmnet' finished successfully. (T280641) - cookbook ran by dcaro@vulcanus
  • 09:47 wm-bot: Safe rebooting 'cloudvirt1033.eqiad.wmnet'. (T280641) - cookbook ran by dcaro@vulcanus
  • 09:47 wm-bot: Safe reboot of 'cloudvirt1032.eqiad.wmnet' finished successfully. (T280641) - cookbook ran by dcaro@vulcanus
  • 09:21 wm-bot: Safe rebooting 'cloudvirt1032.eqiad.wmnet'. (T280641) - cookbook ran by dcaro@vulcanus
  • 09:21 wm-bot: Safe reboot of 'cloudvirt1031.eqiad.wmnet' finished successfully. (T280641) - cookbook ran by dcaro@vulcanus
  • 08:45 wm-bot: Safe rebooting 'cloudvirt1031.eqiad.wmnet'. (T280641) - cookbook ran by dcaro@vulcanus
  • 08:45 wm-bot: Safe reboot of 'cloudvirt1030.eqiad.wmnet' finished successfully. (T280641) - cookbook ran by dcaro@vulcanus
  • 08:19 wm-bot: Safe rebooting 'cloudvirt1030.eqiad.wmnet'. (T280641) - cookbook ran by dcaro@vulcanus
  • 08:19 wm-bot: Safe reboot of 'cloudvirt1029.eqiad.wmnet' finished successfully. (T280641) - cookbook ran by dcaro@vulcanus
  • 08:02 wm-bot: Safe rebooting 'cloudvirt1029.eqiad.wmnet'. (T280641) - cookbook ran by dcaro@vulcanus

2021-05-04

  • 16:05 wm-bot: Safe reboot of 'cloudvirt1028.eqiad.wmnet' finished successfully. (T280641) - cookbook ran by dcaro@vulcanus
  • 15:45 wm-bot: Safe rebooting 'cloudvirt1028.eqiad.wmnet'. (T280641) - cookbook ran by dcaro@vulcanus
  • 15:44 wm-bot: Safe reboot of 'cloudvirt1027.eqiad.wmnet' finished successfully. (T280641) - cookbook ran by dcaro@vulcanus
  • 15:22 wm-bot: Safe rebooting 'cloudvirt1027.eqiad.wmnet'. (T280641) - cookbook ran by dcaro@vulcanus
  • 15:19 wm-bot: Safe reboot of 'cloudvirt1026.eqiad.wmnet' finished successfully. (T280641) - cookbook ran by dcaro@vulcanus
  • 15:15 wm-bot: Safe rebooting 'cloudvirt1026.eqiad.wmnet'. (T280641) - cookbook ran by dcaro@vulcanus
  • 13:19 dcaro: rebooting cloudmetrics1002, got stuck again (T275605)
  • 10:04 wm-bot: Safe rebooting 'cloudvirt1026.eqiad.wmnet'. (T280641) - cookbook ran by dcaro@vulcanus
  • 09:10 wm-bot: Safe rebooting 'cloudvirt1026.eqiad.wmnet'. (T280641) - cookbook ran by dcaro@vulcanus
  • 09:10 wm-bot: Safe reboot of 'cloudvirt1025.eqiad.wmnet' finished successfully. (T280641) - cookbook ran by dcaro@vulcanus
  • 08:34 wm-bot: Safe rebooting 'cloudvirt1025.eqiad.wmnet'. (T280641) - cookbook ran by dcaro@vulcanus
  • 08:20 wm-bot: Safe reboot of 'cloudvirt1024.eqiad.wmnet' finished successfully. (T280641) - cookbook ran by dcaro@vulcanus
  • 08:03 wm-bot: Safe rebooting 'cloudvirt1024.eqiad.wmnet'. (T280641) - cookbook ran by dcaro@vulcanus

2021-05-03

  • 23:53 bstorm: running `maintain-dbusers harvest-replicas` on labstore1004 T281287
  • 23:51 bstorm: running `maintain-dbusers harvest-replicas` on labstore1004
  • 16:34 wm-bot: Safe reboot of 'cloudvirt1023.eqiad.wmnet' finished successfully. (T280641) - cookbook ran by dcaro@vulcanus
  • 16:29 wm-bot: Safe rebooting 'cloudvirt1023.eqiad.wmnet'. (T280641) - cookbook ran by dcaro@vulcanus
  • 15:41 wm-bot: Safe rebooting 'cloudvirt1023.eqiad.wmnet'. (T280641) - cookbook ran by dcaro@vulcanus
  • 15:41 wm-bot: Safe reboot of 'cloudvirt1022.eqiad.wmnet' finished successfully. (T280641) - cookbook ran by dcaro@vulcanus
  • 15:13 wm-bot: Safe rebooting 'cloudvirt1022.eqiad.wmnet'. (T280641) - cookbook ran by dcaro@vulcanus
  • 10:31 wm-bot: Safe rebooting 'cloudvirt1021.eqiad.wmnet'. (T280641 - cookbook ran by dcaro@vulcanus)
  • 10:23 wm-bot: (from a cookbook)
  • 09:12 dcaro: draining and rebooting coludvirt1021 (T280641)
  • 08:26 dcaro: draining and rebooting coludvirt1018 (T280641)

2021-04-30

  • 11:16 dcaro: draining and rebooting coludvirt1017, last one today (T280641)
  • 10:37 dcaro: draining coludvirt1016 for reboot (T280641)
  • 09:48 dcaro: draining coludvirt1013 for reboot (T280641)

2021-04-29

  • 15:11 dcaro: hard rebooting cloudmetrics1002, got hung again (T275605)
  • 07:53 dcaro: Upgrading ceph libraries on cloudcontrol1005 to octopus (T274566)
  • 07:51 dcaro: Upgrading ceph libraries on cloudcontrol1003 to octopus (T274566)
  • 07:50 dcaro: Upgrading ceph libraries on cloudcontrol1004 to octopus (T274566)

2021-04-28

  • 21:11 andrewbogott: cleaning up more references to deleted hypervisors with delete from services where topic='compute' and version != 53;
  • 20:48 andrewbogott: cleaning up references to deleted hypervisors with mysql:root@localhost [nova_eqiad1]> delete from compute_nodes where hypervisor_version != '5002000';
  • 19:40 andrewbogott: putting cloudvirt1040 into the maintenance aggregate pending more info about T281399
  • 18:11 andrewbogott: adding cloudvirt1040, 1041 and 1042 to the 'ceph' host aggregate -- T275081
  • 11:06 dcaro: All ceph server side upgraded to Octopus! \o/ (T280641)
  • 10:57 dcaro: Got a PG getting stuck on 'remapping' after the OSD came up, had to unset the norebalance and then set it again to get it unstuck (T280641)
  • 10:34 dcaro: Slow/blocked opns from cloudcephmon03, "osd_failure(failed timeout osd.32..." (cloudcephosd1005), unset the cluster noout/norebalance and went away in a few secs, setting it again and continuing... (T280641)
  • 09:03 dcaro: Waiting for slow heartbeats from osd.58(cloudcephosd1002) to recover... (T280641)
  • 08:59 dcaro: During the upgrade, started getting warning 'slow osd heartbacks in the back', meaning that pings between osds are really slow (up to 190s) all from osd.58, currently on cloudcephosd1002 (T280641)
  • 08:58 dcaro: During the upgrade, started getting warning 'slow osd heartbacks in the back', meaning that pings between osds are really slow (up to 190s) all from osd.58 (T280641)
  • 08:58 dcaro: During the upgrade, started getting warning 'slow osd heartbacks in the back', meaning that pings between osds are really slow (up to 190s) (T280641)
  • 08:21 dcaro: Upgrading all the ceph osds on eqiad (T280641)
  • 08:21 dcaro: The clock skew seems intermittent, there's another task to follw it T275860 (T280641)
  • 08:18 dcaro: All equiad ceph mons and mgrs upgraded (T280641)
  • 08:18 dcaro: During the upgrade, ceph detected a clock skew on cloudcephmon1002, cloudcephmon1001, they are back (T280641)
  • 08:15 dcaro: During the upgrade, ceph detected a clock skew on cloudcephmon1002, it went away, I'm guessing systemd-timesyncd fixed it (T280641)
  • 08:14 dcaro: During the upgrade, ceph detected a clock skew on cloudcephmon1002, looking (T280641)
  • 07:58 dcaro: Upgrading ceph services on eqiad, starting with mons/managers (T280641)

2021-04-27

  • 14:10 dcaro: codfw.openstack upgraded ceph libraries to 15.2.11 (T280641)
  • 13:07 dcaro: codfw.openstack cloudvirt2002-dev done, taking cloudvirt2003-dev out to upgrade ceph libraries (T280641)
  • 13:00 dcaro: codfw.openstack cloudvirt2001-dev back online, taking cloudvirt2002-dev out to upgrade ceph libraries (T280641)
  • 10:51 dcaro: ceph.eqiad: cinder pool got it's pg_num increased to 1024, re-shuffle started (T273783)
  • 10:48 dcaro: ceph.eqiad: Tweaked the target_size_ratio of all the pools, enabling autoscaler (it will increase cinder pool only) (T273783)
  • 09:14 dcaro: manually force stopping the server puppetmaster-01 to unblock migration (in codfw1)
  • 09:14 dcaro: manually force stopping the server puppetmaster-01 to unblock migration
  • 08:59 dcaro: manually force stopping the server exploding-head on codfw, to try cold migration
  • 08:47 dcaro: restarting nova-compute on cloudvirt2001-dev after upgrading ceph libraries to 15.2.11

2021-04-26

  • 20:56 andrewbogott: deleting spurious 'codfw1dev' and 'codw1dev-4' regions in the dallas deployment; regions without endpoints break a bunch of things
  • 09:45 dcaro: draining cloudvirt2001-dev with the new cookbooks (T280641)

2021-04-23

  • 13:49 dcaro: testing the drain_cloudvirt cookbook on codfw1 openstack cluster, draining cloudvirt2001 (T280641)
  • 11:12 dcaro: testing the drain_cloudvirt cookbook on codfw1 openstack cluster (T280641)
  • 09:32 dcaro: finished upgrade of ceph cluster on codfw1 using exclusively cookbooks (T280641)
  • 09:17 dcaro: testing the upgrade_osds cookbook on codfw1 ceph cluster (T280641)
  • 08:17 dcaro: testing the upgrade_mons cookbook on codfw1 ceph cluster (T280641)

2021-04-21

  • 17:59 dcaro: all monitors upgraded on codfw1 with one cookbook `cookbook --verbose -c ~/.config/spicerack/cookbook.yaml wmcs.ceph.upgrade_mons --monitor-node-fqdn cloudcephmon2002-dev.codfw.wmnet` (T280641)
  • 17:47 dcaro: upgrading monitors and mrg nodes on codfw ceph cluster (T280641)
  • 13:26 dcaro: testing ceph upgrade cookbook on cloudcephmon2002-dev (T280641)

2021-04-20

  • 20:21 andrewbogott: reboot cloudservices1003
  • 20:13 andrewbogott: reboot cloudservices1004

2021-04-19

  • 08:40 dcaro: enabling puppet on labstore1004 after mysql restart (T279657)
  • 08:09 dcaro: downtiming labstore1004 and stopping puppet for mysql restart (T279657)

2021-04-14

  • 10:48 dcaro: Upgrade of codfw ceph to octopus 15.2.20 done, will run some performance tests now (T274566)
  • 10:41 dcaro: Upgrade of codfw ceph to octopus 15.2.20, mgrs upgraded, osds next (T274566)
  • 10:37 dcaro: Upgrade of codfw ceph to octopus 15.2.20, mons upgraded, mgrs next (T274566)
  • 10:15 dcaro: starting the upgrade of codfw ceph to octopus 15.2.20 (T274566)
  • 10:07 dcaro: Merged the ceph 15 (Octopus) repo deployment to codfw, only the repo, not the packages (T274566)

2021-04-13

  • 16:42 dcaro: Ceph balancer got the cluster to eval 0.014916, that is 88-77% usage for compute pool, and 28-19% usage for the cinder one \o/ (T274573)
  • 15:08 dcaro: Activating continuous upmap balancer, keeping a close eye (T274573)
  • 15:03 dcaro: Executing a second pass, there's still movements to improve the eval of 0.030075 (T274573)
  • 15:02 dcaro: First pass finished, improved eval to 0.030075 (T274573)
  • 14:49 dcaro: Running the first_pass balancing plan on ceph eqiad, current eval 0.030622 (T274573)
  • 14:43 dcaro: enabling ceph upmap pg balancer on equiad (T274573)
  • 14:36 andrewbogott: upgrading codfw1dev to version Victoria, T261137
  • 13:11 andrewbogott: upgrading eqiad1 designate to version Victoria, T261137
  • 10:44 dcaro: enabled ceph upmap balancer on codfw (T274573,T274573)

2021-04-07

  • 21:33 andrewbogott: upgrading codfw1dev designate to Victoria

2021-04-04

  • 17:36 andrewbogott: upgrading eqiad1 designate to Ussuri

2021-04-02

  • 14:12 andrewbogott: upgrading codfw1dev to OpenStack version Ussuri

2021-04-01

  • 12:15 dcaro: Restoring the 4.9 kernel on cloudcephosd2003-dev and upgrading (T274565)
  • 10:29 dcaro: Done restoring the 4.9 kernel on cloudcephosd2001-dev and upgrading, requires logging into console to boot from the older kernel before removing the newer one (T274565)
  • 10:10 dcaro: Restoring the 4.9 kernel on cloudcephosd2001-dev and upgrading (T274565)

2021-03-31

  • 08:47 dcaro: upgrading cinder on codfw cloudcontrol2* nodes (T278845)

2021-03-30

  • 09:53 arturo: rebooting cloudnet1003 to cleanup conntrack table, it wouldn't cleanup by hand ...

2021-03-28

  • 15:42 andrewbogott: updated debian-10.0-buster base image

2021-03-27

  • 09:54 arturo: cleanup conntrack table in qrouter nents in cloudnet1003 (backup)

2021-03-25

  • 19:03 andrewbogott: deleting all unused (per wmcs-imageusage) Jessie base images from Glance
  • 17:15 andrewbogott: refreshing puppet compiler facts for tools project
  • 10:31 dcaro: kernel upgrade on osds on codfw done, running performance tests (T274565)
  • 10:24 dcaro: upgrading kernel on cloudcephosd2003-dev and reboot (T274565)
  • 10:18 dcaro: upgrading kernel on cloudcephosd2002-dev and reboot (T274565)
  • 10:08 dcaro: upgrading kernel on cloudcephmon2003-dev and reboot (T274565)

2021-03-24

  • 09:19 dcaro: restarted wmcs-backup on cloudvirt1024 as it failed due to an image being removed while running (T276892)

2021-03-23

  • 11:33 arturo: root@cloudcontrol1005:~# wmcs-novastats-dnsleaks --delete

2021-03-22

  • 10:10 arturo: cleanup conntrack table in standby node: aborrero@cloudnet1003:~ $ sudo ip netns exec qrouter-d93771ba-2711-4f88-804a-8df6fd03978a conntrack -F

2021-03-19

  • 17:18 bstorm: running `ALTER TABLE account MODIFY COLUMN type ENUM('user','tool','paws');` against the labsdbaccounts database on m5 T276284
  • 14:29 andrewbogott: switching admin-monitoring project to use an upstream debian image; I want to see how this affects performance
  • 00:30 bstorm: downtimed labstore1004 to check some things in debug mode

2021-03-17

  • 17:28 bstorm: restarted the backup-glance-images job to clear errors in systemd T271782
  • 17:16 andrewbogott: set default cinder quota for projects to 80Gb with "update quota_classes set hard_limit=80 where resource='gigabytes';" on database 'cinder'
  • 16:58 andrewbogott: disabling all flavors with >20Gb root storage with "update flavors set disabled=1 where root_gb>20;" in nova_eqiad1_api

2021-03-10

  • 16:51 arturo: rebooting cloudvirt1030 for T275753
  • 13:14 dcaro: starting manually the canary VM for cloudvirt1029 (nova start 349830f6-3b39-4a8c-ada4-a7439f65cffe) (T275753)
  • 12:51 arturo: draining cloudvirt1030 for T275753
  • 12:47 arturo: rebooting cloudvirt1029 for T275753
  • 11:56 arturo: [codfw1dev] restart rabbitmq-server in all 3 cloudcontrol servers for T276964
  • 11:53 arturo: [codfw1dev] restart nova-conductor in all 3 cloudcontrol servers for T276964
  • 11:31 arturo: draining cloudvirt1029 for T275753
  • 11:29 arturo: rebooting cloudvirt1013 for T275753
  • 11:05 arturo: draining cloudvirt1013 for T275753
  • 11:00 arturo: rebooting cloudvirt1028 for T275753
  • 10:33 arturo: draining cloudvirt1028 for T275753
  • 10:29 arturo: rebooting cloudvirt1023 for T275753
  • 09:37 arturo: draining cloudvirt1023 for T275753
  • 09:07 arturo: [codfw1dev] reimaging cloudvirt2003-dev (T276964)

2021-03-09

  • 16:27 arturo: rebooting cloudvirt1027 (T275753)
  • 13:39 arturo: draining cloudvrit1027 for T275753
  • 13:35 arturo: icinga-downtime cloudvirt1038 for 30 days for T276922
  • 13:21 arturo: add cloudvirt1039 to the ceph host aggregate (no longer a spare, we have cloudvirt1038 with HW failures)
  • 12:52 arturo: cloudvirt1038 hard powerdown / powerup for T276922
  • 12:33 arturo: rebooting cloudvirt1038 (T275753)
  • 10:58 arturo: draining cloudvirt1038 (T275753)
  • 10:54 arturo: rebooting cloudvirt1037 (T275753)
  • 09:59 arturo: draining cloudvirt1037 (T275753)
  • 09:12 dcaro: restarted the wmcs-backup service on cloudvirt1024 to retry the backups (failed because a VM was removed in-between, T276892)

2021-03-05

  • 21:40 andrewbogott: replacing 'observer' role with 'reader' role in eqiad1 T276018
  • 21:21 andrewbogott: replacing 'observer' role with 'reader' role in eqiad1
  • 16:23 arturo: rebooting cloudvirt1036 for T275753
  • 12:30 arturo: draining cloudvirt1036 for T275753
  • 12:25 arturo: rebooting cloudvirt1035 for T275753
  • 10:49 arturo: rebooting cloudvirt1035 for T275753
  • 10:47 arturo: rebooting cloudvirt1034 for T275753
  • 10:26 arturo: draining cloudvirt1034 for T275753
  • 10:25 arturo: rebooting cloudvirt1033 for T275753
  • 09:18 arturo: draining cloudvirt1033 for T275753

2021-03-04

  • 18:36 andrewbogott: rebooting cloudmetrics1002; the console is hanging
  • 16:59 arturo: rebooting cloudvirt1032 for T275753
  • 16:34 arturo: draining cloudvirt1032 for T275753
  • 16:33 arturo: rebooting cloudvirt1031 for T275753
  • 16:11 arturo: draining cloudvirt1031 for T275753
  • 16:09 arturo: rebooting cloudvirt1026 for T275753
  • 15:57 arturo: draining cloudvirt1026 for T275753
  • 15:55 arturo: rebooting cloudvirt1025 for T275753
  • 15:41 arturo: draining cloudvirt1025 for T275753
  • 15:12 arturo: rebooting cloudvirt1024 for T275753
  • 11:29 arturo: draining cloudvirt1024 for T275753
  • 11:24 dcaro: rebooted cloudvirt1022, re-adding to ceph and removing from maintenance host aggregate for T275753
  • 11:01 dcaro: rebooting cloudvirt1022 for T275753
  • 09:12 dcaro: draining cloudvirt1022 for T275753

2021-03-03

  • 17:16 andrewbogott: restarting rabbitmq-server on cloudcontrol1003,1004,1005; trying to explain amqp errors in scheduler logs
  • 16:03 dcaro: draining cloudvirt1022 for T275753
  • 16:03 dcaro: draining cloudvirt1022 for T275753
  • 16:00 arturo: move cloudvirt1013 into the 'toobusy' host aggregate, it has 221% cpu subscription and 82% MEM subscription
  • 15:34 arturo: rebooting cloudvirt1021 for T275753
  • 14:31 arturo: draining cloudvirt1021 for T275753
  • 13:59 arturo: rebooting cloudvirt1018 for T275753
  • 13:28 arturo: draining cloudvirt1018 for T275753
  • 12:49 arturo: rebooting cloudvirt1017 for T275753
  • 12:22 arturo: draining cloudvirt1017 for T275753
  • 12:20 arturo: rebooting cloudvirt1016 for T275753
  • 12:01 arturo: draining cloudvirt1016 for T275753
  • 11:59 arturo: cloudvirt1014 now in the ceph host aggregate
  • 11:58 arturo: rebooting cloudvirt1014 for T275753
  • 11:50 arturo: moved cloudvirt1023 away from the maintenance host aggregate, leave it in the ceph aggregate (was in the 2)
  • 11:47 arturo: moved cloudvirt1014 to the 'maintenance' host aggregate, drain it for T275753
  • 10:01 arturo: icinga-downtime cloudnet1003 for 14 days bc potential alerting storm due to firmware issues (T271058)
  • 10:01 arturo: rebooting again cloudnet1003 (no network failover) (T271058)
  • 09:59 arturo: update firmware-bnx2x from 20190114-2 to 20200918-1~bpo10+1 on cloudnet1003 (T271058)
  • 09:30 arturo: installing linux kernel 5.10.13-1~bpo10+1 in cloudnet1003 and rebooting it (network failover) (T271058)

2021-03-02

  • 17:16 andrewbogott: rebooting cloudvirt1039 to see if I can trigger T276208
  • 16:10 arturo: [codfw1dev] restart nova-compute on cloudvirt2002-dev
  • 11:59 arturo: moved cloudvirt1012 to 'maintenance' host aggregate. Drain it with `wmcs-drain-hypervisor` to reboot it for T275753
  • 11:59 arturo: cloudvirt1023 is affected by T276208 and cannot be rebooted. Put it back into the ceph hos aggregate
  • 10:43 arturo: moved cloudvirt1013 cloudvirt1032 cloudvirt1037 back into the 'ceph' host aggregate
  • 10:13 arturo: moved cloudvirt1023 to 'maintenance' host aggregate. Drain it with `wmcs-drain-hypervisor` to reboot it for T275753

2021-03-01

  • 20:12 andrewbogott: removing novaadmin from all projects save 'admin' for T274385
  • 19:51 andrewbogott: removing novaobserver from all projects save 'observer' for T274385
  • 19:50 andrewbogott: adding inherited domain-wide roles to novaadmin and novaobserver as per T274385

2021-02-28

  • 04:54 andrewbogott: restarted redis-server on tools-redis-1003 and tools-redis-1004 in an attempt to reduce replag, no real change detected

2021-02-27

  • 00:33 andrewbogott: sudo cumin --timeout 500 "A:all and not O{project:clouddb-services}" 'lsb_release -c | grep -i buster && uname -r | grep -v 4.19.0-14-amd64 && reboot'
  • 00:28 andrewbogott: sudo cumin --timeout 500 "A:all and not O{project:clouddb-services}" 'lsb_release -c | grep -i buster && uname -r | grep -v 4.19.0-14-amd64 && echo reboot'
  • 00:09 andrewbogott: sudo cumin "A:all and not O{project:clouddb-services}" 'lsb_release -c | grep -i stretch && uname -r | grep -v 4.19.0-0.bpo.14-amd64 && reboot'

2021-02-26

  • 14:58 dcaro: [eqiad] rebooting cloudcephosd1015 (last osd \o/) for kernel upgrade (T275753)
  • 14:51 dcaro: [eqiad] rebooting cloudcephosd1014 for kernel upgrade (T275753)
  • 14:44 dcaro: [eqiad] rebooting cloudcephosd1013 for kernel upgrade (T275753)
  • 14:38 dcaro: [eqiad] rebooting cloudcephosd1012 for kernel upgrade (T275753)
  • 14:31 dcaro: [eqiad] rebooting cloudcephosd1011 for kernel upgrade (T275753)
  • 14:25 dcaro: [eqiad] rebooting cloudcephosd1010 for kernel upgrade (T275753)
  • 14:17 dcaro: [eqiad] rebooting cloudcephosd1009 for kernel upgrade (T275753)
  • 13:54 dcaro: [eqiad] downtimed alert1001 Ceph OSDs down alert until 18:00 GMT+1 as that is not under the host being rebooted (T275753)
  • 13:51 dcaro: [eqiad] rebooting cloudcephosd1008 for kernel upgrade (T275753)
  • 13:45 dcaro: [eqiad] rebooting cloudcephosd1007 for kernel upgrade (T275753)
  • 13:38 dcaro: [eqiad] rebooting cloudcephosd1006 for kernel upgrade (T275753)
  • 12:07 dcaro: [eqiad] rebooting cloudcephosd1005 for kernel upgrade (T275753)
  • 12:00 arturo: rebooting cloudcontrol1003 for kernel upgrade (T275753)
  • 11:42 arturo: rebooting cloudcontrol1004 for kernel upgrade (T275753)
  • 11:41 dcaro: [eqiad] rebooting cloudcephosd1004 for kernel upgrade (T275753)
  • 11:32 dcaro: [eqiad] rebooting cloudcephosd1003 for kernel upgrade (T275753)
  • 11:30 arturo: rebooting cloudcontrol1005 for kernel upgrade (T2
  • 11:26 dcaro: [eqiad] rebooting cloudcephosd1002 for kernel upgrade (T275753)
  • 11:16 dcaro: [eqiad] rebooting cloudcephosd1001 for kernel upgrade (T275753)
  • 11:11 dcaro: [eqiad] rebooting cloudcephmon1003 for kernel upgrade (T275753)
  • 11:05 dcaro: [eqiad] rebooting cloudcephmon1002 for kernel upgrade (T275753)
  • 10:59 dcaro: [eqiad] rebooting cloudcephmon1001 for kernel upgrade (T275753)
  • 10:45 arturo: rebooting cloudvirt1039 into a new kernel (T275753) --- spare
  • 10:43 dcaro: [codfw1dev] rebooting cloudcephmon2003-dev for kernel upgrade (T275753)
  • 10:38 dcaro: [codfw1dev] rebooting cloudcephmon2002-dev for kernel upgrade (T275753)
  • 10:29 dcaro: [codfw1dev] rebooting cloudcephmon2001-dev for kernel upgrade (T275753)
  • 10:24 arturo: [codfw1dev] purge old kernel packages on cloudvirt2003-dev to force boot into a new kernel (T275753)
  • 10:11 arturo: [codfw1dev] manually creating /boot/grub/ on cloudvirt2003-dev to allow update-grub2 to run (so it can reboot into a new kernel) (T275753)
  • 10:11 dcaro: [codfw1dev] rebooting cloudcephosd2003-dev for kernel upgrade (T275753)
  • 10:05 dcaro: [codfw1dev] rebooting cloudcephosd2002-dev for kernel upgrade (T275753)
  • 10:01 arturo: [codfw1dev] rebooting cloudvirt200X-dev for kernel upgrade (T275753)
  • 09:59 arturo: [codfw1dev] rebooting cloudweb2001-dev for kernel upgrade (T275753)
  • 09:53 arturo: [codfw1dev] rebooting cloudservices2003-dev for kernel upgrade (T275753)
  • 09:51 arturo: [codfw1dev] rebooting cloudservices2002-dev for kernel upgrade (T275753)
  • 09:45 arturo: [codfw1dev] rebooting cloudcontrol2004-dev for kernel upgrade (T275753)
  • 09:44 arturo: [codfw1dev] rebooting cloudbackup[2001-2002].codfw.wmnet for kernel upgrade (T275753)
  • 09:43 dcaro: [codfw1dev] rebooting cloudcephosd2001-dev for kernel upgrade (T275753)
  • 09:41 arturo: [codfw1dev] rebooting cloudcontrol2003-dev for kernel upgrade (T275753)
  • 09:33 arturo: [codfw1dev] rebooting cloudcontrol2001-dev for kernel upgrade (T275753)

2021-02-25

  • 14:56 arturo: deployed wmcs-netns-events daemon to all cloudnet servers (T275483)

2021-02-24

  • 11:07 arturo: force-reboot cloudmetrics1002, add icinga downtime for 2 hours. Investigating some server issue
  • 00:17 bstorm: set --property hw_scsi_model=virtio-scsi and --property hw_disk_bus=scsi on the main stretch image in glance on eqiad1 T275430

2021-02-23

  • 22:43 bstorm: set --property hw_scsi_model=virtio-scsi and --property hw_disk_bus=scsi on the main buster image in glance on eqiad1 T275430
  • 20:36 andrewbogott: adding r/o access to the eqiad1-glance-images ceph pool for the client.eqiad1-compute for T275430
  • 10:49 arturo: rebooting clounet1004 into new kernel from buster-bpo (T271058)
  • 10:49 arturo: installing linux-image-amd64 from buster-bpo 5.10.13-1~bpo10+1 in cloudnet1004 (T271058)

2021-02-22

  • 17:15 bstorm: restarting nova-compute on cloudvirt1016 and cloudvirt1036 in case it helps T275411
  • 15:02 dcaro: Re-uploaded the debian buster 10.0 image from rbd to glance, that worked, re-spawning all the broken instances (T275378)
  • 11:12 dcaro: Refreshing all the canary instances (T275354)

2021-02-18

  • 14:50 arturo: rebooting cloudnet1004 for T271058
  • 10:25 dcaro: Rebooting cloudmetrics1001 to apply new kernel (T275116)
  • 10:16 dcaro: Rebooting cloudmetrics1002 to apply new kernel (T275116)
  • 10:14 dcaro: Upgrading grafana on cloudmetrics1002 (T275116)
  • 10:12 dcaro: Upgrading grafana on cloudmetrics1001 (T275116)

2021-02-17

2021-02-15

  • 16:25 arturo: [codfw1dev] rebooting all cloudgw200x-dev / cloudnet200x-dev servers (T272963)
  • 15:45 arturo: [codfw1dev] drop subnet definition for cloud-instances-transport1-b-codfw (T272963)
  • 15:45 arturo: [codfw1dev] connect virtual router cloudinstances2b-gw to vlan cloud-gw-transport-codfw (185.15.57.10) (T272963)

2021-02-11

  • 12:01 arturo: [codfw1dev] drop instance `tools-codfw1dev-bastion-1` in `tools-codfw1dev` (was buster, cannot use it yet)
  • 11:59 arturo: [codfw1dev] create instance `tools-codfw1dev-bastion-2` (stretch) in `tools-codfw1dev` to test stuff related to T272397
  • 11:45 arturo: [codfw1dev] create instance `tools-codfw1dev-bastion-1` in `tools-codfw1dev` to test stuff related to T272397
  • 11:42 arturo: [codfw1dev] drop `tools` project, create `tools-codfw1dev`
  • 11:38 arturo: [codfw1dev] drop `coudinfra` project (we are using `cloudinfra-codfw1dev` there)
  • 05:37 bstorm: downtimed cloudnet1004 for another week T271058

2021-02-09

  • 15:23 arturo: icinga-downtime for 2h everything *labs *cloud for openstack upgrades
  • 11:14 dcaro: Merged the osd scheduler change for all osds, applying on all cloudcephosd* (T273791)

2021-02-08

  • 18:50 bstorm: enabled puppet on cloudvirt1023 for now T274144
  • 18:44 bstorm: restarted the backup_vms.service on cloudvirt1027 T274144
  • 17:51 bstorm: deleted project pki T273175

2021-02-05

  • 10:59 arturo: icinga-downtime labstore1004 tools share space check for 1 week (T272247)
  • 10:21 dcaro: This was affecting maps and several others, maps and project-proxy have been fixed (T273956)
  • 09:19 dcaro: Some certs around the infra are expired (T273956)

2021-02-04

  • 10:12 dcaro: Increasing the memory limit of osds in eqiad from 8589934592(8G) to 12884901888(12G) (T273851)

2021-02-03

  • 09:59 dcaro: Doing a full vm backup on cloudvirt1024 with the new script (T260692)
  • 01:50 bstorm: icinga-downtime cloudnet1004 for a week T271058

2021-02-02

  • 17:14 dcaro: Changed osd memory limit from 4G to 8G (T273649)
  • 11:00 arturo: icinga-downtime cloudvirt-wdqs1001 for 1 week (T273579)
  • 03:12 andrewbogott: running /usr/local/sbin/wmcs-purge-backups and /usr/local/sbin/wmcs-backup-instances on cloudvirt1024 to see why the backup job paged

2021-01-29

  • 15:36 andrewbogott: disabling puppet and some services on eqiad1 cloudcontrol nodes; replacing nova-placement-api with placement-api

2021-01-28

  • 19:44 andrewbogott: shutting down cloudcontrol2001-dev because it's in a partially upgraded state; will revive when it's time for Train

2021-01-27

  • 00:50 bstorm: icinga-downtime cloudnet1004 for a week T271058

2021-01-22

  • 16:44 andrewbogott: upgrading designate on cloudvirt1003/1004 to OpenStack 'train'
  • 11:29 dcaro: Doing some tests removed cloudcontrol1003 puppet cert, regenerating...

2021-01-21

2021-01-20

  • 10:49 arturo: merging core router firewall change https://gerrit.wikimedia.org/r/c/operations/homer/public/+/657302 (T209082)
  • 10:05 dcaro: Everything looks ok, created a new vm with a volume in ceph without issues, and on warnings/errors on ceph status, closing (T272303)
  • 09:55 dcaro: Eqiad ceph cluster uprgaded, doing sanity checks (T272303)
  • 09:46 dcaro: 75% of the eqiad cluster upgraded... continuing (T272303)
  • 09:37 dcaro: 25% of the eqiad cluster upgraded... continuing (T272303)
  • 09:24 dcaro: Mgr daemons upgraded and running, upgrading osd daemons on servers cloudcephosd1*, this make take a bit longer (T272303)
  • 09:22 dcaro: Mon daemons upgraded and running, upgrading mgr daemons on servers cloudcephmon1* (T272303)
  • 09:16 dcaro: Starting eqiad ceph upgrade, upgrading the mon servers cloudcephmon1* (T272303)
  • 09:01 dcaro: Will start the ceph upgrade in 15 min, no downtime nor performance impact is expected (T272303)

2021-01-19

  • 10:17 arturo: icinga-downtime cloudnet1004 for 1 week (T271058)

2021-01-18

  • 16:00 dcaro: Codfw1 ceph cluster uprgaded, will wait until tomorrow to see if there's any instability, but everything looks fine (T272303)
  • 15:38 dcaro: Upgraded mgr sevices on codfw ceph cluster, starting with osd ones (T272303)
  • 15:35 dcaro: Upgraded mon sevices on codfw ceph cluster, starting with mgr ones (T272303)
  • 15:21 dcaro: Starting upgrade of ceph mon nodes on codfw (T272303)
  • 15:06 dcaro: re-enabling puppet on cloudcephosd2* hosts
  • 13:53 dcaro: disabling puppet on cloudcephosd2* to resume perf tests
  • 10:50 dcaro: re-enabling puppet on cephcloudosd2* (codfw)
  • 10:07 dcaro: disabling puppet on cephcloudosd2* (codfw) to do some performance tests
  • 09:00 dcaro: Enabling custom application 'cinder' on pool codfw1dev-cinder to get rid of health warnings

2021-01-17

  • 16:53 arturo: icinga downtime labstore1004 /srv/tools space check for 3 days (T272247)

2021-01-15

  • 13:41 arturo: icinga downtime labstore1004 maintain-dbuser alert until 2021-01-19 (T272125)
  • 09:47 arturo: labstore1004 maintain-dbusers affected by T272127 and T272125
  • 09:22 arturo: restart maintain-dbusers.service in labstore1004
  • 08:19 dcaro: Merging the patch to disable write caches on ceph osds (T271527)

2021-01-13

  • 17:03 arturo: remove cloudvirt1013 cloudvirt1032 cloudvirt1037 to the 'toobusy' host aggregate to prevent further CPU oversubscribing
  • 12:40 arturo: try increasing systemd watchdog timeout for conntrackd in cloudnet1004 (T268335)
  • 11:45 dcaro: https://gerrit.wikimedia.org/r/c/operations/puppet/+/654419 merged and deployed (and tested) (T268877)
  • 11:40 dcaro: merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/654419 that might affect the encapi service (puppet on cloud environment), no downtime expected though (T268877)
  • 10:56 arturo: trying to cleanup dpkg package mess in cloudnet2002-dev
  • 10:02 arturo: prevent floating IP allocation from neutron transport subnet: root@cloudcontrol1005:~# neutron subnet-update --allocation-pool start=185.15.56.244,end=185.15.56.244 cloud-instances-transport1-b-eqiad1 (T271867)

2021-01-12

  • 10:33 arturo: reboot cloudnet1004
  • 10:32 arturo: update firmware-bnx2x from 20190114-2 to 20200918-1~bpo10+1 on cloudnet1004 (T271058)

2021-01-11

  • 10:22 arturo: doubling size of conntrack table in cloudnet servers https://gerrit.wikimedia.org/r/c/operations/puppet/+/655407 (T271058)
  • 10:07 arturo: manually cleanup conntrack table in cloudnet1004 (T271058)
  • 09:19 dcaro: cleaned up ~1800 snapshots, 109 remaining only, one for each host x image combination (plus some ephemeral ones while doing backups), closing the task (T270478)
  • 08:39 dcaro: cleaning up dangling snapshots now that we have the new suffixed ones (T270478)

2021-01-10

  • 16:02 andrewbogott: restarting rabbitmq-server on all eqiad1 cloudcontrols
  • 15:54 andrewbogott: restating neutron-metadata-agent on cloudnet1004 due to many syslog complaints

2021-01-08

  • 11:25 arturo: rebooting both cloudnet2002-dev/cloudnet2003-dev to make sure interfaces are set up correctl (T271517)
  • 11:22 arturo: connecting cloudnet2002-dev cloudnet2003-dev back to vlan 2120 (T271517)
  • 11:06 arturo: root@cloudcontrol2001-dev:~# openstack router set --external-gateway wan-transport-codfw --fixed-ip subnet=cloud-instances-transport1-b-codfw,ip-address=208.80.153.190 cloudinstances2b-gw (T271517)
  • 11:02 arturo: root@cloudcontrol2001-dev:~# openstack router set --enable-snat cloudinstances2b-gw --external-gateway wan-transport-codfw (T271517)
  • 11:01 arturo: enabling neutron hacks in codfw1dev (cloudnet2002-dev, cloudnet2003-dev) (T271517)
  • 10:55 arturo: aborrero@labtestvirt2003:~ $ sudo ifdown eno2.2107 (T271517)
  • 10:55 arturo: aborrero@labtestvirt2003:~ $ sudo ifdown eno2.2120 (T271517)
  • 10:53 arturo: root@cloudcontrol2001-dev:~# openstack subnet create --network wan-transport-codfw --gateway 208.80.153.185 --ip-version 4 --network wan-transport-codfw --no-dhcp --subnet-range 208.80.153.184/29 cloud-instances-transport1-b-codfw (T271517)
  • 10:40 dcaro: Finished tests, brining osd online (od.48) for eqiad ceph cluster (T271417)
  • 09:59 dcaro: Started performance tests on sdc (od.48) for eqiad ceph cluster (T271417)
  • 09:41 dcaro: Taking osd.48 from eqiad ceph cluster out to do performance tests (T271417)

2021-01-07

  • 15:19 dcaro: Finished speed tests on cloudcephosd2001-dev, reprovisioning the osd.0 sdc (T271417)
  • 14:39 dcaro: Starting speed tests on cloudcephosd2001-dev sdc (T271417)
  • 12:54 dcaro: Taking osd.0 down on codfw ceph cluster to try the disk performance testing process (T271417)
  • 11:35 arturo: merging dmz_cidr change (T209082, T267779)

2021-01-05

  • 10:40 dcaro: removing dumps-[1..*] backups from cloudvirt1024 as they are not needed (T271094)

2021-01-03

  • 07:06 dcaro: Got a network hiccup on cloudnet1004, keeping track here T271058

2020-12-28

2020-12-23

  • 15:38 andrewbogott: restarting rabbitmq on cloudcontrol1004; suspected leaks
  • 15:33 andrewbogott: restarting each cloudcontrol galera node in turn to see if that quiets down the syncing warnings
  • 12:08 arturo: move memory out of the swap in cloudcontrol1004 by disabling/enabling it (1Gb swap was being used)

2020-12-22

  • 15:30 dcaro: cleaning up 6778 dangling snapshots for glance images in eqiad (T270478)
  • 13:51 dcaro: merged patch to move wikidumpparse backups to cloudvirt1025 to free space on cloudvirt1026

2020-12-19

  • 16:18 dcaro: gzipped a bunch of logs on cloudvirt1004 due to / being out of space
  • 00:14 bstorm: truncated /var/log/debug.1 on cloudcontrol1003 which appears to be the exact same content as the user.log files anyway
  • 00:10 bstorm: truncated /var/log/daemon.log.1 and the haproxy log
  • 00:02 bstorm: truncated /var/log/messages.1 on cloudcontrol1003

2020-12-18

  • 23:53 bstorm: truncated haproxy.log.1 on cloudcontrol1003
  • 20:46 andrewbogott: setting pg and pgp number to 4096 for eqiad1-compute as joachim thinks 8192 might be too much T270305
  • 17:09 dcaro: finished cleaning up the dangling snapshots from cloudvirt1026 (T270478)
  • 17:08 dcaro: removing dangling rbd snapshots (for backups on cloudvirt1026) (T270478)
  • 17:06 dcaro: finished cleaning up the dangling snapshots from cloudvirt1025 (T270478)
  • 17:05 dcaro: removing dangling rbd snapshots (for backups on cloudvirt1025) (T270478)
  • 17:00 dcaro: finished cleaning up the dangling snapshots from cloudvirt1021 (T270478)
  • 16:58 dcaro: removing dangling rbd snapshots (for backups on cloudvirt1021) (T270478)
  • 16:56 dcaro: finished cleaning up the dangling snapshots from cloudvirt1022 (T270478)
  • 16:55 dcaro: removing dangling rbd snapshots (for backups on cloudvirt1022) (T270478)
  • 16:54 dcaro: finished cleaning up the dangling snapshots from cloudvirt1023 (T270478)
  • 16:51 dcaro: removing dangling rbd snapshots (for backups on cloudvirt1023) (T270478)
  • 16:47 dcaro: finished cleaning up the dangling snapshots from cloudvirt1024, freed ~12% of the capacity (T270478)
  • 16:21 dcaro: removing dangling rbd snapshots (for backups on cloudvirt1024) (T270478)
  • 16:13 andrewbogott: setting autoscale to 'off' for both ceph pools (eqiad1-compute and eqiad1-glance-images) because we like how things are set and the autoscaler does not
  • 10:33 dcaro: purging rbd snapshots for image fc6fb78b-4515-4dcc-8254-591b9fe01762 (T270478)

2020-12-17

  • 22:17 andrewbogott: correction to above, set the pg and pgp to 1024 for eqiad1-glance-images
  • 22:16 andrewbogott: setting pgp number to 8192 for eqiad1-compute (a 4x increase) and 2048 for eqiad1-glance-images (also a 4x increase) T270305 (same as pg)
  • 22:14 andrewbogott: setting pg number to 8192 for eqiad1-compute (a 4x increase) and 2048 for eqiad1-glance-images (also a 4x increase) T270305
  • 22:10 andrewbogott: setting autoscale to 'warn' for both ceph pools (eqiad1-compute and eqiad1-glance-images)

2020-12-16

  • 09:31 dcaro: removing invalid backups from cloudvirt1024 (196 in total) (T269419)

2020-12-14

  • 17:42 dcaro: The removal freed ~12GB (still 100% usage :S) (T269419)
  • 17:36 dcaro: removing invalid backups that have a valid copy (T269419)
  • 15:43 dcaro: Merging the tagging for vm backups (T267195)
  • 09:45 arturo: icinga downtime cloudvirt1024 for 6 days (T269419)

2020-12-13

  • 09:11 _dcaro: running backup purge script on cloudvirt1024 (T269419)

2020-12-10

  • 23:36 bstorm: cleaned up the logs for haproxy on cloudcontrol1003 by deleting all the gzipped ones and truncating the .1 file
  • 11:56 dcaro: Freed some space on cloudvirt1024 by running the purge script (T269419)
  • 09:17 dcaro: removing leaked dns record discordwiki.eqiad.wmflabs (clinic duty)

2020-12-08

  • 18:01 dcaro: Host cloudvirt1030 up and running (T216195)
  • 15:59 dcaro: Re-imaging host cloudvirt1030 (T216195)
  • 14:18 dcaro: Host online cloudvirt1029 (T216195)
  • 14:13 dcaro: Host re-imaged, doing tests cloudvirt1029 (T216195)
  • 12:14 dcaro: Re-imaging cloudvirt1029 (T216195)

2020-12-07

  • 18:33 andrewbogott: putting cloudvirt1023 back into service T269467
  • 15:55 andrewbogott: reimaging cloudvirt1028 for T216195
  • 14:49 dcaro: Re-imaging cloudvirt1027 (T216195)

2020-12-05

  • 00:35 andrewbogott: moving cloudvirt1023 back into maintenance because T269467 continues to puzzle

2020-12-04

  • 22:33 andrewbogott: moving cloudvirt1023 back into the ceph aggregate; it doesn't need upgrades after all T269467
  • 22:24 andrewbogott: moving cloudvirt1023 out of the ceph aggregate and into maintenance for T269467
  • 21:06 andrewbogott: putting cloudvirt1025 and 1026 back into service because I'm pretty sure they're fixed. T269313
  • 12:12 arturo: manually running `wmcs-purge-backups` again on cloudvirt1024 (T269419)
  • 11:25 arturo: icinga downtime cloudvirt1024 for 6 days, to avoid paging noises (T269419)
  • 11:25 arturo: last log line referencing cloudvirt1024 is a mistake (T269313)
  • 11:24 arturo: icinga downtime cloudvirt1024 for 6 days, to avoid paging noises (T269313)
  • 10:28 arturo: manually running `wmcs-purge-backups` on cloudvirt1024 (T269419)
  • 10:23 arturo: setting expiration to 2020-12-03 to the oldest backy snapshot of every VM in cloudvirt1024 (T269419)
  • 09:54 arturo: icinga downtime cloudvirt1025 for 6 days (T269313)

2020-12-03

  • 23:21 andrewbogott: removing all osds on cloudcephosd1004 for rebuild, T268746
  • 21:45 andrewbogott: removing all osds on cloudcephosd1005 for rebuild, T268746
  • 19:51 andrewbogott: removing all osds on cloudcephosd1006 for rebuild, T268746
  • 17:01 arturo: icinga downtime cloudvirt1025 for 48h to debug network issue T269313
  • 16:56 arturo: rebooting cloudvirt1025 to debug network issue T269313
  • 16:38 dcaro: Rimaging cloudvirt1026 (T216195)
  • 13:24 andrewbogott: removing all osds on cloudcephosd1008 for rebuild, T268746
  • 02:55 andrewbogott: removing all osds on cloudcephosd1009 for rebuild, T268746

2020-12-02

  • 20:04 andrewbogott: removing all osds on cloudcephosd1010 for rebuild, T268746
  • 17:25 arturo: [15:51] failovering neutron virtual router in eqiad1 (T268335)
  • 15:36 arturo: conntrackd is now up and running in cloudnet1003/1004 nodes (T268335)
  • 15:33 arturo: [codfw1dev] conntrackd is now up and running in cloudnet200x-dev nodes (T268335)
  • 15:08 andrewbogott: removing all osds on cloudcephosd1012 for rebuild, T268746
  • 12:41 arturo: disable puppet in all cloudnet servers to merge conntrackd change T268335
  • 11:12 dcaro: Reset the properties for the flavor g2.cores8.ram16.disk1120 to correct quotes (T269172)
  • 09:57 arturo: moved cloudvirts 1030, 1029, 1028, 1027, 1026, 1025 away from the 'standard' host aggregate to 'maintenance' (T269172)

2020-12-01

2020-11-30

  • 18:12 andrewbogott: removing all osds from cloudcephosd1015 in order to investigate T268746

2020-11-29

  • 17:18 andrewbogott: cleaning up some logfiles in tools-sgecron-01 — drive is full

2020-11-26

  • 22:58 andrewbogott: deleting /var/log/haproxy logs older than 7 days in cloudcontrol100x. We need log rotation here it seems.
  • 15:53 dcaro: Created private flavor g2.cores8.ram16.disk1120 for wikidumpparse (T268190)

2020-11-25

  • 19:35 bstorm: repairing ceph pg `instructing pg 6.91 on osd.117 to repair`
  • 09:31 _dcaro: The OSD seems to be up and running actually, though there's that misleading log, will leave it see if the cluster comes fully healthy (T268722)
  • 08:54 _dcaro: Unsetting noup/nodown to allow re-shuffling of the pgs that osd.44 had, will try to rebuild it (T268722)
  • 08:45 _dcaro: Tried resetting the class for osd.44 to ssd, no luck, the cluster is in noout/norebalance to avoid data shuffling (opened T268722)
  • 08:45 _dcaro: Tried resetting the class for osd.44 to ssd, no luck, the cluster is in noout/norebalance to avoid data shuffling (opened root@cloudcephosd1005:/var/lib/ceph/osd/ceph-44# ceph osd crush set-device-class ssd osd.44)
  • 08:19 _dcaro: Restarting serivce osd.44 resulted on osd.44 being unable to start due to some config inconsistency (can not reset class to hdd)
  • 08:16 _dcaro: After enabling auto pg scaling on ceph eqiad cluster, osd.44 (cloudcephosd1005) got stuck, trying to restart the osd service
  • 08:16 _dcaro: After enabling auto pg scaling on ceph eqiad cluster, osd.44 (cloudcephosd1005) got stuck, trying to restart

2020-11-22

  • 17:40 andrewbogott: apt-get upgrade on cloudservices1003/1004
  • 17:32 andrewbogott: upgrading Designate on cloudservices1003/1004 to Stein

2020-11-20

  • 12:44 arturo: [codfw1dev] install conntrackd in cloudnet2003-dev/cloudnet2002-dev to research l3 agent HA reliability
  • 09:26 arturo: incinga downtime labstore1006 RAID checks for 10 days (T268281)

2020-11-17

  • 19:21 andrewbogott: draining cloudvirt1012 to experiment with libvirt/cpu things

2020-11-15

  • 11:21 arturo: icinga downtime cloudbackup2002 for 48h (T267865)

2020-11-10

  • 16:38 arturo: icinga downtime toolschecker for 2h becasue toolsdb maintenance (T266587)
  • 11:24 arturo: [codfw1dev] enable puppet in puppetmaster01.cloudinfra-codfw1dev (disabled for unspecified reasons)

2020-11-09

  • 12:42 arturo: restarted neutron l3 agent in cloudnet1003 bc it still had the old default route (T265288)
  • 12:41 arturo: `root@cloudcontrol1005:~# neutron subnet-delete dcbb0f98-5e9d-4a93-8dfc-4e3ec3c44dcc` (T265288)
  • 12:41 arturo: `root@cloudcontrol1005:~# neutron router-gateway-set --fixed-ip subnet_id=7c6bcc12-212f-44c2-9954-5c55002ee371,ip_address=185.15.56.244 cloudinstances2b-gw wan-transport-eqiad` (T265288)
  • 12:19 arturo: subnet 185.1.5.56.240/29 has id 7c6bcc12-212f-44c2-9954-5c55002ee371 in neutron (T265288)
  • 12:19 arturo: `root@cloudcontrol1005:~# neutron subnet-create --gateway 185.15.56.241 --name cloud-instances-transport1-b-eqiad1 --ip-version 4 --disable-dhcp wan-transport-eqiad 185.15.56.240/29` (T265288)
  • 12:15 arturo: icinga-downtime toolschecker for 2h (T265288)

2020-11-02

  • 13:36 arturo: (typo: dcaro)
  • 13:35 arturo: added dcar as projectadmin & user (T266068)

2020-10-29

  • 16:57 bstorm: silenced deployment-prep project alerts for 60 days since the downtime expired
  • 08:12 arturo: force-powercycling cloudcephosd1006

2020-10-25

  • 16:20 andrewbogott: adding cloudvirt1038 to the 'ceph' aggregate and removing from the 'spare' aggregate. We need this space while waiting on network upgrades for empty cloudvirts (T216195)

2020-10-23

  • 11:30 arturo: [codfw1dev] openstack --os-project-id cloudinfra-codfw1dev recordset create --type PTR --record nat.cloudgw.codfw1dev.wikimediacloud.org. --description "created by hand" 0-29.57.15.185.in-addr.arpa. 1.0-29.57.15.185.in-addr.arpa. (T261724)
  • 10:09 arturo: [codf1dev] doing DNS changes for the cloudgw PoC, including designate and https://gerrit.wikimedia.org/r/c/operations/dns/+/635965 (T261724)

2020-10-22

  • 10:46 arturo: [codfw1dev] rebooting cloudinfra-internal-puppetmaster-01.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud to try fixing some DNS weirdness
  • 09:43 arturo: enabling puppet in cloucontrol1003 (message said "please re-enable after 2020-10-22 06:00UTC")

2020-10-21

  • 14:36 andrewbogott: running apt-get update && apt-get install -y facter on all cloud-vps instances
  • 10:31 arturo: [codfw1dev] reimaging labtestvirt2003 (cloudgw) to test puppet code (T261724)
  • 08:56 arturo: [codfw1dev] reimaging labtestvirt2003 (cloudgw) to test puppet code (T261724)

2020-10-20

2020-10-19

  • 01:41 andrewbogott: deleting all Precise base images
  • 01:36 andrewbogott: deleting all unused Jessie base images

2020-10-18

  • 23:26 andrewbogott: deleting all Trusty base images
  • 21:50 andrewbogott: migrating all currently used ceph images to rbd

2020-10-16

  • 09:29 arturo: [codfw1dev] still some DNS weirdness, investigating
  • 09:25 arturo: [codfw1dev] hard-rebooting bastion-codfw1dev-02, seems in bad shape, doesn't even wake up in the virsh console
  • 09:18 arturo: [codfw1dev] live-hacked cloudservices2002-dev /etc/powerdns/recursor.conf file to include cloud-codfw1dev-floating CIDR (185.15.57.0/29) while https://gerrit.wikimedia.org/r/c/operations/puppet/+/634050 is in review, so VMs with a floating IP can query the DNS recursor (T261724)
  • 09:01 arturo: [codfw1dev] basic network connectivity seems stable after cleaning up everything related to address scopes (T261724)

2020-10-15

  • 15:17 arturo: [codfw1dev] try cleaning up anything related to address scopes in the neutron database (T261724)
  • 13:56 arturo: [codfw1dev] drop neutron l3 agent hacks in cloudnet2002/2003-dev (T261724)

2020-10-13

  • 17:54 andrewbogott: rebuilding cloudvirt1021 for backy support
  • 15:22 andrewbogott: draining cloudvirt1021 so I can rebuild it with backy support
  • 14:19 andrewbogott: rebuilding cloudvirt1022 with backy support
  • 14:03 andrewbogott: draining cloudvirt1022 so I can rebuild it with backy support
  • 11:19 arturo: [codfw1dev] rebooting labtestvirt2003

2020-10-09

  • 10:15 arturo: [codfwd1ev] root@cloudcontrol2001-dev:~# openstack router set --disable-snat cloudinstances2b-gw --external-gateway wan-transport-codfw (T261724)
  • 09:22 arturo: [codfwd1dev] rebooting cloudnet boxes for bridge and vlan changes (T261724)
  • 09:12 arturo: [codfw1dev] root@cloudcontrol2001-dev:~# openstack subnet delete 31214392-9ca5-4256-bff5-1e19a35661de (cloud-instances-transport1-b-codfw - 208.80.153.184/29) (T261724)
  • 09:10 arturo: [codfw1dev] root@cloudcontrol2001-dev:~# openstack router set --external-gateway wan-transport-codfw --fixed-ip subnet=cloud-gw-transport-codfw,ip-address=185.15.57.10 cloudinstances2b-gw (T261724)
  • 08:49 arturo: [codfw1dev] root@cloudcontrol2001-dev:~# openstack subnet create --network wan-transport-codfw --gateway 185.15.57.9 --no-dhcp --subnet-range 185.15.57.8/30 cloud-gw-transport-codfw (T261724)
  • 08:47 arturo: [codfw1dev] root@cloudcontrol2001-dev:~# openstack subnet delete a5ab5362-4ffb-4059-9ff7-391e22dcf3bc (T261724)

2020-10-08

  • 16:17 arturo: [codfw1dev] `root@cloudcontrol2001-dev:~# openstack subnet create --network wan-transport-codfw --gateway 185.15.57.8 --no-dhcp --subnet-range 185.15.57.8/31 cloud-gw-transport-codfw` (with a hack -- see task) (T263622)
  • 16:03 arturo: [codfw1dev] briefly live-hacked python3-neutron source code in all 3 cloudcontrol2xxx-dev servers to workaround /31 network definition issue (T263622)
  • 10:28 arturo: [codfw1dev] reimaging labtestvirt2003 (cloudgw) T261724

2020-10-06

  • 21:30 andrewbogott: moved cloudvirt1013 out of the 'ceph' aggregate and into the 'maintenance' aggregate for T243414
  • 21:29 andrewbogott: draining cloudvirt1013 for upgrade to 10G networking
  • 14:45 arturo: icinga downtime every cloud* lab* host for 60 minutes for keystone maintenance

2020-10-05

  • 17:40 bd808: `service uwsgi-labspuppetbackend restart` on cloud-puppetmaster-03 (T264649)

2020-10-02

  • 11:05 arturo: [codfw1dev] restarting rabbitmq-server in all 3 control nodes, the l3 agent was misbehaving
  • 09:16 arturo: [codfw1dev] trying the labtestvirt2003 (cloudgw) reimage again (T261724)

2020-10-01

  • 16:06 arturo: rebooting cloudvirt1024 to validate changes to /etc/network/interfaces file
  • 15:36 arturo: [codfw1dev] reimaging labtestvirt2003

2020-09-30

  • 16:47 andrewbogott: rebooting cloudvir1032, 1033, 1034 for T262979
  • 13:28 arturo: enable puppet, reboot and pool back cloudvirt1031
  • 13:27 arturo: extend icinga downtimes for another 120 mins
  • 13:15 arturo: `aborrero@cloudcontrol1003:~$ sudo nova-manage placement sync_aggregates` after reading a hint in nova-api.log
  • 13:02 arturo: rebooting cloudvirt1016 and moving it to the ceph host aggregate
  • 12:55 arturo: rebooting cloudvirt1014 and moving it to the ceph host aggregate
  • 12:51 arturo: rebooting cloudvirt1013 and moving it to the ceph host aggregate
  • 12:39 arturo: root@cloudcontrol1005:~# openstack aggregate add host maintenance cloudvirt1031
  • 12:36 arturo: rebooted cloudnet1003 (active) a couple of minutes ago
  • 12:36 arturo: move cloudvirt1012 and cloudvirt1039 to the ceph aggregate
  • 11:49 arturo: rebooting cloudvirt1039
  • 11:46 arturo: rebooting cloudvirt1012
  • 11:40 arturo: rebooting cloudnet1004 (standby) to pick up https://gerrit.wikimedia.org/r/c/operations/puppet/+/631167 (T262979)
  • 11:38 arturo: [codfw1dev] rebooting cloudnet2002-dev to pick up https://gerrit.wikimedia.org/r/c/operations/puppet/+/631167
  • 11:36 arturo: [codfw1dev] rebooting cloudnet2003-dev to pick up https://gerrit.wikimedia.org/r/c/operations/puppet/+/631167
  • 11:33 arturo: disabling puppet and downtiming every virt/net server in the fleet in preparation for merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/631167 (T262979)
  • 09:32 arturo: rebooting cloudvirt1012 to investigate linuxbridge agent issues

2020-09-29

  • 15:40 arturo: downgrade linux kernel from linux-image-4.19.0-11-amd64 to linux-image-4.19.0-10-amd64 on cloudvirt1012
  • 14:47 arturo: rebooting cloudvirt1012, chasing config weirdness in the linuxbridge agent
  • 14:05 andrewbogott: reimaging 1014 over and over in an attempt to get partman right
  • 13:51 arturo: rebooting cloudvirt1012

2020-09-28

  • 14:55 arturo: [jbond42] upgraded facter to v3 across the VM fleet
  • 13:54 andrewbogott: moving cloudvirt1035 from aggregate 'spare' to 'ceph'. We're going to need all the capacity we can get while converting older cloudvirts to ceph

2020-09-24

  • 15:47 arturo: stopping/restarting rabbitmq-server in all cloudcontrol servers
  • 15:45 arturo: restarting rabbitmq-server in cloudcontrol103
  • 15:15 arturo: restarting floating_ip_ptr_records_updater.service in all 3 cloudcontrol servers to reset state after a DNS failure

2020-09-18

  • 10:16 arturo: cloudvirt1039 libvirtd service issues were fixed with a reboot
  • 09:56 arturo: rebooting cloudvirt1039 (spare) to try to fix some weird libvirtd failure
  • 09:50 arturo: enabling puppet in cloudvirts and effectively merging patches from T262979
  • 08:59 arturo: disable puppet in all buster cloudvirts (cloudvirt[1024,1031-1039].eqiad.wmnet) to merge a patch for T263205 and T262979
  • 08:50 arturo: installing iptables from buster-bpo in cloudvirt1036 (T263205 and T262979)

2020-09-15

  • 20:32 andrewbogott: rebooting cloudvirt1038 to see if it resolves T262979
  • 13:58 andrewbogott: draining cloudvirt1002 with wmcs-ceph-migrate

2020-09-14

  • 14:21 andrewbogott: draining cloudvirt1001, migrating all VMs with wmcs-ceph-migrate
  • 10:41 arturo: [codfw1dev] trying to get the bonding working for labtestvirt2003 (T261724)
  • 09:47 arturo: installed qemu security update in eqiad1 cloudvirts (T262386)
  • 09:43 arturo: [codfw1dev] installed qemu security update in codfw1dev cloudvirts (T262386)

2020-09-09

2020-09-08

  • 21:48 bd808: Renamed FQDN prefixes to wikimedia.cloud scheme in cloudinfra-db01's labspuppet db (T260614)
  • 14:29 andrewbogott: restarting nova-compute on all cloudvirts (everyone is upset from the reset switch failure)
  • 14:18 arturo: restarting nova-fullstack service in cloudcontrol1003
  • 14:17 andrewbogott: stopping apache2 on labweb1001 to make sure the Horizon outage is total

2020-09-03

  • 09:31 arturo: icinga downtime cloud* servers for 30 mins (T261866)

2020-09-02

  • 08:46 arturo: [codfw1dev] reimaging spare server labtestvirt2003 as debian buster (T261724)

2020-09-01

  • 18:18 andrewbogott: adding drives on cloudcephosd100[3-5] to ceph osd pool
  • 13:40 andrewbogott: adding drives on cloudcephosd101[0-2] to ceph osd pool
  • 13:35 andrewbogott: adding drives on cloudcephosd100[1-3] to ceph osd pool
  • 11:27 arturo: [codfw1dev] rebooting again cloudnet2002-dev after some network tests, to reset initial state (T261724)
  • 11:09 arturo: [codfw1dev] rebooting cloudnet2002-dev after some network tests, to reset initial state (T261724)
  • 10:49 arturo: disable puppet in cloudnet servers to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/623569/

2020-08-31

2020-08-28

  • 20:12 bd808: Running `wmcs-novastats-dnsleaks --delete` from cloudcontrol1003

2020-08-26

  • 17:12 bstorm: Running 'ionice -c 3 nice -19 find /srv/tools -type f -size +100M -printf "%k KB %p\n" > tools_large_files_20200826.txt' on labstore1004 T261336

2020-08-21

  • 21:34 andrewbogott: restarting nova-compute on cloudvirt1033; it seems stuck

2020-08-19

  • 14:21 andrewbogott: rebooting cloudweb2001-dev, labweb1001, labweb1002 to address mediawiki-induced memleak

2020-08-06

  • 21:02 andrewbogott: removing cloudvirt1004/1006 from nova's list of hypervisors; rebuilding them to use as backup test hosts
  • 20:06 bstorm: manually stopped the RAID check on cloudcontrol1003 T259760

2020-08-04

  • 18:54 bstorm: restarting mariadb on cloudcontrol1004 to setup parallel replication

2020-08-03

  • 17:02 bstorm: increased db connection limit to 800 across galera cluster because we were clearly hovering at limit

2020-07-31

  • 19:28 bd808: wmcs-novastats-dnsleaks --delete (lots of leaked fullstack-monitoring records to clean up)

2020-07-27

  • 22:17 andrewbogott: ceph osd pool set compute pg_num 2048
  • 22:14 andrewbogott: ceph osd pool set compute pg_autoscale_mode off

2020-07-24

  • 19:15 andrewbogott: ceph mgr module enable pg_autoscaler
  • 19:15 andrewbogott: ceph osd pool set compute pg_autoscale_mode on

2020-07-22

  • 08:55 jbond42: [codfw1dev] upgrading hiera to version5
  • 08:48 arturo: [codfw1dev] add jbond as user in the bastion-codfw1dev and cloudinfra-codfw1dev projects
  • 08:45 arturo: [codfw1dev] enabled account creation in labtestwiki briefly for jbond42 to create an account

2020-07-16

2020-07-15

  • 23:15 bd808: Removed Merlijn van Deen from toollabs-trusted Gerrit group (T255697)
  • 11:48 arturo: [codfw1dev] created DNS records (A and PTR) for bastion.bastioninfra-codfw1dev.codfw1dev.wmcloud.org <-> 185.15.57.2
  • 11:41 arturo: [codfw1dev] add myself as projectadmin to the `bastioninfra-codfw1dev` project
  • 11:39 arturo: [codfw1dev] created DNS zone `bastioninfra-codfw1dev.codfw1dev.wmcloud.org.` in the cloudinfra-codfw1dev project and then transfer ownership to the bastioninfra-codfw1dev project

2020-07-14

  • 15:19 arturo: briefly set root@cloudnet1003:~ # sysctl net.ipv4.conf.all.accept_local=1 (in neutron qrouter netns) (T257534)
  • 10:43 arturo: icinga downtime cloudnet* hosts for 30 mins to introduce new check https://gerrit.wikimedia.org/r/c/operations/puppet/+/612390 (T257552)
  • 04:01 andrewbogott: added a wildcard *.wmflabs.org domain pointing at the domain proxy in project-proxy
  • 04:00 andrewbogott: shortened the ttl on .wmflabs.org. to 300

2020-07-13

  • 16:17 arturo: icinga downtime cloudcontrol[1003-1005].wikimedia.org for 1h for galera database movements

2020-07-12

  • 17:39 andrewbogott: switched eqiad1 keystone from m5 to cloudcontrol galera

2020-07-10

  • 20:26 andrewbogott: disabling nova api to move database to galera

2020-07-09

  • 11:23 arturo: [codfw1dev] rebooting cloudnet2003-dev again for testing sysct/puppet behavior (T257552)
  • 11:11 arturo: [codfw1dev] rebooting cloudnet2003-dev for testing sysct/puppet behavior (T257552)
  • 09:16 arturo: manually increasing sysctl value of net.nf_conntrack_max in cloudnet servers (T257552)

2020-07-06

  • 15:16 arturo: installing 'aptitude' in all cloudvirts

2020-07-03

  • 12:51 arturo: [codfw1dev] galera cluster should be up and running, openstack happy (T256283)
  • 11:44 arturo: [codfw1dev] restoring glance database backup from bacula into cloudcontrol2001-dev (T256283)
  • 11:39 arturo: [codfw1dev] stopped mysql database in the galera cluster T256283
  • 11:36 arturo: [codfw1dev] dropped glance database in the galera cluster T256283

2020-07-02

  • 15:41 arturo: `sudo wmcs-openstack --os-compute-api-version 2.55 flavor create --private --vcpus 8 --disk 300 --ram 16384 --property aggregate_instance_extra_specs:ceph=true --description "for packaging envoy" bigdisk-ceph` (T256983)

2020-06-29

  • 14:24 arturo: starting rabbitmq-server in all 3 cloudcontrol servers
  • 14:23 arturo: stopping rabbitmq-server in all 3 cloudcontrol servers

2020-06-18

  • 20:38 andrewbogott: rebooting cloudservices2003-dev due to a mysterious 'host down' alert on a secondary ip

2020-06-16

  • 15:38 arturo: created by hand neutron port 9c0a9a13-e409-49de-9ba3-bc8ec4801dbf `paws-haproxy-vip` (T295217)

2020-06-12

  • 13:23 arturo: DNS zone `paws.wmcloud.org` transferred to the PAWS project (T195217)
  • 13:20 arturo: created DNS zone `paws.wmcloud.org` (T195217)

2020-06-11

  • 19:19 bstorm_: proceeding with failback to labstore1004 now that DRBD devices are consistent T224582
  • 17:22 bstorm_: delaying failback labstore1004 for drive syncs T224582
  • 17:17 bstorm_: failing NFS back to labstore1004 to complete the upgrade process T224582
  • 16:15 bstorm_: failing over NFS for labstore1004 to labstore1005 T224582

2020-06-10

  • 16:09 andrewbogott: deleting all old cloud-ns0.wikimedia.org and cloud-ns1.wikimedia.org ns records in designate database T254496

2020-06-09

  • 15:25 arturo: icinga downtime everything cloud* lab* for 2h more (T253780)
  • 14:09 andrewbogott: stopping puppet, all designate services and all pdns services on cloudservices1004 for T253780
  • 14:01 arturo: icinga downtime everything cloud* lab* for 2h (T253780)

2020-06-05

2020-06-04

  • 14:24 andrewbogott: disabling puppet on all instances for /labs/private recovery
  • 14:23 arturo: disabling puppet on all instances for /labs/private recovery

2020-05-28

  • 23:02 bd808: `/usr/local/sbin/maintain-dbusers --debug harvest-replicas` (T253930)
  • 13:36 andrewbogott: rebuilding cloudservices2002-dev with Buster
  • 00:33 andrewbogott: shutting down cloudservices2002-dev to see if we can live without it. This is in anticipation or rebuilding it entirely for T253780

2020-05-27

  • 23:29 andrewbogott: disabling the backup job on cloudbackup2001 (just like last week) so the backup doesn't start while Brooke is rebuilding labstore1004 tomorrow.
  • 06:03 bd808: `systemctl start mariadb` on clouddb1001 following reboot (take 2)
  • 05:58 bd808: `systemctl start mariadb` on clouddb1001 following reboot
  • 05:53 bd808: Hard reboot of clouddb1001 via Horizon. Console unresponsive.

2020-05-25

  • 16:35 arturo: [codfw1dev] created zone `0-29.57.15.185.in-addr.arpa.` (T247972)

2020-05-21

  • 19:23 andrewbogott: disabling puppet on cloudbackup2001 to prevent the backup job from starting during maintenance
  • 19:16 andrewbogott: systemctl disable block_sync-tools-project.service on cloudbackup2001.codfw.wmnet to avoid stepping on current upgrade
  • 15:48 andrewbogott: re-imaging cloudnet1003 with Buster

2020-05-19

  • 22:59 bd808: `apt-get install mariadb-client` on cloudcontrol1003
  • 21:12 bd808: Migrating wcdo.wcdo.eqiad.wmflabs to cloudvirt1023 (T251065)

2020-05-18

  • 21:37 andrewbogott: rebuilding cloudnet2003-dev with Buster

2020-05-15

  • 22:10 bd808: Added reedy as projectadmin in cloudinfra project (T249774)
  • 22:05 bd808: Added reedy as projectadmin in admin project (T249774)
  • 18:44 bstorm_: rebooting cloudvirt-wdqs1003 T252831
  • 15:47 bd808: Manually running wmcs-novastats-dnsleaks from cloudcontrol1003 (T252889)

2020-05-14

  • 23:28 bstorm_: downtimed cloudvirt1004/6 and cloudvirt-wdqs1003 until tomorrow around this time T252831
  • 22:21 bstorm_: upgrading qemu-system-x86 on cloudvirt1006 to backports version T252831
  • 22:15 bstorm_: changing /etc/libvirt/qemu.conf and restarting libvirtd on cloudvirt1006 T252831
  • 21:12 andrewbogott: rebuilding cloudvirt1003-wdqs as part of T252831
  • 15:47 andrewbogott: moving cloudvirt1004 and cloudvirt1006 to the 'ceph' aggregate for T252784
  • 15:02 andrewbogott: moving all of cloudvirt100[1-9] into the 'toobusy' host aggregate. These are slower, have spinning disks, and are due for replacement.

2020-05-12

  • 20:33 andrewbogott: moving cloudvirt1023 to the 'standard' pool and out of the 'spare' pool
  • 19:10 jeh: disable neutron-openvswitch-agent service on cloudvirt2001-dev.codfw T248881
  • 19:09 jeh: Shutdown the unused eno2 network interface on cloudvirt2001-dev.codfw to clear up monitoring errors T248425
  • 18:20 andrewbogott: moving cloudvirt1024 out of the 'maintenance' aggregate and into 'spare'
  • 16:45 andrewbogott: restarting neutron-l3-agent on cloudnet1004 so it knows about all three cloudcontrols. Leaving cloudnet1003 since restarting it there will cause network interruptions
  • 14:06 arturo: icinga downtime everything for 2h for Debian Buster migration in some cloud components

2020-05-09

  • 16:53 andrewbogott: rebuilding cloudcontrol2001-dev and 2003-dev with buster for T252121

2020-05-08

  • 19:02 bstorm_: moving tools-k8s-haproxy-2 from cloudvirt1021 to cloudvirt1017 to improve spread

2020-05-05

  • 13:58 andrewbogott: rebuilding cloudcontrol2004-dev to test new puppet changes

2020-05-04

  • 09:04 arturo: [codfw1dev] manually modify iptables ruleset to only allow SSH from WMF bastions on cloudservices2003-dev and cloudcontrol2004-dev (T251604)

2020-04-21

  • 22:12 andrewbogott: moving cloudvirt1004 out of the 'standard' aggregate and into the 'maintenance' aggregate
  • 16:01 jeh: restart cloudceph mon and osd services for openssl upgrades

2020-04-15

  • 18:44 jeh: create indexes and views for grwikimedia T245912

2020-04-13

  • 15:07 jeh: restart memcached on labwebs to increase cache size T145703

2020-04-09

  • 19:57 andrewbogott: upgrading eqiad1 designate to rocky
  • 16:52 andrewbogott: cleaned up a bunch of leaked .eqiad.wmflabs dns records

2020-04-08

  • 19:20 andrewbogott: rotated password and api token for pdns servers on cloudservices1003 and cloudservices1004
  • 14:54 arturo: `root@cloudcontrol1003:~# cp /etc/inputrc .inputrc` to solve some bash shortcut weirdness

2020-04-07

  • 20:57 andrewbogott: service sssd stop; rm -rf /var/lib/sss/db*; service sssd start on tools-sgebastion-08

2020-04-06

  • 22:39 andrewbogott: deleting bogus groups cn=b'project-bastion',ou=groups,dc=wikimedia,dc=org and cn=b'project-tools',ou=groups,dc=wikimedia,dc=org from ldap
  • 17:42 arturo: [codfw1dev] transferred DNS zone 57.15.185.in-addr.arpa. to the cloudinfra-codfw1dev project (T247972)
  • 17:39 arturo: [codfw1dev] `openstack zone create --email root@wmflabs.org --type PRIMARY --ttl 3600 --description "floating IPs subnet" 57.15.185.in-addr.arpa.` (T247972)
  • 16:23 arturo: restarting apache2 in cloudcontrol1003/1004 to pick up latest wmfkeystonehooks changes T249494

2020-04-02

  • 20:59 jeh: codfw1dev clear VM error states and start bastions, puppet master and database

2020-04-01

  • 16:27 arturo: [codfw1dev] enable puppet across the fleet clean vxlan changes (T248881)

2020-03-31

  • 12:35 arturo: [codfw1dev] restarting VMs: designaterockytest14, bastion-codfw1dev-0[1,2] (T248881)
  • 12:34 arturo: [codfw1dev] installing neutron-openvswitch-agent on cloudvirt2001-dev (T248881)
  • 12:25 arturo: [codfw1dev] installing neutron-openvswitch-agent on cloudnet200[2,3]-dev (T248881)
  • 11:45 arturo: [codfw1dev] rebooting cloudvirt2003-dev to pick up latest kernel update. Otherwise modprobe is confused trying to load modules and openvswitch won't start (T248881)
  • 10:40 arturo: [codfw1dev] installing neutron-openvswitch-agent on cloudvirt2003-dev (T248881)
  • 10:09 arturo: [codfw1dev] reboot cloudnet2003-dev into linux 4.9 (was using 4.14 from a testing operation in 2020-03-10)

2020-03-30

2020-03-27

  • 21:28 bd808: Created huggle.wmcloud.org Designate zone and allocated it to the huggle project
  • 19:51 jeh: start haproxy on cloudcontrol2003-dev.wikimedia.org

2020-03-26

  • 15:01 arturo: icinga downtime cloudvirt* cloudcontrol* cloudnet* lab* cloudstore*
  • 15:01 andrewbogott: beginning openstack upgrade window for T242766
  • 12:32 arturo: [codfw1dev] downgraded systemd, libsystemd0, udev and friends to the non-backports versions (T247013)

2020-03-25

  • 19:29 andrewbogott: dumping a bunch of VMs on cloudvirt1015 to see if it still crashes
  • 17:56 jeh: add labweb1002 back into the pool - completed horizon testing T240852
  • 17:09 jeh: depool labweb1002 for horizon testing T240852

2020-03-24

  • 19:41 jeh: switch cloudvirt1016 from maintenance to standard host aggregate T243327
  • 15:31 andrewbogott: restarting nova-conductor and nova-api on cloudcontrol1003 and cloudcontrol1004

2020-03-23

  • 21:41 jeh: restart neutron-l3-agent on cloudnet100[3,4] to pickup policy.yaml changes
  • 13:28 jeh: disable puppet on labweb100[1,2] to enable horizon event traces T240852
  • 10:26 arturo: restarting apache in both labweb1001/labweb1002 upon reports of returning 500s

2020-03-21

  • 14:23 andrewbogott: restarting apache2 on labweb1001 and 1002

2020-03-18

  • 19:17 andrewbogott: deleted a bunch of records from the pdns database on cloudservices1003/1004 which had a record name but the content (where an IP address should be) was NULL, e.g. m.wikidata.beta.wmflabs.org.
  • 10:55 arturo: [codfw1dev] deleting BGP agent, undoing changes we did for T245606

2020-03-14

  • 17:40 jeh: restart maintain-dbusers on labstore1004 T247654

2020-03-13

2020-03-12

  • 22:29 bstorm_: running puppet across all dumps mounts to make sure active links are shifted to labstore1006

2020-03-11

2020-03-10

  • 17:02 arturo: [codfw1dev] deleting address scopes, bad interaction with our custom NAT setup T247135
  • 13:55 arturo: [codfw1dev] rebooting cloudnet2003-dev into linux kernel 4.14 for testing stuff related to T247135

2020-03-09

  • 18:09 arturo: enabling puppet in cloudvirt1006, all services have been restored
  • 17:59 arturo: deleted the neutron bridge on cloudvirt1006, for testing stuff related to the queens upgrade
  • 17:58 arturo: stopped neutron-linuxbridge-agent and nova-compute in cloudvirt1006 for testing stuff related to the queens upgrade

2020-03-06

  • 14:54 andrewbogott: draining all instances off of cloudvirt1006 for T246908

2020-03-05

  • 14:24 arturo: [codfw1dev] we just enabled BGP session between cloudnet2xxx-dev and cr1-codfw (T245606)
  • 13:07 arturo: [codfw1dev] move the extra IP address for BGP in cloudnet200x-dev servers from eno2.2120 to the br-external bridge device (T245606)
  • 13:06 arturo: [codfw1dev] upgrade neutron-dynamic-routing packages in cloudnet200X-dev and cloudcontrol200X-dev servers to 11.0.0-2~bpo9+1 (T245606)

2020-03-04

  • 22:22 andrewbogott: upgrading designate on cloudservices1003/1004 to Queens
  • 22:09 andrewbogott: moving cloudvirt1006 into the maintenance aggregate for T246908
  • 21:37 bd808: Running wmcs-wikireplica-dns to add service names for ngwikimedia.*.db.svc.eqiad.wmflabs (T240772)
  • 21:14 bd808: Running `sudo maintain-meta_p --all-databases --purge` on labsdb1009 (T246056)
  • 21:11 bd808: Running `sudo maintain-meta_p --all-databases --purge` on labsdb1010 (T246056)
  • 21:08 bd808: Running `sudo maintain-meta_p --all-databases --purge` on labsdb1011 (T246056)
  • 21:05 bd808: Running `sudo maintain-meta_p --all-databases --purge` on labsdb1002 (T246056)

2020-03-02

  • 16:54 arturo: [codfw1dev] deleted python3-os-ken debian package in cloudnet2003-dev which was installed by hand and had depedency issues

2020-02-29

  • 16:32 bstorm_: downtimed the smart alert on cloudvirt1009 until Monday since apparently predictive failures flap T244986

2020-02-26

  • 22:03 jeh: powering down cloudvirt1014 for hardware maintenance

2020-02-25

  • 16:08 andrewbogott: changing neutron's rabbitmq password because oslo is having trouble parsing some of the characters in the password
  • 15:26 andrewbogott: updated the cell_mapping record in the nova_api database to add the second rabbitmq server to the transport_url field
  • 15:26 andrewbogott: updated the cell_mapping record in the nova_api database to set the db uri to 'mysql+pymysql' -- this in response to a deprecation notice

2020-02-24

  • 12:16 arturo: [codfw1dev] `root@cloudcontrol2001-dev:~# neutron bgp-speaker-peer-add bgpspeaker cr2-codfw` (T245606)
  • 12:16 arturo: [codfw1dev] `root@cloudcontrol2001-dev:~# neutron bgp-speaker-peer-add bgpspeaker cr1-codfw` (T245606)
  • 12:09 arturo: [codfw1dev] `root@cloudcontrol2001-dev:~# neutron bgp-peer-create --peer-ip 208.80.153.187 --remote-as 65002 cr2-codfw` (T245606)
  • 12:09 arturo: [codfw1dev] `root@cloudcontrol2001-dev:~# neutron bgp-peer-create --peer-ip 208.80.153.186 --remote-as 65002 cr1-codfw` (T245606)
  • 12:06 arturo: [codfw1dev] `root@cloudcontrol2001-dev:~# neutron bgp-peer-delete 17b8c2a3-f0ce-4d50-a265-18ccac703c61` (T245606)
  • 10:59 arturo: [codfw1dev] `root@cloudcontrol2001-dev:~# neutron bgp-speaker-peer-add bgpspeaker bgppeer` (T245606)
  • 10:56 arturo: [codfw1dev] `root@cloudcontrol2001-dev:~# neutron bgp-peer-create --peer-ip 208.80.153.185 --remote-as 65002 bgppeer` (T245606)

2020-02-21

  • 12:48 arturo: [codfw1dev] running `root@cloudcontrol2001-dev:~# neutron bgp-speaker-network-add bgpspeaker wan-transport-codfw` (T245606)
  • 12:46 arturo: [codfw1dev] created bgpspeaker for AS64711 (T245606)
  • 12:42 arturo: [codfw1dev] run `sudo neutron-db-manage upgrade head` to upgrade the db schema for neutron bgp tables
  • 11:51 arturo: [codfw1dev] create a neutron subnet pool per each subnet objects we have and manually update DB to inter-associate them (T245606)
  • 11:49 arturo: [codfw1dev] rename neutron address scope `no-nat` to `bgp` (T245606)
  • 11:37 arturo: [codfw1dev] cleanup unused neutron subnet pools from previous address scope tests (T244851)

2020-02-20

  • 19:22 andrewbogott: updating designate pool config for https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/572213/
  • 15:33 andrewbogott: migrating all VMs on cloudvirt1014 to cloudvirt1022
  • 13:35 arturo: [codfw1dev] disable puppet in cloudcontrol servers to hack neutron.conf for tests related to T245606
  • 13:33 arturo: [codfw1dev] disable puppet in cloudnet servers to hack neutron.conf for tests related to T245606

2020-02-18

  • 22:19 andrewbogott: transferred the tools.wmcloud.org. to the tools project
  • 22:16 andrewbogott: moved wmcloud.org dns domain to the cloud-infra project
  • 21:02 andrewbogott: adding .eqiad1.wikimedia.cloud records to all existing eqiad1 VMs, updating all eqiad1 internal pointer records to reference the new eqiad1.wikimedia.cloud fqdns.
  • 09:44 arturo: deleted DNS zone wmcloud.org and try re-creating it

2020-02-14

  • 10:35 arturo: running `root@cloudcontrol2001-dev:~# designate server-create --name ns1.openstack.codfw1dev.wikimediacloud.org.` (T243766)
  • 10:32 arturo: running `root@cloudcontrol1004:~# designate server-create --name ns1.openstack.eqiad1.wikimediacloud.org.` (T243766)
  • 10:32 arturo: running `root@cloudcontrol1004:~# designate server-create --name ns0.openstack.eqiad1.wikimediacloud.org.` (T243766)

2020-02-12

  • 13:38 arturo: [codfw1dev] add reference to subnetpool to the instance subnet `MariaDB [neutron]> update subnets set subnetpool_id='d129650d-d4be-4fe1-b13e-6edb5565cb4a' where id = '7adfcebe-b3d0-4315-92fe-e8365cc80668';` (T244851)

2020-02-11

  • 13:46 arturo: [codfw1dev] creating some neutron objects to investigate T244851 (subnets, subnet pools, address scopes, ...)
  • 12:40 arturo: [codfw1dev] delete unknown address scope 'wmcs-v4-scope': `root@cloudcontrol2001-dev:~# openstack address scope delete 078cfd71-117b-4aac-9197-6ebbbb7dd3de` (T244851)
  • 12:40 arturo: [codfw1dev] delete unknown subnet pool 'cloudinstancesb-v4-pool0': `root@cloudcontrol2001-dev:~# openstack subnet pool delete d23a9b88-5c3d-4a53-ab88-053233a75365` (T244851)

2020-02-07

  • 18:11 jeh: shutdown cloudvirt1016 for hardware maintenance T241882

2020-02-06

  • 14:44 jeh: update apt packages on cloudvirt1015 T220853
  • 14:28 jeh: run hardware tests on cloudvirt1015 T220853

2020-01-28

  • 17:24 arturo: [codfw1dev] root@cloudcontrol2001-dev:~# designate server-create --name ns0.openstack.codfw1dev.wikimediacloud.org. (T243766)
  • 10:18 arturo: [codfw1dev] created DNS record `bastion-codfw1dev-01.codfw1dev.wmcloud.org A 185.15.57.2` (T242976, T229441)
  • 10:13 arturo: [codfw1dev] the zone `codfw1dev.wmcloud.org` belongs now to the `cloudinfra-codfw1dev` project (T242976)
  • 10:11 arturo: [codfw1dev] `root@cloudcontrol2001-dev:~# openstack zone create --description "main DNS domain for public addresses" --email "root@wmflabs.org" --type PRIMARY --ttl 3600 codfw1dev.wmcloud.org.` (T242976 and T243766)
  • 09:53 arturo: restart apache2 in labweb1001/1002 because horizon errors
  • 09:47 arturo: created DNS zone wmcloud.org in eqiad1, transfer it to the cloudinfra project (T242976) right now only use is to delegate codfw1dev.wmcloud.org subdomain to designate in the other deployment

2020-01-27

  • 12:45 arturo: [codfw1dev] manually move the new domain to the `cloudinfra-codfw1dev` project clouddb2001-dev: `[designate]> update zones set tenant_id='cloudinfra-codfw1dev' where id = '4c75410017904858a5839de93c9e8b3d';` T243556
  • 12:44 arturo: [codfw1dev] `root@cloudcontrol2001-dev:~# openstack zone create --description "main DNS domain for VMs" --email "root@wmflabs.org" --type PRIMARY --ttl 3600 codfw1dev.wikimedia.cloud.` T243556

2020-01-24

  • 15:10 jeh: remove icinga downtime for cloudvirt1013 T241313
  • 12:52 arturo: repooling cloudvirt1013 after HW got fixed (T241313)

2020-01-21

  • 17:43 bstorm_: remounting /mnt/nfs/dumps-labstore1007.wikimedia.org/ on all dumps-mounting projects
  • 10:24 arturo: running `sudo systemctl restart apache2.service` in both labweb servers to try mitigating T240852

2020-01-15

  • 16:59 bd808: Changed the config for cloud-announce mailing list so that lsit admins do not get bounce unsubscribe notices

2020-01-14

  • 14:03 arturo: icinga downtime all cloudvirts for another 2h for fixing some icinga checks
  • 12:04 arturo: icinga downtime toolchecker for 2 hours for openstack upgrades T241347
  • 12:02 arturo: icinga downtime cloud* labs* hosts for 2 hours for openstack upgrades T241347
  • 04:26 andrewbogott: upgrading designate on cloudservices1003/1004

2020-01-13

  • 13:34 arturo: [¢odfw1dev] prevent neutron from allocating floating IPs from the wrong subnet by doing `neutron subnet-update --allocation-pool start=208.80.153.190,end=208.80.153.190 cloud-instances-transport1-b-codfw` (T242594)

2020-01-10

  • 13:27 arturo: cloudvirt1009: virsh undefine i-000069b6. This is tools-elastic-01 which is running on cloudvirt1008 (so, leaked on cloudvirt1009)

2020-01-09

  • 11:12 arturo: running `MariaDB [nova_eqiad1]> update quota_usages set in_use='0' where project_id='etytree';` (T242332)
  • 11:11 arturo: running `MariaDB [nova_eqiad1]> select * from quota_usages where project_id = 'etytree';` (T242332)
  • 10:32 arturo: ran `root@cloudcontrol1004:~# nova-manage project quota_usage_refresh --project etytree`

2020-01-08

  • 10:53 arturo: icinga downtime all cloudvirts for 30 minutes to re-create all canary VMs"

2020-01-07

  • 11:12 arturo: icinga-downtime everything cloud* for 30 minutes to merge nova scheduler changes
  • 10:02 arturo: icinga downtime cloudvirt1009 for 30 minutes to re-create canary VM (T242078)

2020-01-06

  • 13:45 andrewbogott: restarting nova-api and nova-conductor on cloudcontrol1003 and 1004

2020-01-04

  • 16:34 arturo: icinga downtime cloudvirt1024 for 2 months because hardware errors (T241884)

2019-12-31

  • 11:46 andrewbogott: I couldn't!
  • 11:40 andrewbogott: restarting cloudservices2002-dev to see if I can reproduce an issue I saw earlier

2019-12-25

2019-12-24

  • 15:13 arturo: icinga downtime all the lab* fleet for nova password change for 1h
  • 14:39 arturo: icinga downtime all the cloud* fleet for nova password change for 1h

2019-12-23

  • 11:13 arturo: enable puppet in cloudcontrol1003/1004
  • 10:40 arturo: disable puppet in cloudcontrol1003/1004 while doing changes related to python-ldap

2019-12-22

  • 23:48 andrewbogott: restarting nova-conductor and nova-api on cloudcontrol1003 and 1004
  • 09:45 arturo: cloudvirt1013 is back (did it alone) T241313
  • 09:37 arturo: cloudvirt1013 is down for good. Apparently powered off. I can't even reach it via iLO

2019-12-20

  • 12:43 arturo: icinga downtime cloudmetrics1001 for 128 hours

2019-12-18

  • 12:55 arturo: [codfw1dev] created a new subnet neutron object to hold the new CIDR for floating IPs (cloud-codfw1dev-floating - 185.15.57.0/29) T239347

2019-12-17

  • 07:21 andrewbogott: deploying horizon/train to labweb1001/1002

2019-12-12

  • 06:11 arturo: schedule 4h downtime for labstores
  • 05:57 arturo: schedule 4h downtime for cloudvirts and other openstack components due to upgrade ops

2019-12-02

  • 06:28 andrewbogott: running nova-manage db sync on eqiad1
  • 06:27 andrewbogott: running nova-manage cell_v2 map_cell0 on eqiad1

2019-11-21

  • 16:07 jeh: created replica indexes and views for szywiki T237373
  • 15:48 jeh: creating replica indexes and views for shywiktionary T238115
  • 15:48 jeh: creating replica indexes and views for gcrwiki T238114
  • 15:46 jeh: creating replica indexes and views for minwiktionary T238522
  • 15:36 jeh: creating replica indexes and views for gewikimedia T236404

2019-11-18

  • 19:27 andrewbogott: repooling labsdb1011
  • 18:54 andrewbogott: running maintain-views --all-databases --replace-all —clean on labsdb1011 T238480
  • 18:44 andrewbogott: depooling labsdb1011 and killing remaining user queries T238480
  • 18:42 andrewbogott: repooled labsdb1009 and 1010 T238480
  • 18:19 andrewbogott: running maintain-views --all-databases --replace-all —clean on labsdb1010 T238480
  • 18:18 andrewbogott: depooling labsdb1010, killing remaining user queries
  • 17:46 andrewbogott: running maintain-views --all-databases --replace-all —clean on labsdb1009 T238480
  • 17:38 andrewbogott: depooling labsdb1009, killing remaining user queries
  • 16:54 andrewbogott: running maintain-views --all-databases --replace-all —clean on labsdb1012 T237509

2019-11-15

  • 20:04 andrewbogott: repool labdb1011 (T237509)
  • 19:29 andrewbogott: running maintain-views --all-databases --replace-all —clean on labsdb1011
  • 19:25 andrewbogott: depooling labsdb1011, killing remaining queries
  • 19:25 andrewbogott: repooling labsdb1010
  • 18:59 andrewbogott: running maintain-views --all-databases --replace-all —clean on labsdb1012
  • 18:57 andrewbogott: running maintain-views --all-databases --replace-all —clean on labsdb1010
  • 18:54 andrewbogott: depooling labsdb1010, killing remaining user queries
  • 18:54 andrewbogott: depooled labsdb1009, ran maintain-views —clean —all-databases —replace-all, repooled

2019-11-11

  • 13:10 arturo: cloudweb2001-dev: disable puppet and redirect stderr in the loadExitNodes.php cron script to prevent cronspam while we investigate the cause of the issue (T237971)

2019-11-05

  • 11:59 arturo: icinga downtime for 1h cloudcontrol1004, cloudnet1003, cloudvirt1017/1020/1022 for PDU operations in the rack T227542

2019-11-04

  • 21:55 andrewbogott: deleting a ton of wikitech hiera pages that were either no-ops or refer to nonexistent VMs or prefixes

2019-10-31

  • 11:01 arturo: icinga-downtimed cloudvirt1030 and cloudservices1003 for 1h due to PDU upgrade operations T227543

2019-10-30

  • 22:43 jeh: reboot cloud-bootstrapvz-stretch to resolve bad bootstrapvz build

2019-10-29

  • 10:52 arturo: icinga downtime cloudvirt1001/1002/1024/1018/1012/1009/1015/1008 for 1h T227538

2019-10-25

  • 10:45 arturo: icinga downtime toolschecker for 1 to upgrade clouddb1002 mariadb (toolsdb secondary) (T236384 , T236420)

2019-10-24

  • 12:30 arturo: starting cloudvirt1019, PDU operations ended (T227540)
  • 11:58 arturo: icinga downtime for 2h (T227540) cloudvirt1019
  • 11:15 arturo: poweroff cloudvirt1019 during the PDU operations (T227540)
  • 11:10 arturo: icinga downtime for 2h (T227540) toolschecker
  • 10:58 arturo: icinga downtime for 1h (T227540) cloudvirt100[3-7], cloudvirt1019, cloudvirt1016, cloudvirt1021, cloudvirt1013, cloudnet1004

2019-10-23

  • 09:23 arturo: cloudvirt1026 reboot ended OK
  • 09:12 arturo: rebooting cloudvirt1026 for kernel upgrade
  • 09:09 arturo: cloudvirt1025 reboot ended OK
  • 09:00 arturo: rebooting cloudvirt1025 for kernel upgrade
  • 08:51 arturo: icinga downtime cloudvirt1025/1026 for reboots

2019-10-18

  • 16:01 arturo: created the `eqiad1.wikimedia.cloud` DNS zone (T235846)
  • 14:27 andrewbogott: deleted a bunch of leaked VMS from earlier today from the admin-monitoring project. Fullstack leaks due to an api outage, maybe?
  • 10:44 arturo: double max_message_size from 40KB to 80KB in the cloud-admin mailing list. A simple email with a couple of quotes can go over the 40KB limit.

2019-10-16

  • 21:59 jeh: resync wiki replica tool and user accounts T235697
  • 09:40 arturo: reboot of cloudvirt1030 went fine
  • 09:28 arturo: reboot of cloudvirt1029 went fine
  • 09:28 arturo: rebooting cloudvirt1030 for kernel updates
  • 09:12 arturo: rebooting cloudvirt1029 for kernel updates
  • 09:11 arturo: reboot of cloudvirt1028 went fine
  • 09:00 arturo: rebooting cloudvirt1028 for kernel updates
  • 08:56 arturo: icinga downtime cloudvirt[1028-1030].eqiad.wmnet for 1h for reboots

2019-10-15

  • 13:30 jeh: creating indexes and views for banwiki T234770

2019-10-10

  • 18:55 bd808: Created indexes and views for nqowiki (T230543)
  • 11:59 arturo: network switch hardware is down affecting cloudvirt1025/1026 (T227536) VMs are supposed to be online but unreachable

2019-10-09

  • 10:44 arturo: cloudvirt1013 rebooted well
  • 10:32 arturo: cloudvirt1013 is rebooting
  • 10:32 arturo: cloudvirt1012 rebooted just fine (very slow, 35 VMs)
  • 10:21 arturo: cloudvirt1012 is rebooting
  • 10:19 arturo: cloudvirt1009 rebooted just fine (very slow though)
  • 10:07 arturo: cloudvirt1009 is rebooting
  • 10:06 arturo: cloudvirt1008 rebooted just fine (very slow though)
  • 09:58 arturo: cloudvirt1008 is rebooting
  • 09:52 arturo: icinga downtime toolschecker, paws, etc for 2h, because cloudvirt reboots

2019-10-07

  • 14:07 arturo: horizon is disabled for maintenance (T212302)
  • 14:00 arturo: starting scheduled maintenance: upgrading eqiad1 from openstack mitaka to newton

2019-10-02

  • 15:23 arturo: codfw1dev renaming net/subnet objects to a more modern naming scheme T233665
  • 12:49 arturo: codfw1dev delete all floating ip allocations in the deployment for mangling the network config for testing T233665
  • 12:47 arturo: codfw1dev deleting all VMs in the deployment for mangling the network config for testing T233665
  • 11:08 arturo: codfw1dev rebooting cloudnet2002-dev and cloudnet2003-dev for testing T233665
  • 10:31 arturo: codfw1dev: add cloudinstances2b-gw router to the l3 agent in cloudnet2003-dev
  • 09:59 arturo: codfw1dev: cleanup leftover "HA port tenant admin" in neutron (ports from missing servers)
  • 09:46 arturo: codfw1dev: cleanup leftover neutron agents

2019-09-30

  • 10:21 arturo: we installed ferm in every VM by mistake. Deleting it and forcing a puppet agent run to try to go back to a clean state.
  • 09:38 arturo: downtime toolschecker for 24h
  • 09:33 arturo: force update ferm cloud-wide (in all VMs) for T153468

2019-08-18

  • 10:39 arturo: rebooting cloudvirt1023 for new interface names configuration
  • 10:34 arturo: downtimed cloudvirt1023 for 2 days

2019-08-05

  • 17:17 bd808: Set downtime on gridengine and kubernetes webservice checks in icinga until 2019-09-02 (flaky tests)

2019-07-29

  • 20:14 bd808: Restarted maintain-kubeusers on tools-k8s-master-01 (T194859)

2019-07-25

  • 12:32 arturo: eqiad1/glance: debian-9.9-stretch image deprecates debian-9.8-stretch (T228983)
  • 09:59 arturo: (codfw1dev) drop missing glance images (T228972)
  • 09:32 arturo: (codfw1dev) deleting a bunch of VMs that were running in now missing hypervisors
  • 09:31 arturo: (codfw1dev) deleting a bunch of VMs in ERROR and SHUTDOWN state
  • 09:27 arturo: last log entry refers to the codfw1dev deployment
  • 09:27 arturo: cleanup `nova service-list` from old hypervisors (labtest*)
  • 09:23 arturo: refreshed nova DB grants in clouddb2001-dev for the codfw1dev deployment
  • 08:47 arturo: cleanup the cloud-announce pending emails (spam)

2019-07-23

  • 19:43 andrewbogott: restarting rabbitmq-server on cloudcontrol1003 and 1004

2019-07-22

  • 23:44 bd808: Restarted maintain-kubeusers on tools-k8s-master-01 (T228529)

2019-07-11

  • 22:07 bd808: Ran `sudo systemctl stop designate_floating_ip_ptr_records_updater.service` on cloudcontrol1003
  • 22:01 bd808: `sudo apt-get install python2.7-dbg` on cloudcontrol1003 to debug hung python process
  • 21:48 bd808: Ran `sudo systemctl stop designate_floating_ip_ptr_records_updater.service` on cloudcontrol1004

2019-06-25

  • 16:05 bstorm_: updated python3.4 to update4 wherever it was installed on Jessie VMs to prevent issues with broken update3.
  • 14:56 bstorm_: Updated python 3.4 on the labs-puppetmaster server

2019-06-03

  • 15:55 arturo: T221769 rebooting cloudservices1003 after bootstrapping is apparently completed

2019-05-28

  • 21:42 bstorm_: unmounting labstore1003-scratch on all cloud clients
  • 18:14 bstorm_: T209527 switched mounts from labstore1003 to cloudstore1008 for scratch

2019-05-20

  • 17:25 arturo: T223923 dropped compat-network config from /etc/network/interfaces in eqiad1/codfw1dev neutron nodes
  • 17:22 arturo: T223923 dropped br-compat bridges and vlan interfaces (1102 and 2102) in eqiad1/codfw1dev neutron nodes
  • 17:07 arturo: T223923 dropped compat-network configuration from the neutron database in eqiad1
  • 16:55 arturo: T223923 dropped compat-network configuration from the neutron database in codfw1dev

2019-05-15

  • 17:00 andrewbogott: touching /root/firstboot_done on all VMs that cumin can reach. This will prevent firstboot.sh from running a second time if/when any of these are rebooted. T223370

2019-04-26

  • 15:51 arturo: andrew updated dns servers for the cloud-instances2-b-eqiad subnet in neutron: 208.80.154.143 and 208.80.154.24

2019-04-25

  • 11:14 arturo: T221760 increased size of conntrack table

2019-04-24

  • 12:54 arturo: T220051 puppet broken in every VM in Cloud VPS, fixing right now

2019-04-22

  • 11:14 arturo: create by hand /var/cache/labsaliaser/labs-ip-aliases.json in cloudservices2002-dev (T218575)

2019-04-16

  • 22:55 bd808: cloudcontrol2003-dev: added `exit 0` to /etc/cron.hourly/keystone to stop cron spam on partially configured cluster
  • 12:08 arturo: rebooting cloudvirt200[123]-dev because deep changes in config
  • 11:27 arturo: T219626 add DB grants for neutron and glnace to clouddb2001-dev (codfw1dev)
  • 10:37 arturo: T219626 replace 208.80.153.75 with 208.80.153.59 in the clouddb2001-dev database (codfw1dev deployment)
  • 10:30 arturo: T219626 replace labtestcontrol2003 with cloudcontrol2001-dev in the clouddb2001-dev database (codfw1dev deployment)

2019-04-15

  • 13:08 arturo: T219626 add DB grants for keystone/nova/nova_api to clouddb2001-dev (codfw1dev)

2019-04-13

  • 18:25 bd808: Restarted nova-compute service on cloudvirt1015 (T220853)

2019-04-11

  • 12:00 arturo: T151704 deploying oidentd to cloudnet1xxx servers

2019-04-02

  • 19:52 andrewbogott: installed new base Stretch image. Updated packages, and runs apt-get dist-upgrade on first boot.

2019-03-29

  • 14:34 andrewbogott: moving tools-static.wmflabs.org to point to tools-static-13 in eqiad1-r
  • 00:00 bstorm_: T193264 Added osm.db.svc.eqiad.wmflabs to cloud DNS

2019-03-25

  • 00:40 bd808: Restarted maintain-dbusers on labstore1004. Process hung up on failed LDAP connection.

2019-03-21

  • 19:32 andrewbogott: restarting keystone on cloudcontrol1003

2019-03-15

  • 16:00 gtirloni: increased nscd cache size (T217280)

2019-03-14

  • 19:04 gtirloni: bstorm started nfsd on labstore1006 (T218341)
  • 16:42 gtirloni: published new debian-9.8 image (T218314)

2019-03-04

  • 19:37 bstorm_: umounted /mnt/nfs/dumps-labstore1006.wikimedia.org across all VPS projects for T217473

2019-02-26

  • 12:46 gtirloni: shutdown toolsbeta-sgegrid-master (cronspam)

2019-02-25

  • 10:32 gtirloni: restarted nfsd on labstore1004

2019-02-21

  • 09:09 gtirloni: restarted uwsgi-labspuppetbackend.service on labpuppetmaster1001
  • 07:42 gtirloni: created project cloudstore
  • 07:36 gtirloni: deleted wmcs-nfs project

2019-02-20

  • 21:58 andrewbogott: silencing shinken and disabling puppet on shinken-02 for now

2019-02-19

  • 12:00 gtirloni: added nagios@icinga2001.wikimedia.org to cloud-admin-feed@ allowed senders

2019-02-18

  • 20:21 gtirloni: downtimed cloudvirt1020
  • 20:12 gtirloni: ran `labs-ip-alias-dump.py` on cloudservices/labservices servers

2019-02-15

  • 13:10 arturo: T216239 labvirt1019 has been drained
  • 12:22 arturo: T216239 draining labvirt1009 with a command like this: `root@cloudcontrol1004:~# wmcs-cold-migrate --region eqiad --nova-db nova 2c0cf363-c7c3-42ad-94bd-e586f2492321 labvirt1001`
  • 12:02 arturo: more nova service cleanups in the database (labvirts that were reallocated to eqiad1)
  • 11:34 arturo: T216190 cleanup from nova database `nova service-delete 35`
  • 03:50 andrewbogott: updated VPS base images for Jessie and Stretch, now featuring Stretch 9.7

2019-02-11

  • 18:13 gtirloni: cleaned old metrics data in labmon1001 T215417
  • 15:28 gtirloni: running `maintain-views --all-databases --replace-all` on labsdb1011
  • 14:18 gtirloni: running `maintain-views --all-databases --replace-all` on labsdb1010

2019-02-08

  • 14:56 gtirloni: running `maintain-views --all-databases --replace-all` on labsdb1009

2019-02-06

  • 11:47 gtirloni: downtimed labmon100{1,2} T215399
  • 00:17 bstorm_: T214106 deleted bstorm-test2 project to clean up

2019-02-05

  • 10:48 arturo: labmon1001 is now part of the 'eqiad1-r' region

2019-02-01

  • 09:54 arturo: moving canary1015-01 VM instance from cloudvirt1024 back to cloudvirt1015

2019-01-31

  • 12:44 arturo: T215012 depooling cloudvirt1015 and migrating all VMs to cloudvirt1024

2019-01-25

  • 20:11 gtirloni: deleted project yandex-proxy T212306
  • 20:11 gtirloni: deleted project T212306

2019-01-24

  • 11:50 arturo: T213925 modify subnet cloud-instances-transport1-b-eqiad1 to avoid floating IP allocations from here
  • 11:07 arturo: T214299 failover cloudnet1003 to cloudnet1004
  • 10:03 arturo: T214299 reimage cloudnet1004 to debian stretch
  • 09:51 arturo: T214299 failover cloudnet1004 to cloudnet1003

2019-01-22

  • 19:19 arturo: T214299 stretch cloudnet1003 is apparently all set
  • 18:40 arturo: T214299 manually delete from neutron agents from cloudnet1003 (must be added again after reimage, with new uuids)
  • 18:37 arturo: T214299 reimaging cloudnet1003 as debian stretch
  • 17:35 jbond42: starting roll out of apt package updates to
  • 14:41 gtirloni: T214369 deployed new jessie and stretch VM images

2019-01-21

  • 18:29 gtirloni: installed libguestfs-tools on cloudvirt1021

2019-01-16

  • 14:21 andrewbogott: stopping old VPS proxies in eqiad — T213540

2019-01-15

  • 14:20 andrewbogott: changing tools.wmflabs.org to point to tools-proxy-03 in eqiad1

2019-01-13

  • 20:00 andrewbogott: VPS proxies are now running in eqiad1 on proxy-01. Old VMs will wait a bit for deletion. T213540
  • 19:12 andrewbogott: moving the VPS proxy API backend to proxy-01.project-proxy.eqiad.wmflabs, as per T213540
  • 17:11 andrewbogott: moving all VPS dynamic proxies to proxy-eqiad1.wmflabs.org aka proxy-01.project-proxy.eqiad.wmflabs, as per T213540

2019-01-09

  • 22:21 bd808: neutron quota-update --tenant-id tools --port 256

2019-01-08

  • 18:59 bd808: Definately did NOT delete uid=novaadmin,ou=people,dc=wikimedia,dc=org
  • 18:59 bd808: Deleted LDAP user uid=neutron,ou=people,dc=wikimedia,dc=org
  • 18:58 bd808: Deleted LDAP user uid=novaadmin,ou=people,dc=wikimedia,dc=org

2019-01-06

  • 22:03 bd808: Set floatingip quota of 60 for tools project in eqiad1-r region (T212360)

2018-12-20

  • 17:10 arturo: T207663 renumbered transport network in eqiad1

2018-12-05

  • 17:59 arturo: T207663 changed labtestn transport network addressing from private to public

2018-12-03

  • 13:25 arturo: T202886 create again PTR records after dnsleak.py fix

2018-11-30

  • 14:08 arturo: running dns leaks cleanup `root@cloudcontrol1003:~# /root/novastats/dnsleaks.py --delete`

2018-11-28

  • 17:33 gtirloni: deleted contintcloud project (T209644)

2018-11-27

  • 13:32 gtirloni: enabled DRBD stats collection on labstore100[4-5] T208446

2018-11-22

  • 07:12 gtirloni: deployed new debian-9.6-stretch image

2018-11-21

  • 10:48 arturo: re-created compat-net as not shared in labtestn to test stuff related to T209954

2018-11-16

  • 12:43 gtirloni: armed keyholder on labpuppetmaster1001/1002 after reboots
  • 12:08 gtirloni: rebooted labpuppetmaster1001 (T207377)
  • 11:57 gtirloni: rebooted labpuppetmaster1002 (T207377)

2018-11-14

  • 17:19 gtirloni: added cloudvirt1016 to scheduler pool (T209426)
  • 15:41 gtirloni: reimaging labvirt1016 as cloudvirt1016
  • 15:14 gtirloni: reset-failed systemd unit nova-scheduler on cloudcontrol1004
  • 13:52 gtirloni: rebooted labservices1002 after package upgrades (T207377)
  • 13:23 gtirloni: rebooted labstore2004 after package upgrades (T207377)
  • 13:20 gtirloni: rebooted labstore2003 after package upgrades (T207377)
  • 13:20 gtirloni: rebooted labstore2001/labstore2003 after package upgrades (T207377)
  • 12:08 gtirloni: rebooted labnet1002 after package upgrades
  • 12:01 gtirloni: rebooted labmon1002 after package upgrades
  • 11:41 gtirloni: rebooted labcontrol1002 after package upgrades
  • 11:15 gtirloni: rebooted cloudcontrol1004 after package upgrades

2018-11-09

  • 18:17 gtirloni: restarted neutron-linuxbridge-agent on cloudvirt1018/1023

2018-11-08

  • 11:00 gtirloni: Added novaproxy-02 to $CACHES
  • 10:50 gtirloni: Added cloudvirt1017 to eqiad1 region

2018-11-07

  • 13:49 arturo: T208733 moving labvirt1017 from main deployment to eqiad1 and renaming it to cloudvirt1017

2018-10-22

  • 16:24 arturo: T206261 another update to dmz_cidr in eqiad1
  • 10:26 arturo: change again in dmz_cidr in eqiad1: VMs will connect between them without NAT even when using floating IPs (T206261)

2018-10-19

  • 12:02 arturo: revert change in dmz_cidr in eqiad1 for now (T206261)
  • 11:16 arturo: change in dmz_cidr in eqiad1: VMs will connect between them without NAT even when using floating IPs (T206261)
  • 10:14 arturo: we have new virt servers in the eqiad1 deployment since past week and this week: cloudvirt1018, cloudvirt1023, cloudvirt1024

2018-09-26

  • 10:40 arturo: T205524 all sorts of restarts in all neutron daemons
  • 10:20 arturo: T205524 stop/start all neutron agents in cloudnet1003.eqiad.wmnet
  • 10:13 arturo: T205524 restart all agents in cloudnet1004.eqiad.wmnet
  • 10:10 arturo: restart neutron-server in cloudcontrol1003, investigating T205524

2018-09-24

  • 10:57 arturo: try to increase floating ip allocation pool in eqiad1. Of 185.15.56.0/25 we are using only 185.15.56.10-185.15.56.31, I don't know why. Let's use 185.15.56.2-185.15.56.126

2018-09-21

  • 17:18 bd808: Running `sudo maintain-meta_p --all-databases --purge` across labsdb10(09|10|11) for T201890

2018-09-17

  • 22:08 bd808: Granted gtirloni project roles of admin, projectadmin, and user

2018-09-12

  • 11:20 arturo: T202636 distributing default routes using classless-static-route for all VMs in main/labtest (dnsmasq/nova-network)

2018-09-11

  • 16:52 arturo: again, restarted nova-network after killing all dnsmasq procs in labnet1001 for T202636
  • 16:08 arturo: restarted nova-network after killing all dnsmasq procs in labnet1001 for T202636
  • 10:53 arturo: T202636 creating all the compat-network configuration in neutron
  • 10:36 arturo: T202636 creating br-compat bridge in eqiad1 for the compat network
  • 10:33 arturo: T202636 manually reserve 10.68.23.253 (in nova-network)

2018-09-10

  • 22:46 andrewbogott: deleting all VMs on labvirt1019 and 1020 as prep for T204003

2018-08-30

  • 15:46 andrewbogott: restarting rabbitmq-server on cloudcontrol1003
  • 13:07 arturo: T202636 internal network routing now exists in labtest/labtestn for VM to communicate with each other

2018-08-28

  • 11:04 arturo: T202549 eqiad1 databases are all now running in m5-master. Mysql has been cleaned from cloudcontrol100[3,4]

2018-08-23

  • 16:17 arturo: T188589 bstorm_ merged patch to reduce nova DB connection usage
  • 13:15 arturo: T202115 `root@cloudcontrol1003:~# neutron subnet-update --allocation-pool start=10.64.22.4,end=10.64.22.4 e4fb2771-a361-4add-ac4e-280cc300c59f`
  • 13:10 arturo: T202115 (was `{"start": "10.64.22.2", "end": "10.64.22.254"}` )
  • 13:08 arturo: T202115 `root@cloudcontrol1003:~# neutron subnet-update --allocation-pool start=10.64.22.254,end=10.64.22.254 e4fb2771-a361-4add-ac4e-280cc300c59f`

2018-08-22

  • 15:28 arturo: cleanup local glance,keystone databases in cloudcontrol1003.wikimedia.org (already in m5-master)
  • 15:27 arturo: cleanup local keystone database in cloudcontrol1003.wikimedia.org (already in m5-master)

2018-08-21

  • 15:39 andrewbogott: initial test message
  • 10:31 arturo: eqiad1 remove leftover port for HA on labnet1004
  • 10:15 arturo: test

2018-05-07

  • 18:07 bstorm_: stopped the toolhistory job because it is totally broken and fills /tmp.

2018-02-09

  • 00:55 bd808: Added Arturo Borrero Gonzalez and Bstorm as project members
  • 00:54 bd808: Removed Yuvipanda at user request (T186289)