You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Nova Resource:Admin/SAL

From Wikitech-static
< Nova Resource:Admin
Revision as of 17:48, 28 November 2021 by imported>Stashbot (andrewbogott: moved cloudvirt1018 out of the 'localstorage' aggregate and into 'maintenance' for T296592. It will need to be moved back after the raid is rebuilt.)
Jump to navigation Jump to search

2021-11-28

  • 17:48 andrewbogott: moved cloudvirt1018 out of the 'localstorage' aggregate and into 'maintenance' for T296592. It will need to be moved back after the raid is rebuilt.

2021-11-21

  • 07:19 dcaro_away: restarting designate-sink with some extra logs in it (T296144)

2021-11-17

  • 15:48 andrewbogott: upgrading mariadb packages on eqiad1 cloudcontrols
  • 15:39 andrewbogott: sudo cumin "cloud*" 'apt-get update -y --allow-releaseinfo-change'
  • 15:26 andrewbogott: updated mariadb packages on codfw1dev cloudcontrols to 1:10.3.31-0+deb10u1

2021-11-12

  • 13:31 arturo: restarting glance-api services to make sure they work with new ceph auth creds (T293752)

2021-11-08

2021-11-05

  • 11:18 wm-bot: Added 1 new OSDs ['cloudcephosd1024.eqiad.wmnet'] (T295012) - cookbook ran by arturo@endurance
  • 11:17 wm-bot: Added OSD cloudcephosd1024.eqiad.wmnet... (1/1) (T295012) - cookbook ran by arturo@endurance
  • 11:15 wm-bot: Finished rebooting node cloudcephosd1024.eqiad.wmnet - cookbook ran by arturo@endurance
  • 11:12 wm-bot: Rebooting node cloudcephosd1024.eqiad.wmnet - cookbook ran by arturo@endurance
  • 11:12 wm-bot: Adding OSD cloudcephosd1024.eqiad.wmnet... (1/1) (T295012) - cookbook ran by arturo@endurance
  • 11:12 wm-bot: Adding new OSDs ['cloudcephosd1024.eqiad.wmnet'] to the cluster (T295012) - cookbook ran by arturo@endurance

2021-11-04

  • 16:39 wm-bot: Added 1 new OSDs ['cloudcephosd1023.eqiad.wmnet'] (T295012) - cookbook ran by arturo@endurance
  • 16:39 wm-bot: Added OSD cloudcephosd1023.eqiad.wmnet... (1/1) (T295012) - cookbook ran by arturo@endurance
  • 16:37 wm-bot: Finished rebooting node cloudcephosd1023.eqiad.wmnet - cookbook ran by arturo@endurance
  • 16:34 wm-bot: Rebooting node cloudcephosd1023.eqiad.wmnet - cookbook ran by arturo@endurance
  • 16:33 wm-bot: Adding OSD cloudcephosd1023.eqiad.wmnet... (1/1) (T295012) - cookbook ran by arturo@endurance
  • 16:33 wm-bot: Adding new OSDs ['cloudcephosd1023.eqiad.wmnet'] to the cluster (T295012) - cookbook ran by arturo@endurance
  • 16:17 wm-bot: Added 1 new OSDs ['cloudcephosd1022.eqiad.wmnet'] (T295012) - cookbook ran by arturo@endurance
  • 16:17 wm-bot: Added OSD cloudcephosd1022.eqiad.wmnet... (1/1) (T295012) - cookbook ran by arturo@endurance
  • 16:16 wm-bot: Finished rebooting node cloudcephosd1022.eqiad.wmnet - cookbook ran by arturo@endurance
  • 16:13 wm-bot: Rebooting node cloudcephosd1022.eqiad.wmnet - cookbook ran by arturo@endurance
  • 16:12 wm-bot: Adding OSD cloudcephosd1022.eqiad.wmnet... (1/1) (T295012) - cookbook ran by arturo@endurance
  • 16:12 wm-bot: Adding new OSDs ['cloudcephosd1022.eqiad.wmnet'] to the cluster (T295012) - cookbook ran by arturo@endurance
  • 16:00 wm-bot: Adding OSD cloudcephosd1022.eqiad.wmnet... (1/1) (T295012) - cookbook ran by arturo@endurance
  • 16:00 wm-bot: Adding new OSDs ['cloudcephosd1022.eqiad.wmnet'] to the cluster (T295012) - cookbook ran by arturo@endurance
  • 11:26 wm-bot: Added 1 new OSDs ['cloudcephosd1021.eqiad.wmnet'] (T295012) - cookbook ran by arturo@endurance
  • 11:26 wm-bot: Added OSD cloudcephosd1021.eqiad.wmnet... (1/1) (T295012) - cookbook ran by arturo@endurance
  • 11:23 wm-bot: Finished rebooting node cloudcephosd1021.eqiad.wmnet - cookbook ran by arturo@endurance
  • 11:20 wm-bot: Rebooting node cloudcephosd1021.eqiad.wmnet - cookbook ran by arturo@endurance
  • 11:19 wm-bot: Adding OSD cloudcephosd1021.eqiad.wmnet... (1/1) (T295012) - cookbook ran by arturo@endurance
  • 11:19 wm-bot: Adding new OSDs ['cloudcephosd1021.eqiad.wmnet'] to the cluster (T295012) - cookbook ran by arturo@endurance
  • 11:16 wm-bot: Adding new OSDs ['cloudcephosd1021.eqiad.wmnet'] to the cluster (T295012) - cookbook ran by arturo@endurance

2021-11-03

  • 17:22 arturo: [codfw1dev] installing keepalived 2.1.5 from buster-backports on cloudgw2001-dev/2002-dev (T294956)
  • 11:45 arturo: [codfw1dev] downgrade kernel on cloudgw2001-dev/2002-dev (T294853, T291813)

2021-11-02

  • 10:54 arturo: rebooting cloudnet1004/1003 for T291813
  • 10:43 arturo: [codfw1dev] rebooting cloudgw200[12]-dev for T291813

2021-10-24

2021-10-21

  • 10:19 arturo: drop firewall exception on core routers for wiki replicas legacy setup (T293897)
  • 10:12 arturo: drop NAT exception for wiki replicas legacy setup (T293897)

2021-10-20

  • 21:06 andrewbogott: creating cloudinfra-nfs project T293936

2021-10-18

  • 19:21 andrewbogott: also ticked the 'admin' box on wikitech for majavah T292827
  • 18:58 andrewbogott: granting majavah 'admin' role in the 'admin' project and also in the default domain. T292827

2021-10-14

  • 12:28 arturo: [codfw1dev] add DB grants for cloudbackup2002.codfw.wmnet IP address to the cinder DB (T292546)

2021-10-13

  • 10:46 arturo: updating python3-neutron across the fleet (T292936)

2021-10-12

  • 09:06 dcaro: upgrading eqiad cloudnet hosts neutron packages (T292936)
  • 08:57 dcaro: upgrading codfw cloudnet hosts neutron packages (T292936)

2021-10-05

  • 09:39 arturo: [codfw1dev] cleaning up manila stuff from openstack (db, endpoints, tenant, VMs, and such) T291257

2021-09-30

  • 14:50 andrewbogott: sudo cumin "cloud*" "ps -ef | grep nslcd && service nslcd restart" and sudo cumin "lab*" "ps -ef | grep nslcd && service nslcd restart" T292202
  • 14:43 andrewbogott: ran sudo cumin --force --timeout 500 -o json "A:all" "ps -ef | grep nslcd && service nslcd restart" to get nslcd happy again T292202

2021-09-29

  • 09:41 arturo: [codfw1dev] cleanup manila shares definitions for a clean start now that the manila-sharecontroller VM is apparently well configured (T291257)

2021-09-28

  • 16:23 bstorm: downtime for clouddb1020 to reduce re-pages in case this goes badly T291963
  • 16:21 bstorm: powering on clouddb1020 via remote console T291963
  • 15:58 bstorm: depooled clouddb1020 for repair T291961
  • 12:40 dcaro: Merged change on sssd for bullseye cloud hosts (T291585)
  • 11:30 arturo: [codfw1dev] create floating IP 185.15.57.5 for manila-sharecontroller.cloudinfra-codfw1dev.codfw1dev.wmcloud.org (T291257)

2021-09-27

  • 10:07 arturo: cloudcontrol1004 apparently healthy T291446
  • 09:25 arturo: rebooting cloudcontrol1004 for T291446

2021-09-24

  • 13:02 arturo: [codfw1dev] create VM manila-share-controller-01 on cloudinfra-codfw1dev
  • 13:00 arturo: [codfw1dev] rebase labs/private.git on cloudinfra-puppetmaster-01, had merge conflict

2021-09-21

  • 12:13 arturo: [codfw1dev] trying to create a manila service image (T291257)
  • 11:45 arturo: [codfw1dev] created rabbitmq user (T291257)
  • 11:32 arturo: [codfw1dev] populated manila DB & created service endpoints (T291257)
  • 11:06 arturo: [codfw1dev] give manila user admin role @ manila project (T291257)
  • 11:06 arturo: [codfw1dev] created manila project (T291257)
  • 10:57 arturo: [codfw1dev] created manila user @ labtestwikitech (T291257)
  • 10:49 arturo: [codfw1dev] create manila database on cloudcontrol-dev nodes (galera) T291257

2021-09-20

  • 23:08 bstorm: ran `echo check > /sys/block/md0/md/sync_action` on cloudcontrol1004 to check raid
  • 22:48 andrewbogott: stopped puppet & mariadb on cloudcontrol1004; it was flapping
  • 22:44 andrewbogott: sudo touch /tmp/galera.disabled on cloudcontrol1004, the service seems troubled there
  • 21:57 andrewbogott: moving cloudvirt1043 into the 'nfs' aggregate for T291405

2021-09-17

  • 11:35 arturo: [codfw1dev] install manila on cloudcontrol2001-dev (T291257)

2021-09-16

  • 15:56 bstorm: removing downtime for labstore1005 so we'll know if it has another issue T290318

2021-09-09

  • 22:03 bstorm: restarted the prometheus-mysqld-exporter@s1 service as it was not working T290630
  • 03:15 bstorm: resetting swap on clouddb1017 T290630
  • 03:08 andrewbogott: stopping maintain-dbusers on labstore1004 for help diagnosing T290630

2021-09-03

  • 15:34 bstorm: rebooting labstore1005 to disconnect the drives from labstore1004 T290318
  • 15:24 bstorm: stopping puppet and disabling backup syncs to labstore1005 on cloudbackup2002 T290318
  • 15:20 bstorm: stopping puppet and disabling backup syncs to labstore1005 on cloudbackup2001 T290318

2021-08-30

  • 16:16 wm-bot: Added 1 new OSDs ['cloudcephosd1018.eqiad.wmnet'] - cookbook ran by andrew@buster
  • 16:16 wm-bot: Added OSD cloudcephosd1018.eqiad.wmnet... (1/1) - cookbook ran by andrew@buster
  • 16:13 wm-bot: Adding OSD cloudcephosd1018.eqiad.wmnet... (1/1) - cookbook ran by andrew@buster
  • 16:13 wm-bot: Adding new OSDs ['cloudcephosd1018.eqiad.wmnet'] to the cluster - cookbook ran by andrew@buster
  • 16:10 wm-bot: Finished rebooting node cloudcephosd1018.eqiad.wmnet - cookbook ran by andrew@buster
  • 16:07 wm-bot: Rebooting node cloudcephosd1018.eqiad.wmnet - cookbook ran by andrew@buster
  • 16:07 wm-bot: Adding OSD cloudcephosd1018.eqiad.wmnet... (1/1) - cookbook ran by andrew@buster
  • 16:07 wm-bot: Adding new OSDs ['cloudcephosd1018.eqiad.wmnet'] to the cluster - cookbook ran by andrew@buster

2021-08-27

  • 18:57 andrewbogott: raising toolsbeta ram/core/instances quotas so majavah can experiment with bullseye

2021-08-25

  • 14:45 wm-bot: Finished rebooting node cloudcephosd1018.eqiad.wmnet - cookbook ran by andrew@buster
  • 14:42 wm-bot: Rebooting node cloudcephosd1018.eqiad.wmnet - cookbook ran by andrew@buster
  • 14:42 wm-bot: Adding OSD cloudcephosd1018.eqiad.wmnet... (1/1) - cookbook ran by andrew@buster
  • 14:42 wm-bot: Adding new OSDs ['cloudcephosd1018.eqiad.wmnet'] to the cluster - cookbook ran by andrew@buster
  • 14:41 wm-bot: Adding new OSDs ['cloudcephosd1018.eqiad.wmnet'] to the cluster - cookbook ran by andrew@buster

2021-08-19

  • 17:39 bstorm: restarting glance image backup to try and clear the page

2021-08-18

  • 16:21 wm-bot: Rebooting node cloudcephosd1018.eqiad.wmnet - cookbook ran by andrew@buster
  • 16:21 wm-bot: Adding OSD cloudcephosd1018.eqiad.wmnet... (1/1) - cookbook ran by andrew@buster
  • 16:21 wm-bot: Adding new OSDs ['cloudcephosd1018.eqiad.wmnet'] to the cluster - cookbook ran by andrew@buster
  • 16:17 wm-bot: Adding new OSDs ['cloudcephosd1018.eqiad.wmnet'] to the cluster - cookbook ran by andrew@buster
  • 16:16 wm-bot: Adding new OSDs ['cloudcephosd1018.eqiad.wmnet'] to the cluster - cookbook ran by andrew@buster
  • 16:15 wm-bot: Adding new OSDs ['cloudcephosd1018.eqiad.wmnet'] to the cluster - cookbook ran by andrew@buster
  • 16:13 wm-bot: Adding new OSDs ['cloudcephosd1018.eqiad.wmnet'] to the cluster - cookbook ran by andrew@buster
  • 14:47 andrewbogott: adding clouvirt1038 to the ceph aggregate, removing from the maintenance aggregate T276922

2021-08-17

  • 15:11 andrewbogott: rebooting cloudcephosd1008 to force raid rebuild -- T287838

2021-08-11

  • 13:51 wm-bot: Finished rebooting node cloudcephosd1018.eqiad.wmnet - cookbook ran by dcaro@vulcanus
  • 13:48 wm-bot: Rebooting node cloudcephosd1018.eqiad.wmnet - cookbook ran by dcaro@vulcanus
  • 13:47 wm-bot: Adding OSD cloudcephosd1018.eqiad.wmnet... (1/1) (T285858) - cookbook ran by dcaro@vulcanus
  • 13:47 wm-bot: Adding new OSDs ['cloudcephosd1018.eqiad.wmnet'] to the cluster (T285858) - cookbook ran by dcaro@vulcanus

2021-08-10

  • 15:15 andrewbogott: restarting all designate services in eqiad1
  • 15:04 andrewbogott: restarting designate-sink in eqiad1; it's complaining about rabbit but I don't want to restart rabbit yet

2021-08-05

  • 09:37 dcaro: Taking one osd daemon down ot codfw cluster (T288203)

2021-08-04

  • 19:20 bd808: Running deleteBatch.php on cloudweb2001-dev to remove legacy Heira: pages from labtestwiki

2021-08-03

  • 17:40 bstorm: rerunning the glance backup script after failure

2021-07-31

  • 00:10 andrewbogott: "systemctl reset-failed cloud-init.service" on all VMs for T287309
  • 00:08 andrewbogott: "systemctl reset-failed cloud-final.service" on all VMs for T287309

2021-07-27

  • 21:32 andrewbogott: putting cloudvirt1012 back into service T286748
  • 20:52 andrewbogott: draining VMs off of cloudvirt1012 so we can replace the battery for T286748
  • 15:15 andrewbogott: "rm /etc/apt/sources.list.d/openstack-mitaka-jessie.list" cloud-wide

2021-07-23

  • 15:22 bstorm: update wikireplicas-dns for s7 fix for web replicas

2021-07-20

  • 17:07 andrewbogott: reloading haproxy on dbproxy1018 for T286598
  • 15:45 arturo: failback from labstore1006 to labstore1007 (dumps NFS) https://gerrit.wikimedia.org/r/c/operations/puppet/+/705417
  • 00:10 bstorm: restarting nova-api on cloudcontrol1003 to try and recover whatever it's doing with designate_floating_ip_ptr_records_updater

2021-07-19

  • 22:05 bstorm: set downtime scheduled for tomorrow from 1300 to 1600 UTC for cloudstore1008 and 1009 T286599
  • 20:40 andrewbogott: reloading haproxy on dbproxy1018 for T286598
  • 13:50 andrewbogott: upgrading mariadb to 10.3.29 on all cloudcontrols

2021-07-16

  • 09:55 dcaro: checking HP raid issues on coludvirt1012 (T286766)

2021-07-14

  • 21:08 andrewbogott: restarting lots of openstack services while trying to resolve T286675
  • 12:17 dcaro: doing ceph outage tests on codfw1 (fyi)

2021-07-13

  • 10:57 dcaro: enabled autoscaling on codfw1 ceph cluster, setting a minimum of pgs on codfw1dev-compute to 128

2021-07-02

  • 10:12 wm-bot: The cluster is not rebalance after adding the new OSDs ['cloudcephosd1019.eqiad.wmnet', 'cloudcephosd1020.eqiad.wmnet'] (T285858) - cookbook ran by dcaro@vulcanus
  • 10:12 wm-bot: Added 2 new OSDs ['cloudcephosd1019.eqiad.wmnet', 'cloudcephosd1020.eqiad.wmnet'] (T285858) - cookbook ran by dcaro@vulcanus
  • 10:12 wm-bot: Added OSD cloudcephosd1020.eqiad.wmnet... (2/2) (T285858) - cookbook ran by dcaro@vulcanus
  • 10:10 wm-bot: Finished rebooting node cloudcephosd1020.eqiad.wmnet - cookbook ran by dcaro@vulcanus
  • 10:07 wm-bot: Rebooting node cloudcephosd1020.eqiad.wmnet - cookbook ran by dcaro@vulcanus
  • 10:07 wm-bot: Adding OSD cloudcephosd1020.eqiad.wmnet... (2/2) (T285858) - cookbook ran by dcaro@vulcanus
  • 10:07 wm-bot: Added OSD cloudcephosd1019.eqiad.wmnet... (1/2) (T285858) - cookbook ran by dcaro@vulcanus
  • 10:05 wm-bot: Finished rebooting node cloudcephosd1019.eqiad.wmnet - cookbook ran by dcaro@vulcanus
  • 10:02 wm-bot: Rebooting node cloudcephosd1019.eqiad.wmnet - cookbook ran by dcaro@vulcanus
  • 10:02 wm-bot: Adding OSD cloudcephosd1019.eqiad.wmnet... (1/2) (T285858) - cookbook ran by dcaro@vulcanus
  • 10:01 wm-bot: Adding new OSDs ['cloudcephosd1019.eqiad.wmnet', 'cloudcephosd1020.eqiad.wmnet'] to the cluster (T285858) - cookbook ran by dcaro@vulcanus
  • 09:13 wm-bot: Adding OSD cloudcephosd1019.eqiad.wmnet... (1/2) (T285858) - cookbook ran by dcaro@vulcanus
  • 09:13 wm-bot: Adding new OSDs ['cloudcephosd1019.eqiad.wmnet', 'cloudcephosd1020.eqiad.wmnet'] to the cluster (T285858) - cookbook ran by dcaro@vulcanus

2021-07-01

  • 16:27 bstorm: failed over cloudstore1009 to cloudstore1008 T224747
  • 16:18 bstorm: downtimed cloudstore1008 and cloudstore1009 to fail over T224747
  • 14:25 wm-bot: Adding OSD cloudcephosd1019.eqiad.wmnet... (2/3) (T285858) - cookbook ran by dcaro@vulcanus
  • 14:25 wm-bot: Added OSD cloudcephosd1017.eqiad.wmnet... (1/3) (T285858) - cookbook ran by dcaro@vulcanus
  • 14:24 wm-bot: Finished rebooting node cloudcephosd1017.eqiad.wmnet - cookbook ran by dcaro@vulcanus
  • 14:21 wm-bot: Rebooting node cloudcephosd1017.eqiad.wmnet - cookbook ran by dcaro@vulcanus
  • 14:20 wm-bot: Adding OSD cloudcephosd1017.eqiad.wmnet... (1/3) (T285858) - cookbook ran by dcaro@vulcanus
  • 14:20 wm-bot: Adding new OSDs ['cloudcephosd1017.eqiad.wmnet', 'cloudcephosd1019.eqiad.wmnet', 'cloudcephosd1020.eqiad.wmnet'] to the cluster (T285858) - cookbook ran by dcaro@vulcanus
  • 14:18 wm-bot: Rebooting node cloudcephosd1017.eqiad.wmnet - cookbook ran by dcaro@vulcanus
  • 14:17 wm-bot: Adding OSD cloudcephosd1017.eqiad.wmnet... (1/3) (T285858) - cookbook ran by dcaro@vulcanus
  • 14:17 wm-bot: Adding new OSDs ['cloudcephosd1017.eqiad.wmnet', 'cloudcephosd1019.eqiad.wmnet', 'cloudcephosd1020.eqiad.wmnet'] to the cluster (T285858) - cookbook ran by dcaro@vulcanus
  • 11:16 wm-bot: Added new OSD node cloudcephosd1016.eqiad.wmnet (T285858) - cookbook ran by dcaro@vulcanus
  • 11:13 wm-bot: Adding new OSD cloudcephosd1016.eqiad.wmnet to the cluster (T285858) - cookbook ran by dcaro@vulcanus
  • 10:58 dcaro: rebooting cloudcephosd1016 (T285858)
  • 10:47 wm-bot: Adding new OSD cloudcephosd1016.eqiad.wmnet to the cluster (T285858) - cookbook ran by dcaro@vulcanus
  • 10:44 wm-bot: Adding new OSD cloudcephosd1016.eqiad.wmnet to the cluster (T285858) - cookbook ran by dcaro@vulcanus
  • 10:42 wm-bot: Adding new OSD cloudcephosd1016.eqiad.wmnet to the cluster (T285858) - cookbook ran by dcaro@vulcanus
  • 10:41 wm-bot: Adding new OSD cloudcephosd1016.eqiad.wmnet to the cluster (T285858) - cookbook ran by dcaro@vulcanus
  • 10:40 wm-bot: Adding new OSD cloudcephosd1016.eqiad.wmnet to the cluster (T285858) - cookbook ran by dcaro@vulcanus

2021-06-30

  • 21:48 bstorm: downtimed space alerts for scratch on cloudstore1008 until after the migration

2021-06-25

  • 15:28 andrewbogott: restarting openstack services on cloudcontrol1005
  • 09:16 arturo: icinga downtime cloudcontrols for 2h
  • 08:20 dcaro: restarting rabbitmq on cloudcontrol100{3,4}

2021-06-21

  • 13:54 dcaro: puppet fix merged and deployed, servers are back to normal
  • 13:20 dcaro: merged broken puppet patch, downtimed all cloudvirts for 2h while fixing (nothing big, just added a bad systemd timer)

2021-06-20

  • 22:21 andrewbogott: clearing admin-monitoring VMs; puppet has been failing lately due to a full drive on the puppetmaster

2021-06-15

  • 01:18 bstorm: running a modified version of the prometheus dir size cron in screen T284964

2021-06-14

  • 10:13 dcaro: setting ssd to debug mode on tools-sgeexec-0917 (T284130)

2021-06-10

  • 10:58 wm-bot: Finished rebooting the nodes ['cloudcephmon2002-dev', 'cloudcephmon2003-dev', 'cloudcephmon2004-dev'] (T281248) - cookbook ran by dcaro@vulcanus
  • 10:58 wm-bot: Finished rebooting node cloudcephmon2004-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 10:55 wm-bot: Rebooting node cloudcephmon2004-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 10:55 wm-bot: Finished rebooting node cloudcephmon2003-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 10:52 wm-bot: Rebooting node cloudcephmon2003-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 10:52 wm-bot: Finished rebooting node cloudcephmon2002-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 10:49 wm-bot: Rebooting node cloudcephmon2002-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 10:49 wm-bot: Rebooting the nodes cloudcephmon2002-dev,cloudcephmon2003-dev,cloudcephmon2004-dev (T281248) - cookbook ran by dcaro@vulcanus
  • 10:48 wm-bot: Finished rebooting the nodes ['cloudcephosd2001-dev', 'cloudcephosd2002-dev', 'cloudcephosd2003-dev'] (T281248) - cookbook ran by dcaro@vulcanus
  • 10:48 wm-bot: Finished rebooting node cloudcephosd2003-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 10:45 wm-bot: Rebooting node cloudcephosd2003-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 10:45 wm-bot: Finished rebooting node cloudcephosd2002-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 10:42 wm-bot: Rebooting node cloudcephosd2002-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 10:42 wm-bot: Finished rebooting node cloudcephosd2001-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 10:39 wm-bot: Rebooting node cloudcephosd2001-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 10:39 wm-bot: Rebooting the nodes cloudcephosd2001-dev,cloudcephosd2002-dev,cloudcephosd2003-dev (T281248) - cookbook ran by dcaro@vulcanus
  • 09:39 wm-bot: Finished rebooting the nodes ['cloudcephosd2001-dev', 'cloudcephosd2002-dev', 'cloudcephosd2003-dev'] (T281248) - cookbook ran by dcaro@vulcanus
  • 09:38 wm-bot: Finished rebooting node cloudcephosd2003-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 09:35 wm-bot: Rebooting node cloudcephosd2003-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 09:35 wm-bot: Finished rebooting node cloudcephosd2002-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 09:32 wm-bot: Rebooting node cloudcephosd2002-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 09:32 wm-bot: Finished rebooting node cloudcephosd2001-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 09:29 wm-bot: Rebooting node cloudcephosd2001-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 09:29 wm-bot: Rebooting the nodes cloudcephosd2001-dev,cloudcephosd2002-dev,cloudcephosd2003-dev (T281248) - cookbook ran by dcaro@vulcanus
  • 09:26 wm-bot: Rebooting node cloudcephosd2001-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 09:26 wm-bot: Rebooting the nodes cloudcephosd2001-dev,cloudcephosd2002-dev,cloudcephosd2003-dev (T281248) - cookbook ran by dcaro@vulcanus
  • 09:24 wm-bot: Rebooting node cloudcephosd2001-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 09:24 wm-bot: Rebooting the nodes cloudcephosd2001-dev,cloudcephosd2002-dev,cloudcephosd2003-dev (T281248) - cookbook ran by dcaro@vulcanus

2021-06-09

  • 17:33 arturo: removed icinga downtime for cloudmetrics1002 -- to see if hardware is healthy (T281881)
  • 13:30 wm-bot: Finished rebooting the nodes ['cloudcephmon2002-dev', 'cloudcephmon2003-dev', 'cloudcephmon2004-dev'] (T281248) - cookbook ran by dcaro@vulcanus
  • 13:30 wm-bot: Finished rebooting node cloudcephmon2004-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 13:27 wm-bot: Rebooting node cloudcephmon2004-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 13:27 wm-bot: Finished rebooting node cloudcephmon2003-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 13:24 wm-bot: Rebooting node cloudcephmon2003-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 13:24 wm-bot: Finished rebooting node cloudcephmon2002-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 13:21 wm-bot: Rebooting node cloudcephmon2002-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 13:21 wm-bot: Rebooting the nodes cloudcephmon2002-dev,cloudcephmon2003-dev,cloudcephmon2004-dev (T281248) - cookbook ran by dcaro@vulcanus
  • 13:01 wm-bot: Rebooting node cloudcephmon2002-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 13:01 wm-bot: Rebooting the nodes cloudcephmon2002-dev,cloudcephmon2003-dev,cloudcephmon2004-dev (T281248) - cookbook ran by dcaro@vulcanus
  • 12:53 wm-bot: Rebooting node cloudcephmon2002-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 12:53 wm-bot: Rebooting the nodes cloudcephmon2002-dev,cloudcephmon2003-dev,cloudcephmon2004-dev (T281248) - cookbook ran by dcaro@vulcanus

2021-06-08

  • 23:19 bd808: Downtimed cloudmetrics1002 in icinga until 2021-06-30 23:59:01 (T281881)
  • 21:08 bstorm: downtiming grafana-labs for maintenance
  • 16:28 wm-bot: Finished rebooting the nodes ['cloudcephosd2001-dev', 'cloudcephosd2002-dev', 'cloudcephosd2003-dev'] (T281248) - cookbook ran by dcaro@vulcanus
  • 16:27 wm-bot: Finished rebooting node cloudcephosd2003-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 16:24 wm-bot: Rebooting node cloudcephosd2003-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 16:24 wm-bot: Finished rebooting node cloudcephosd2002-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 16:22 wm-bot: Rebooting node cloudcephosd2002-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 16:21 wm-bot: Finished rebooting node cloudcephosd2001-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 16:18 wm-bot: Rebooting node cloudcephosd2001-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus
  • 16:18 wm-bot: Rebooting the nodes ['cloudcephosd2001-dev', 'cloudcephosd2002-dev', 'cloudcephosd2003-dev'] (T281248) - cookbook ran by dcaro@vulcanus
  • 16:17 wm-bot: Rebooting the nodes ['cloudcephosd2001-dev', 'cloudcephosd2002-dev', 'cloudcephosd2003-dev'] (T281248) - cookbook ran by dcaro@vulcanus
  • 15:03 wm-bot: Finished rebooting node cloudcephosd2001-dev.codfw.wmnet - cookbook ran by dcaro@vulcanus
  • 14:59 wm-bot: Rebooting node cloudcephosd2001-dev.codfw.wmnet - cookbook ran by dcaro@vulcanus
  • 14:59 wm-bot: Rebooting node cloudcephosd2001-dev.codfw.wmnet - cookbook ran by dcaro@vulcanus
  • 14:57 wm-bot: Rebooting node cloudcephosd2001-dev.codfw.wmnet - cookbook ran by dcaro@vulcanus
  • 14:57 wm-bot: Rebooting node cloudcephosd2001-dev.codfw.wmnet - cookbook ran by dcaro@vulcanus
  • 14:29 wm-bot: Rebooting node cloudcephosd2001-dev.codfw.wmnet - cookbook ran by dcaro@vulcanus
  • 14:23 wm-bot: Rebooting node cloudcephosd2001-dev.codfw.wmnet - cookbook ran by dcaro@vulcanus
  • 14:18 wm-bot: Rebooting node cloudcephosd2001-dev.codfw.wmnet - cookbook ran by dcaro@vulcanus

2021-06-07

  • 14:27 andrewbogott: moving cloudvirt1040 from 'maintenance' aggregate to 'ceph' aggregate T281399

2021-06-01

  • 13:12 dcaro: Changed the ceph osd_memory_target on eqiad pool to 6Gi (we were reaching the limit, swapping at some points)
  • 09:57 arturo: fix PTR record for 185.15.56.1 (T284025)
  • 09:56 arturo: fix PTR record for 185.15.56.1 (T248025)

2021-05-27

  • 14:58 wm-bot: Testing - cookbook ran by dcaro@vulcanus

2021-05-26

  • 19:10 andrewbogott: reimaging cloudvirt1018 to support local VM storage
  • 18:07 andrewbogott: draining cloudvirt1018, converting it to a local-storage host like cloudvirt1019 and 1020 -- T283296
  • 14:36 dcaro: Enabled syslog logging for osd.55 on eqiad ceph cluster for testing (T281247)
  • 14:36 dcaro: Enabled syslog logging on codfw ceph cluster (mon/osd/mgr) (T281247)
  • 11:26 arturo: [codfw1dev] purge old kernel packages in cloudvirt200[12]-dev
  • 11:03 arturo: created public flavor `g3.cores16.ram36.disk20` (even though it was requested as private in T283293, but may be useful for others)

2021-05-25

  • 16:14 bd808: Closed #wikimedia-cloud-admin on f***node
  • 16:11 bd808: Closed #wikimedia-cloud-feed on f***node
  • 15:19 dcaro: rebooted cloudvirt1020, starting VMs (T275893)
  • 15:13 dcaro: rebooting cloudvirt1020 (T275893)
  • 14:42 dcaro: taking cloudvirt1020 out for maintenance (openstack wise) so no new VMs are scheduled on it (T275893)

2021-05-24

  • 22:32 andrewbogott: changing the default ttl for eqiad1.wikimedia.cloud. from 3600 to 60; this should help us avoid madness when re-using hostnames.
  • 11:20 arturo: created `g3.cores2.ram80.disk40.private` for the wmf-research-tools project, to allow resizing a 40G disk instance

2021-05-22

  • 02:14 bstorm: downtiming SMART alerts on dumps server labstore1007 for the weekend because it has been flapping T281045

2021-05-13

  • 21:25 bstorm: converted the maps and scratch volumes on cloudstore1008 (standby) to drbd T224747
  • 15:45 bstorm: re-running wikireplicas-dns after refactor of config to make sure it doesn't change anything

2021-05-12

  • 14:23 arturo: [codfw1dev] cleanup old unused agents (bgp, ovs)
  • 11:37 arturo: [codfw1dev] replacing cloudnet2003-dev with cloudnet2004-dev (T281381)

2021-05-11

  • 18:00 andrewbogott: adding 'trove' service project in advance of deploying trove in eqiad1
  • 10:22 arturo: rebooted cloudgw1002 (active) thus causing a failover to cloudgw1001

2021-05-09

  • 10:53 arturo: icinga-downtime cloudmetrics1002 for 3 months (T275605)

2021-05-07

  • 13:51 andrewbogott: add inherited 'admin' right to novaadmin user throughout eqiad1. I was trying to narrow down the rights here but lack of admin breaks some workflows, e.g. T281894 and T282235

2021-05-06

  • 15:31 arturo: about to migrating CloudVPS network to the cloudgw architecture T270704
  • 11:14 dcaro: restarting cinder-volume on the eqiad control nodes to refresh the ceph libraries (T282109)

2021-05-05

  • 16:07 dcaro: disallowing insecure global ids on the eqiad ceph cluster (T280641)
  • 15:15 wm-bot: Safe reboot of 'cloudvirt1046.eqiad.wmnet' finished successfully. (T280641) - cookbook ran by dcaro@vulcanus
  • 15:11 wm-bot: Safe rebooting 'cloudvirt1046.eqiad.wmnet'. (T280641) - cookbook ran by dcaro@vulcanus
  • 15:11 wm-bot: Safe reboot of 'cloudvirt1045.eqiad.wmnet' finished successfully. (T280641) - cookbook ran by dcaro@vulcanus
  • 15:07 wm-bot: Safe rebooting 'cloudvirt1045.eqiad.wmnet'. (T280641) - cookbook ran by dcaro@vulcanus
  • 15:07 wm-bot: Safe reboot of 'cloudvirt1044.eqiad.wmnet' finished successfully. (T280641) - cookbook ran by dcaro@vulcanus
  • 15:03 wm-bot: Safe rebooting 'cloudvirt1044.eqiad.wmnet'. (T280641) - cookbook ran by dcaro@vulcanus
  • 15:03 wm-bot: Safe reboot of 'cloudvirt1043.eqiad.wmnet' finished successfully. (T280641) - cookbook ran by dcaro@vulcanus
  • 14:59 wm-bot: Safe rebooting 'cloudvirt1043.eqiad.wmnet'. (T280641) - cookbook ran by dcaro@vulcanus
  • 14:59 wm-bot: Safe reboot of 'cloudvirt1042.eqiad.wmnet' finished successfully. (T280641) - cookbook ran by dcaro@vulcanus
  • 14:40 wm-bot: Safe rebooting 'cloudvirt1042.eqiad.wmnet'. (T280641) - cookbook ran by dcaro@vulcanus
  • 14:39 wm-bot: Safe reboot of 'cloudvirt1041.eqiad.wmnet' finished successfully. (T280641) - cookbook ran by dcaro@vulcanus
  • 14:14 wm-bot: Safe rebooting 'cloudvirt1041.eqiad.wmnet'. (T280641) - cookbook ran by dcaro@vulcanus
  • 14:14 wm-bot: Safe reboot of 'cloudvirt1039.eqiad.wmnet' finished successfully. (T280641) - cookbook ran by dcaro@vulcanus
  • 14:10 wm-bot: Safe rebooting 'cloudvirt1039.eqiad.wmnet'. (T280641) - cookbook ran by dcaro@vulcanus
  • 12:35 wm-bot: Safe rebooting 'cloudvirt1039.eqiad.wmnet'. (T280641) - cookbook ran by dcaro@vulcanus
  • 11:56 wm-bot: Safe rebooting 'cloudvirt1038.eqiad.wmnet'. (T280641) - cookbook ran by dcaro@vulcanus
  • 11:56 wm-bot: Safe reboot of 'cloudvirt1037.eqiad.wmnet' finished successfully. (T280641) - cookbook ran by dcaro@vulcanus
  • 11:31 wm-bot: Safe rebooting 'cloudvirt1037.eqiad.wmnet'. (T280641) - cookbook ran by dcaro@vulcanus
  • 11:31 wm-bot: Safe reboot of 'cloudvirt1036.eqiad.wmnet' finished successfully. (T280641) - cookbook ran by dcaro@vulcanus
  • 11:08 wm-bot: Safe rebooting 'cloudvirt1036.eqiad.wmnet'. (T280641) - cookbook ran by dcaro@vulcanus
  • 11:08 wm-bot: Safe reboot of 'cloudvirt1035.eqiad.wmnet' finished successfully. (T280641) - cookbook ran by dcaro@vulcanus
  • 10:39 wm-bot: Safe rebooting 'cloudvirt1035.eqiad.wmnet'. (T280641) - cookbook ran by dcaro@vulcanus
  • 10:39 wm-bot: Safe reboot of 'cloudvirt1034.eqiad.wmnet' finished successfully. (T280641) - cookbook ran by dcaro@vulcanus
  • 10:13 wm-bot: Safe rebooting 'cloudvirt1034.eqiad.wmnet'. (T280641) - cookbook ran by dcaro@vulcanus
  • 10:13 wm-bot: Safe reboot of 'cloudvirt1033.eqiad.wmnet' finished successfully. (T280641) - cookbook ran by dcaro@vulcanus
  • 09:47 wm-bot: Safe rebooting 'cloudvirt1033.eqiad.wmnet'. (T280641) - cookbook ran by dcaro@vulcanus
  • 09:47 wm-bot: Safe reboot of 'cloudvirt1032.eqiad.wmnet' finished successfully. (T280641) - cookbook ran by dcaro@vulcanus
  • 09:21 wm-bot: Safe rebooting 'cloudvirt1032.eqiad.wmnet'. (T280641) - cookbook ran by dcaro@vulcanus
  • 09:21 wm-bot: Safe reboot of 'cloudvirt1031.eqiad.wmnet' finished successfully. (T280641) - cookbook ran by dcaro@vulcanus
  • 08:45 wm-bot: Safe rebooting 'cloudvirt1031.eqiad.wmnet'. (T280641) - cookbook ran by dcaro@vulcanus
  • 08:45 wm-bot: Safe reboot of 'cloudvirt1030.eqiad.wmnet' finished successfully. (T280641) - cookbook ran by dcaro@vulcanus
  • 08:19 wm-bot: Safe rebooting 'cloudvirt1030.eqiad.wmnet'. (T280641) - cookbook ran by dcaro@vulcanus
  • 08:19 wm-bot: Safe reboot of 'cloudvirt1029.eqiad.wmnet' finished successfully. (T280641) - cookbook ran by dcaro@vulcanus
  • 08:02 wm-bot: Safe rebooting 'cloudvirt1029.eqiad.wmnet'. (T280641) - cookbook ran by dcaro@vulcanus

2021-05-04

  • 16:05 wm-bot: Safe reboot of 'cloudvirt1028.eqiad.wmnet' finished successfully. (T280641) - cookbook ran by dcaro@vulcanus
  • 15:45 wm-bot: Safe rebooting 'cloudvirt1028.eqiad.wmnet'. (T280641) - cookbook ran by dcaro@vulcanus
  • 15:44 wm-bot: Safe reboot of 'cloudvirt1027.eqiad.wmnet' finished successfully. (T280641) - cookbook ran by dcaro@vulcanus
  • 15:22 wm-bot: Safe rebooting 'cloudvirt1027.eqiad.wmnet'. (T280641) - cookbook ran by dcaro@vulcanus
  • 15:19 wm-bot: Safe reboot of 'cloudvirt1026.eqiad.wmnet' finished successfully. (T280641) - cookbook ran by dcaro@vulcanus
  • 15:15 wm-bot: Safe rebooting 'cloudvirt1026.eqiad.wmnet'. (T280641) - cookbook ran by dcaro@vulcanus
  • 13:19 dcaro: rebooting cloudmetrics1002, got stuck again (T275605)
  • 10:04 wm-bot: Safe rebooting 'cloudvirt1026.eqiad.wmnet'. (T280641) - cookbook ran by dcaro@vulcanus
  • 09:10 wm-bot: Safe rebooting 'cloudvirt1026.eqiad.wmnet'. (T280641) - cookbook ran by dcaro@vulcanus
  • 09:10 wm-bot: Safe reboot of 'cloudvirt1025.eqiad.wmnet' finished successfully. (T280641) - cookbook ran by dcaro@vulcanus
  • 08:34 wm-bot: Safe rebooting 'cloudvirt1025.eqiad.wmnet'. (T280641) - cookbook ran by dcaro@vulcanus
  • 08:20 wm-bot: Safe reboot of 'cloudvirt1024.eqiad.wmnet' finished successfully. (T280641) - cookbook ran by dcaro@vulcanus
  • 08:03 wm-bot: Safe rebooting 'cloudvirt1024.eqiad.wmnet'. (T280641) - cookbook ran by dcaro@vulcanus

2021-05-03

  • 23:53 bstorm: running `maintain-dbusers harvest-replicas` on labstore1004 T281287
  • 23:51 bstorm: running `maintain-dbusers harvest-replicas` on labstore1004
  • 16:34 wm-bot: Safe reboot of 'cloudvirt1023.eqiad.wmnet' finished successfully. (T280641) - cookbook ran by dcaro@vulcanus
  • 16:29 wm-bot: Safe rebooting 'cloudvirt1023.eqiad.wmnet'. (T280641) - cookbook ran by dcaro@vulcanus
  • 15:41 wm-bot: Safe rebooting 'cloudvirt1023.eqiad.wmnet'. (T280641) - cookbook ran by dcaro@vulcanus
  • 15:41 wm-bot: Safe reboot of 'cloudvirt1022.eqiad.wmnet' finished successfully. (T280641) - cookbook ran by dcaro@vulcanus
  • 15:13 wm-bot: Safe rebooting 'cloudvirt1022.eqiad.wmnet'. (T280641) - cookbook ran by dcaro@vulcanus
  • 10:31 wm-bot: Safe rebooting 'cloudvirt1021.eqiad.wmnet'. (T280641 - cookbook ran by dcaro@vulcanus)
  • 10:23 wm-bot: (from a cookbook)
  • 09:12 dcaro: draining and rebooting coludvirt1021 (T280641)
  • 08:26 dcaro: draining and rebooting coludvirt1018 (T280641)

2021-04-30

  • 11:16 dcaro: draining and rebooting coludvirt1017, last one today (T280641)
  • 10:37 dcaro: draining coludvirt1016 for reboot (T280641)
  • 09:48 dcaro: draining coludvirt1013 for reboot (T280641)

2021-04-29

  • 15:11 dcaro: hard rebooting cloudmetrics1002, got hung again (T275605)
  • 07:53 dcaro: Upgrading ceph libraries on cloudcontrol1005 to octopus (T274566)
  • 07:51 dcaro: Upgrading ceph libraries on cloudcontrol1003 to octopus (T274566)
  • 07:50 dcaro: Upgrading ceph libraries on cloudcontrol1004 to octopus (T274566)

2021-04-28

  • 21:11 andrewbogott: cleaning up more references to deleted hypervisors with delete from services where topic='compute' and version != 53;
  • 20:48 andrewbogott: cleaning up references to deleted hypervisors with mysql:root@localhost [nova_eqiad1]> delete from compute_nodes where hypervisor_version != '5002000';
  • 19:40 andrewbogott: putting cloudvirt1040 into the maintenance aggregate pending more info about T281399
  • 18:11 andrewbogott: adding cloudvirt1040, 1041 and 1042 to the 'ceph' host aggregate -- T275081
  • 11:06 dcaro: All ceph server side upgraded to Octopus! \o/ (T280641)
  • 10:57 dcaro: Got a PG getting stuck on 'remapping' after the OSD came up, had to unset the norebalance and then set it again to get it unstuck (T280641)
  • 10:34 dcaro: Slow/blocked opns from cloudcephmon03, "osd_failure(failed timeout osd.32..." (cloudcephosd1005), unset the cluster noout/norebalance and went away in a few secs, setting it again and continuing... (T280641)
  • 09:03 dcaro: Waiting for slow heartbeats from osd.58(cloudcephosd1002) to recover... (T280641)
  • 08:59 dcaro: During the upgrade, started getting warning 'slow osd heartbacks in the back', meaning that pings between osds are really slow (up to 190s) all from osd.58, currently on cloudcephosd1002 (T280641)
  • 08:58 dcaro: During the upgrade, started getting warning 'slow osd heartbacks in the back', meaning that pings between osds are really slow (up to 190s) all from osd.58 (T280641)
  • 08:58 dcaro: During the upgrade, started getting warning 'slow osd heartbacks in the back', meaning that pings between osds are really slow (up to 190s) (T280641)
  • 08:21 dcaro: Upgrading all the ceph osds on eqiad (T280641)
  • 08:21 dcaro: The clock skew seems intermittent, there's another task to follw it T275860 (T280641)
  • 08:18 dcaro: All equiad ceph mons and mgrs upgraded (T280641)
  • 08:18 dcaro: During the upgrade, ceph detected a clock skew on cloudcephmon1002, cloudcephmon1001, they are back (T280641)
  • 08:15 dcaro: During the upgrade, ceph detected a clock skew on cloudcephmon1002, it went away, I'm guessing systemd-timesyncd fixed it (T280641)
  • 08:14 dcaro: During the upgrade, ceph detected a clock skew on cloudcephmon1002, looking (T280641)
  • 07:58 dcaro: Upgrading ceph services on eqiad, starting with mons/managers (T280641)

2021-04-27

  • 14:10 dcaro: codfw.openstack upgraded ceph libraries to 15.2.11 (T280641)
  • 13:07 dcaro: codfw.openstack cloudvirt2002-dev done, taking cloudvirt2003-dev out to upgrade ceph libraries (T280641)
  • 13:00 dcaro: codfw.openstack cloudvirt2001-dev back online, taking cloudvirt2002-dev out to upgrade ceph libraries (T280641)
  • 10:51 dcaro: ceph.eqiad: cinder pool got it's pg_num increased to 1024, re-shuffle started (T273783)
  • 10:48 dcaro: ceph.eqiad: Tweaked the target_size_ratio of all the pools, enabling autoscaler (it will increase cinder pool only) (T273783)
  • 09:14 dcaro: manually force stopping the server puppetmaster-01 to unblock migration (in codfw1)
  • 09:14 dcaro: manually force stopping the server puppetmaster-01 to unblock migration
  • 08:59 dcaro: manually force stopping the server exploding-head on codfw, to try cold migration
  • 08:47 dcaro: restarting nova-compute on cloudvirt2001-dev after upgrading ceph libraries to 15.2.11

2021-04-26

  • 20:56 andrewbogott: deleting spurious 'codfw1dev' and 'codw1dev-4' regions in the dallas deployment; regions without endpoints break a bunch of things
  • 09:45 dcaro: draining cloudvirt2001-dev with the new cookbooks (T280641)

2021-04-23

  • 13:49 dcaro: testing the drain_cloudvirt cookbook on codfw1 openstack cluster, draining cloudvirt2001 (T280641)
  • 11:12 dcaro: testing the drain_cloudvirt cookbook on codfw1 openstack cluster (T280641)
  • 09:32 dcaro: finished upgrade of ceph cluster on codfw1 using exclusively cookbooks (T280641)
  • 09:17 dcaro: testing the upgrade_osds cookbook on codfw1 ceph cluster (T280641)
  • 08:17 dcaro: testing the upgrade_mons cookbook on codfw1 ceph cluster (T280641)

2021-04-21

  • 17:59 dcaro: all monitors upgraded on codfw1 with one cookbook `cookbook --verbose -c ~/.config/spicerack/cookbook.yaml wmcs.ceph.upgrade_mons --monitor-node-fqdn cloudcephmon2002-dev.codfw.wmnet` (T280641)
  • 17:47 dcaro: upgrading monitors and mrg nodes on codfw ceph cluster (T280641)
  • 13:26 dcaro: testing ceph upgrade cookbook on cloudcephmon2002-dev (T280641)

2021-04-20

  • 20:21 andrewbogott: reboot cloudservices1003
  • 20:13 andrewbogott: reboot cloudservices1004

2021-04-19

  • 08:40 dcaro: enabling puppet on labstore1004 after mysql restart (T279657)
  • 08:09 dcaro: downtiming labstore1004 and stopping puppet for mysql restart (T279657)

2021-04-14

  • 10:48 dcaro: Upgrade of codfw ceph to octopus 15.2.20 done, will run some performance tests now (T274566)
  • 10:41 dcaro: Upgrade of codfw ceph to octopus 15.2.20, mgrs upgraded, osds next (T274566)
  • 10:37 dcaro: Upgrade of codfw ceph to octopus 15.2.20, mons upgraded, mgrs next (T274566)
  • 10:15 dcaro: starting the upgrade of codfw ceph to octopus 15.2.20 (T274566)
  • 10:07 dcaro: Merged the ceph 15 (Octopus) repo deployment to codfw, only the repo, not the packages (T274566)

2021-04-13

  • 16:42 dcaro: Ceph balancer got the cluster to eval 0.014916, that is 88-77% usage for compute pool, and 28-19% usage for the cinder one \o/ (T274573)
  • 15:08 dcaro: Activating continuous upmap balancer, keeping a close eye (T274573)
  • 15:03 dcaro: Executing a second pass, there's still movements to improve the eval of 0.030075 (T274573)
  • 15:02 dcaro: First pass finished, improved eval to 0.030075 (T274573)
  • 14:49 dcaro: Running the first_pass balancing plan on ceph eqiad, current eval 0.030622 (T274573)
  • 14:43 dcaro: enabling ceph upmap pg balancer on equiad (T274573)
  • 14:36 andrewbogott: upgrading codfw1dev to version Victoria, T261137
  • 13:11 andrewbogott: upgrading eqiad1 designate to version Victoria, T261137
  • 10:44 dcaro: enabled ceph upmap balancer on codfw (T274573,T274573)

2021-04-07

  • 21:33 andrewbogott: upgrading codfw1dev designate to Victoria

2021-04-04

  • 17:36 andrewbogott: upgrading eqiad1 designate to Ussuri

2021-04-02

  • 14:12 andrewbogott: upgrading codfw1dev to OpenStack version Ussuri

2021-04-01

  • 12:15 dcaro: Restoring the 4.9 kernel on cloudcephosd2003-dev and upgrading (T274565)
  • 10:29 dcaro: Done restoring the 4.9 kernel on cloudcephosd2001-dev and upgrading, requires logging into console to boot from the older kernel before removing the newer one (T274565)
  • 10:10 dcaro: Restoring the 4.9 kernel on cloudcephosd2001-dev and upgrading (T274565)

2021-03-31

  • 08:47 dcaro: upgrading cinder on codfw cloudcontrol2* nodes (T278845)

2021-03-30

  • 09:53 arturo: rebooting cloudnet1003 to cleanup conntrack table, it wouldn't cleanup by hand ...

2021-03-28

  • 15:42 andrewbogott: updated debian-10.0-buster base image

2021-03-27

  • 09:54 arturo: cleanup conntrack table in qrouter nents in cloudnet1003 (backup)

2021-03-25

  • 19:03 andrewbogott: deleting all unused (per wmcs-imageusage) Jessie base images from Glance
  • 17:15 andrewbogott: refreshing puppet compiler facts for tools project
  • 10:31 dcaro: kernel upgrade on osds on codfw done, running performance tests (T274565)
  • 10:24 dcaro: upgrading kernel on cloudcephosd2003-dev and reboot (T274565)
  • 10:18 dcaro: upgrading kernel on cloudcephosd2002-dev and reboot (T274565)
  • 10:08 dcaro: upgrading kernel on cloudcephmon2003-dev and reboot (T274565)

2021-03-24

  • 09:19 dcaro: restarted wmcs-backup on cloudvirt1024 as it failed due to an image being removed while running (T276892)

2021-03-23

  • 11:33 arturo: root@cloudcontrol1005:~# wmcs-novastats-dnsleaks --delete

2021-03-22

  • 10:10 arturo: cleanup conntrack table in standby node: aborrero@cloudnet1003:~ $ sudo ip netns exec qrouter-d93771ba-2711-4f88-804a-8df6fd03978a conntrack -F

2021-03-19

  • 17:18 bstorm: running `ALTER TABLE account MODIFY COLUMN type ENUM('user','tool','paws');` against the labsdbaccounts database on m5 T276284
  • 14:29 andrewbogott: switching admin-monitoring project to use an upstream debian image; I want to see how this affects performance
  • 00:30 bstorm: downtimed labstore1004 to check some things in debug mode

2021-03-17

  • 17:28 bstorm: restarted the backup-glance-images job to clear errors in systemd T271782
  • 17:16 andrewbogott: set default cinder quota for projects to 80Gb with "update quota_classes set hard_limit=80 where resource='gigabytes';" on database 'cinder'
  • 16:58 andrewbogott: disabling all flavors with >20Gb root storage with "update flavors set disabled=1 where root_gb>20;" in nova_eqiad1_api

2021-03-10

  • 16:51 arturo: rebooting cloudvirt1030 for T275753
  • 13:14 dcaro: starting manually the canary VM for cloudvirt1029 (nova start 349830f6-3b39-4a8c-ada4-a7439f65cffe) (T275753)
  • 12:51 arturo: draining cloudvirt1030 for T275753
  • 12:47 arturo: rebooting cloudvirt1029 for T275753
  • 11:56 arturo: [codfw1dev] restart rabbitmq-server in all 3 cloudcontrol servers for T276964
  • 11:53 arturo: [codfw1dev] restart nova-conductor in all 3 cloudcontrol servers for T276964
  • 11:31 arturo: draining cloudvirt1029 for T275753
  • 11:29 arturo: rebooting cloudvirt1013 for T275753
  • 11:05 arturo: draining cloudvirt1013 for T275753
  • 11:00 arturo: rebooting cloudvirt1028 for T275753
  • 10:33 arturo: draining cloudvirt1028 for T275753
  • 10:29 arturo: rebooting cloudvirt1023 for T275753
  • 09:37 arturo: draining cloudvirt1023 for T275753
  • 09:07 arturo: [codfw1dev] reimaging cloudvirt2003-dev (T276964)

2021-03-09

  • 16:27 arturo: rebooting cloudvirt1027 (T275753)
  • 13:39 arturo: draining cloudvrit1027 for T275753
  • 13:35 arturo: icinga-downtime cloudvirt1038 for 30 days for T276922
  • 13:21 arturo: add cloudvirt1039 to the ceph host aggregate (no longer a spare, we have cloudvirt1038 with HW failures)
  • 12:52 arturo: cloudvirt1038 hard powerdown / powerup for T276922
  • 12:33 arturo: rebooting cloudvirt1038 (T275753)
  • 10:58 arturo: draining cloudvirt1038 (T275753)
  • 10:54 arturo: rebooting cloudvirt1037 (T275753)
  • 09:59 arturo: draining cloudvirt1037 (T275753)
  • 09:12 dcaro: restarted the wmcs-backup service on cloudvirt1024 to retry the backups (failed because a VM was removed in-between, T276892)

2021-03-05

  • 21:40 andrewbogott: replacing 'observer' role with 'reader' role in eqiad1 T276018
  • 21:21 andrewbogott: replacing 'observer' role with 'reader' role in eqiad1
  • 16:23 arturo: rebooting cloudvirt1036 for T275753
  • 12:30 arturo: draining cloudvirt1036 for T275753
  • 12:25 arturo: rebooting cloudvirt1035 for T275753
  • 10:49 arturo: rebooting cloudvirt1035 for T275753
  • 10:47 arturo: rebooting cloudvirt1034 for T275753
  • 10:26 arturo: draining cloudvirt1034 for T275753
  • 10:25 arturo: rebooting cloudvirt1033 for T275753
  • 09:18 arturo: draining cloudvirt1033 for T275753

2021-03-04

  • 18:36 andrewbogott: rebooting cloudmetrics1002; the console is hanging
  • 16:59 arturo: rebooting cloudvirt1032 for T275753
  • 16:34 arturo: draining cloudvirt1032 for T275753
  • 16:33 arturo: rebooting cloudvirt1031 for T275753
  • 16:11 arturo: draining cloudvirt1031 for T275753
  • 16:09 arturo: rebooting cloudvirt1026 for T275753
  • 15:57 arturo: draining cloudvirt1026 for T275753
  • 15:55 arturo: rebooting cloudvirt1025 for T275753
  • 15:41 arturo: draining cloudvirt1025 for T275753
  • 15:12 arturo: rebooting cloudvirt1024 for T275753
  • 11:29 arturo: draining cloudvirt1024 for T275753
  • 11:24 dcaro: rebooted cloudvirt1022, re-adding to ceph and removing from maintenance host aggregate for T275753
  • 11:01 dcaro: rebooting cloudvirt1022 for T275753
  • 09:12 dcaro: draining cloudvirt1022 for T275753

2021-03-03

  • 17:16 andrewbogott: restarting rabbitmq-server on cloudcontrol1003,1004,1005; trying to explain amqp errors in scheduler logs
  • 16:03 dcaro: draining cloudvirt1022 for T275753
  • 16:03 dcaro: draining cloudvirt1022 for T275753
  • 16:00 arturo: move cloudvirt1013 into the 'toobusy' host aggregate, it has 221% cpu subscription and 82% MEM subscription
  • 15:34 arturo: rebooting cloudvirt1021 for T275753
  • 14:31 arturo: draining cloudvirt1021 for T275753
  • 13:59 arturo: rebooting cloudvirt1018 for T275753
  • 13:28 arturo: draining cloudvirt1018 for T275753
  • 12:49 arturo: rebooting cloudvirt1017 for T275753
  • 12:22 arturo: draining cloudvirt1017 for T275753
  • 12:20 arturo: rebooting cloudvirt1016 for T275753
  • 12:01 arturo: draining cloudvirt1016 for T275753
  • 11:59 arturo: cloudvirt1014 now in the ceph host aggregate
  • 11:58 arturo: rebooting cloudvirt1014 for T275753
  • 11:50 arturo: moved cloudvirt1023 away from the maintenance host aggregate, leave it in the ceph aggregate (was in the 2)
  • 11:47 arturo: moved cloudvirt1014 to the 'maintenance' host aggregate, drain it for T275753
  • 10:01 arturo: icinga-downtime cloudnet1003 for 14 days bc potential alerting storm due to firmware issues (T271058)
  • 10:01 arturo: rebooting again cloudnet1003 (no network failover) (T271058)
  • 09:59 arturo: update firmware-bnx2x from 20190114-2 to 20200918-1~bpo10+1 on cloudnet1003 (T271058)
  • 09:30 arturo: installing linux kernel 5.10.13-1~bpo10+1 in cloudnet1003 and rebooting it (network failover) (T271058)

2021-03-02

  • 17:16 andrewbogott: rebooting cloudvirt1039 to see if I can trigger T276208
  • 16:10 arturo: [codfw1dev] restart nova-compute on cloudvirt2002-dev
  • 11:59 arturo: moved cloudvirt1012 to 'maintenance' host aggregate. Drain it with `wmcs-drain-hypervisor` to reboot it for T275753
  • 11:59 arturo: cloudvirt1023 is affected by T276208 and cannot be rebooted. Put it back into the ceph hos aggregate
  • 10:43 arturo: moved cloudvirt1013 cloudvirt1032 cloudvirt1037 back into the 'ceph' host aggregate
  • 10:13 arturo: moved cloudvirt1023 to 'maintenance' host aggregate. Drain it with `wmcs-drain-hypervisor` to reboot it for T275753

2021-03-01

  • 20:12 andrewbogott: removing novaadmin from all projects save 'admin' for T274385
  • 19:51 andrewbogott: removing novaobserver from all projects save 'observer' for T274385
  • 19:50 andrewbogott: adding inherited domain-wide roles to novaadmin and novaobserver as per T274385

2021-02-28

  • 04:54 andrewbogott: restarted redis-server on tools-redis-1003 and tools-redis-1004 in an attempt to reduce replag, no real change detected

2021-02-27

  • 00:33 andrewbogott: sudo cumin --timeout 500 "A:all and not O{project:clouddb-services}" 'lsb_release -c | grep -i buster && uname -r | grep -v 4.19.0-14-amd64 && reboot'
  • 00:28 andrewbogott: sudo cumin --timeout 500 "A:all and not O{project:clouddb-services}" 'lsb_release -c | grep -i buster && uname -r | grep -v 4.19.0-14-amd64 && echo reboot'
  • 00:09 andrewbogott: sudo cumin "A:all and not O{project:clouddb-services}" 'lsb_release -c | grep -i stretch && uname -r | grep -v 4.19.0-0.bpo.14-amd64 && reboot'

2021-02-26

  • 14:58 dcaro: [eqiad] rebooting cloudcephosd1015 (last osd \o/) for kernel upgrade (T275753)
  • 14:51 dcaro: [eqiad] rebooting cloudcephosd1014 for kernel upgrade (T275753)
  • 14:44 dcaro: [eqiad] rebooting cloudcephosd1013 for kernel upgrade (T275753)
  • 14:38 dcaro: [eqiad] rebooting cloudcephosd1012 for kernel upgrade (T275753)
  • 14:31 dcaro: [eqiad] rebooting cloudcephosd1011 for kernel upgrade (T275753)
  • 14:25 dcaro: [eqiad] rebooting cloudcephosd1010 for kernel upgrade (T275753)
  • 14:17 dcaro: [eqiad] rebooting cloudcephosd1009 for kernel upgrade (T275753)
  • 13:54 dcaro: [eqiad] downtimed alert1001 Ceph OSDs down alert until 18:00 GMT+1 as that is not under the host being rebooted (T275753)
  • 13:51 dcaro: [eqiad] rebooting cloudcephosd1008 for kernel upgrade (T275753)
  • 13:45 dcaro: [eqiad] rebooting cloudcephosd1007 for kernel upgrade (T275753)
  • 13:38 dcaro: [eqiad] rebooting cloudcephosd1006 for kernel upgrade (T275753)
  • 12:07 dcaro: [eqiad] rebooting cloudcephosd1005 for kernel upgrade (T275753)
  • 12:00 arturo: rebooting cloudcontrol1003 for kernel upgrade (T275753)
  • 11:42 arturo: rebooting cloudcontrol1004 for kernel upgrade (T275753)
  • 11:41 dcaro: [eqiad] rebooting cloudcephosd1004 for kernel upgrade (T275753)
  • 11:32 dcaro: [eqiad] rebooting cloudcephosd1003 for kernel upgrade (T275753)
  • 11:30 arturo: rebooting cloudcontrol1005 for kernel upgrade (T2
  • 11:26 dcaro: [eqiad] rebooting cloudcephosd1002 for kernel upgrade (T275753)
  • 11:16 dcaro: [eqiad] rebooting cloudcephosd1001 for kernel upgrade (T275753)
  • 11:11 dcaro: [eqiad] rebooting cloudcephmon1003 for kernel upgrade (T275753)
  • 11:05 dcaro: [eqiad] rebooting cloudcephmon1002 for kernel upgrade (T275753)
  • 10:59 dcaro: [eqiad] rebooting cloudcephmon1001 for kernel upgrade (T275753)
  • 10:45 arturo: rebooting cloudvirt1039 into a new kernel (T275753) --- spare
  • 10:43 dcaro: [codfw1dev] rebooting cloudcephmon2003-dev for kernel upgrade (T275753)
  • 10:38 dcaro: [codfw1dev] rebooting cloudcephmon2002-dev for kernel upgrade (T275753)
  • 10:29 dcaro: [codfw1dev] rebooting cloudcephmon2001-dev for kernel upgrade (T275753)
  • 10:24 arturo: [codfw1dev] purge old kernel packages on cloudvirt2003-dev to force boot into a new kernel (T275753)
  • 10:11 arturo: [codfw1dev] manually creating /boot/grub/ on cloudvirt2003-dev to allow update-grub2 to run (so it can reboot into a new kernel) (T275753)
  • 10:11 dcaro: [codfw1dev] rebooting cloudcephosd2003-dev for kernel upgrade (T275753)
  • 10:05 dcaro: [codfw1dev] rebooting cloudcephosd2002-dev for kernel upgrade (T275753)
  • 10:01 arturo: [codfw1dev] rebooting cloudvirt200X-dev for kernel upgrade (T275753)
  • 09:59 arturo: [codfw1dev] rebooting cloudweb2001-dev for kernel upgrade (T275753)
  • 09:53 arturo: [codfw1dev] rebooting cloudservices2003-dev for kernel upgrade (T275753)
  • 09:51 arturo: [codfw1dev] rebooting cloudservices2002-dev for kernel upgrade (T275753)
  • 09:45 arturo: [codfw1dev] rebooting cloudcontrol2004-dev for kernel upgrade (T275753)
  • 09:44 arturo: [codfw1dev] rebooting cloudbackup[2001-2002].codfw.wmnet for kernel upgrade (T275753)
  • 09:43 dcaro: [codfw1dev] rebooting cloudcephosd2001-dev for kernel upgrade (T275753)
  • 09:41 arturo: [codfw1dev] rebooting cloudcontrol2003-dev for kernel upgrade (T275753)
  • 09:33 arturo: [codfw1dev] rebooting cloudcontrol2001-dev for kernel upgrade (T275753)

2021-02-25

  • 14:56 arturo: deployed wmcs-netns-events daemon to all cloudnet servers (T275483)

2021-02-24

  • 11:07 arturo: force-reboot cloudmetrics1002, add icinga downtime for 2 hours. Investigating some server issue
  • 00:17 bstorm: set --property hw_scsi_model=virtio-scsi and --property hw_disk_bus=scsi on the main stretch image in glance on eqiad1 T275430

2021-02-23

  • 22:43 bstorm: set --property hw_scsi_model=virtio-scsi and --property hw_disk_bus=scsi on the main buster image in glance on eqiad1 T275430
  • 20:36 andrewbogott: adding r/o access to the eqiad1-glance-images ceph pool for the client.eqiad1-compute for T275430
  • 10:49 arturo: rebooting clounet1004 into new kernel from buster-bpo (T271058)
  • 10:49 arturo: installing linux-image-amd64 from buster-bpo 5.10.13-1~bpo10+1 in cloudnet1004 (T271058)

2021-02-22

  • 17:15 bstorm: restarting nova-compute on cloudvirt1016 and cloudvirt1036 in case it helps T275411
  • 15:02 dcaro: Re-uploaded the debian buster 10.0 image from rbd to glance, that worked, re-spawning all the broken instances (T275378)
  • 11:12 dcaro: Refreshing all the canary instances (T275354)

2021-02-18

  • 14:50 arturo: rebooting cloudnet1004 for T271058
  • 10:25 dcaro: Rebooting cloudmetrics1001 to apply new kernel (T275116)
  • 10:16 dcaro: Rebooting cloudmetrics1002 to apply new kernel (T275116)
  • 10:14 dcaro: Upgrading grafana on cloudmetrics1002 (T275116)
  • 10:12 dcaro: Upgrading grafana on cloudmetrics1001 (T275116)

2021-02-17

2021-02-15

  • 16:25 arturo: [codfw1dev] rebooting all cloudgw200x-dev / cloudnet200x-dev servers (T272963)
  • 15:45 arturo: [codfw1dev] drop subnet definition for cloud-instances-transport1-b-codfw (T272963)
  • 15:45 arturo: [codfw1dev] connect virtual router cloudinstances2b-gw to vlan cloud-gw-transport-codfw (185.15.57.10) (T272963)

2021-02-11

  • 12:01 arturo: [codfw1dev] drop instance `tools-codfw1dev-bastion-1` in `tools-codfw1dev` (was buster, cannot use it yet)
  • 11:59 arturo: [codfw1dev] create instance `tools-codfw1dev-bastion-2` (stretch) in `tools-codfw1dev` to test stuff related to T272397
  • 11:45 arturo: [codfw1dev] create instance `tools-codfw1dev-bastion-1` in `tools-codfw1dev` to test stuff related to T272397
  • 11:42 arturo: [codfw1dev] drop `tools` project, create `tools-codfw1dev`
  • 11:38 arturo: [codfw1dev] drop `coudinfra` project (we are using `cloudinfra-codfw1dev` there)
  • 05:37 bstorm: downtimed cloudnet1004 for another week T271058

2021-02-09

  • 15:23 arturo: icinga-downtime for 2h everything *labs *cloud for openstack upgrades
  • 11:14 dcaro: Merged the osd scheduler change for all osds, applying on all cloudcephosd* (T273791)

2021-02-08

  • 18:50 bstorm: enabled puppet on cloudvirt1023 for now T274144
  • 18:44 bstorm: restarted the backup_vms.service on cloudvirt1027 T274144
  • 17:51 bstorm: deleted project pki T273175

2021-02-05

  • 10:59 arturo: icinga-downtime labstore1004 tools share space check for 1 week (T272247)
  • 10:21 dcaro: This was affecting maps and several others, maps and project-proxy have been fixed (T273956)
  • 09:19 dcaro: Some certs around the infra are expired (T273956)

2021-02-04

  • 10:12 dcaro: Increasing the memory limit of osds in eqiad from 8589934592(8G) to 12884901888(12G) (T273851)

2021-02-03

  • 09:59 dcaro: Doing a full vm backup on cloudvirt1024 with the new script (T260692)
  • 01:50 bstorm: icinga-downtime cloudnet1004 for a week T271058

2021-02-02

  • 17:14 dcaro: Changed osd memory limit from 4G to 8G (T273649)
  • 11:00 arturo: icinga-downtime cloudvirt-wdqs1001 for 1 week (T273579)
  • 03:12 andrewbogott: running /usr/local/sbin/wmcs-purge-backups and /usr/local/sbin/wmcs-backup-instances on cloudvirt1024 to see why the backup job paged

2021-01-29

  • 15:36 andrewbogott: disabling puppet and some services on eqiad1 cloudcontrol nodes; replacing nova-placement-api with placement-api

2021-01-28

  • 19:44 andrewbogott: shutting down cloudcontrol2001-dev because it's in a partially upgraded state; will revive when it's time for Train

2021-01-27

  • 00:50 bstorm: icinga-downtime cloudnet1004 for a week T271058

2021-01-22

  • 16:44 andrewbogott: upgrading designate on cloudvirt1003/1004 to OpenStack 'train'
  • 11:29 dcaro: Doing some tests removed cloudcontrol1003 puppet cert, regenerating...

2021-01-21

2021-01-20

  • 10:49 arturo: merging core router firewall change https://gerrit.wikimedia.org/r/c/operations/homer/public/+/657302 (T209082)
  • 10:05 dcaro: Everything looks ok, created a new vm with a volume in ceph without issues, and on warnings/errors on ceph status, closing (T272303)
  • 09:55 dcaro: Eqiad ceph cluster uprgaded, doing sanity checks (T272303)
  • 09:46 dcaro: 75% of the eqiad cluster upgraded... continuing (T272303)
  • 09:37 dcaro: 25% of the eqiad cluster upgraded... continuing (T272303)
  • 09:24 dcaro: Mgr daemons upgraded and running, upgrading osd daemons on servers cloudcephosd1*, this make take a bit longer (T272303)
  • 09:22 dcaro: Mon daemons upgraded and running, upgrading mgr daemons on servers cloudcephmon1* (T272303)
  • 09:16 dcaro: Starting eqiad ceph upgrade, upgrading the mon servers cloudcephmon1* (T272303)
  • 09:01 dcaro: Will start the ceph upgrade in 15 min, no downtime nor performance impact is expected (T272303)

2021-01-19

  • 10:17 arturo: icinga-downtime cloudnet1004 for 1 week (T271058)

2021-01-18

  • 16:00 dcaro: Codfw1 ceph cluster uprgaded, will wait until tomorrow to see if there's any instability, but everything looks fine (T272303)
  • 15:38 dcaro: Upgraded mgr sevices on codfw ceph cluster, starting with osd ones (T272303)
  • 15:35 dcaro: Upgraded mon sevices on codfw ceph cluster, starting with mgr ones (T272303)
  • 15:21 dcaro: Starting upgrade of ceph mon nodes on codfw (T272303)
  • 15:06 dcaro: re-enabling puppet on cloudcephosd2* hosts
  • 13:53 dcaro: disabling puppet on cloudcephosd2* to resume perf tests
  • 10:50 dcaro: re-enabling puppet on cephcloudosd2* (codfw)
  • 10:07 dcaro: disabling puppet on cephcloudosd2* (codfw) to do some performance tests
  • 09:00 dcaro: Enabling custom application 'cinder' on pool codfw1dev-cinder to get rid of health warnings

2021-01-17

  • 16:53 arturo: icinga downtime labstore1004 /srv/tools space check for 3 days (T272247)

2021-01-15

  • 13:41 arturo: icinga downtime labstore1004 maintain-dbuser alert until 2021-01-19 (T272125)
  • 09:47 arturo: labstore1004 maintain-dbusers affected by T272127 and T272125
  • 09:22 arturo: restart maintain-dbusers.service in labstore1004
  • 08:19 dcaro: Merging the patch to disable write caches on ceph osds (T271527)

2021-01-13

  • 17:03 arturo: remove cloudvirt1013 cloudvirt1032 cloudvirt1037 to the 'toobusy' host aggregate to prevent further CPU oversubscribing
  • 12:40 arturo: try increasing systemd watchdog timeout for conntrackd in cloudnet1004 (T268335)
  • 11:45 dcaro: https://gerrit.wikimedia.org/r/c/operations/puppet/+/654419 merged and deployed (and tested) (T268877)
  • 11:40 dcaro: merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/654419 that might affect the encapi service (puppet on cloud environment), no downtime expected though (T268877)
  • 10:56 arturo: trying to cleanup dpkg package mess in cloudnet2002-dev
  • 10:02 arturo: prevent floating IP allocation from neutron transport subnet: root@cloudcontrol1005:~# neutron subnet-update --allocation-pool start=185.15.56.244,end=185.15.56.244 cloud-instances-transport1-b-eqiad1 (T271867)

2021-01-12

  • 10:33 arturo: reboot cloudnet1004
  • 10:32 arturo: update firmware-bnx2x from 20190114-2 to 20200918-1~bpo10+1 on cloudnet1004 (T271058)

2021-01-11

  • 10:22 arturo: doubling size of conntrack table in cloudnet servers https://gerrit.wikimedia.org/r/c/operations/puppet/+/655407 (T271058)
  • 10:07 arturo: manually cleanup conntrack table in cloudnet1004 (T271058)
  • 09:19 dcaro: cleaned up ~1800 snapshots, 109 remaining only, one for each host x image combination (plus some ephemeral ones while doing backups), closing the task (T270478)
  • 08:39 dcaro: cleaning up dangling snapshots now that we have the new suffixed ones (T270478)

2021-01-10

  • 16:02 andrewbogott: restarting rabbitmq-server on all eqiad1 cloudcontrols
  • 15:54 andrewbogott: restating neutron-metadata-agent on cloudnet1004 due to many syslog complaints

2021-01-08

  • 11:25 arturo: rebooting both cloudnet2002-dev/cloudnet2003-dev to make sure interfaces are set up correctl (T271517)
  • 11:22 arturo: connecting cloudnet2002-dev cloudnet2003-dev back to vlan 2120 (T271517)
  • 11:06 arturo: root@cloudcontrol2001-dev:~# openstack router set --external-gateway wan-transport-codfw --fixed-ip subnet=cloud-instances-transport1-b-codfw,ip-address=208.80.153.190 cloudinstances2b-gw (T271517)
  • 11:02 arturo: root@cloudcontrol2001-dev:~# openstack router set --enable-snat cloudinstances2b-gw --external-gateway wan-transport-codfw (T271517)
  • 11:01 arturo: enabling neutron hacks in codfw1dev (cloudnet2002-dev, cloudnet2003-dev) (T271517)
  • 10:55 arturo: aborrero@labtestvirt2003:~ $ sudo ifdown eno2.2107 (T271517)
  • 10:55 arturo: aborrero@labtestvirt2003:~ $ sudo ifdown eno2.2120 (T271517)
  • 10:53 arturo: root@cloudcontrol2001-dev:~# openstack subnet create --network wan-transport-codfw --gateway 208.80.153.185 --ip-version 4 --network wan-transport-codfw --no-dhcp --subnet-range 208.80.153.184/29 cloud-instances-transport1-b-codfw (T271517)
  • 10:40 dcaro: Finished tests, brining osd online (od.48) for eqiad ceph cluster (T271417)
  • 09:59 dcaro: Started performance tests on sdc (od.48) for eqiad ceph cluster (T271417)
  • 09:41 dcaro: Taking osd.48 from eqiad ceph cluster out to do performance tests (T271417)

2021-01-07

  • 15:19 dcaro: Finished speed tests on cloudcephosd2001-dev, reprovisioning the osd.0 sdc (T271417)
  • 14:39 dcaro: Starting speed tests on cloudcephosd2001-dev sdc (T271417)
  • 12:54 dcaro: Taking osd.0 down on codfw ceph cluster to try the disk performance testing process (T271417)
  • 11:35 arturo: merging dmz_cidr change (T209082, T267779)

2021-01-05

  • 10:40 dcaro: removing dumps-[1..*] backups from cloudvirt1024 as they are not needed (T271094)

2021-01-03

  • 07:06 dcaro: Got a network hiccup on cloudnet1004, keeping track here T271058

2020-12-28

2020-12-23

  • 15:38 andrewbogott: restarting rabbitmq on cloudcontrol1004; suspected leaks
  • 15:33 andrewbogott: restarting each cloudcontrol galera node in turn to see if that quiets down the syncing warnings
  • 12:08 arturo: move memory out of the swap in cloudcontrol1004 by disabling/enabling it (1Gb swap was being used)

2020-12-22

  • 15:30 dcaro: cleaning up 6778 dangling snapshots for glance images in eqiad (T270478)
  • 13:51 dcaro: merged patch to move wikidumpparse backups to cloudvirt1025 to free space on cloudvirt1026

2020-12-19

  • 16:18 dcaro: gzipped a bunch of logs on cloudvirt1004 due to / being out of space
  • 00:14 bstorm: truncated /var/log/debug.1 on cloudcontrol1003 which appears to be the exact same content as the user.log files anyway
  • 00:10 bstorm: truncated /var/log/daemon.log.1 and the haproxy log
  • 00:02 bstorm: truncated /var/log/messages.1 on cloudcontrol1003

2020-12-18

  • 23:53 bstorm: truncated haproxy.log.1 on cloudcontrol1003
  • 20:46 andrewbogott: setting pg and pgp number to 4096 for eqiad1-compute as joachim thinks 8192 might be too much T270305
  • 17:09 dcaro: finished cleaning up the dangling snapshots from cloudvirt1026 (T270478)
  • 17:08 dcaro: removing dangling rbd snapshots (for backups on cloudvirt1026) (T270478)
  • 17:06 dcaro: finished cleaning up the dangling snapshots from cloudvirt1025 (T270478)
  • 17:05 dcaro: removing dangling rbd snapshots (for backups on cloudvirt1025) (T270478)
  • 17:00 dcaro: finished cleaning up the dangling snapshots from cloudvirt1021 (T270478)
  • 16:58 dcaro: removing dangling rbd snapshots (for backups on cloudvirt1021) (T270478)
  • 16:56 dcaro: finished cleaning up the dangling snapshots from cloudvirt1022 (T270478)
  • 16:55 dcaro: removing dangling rbd snapshots (for backups on cloudvirt1022) (T270478)
  • 16:54 dcaro: finished cleaning up the dangling snapshots from cloudvirt1023 (T270478)
  • 16:51 dcaro: removing dangling rbd snapshots (for backups on cloudvirt1023) (T270478)
  • 16:47 dcaro: finished cleaning up the dangling snapshots from cloudvirt1024, freed ~12% of the capacity (T270478)
  • 16:21 dcaro: removing dangling rbd snapshots (for backups on cloudvirt1024) (T270478)
  • 16:13 andrewbogott: setting autoscale to 'off' for both ceph pools (eqiad1-compute and eqiad1-glance-images) because we like how things are set and the autoscaler does not
  • 10:33 dcaro: purging rbd snapshots for image fc6fb78b-4515-4dcc-8254-591b9fe01762 (T270478)

2020-12-17

  • 22:17 andrewbogott: correction to above, set the pg and pgp to 1024 for eqiad1-glance-images
  • 22:16 andrewbogott: setting pgp number to 8192 for eqiad1-compute (a 4x increase) and 2048 for eqiad1-glance-images (also a 4x increase) T270305 (same as pg)
  • 22:14 andrewbogott: setting pg number to 8192 for eqiad1-compute (a 4x increase) and 2048 for eqiad1-glance-images (also a 4x increase) T270305
  • 22:10 andrewbogott: setting autoscale to 'warn' for both ceph pools (eqiad1-compute and eqiad1-glance-images)

2020-12-16

  • 09:31 dcaro: removing invalid backups from cloudvirt1024 (196 in total) (T269419)

2020-12-14

  • 17:42 dcaro: The removal freed ~12GB (still 100% usage :S) (T269419)
  • 17:36 dcaro: removing invalid backups that have a valid copy (T269419)
  • 15:43 dcaro: Merging the tagging for vm backups (T267195)
  • 09:45 arturo: icinga downtime cloudvirt1024 for 6 days (T269419)

2020-12-13

  • 09:11 _dcaro: running backup purge script on cloudvirt1024 (T269419)

2020-12-10

  • 23:36 bstorm: cleaned up the logs for haproxy on cloudcontrol1003 by deleting all the gzipped ones and truncating the .1 file
  • 11:56 dcaro: Freed some space on cloudvirt1024 by running the purge script (T269419)
  • 09:17 dcaro: removing leaked dns record discordwiki.eqiad.wmflabs (clinic duty)

2020-12-08

  • 18:01 dcaro: Host cloudvirt1030 up and running (T216195)
  • 15:59 dcaro: Re-imaging host cloudvirt1030 (T216195)
  • 14:18 dcaro: Host online cloudvirt1029 (T216195)
  • 14:13 dcaro: Host re-imaged, doing tests cloudvirt1029 (T216195)
  • 12:14 dcaro: Re-imaging cloudvirt1029 (T216195)

2020-12-07

  • 18:33 andrewbogott: putting cloudvirt1023 back into service T269467
  • 15:55 andrewbogott: reimaging cloudvirt1028 for T216195
  • 14:49 dcaro: Re-imaging cloudvirt1027 (T216195)

2020-12-05

  • 00:35 andrewbogott: moving cloudvirt1023 back into maintenance because T269467 continues to puzzle

2020-12-04

  • 22:33 andrewbogott: moving cloudvirt1023 back into the ceph aggregate; it doesn't need upgrades after all T269467
  • 22:24 andrewbogott: moving cloudvirt1023 out of the ceph aggregate and into maintenance for T269467
  • 21:06 andrewbogott: putting cloudvirt1025 and 1026 back into service because I'm pretty sure they're fixed. T269313
  • 12:12 arturo: manually running `wmcs-purge-backups` again on cloudvirt1024 (T269419)
  • 11:25 arturo: icinga downtime cloudvirt1024 for 6 days, to avoid paging noises (T269419)
  • 11:25 arturo: last log line referencing cloudvirt1024 is a mistake (T269313)
  • 11:24 arturo: icinga downtime cloudvirt1024 for 6 days, to avoid paging noises (T269313)
  • 10:28 arturo: manually running `wmcs-purge-backups` on cloudvirt1024 (T269419)
  • 10:23 arturo: setting expiration to 2020-12-03 to the oldest backy snapshot of every VM in cloudvirt1024 (T269419)
  • 09:54 arturo: icinga downtime cloudvirt1025 for 6 days (T269313)

2020-12-03

  • 23:21 andrewbogott: removing all osds on cloudcephosd1004 for rebuild, T268746
  • 21:45 andrewbogott: removing all osds on cloudcephosd1005 for rebuild, T268746
  • 19:51 andrewbogott: removing all osds on cloudcephosd1006 for rebuild, T268746
  • 17:01 arturo: icinga downtime cloudvirt1025 for 48h to debug network issue T269313
  • 16:56 arturo: rebooting cloudvirt1025 to debug network issue T269313
  • 16:38 dcaro: Rimaging cloudvirt1026 (T216195)
  • 13:24 andrewbogott: removing all osds on cloudcephosd1008 for rebuild, T268746
  • 02:55 andrewbogott: removing all osds on cloudcephosd1009 for rebuild, T268746

2020-12-02

  • 20:04 andrewbogott: removing all osds on cloudcephosd1010 for rebuild, T268746
  • 17:25 arturo: [15:51] failovering neutron virtual router in eqiad1 (T268335)
  • 15:36 arturo: conntrackd is now up and running in cloudnet1003/1004 nodes (T268335)
  • 15:33 arturo: [codfw1dev] conntrackd is now up and running in cloudnet200x-dev nodes (T268335)
  • 15:08 andrewbogott: removing all osds on cloudcephosd1012 for rebuild, T268746
  • 12:41 arturo: disable puppet in all cloudnet servers to merge conntrackd change T268335
  • 11:12 dcaro: Reset the properties for the flavor g2.cores8.ram16.disk1120 to correct quotes (T269172)
  • 09:57 arturo: moved cloudvirts 1030, 1029, 1028, 1027, 1026, 1025 away from the 'standard' host aggregate to 'maintenance' (T269172)

2020-12-01

2020-11-30

  • 18:12 andrewbogott: removing all osds from cloudcephosd1015 in order to investigate T268746

2020-11-29

  • 17:18 andrewbogott: cleaning up some logfiles in tools-sgecron-01 — drive is full

2020-11-26

  • 22:58 andrewbogott: deleting /var/log/haproxy logs older than 7 days in cloudcontrol100x. We need log rotation here it seems.
  • 15:53 dcaro: Created private flavor g2.cores8.ram16.disk1120 for wikidumpparse (T268190)

2020-11-25

  • 19:35 bstorm: repairing ceph pg `instructing pg 6.91 on osd.117 to repair`
  • 09:31 _dcaro: The OSD seems to be up and running actually, though there's that misleading log, will leave it see if the cluster comes fully healthy (T268722)
  • 08:54 _dcaro: Unsetting noup/nodown to allow re-shuffling of the pgs that osd.44 had, will try to rebuild it (T268722)
  • 08:45 _dcaro: Tried resetting the class for osd.44 to ssd, no luck, the cluster is in noout/norebalance to avoid data shuffling (opened T268722)
  • 08:45 _dcaro: Tried resetting the class for osd.44 to ssd, no luck, the cluster is in noout/norebalance to avoid data shuffling (opened root@cloudcephosd1005:/var/lib/ceph/osd/ceph-44# ceph osd crush set-device-class ssd osd.44)
  • 08:19 _dcaro: Restarting serivce osd.44 resulted on osd.44 being unable to start due to some config inconsistency (can not reset class to hdd)
  • 08:16 _dcaro: After enabling auto pg scaling on ceph eqiad cluster, osd.44 (cloudcephosd1005) got stuck, trying to restart the osd service
  • 08:16 _dcaro: After enabling auto pg scaling on ceph eqiad cluster, osd.44 (cloudcephosd1005) got stuck, trying to restart

2020-11-22

  • 17:40 andrewbogott: apt-get upgrade on cloudservices1003/1004
  • 17:32 andrewbogott: upgrading Designate on cloudservices1003/1004 to Stein

2020-11-20

  • 12:44 arturo: [codfw1dev] install conntrackd in cloudnet2003-dev/cloudnet2002-dev to research l3 agent HA reliability
  • 09:26 arturo: incinga downtime labstore1006 RAID checks for 10 days (T268281)

2020-11-17

  • 19:21 andrewbogott: draining cloudvirt1012 to experiment with libvirt/cpu things

2020-11-15

  • 11:21 arturo: icinga downtime cloudbackup2002 for 48h (T267865)

2020-11-10

  • 16:38 arturo: icinga downtime toolschecker for 2h becasue toolsdb maintenance (T266587)
  • 11:24 arturo: [codfw1dev] enable puppet in puppetmaster01.cloudinfra-codfw1dev (disabled for unspecified reasons)

2020-11-09

  • 12:42 arturo: restarted neutron l3 agent in cloudnet1003 bc it still had the old default route (T265288)
  • 12:41 arturo: `root@cloudcontrol1005:~# neutron subnet-delete dcbb0f98-5e9d-4a93-8dfc-4e3ec3c44dcc` (T265288)
  • 12:41 arturo: `root@cloudcontrol1005:~# neutron router-gateway-set --fixed-ip subnet_id=7c6bcc12-212f-44c2-9954-5c55002ee371,ip_address=185.15.56.244 cloudinstances2b-gw wan-transport-eqiad` (T265288)
  • 12:19 arturo: subnet 185.1.5.56.240/29 has id 7c6bcc12-212f-44c2-9954-5c55002ee371 in neutron (T265288)
  • 12:19 arturo: `root@cloudcontrol1005:~# neutron subnet-create --gateway 185.15.56.241 --name cloud-instances-transport1-b-eqiad1 --ip-version 4 --disable-dhcp wan-transport-eqiad 185.15.56.240/29` (T265288)
  • 12:15 arturo: icinga-downtime toolschecker for 2h (T265288)

2020-11-02

  • 13:36 arturo: (typo: dcaro)
  • 13:35 arturo: added dcar as projectadmin & user (T266068)

2020-10-29

  • 16:57 bstorm: silenced deployment-prep project alerts for 60 days since the downtime expired
  • 08:12 arturo: force-powercycling cloudcephosd1006

2020-10-25

  • 16:20 andrewbogott: adding cloudvirt1038 to the 'ceph' aggregate and removing from the 'spare' aggregate. We need this space while waiting on network upgrades for empty cloudvirts (T216195)

2020-10-23

  • 11:30 arturo: [codfw1dev] openstack --os-project-id cloudinfra-codfw1dev recordset create --type PTR --record nat.cloudgw.codfw1dev.wikimediacloud.org. --description "created by hand" 0-29.57.15.185.in-addr.arpa. 1.0-29.57.15.185.in-addr.arpa. (T261724)
  • 10:09 arturo: [codf1dev] doing DNS changes for the cloudgw PoC, including designate and https://gerrit.wikimedia.org/r/c/operations/dns/+/635965 (T261724)

2020-10-22

  • 10:46 arturo: [codfw1dev] rebooting cloudinfra-internal-puppetmaster-01.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud to try fixing some DNS weirdness
  • 09:43 arturo: enabling puppet in cloucontrol1003 (message said "please re-enable after 2020-10-22 06:00UTC")

2020-10-21

  • 14:36 andrewbogott: running apt-get update && apt-get install -y facter on all cloud-vps instances
  • 10:31 arturo: [codfw1dev] reimaging labtestvirt2003 (cloudgw) to test puppet code (T261724)
  • 08:56 arturo: [codfw1dev] reimaging labtestvirt2003 (cloudgw) to test puppet code (T261724)

2020-10-20

2020-10-19

  • 01:41 andrewbogott: deleting all Precise base images
  • 01:36 andrewbogott: deleting all unused Jessie base images

2020-10-18

  • 23:26 andrewbogott: deleting all Trusty base images
  • 21:50 andrewbogott: migrating all currently used ceph images to rbd

2020-10-16

  • 09:29 arturo: [codfw1dev] still some DNS weirdness, investigating
  • 09:25 arturo: [codfw1dev] hard-rebooting bastion-codfw1dev-02, seems in bad shape, doesn't even wake up in the virsh console
  • 09:18 arturo: [codfw1dev] live-hacked cloudservices2002-dev /etc/powerdns/recursor.conf file to include cloud-codfw1dev-floating CIDR (185.15.57.0/29) while https://gerrit.wikimedia.org/r/c/operations/puppet/+/634050 is in review, so VMs with a floating IP can query the DNS recursor (T261724)
  • 09:01 arturo: [codfw1dev] basic network connectivity seems stable after cleaning up everything related to address scopes (T261724)

2020-10-15

  • 15:17 arturo: [codfw1dev] try cleaning up anything related to address scopes in the neutron database (T261724)
  • 13:56 arturo: [codfw1dev] drop neutron l3 agent hacks in cloudnet2002/2003-dev (T261724)

2020-10-13

  • 17:54 andrewbogott: rebuilding cloudvirt1021 for backy support
  • 15:22 andrewbogott: draining cloudvirt1021 so I can rebuild it with backy support
  • 14:19 andrewbogott: rebuilding cloudvirt1022 with backy support
  • 14:03 andrewbogott: draining cloudvirt1022 so I can rebuild it with backy support
  • 11:19 arturo: [codfw1dev] rebooting labtestvirt2003

2020-10-09

  • 10:15 arturo: [codfwd1ev] root@cloudcontrol2001-dev:~# openstack router set --disable-snat cloudinstances2b-gw --external-gateway wan-transport-codfw (T261724)
  • 09:22 arturo: [codfwd1dev] rebooting cloudnet boxes for bridge and vlan changes (T261724)
  • 09:12 arturo: [codfw1dev] root@cloudcontrol2001-dev:~# openstack subnet delete 31214392-9ca5-4256-bff5-1e19a35661de (cloud-instances-transport1-b-codfw - 208.80.153.184/29) (T261724)
  • 09:10 arturo: [codfw1dev] root@cloudcontrol2001-dev:~# openstack router set --external-gateway wan-transport-codfw --fixed-ip subnet=cloud-gw-transport-codfw,ip-address=185.15.57.10 cloudinstances2b-gw (T261724)
  • 08:49 arturo: [codfw1dev] root@cloudcontrol2001-dev:~# openstack subnet create --network wan-transport-codfw --gateway 185.15.57.9 --no-dhcp --subnet-range 185.15.57.8/30 cloud-gw-transport-codfw (T261724)
  • 08:47 arturo: [codfw1dev] root@cloudcontrol2001-dev:~# openstack subnet delete a5ab5362-4ffb-4059-9ff7-391e22dcf3bc (T261724)

2020-10-08

  • 16:17 arturo: [codfw1dev] `root@cloudcontrol2001-dev:~# openstack subnet create --network wan-transport-codfw --gateway 185.15.57.8 --no-dhcp --subnet-range 185.15.57.8/31 cloud-gw-transport-codfw` (with a hack -- see task) (T263622)
  • 16:03 arturo: [codfw1dev] briefly live-hacked python3-neutron source code in all 3 cloudcontrol2xxx-dev servers to workaround /31 network definition issue (T263622)
  • 10:28 arturo: [codfw1dev] reimaging labtestvirt2003 (cloudgw) T261724

2020-10-06

  • 21:30 andrewbogott: moved cloudvirt1013 out of the 'ceph' aggregate and into the 'maintenance' aggregate for T243414
  • 21:29 andrewbogott: draining cloudvirt1013 for upgrade to 10G networking
  • 14:45 arturo: icinga downtime every cloud* lab* host for 60 minutes for keystone maintenance

2020-10-05

  • 17:40 bd808: `service uwsgi-labspuppetbackend restart` on cloud-puppetmaster-03 (T264649)

2020-10-02

  • 11:05 arturo: [codfw1dev] restarting rabbitmq-server in all 3 control nodes, the l3 agent was misbehaving
  • 09:16 arturo: [codfw1dev] trying the labtestvirt2003 (cloudgw) reimage again (T261724)

2020-10-01

  • 16:06 arturo: rebooting cloudvirt1024 to validate changes to /etc/network/interfaces file
  • 15:36 arturo: [codfw1dev] reimaging labtestvirt2003

2020-09-30

  • 16:47 andrewbogott: rebooting cloudvir1032, 1033, 1034 for T262979
  • 13:28 arturo: enable puppet, reboot and pool back cloudvirt1031
  • 13:27 arturo: extend icinga downtimes for another 120 mins
  • 13:15 arturo: `aborrero@cloudcontrol1003:~$ sudo nova-manage placement sync_aggregates` after reading a hint in nova-api.log
  • 13:02 arturo: rebooting cloudvirt1016 and moving it to the ceph host aggregate
  • 12:55 arturo: rebooting cloudvirt1014 and moving it to the ceph host aggregate
  • 12:51 arturo: rebooting cloudvirt1013 and moving it to the ceph host aggregate
  • 12:39 arturo: root@cloudcontrol1005:~# openstack aggregate add host maintenance cloudvirt1031
  • 12:36 arturo: rebooted cloudnet1003 (active) a couple of minutes ago
  • 12:36 arturo: move cloudvirt1012 and cloudvirt1039 to the ceph aggregate
  • 11:49 arturo: rebooting cloudvirt1039
  • 11:46 arturo: rebooting cloudvirt1012
  • 11:40 arturo: rebooting cloudnet1004 (standby) to pick up https://gerrit.wikimedia.org/r/c/operations/puppet/+/631167 (T262979)
  • 11:38 arturo: [codfw1dev] rebooting cloudnet2002-dev to pick up https://gerrit.wikimedia.org/r/c/operations/puppet/+/631167
  • 11:36 arturo: [codfw1dev] rebooting cloudnet2003-dev to pick up https://gerrit.wikimedia.org/r/c/operations/puppet/+/631167
  • 11:33 arturo: disabling puppet and downtiming every virt/net server in the fleet in preparation for merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/631167 (T262979)
  • 09:32 arturo: rebooting cloudvirt1012 to investigate linuxbridge agent issues

2020-09-29

  • 15:40 arturo: downgrade linux kernel from linux-image-4.19.0-11-amd64 to linux-image-4.19.0-10-amd64 on cloudvirt1012
  • 14:47 arturo: rebooting cloudvirt1012, chasing config weirdness in the linuxbridge agent
  • 14:05 andrewbogott: reimaging 1014 over and over in an attempt to get partman right
  • 13:51 arturo: rebooting cloudvirt1012

2020-09-28

  • 14:55 arturo: [jbond42] upgraded facter to v3 across the VM fleet
  • 13:54 andrewbogott: moving cloudvirt1035 from aggregate 'spare' to 'ceph'. We're going to need all the capacity we can get while converting older cloudvirts to ceph

2020-09-24

  • 15:47 arturo: stopping/restarting rabbitmq-server in all cloudcontrol servers
  • 15:45 arturo: restarting rabbitmq-server in cloudcontrol103
  • 15:15 arturo: restarting floating_ip_ptr_records_updater.service in all 3 cloudcontrol servers to reset state after a DNS failure

2020-09-18

  • 10:16 arturo: cloudvirt1039 libvirtd service issues were fixed with a reboot
  • 09:56 arturo: rebooting cloudvirt1039 (spare) to try to fix some weird libvirtd failure
  • 09:50 arturo: enabling puppet in cloudvirts and effectively merging patches from T262979
  • 08:59 arturo: disable puppet in all buster cloudvirts (cloudvirt[1024,1031-1039].eqiad.wmnet) to merge a patch for T263205 and T262979
  • 08:50 arturo: installing iptables from buster-bpo in cloudvirt1036 (T263205 and T262979)

2020-09-15

  • 20:32 andrewbogott: rebooting cloudvirt1038 to see if it resolves T262979
  • 13:58 andrewbogott: draining cloudvirt1002 with wmcs-ceph-migrate

2020-09-14

  • 14:21 andrewbogott: draining cloudvirt1001, migrating all VMs with wmcs-ceph-migrate
  • 10:41 arturo: [codfw1dev] trying to get the bonding working for labtestvirt2003 (T261724)
  • 09:47 arturo: installed qemu security update in eqiad1 cloudvirts (T262386)
  • 09:43 arturo: [codfw1dev] installed qemu security update in codfw1dev cloudvirts (T262386)

2020-09-09

2020-09-08

  • 21:48 bd808: Renamed FQDN prefixes to wikimedia.cloud scheme in cloudinfra-db01's labspuppet db (T260614)
  • 14:29 andrewbogott: restarting nova-compute on all cloudvirts (everyone is upset from the reset switch failure)
  • 14:18 arturo: restarting nova-fullstack service in cloudcontrol1003
  • 14:17 andrewbogott: stopping apache2 on labweb1001 to make sure the Horizon outage is total

2020-09-03

  • 09:31 arturo: icinga downtime cloud* servers for 30 mins (T261866)

2020-09-02

  • 08:46 arturo: [codfw1dev] reimaging spare server labtestvirt2003 as debian buster (T261724)

2020-09-01

  • 18:18 andrewbogott: adding drives on cloudcephosd100[3-5] to ceph osd pool
  • 13:40 andrewbogott: adding drives on cloudcephosd101[0-2] to ceph osd pool
  • 13:35 andrewbogott: adding drives on cloudcephosd100[1-3] to ceph osd pool
  • 11:27 arturo: [codfw1dev] rebooting again cloudnet2002-dev after some network tests, to reset initial state (T261724)
  • 11:09 arturo: [codfw1dev] rebooting cloudnet2002-dev after some network tests, to reset initial state (T261724)
  • 10:49 arturo: disable puppet in cloudnet servers to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/623569/

2020-08-31

2020-08-28

  • 20:12 bd808: Running `wmcs-novastats-dnsleaks --delete` from cloudcontrol1003

2020-08-26

  • 17:12 bstorm: Running 'ionice -c 3 nice -19 find /srv/tools -type f -size +100M -printf "%k KB %p\n" > tools_large_files_20200826.txt' on labstore1004 T261336

2020-08-21

  • 21:34 andrewbogott: restarting nova-compute on cloudvirt1033; it seems stuck

2020-08-19

  • 14:21 andrewbogott: rebooting cloudweb2001-dev, labweb1001, labweb1002 to address mediawiki-induced memleak

2020-08-06

  • 21:02 andrewbogott: removing cloudvirt1004/1006 from nova's list of hypervisors; rebuilding them to use as backup test hosts
  • 20:06 bstorm: manually stopped the RAID check on cloudcontrol1003 T259760

2020-08-04

  • 18:54 bstorm: restarting mariadb on cloudcontrol1004 to setup parallel replication

2020-08-03

  • 17:02 bstorm: increased db connection limit to 800 across galera cluster because we were clearly hovering at limit

2020-07-31

  • 19:28 bd808: wmcs-novastats-dnsleaks --delete (lots of leaked fullstack-monitoring records to clean up)

2020-07-27

  • 22:17 andrewbogott: ceph osd pool set compute pg_num 2048
  • 22:14 andrewbogott: ceph osd pool set compute pg_autoscale_mode off

2020-07-24

  • 19:15 andrewbogott: ceph mgr module enable pg_autoscaler
  • 19:15 andrewbogott: ceph osd pool set compute pg_autoscale_mode on

2020-07-22

  • 08:55 jbond42: [codfw1dev] upgrading hiera to version5
  • 08:48 arturo: [codfw1dev] add jbond as user in the bastion-codfw1dev and cloudinfra-codfw1dev projects
  • 08:45 arturo: [codfw1dev] enabled account creation in labtestwiki briefly for jbond42 to create an account

2020-07-16

2020-07-15

  • 23:15 bd808: Removed Merlijn van Deen from toollabs-trusted Gerrit group (T255697)
  • 11:48 arturo: [codfw1dev] created DNS records (A and PTR) for bastion.bastioninfra-codfw1dev.codfw1dev.wmcloud.org <-> 185.15.57.2
  • 11:41 arturo: [codfw1dev] add myself as projectadmin to the `bastioninfra-codfw1dev` project
  • 11:39 arturo: [codfw1dev] created DNS zone `bastioninfra-codfw1dev.codfw1dev.wmcloud.org.` in the cloudinfra-codfw1dev project and then transfer ownership to the bastioninfra-codfw1dev project

2020-07-14

  • 15:19 arturo: briefly set root@cloudnet1003:~ # sysctl net.ipv4.conf.all.accept_local=1 (in neutron qrouter netns) (T257534)
  • 10:43 arturo: icinga downtime cloudnet* hosts for 30 mins to introduce new check https://gerrit.wikimedia.org/r/c/operations/puppet/+/612390 (T257552)
  • 04:01 andrewbogott: added a wildcard *.wmflabs.org domain pointing at the domain proxy in project-proxy
  • 04:00 andrewbogott: shortened the ttl on .wmflabs.org. to 300

2020-07-13

  • 16:17 arturo: icinga downtime cloudcontrol[1003-1005].wikimedia.org for 1h for galera database movements

2020-07-12

  • 17:39 andrewbogott: switched eqiad1 keystone from m5 to cloudcontrol galera

2020-07-10

  • 20:26 andrewbogott: disabling nova api to move database to galera

2020-07-09

  • 11:23 arturo: [codfw1dev] rebooting cloudnet2003-dev again for testing sysct/puppet behavior (T257552)
  • 11:11 arturo: [codfw1dev] rebooting cloudnet2003-dev for testing sysct/puppet behavior (T257552)
  • 09:16 arturo: manually increasing sysctl value of net.nf_conntrack_max in cloudnet servers (T257552)

2020-07-06

  • 15:16 arturo: installing 'aptitude' in all cloudvirts

2020-07-03

  • 12:51 arturo: [codfw1dev] galera cluster should be up and running, openstack happy (T256283)
  • 11:44 arturo: [codfw1dev] restoring glance database backup from bacula into cloudcontrol2001-dev (T256283)
  • 11:39 arturo: [codfw1dev] stopped mysql database in the galera cluster T256283
  • 11:36 arturo: [codfw1dev] dropped glance database in the galera cluster T256283

2020-07-02

  • 15:41 arturo: `sudo wmcs-openstack --os-compute-api-version 2.55 flavor create --private --vcpus 8 --disk 300 --ram 16384 --property aggregate_instance_extra_specs:ceph=true --description "for packaging envoy" bigdisk-ceph` (T256983)

2020-06-29

  • 14:24 arturo: starting rabbitmq-server in all 3 cloudcontrol servers
  • 14:23 arturo: stopping rabbitmq-server in all 3 cloudcontrol servers

2020-06-18

  • 20:38 andrewbogott: rebooting cloudservices2003-dev due to a mysterious 'host down' alert on a secondary ip

2020-06-16

  • 15:38 arturo: created by hand neutron port 9c0a9a13-e409-49de-9ba3-bc8ec4801dbf `paws-haproxy-vip` (T295217)

2020-06-12

  • 13:23 arturo: DNS zone `paws.wmcloud.org` transferred to the PAWS project (T195217)
  • 13:20 arturo: created DNS zone `paws.wmcloud.org` (T195217)

2020-06-11

  • 19:19 bstorm_: proceeding with failback to labstore1004 now that DRBD devices are consistent T224582
  • 17:22 bstorm_: delaying failback labstore1004 for drive syncs T224582
  • 17:17 bstorm_: failing NFS back to labstore1004 to complete the upgrade process T224582
  • 16:15 bstorm_: failing over NFS for labstore1004 to labstore1005 T224582

2020-06-10

  • 16:09 andrewbogott: deleting all old cloud-ns0.wikimedia.org and cloud-ns1.wikimedia.org ns records in designate database T254496

2020-06-09

  • 15:25 arturo: icinga downtime everything cloud* lab* for 2h more (T253780)
  • 14:09 andrewbogott: stopping puppet, all designate services and all pdns services on cloudservices1004 for T253780
  • 14:01 arturo: icinga downtime everything cloud* lab* for 2h (T253780)

2020-06-05

2020-06-04

  • 14:24 andrewbogott: disabling puppet on all instances for /labs/private recovery
  • 14:23 arturo: disabling puppet on all instances for /labs/private recovery

2020-05-28

  • 23:02 bd808: `/usr/local/sbin/maintain-dbusers --debug harvest-replicas` (T253930)
  • 13:36 andrewbogott: rebuilding cloudservices2002-dev with Buster
  • 00:33 andrewbogott: shutting down cloudservices2002-dev to see if we can live without it. This is in anticipation or rebuilding it entirely for T253780

2020-05-27

  • 23:29 andrewbogott: disabling the backup job on cloudbackup2001 (just like last week) so the backup doesn't start while Brooke is rebuilding labstore1004 tomorrow.
  • 06:03 bd808: `systemctl start mariadb` on clouddb1001 following reboot (take 2)
  • 05:58 bd808: `systemctl start mariadb` on clouddb1001 following reboot
  • 05:53 bd808: Hard reboot of clouddb1001 via Horizon. Console unresponsive.

2020-05-25

  • 16:35 arturo: [codfw1dev] created zone `0-29.57.15.185.in-addr.arpa.` (T247972)

2020-05-21

  • 19:23 andrewbogott: disabling puppet on cloudbackup2001 to prevent the backup job from starting during maintenance
  • 19:16 andrewbogott: systemctl disable block_sync-tools-project.service on cloudbackup2001.codfw.wmnet to avoid stepping on current upgrade
  • 15:48 andrewbogott: re-imaging cloudnet1003 with Buster

2020-05-19

  • 22:59 bd808: `apt-get install mariadb-client` on cloudcontrol1003
  • 21:12 bd808: Migrating wcdo.wcdo.eqiad.wmflabs to cloudvirt1023 (T251065)

2020-05-18

  • 21:37 andrewbogott: rebuilding cloudnet2003-dev with Buster

2020-05-15

  • 22:10 bd808: Added reedy as projectadmin in cloudinfra project (T249774)
  • 22:05 bd808: Added reedy as projectadmin in admin project (T249774)
  • 18:44 bstorm_: rebooting cloudvirt-wdqs1003 T252831
  • 15:47 bd808: Manually running wmcs-novastats-dnsleaks from cloudcontrol1003 (T252889)

2020-05-14

  • 23:28 bstorm_: downtimed cloudvirt1004/6 and cloudvirt-wdqs1003 until tomorrow around this time T252831
  • 22:21 bstorm_: upgrading qemu-system-x86 on cloudvirt1006 to backports version T252831
  • 22:15 bstorm_: changing /etc/libvirt/qemu.conf and restarting libvirtd on cloudvirt1006 T252831
  • 21:12 andrewbogott: rebuilding cloudvirt1003-wdqs as part of T252831
  • 15:47 andrewbogott: moving cloudvirt1004 and cloudvirt1006 to the 'ceph' aggregate for T252784
  • 15:02 andrewbogott: moving all of cloudvirt100[1-9] into the 'toobusy' host aggregate. These are slower, have spinning disks, and are due for replacement.

2020-05-12

  • 20:33 andrewbogott: moving cloudvirt1023 to the 'standard' pool and out of the 'spare' pool
  • 19:10 jeh: disable neutron-openvswitch-agent service on cloudvirt2001-dev.codfw T248881
  • 19:09 jeh: Shutdown the unused eno2 network interface on cloudvirt2001-dev.codfw to clear up monitoring errors T248425
  • 18:20 andrewbogott: moving cloudvirt1024 out of the 'maintenance' aggregate and into 'spare'
  • 16:45 andrewbogott: restarting neutron-l3-agent on cloudnet1004 so it knows about all three cloudcontrols. Leaving cloudnet1003 since restarting it there will cause network interruptions
  • 14:06 arturo: icinga downtime everything for 2h for Debian Buster migration in some cloud components

2020-05-09

  • 16:53 andrewbogott: rebuilding cloudcontrol2001-dev and 2003-dev with buster for T252121

2020-05-08

  • 19:02 bstorm_: moving tools-k8s-haproxy-2 from cloudvirt1021 to cloudvirt1017 to improve spread

2020-05-05

  • 13:58 andrewbogott: rebuilding cloudcontrol2004-dev to test new puppet changes

2020-05-04

  • 09:04 arturo: [codfw1dev] manually modify iptables ruleset to only allow SSH from WMF bastions on cloudservices2003-dev and cloudcontrol2004-dev (T251604)

2020-04-21

  • 22:12 andrewbogott: moving cloudvirt1004 out of the 'standard' aggregate and into the 'maintenance' aggregate
  • 16:01 jeh: restart cloudceph mon and osd services for openssl upgrades

2020-04-15

  • 18:44 jeh: create indexes and views for grwikimedia T245912

2020-04-13

  • 15:07 jeh: restart memcached on labwebs to increase cache size T145703

2020-04-09

  • 19:57 andrewbogott: upgrading eqiad1 designate to rocky
  • 16:52 andrewbogott: cleaned up a bunch of leaked .eqiad.wmflabs dns records

2020-04-08

  • 19:20 andrewbogott: rotated password and api token for pdns servers on cloudservices1003 and cloudservices1004
  • 14:54 arturo: `root@cloudcontrol1003:~# cp /etc/inputrc .inputrc` to solve some bash shortcut weirdness

2020-04-07

  • 20:57 andrewbogott: service sssd stop; rm -rf /var/lib/sss/db*; service sssd start on tools-sgebastion-08

2020-04-06

  • 22:39 andrewbogott: deleting bogus groups cn=b'project-bastion',ou=groups,dc=wikimedia,dc=org and cn=b'project-tools',ou=groups,dc=wikimedia,dc=org from ldap
  • 17:42 arturo: [codfw1dev] transferred DNS zone 57.15.185.in-addr.arpa. to the cloudinfra-codfw1dev project (T247972)
  • 17:39 arturo: [codfw1dev] `openstack zone create --email root@wmflabs.org --type PRIMARY --ttl 3600 --description "floating IPs subnet" 57.15.185.in-addr.arpa.` (T247972)
  • 16:23 arturo: restarting apache2 in cloudcontrol1003/1004 to pick up latest wmfkeystonehooks changes T249494

2020-04-02

  • 20:59 jeh: codfw1dev clear VM error states and start bastions, puppet master and database

2020-04-01

  • 16:27 arturo: [codfw1dev] enable puppet across the fleet clean vxlan changes (T248881)

2020-03-31

  • 12:35 arturo: [codfw1dev] restarting VMs: designaterockytest14, bastion-codfw1dev-0[1,2] (T248881)
  • 12:34 arturo: [codfw1dev] installing neutron-openvswitch-agent on cloudvirt2001-dev (T248881)
  • 12:25 arturo: [codfw1dev] installing neutron-openvswitch-agent on cloudnet200[2,3]-dev (T248881)
  • 11:45 arturo: [codfw1dev] rebooting cloudvirt2003-dev to pick up latest kernel update. Otherwise modprobe is confused trying to load modules and openvswitch won't start (T248881)
  • 10:40 arturo: [codfw1dev] installing neutron-openvswitch-agent on cloudvirt2003-dev (T248881)
  • 10:09 arturo: [codfw1dev] reboot cloudnet2003-dev into linux 4.9 (was using 4.14 from a testing operation in 2020-03-10)

2020-03-30

2020-03-27

  • 21:28 bd808: Created huggle.wmcloud.org Designate zone and allocated it to the huggle project
  • 19:51 jeh: start haproxy on cloudcontrol2003-dev.wikimedia.org

2020-03-26

  • 15:01 arturo: icinga downtime cloudvirt* cloudcontrol* cloudnet* lab* cloudstore*
  • 15:01 andrewbogott: beginning openstack upgrade window for T242766
  • 12:32 arturo: [codfw1dev] downgraded systemd, libsystemd0, udev and friends to the non-backports versions (T247013)

2020-03-25

  • 19:29 andrewbogott: dumping a bunch of VMs on cloudvirt1015 to see if it still crashes
  • 17:56 jeh: add labweb1002 back into the pool - completed horizon testing T240852
  • 17:09 jeh: depool labweb1002 for horizon testing T240852

2020-03-24

  • 19:41 jeh: switch cloudvirt1016 from maintenance to standard host aggregate T243327
  • 15:31 andrewbogott: restarting nova-conductor and nova-api on cloudcontrol1003 and cloudcontrol1004

2020-03-23

  • 21:41 jeh: restart neutron-l3-agent on cloudnet100[3,4] to pickup policy.yaml changes
  • 13:28 jeh: disable puppet on labweb100[1,2] to enable horizon event traces T240852
  • 10:26 arturo: restarting apache in both labweb1001/labweb1002 upon reports of returning 500s

2020-03-21

  • 14:23 andrewbogott: restarting apache2 on labweb1001 and 1002

2020-03-18

  • 19:17 andrewbogott: deleted a bunch of records from the pdns database on cloudservices1003/1004 which had a record name but the content (where an IP address should be) was NULL, e.g. m.wikidata.beta.wmflabs.org.
  • 10:55 arturo: [codfw1dev] deleting BGP agent, undoing changes we did for T245606

2020-03-14

  • 17:40 jeh: restart maintain-dbusers on labstore1004 T247654

2020-03-13

2020-03-12

  • 22:29 bstorm_: running puppet across all dumps mounts to make sure active links are shifted to labstore1006

2020-03-11

2020-03-10

  • 17:02 arturo: [codfw1dev] deleting address scopes, bad interaction with our custom NAT setup T247135
  • 13:55 arturo: [codfw1dev] rebooting cloudnet2003-dev into linux kernel 4.14 for testing stuff related to T247135

2020-03-09

  • 18:09 arturo: enabling puppet in cloudvirt1006, all services have been restored
  • 17:59 arturo: deleted the neutron bridge on cloudvirt1006, for testing stuff related to the queens upgrade
  • 17:58 arturo: stopped neutron-linuxbridge-agent and nova-compute in cloudvirt1006 for testing stuff related to the queens upgrade

2020-03-06

  • 14:54 andrewbogott: draining all instances off of cloudvirt1006 for T246908

2020-03-05

  • 14:24 arturo: [codfw1dev] we just enabled BGP session between cloudnet2xxx-dev and cr1-codfw (T245606)
  • 13:07 arturo: [codfw1dev] move the extra IP address for BGP in cloudnet200x-dev servers from eno2.2120 to the br-external bridge device (T245606)
  • 13:06 arturo: [codfw1dev] upgrade neutron-dynamic-routing packages in cloudnet200X-dev and cloudcontrol200X-dev servers to 11.0.0-2~bpo9+1 (T245606)

2020-03-04

  • 22:22 andrewbogott: upgrading designate on cloudservices1003/1004 to Queens
  • 22:09 andrewbogott: moving cloudvirt1006 into the maintenance aggregate for T246908
  • 21:37 bd808: Running wmcs-wikireplica-dns to add service names for ngwikimedia.*.db.svc.eqiad.wmflabs (T240772)
  • 21:14 bd808: Running `sudo maintain-meta_p --all-databases --purge` on labsdb1009 (T246056)
  • 21:11 bd808: Running `sudo maintain-meta_p --all-databases --purge` on labsdb1010 (T246056)
  • 21:08 bd808: Running `sudo maintain-meta_p --all-databases --purge` on labsdb1011 (T246056)
  • 21:05 bd808: Running `sudo maintain-meta_p --all-databases --purge` on labsdb1002 (T246056)

2020-03-02

  • 16:54 arturo: [codfw1dev] deleted python3-os-ken debian package in cloudnet2003-dev which was installed by hand and had depedency issues

2020-02-29

  • 16:32 bstorm_: downtimed the smart alert on cloudvirt1009 until Monday since apparently predictive failures flap T244986

2020-02-26

  • 22:03 jeh: powering down cloudvirt1014 for hardware maintenance

2020-02-25

  • 16:08 andrewbogott: changing neutron's rabbitmq password because oslo is having trouble parsing some of the characters in the password
  • 15:26 andrewbogott: updated the cell_mapping record in the nova_api database to add the second rabbitmq server to the transport_url field
  • 15:26 andrewbogott: updated the cell_mapping record in the nova_api database to set the db uri to 'mysql+pymysql' -- this in response to a deprecation notice

2020-02-24

  • 12:16 arturo: [codfw1dev] `root@cloudcontrol2001-dev:~# neutron bgp-speaker-peer-add bgpspeaker cr2-codfw` (T245606)
  • 12:16 arturo: [codfw1dev] `root@cloudcontrol2001-dev:~# neutron bgp-speaker-peer-add bgpspeaker cr1-codfw` (T245606)
  • 12:09 arturo: [codfw1dev] `root@cloudcontrol2001-dev:~# neutron bgp-peer-create --peer-ip 208.80.153.187 --remote-as 65002 cr2-codfw` (T245606)
  • 12:09 arturo: [codfw1dev] `root@cloudcontrol2001-dev:~# neutron bgp-peer-create --peer-ip 208.80.153.186 --remote-as 65002 cr1-codfw` (T245606)
  • 12:06 arturo: [codfw1dev] `root@cloudcontrol2001-dev:~# neutron bgp-peer-delete 17b8c2a3-f0ce-4d50-a265-18ccac703c61` (T245606)
  • 10:59 arturo: [codfw1dev] `root@cloudcontrol2001-dev:~# neutron bgp-speaker-peer-add bgpspeaker bgppeer` (T245606)
  • 10:56 arturo: [codfw1dev] `root@cloudcontrol2001-dev:~# neutron bgp-peer-create --peer-ip 208.80.153.185 --remote-as 65002 bgppeer` (T245606)

2020-02-21

  • 12:48 arturo: [codfw1dev] running `root@cloudcontrol2001-dev:~# neutron bgp-speaker-network-add bgpspeaker wan-transport-codfw` (T245606)
  • 12:46 arturo: [codfw1dev] created bgpspeaker for AS64711 (T245606)
  • 12:42 arturo: [codfw1dev] run `sudo neutron-db-manage upgrade head` to upgrade the db schema for neutron bgp tables
  • 11:51 arturo: [codfw1dev] create a neutron subnet pool per each subnet objects we have and manually update DB to inter-associate them (T245606)
  • 11:49 arturo: [codfw1dev] rename neutron address scope `no-nat` to `bgp` (T245606)
  • 11:37 arturo: [codfw1dev] cleanup unused neutron subnet pools from previous address scope tests (T244851)

2020-02-20

  • 19:22 andrewbogott: updating designate pool config for https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/572213/
  • 15:33 andrewbogott: migrating all VMs on cloudvirt1014 to cloudvirt1022
  • 13:35 arturo: [codfw1dev] disable puppet in cloudcontrol servers to hack neutron.conf for tests related to T245606
  • 13:33 arturo: [codfw1dev] disable puppet in cloudnet servers to hack neutron.conf for tests related to T245606

2020-02-18

  • 22:19 andrewbogott: transferred the tools.wmcloud.org. to the tools project
  • 22:16 andrewbogott: moved wmcloud.org dns domain to the cloud-infra project
  • 21:02 andrewbogott: adding .eqiad1.wikimedia.cloud records to all existing eqiad1 VMs, updating all eqiad1 internal pointer records to reference the new eqiad1.wikimedia.cloud fqdns.
  • 09:44 arturo: deleted DNS zone wmcloud.org and try re-creating it

2020-02-14

  • 10:35 arturo: running `root@cloudcontrol2001-dev:~# designate server-create --name ns1.openstack.codfw1dev.wikimediacloud.org.` (T243766)
  • 10:32 arturo: running `root@cloudcontrol1004:~# designate server-create --name ns1.openstack.eqiad1.wikimediacloud.org.` (T243766)
  • 10:32 arturo: running `root@cloudcontrol1004:~# designate server-create --name ns0.openstack.eqiad1.wikimediacloud.org.` (T243766)

2020-02-12

  • 13:38 arturo: [codfw1dev] add reference to subnetpool to the instance subnet `MariaDB [neutron]> update subnets set subnetpool_id='d129650d-d4be-4fe1-b13e-6edb5565cb4a' where id = '7adfcebe-b3d0-4315-92fe-e8365cc80668';` (T244851)

2020-02-11

  • 13:46 arturo: [codfw1dev] creating some neutron objects to investigate T244851 (subnets, subnet pools, address scopes, ...)
  • 12:40 arturo: [codfw1dev] delete unknown address scope 'wmcs-v4-scope': `root@cloudcontrol2001-dev:~# openstack address scope delete 078cfd71-117b-4aac-9197-6ebbbb7dd3de` (T244851)
  • 12:40 arturo: [codfw1dev] delete unknown subnet pool 'cloudinstancesb-v4-pool0': `root@cloudcontrol2001-dev:~# openstack subnet pool delete d23a9b88-5c3d-4a53-ab88-053233a75365` (T244851)

2020-02-07

  • 18:11 jeh: shutdown cloudvirt1016 for hardware maintenance T241882

2020-02-06

  • 14:44 jeh: update apt packages on cloudvirt1015 T220853
  • 14:28 jeh: run hardware tests on cloudvirt1015 T220853

2020-01-28

  • 17:24 arturo: [codfw1dev] root@cloudcontrol2001-dev:~# designate server-create --name ns0.openstack.codfw1dev.wikimediacloud.org. (T243766)
  • 10:18 arturo: [codfw1dev] created DNS record `bastion-codfw1dev-01.codfw1dev.wmcloud.org A 185.15.57.2` (T242976, T229441)
  • 10:13 arturo: [codfw1dev] the zone `codfw1dev.wmcloud.org` belongs now to the `cloudinfra-codfw1dev` project (T242976)
  • 10:11 arturo: [codfw1dev] `root@cloudcontrol2001-dev:~# openstack zone create --description "main DNS domain for public addresses" --email "root@wmflabs.org" --type PRIMARY --ttl 3600 codfw1dev.wmcloud.org.` (T242976 and T243766)
  • 09:53 arturo: restart apache2 in labweb1001/1002 because horizon errors
  • 09:47 arturo: created DNS zone wmcloud.org in eqiad1, transfer it to the cloudinfra project (T242976) right now only use is to delegate codfw1dev.wmcloud.org subdomain to designate in the other deployment

2020-01-27

  • 12:45 arturo: [codfw1dev] manually move the new domain to the `cloudinfra-codfw1dev` project clouddb2001-dev: `[designate]> update zones set tenant_id='cloudinfra-codfw1dev' where id = '4c75410017904858a5839de93c9e8b3d';` T243556
  • 12:44 arturo: [codfw1dev] `root@cloudcontrol2001-dev:~# openstack zone create --description "main DNS domain for VMs" --email "root@wmflabs.org" --type PRIMARY --ttl 3600 codfw1dev.wikimedia.cloud.` T243556

2020-01-24

  • 15:10 jeh: remove icinga downtime for cloudvirt1013 T241313
  • 12:52 arturo: repooling cloudvirt1013 after HW got fixed (T241313)

2020-01-21

  • 17:43 bstorm_: remounting /mnt/nfs/dumps-labstore1007.wikimedia.org/ on all dumps-mounting projects
  • 10:24 arturo: running `sudo systemctl restart apache2.service` in both labweb servers to try mitigating T240852

2020-01-15

  • 16:59 bd808: Changed the config for cloud-announce mailing list so that lsit admins do not get bounce unsubscribe notices

2020-01-14

  • 14:03 arturo: icinga downtime all cloudvirts for another 2h for fixing some icinga checks
  • 12:04 arturo: icinga downtime toolchecker for 2 hours for openstack upgrades T241347
  • 12:02 arturo: icinga downtime cloud* labs* hosts for 2 hours for openstack upgrades T241347
  • 04:26 andrewbogott: upgrading designate on cloudservices1003/1004

2020-01-13

  • 13:34 arturo: [¢odfw1dev] prevent neutron from allocating floating IPs from the wrong subnet by doing `neutron subnet-update --allocation-pool start=208.80.153.190,end=208.80.153.190 cloud-instances-transport1-b-codfw` (T242594)

2020-01-10

  • 13:27 arturo: cloudvirt1009: virsh undefine i-000069b6. This is tools-elastic-01 which is running on cloudvirt1008 (so, leaked on cloudvirt1009)

2020-01-09

  • 11:12 arturo: running `MariaDB [nova_eqiad1]> update quota_usages set in_use='0' where project_id='etytree';` (T242332)
  • 11:11 arturo: running `MariaDB [nova_eqiad1]> select * from quota_usages where project_id = 'etytree';` (T242332)
  • 10:32 arturo: ran `root@cloudcontrol1004:~# nova-manage project quota_usage_refresh --project etytree`

2020-01-08

  • 10:53 arturo: icinga downtime all cloudvirts for 30 minutes to re-create all canary VMs"

2020-01-07

  • 11:12 arturo: icinga-downtime everything cloud* for 30 minutes to merge nova scheduler changes
  • 10:02 arturo: icinga downtime cloudvirt1009 for 30 minutes to re-create canary VM (T242078)

2020-01-06

  • 13:45 andrewbogott: restarting nova-api and nova-conductor on cloudcontrol1003 and 1004

2020-01-04

  • 16:34 arturo: icinga downtime cloudvirt1024 for 2 months because hardware errors (T241884)

2019-12-31

  • 11:46 andrewbogott: I couldn't!
  • 11:40 andrewbogott: restarting cloudservices2002-dev to see if I can reproduce an issue I saw earlier

2019-12-25

2019-12-24

  • 15:13 arturo: icinga downtime all the lab* fleet for nova password change for 1h
  • 14:39 arturo: icinga downtime all the cloud* fleet for nova password change for 1h

2019-12-23

  • 11:13 arturo: enable puppet in cloudcontrol1003/1004
  • 10:40 arturo: disable puppet in cloudcontrol1003/1004 while doing changes related to python-ldap

2019-12-22

  • 23:48 andrewbogott: restarting nova-conductor and nova-api on cloudcontrol1003 and 1004
  • 09:45 arturo: cloudvirt1013 is back (did it alone) T241313
  • 09:37 arturo: cloudvirt1013 is down for good. Apparently powered off. I can't even reach it via iLO

2019-12-20

  • 12:43 arturo: icinga downtime cloudmetrics1001 for 128 hours

2019-12-18

  • 12:55 arturo: [codfw1dev] created a new subnet neutron object to hold the new CIDR for floating IPs (cloud-codfw1dev-floating - 185.15.57.0/29) T239347

2019-12-17

  • 07:21 andrewbogott: deploying horizon/train to labweb1001/1002

2019-12-12

  • 06:11 arturo: schedule 4h downtime for labstores
  • 05:57 arturo: schedule 4h downtime for cloudvirts and other openstack components due to upgrade ops

2019-12-02

  • 06:28 andrewbogott: running nova-manage db sync on eqiad1
  • 06:27 andrewbogott: running nova-manage cell_v2 map_cell0 on eqiad1

2019-11-21

  • 16:07 jeh: created replica indexes and views for szywiki T237373
  • 15:48 jeh: creating replica indexes and views for shywiktionary T238115
  • 15:48 jeh: creating replica indexes and views for gcrwiki T238114
  • 15:46 jeh: creating replica indexes and views for minwiktionary T238522
  • 15:36 jeh: creating replica indexes and views for gewikimedia T236404

2019-11-18

  • 19:27 andrewbogott: repooling labsdb1011
  • 18:54 andrewbogott: running maintain-views --all-databases --replace-all —clean on labsdb1011 T238480
  • 18:44 andrewbogott: depooling labsdb1011 and killing remaining user queries T238480
  • 18:42 andrewbogott: repooled labsdb1009 and 1010 T238480
  • 18:19 andrewbogott: running maintain-views --all-databases --replace-all —clean on labsdb1010 T238480
  • 18:18 andrewbogott: depooling labsdb1010, killing remaining user queries
  • 17:46 andrewbogott: running maintain-views --all-databases --replace-all —clean on labsdb1009 T238480
  • 17:38 andrewbogott: depooling labsdb1009, killing remaining user queries
  • 16:54 andrewbogott: running maintain-views --all-databases --replace-all —clean on labsdb1012 T237509

2019-11-15

  • 20:04 andrewbogott: repool labdb1011 (T237509)
  • 19:29 andrewbogott: running maintain-views --all-databases --replace-all —clean on labsdb1011
  • 19:25 andrewbogott: depooling labsdb1011, killing remaining queries
  • 19:25 andrewbogott: repooling labsdb1010
  • 18:59 andrewbogott: running maintain-views --all-databases --replace-all —clean on labsdb1012
  • 18:57 andrewbogott: running maintain-views --all-databases --replace-all —clean on labsdb1010
  • 18:54 andrewbogott: depooling labsdb1010, killing remaining user queries
  • 18:54 andrewbogott: depooled labsdb1009, ran maintain-views —clean —all-databases —replace-all, repooled

2019-11-11

  • 13:10 arturo: cloudweb2001-dev: disable puppet and redirect stderr in the loadExitNodes.php cron script to prevent cronspam while we investigate the cause of the issue (T237971)

2019-11-05

  • 11:59 arturo: icinga downtime for 1h cloudcontrol1004, cloudnet1003, cloudvirt1017/1020/1022 for PDU operations in the rack T227542

2019-11-04

  • 21:55 andrewbogott: deleting a ton of wikitech hiera pages that were either no-ops or refer to nonexistent VMs or prefixes

2019-10-31

  • 11:01 arturo: icinga-downtimed cloudvirt1030 and cloudservices1003 for 1h due to PDU upgrade operations T227543

2019-10-30

  • 22:43 jeh: reboot cloud-bootstrapvz-stretch to resolve bad bootstrapvz build

2019-10-29

  • 10:52 arturo: icinga downtime cloudvirt1001/1002/1024/1018/1012/1009/1015/1008 for 1h T227538

2019-10-25

  • 10:45 arturo: icinga downtime toolschecker for 1 to upgrade clouddb1002 mariadb (toolsdb secondary) (T236384 , T236420)

2019-10-24

  • 12:30 arturo: starting cloudvirt1019, PDU operations ended (T227540)
  • 11:58 arturo: icinga downtime for 2h (T227540) cloudvirt1019
  • 11:15 arturo: poweroff cloudvirt1019 during the PDU operations (T227540)
  • 11:10 arturo: icinga downtime for 2h (T227540) toolschecker
  • 10:58 arturo: icinga downtime for 1h (T227540) cloudvirt100[3-7], cloudvirt1019, cloudvirt1016, cloudvirt1021, cloudvirt1013, cloudnet1004

2019-10-23

  • 09:23 arturo: cloudvirt1026 reboot ended OK
  • 09:12 arturo: rebooting cloudvirt1026 for kernel upgrade
  • 09:09 arturo: cloudvirt1025 reboot ended OK
  • 09:00 arturo: rebooting cloudvirt1025 for kernel upgrade
  • 08:51 arturo: icinga downtime cloudvirt1025/1026 for reboots

2019-10-18

  • 16:01 arturo: created the `eqiad1.wikimedia.cloud` DNS zone (T235846)
  • 14:27 andrewbogott: deleted a bunch of leaked VMS from earlier today from the admin-monitoring project. Fullstack leaks due to an api outage, maybe?
  • 10:44 arturo: double max_message_size from 40KB to 80KB in the cloud-admin mailing list. A simple email with a couple of quotes can go over the 40KB limit.

2019-10-16

  • 21:59 jeh: resync wiki replica tool and user accounts T235697
  • 09:40 arturo: reboot of cloudvirt1030 went fine
  • 09:28 arturo: reboot of cloudvirt1029 went fine
  • 09:28 arturo: rebooting cloudvirt1030 for kernel updates
  • 09:12 arturo: rebooting cloudvirt1029 for kernel updates
  • 09:11 arturo: reboot of cloudvirt1028 went fine
  • 09:00 arturo: rebooting cloudvirt1028 for kernel updates
  • 08:56 arturo: icinga downtime cloudvirt[1028-1030].eqiad.wmnet for 1h for reboots

2019-10-15

  • 13:30 jeh: creating indexes and views for banwiki T234770

2019-10-10

  • 18:55 bd808: Created indexes and views for nqowiki (T230543)
  • 11:59 arturo: network switch hardware is down affecting cloudvirt1025/1026 (T227536) VMs are supposed to be online but unreachable

2019-10-09

  • 10:44 arturo: cloudvirt1013 rebooted well
  • 10:32 arturo: cloudvirt1013 is rebooting
  • 10:32 arturo: cloudvirt1012 rebooted just fine (very slow, 35 VMs)
  • 10:21 arturo: cloudvirt1012 is rebooting
  • 10:19 arturo: cloudvirt1009 rebooted just fine (very slow though)
  • 10:07 arturo: cloudvirt1009 is rebooting
  • 10:06 arturo: cloudvirt1008 rebooted just fine (very slow though)
  • 09:58 arturo: cloudvirt1008 is rebooting
  • 09:52 arturo: icinga downtime toolschecker, paws, etc for 2h, because cloudvirt reboots

2019-10-07

  • 14:07 arturo: horizon is disabled for maintenance (T212302)
  • 14:00 arturo: starting scheduled maintenance: upgrading eqiad1 from openstack mitaka to newton

2019-10-02

  • 15:23 arturo: codfw1dev renaming net/subnet objects to a more modern naming scheme T233665
  • 12:49 arturo: codfw1dev delete all floating ip allocations in the deployment for mangling the network config for testing T233665
  • 12:47 arturo: codfw1dev deleting all VMs in the deployment for mangling the network config for testing T233665
  • 11:08 arturo: codfw1dev rebooting cloudnet2002-dev and cloudnet2003-dev for testing T233665
  • 10:31 arturo: codfw1dev: add cloudinstances2b-gw router to the l3 agent in cloudnet2003-dev
  • 09:59 arturo: codfw1dev: cleanup leftover "HA port tenant admin" in neutron (ports from missing servers)
  • 09:46 arturo: codfw1dev: cleanup leftover neutron agents

2019-09-30

  • 10:21 arturo: we installed ferm in every VM by mistake. Deleting it and forcing a puppet agent run to try to go back to a clean state.
  • 09:38 arturo: downtime toolschecker for 24h
  • 09:33 arturo: force update ferm cloud-wide (in all VMs) for T153468

2019-08-18

  • 10:39 arturo: rebooting cloudvirt1023 for new interface names configuration
  • 10:34 arturo: downtimed cloudvirt1023 for 2 days

2019-08-05

  • 17:17 bd808: Set downtime on gridengine and kubernetes webservice checks in icinga until 2019-09-02 (flaky tests)

2019-07-29

  • 20:14 bd808: Restarted maintain-kubeusers on tools-k8s-master-01 (T194859)

2019-07-25

  • 12:32 arturo: eqiad1/glance: debian-9.9-stretch image deprecates debian-9.8-stretch (T228983)
  • 09:59 arturo: (codfw1dev) drop missing glance images (T228972)
  • 09:32 arturo: (codfw1dev) deleting a bunch of VMs that were running in now missing hypervisors
  • 09:31 arturo: (codfw1dev) deleting a bunch of VMs in ERROR and SHUTDOWN state
  • 09:27 arturo: last log entry refers to the codfw1dev deployment
  • 09:27 arturo: cleanup `nova service-list` from old hypervisors (labtest*)
  • 09:23 arturo: refreshed nova DB grants in clouddb2001-dev for the codfw1dev deployment
  • 08:47 arturo: cleanup the cloud-announce pending emails (spam)

2019-07-23

  • 19:43 andrewbogott: restarting rabbitmq-server on cloudcontrol1003 and 1004

2019-07-22

  • 23:44 bd808: Restarted maintain-kubeusers on tools-k8s-master-01 (T228529)

2019-07-11

  • 22:07 bd808: Ran `sudo systemctl stop designate_floating_ip_ptr_records_updater.service` on cloudcontrol1003
  • 22:01 bd808: `sudo apt-get install python2.7-dbg` on cloudcontrol1003 to debug hung python process
  • 21:48 bd808: Ran `sudo systemctl stop designate_floating_ip_ptr_records_updater.service` on cloudcontrol1004

2019-06-25

  • 16:05 bstorm_: updated python3.4 to update4 wherever it was installed on Jessie VMs to prevent issues with broken update3.
  • 14:56 bstorm_: Updated python 3.4 on the labs-puppetmaster server

2019-06-03

  • 15:55 arturo: T221769 rebooting cloudservices1003 after bootstrapping is apparently completed

2019-05-28

  • 21:42 bstorm_: unmounting labstore1003-scratch on all cloud clients
  • 18:14 bstorm_: T209527 switched mounts from labstore1003 to cloudstore1008 for scratch

2019-05-20

  • 17:25 arturo: T223923 dropped compat-network config from /etc/network/interfaces in eqiad1/codfw1dev neutron nodes
  • 17:22 arturo: T223923 dropped br-compat bridges and vlan interfaces (1102 and 2102) in eqiad1/codfw1dev neutron nodes
  • 17:07 arturo: T223923 dropped compat-network configuration from the neutron database in eqiad1
  • 16:55 arturo: T223923 dropped compat-network configuration from the neutron database in codfw1dev

2019-05-15

  • 17:00 andrewbogott: touching /root/firstboot_done on all VMs that cumin can reach. This will prevent firstboot.sh from running a second time if/when any of these are rebooted. T223370

2019-04-26

  • 15:51 arturo: andrew updated dns servers for the cloud-instances2-b-eqiad subnet in neutron: 208.80.154.143 and 208.80.154.24

2019-04-25

  • 11:14 arturo: T221760 increased size of conntrack table

2019-04-24

  • 12:54 arturo: T220051 puppet broken in every VM in Cloud VPS, fixing right now

2019-04-22

  • 11:14 arturo: create by hand /var/cache/labsaliaser/labs-ip-aliases.json in cloudservices2002-dev (T218575)

2019-04-16

  • 22:55 bd808: cloudcontrol2003-dev: added `exit 0` to /etc/cron.hourly/keystone to stop cron spam on partially configured cluster
  • 12:08 arturo: rebooting cloudvirt200[123]-dev because deep changes in config
  • 11:27 arturo: T219626 add DB grants for neutron and glnace to clouddb2001-dev (codfw1dev)
  • 10:37 arturo: T219626 replace 208.80.153.75 with 208.80.153.59 in the clouddb2001-dev database (codfw1dev deployment)
  • 10:30 arturo: T219626 replace labtestcontrol2003 with cloudcontrol2001-dev in the clouddb2001-dev database (codfw1dev deployment)

2019-04-15

  • 13:08 arturo: T219626 add DB grants for keystone/nova/nova_api to clouddb2001-dev (codfw1dev)

2019-04-13

  • 18:25 bd808: Restarted nova-compute service on cloudvirt1015 (T220853)

2019-04-11

  • 12:00 arturo: T151704 deploying oidentd to cloudnet1xxx servers

2019-04-02

  • 19:52 andrewbogott: installed new base Stretch image. Updated packages, and runs apt-get dist-upgrade on first boot.

2019-03-29

  • 14:34 andrewbogott: moving tools-static.wmflabs.org to point to tools-static-13 in eqiad1-r
  • 00:00 bstorm_: T193264 Added osm.db.svc.eqiad.wmflabs to cloud DNS

2019-03-25

  • 00:40 bd808: Restarted maintain-dbusers on labstore1004. Process hung up on failed LDAP connection.

2019-03-21

  • 19:32 andrewbogott: restarting keystone on cloudcontrol1003

2019-03-15

  • 16:00 gtirloni: increased nscd cache size (T217280)

2019-03-14

  • 19:04 gtirloni: bstorm started nfsd on labstore1006 (T218341)
  • 16:42 gtirloni: published new debian-9.8 image (T218314)

2019-03-04

  • 19:37 bstorm_: umounted /mnt/nfs/dumps-labstore1006.wikimedia.org across all VPS projects for T217473

2019-02-26

  • 12:46 gtirloni: shutdown toolsbeta-sgegrid-master (cronspam)

2019-02-25

  • 10:32 gtirloni: restarted nfsd on labstore1004

2019-02-21

  • 09:09 gtirloni: restarted uwsgi-labspuppetbackend.service on labpuppetmaster1001
  • 07:42 gtirloni: created project cloudstore
  • 07:36 gtirloni: deleted wmcs-nfs project

2019-02-20

  • 21:58 andrewbogott: silencing shinken and disabling puppet on shinken-02 for now

2019-02-19

  • 12:00 gtirloni: added nagios@icinga2001.wikimedia.org to cloud-admin-feed@ allowed senders

2019-02-18

  • 20:21 gtirloni: downtimed cloudvirt1020
  • 20:12 gtirloni: ran `labs-ip-alias-dump.py` on cloudservices/labservices servers

2019-02-15

  • 13:10 arturo: T216239 labvirt1019 has been drained
  • 12:22 arturo: T216239 draining labvirt1009 with a command like this: `root@cloudcontrol1004:~# wmcs-cold-migrate --region eqiad --nova-db nova 2c0cf363-c7c3-42ad-94bd-e586f2492321 labvirt1001`
  • 12:02 arturo: more nova service cleanups in the database (labvirts that were reallocated to eqiad1)
  • 11:34 arturo: T216190 cleanup from nova database `nova service-delete 35`
  • 03:50 andrewbogott: updated VPS base images for Jessie and Stretch, now featuring Stretch 9.7

2019-02-11

  • 18:13 gtirloni: cleaned old metrics data in labmon1001 T215417
  • 15:28 gtirloni: running `maintain-views --all-databases --replace-all` on labsdb1011
  • 14:18 gtirloni: running `maintain-views --all-databases --replace-all` on labsdb1010

2019-02-08

  • 14:56 gtirloni: running `maintain-views --all-databases --replace-all` on labsdb1009

2019-02-06

  • 11:47 gtirloni: downtimed labmon100{1,2} T215399
  • 00:17 bstorm_: T214106 deleted bstorm-test2 project to clean up

2019-02-05

  • 10:48 arturo: labmon1001 is now part of the 'eqiad1-r' region

2019-02-01

  • 09:54 arturo: moving canary1015-01 VM instance from cloudvirt1024 back to cloudvirt1015

2019-01-31

  • 12:44 arturo: T215012 depooling cloudvirt1015 and migrating all VMs to cloudvirt1024

2019-01-25

  • 20:11 gtirloni: deleted project yandex-proxy T212306
  • 20:11 gtirloni: deleted project T212306

2019-01-24

  • 11:50 arturo: T213925 modify subnet cloud-instances-transport1-b-eqiad1 to avoid floating IP allocations from here
  • 11:07 arturo: T214299 failover cloudnet1003 to cloudnet1004
  • 10:03 arturo: T214299 reimage cloudnet1004 to debian stretch
  • 09:51 arturo: T214299 failover cloudnet1004 to cloudnet1003

2019-01-22

  • 19:19 arturo: T214299 stretch cloudnet1003 is apparently all set
  • 18:40 arturo: T214299 manually delete from neutron agents from cloudnet1003 (must be added again after reimage, with new uuids)
  • 18:37 arturo: T214299 reimaging cloudnet1003 as debian stretch
  • 17:35 jbond42: starting roll out of apt package updates to
  • 14:41 gtirloni: T214369 deployed new jessie and stretch VM images

2019-01-21

  • 18:29 gtirloni: installed libguestfs-tools on cloudvirt1021

2019-01-16

  • 14:21 andrewbogott: stopping old VPS proxies in eqiad — T213540

2019-01-15

  • 14:20 andrewbogott: changing tools.wmflabs.org to point to tools-proxy-03 in eqiad1

2019-01-13

  • 20:00 andrewbogott: VPS proxies are now running in eqiad1 on proxy-01. Old VMs will wait a bit for deletion. T213540
  • 19:12 andrewbogott: moving the VPS proxy API backend to proxy-01.project-proxy.eqiad.wmflabs, as per T213540
  • 17:11 andrewbogott: moving all VPS dynamic proxies to proxy-eqiad1.wmflabs.org aka proxy-01.project-proxy.eqiad.wmflabs, as per T213540

2019-01-09

  • 22:21 bd808: neutron quota-update --tenant-id tools --port 256

2019-01-08

  • 18:59 bd808: Definately did NOT delete uid=novaadmin,ou=people,dc=wikimedia,dc=org
  • 18:59 bd808: Deleted LDAP user uid=neutron,ou=people,dc=wikimedia,dc=org
  • 18:58 bd808: Deleted LDAP user uid=novaadmin,ou=people,dc=wikimedia,dc=org

2019-01-06

  • 22:03 bd808: Set floatingip quota of 60 for tools project in eqiad1-r region (T212360)

2018-12-20

  • 17:10 arturo: T207663 renumbered transport network in eqiad1

2018-12-05

  • 17:59 arturo: T207663 changed labtestn transport network addressing from private to public

2018-12-03

  • 13:25 arturo: T202886 create again PTR records after dnsleak.py fix

2018-11-30

  • 14:08 arturo: running dns leaks cleanup `root@cloudcontrol1003:~# /root/novastats/dnsleaks.py --delete`

2018-11-28

  • 17:33 gtirloni: deleted contintcloud project (T209644)

2018-11-27

  • 13:32 gtirloni: enabled DRBD stats collection on labstore100[4-5] T208446

2018-11-22

  • 07:12 gtirloni: deployed new debian-9.6-stretch image

2018-11-21

  • 10:48 arturo: re-created compat-net as not shared in labtestn to test stuff related to T209954

2018-11-16

  • 12:43 gtirloni: armed keyholder on labpuppetmaster1001/1002 after reboots
  • 12:08 gtirloni: rebooted labpuppetmaster1001 (T207377)
  • 11:57 gtirloni: rebooted labpuppetmaster1002 (T207377)

2018-11-14

  • 17:19 gtirloni: added cloudvirt1016 to scheduler pool (T209426)
  • 15:41 gtirloni: reimaging labvirt1016 as cloudvirt1016
  • 15:14 gtirloni: reset-failed systemd unit nova-scheduler on cloudcontrol1004
  • 13:52 gtirloni: rebooted labservices1002 after package upgrades (T207377)
  • 13:23 gtirloni: rebooted labstore2004 after package upgrades (T207377)
  • 13:20 gtirloni: rebooted labstore2003 after package upgrades (T207377)
  • 13:20 gtirloni: rebooted labstore2001/labstore2003 after package upgrades (T207377)
  • 12:08 gtirloni: rebooted labnet1002 after package upgrades
  • 12:01 gtirloni: rebooted labmon1002 after package upgrades
  • 11:41 gtirloni: rebooted labcontrol1002 after package upgrades
  • 11:15 gtirloni: rebooted cloudcontrol1004 after package upgrades

2018-11-09

  • 18:17 gtirloni: restarted neutron-linuxbridge-agent on cloudvirt1018/1023

2018-11-08

  • 11:00 gtirloni: Added novaproxy-02 to $CACHES
  • 10:50 gtirloni: Added cloudvirt1017 to eqiad1 region

2018-11-07

  • 13:49 arturo: T208733 moving labvirt1017 from main deployment to eqiad1 and renaming it to cloudvirt1017

2018-10-22

  • 16:24 arturo: T206261 another update to dmz_cidr in eqiad1
  • 10:26 arturo: change again in dmz_cidr in eqiad1: VMs will connect between them without NAT even when using floating IPs (T206261)

2018-10-19

  • 12:02 arturo: revert change in dmz_cidr in eqiad1 for now (T206261)
  • 11:16 arturo: change in dmz_cidr in eqiad1: VMs will connect between them without NAT even when using floating IPs (T206261)
  • 10:14 arturo: we have new virt servers in the eqiad1 deployment since past week and this week: cloudvirt1018, cloudvirt1023, cloudvirt1024

2018-09-26

  • 10:40 arturo: T205524 all sorts of restarts in all neutron daemons
  • 10:20 arturo: T205524 stop/start all neutron agents in cloudnet1003.eqiad.wmnet
  • 10:13 arturo: T205524 restart all agents in cloudnet1004.eqiad.wmnet
  • 10:10 arturo: restart neutron-server in cloudcontrol1003, investigating T205524

2018-09-24

  • 10:57 arturo: try to increase floating ip allocation pool in eqiad1. Of 185.15.56.0/25 we are using only 185.15.56.10-185.15.56.31, I don't know why. Let's use 185.15.56.2-185.15.56.126

2018-09-21

  • 17:18 bd808: Running `sudo maintain-meta_p --all-databases --purge` across labsdb10(09|10|11) for T201890

2018-09-17

  • 22:08 bd808: Granted gtirloni project roles of admin, projectadmin, and user

2018-09-12

  • 11:20 arturo: T202636 distributing default routes using classless-static-route for all VMs in main/labtest (dnsmasq/nova-network)

2018-09-11

  • 16:52 arturo: again, restarted nova-network after killing all dnsmasq procs in labnet1001 for T202636
  • 16:08 arturo: restarted nova-network after killing all dnsmasq procs in labnet1001 for T202636
  • 10:53 arturo: T202636 creating all the compat-network configuration in neutron
  • 10:36 arturo: T202636 creating br-compat bridge in eqiad1 for the compat network
  • 10:33 arturo: T202636 manually reserve 10.68.23.253 (in nova-network)

2018-09-10

  • 22:46 andrewbogott: deleting all VMs on labvirt1019 and 1020 as prep for T204003

2018-08-30

  • 15:46 andrewbogott: restarting rabbitmq-server on cloudcontrol1003
  • 13:07 arturo: T202636 internal network routing now exists in labtest/labtestn for VM to communicate with each other

2018-08-28

  • 11:04 arturo: T202549 eqiad1 databases are all now running in m5-master. Mysql has been cleaned from cloudcontrol100[3,4]

2018-08-23

  • 16:17 arturo: T188589 bstorm_ merged patch to reduce nova DB connection usage
  • 13:15 arturo: T202115 `root@cloudcontrol1003:~# neutron subnet-update --allocation-pool start=10.64.22.4,end=10.64.22.4 e4fb2771-a361-4add-ac4e-280cc300c59f`
  • 13:10 arturo: T202115 (was `{"start": "10.64.22.2", "end": "10.64.22.254"}` )
  • 13:08 arturo: T202115 `root@cloudcontrol1003:~# neutron subnet-update --allocation-pool start=10.64.22.254,end=10.64.22.254 e4fb2771-a361-4add-ac4e-280cc300c59f`

2018-08-22

  • 15:28 arturo: cleanup local glance,keystone databases in cloudcontrol1003.wikimedia.org (already in m5-master)
  • 15:27 arturo: cleanup local keystone database in cloudcontrol1003.wikimedia.org (already in m5-master)

2018-08-21

  • 15:39 andrewbogott: initial test message
  • 10:31 arturo: eqiad1 remove leftover port for HA on labnet1004
  • 10:15 arturo: test

2018-05-07

  • 18:07 bstorm_: stopped the toolhistory job because it is totally broken and fills /tmp.

2018-02-09

  • 00:55 bd808: Added Arturo Borrero Gonzalez and Bstorm as project members
  • 00:54 bd808: Removed Yuvipanda at user request (T186289)