You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Nova Resource:Admin/SAL

From Wikitech-static
< Nova Resource:Admin
Revision as of 16:20, 25 October 2020 by imported>Stashbot (andrewbogott: adding cloudvirt1038 to the 'ceph' aggregate and removing from the 'spare' aggregate. We need this space while waiting on network upgrades for empty cloudvirts (T216195))
Jump to navigation Jump to search

2020-10-25

  • 16:20 andrewbogott: adding cloudvirt1038 to the 'ceph' aggregate and removing from the 'spare' aggregate. We need this space while waiting on network upgrades for empty cloudvirts (T216195)

2020-10-23

  • 11:30 arturo: [codfw1dev] openstack --os-project-id cloudinfra-codfw1dev recordset create --type PTR --record nat.cloudgw.codfw1dev.wikimediacloud.org. --description "created by hand" 0-29.57.15.185.in-addr.arpa. 1.0-29.57.15.185.in-addr.arpa. (T261724)
  • 10:09 arturo: [codf1dev] doing DNS changes for the cloudgw PoC, including designate and https://gerrit.wikimedia.org/r/c/operations/dns/+/635965 (T261724)

2020-10-22

  • 10:46 arturo: [codfw1dev] rebooting cloudinfra-internal-puppetmaster-01.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud to try fixing some DNS weirdness
  • 09:43 arturo: enabling puppet in cloucontrol1003 (message said "please re-enable after 2020-10-22 06:00UTC")

2020-10-21

  • 14:36 andrewbogott: running apt-get update && apt-get install -y facter on all cloud-vps instances
  • 10:31 arturo: [codfw1dev] reimaging labtestvirt2003 (cloudgw) to test puppet code (T261724)
  • 08:56 arturo: [codfw1dev] reimaging labtestvirt2003 (cloudgw) to test puppet code (T261724)

2020-10-20

2020-10-19

  • 01:41 andrewbogott: deleting all Precise base images
  • 01:36 andrewbogott: deleting all unused Jessie base images

2020-10-18

  • 23:26 andrewbogott: deleting all Trusty base images
  • 21:50 andrewbogott: migrating all currently used ceph images to rbd

2020-10-16

  • 09:29 arturo: [codfw1dev] still some DNS weirdness, investigating
  • 09:25 arturo: [codfw1dev] hard-rebooting bastion-codfw1dev-02, seems in bad shape, doesn't even wake up in the virsh console
  • 09:18 arturo: [codfw1dev] live-hacked cloudservices2002-dev /etc/powerdns/recursor.conf file to include cloud-codfw1dev-floating CIDR (185.15.57.0/29) while https://gerrit.wikimedia.org/r/c/operations/puppet/+/634050 is in review, so VMs with a floating IP can query the DNS recursor (T261724)
  • 09:01 arturo: [codfw1dev] basic network connectivity seems stable after cleaning up everything related to address scopes (T261724)

2020-10-15

  • 15:17 arturo: [codfw1dev] try cleaning up anything related to address scopes in the neutron database (T261724)
  • 13:56 arturo: [codfw1dev] drop neutron l3 agent hacks in cloudnet2002/2003-dev (T261724)

2020-10-13

  • 17:54 andrewbogott: rebuilding cloudvirt1021 for backy support
  • 15:22 andrewbogott: draining cloudvirt1021 so I can rebuild it with backy support
  • 14:19 andrewbogott: rebuilding cloudvirt1022 with backy support
  • 14:03 andrewbogott: draining cloudvirt1022 so I can rebuild it with backy support
  • 11:19 arturo: [codfw1dev] rebooting labtestvirt2003

2020-10-09

  • 10:15 arturo: [codfwd1ev] root@cloudcontrol2001-dev:~# openstack router set --disable-snat cloudinstances2b-gw --external-gateway wan-transport-codfw (T261724)
  • 09:22 arturo: [codfwd1dev] rebooting cloudnet boxes for bridge and vlan changes (T261724)
  • 09:12 arturo: [codfw1dev] root@cloudcontrol2001-dev:~# openstack subnet delete 31214392-9ca5-4256-bff5-1e19a35661de (cloud-instances-transport1-b-codfw - 208.80.153.184/29) (T261724)
  • 09:10 arturo: [codfw1dev] root@cloudcontrol2001-dev:~# openstack router set --external-gateway wan-transport-codfw --fixed-ip subnet=cloud-gw-transport-codfw,ip-address=185.15.57.10 cloudinstances2b-gw (T261724)
  • 08:49 arturo: [codfw1dev] root@cloudcontrol2001-dev:~# openstack subnet create --network wan-transport-codfw --gateway 185.15.57.9 --no-dhcp --subnet-range 185.15.57.8/30 cloud-gw-transport-codfw (T261724)
  • 08:47 arturo: [codfw1dev] root@cloudcontrol2001-dev:~# openstack subnet delete a5ab5362-4ffb-4059-9ff7-391e22dcf3bc (T261724)

2020-10-08

  • 16:17 arturo: [codfw1dev] `root@cloudcontrol2001-dev:~# openstack subnet create --network wan-transport-codfw --gateway 185.15.57.8 --no-dhcp --subnet-range 185.15.57.8/31 cloud-gw-transport-codfw` (with a hack -- see task) (T263622)
  • 16:03 arturo: [codfw1dev] briefly live-hacked python3-neutron source code in all 3 cloudcontrol2xxx-dev servers to workaround /31 network definition issue (T263622)
  • 10:28 arturo: [codfw1dev] reimaging labtestvirt2003 (cloudgw) T261724

2020-10-06

  • 21:30 andrewbogott: moved cloudvirt1013 out of the 'ceph' aggregate and into the 'maintenance' aggregate for T243414
  • 21:29 andrewbogott: draining cloudvirt1013 for upgrade to 10G networking
  • 14:45 arturo: icinga downtime every cloud* lab* host for 60 minutes for keystone maintenance

2020-10-05

  • 17:40 bd808: `service uwsgi-labspuppetbackend restart` on cloud-puppetmaster-03 (T264649)

2020-10-02

  • 11:05 arturo: [codfw1dev] restarting rabbitmq-server in all 3 control nodes, the l3 agent was misbehaving
  • 09:16 arturo: [codfw1dev] trying the labtestvirt2003 (cloudgw) reimage again (T261724)

2020-10-01

  • 16:06 arturo: rebooting cloudvirt1024 to validate changes to /etc/network/interfaces file
  • 15:36 arturo: [codfw1dev] reimaging labtestvirt2003

2020-09-30

  • 16:47 andrewbogott: rebooting cloudvir1032, 1033, 1034 for T262979
  • 13:28 arturo: enable puppet, reboot and pool back cloudvirt1031
  • 13:27 arturo: extend icinga downtimes for another 120 mins
  • 13:15 arturo: `aborrero@cloudcontrol1003:~$ sudo nova-manage placement sync_aggregates` after reading a hint in nova-api.log
  • 13:02 arturo: rebooting cloudvirt1016 and moving it to the ceph host aggregate
  • 12:55 arturo: rebooting cloudvirt1014 and moving it to the ceph host aggregate
  • 12:51 arturo: rebooting cloudvirt1013 and moving it to the ceph host aggregate
  • 12:39 arturo: root@cloudcontrol1005:~# openstack aggregate add host maintenance cloudvirt1031
  • 12:36 arturo: rebooted cloudnet1003 (active) a couple of minutes ago
  • 12:36 arturo: move cloudvirt1012 and cloudvirt1039 to the ceph aggregate
  • 11:49 arturo: rebooting cloudvirt1039
  • 11:46 arturo: rebooting cloudvirt1012
  • 11:40 arturo: rebooting cloudnet1004 (standby) to pick up https://gerrit.wikimedia.org/r/c/operations/puppet/+/631167 (T262979)
  • 11:38 arturo: [codfw1dev] rebooting cloudnet2002-dev to pick up https://gerrit.wikimedia.org/r/c/operations/puppet/+/631167
  • 11:36 arturo: [codfw1dev] rebooting cloudnet2003-dev to pick up https://gerrit.wikimedia.org/r/c/operations/puppet/+/631167
  • 11:33 arturo: disabling puppet and downtiming every virt/net server in the fleet in preparation for merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/631167 (T262979)
  • 09:32 arturo: rebooting cloudvirt1012 to investigate linuxbridge agent issues

2020-09-29

  • 15:40 arturo: downgrade linux kernel from linux-image-4.19.0-11-amd64 to linux-image-4.19.0-10-amd64 on cloudvirt1012
  • 14:47 arturo: rebooting cloudvirt1012, chasing config weirdness in the linuxbridge agent
  • 14:05 andrewbogott: reimaging 1014 over and over in an attempt to get partman right
  • 13:51 arturo: rebooting cloudvirt1012

2020-09-28

  • 14:55 arturo: [jbond42] upgraded facter to v3 across the VM fleet
  • 13:54 andrewbogott: moving cloudvirt1035 from aggregate 'spare' to 'ceph'. We're going to need all the capacity we can get while converting older cloudvirts to ceph

2020-09-24

  • 15:47 arturo: stopping/restarting rabbitmq-server in all cloudcontrol servers
  • 15:45 arturo: restarting rabbitmq-server in cloudcontrol103
  • 15:15 arturo: restarting floating_ip_ptr_records_updater.service in all 3 cloudcontrol servers to reset state after a DNS failure

2020-09-18

  • 10:16 arturo: cloudvirt1039 libvirtd service issues were fixed with a reboot
  • 09:56 arturo: rebooting cloudvirt1039 (spare) to try to fix some weird libvirtd failure
  • 09:50 arturo: enabling puppet in cloudvirts and effectively merging patches from T262979
  • 08:59 arturo: disable puppet in all buster cloudvirts (cloudvirt[1024,1031-1039].eqiad.wmnet) to merge a patch for T263205 and T262979
  • 08:50 arturo: installing iptables from buster-bpo in cloudvirt1036 (T263205 and T262979)

2020-09-15

  • 20:32 andrewbogott: rebooting cloudvirt1038 to see if it resolves T262979
  • 13:58 andrewbogott: draining cloudvirt1002 with wmcs-ceph-migrate

2020-09-14

  • 14:21 andrewbogott: draining cloudvirt1001, migrating all VMs with wmcs-ceph-migrate
  • 10:41 arturo: [codfw1dev] trying to get the bonding working for labtestvirt2003 (T261724)
  • 09:47 arturo: installed qemu security update in eqiad1 cloudvirts (T262386)
  • 09:43 arturo: [codfw1dev] installed qemu security update in codfw1dev cloudvirts (T262386)

2020-09-09

2020-09-08

  • 21:48 bd808: Renamed FQDN prefixes to wikimedia.cloud scheme in cloudinfra-db01's labspuppet db (T260614)
  • 14:29 andrewbogott: restarting nova-compute on all cloudvirts (everyone is upset from the reset switch failure)
  • 14:18 arturo: restarting nova-fullstack service in cloudcontrol1003
  • 14:17 andrewbogott: stopping apache2 on labweb1001 to make sure the Horizon outage is total

2020-09-03

  • 09:31 arturo: icinga downtime cloud* servers for 30 mins (T261866)

2020-09-02

  • 08:46 arturo: [codfw1dev] reimaging spare server labtestvirt2003 as debian buster (T261724)

2020-09-01

  • 18:18 andrewbogott: adding drives on cloudcephosd100[3-5] to ceph osd pool
  • 13:40 andrewbogott: adding drives on cloudcephosd101[0-2] to ceph osd pool
  • 13:35 andrewbogott: adding drives on cloudcephosd100[1-3] to ceph osd pool
  • 11:27 arturo: [codfw1dev] rebooting again cloudnet2002-dev after some network tests, to reset initial state (T261724)
  • 11:09 arturo: [codfw1dev] rebooting cloudnet2002-dev after some network tests, to reset initial state (T261724)
  • 10:49 arturo: disable puppet in cloudnet servers to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/623569/

2020-08-31

2020-08-28

  • 20:12 bd808: Running `wmcs-novastats-dnsleaks --delete` from cloudcontrol1003

2020-08-26

  • 17:12 bstorm: Running 'ionice -c 3 nice -19 find /srv/tools -type f -size +100M -printf "%k KB %p\n" > tools_large_files_20200826.txt' on labstore1004 T261336

2020-08-21

  • 21:34 andrewbogott: restarting nova-compute on cloudvirt1033; it seems stuck

2020-08-19

  • 14:21 andrewbogott: rebooting cloudweb2001-dev, labweb1001, labweb1002 to address mediawiki-induced memleak

2020-08-06

  • 21:02 andrewbogott: removing cloudvirt1004/1006 from nova's list of hypervisors; rebuilding them to use as backup test hosts
  • 20:06 bstorm: manually stopped the RAID check on cloudcontrol1003 T259760

2020-08-04

  • 18:54 bstorm: restarting mariadb on cloudcontrol1004 to setup parallel replication

2020-08-03

  • 17:02 bstorm: increased db connection limit to 800 across galera cluster because we were clearly hovering at limit

2020-07-31

  • 19:28 bd808: wmcs-novastats-dnsleaks --delete (lots of leaked fullstack-monitoring records to clean up)

2020-07-27

  • 22:17 andrewbogott: ceph osd pool set compute pg_num 2048
  • 22:14 andrewbogott: ceph osd pool set compute pg_autoscale_mode off

2020-07-24

  • 19:15 andrewbogott: ceph mgr module enable pg_autoscaler
  • 19:15 andrewbogott: ceph osd pool set compute pg_autoscale_mode on

2020-07-22

  • 08:55 jbond42: [codfw1dev] upgrading hiera to version5
  • 08:48 arturo: [codfw1dev] add jbond as user in the bastion-codfw1dev and cloudinfra-codfw1dev projects
  • 08:45 arturo: [codfw1dev] enabled account creation in labtestwiki briefly for jbond42 to create an account

2020-07-16

2020-07-15

  • 23:15 bd808: Removed Merlijn van Deen from toollabs-trusted Gerrit group (T255697)
  • 11:48 arturo: [codfw1dev] created DNS records (A and PTR) for bastion.bastioninfra-codfw1dev.codfw1dev.wmcloud.org <-> 185.15.57.2
  • 11:41 arturo: [codfw1dev] add myself as projectadmin to the `bastioninfra-codfw1dev` project
  • 11:39 arturo: [codfw1dev] created DNS zone `bastioninfra-codfw1dev.codfw1dev.wmcloud.org.` in the cloudinfra-codfw1dev project and then transfer ownership to the bastioninfra-codfw1dev project

2020-07-14

  • 15:19 arturo: briefly set root@cloudnet1003:~ # sysctl net.ipv4.conf.all.accept_local=1 (in neutron qrouter netns) (T257534)
  • 10:43 arturo: icinga downtime cloudnet* hosts for 30 mins to introduce new check https://gerrit.wikimedia.org/r/c/operations/puppet/+/612390 (T257552)
  • 04:01 andrewbogott: added a wildcard *.wmflabs.org domain pointing at the domain proxy in project-proxy
  • 04:00 andrewbogott: shortened the ttl on .wmflabs.org. to 300

2020-07-13

  • 16:17 arturo: icinga downtime cloudcontrol[1003-1005].wikimedia.org for 1h for galera database movements

2020-07-12

  • 17:39 andrewbogott: switched eqiad1 keystone from m5 to cloudcontrol galera

2020-07-10

  • 20:26 andrewbogott: disabling nova api to move database to galera

2020-07-09

  • 11:23 arturo: [codfw1dev] rebooting cloudnet2003-dev again for testing sysct/puppet behavior (T257552)
  • 11:11 arturo: [codfw1dev] rebooting cloudnet2003-dev for testing sysct/puppet behavior (T257552)
  • 09:16 arturo: manually increasing sysctl value of net.nf_conntrack_max in cloudnet servers (T257552)

2020-07-06

  • 15:16 arturo: installing 'aptitude' in all cloudvirts

2020-07-03

  • 12:51 arturo: [codfw1dev] galera cluster should be up and running, openstack happy (T256283)
  • 11:44 arturo: [codfw1dev] restoring glance database backup from bacula into cloudcontrol2001-dev (T256283)
  • 11:39 arturo: [codfw1dev] stopped mysql database in the galera cluster T256283
  • 11:36 arturo: [codfw1dev] dropped glance database in the galera cluster T256283

2020-07-02

  • 15:41 arturo: `sudo wmcs-openstack --os-compute-api-version 2.55 flavor create --private --vcpus 8 --disk 300 --ram 16384 --property aggregate_instance_extra_specs:ceph=true --description "for packaging envoy" bigdisk-ceph` (T256983)

2020-06-29

  • 14:24 arturo: starting rabbitmq-server in all 3 cloudcontrol servers
  • 14:23 arturo: stopping rabbitmq-server in all 3 cloudcontrol servers

2020-06-18

  • 20:38 andrewbogott: rebooting cloudservices2003-dev due to a mysterious 'host down' alert on a secondary ip

2020-06-16

  • 15:38 arturo: created by hand neutron port 9c0a9a13-e409-49de-9ba3-bc8ec4801dbf `paws-haproxy-vip` (T295217)

2020-06-12

  • 13:23 arturo: DNS zone `paws.wmcloud.org` transferred to the PAWS project (T195217)
  • 13:20 arturo: created DNS zone `paws.wmcloud.org` (T195217)

2020-06-11

  • 19:19 bstorm_: proceeding with failback to labstore1004 now that DRBD devices are consistent T224582
  • 17:22 bstorm_: delaying failback labstore1004 for drive syncs T224582
  • 17:17 bstorm_: failing NFS back to labstore1004 to complete the upgrade process T224582
  • 16:15 bstorm_: failing over NFS for labstore1004 to labstore1005 T224582

2020-06-10

  • 16:09 andrewbogott: deleting all old cloud-ns0.wikimedia.org and cloud-ns1.wikimedia.org ns records in designate database T254496

2020-06-09

  • 15:25 arturo: icinga downtime everything cloud* lab* for 2h more (T253780)
  • 14:09 andrewbogott: stopping puppet, all designate services and all pdns services on cloudservices1004 for T253780
  • 14:01 arturo: icinga downtime everything cloud* lab* for 2h (T253780)

2020-06-05

2020-06-04

  • 14:24 andrewbogott: disabling puppet on all instances for /labs/private recovery
  • 14:23 arturo: disabling puppet on all instances for /labs/private recovery

2020-05-28

  • 23:02 bd808: `/usr/local/sbin/maintain-dbusers --debug harvest-replicas` (T253930)
  • 13:36 andrewbogott: rebuilding cloudservices2002-dev with Buster
  • 00:33 andrewbogott: shutting down cloudservices2002-dev to see if we can live without it. This is in anticipation or rebuilding it entirely for T253780

2020-05-27

  • 23:29 andrewbogott: disabling the backup job on cloudbackup2001 (just like last week) so the backup doesn't start while Brooke is rebuilding labstore1004 tomorrow.
  • 06:03 bd808: `systemctl start mariadb` on clouddb1001 following reboot (take 2)
  • 05:58 bd808: `systemctl start mariadb` on clouddb1001 following reboot
  • 05:53 bd808: Hard reboot of clouddb1001 via Horizon. Console unresponsive.

2020-05-25

  • 16:35 arturo: [codfw1dev] created zone `0-29.57.15.185.in-addr.arpa.` (T247972)

2020-05-21

  • 19:23 andrewbogott: disabling puppet on cloudbackup2001 to prevent the backup job from starting during maintenance
  • 19:16 andrewbogott: systemctl disable block_sync-tools-project.service on cloudbackup2001.codfw.wmnet to avoid stepping on current upgrade
  • 15:48 andrewbogott: re-imaging cloudnet1003 with Buster

2020-05-19

  • 22:59 bd808: `apt-get install mariadb-client` on cloudcontrol1003
  • 21:12 bd808: Migrating wcdo.wcdo.eqiad.wmflabs to cloudvirt1023 (T251065)

2020-05-18

  • 21:37 andrewbogott: rebuilding cloudnet2003-dev with Buster

2020-05-15

  • 22:10 bd808: Added reedy as projectadmin in cloudinfra project (T249774)
  • 22:05 bd808: Added reedy as projectadmin in admin project (T249774)
  • 18:44 bstorm_: rebooting cloudvirt-wdqs1003 T252831
  • 15:47 bd808: Manually running wmcs-novastats-dnsleaks from cloudcontrol1003 (T252889)

2020-05-14

  • 23:28 bstorm_: downtimed cloudvirt1004/6 and cloudvirt-wdqs1003 until tomorrow around this time T252831
  • 22:21 bstorm_: upgrading qemu-system-x86 on cloudvirt1006 to backports version T252831
  • 22:15 bstorm_: changing /etc/libvirt/qemu.conf and restarting libvirtd on cloudvirt1006 T252831
  • 21:12 andrewbogott: rebuilding cloudvirt1003-wdqs as part of T252831
  • 15:47 andrewbogott: moving cloudvirt1004 and cloudvirt1006 to the 'ceph' aggregate for T252784
  • 15:02 andrewbogott: moving all of cloudvirt100[1-9] into the 'toobusy' host aggregate. These are slower, have spinning disks, and are due for replacement.

2020-05-12

  • 20:33 andrewbogott: moving cloudvirt1023 to the 'standard' pool and out of the 'spare' pool
  • 19:10 jeh: disable neutron-openvswitch-agent service on cloudvirt2001-dev.codfw T248881
  • 19:09 jeh: Shutdown the unused eno2 network interface on cloudvirt2001-dev.codfw to clear up monitoring errors T248425
  • 18:20 andrewbogott: moving cloudvirt1024 out of the 'maintenance' aggregate and into 'spare'
  • 16:45 andrewbogott: restarting neutron-l3-agent on cloudnet1004 so it knows about all three cloudcontrols. Leaving cloudnet1003 since restarting it there will cause network interruptions
  • 14:06 arturo: icinga downtime everything for 2h for Debian Buster migration in some cloud components

2020-05-09

  • 16:53 andrewbogott: rebuilding cloudcontrol2001-dev and 2003-dev with buster for T252121

2020-05-08

  • 19:02 bstorm_: moving tools-k8s-haproxy-2 from cloudvirt1021 to cloudvirt1017 to improve spread

2020-05-05

  • 13:58 andrewbogott: rebuilding cloudcontrol2004-dev to test new puppet changes

2020-05-04

  • 09:04 arturo: [codfw1dev] manually modify iptables ruleset to only allow SSH from WMF bastions on cloudservices2003-dev and cloudcontrol2004-dev (T251604)

2020-04-21

  • 22:12 andrewbogott: moving cloudvirt1004 out of the 'standard' aggregate and into the 'maintenance' aggregate
  • 16:01 jeh: restart cloudceph mon and osd services for openssl upgrades

2020-04-15

  • 18:44 jeh: create indexes and views for grwikimedia T245912

2020-04-13

  • 15:07 jeh: restart memcached on labwebs to increase cache size T145703

2020-04-09

  • 19:57 andrewbogott: upgrading eqiad1 designate to rocky
  • 16:52 andrewbogott: cleaned up a bunch of leaked .eqiad.wmflabs dns records

2020-04-08

  • 19:20 andrewbogott: rotated password and api token for pdns servers on cloudservices1003 and cloudservices1004
  • 14:54 arturo: `root@cloudcontrol1003:~# cp /etc/inputrc .inputrc` to solve some bash shortcut weirdness

2020-04-07

  • 20:57 andrewbogott: service sssd stop; rm -rf /var/lib/sss/db*; service sssd start on tools-sgebastion-08

2020-04-06

  • 22:39 andrewbogott: deleting bogus groups cn=b'project-bastion',ou=groups,dc=wikimedia,dc=org and cn=b'project-tools',ou=groups,dc=wikimedia,dc=org from ldap
  • 17:42 arturo: [codfw1dev] transferred DNS zone 57.15.185.in-addr.arpa. to the cloudinfra-codfw1dev project (T247972)
  • 17:39 arturo: [codfw1dev] `openstack zone create --email root@wmflabs.org --type PRIMARY --ttl 3600 --description "floating IPs subnet" 57.15.185.in-addr.arpa.` (T247972)
  • 16:23 arturo: restarting apache2 in cloudcontrol1003/1004 to pick up latest wmfkeystonehooks changes T249494

2020-04-02

  • 20:59 jeh: codfw1dev clear VM error states and start bastions, puppet master and database

2020-04-01

  • 16:27 arturo: [codfw1dev] enable puppet across the fleet clean vxlan changes (T248881)

2020-03-31

  • 12:35 arturo: [codfw1dev] restarting VMs: designaterockytest14, bastion-codfw1dev-0[1,2] (T248881)
  • 12:34 arturo: [codfw1dev] installing neutron-openvswitch-agent on cloudvirt2001-dev (T248881)
  • 12:25 arturo: [codfw1dev] installing neutron-openvswitch-agent on cloudnet200[2,3]-dev (T248881)
  • 11:45 arturo: [codfw1dev] rebooting cloudvirt2003-dev to pick up latest kernel update. Otherwise modprobe is confused trying to load modules and openvswitch won't start (T248881)
  • 10:40 arturo: [codfw1dev] installing neutron-openvswitch-agent on cloudvirt2003-dev (T248881)
  • 10:09 arturo: [codfw1dev] reboot cloudnet2003-dev into linux 4.9 (was using 4.14 from a testing operation in 2020-03-10)

2020-03-30

2020-03-27

  • 21:28 bd808: Created huggle.wmcloud.org Designate zone and allocated it to the huggle project
  • 19:51 jeh: start haproxy on cloudcontrol2003-dev.wikimedia.org

2020-03-26

  • 15:01 arturo: icinga downtime cloudvirt* cloudcontrol* cloudnet* lab* cloudstore*
  • 15:01 andrewbogott: beginning openstack upgrade window for T242766
  • 12:32 arturo: [codfw1dev] downgraded systemd, libsystemd0, udev and friends to the non-backports versions (T247013)

2020-03-25

  • 19:29 andrewbogott: dumping a bunch of VMs on cloudvirt1015 to see if it still crashes
  • 17:56 jeh: add labweb1002 back into the pool - completed horizon testing T240852
  • 17:09 jeh: depool labweb1002 for horizon testing T240852

2020-03-24

  • 19:41 jeh: switch cloudvirt1016 from maintenance to standard host aggregate T243327
  • 15:31 andrewbogott: restarting nova-conductor and nova-api on cloudcontrol1003 and cloudcontrol1004

2020-03-23

  • 21:41 jeh: restart neutron-l3-agent on cloudnet100[3,4] to pickup policy.yaml changes
  • 13:28 jeh: disable puppet on labweb100[1,2] to enable horizon event traces T240852
  • 10:26 arturo: restarting apache in both labweb1001/labweb1002 upon reports of returning 500s

2020-03-21

  • 14:23 andrewbogott: restarting apache2 on labweb1001 and 1002

2020-03-18

  • 19:17 andrewbogott: deleted a bunch of records from the pdns database on cloudservices1003/1004 which had a record name but the content (where an IP address should be) was NULL, e.g. m.wikidata.beta.wmflabs.org.
  • 10:55 arturo: [codfw1dev] deleting BGP agent, undoing changes we did for T245606

2020-03-14

  • 17:40 jeh: restart maintain-dbusers on labstore1004 T247654

2020-03-13

2020-03-12

  • 22:29 bstorm_: running puppet across all dumps mounts to make sure active links are shifted to labstore1006

2020-03-11

2020-03-10

  • 17:02 arturo: [codfw1dev] deleting address scopes, bad interaction with our custom NAT setup T247135
  • 13:55 arturo: [codfw1dev] rebooting cloudnet2003-dev into linux kernel 4.14 for testing stuff related to T247135

2020-03-09

  • 18:09 arturo: enabling puppet in cloudvirt1006, all services have been restored
  • 17:59 arturo: deleted the neutron bridge on cloudvirt1006, for testing stuff related to the queens upgrade
  • 17:58 arturo: stopped neutron-linuxbridge-agent and nova-compute in cloudvirt1006 for testing stuff related to the queens upgrade

2020-03-06

  • 14:54 andrewbogott: draining all instances off of cloudvirt1006 for T246908

2020-03-05

  • 14:24 arturo: [codfw1dev] we just enabled BGP session between cloudnet2xxx-dev and cr1-codfw (T245606)
  • 13:07 arturo: [codfw1dev] move the extra IP address for BGP in cloudnet200x-dev servers from eno2.2120 to the br-external bridge device (T245606)
  • 13:06 arturo: [codfw1dev] upgrade neutron-dynamic-routing packages in cloudnet200X-dev and cloudcontrol200X-dev servers to 11.0.0-2~bpo9+1 (T245606)

2020-03-04

  • 22:22 andrewbogott: upgrading designate on cloudservices1003/1004 to Queens
  • 22:09 andrewbogott: moving cloudvirt1006 into the maintenance aggregate for T246908
  • 21:37 bd808: Running wmcs-wikireplica-dns to add service names for ngwikimedia.*.db.svc.eqiad.wmflabs (T240772)
  • 21:14 bd808: Running `sudo maintain-meta_p --all-databases --purge` on labsdb1009 (T246056)
  • 21:11 bd808: Running `sudo maintain-meta_p --all-databases --purge` on labsdb1010 (T246056)
  • 21:08 bd808: Running `sudo maintain-meta_p --all-databases --purge` on labsdb1011 (T246056)
  • 21:05 bd808: Running `sudo maintain-meta_p --all-databases --purge` on labsdb1002 (T246056)

2020-03-02

  • 16:54 arturo: [codfw1dev] deleted python3-os-ken debian package in cloudnet2003-dev which was installed by hand and had depedency issues

2020-02-29

  • 16:32 bstorm_: downtimed the smart alert on cloudvirt1009 until Monday since apparently predictive failures flap T244986

2020-02-26

  • 22:03 jeh: powering down cloudvirt1014 for hardware maintenance

2020-02-25

  • 16:08 andrewbogott: changing neutron's rabbitmq password because oslo is having trouble parsing some of the characters in the password
  • 15:26 andrewbogott: updated the cell_mapping record in the nova_api database to add the second rabbitmq server to the transport_url field
  • 15:26 andrewbogott: updated the cell_mapping record in the nova_api database to set the db uri to 'mysql+pymysql' -- this in response to a deprecation notice

2020-02-24

  • 12:16 arturo: [codfw1dev] `root@cloudcontrol2001-dev:~# neutron bgp-speaker-peer-add bgpspeaker cr2-codfw` (T245606)
  • 12:16 arturo: [codfw1dev] `root@cloudcontrol2001-dev:~# neutron bgp-speaker-peer-add bgpspeaker cr1-codfw` (T245606)
  • 12:09 arturo: [codfw1dev] `root@cloudcontrol2001-dev:~# neutron bgp-peer-create --peer-ip 208.80.153.187 --remote-as 65002 cr2-codfw` (T245606)
  • 12:09 arturo: [codfw1dev] `root@cloudcontrol2001-dev:~# neutron bgp-peer-create --peer-ip 208.80.153.186 --remote-as 65002 cr1-codfw` (T245606)
  • 12:06 arturo: [codfw1dev] `root@cloudcontrol2001-dev:~# neutron bgp-peer-delete 17b8c2a3-f0ce-4d50-a265-18ccac703c61` (T245606)
  • 10:59 arturo: [codfw1dev] `root@cloudcontrol2001-dev:~# neutron bgp-speaker-peer-add bgpspeaker bgppeer` (T245606)
  • 10:56 arturo: [codfw1dev] `root@cloudcontrol2001-dev:~# neutron bgp-peer-create --peer-ip 208.80.153.185 --remote-as 65002 bgppeer` (T245606)

2020-02-21

  • 12:48 arturo: [codfw1dev] running `root@cloudcontrol2001-dev:~# neutron bgp-speaker-network-add bgpspeaker wan-transport-codfw` (T245606)
  • 12:46 arturo: [codfw1dev] created bgpspeaker for AS64711 (T245606)
  • 12:42 arturo: [codfw1dev] run `sudo neutron-db-manage upgrade head` to upgrade the db schema for neutron bgp tables
  • 11:51 arturo: [codfw1dev] create a neutron subnet pool per each subnet objects we have and manually update DB to inter-associate them (T245606)
  • 11:49 arturo: [codfw1dev] rename neutron address scope `no-nat` to `bgp` (T245606)
  • 11:37 arturo: [codfw1dev] cleanup unused neutron subnet pools from previous address scope tests (T244851)

2020-02-20

  • 19:22 andrewbogott: updating designate pool config for https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/572213/
  • 15:33 andrewbogott: migrating all VMs on cloudvirt1014 to cloudvirt1022
  • 13:35 arturo: [codfw1dev] disable puppet in cloudcontrol servers to hack neutron.conf for tests related to T245606
  • 13:33 arturo: [codfw1dev] disable puppet in cloudnet servers to hack neutron.conf for tests related to T245606

2020-02-18

  • 22:19 andrewbogott: transferred the tools.wmcloud.org. to the tools project
  • 22:16 andrewbogott: moved wmcloud.org dns domain to the cloud-infra project
  • 21:02 andrewbogott: adding .eqiad1.wikimedia.cloud records to all existing eqiad1 VMs, updating all eqiad1 internal pointer records to reference the new eqiad1.wikimedia.cloud fqdns.
  • 09:44 arturo: deleted DNS zone wmcloud.org and try re-creating it

2020-02-14

  • 10:35 arturo: running `root@cloudcontrol2001-dev:~# designate server-create --name ns1.openstack.codfw1dev.wikimediacloud.org.` (T243766)
  • 10:32 arturo: running `root@cloudcontrol1004:~# designate server-create --name ns1.openstack.eqiad1.wikimediacloud.org.` (T243766)
  • 10:32 arturo: running `root@cloudcontrol1004:~# designate server-create --name ns0.openstack.eqiad1.wikimediacloud.org.` (T243766)

2020-02-12

  • 13:38 arturo: [codfw1dev] add reference to subnetpool to the instance subnet `MariaDB [neutron]> update subnets set subnetpool_id='d129650d-d4be-4fe1-b13e-6edb5565cb4a' where id = '7adfcebe-b3d0-4315-92fe-e8365cc80668';` (T244851)

2020-02-11

  • 13:46 arturo: [codfw1dev] creating some neutron objects to investigate T244851 (subnets, subnet pools, address scopes, ...)
  • 12:40 arturo: [codfw1dev] delete unknown address scope 'wmcs-v4-scope': `root@cloudcontrol2001-dev:~# openstack address scope delete 078cfd71-117b-4aac-9197-6ebbbb7dd3de` (T244851)
  • 12:40 arturo: [codfw1dev] delete unknown subnet pool 'cloudinstancesb-v4-pool0': `root@cloudcontrol2001-dev:~# openstack subnet pool delete d23a9b88-5c3d-4a53-ab88-053233a75365` (T244851)

2020-02-07

  • 18:11 jeh: shutdown cloudvirt1016 for hardware maintenance T241882

2020-02-06

  • 14:44 jeh: update apt packages on cloudvirt1015 T220853
  • 14:28 jeh: run hardware tests on cloudvirt1015 T220853

2020-01-28

  • 17:24 arturo: [codfw1dev] root@cloudcontrol2001-dev:~# designate server-create --name ns0.openstack.codfw1dev.wikimediacloud.org. (T243766)
  • 10:18 arturo: [codfw1dev] created DNS record `bastion-codfw1dev-01.codfw1dev.wmcloud.org A 185.15.57.2` (T242976, T229441)
  • 10:13 arturo: [codfw1dev] the zone `codfw1dev.wmcloud.org` belongs now to the `cloudinfra-codfw1dev` project (T242976)
  • 10:11 arturo: [codfw1dev] `root@cloudcontrol2001-dev:~# openstack zone create --description "main DNS domain for public addresses" --email "root@wmflabs.org" --type PRIMARY --ttl 3600 codfw1dev.wmcloud.org.` (T242976 and T243766)
  • 09:53 arturo: restart apache2 in labweb1001/1002 because horizon errors
  • 09:47 arturo: created DNS zone wmcloud.org in eqiad1, transfer it to the cloudinfra project (T242976) right now only use is to delegate codfw1dev.wmcloud.org subdomain to designate in the other deployment

2020-01-27

  • 12:45 arturo: [codfw1dev] manually move the new domain to the `cloudinfra-codfw1dev` project clouddb2001-dev: `[designate]> update zones set tenant_id='cloudinfra-codfw1dev' where id = '4c75410017904858a5839de93c9e8b3d';` T243556
  • 12:44 arturo: [codfw1dev] `root@cloudcontrol2001-dev:~# openstack zone create --description "main DNS domain for VMs" --email "root@wmflabs.org" --type PRIMARY --ttl 3600 codfw1dev.wikimedia.cloud.` T243556

2020-01-24

  • 15:10 jeh: remove icinga downtime for cloudvirt1013 T241313
  • 12:52 arturo: repooling cloudvirt1013 after HW got fixed (T241313)

2020-01-21

  • 17:43 bstorm_: remounting /mnt/nfs/dumps-labstore1007.wikimedia.org/ on all dumps-mounting projects
  • 10:24 arturo: running `sudo systemctl restart apache2.service` in both labweb servers to try mitigating T240852

2020-01-15

  • 16:59 bd808: Changed the config for cloud-announce mailing list so that lsit admins do not get bounce unsubscribe notices

2020-01-14

  • 14:03 arturo: icinga downtime all cloudvirts for another 2h for fixing some icinga checks
  • 12:04 arturo: icinga downtime toolchecker for 2 hours for openstack upgrades T241347
  • 12:02 arturo: icinga downtime cloud* labs* hosts for 2 hours for openstack upgrades T241347
  • 04:26 andrewbogott: upgrading designate on cloudservices1003/1004

2020-01-13

  • 13:34 arturo: [¢odfw1dev] prevent neutron from allocating floating IPs from the wrong subnet by doing `neutron subnet-update --allocation-pool start=208.80.153.190,end=208.80.153.190 cloud-instances-transport1-b-codfw` (T242594)

2020-01-10

  • 13:27 arturo: cloudvirt1009: virsh undefine i-000069b6. This is tools-elastic-01 which is running on cloudvirt1008 (so, leaked on cloudvirt1009)

2020-01-09

  • 11:12 arturo: running `MariaDB [nova_eqiad1]> update quota_usages set in_use='0' where project_id='etytree';` (T242332)
  • 11:11 arturo: running `MariaDB [nova_eqiad1]> select * from quota_usages where project_id = 'etytree';` (T242332)
  • 10:32 arturo: ran `root@cloudcontrol1004:~# nova-manage project quota_usage_refresh --project etytree`

2020-01-08

  • 10:53 arturo: icinga downtime all cloudvirts for 30 minutes to re-create all canary VMs"

2020-01-07

  • 11:12 arturo: icinga-downtime everything cloud* for 30 minutes to merge nova scheduler changes
  • 10:02 arturo: icinga downtime cloudvirt1009 for 30 minutes to re-create canary VM (T242078)

2020-01-06

  • 13:45 andrewbogott: restarting nova-api and nova-conductor on cloudcontrol1003 and 1004

2020-01-04

  • 16:34 arturo: icinga downtime cloudvirt1024 for 2 months because hardware errors (T241884)

2019-12-31

  • 11:46 andrewbogott: I couldn't!
  • 11:40 andrewbogott: restarting cloudservices2002-dev to see if I can reproduce an issue I saw earlier

2019-12-25

2019-12-24

  • 15:13 arturo: icinga downtime all the lab* fleet for nova password change for 1h
  • 14:39 arturo: icinga downtime all the cloud* fleet for nova password change for 1h

2019-12-23

  • 11:13 arturo: enable puppet in cloudcontrol1003/1004
  • 10:40 arturo: disable puppet in cloudcontrol1003/1004 while doing changes related to python-ldap

2019-12-22

  • 23:48 andrewbogott: restarting nova-conductor and nova-api on cloudcontrol1003 and 1004
  • 09:45 arturo: cloudvirt1013 is back (did it alone) T241313
  • 09:37 arturo: cloudvirt1013 is down for good. Apparently powered off. I can't even reach it via iLO

2019-12-20

  • 12:43 arturo: icinga downtime cloudmetrics1001 for 128 hours

2019-12-18

  • 12:55 arturo: [codfw1dev] created a new subnet neutron object to hold the new CIDR for floating IPs (cloud-codfw1dev-floating - 185.15.57.0/29) T239347

2019-12-17

  • 07:21 andrewbogott: deploying horizon/train to labweb1001/1002

2019-12-12

  • 06:11 arturo: schedule 4h downtime for labstores
  • 05:57 arturo: schedule 4h downtime for cloudvirts and other openstack components due to upgrade ops

2019-12-02

  • 06:28 andrewbogott: running nova-manage db sync on eqiad1
  • 06:27 andrewbogott: running nova-manage cell_v2 map_cell0 on eqiad1

2019-11-21

  • 16:07 jeh: created replica indexes and views for szywiki T237373
  • 15:48 jeh: creating replica indexes and views for shywiktionary T238115
  • 15:48 jeh: creating replica indexes and views for gcrwiki T238114
  • 15:46 jeh: creating replica indexes and views for minwiktionary T238522
  • 15:36 jeh: creating replica indexes and views for gewikimedia T236404

2019-11-18

  • 19:27 andrewbogott: repooling labsdb1011
  • 18:54 andrewbogott: running maintain-views --all-databases --replace-all —clean on labsdb1011 T238480
  • 18:44 andrewbogott: depooling labsdb1011 and killing remaining user queries T238480
  • 18:42 andrewbogott: repooled labsdb1009 and 1010 T238480
  • 18:19 andrewbogott: running maintain-views --all-databases --replace-all —clean on labsdb1010 T238480
  • 18:18 andrewbogott: depooling labsdb1010, killing remaining user queries
  • 17:46 andrewbogott: running maintain-views --all-databases --replace-all —clean on labsdb1009 T238480
  • 17:38 andrewbogott: depooling labsdb1009, killing remaining user queries
  • 16:54 andrewbogott: running maintain-views --all-databases --replace-all —clean on labsdb1012 T237509

2019-11-15

  • 20:04 andrewbogott: repool labdb1011 (T237509)
  • 19:29 andrewbogott: running maintain-views --all-databases --replace-all —clean on labsdb1011
  • 19:25 andrewbogott: depooling labsdb1011, killing remaining queries
  • 19:25 andrewbogott: repooling labsdb1010
  • 18:59 andrewbogott: running maintain-views --all-databases --replace-all —clean on labsdb1012
  • 18:57 andrewbogott: running maintain-views --all-databases --replace-all —clean on labsdb1010
  • 18:54 andrewbogott: depooling labsdb1010, killing remaining user queries
  • 18:54 andrewbogott: depooled labsdb1009, ran maintain-views —clean —all-databases —replace-all, repooled

2019-11-11

  • 13:10 arturo: cloudweb2001-dev: disable puppet and redirect stderr in the loadExitNodes.php cron script to prevent cronspam while we investigate the cause of the issue (T237971)

2019-11-05

  • 11:59 arturo: icinga downtime for 1h cloudcontrol1004, cloudnet1003, cloudvirt1017/1020/1022 for PDU operations in the rack T227542

2019-11-04

  • 21:55 andrewbogott: deleting a ton of wikitech hiera pages that were either no-ops or refer to nonexistent VMs or prefixes

2019-10-31

  • 11:01 arturo: icinga-downtimed cloudvirt1030 and cloudservices1003 for 1h due to PDU upgrade operations T227543

2019-10-30

  • 22:43 jeh: reboot cloud-bootstrapvz-stretch to resolve bad bootstrapvz build

2019-10-29

  • 10:52 arturo: icinga downtime cloudvirt1001/1002/1024/1018/1012/1009/1015/1008 for 1h T227538

2019-10-25

  • 10:45 arturo: icinga downtime toolschecker for 1 to upgrade clouddb1002 mariadb (toolsdb secondary) (T236384 , T236420)

2019-10-24

  • 12:30 arturo: starting cloudvirt1019, PDU operations ended (T227540)
  • 11:58 arturo: icinga downtime for 2h (T227540) cloudvirt1019
  • 11:15 arturo: poweroff cloudvirt1019 during the PDU operations (T227540)
  • 11:10 arturo: icinga downtime for 2h (T227540) toolschecker
  • 10:58 arturo: icinga downtime for 1h (T227540) cloudvirt100[3-7], cloudvirt1019, cloudvirt1016, cloudvirt1021, cloudvirt1013, cloudnet1004

2019-10-23

  • 09:23 arturo: cloudvirt1026 reboot ended OK
  • 09:12 arturo: rebooting cloudvirt1026 for kernel upgrade
  • 09:09 arturo: cloudvirt1025 reboot ended OK
  • 09:00 arturo: rebooting cloudvirt1025 for kernel upgrade
  • 08:51 arturo: icinga downtime cloudvirt1025/1026 for reboots

2019-10-18

  • 16:01 arturo: created the `eqiad1.wikimedia.cloud` DNS zone (T235846)
  • 14:27 andrewbogott: deleted a bunch of leaked VMS from earlier today from the admin-monitoring project. Fullstack leaks due to an api outage, maybe?
  • 10:44 arturo: double max_message_size from 40KB to 80KB in the cloud-admin mailing list. A simple email with a couple of quotes can go over the 40KB limit.

2019-10-16

  • 21:59 jeh: resync wiki replica tool and user accounts T235697
  • 09:40 arturo: reboot of cloudvirt1030 went fine
  • 09:28 arturo: reboot of cloudvirt1029 went fine
  • 09:28 arturo: rebooting cloudvirt1030 for kernel updates
  • 09:12 arturo: rebooting cloudvirt1029 for kernel updates
  • 09:11 arturo: reboot of cloudvirt1028 went fine
  • 09:00 arturo: rebooting cloudvirt1028 for kernel updates
  • 08:56 arturo: icinga downtime cloudvirt[1028-1030].eqiad.wmnet for 1h for reboots

2019-10-15

  • 13:30 jeh: creating indexes and views for banwiki T234770

2019-10-10

  • 18:55 bd808: Created indexes and views for nqowiki (T230543)
  • 11:59 arturo: network switch hardware is down affecting cloudvirt1025/1026 (T227536) VMs are supposed to be online but unreachable

2019-10-09

  • 10:44 arturo: cloudvirt1013 rebooted well
  • 10:32 arturo: cloudvirt1013 is rebooting
  • 10:32 arturo: cloudvirt1012 rebooted just fine (very slow, 35 VMs)
  • 10:21 arturo: cloudvirt1012 is rebooting
  • 10:19 arturo: cloudvirt1009 rebooted just fine (very slow though)
  • 10:07 arturo: cloudvirt1009 is rebooting
  • 10:06 arturo: cloudvirt1008 rebooted just fine (very slow though)
  • 09:58 arturo: cloudvirt1008 is rebooting
  • 09:52 arturo: icinga downtime toolschecker, paws, etc for 2h, because cloudvirt reboots

2019-10-07

  • 14:07 arturo: horizon is disabled for maintenance (T212302)
  • 14:00 arturo: starting scheduled maintenance: upgrading eqiad1 from openstack mitaka to newton

2019-10-02

  • 15:23 arturo: codfw1dev renaming net/subnet objects to a more modern naming scheme T233665
  • 12:49 arturo: codfw1dev delete all floating ip allocations in the deployment for mangling the network config for testing T233665
  • 12:47 arturo: codfw1dev deleting all VMs in the deployment for mangling the network config for testing T233665
  • 11:08 arturo: codfw1dev rebooting cloudnet2002-dev and cloudnet2003-dev for testing T233665
  • 10:31 arturo: codfw1dev: add cloudinstances2b-gw router to the l3 agent in cloudnet2003-dev
  • 09:59 arturo: codfw1dev: cleanup leftover "HA port tenant admin" in neutron (ports from missing servers)
  • 09:46 arturo: codfw1dev: cleanup leftover neutron agents

2019-09-30

  • 10:21 arturo: we installed ferm in every VM by mistake. Deleting it and forcing a puppet agent run to try to go back to a clean state.
  • 09:38 arturo: downtime toolschecker for 24h
  • 09:33 arturo: force update ferm cloud-wide (in all VMs) for T153468

2019-08-18

  • 10:39 arturo: rebooting cloudvirt1023 for new interface names configuration
  • 10:34 arturo: downtimed cloudvirt1023 for 2 days

2019-08-05

  • 17:17 bd808: Set downtime on gridengine and kubernetes webservice checks in icinga until 2019-09-02 (flaky tests)

2019-07-29

  • 20:14 bd808: Restarted maintain-kubeusers on tools-k8s-master-01 (T194859)

2019-07-25

  • 12:32 arturo: eqiad1/glance: debian-9.9-stretch image deprecates debian-9.8-stretch (T228983)
  • 09:59 arturo: (codfw1dev) drop missing glance images (T228972)
  • 09:32 arturo: (codfw1dev) deleting a bunch of VMs that were running in now missing hypervisors
  • 09:31 arturo: (codfw1dev) deleting a bunch of VMs in ERROR and SHUTDOWN state
  • 09:27 arturo: last log entry refers to the codfw1dev deployment
  • 09:27 arturo: cleanup `nova service-list` from old hypervisors (labtest*)
  • 09:23 arturo: refreshed nova DB grants in clouddb2001-dev for the codfw1dev deployment
  • 08:47 arturo: cleanup the cloud-announce pending emails (spam)

2019-07-23

  • 19:43 andrewbogott: restarting rabbitmq-server on cloudcontrol1003 and 1004

2019-07-22

  • 23:44 bd808: Restarted maintain-kubeusers on tools-k8s-master-01 (T228529)

2019-07-11

  • 22:07 bd808: Ran `sudo systemctl stop designate_floating_ip_ptr_records_updater.service` on cloudcontrol1003
  • 22:01 bd808: `sudo apt-get install python2.7-dbg` on cloudcontrol1003 to debug hung python process
  • 21:48 bd808: Ran `sudo systemctl stop designate_floating_ip_ptr_records_updater.service` on cloudcontrol1004

2019-06-25

  • 16:05 bstorm_: updated python3.4 to update4 wherever it was installed on Jessie VMs to prevent issues with broken update3.
  • 14:56 bstorm_: Updated python 3.4 on the labs-puppetmaster server

2019-06-03

  • 15:55 arturo: T221769 rebooting cloudservices1003 after bootstrapping is apparently completed

2019-05-28

  • 21:42 bstorm_: unmounting labstore1003-scratch on all cloud clients
  • 18:14 bstorm_: T209527 switched mounts from labstore1003 to cloudstore1008 for scratch

2019-05-20

  • 17:25 arturo: T223923 dropped compat-network config from /etc/network/interfaces in eqiad1/codfw1dev neutron nodes
  • 17:22 arturo: T223923 dropped br-compat bridges and vlan interfaces (1102 and 2102) in eqiad1/codfw1dev neutron nodes
  • 17:07 arturo: T223923 dropped compat-network configuration from the neutron database in eqiad1
  • 16:55 arturo: T223923 dropped compat-network configuration from the neutron database in codfw1dev

2019-05-15

  • 17:00 andrewbogott: touching /root/firstboot_done on all VMs that cumin can reach. This will prevent firstboot.sh from running a second time if/when any of these are rebooted. T223370

2019-04-26

  • 15:51 arturo: andrew updated dns servers for the cloud-instances2-b-eqiad subnet in neutron: 208.80.154.143 and 208.80.154.24

2019-04-25

  • 11:14 arturo: T221760 increased size of conntrack table

2019-04-24

  • 12:54 arturo: T220051 puppet broken in every VM in Cloud VPS, fixing right now

2019-04-22

  • 11:14 arturo: create by hand /var/cache/labsaliaser/labs-ip-aliases.json in cloudservices2002-dev (T218575)

2019-04-16

  • 22:55 bd808: cloudcontrol2003-dev: added `exit 0` to /etc/cron.hourly/keystone to stop cron spam on partially configured cluster
  • 12:08 arturo: rebooting cloudvirt200[123]-dev because deep changes in config
  • 11:27 arturo: T219626 add DB grants for neutron and glnace to clouddb2001-dev (codfw1dev)
  • 10:37 arturo: T219626 replace 208.80.153.75 with 208.80.153.59 in the clouddb2001-dev database (codfw1dev deployment)
  • 10:30 arturo: T219626 replace labtestcontrol2003 with cloudcontrol2001-dev in the clouddb2001-dev database (codfw1dev deployment)

2019-04-15

  • 13:08 arturo: T219626 add DB grants for keystone/nova/nova_api to clouddb2001-dev (codfw1dev)

2019-04-13

  • 18:25 bd808: Restarted nova-compute service on cloudvirt1015 (T220853)

2019-04-11

  • 12:00 arturo: T151704 deploying oidentd to cloudnet1xxx servers

2019-04-02

  • 19:52 andrewbogott: installed new base Stretch image. Updated packages, and runs apt-get dist-upgrade on first boot.

2019-03-29

  • 14:34 andrewbogott: moving tools-static.wmflabs.org to point to tools-static-13 in eqiad1-r
  • 00:00 bstorm_: T193264 Added osm.db.svc.eqiad.wmflabs to cloud DNS

2019-03-25

  • 00:40 bd808: Restarted maintain-dbusers on labstore1004. Process hung up on failed LDAP connection.

2019-03-21

  • 19:32 andrewbogott: restarting keystone on cloudcontrol1003

2019-03-15

  • 16:00 gtirloni: increased nscd cache size (T217280)

2019-03-14

  • 19:04 gtirloni: bstorm started nfsd on labstore1006 (T218341)
  • 16:42 gtirloni: published new debian-9.8 image (T218314)

2019-03-04

  • 19:37 bstorm_: umounted /mnt/nfs/dumps-labstore1006.wikimedia.org across all VPS projects for T217473

2019-02-26

  • 12:46 gtirloni: shutdown toolsbeta-sgegrid-master (cronspam)

2019-02-25

  • 10:32 gtirloni: restarted nfsd on labstore1004

2019-02-21

  • 09:09 gtirloni: restarted uwsgi-labspuppetbackend.service on labpuppetmaster1001
  • 07:42 gtirloni: created project cloudstore
  • 07:36 gtirloni: deleted wmcs-nfs project

2019-02-20

  • 21:58 andrewbogott: silencing shinken and disabling puppet on shinken-02 for now

2019-02-19

  • 12:00 gtirloni: added nagios@icinga2001.wikimedia.org to cloud-admin-feed@ allowed senders

2019-02-18

  • 20:21 gtirloni: downtimed cloudvirt1020
  • 20:12 gtirloni: ran `labs-ip-alias-dump.py` on cloudservices/labservices servers

2019-02-15

  • 13:10 arturo: T216239 labvirt1019 has been drained
  • 12:22 arturo: T216239 draining labvirt1009 with a command like this: `root@cloudcontrol1004:~# wmcs-cold-migrate --region eqiad --nova-db nova 2c0cf363-c7c3-42ad-94bd-e586f2492321 labvirt1001`
  • 12:02 arturo: more nova service cleanups in the database (labvirts that were reallocated to eqiad1)
  • 11:34 arturo: T216190 cleanup from nova database `nova service-delete 35`
  • 03:50 andrewbogott: updated VPS base images for Jessie and Stretch, now featuring Stretch 9.7

2019-02-11

  • 18:13 gtirloni: cleaned old metrics data in labmon1001 T215417
  • 15:28 gtirloni: running `maintain-views --all-databases --replace-all` on labsdb1011
  • 14:18 gtirloni: running `maintain-views --all-databases --replace-all` on labsdb1010

2019-02-08

  • 14:56 gtirloni: running `maintain-views --all-databases --replace-all` on labsdb1009

2019-02-06

  • 11:47 gtirloni: downtimed labmon100{1,2} T215399
  • 00:17 bstorm_: T214106 deleted bstorm-test2 project to clean up

2019-02-05

  • 10:48 arturo: labmon1001 is now part of the 'eqiad1-r' region

2019-02-01

  • 09:54 arturo: moving canary1015-01 VM instance from cloudvirt1024 back to cloudvirt1015

2019-01-31

  • 12:44 arturo: T215012 depooling cloudvirt1015 and migrating all VMs to cloudvirt1024

2019-01-25

  • 20:11 gtirloni: deleted project yandex-proxy T212306
  • 20:11 gtirloni: deleted project T212306

2019-01-24

  • 11:50 arturo: T213925 modify subnet cloud-instances-transport1-b-eqiad1 to avoid floating IP allocations from here
  • 11:07 arturo: T214299 failover cloudnet1003 to cloudnet1004
  • 10:03 arturo: T214299 reimage cloudnet1004 to debian stretch
  • 09:51 arturo: T214299 failover cloudnet1004 to cloudnet1003

2019-01-22

  • 19:19 arturo: T214299 stretch cloudnet1003 is apparently all set
  • 18:40 arturo: T214299 manually delete from neutron agents from cloudnet1003 (must be added again after reimage, with new uuids)
  • 18:37 arturo: T214299 reimaging cloudnet1003 as debian stretch
  • 17:35 jbond42: starting roll out of apt package updates to
  • 14:41 gtirloni: T214369 deployed new jessie and stretch VM images

2019-01-21

  • 18:29 gtirloni: installed libguestfs-tools on cloudvirt1021

2019-01-16

  • 14:21 andrewbogott: stopping old VPS proxies in eqiad — T213540

2019-01-15

  • 14:20 andrewbogott: changing tools.wmflabs.org to point to tools-proxy-03 in eqiad1

2019-01-13

  • 20:00 andrewbogott: VPS proxies are now running in eqiad1 on proxy-01. Old VMs will wait a bit for deletion. T213540
  • 19:12 andrewbogott: moving the VPS proxy API backend to proxy-01.project-proxy.eqiad.wmflabs, as per T213540
  • 17:11 andrewbogott: moving all VPS dynamic proxies to proxy-eqiad1.wmflabs.org aka proxy-01.project-proxy.eqiad.wmflabs, as per T213540

2019-01-09

  • 22:21 bd808: neutron quota-update --tenant-id tools --port 256

2019-01-08

  • 18:59 bd808: Definately did NOT delete uid=novaadmin,ou=people,dc=wikimedia,dc=org
  • 18:59 bd808: Deleted LDAP user uid=neutron,ou=people,dc=wikimedia,dc=org
  • 18:58 bd808: Deleted LDAP user uid=novaadmin,ou=people,dc=wikimedia,dc=org

2019-01-06

  • 22:03 bd808: Set floatingip quota of 60 for tools project in eqiad1-r region (T212360)

2018-12-20

  • 17:10 arturo: T207663 renumbered transport network in eqiad1

2018-12-05

  • 17:59 arturo: T207663 changed labtestn transport network addressing from private to public

2018-12-03

  • 13:25 arturo: T202886 create again PTR records after dnsleak.py fix

2018-11-30

  • 14:08 arturo: running dns leaks cleanup `root@cloudcontrol1003:~# /root/novastats/dnsleaks.py --delete`

2018-11-28

  • 17:33 gtirloni: deleted contintcloud project (T209644)

2018-11-27

  • 13:32 gtirloni: enabled DRBD stats collection on labstore100[4-5] T208446

2018-11-22

  • 07:12 gtirloni: deployed new debian-9.6-stretch image

2018-11-21

  • 10:48 arturo: re-created compat-net as not shared in labtestn to test stuff related to T209954

2018-11-16

  • 12:43 gtirloni: armed keyholder on labpuppetmaster1001/1002 after reboots
  • 12:08 gtirloni: rebooted labpuppetmaster1001 (T207377)
  • 11:57 gtirloni: rebooted labpuppetmaster1002 (T207377)

2018-11-14

  • 17:19 gtirloni: added cloudvirt1016 to scheduler pool (T209426)
  • 15:41 gtirloni: reimaging labvirt1016 as cloudvirt1016
  • 15:14 gtirloni: reset-failed systemd unit nova-scheduler on cloudcontrol1004
  • 13:52 gtirloni: rebooted labservices1002 after package upgrades (T207377)
  • 13:23 gtirloni: rebooted labstore2004 after package upgrades (T207377)
  • 13:20 gtirloni: rebooted labstore2003 after package upgrades (T207377)
  • 13:20 gtirloni: rebooted labstore2001/labstore2003 after package upgrades (T207377)
  • 12:08 gtirloni: rebooted labnet1002 after package upgrades
  • 12:01 gtirloni: rebooted labmon1002 after package upgrades
  • 11:41 gtirloni: rebooted labcontrol1002 after package upgrades
  • 11:15 gtirloni: rebooted cloudcontrol1004 after package upgrades

2018-11-09

  • 18:17 gtirloni: restarted neutron-linuxbridge-agent on cloudvirt1018/1023

2018-11-08

  • 11:00 gtirloni: Added novaproxy-02 to $CACHES
  • 10:50 gtirloni: Added cloudvirt1017 to eqiad1 region

2018-11-07

  • 13:49 arturo: T208733 moving labvirt1017 from main deployment to eqiad1 and renaming it to cloudvirt1017

2018-10-22

  • 16:24 arturo: T206261 another update to dmz_cidr in eqiad1
  • 10:26 arturo: change again in dmz_cidr in eqiad1: VMs will connect between them without NAT even when using floating IPs (T206261)

2018-10-19

  • 12:02 arturo: revert change in dmz_cidr in eqiad1 for now (T206261)
  • 11:16 arturo: change in dmz_cidr in eqiad1: VMs will connect between them without NAT even when using floating IPs (T206261)
  • 10:14 arturo: we have new virt servers in the eqiad1 deployment since past week and this week: cloudvirt1018, cloudvirt1023, cloudvirt1024

2018-09-26

  • 10:40 arturo: T205524 all sorts of restarts in all neutron daemons
  • 10:20 arturo: T205524 stop/start all neutron agents in cloudnet1003.eqiad.wmnet
  • 10:13 arturo: T205524 restart all agents in cloudnet1004.eqiad.wmnet
  • 10:10 arturo: restart neutron-server in cloudcontrol1003, investigating T205524

2018-09-24

  • 10:57 arturo: try to increase floating ip allocation pool in eqiad1. Of 185.15.56.0/25 we are using only 185.15.56.10-185.15.56.31, I don't know why. Let's use 185.15.56.2-185.15.56.126

2018-09-21

  • 17:18 bd808: Running `sudo maintain-meta_p --all-databases --purge` across labsdb10(09|10|11) for T201890

2018-09-17

  • 22:08 bd808: Granted gtirloni project roles of admin, projectadmin, and user

2018-09-12

  • 11:20 arturo: T202636 distributing default routes using classless-static-route for all VMs in main/labtest (dnsmasq/nova-network)

2018-09-11

  • 16:52 arturo: again, restarted nova-network after killing all dnsmasq procs in labnet1001 for T202636
  • 16:08 arturo: restarted nova-network after killing all dnsmasq procs in labnet1001 for T202636
  • 10:53 arturo: T202636 creating all the compat-network configuration in neutron
  • 10:36 arturo: T202636 creating br-compat bridge in eqiad1 for the compat network
  • 10:33 arturo: T202636 manually reserve 10.68.23.253 (in nova-network)

2018-09-10

  • 22:46 andrewbogott: deleting all VMs on labvirt1019 and 1020 as prep for T204003

2018-08-30

  • 15:46 andrewbogott: restarting rabbitmq-server on cloudcontrol1003
  • 13:07 arturo: T202636 internal network routing now exists in labtest/labtestn for VM to communicate with each other

2018-08-28

  • 11:04 arturo: T202549 eqiad1 databases are all now running in m5-master. Mysql has been cleaned from cloudcontrol100[3,4]

2018-08-23

  • 16:17 arturo: T188589 bstorm_ merged patch to reduce nova DB connection usage
  • 13:15 arturo: T202115 `root@cloudcontrol1003:~# neutron subnet-update --allocation-pool start=10.64.22.4,end=10.64.22.4 e4fb2771-a361-4add-ac4e-280cc300c59f`
  • 13:10 arturo: T202115 (was `{"start": "10.64.22.2", "end": "10.64.22.254"}` )
  • 13:08 arturo: T202115 `root@cloudcontrol1003:~# neutron subnet-update --allocation-pool start=10.64.22.254,end=10.64.22.254 e4fb2771-a361-4add-ac4e-280cc300c59f`

2018-08-22

  • 15:28 arturo: cleanup local glance,keystone databases in cloudcontrol1003.wikimedia.org (already in m5-master)
  • 15:27 arturo: cleanup local keystone database in cloudcontrol1003.wikimedia.org (already in m5-master)

2018-08-21

  • 15:39 andrewbogott: initial test message
  • 10:31 arturo: eqiad1 remove leftover port for HA on labnet1004
  • 10:15 arturo: test

2018-05-07

  • 18:07 bstorm_: stopped the toolhistory job because it is totally broken and fills /tmp.

2018-02-09

  • 00:55 bd808: Added Arturo Borrero Gonzalez and Bstorm as project members
  • 00:54 bd808: Removed Yuvipanda at user request (T186289)