You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Nova Resource:Tools/SAL: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Stashbot
(Majavah: clear error states from all currently erroring exec nodes)
imported>Stashbot
(wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-builds-api (7e57832) (T337218) - cookbook ran by dcaro@vulcanus)
 
(259 intermediate revisions by 2 users not shown)
Line 1: Line 1:
=== 2021-05-06 ===
=== 2023-06-01 ===
* 14:43 Majavah: clear error states from all currently erroring exec nodes
* 10:07 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-builds-api ({{Gerrit|7e57832}}) ([[phab:T337218|T337218]]) - cookbook ran by dcaro@vulcanus
* 14:37 Majavah: clear error state from tools-sgeexec-0913
* 09:21 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-builds-api ({{Gerrit|0f4076a}}) ([[phab:T336130|T336130]]) - cookbook ran by dcaro@vulcanus
* 04:35 Majavah: add own root key to project hiera on horizon [[phab:T278390|T278390]]
* 09:18 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/buildpack-admission-controller ({{Gerrit|ef7f103}}) ([[phab:T336130|T336130]]) - cookbook ran by dcaro@vulcanus
* 02:36 andrewbogott: removing jhedden from sudo roots
* 07:52 dcaro: rebooted tools-package-builder-04 (stuck not letting me log in with my user)


=== 2021-05-05 ===
=== 2023-05-31 ===
* 19:27 andrewbogott: adding taavi as a sudo root to project toolforge for [[phab:T278390|T278390]]
* 02:38 andrewbogott: rebooted tools-sgeweblight-10-16,  [[phab:T337806|T337806]]


=== 2021-05-04 ===
=== 2023-05-30 ===
* 15:23 arturo: upgrading exim4-daemon-heavy in tools-mail-03
* 00:22 andrewbogott: rebooted tools-sgeweblight-10-30,  oom
* 10:47 arturo: rebase & resolve merge conflicts in labs/private.git
* 00:16 andrewbogott: rebooted tools-sgeweblight-10-24, seems to be oom


=== 2021-05-03 ===
=== 2023-05-26 ===
* 16:24 dcaro: started tools-sgeexec-0907, was stuck on initramfs due to an unclean fs (/dev/vda3, root), ran fsck manually fixing all the errors and booted up correctly after ([[phab:T280641|T280641]])
* 13:13 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/buildpack-admission-controller ({{Gerrit|ef7f103}}) ([[phab:T337218|T337218]]) - cookbook ran by dcaro@vulcanus
* 14:07 dcaro: depooling tols-sgeexec-0908/7 to be able to restart the VMs as they got stuck during migration ([[phab:T280641|T280641]])
* 12:59 dcaro: rebooting tools-sgeexec-10-16.tools.eqiad1.wikimedia.cloud for stale NFS handles (D processes)


=== 2021-04-29 ===
=== 2023-05-24 ===
* 18:23 bstorm: removing one more etcd node via cookbook [[phab:T279723|T279723]]
* 12:28 dcaro: deploy latest buildservice ([[phab:T335865|T335865]])
* 18:12 bstorm: removing an etcd node via cookbook [[phab:T279723|T279723]]
* 12:28 dcaro: deploy latest buildservice ([[phab:T336050|T336050]])


=== 2021-04-27 ===
=== 2023-05-23 ===
* 16:40 bstorm: deleted all the errored out grid jobs stuck in queue wait
* 14:40 wm-bot2: deployed kubernetes component https://github.com/toolforge/buildpack-admission-controller ({{Gerrit|0c7b25b}}) - cookbook ran by fran@wmf3169
* 16:16 bstorm: cleared E status on grid queues to get things flowing again


=== 2021-04-26 ===
=== 2023-05-22 ===
* 12:17 arturo: allowing more tools into the legacy redirector ([[phab:T281003|T281003]])
* 10:06 arturo: hard-reboot tools-sgeexec-10-18 (monitoring reporting it as down)


=== 2021-04-22 ===
=== 2023-05-19 ===
* 08:44 Krenair: Removed yuvipanda from roots sudo policy
* 13:38 arturo: uncordon tools-k8s-worker-47/48/64/75
* 08:42 Krenair: Removed yuvipanda from projectadmin per request
* 08:46 bd808: Building new perl532-sssd/<nowiki>{</nowiki>base,web<nowiki>}</nowiki> images ([[phab:T323522|T323522]], [[phab:T320904|T320904]])
* 08:40 Krenair: Removed yuvipanda from tools.admin per request


=== 2021-04-20 ===
=== 2023-05-17 ===
* 22:20 bd808: `clush -w @all -b "sudo exiqgrep -z -i {{!}} xargs sudo exim -Mt"`
* 16:05 dcaro: release toolforge-cli 0.3.0 ([[phab:T336225|T336225]])
* 22:19 bd808: `clush -w @exec -b "sudo exiqgrep -z -i {{!}} xargs sudo exim -Mt"`
* 12:48 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway ({{Gerrit|fa8ed2c}}) ([[phab:T336225|T336225]]) - cookbook ran by dcaro@vulcanus
* 21:52 bd808: Update hiera `profile::toolforge::active_mail_relay: tools-mail-03.tools.eqiad1.wikimedia.cloud`. Was using wrong domain name in prior update.
* 12:48 wm-bot2: rebooted k8s node tools-k8s-worker-71 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 21:49 bstorm: tagged the latest maintain-kubeusers and deployed to toolforge (with kustomize changes to rbac) after testing in toolsbeta [[phab:T280300|T280300]]
* 12:45 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway ({{Gerrit|d1bb238}}) ([[phab:T336225|T336225]]) - cookbook ran by dcaro@vulcanus
* 21:27 bd808: Update hiera `profile::toolforge::active_mail_relay: tools-mail-03.tools.eqiad.wmflabs`. was -2 which is decommed.
* 12:43 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-builds-api ({{Gerrit|8d21314}}) - cookbook ran by dcaro@vulcanus
* 10:18 dcaro: seting the retention on the tools-prometheus VMs to 250GB (they have 276GB total, leaving some space for online data operations if needed) ([[phab:T279990|T279990]])
* 10:54 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-buildpack-admission-controller:7199a9e from https://github.com/toolforge/buildpack-admission-controller ({{Gerrit|7199a9e}}) - cookbook ran by fran@wmf3169
* 08:49 wm-bot2: rebooted k8s node tools-k8s-worker-55 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 08:33 wm-bot2: rebooted k8s node tools-k8s-worker-64 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 08:32 wm-bot2: rebooted k8s node tools-k8s-worker-75 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 08:25 wm-bot2: rebooted k8s node tools-k8s-worker-74 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 08:17 wm-bot2: rebooted k8s node tools-k8s-worker-61 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 08:10 wm-bot2: rebooted k8s node tools-k8s-worker-70 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 08:03 wm-bot2: rebooted k8s node tools-k8s-worker-66 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 07:54 wm-bot2: rebooted k8s node tools-k8s-worker-72 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 07:46 wm-bot2: rebooted k8s node tools-k8s-worker-47 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 07:45 wm-bot2: rebooted k8s node tools-k8s-worker-48 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 07:42 wm-bot2: rebooted k8s node tools-k8s-worker-69 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 07:29 wm-bot2: rebooted k8s node tools-k8s-worker-76 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus


=== 2021-04-19 ===
=== 2023-05-16 ===
* 10:53 dcaro: reverting setting prometheus data source in grafana to 'server', can't connect,
* 23:24 bd808: kubectl uncordon tools-k8s-worker-69
* 10:51 dcaro: setting prometheus data source in grafana to 'server' to avoid CORS issues
* 23:22 bd808: Force reboot tools-k8s-worker-69 via Horizon
* 23:18 bd808: kubectl drain --ignore-daemonsets --delete-emptydir-data --force tools-k8s-worker-69
* 23:17 bd808: kubectl cordon tools-k8s-worker-69
* 14:37 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/builds-api:35b57c6 from https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-builds-api.git ({{Gerrit|35b57c6}}) - cookbook ran by dcaro@vulcanus
* 13:05 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/volume-admission-controller ({{Gerrit|df52a39}}) ([[phab:T334081|T334081]]) - cookbook ran by dcaro@vulcanus
* 12:54 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/volume-admission-controller ({{Gerrit|ad5b2b5}}) ([[phab:T334081|T334081]]) - cookbook ran by dcaro@vulcanus
* 11:52 dcaro: release toolforge-weld 0.2.0 and toolforge-webservice 0.98
* 08:08 dcaro: reboot tools-mail-03 ([[phab:T316544|T316544]])
* 08:07 dcaro: reboot tools-sgebastion-10 ([[phab:T316544|T316544]])


=== 2021-04-16 ===
=== 2023-05-15 ===
* 23:15 bstorm: cleaned up all source files for the grid with the old domain name to enable future node creation [[phab:T277653|T277653]]
* 22:50 bd808: Rebuilding bullseye and buster docker containers to pick up make package addition ([[phab:T320343|T320343]])
* 14:38 dcaro: added 'will get out of space in X days' panel to the dasboard https://grafana-labs.wikimedia.org/goto/kBlGd0uGk ([[phab:T279990|T279990]]), we got <5days xd
* 22:09 wm-bot2: rebooted k8s node tools-k8s-worker-66 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 11:35 arturo: running `grid-configurator --all-domains` which basically added tools-sgebastion-10,11 as submit hosts and removed tools-sgegrid-master,shadow as submit hosts
* 22:07 wm-bot2: rebooted k8s node tools-k8s-worker-65 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 22:06 wm-bot2: rebooted k8s node tools-k8s-worker-64 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 22:04 wm-bot2: rebooted k8s node tools-k8s-worker-62 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 22:02 wm-bot2: rebooted k8s node tools-k8s-worker-61 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 21:58 wm-bot2: rebooted k8s node tools-k8s-worker-60 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 21:56 wm-bot2: rebooted k8s node tools-k8s-worker-59 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 21:54 wm-bot2: rebooted k8s node tools-k8s-worker-58 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 21:52 wm-bot2: rebooted k8s node tools-k8s-worker-57 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 21:51 wm-bot2: rebooted k8s node tools-k8s-worker-56 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 21:50 wm-bot2: rebooted k8s node tools-k8s-worker-55 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 21:49 wm-bot2: rebooted k8s node tools-k8s-worker-54 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 21:47 wm-bot2: rebooted k8s node tools-k8s-worker-53 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 21:44 wm-bot2: rebooted k8s node tools-k8s-worker-52 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 21:42 wm-bot2: rebooted k8s node tools-k8s-worker-51 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 21:41 wm-bot2: rebooted k8s node tools-k8s-worker-50 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 21:40 wm-bot2: rebooted k8s node tools-k8s-worker-49 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 21:38 wm-bot2: rebooted k8s node tools-k8s-worker-48 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 21:37 wm-bot2: rebooted k8s node tools-k8s-worker-47 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 21:33 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by andrew@bullseye
* 21:16 wm-bot2: rebooted k8s node tools-k8s-worker-45 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 21:15 wm-bot2: rebooted k8s node tools-k8s-worker-44 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 21:13 wm-bot2: rebooted k8s node tools-k8s-worker-43 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 21:12 wm-bot2: rebooted k8s node tools-k8s-worker-42 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 21:09 wm-bot2: rebooted k8s node tools-k8s-worker-41 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 21:03 wm-bot2: rebooted k8s node tools-k8s-worker-40 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 20:58 wm-bot2: rebooted k8s node tools-k8s-worker-39 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 20:52 wm-bot2: rebooted k8s node tools-k8s-worker-38 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 20:50 wm-bot2: rebooted k8s node tools-k8s-worker-37 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 20:49 wm-bot2: rebooted k8s node tools-k8s-worker-36 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 20:48 wm-bot2: rebooted k8s node tools-k8s-worker-35 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 20:47 wm-bot2: rebooted k8s node tools-k8s-worker-34 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 20:42 wm-bot2: rebooted k8s node tools-k8s-worker-33 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 20:41 andrewbogott: rebooting frozen VMs: tools-k8s-worker-65, tools-sgeweblight-10-27, tools-k8s-worker-45, tools-k8s-worker-36, tools-sgewebgen-10-3 (fallout from earlier nfs outage)
* 20:36 wm-bot2: rebooted k8s node tools-k8s-worker-32 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 20:32 wm-bot2: rebooted k8s node tools-k8s-worker-31 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 20:24 wm-bot2: rebooted k8s node tools-k8s-worker-30 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 19:04 wm-bot2: rebooted k8s node tools-k8s-worker-67 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 18:56 wm-bot2: rebooted k8s node tools-k8s-worker-68 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 18:49 wm-bot2: rebooted k8s node tools-k8s-worker-69 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 18:46 bd808: Hard reboot tools-static-14 via Horizon per IRC report of unresponsive requests
* 18:44 wm-bot2: rebooted k8s node tools-k8s-worker-70 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 18:42 wm-bot2: rebooted k8s node tools-k8s-worker-71 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 18:39 wm-bot2: rebooted k8s node tools-k8s-worker-72 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 18:34 wm-bot2: rebooted k8s node tools-k8s-worker-73 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 18:28 wm-bot2: rebooted k8s node tools-k8s-worker-74 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 18:22 wm-bot2: rebooted k8s node tools-k8s-worker-75 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 18:22 taavi: clear mail queue
* 18:21 wm-bot2: rebooted k8s node tools-k8s-worker-76 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 18:15 wm-bot2: rebooted k8s node tools-k8s-worker-77 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 18:08 wm-bot2: rebooted k8s node tools-k8s-worker-80 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 18:06 wm-bot2: rebooted k8s node tools-k8s-worker-81 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 18:05 wm-bot2: rebooted k8s node tools-k8s-worker-82 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 17:57 wm-bot2: rebooted k8s node tools-k8s-worker-83 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 17:48 wm-bot2: rebooted k8s node tools-k8s-worker-84 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 17:47 wm-bot2: rebooted k8s node tools-k8s-worker-85 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 17:38 wm-bot2: rebooted k8s node tools-k8s-worker-86 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 17:37 wm-bot2: rebooted k8s node tools-k8s-worker-87 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 17:35 wm-bot2: rebooted k8s node tools-k8s-worker-88 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 17:34 wm-bot2: rebooting all the workers of tools k8s cluster (64 nodes) ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 17:20 wm-bot2: rebooted k8s node tools-k8s-worker-87 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 17:19 wm-bot2: rebooted k8s node tools-k8s-worker-88 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 17:17 bd808: Rebuilding bullseye and buster docker containers to pick up openssh-client package addition ([[phab:T258841|T258841]])
* 17:12 wm-bot2: rebooting the whole tools k8s cluster (64 nodes) ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 17:06 dcaro: rebooting tools-sgegrid-shadow ([[phab:T316544|T316544]])
* 17:00 dcaro: rebooting tools-sgegrid-master ([[phab:T316544|T316544]])
* 16:55 dcaro: rebooting tools-sgeexec-10-20 ([[phab:T316544|T316544]])
* 16:53 dcaro: rebooting tools-sgeweblight-10-18 ([[phab:T316544|T316544]])
* 16:53 dcaro: rebooting tools-sgeweblight-10-25 ([[phab:T316544|T316544]])
* 16:53 dcaro: rebooting tools-sgeweblight-10-20 ([[phab:T316544|T316544]])
* 16:52 dcaro: rebooting tools-sgeweblight-10-21 ([[phab:T316544|T316544]])
* 16:52 dcaro: rebooting tools-sgeexec-10-22 ([[phab:T316544|T316544]])
* 16:51 dcaro: rebooting tools-sgeweblight-10-28 ([[phab:T316544|T316544]])
* 16:50 dcaro: rebooting tools-sgeexec-10-17 ([[phab:T316544|T316544]])
* 16:48 dcaro: rebooting tools-sgeexec-10-21 ([[phab:T316544|T316544]])
* 16:47 dcaro: rebooting tools-sgeexec-10-19 ([[phab:T316544|T316544]])
* 16:45 dcaro: rebooting tools-sgeexec-10-8 ([[phab:T316544|T316544]])
* 16:45 dcaro: rebooting tools-sgeweblight-10-24 ([[phab:T316544|T316544]])
* 16:44 dcaro: rebooting tools-sgewebgen-10-2 ([[phab:T316544|T316544]])
* 16:44 dcaro: rebooting tools-sgeweblight-10-16 ([[phab:T316544|T316544]])
* 16:43 dcaro: rebooting tools-sgeweblight-10-30 ([[phab:T316544|T316544]])
* 16:43 dcaro: rebooting tools-sgeexec-10-18 ([[phab:T316544|T316544]])
* 16:42 dcaro: rebooting tools-sgeexec-10-16 ([[phab:T316544|T316544]])
* 16:42 dcaro: rebooting tools-sgeexec-10-14 ([[phab:T316544|T316544]])
* 16:41 dcaro: rebooting tools-sgeweblight-10-32 ([[phab:T316544|T316544]])
* 16:40 dcaro: rebooting tools-sgeweblight-10-22 ([[phab:T316544|T316544]])
* 16:39 dcaro: rebooting tools-sgeweblight-10-17 ([[phab:T316544|T316544]])
* 16:32 dcaro: rebooting tools-sgeexec-10-13.tools.eqiad1.wikimedia.cloud ([[phab:T316544|T316544]])
* 16:23 dcaro: rebooting tools-sgeweblight-10-26 ([[phab:T316544|T316544]])
* 16:15 bd808: Hard reboot of tools-sgebastion-11 via Horizon (done circa 16:11Z)
* 16:14 arturo: rebooted a bunch of nodes to cleanup D procs and high load avg because NFS outage (result of [[phab:T316544|T316544]])
* 12:36 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/builds-api:09f3b49-dev from https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-builds-api.git ({{Gerrit|32a8ae9}}) - cookbook ran by dcaro@vulcanus
* 09:12 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/volume-admission:c64da5a from https://gerrit.wikimedia.org/r/cloud/toolforge/volume-admission-controller ({{Gerrit|c64da5a}}) - cookbook ran by dcaro@vulcanus


=== 2021-04-15 ===
=== 2023-05-13 ===
* 17:45 bstorm: cleared error state from tools-sgeexec-0920.tools.eqiad.wmflabs for a failed job
* 09:13 taavi: reboot tools-sgeexec-10-15,17,18,21


=== 2021-04-13 ===
=== 2023-05-11 ===
* 13:26 dcaro: upgrade puppet and python-wmflib on tools-prometheus-03
* 15:48 bd808: Rebooted tools-sgebastion-10 for [[phab:T336510|T336510]]
* 11:23 arturo: deleted shutoff VM tools-package-builder-02 ([[phab:T275864|T275864]])
* 15:31 bd808: Sent `wall` for reboot of tools-sgebastion-10 circa 15:40Z
* 11:21 arturo: deleted shutoff VM tools-sge-services-03,04 ([[phab:T278354|T278354]])
* 11:20 arturo: deleted shutoff VM tools-docker-registry-03,04 ([[phab:T278303|T278303]])
* 11:18 arturo: deleted shutoff VM tools-mail-02 ([[phab:T278538|T278538]])
* 11:17 arturo: deleted shutoff VMs tools-static-12,13 ([[phab:T278539|T278539]])


=== 2021-04-11 ===
=== 2023-05-09 ===
* 16:07 bstorm: cleared E state from tools-sgeexec-0917 tools-sgeexec-0933 tools-sgeexec-0934 tools-sgeexec-0937 from failures of jobs 761759, 815031, 815056, 855676, 898936
* 16:36 taavi: delegated beta.toolforge.org domain to toolsbeta per [[phab:T257386|T257386]]
* 09:35 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|ad4fa2a}}) - cookbook ran by taavi@runko


=== 2021-04-08 ===
=== 2023-05-08 ===
* 18:25 bstorm: cleaned up the deprecated entries in /data/project/.system_sge/gridengine/etc/submithosts for tools-sgegrid-master and tools-sgegrid-shadow using the old fqdns [[phab:T277653|T277653]]
* 09:12 arturo: force-reboot tools-sgeexec-10-13 (reported as down by the monitoring, no SSH)
* 09:24 arturo: allocate & associate floating IP 185.15.56.122 for tools-sgebastion-11, also with DNS A record `dev-buster.toolforge.org` ([[phab:T275865|T275865]])
* 09:22 arturo: create DNS A record `login-buster.toolforge.org` pointing to 185.15.56.66 (tools-sgebastion-10) ([[phab:T275865|T275865]])
* 09:20 arturo: associate floating IP 185.15.56.66 to tools-sgebastion-10 ([[phab:T275865|T275865]])
* 09:13 arturo: created tools-sgebastion-11 (buster) ([[phab:T275865|T275865]])


=== 2021-04-07 ===
=== 2023-05-07 ===
* 04:35 andrewbogott: replacing the mx record '10 mail.tools.wmcloud.org' with '10 mail.tools.wmcloud.org.' — trying to fix axfr for the tools.wmcloud.org zone
* 16:06 taavi: remove inbound 25/tcp rule from the toolserver legacy server [[phab:T136225|T136225]]


=== 2021-04-06 ===
=== 2023-05-05 ===
* 15:16 bstorm: cleared queue state since a few had "errored" for failed jobs.
* 22:21 bd808: Added "RepoLookoutBot" to hiera key "dynamicproxy::blocked_user_agent_regex" to stop unnecessary scans by https://www.repo-lookout.org/
* 12:59 dcaro: Removing etcd member tools-k8s-etcd-7.tools.eqiad1.wikimedia.cloud to get an odd number  ([[phab:T267082|T267082]])
* 22:20 bd808: Added
* 11:45 arturo: upgrading jobutils & misctools to 1.42 everywhere
* 11:30 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:811164e from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|811164e}}) - cookbook ran by taavi@runko
* 11:39 arturo: cleaning up aptly: old package versions, old repos (jessie, trusty, precise) etc
* 09:13 dcaro: rebooted tools-sgeexec-10-16 as it was stuck ([[phab:T335009|T335009]])
* 10:31 dcaro: Removing etcd member tools-k8s-etcd-6.tools.eqiad.wmflabs ([[phab:T267082|T267082]])
* 10:21 arturo: published jobutils & misctools 1.42 ([[phab:T278748|T278748]])
* 10:21 arturo: published jobutils & misctools 1.42
* 10:21 arturo: aptly repo had some weirdness due to the cinder volume: hardlinks created by aptly were broken, solved with `sudo aptly publish --skip-signing repo stretch-tools -force-overwrite`
* 10:07 dcaro: adding new etcd member using the cookbook wmcs.toolforge.add_etcd_node ([[phab:T267082|T267082]])
* 10:05 arturo: installed aptly from buster-backports on tools-services-05 to see if that makes any difference with an issue when publishing repos
* 09:53 dcaro: Removing etcd member tools-k8s-etcd-4.tools.eqiad.wmflabs  ([[phab:T267082|T267082]])
* 08:55 dcaro: adding new etcd member using the cookbook wmcs.toolforge.add_etcd_node ([[phab:T267082|T267082]])


=== 2021-04-05 ===
=== 2023-05-04 ===
* 17:02 bstorm: chowned the data volume for the docker registry to docker-registry:docker-registry
* 15:15 wm-bot2: removed instance tools-k8s-etcd-15 - cookbook ran by andrew@bullseye
* 09:56 arturo: make jhernandez (IRC joakino) projectadmin ([[phab:T278975|T278975]])
* 14:13 wm-bot2: removed instance tools-k8s-etcd-14 - cookbook ran by andrew@bullseye


=== 2021-04-01 ===
=== 2023-05-03 ===
* 20:43 bstorm: cleared error state from the grid queues caused by unspecified job errors
* 12:41 wm-bot2: removed instance tools-k8s-etcd-13 - cookbook ran by andrew@bullseye
* 15:53 dcaro: Removed etcd member tools-k8s-etcd-5.tools.eqiad.wmflabs, adding a new member  ([[phab:T267082|T267082]])
* 15:43 dcaro: Removing etcd member tools-k8s-etcd-5.tools.eqiad.wmflabs  ([[phab:T267082|T267082]])
* 15:36 dcaro: Added new etcd member tools-k8s-etcd-9.tools.eqiad1.wikimedia.cloud ([[phab:T267082|T267082]])
* 15:18 dcaro: adding new etcd member using the cookbook wmcs.toolforge.add_etcd_node ([[phab:T267082|T267082]])


=== 2021-03-31 ===
=== 2023-05-02 ===
* 15:57 arturo: rebooting `tools-mail-03` after enabling NFS ([[phab:T267082|T267082]], [[phab:T278538|T278538]])
* 00:29 wm-bot2: deployed kubernetes component https://github.com/toolforge/buildpack-admission-controller ({{Gerrit|7199a9e}}) - cookbook ran by raymond@ubuntu
* 15:57 arturo: rebooting `tools-mail-03` after enabling NFS (T
* 15:04 arturo: created MX record for `tools.wmcloud.org` pointing to `mail.tools.wmcloud.org`
* 15:03 arturo: created DNS A record `mail.tools.wmcloud.org` pointing to 185.15.56.63
* 14:56 arturo: shutoff tools-mail-02 ([[phab:T278538|T278538]])
* 14:55 arturo: point floating IP 185.15.56.63 to tools-mail-03 ([[phab:T278538|T278538]])
* 14:45 arturo: created VM `tools-mail-03` as Debian Buster ([[phab:T278538|T278538]])
* 14:39 arturo: relocate some of the hiera keys for email server from project-level to prefix
* 09:44 dcaro: running disk performance test on etcd-4 (round2)
* 09:05 dcaro: running disk performance test on etcd-8
* 08:43 dcaro: running disk performance test on etcd-4


=== 2021-03-30 ===
=== 2023-05-01 ===
* 16:15 bstorm: added `labstore::traffic_shaping::egress: 800mbps` to tools-static prefix [[phab:T278539|T278539]]
* 23:17 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-buildpack-admission-controller:3b3803f from https://github.com/toolforge/buildpack-admission-controller ({{Gerrit|3b3803f}}) - cookbook ran by raymond@ubuntu
* 15:44 arturo: shutoff tools-static-12/13 ([[phab:T278539|T278539]])
* 15:41 arturo: point horizon web proxy `tools-static.wmflabs.org` to tools-static-14  ([[phab:T278539|T278539]])
* 15:37 arturo: add `mount_nfs: true` to tools-static prefix ([[phab:T2778539|T2778539]])
* 15:26 arturo: create VM tools-static-14 with Debian Buster image ([[phab:T278539|T278539]])
* 12:19 arturo: introduce horizon proxy `deb-tools.wmcloud.org` ([[phab:T278436|T278436]])
* 12:15 arturo: shutdown tools-sgebastion-09 (stretch)
* 11:05 arturo: created VM `tools-sgebastion-10` as Debian Buster ([[phab:T275865|T275865]])
* 11:04 arturo: created server group `tools-bastion` with anti-affinity policy


=== 2021-03-28 ===
=== 2023-04-28 ===
* 19:31 legoktm: legoktm@tools-sgebastion-08:~$ sudo qdel -f {{Gerrit|9999704}} # [[phab:T278645|T278645]]
* 15:01 arturo: force reboot tools-k8s-worker-79, unresponsive
* 08:27 dcaro: rebooting tools-sgeweblight-10-28 ([[phab:T335336|T335336]])
* 07:20 dcaro: rebooting tools-sgegrid-shadow due to stale nfs mount
* 00:09 bd808: `kubectl uncordon tools-k8s-worker-67` ([[phab:T335543|T335543]])
* 00:07 bd808: Hard reboot tools-k8s-worker-67.tools.eqiad1.wikimedia.cloud via horizon ([[phab:T335543|T335543]])
* 00:04 bd808: Rebooting tools-k8s-worker-67.tools.eqiad1.wikimedia.cloud ([[phab:T335543|T335543]])


=== 2021-03-27 ===
=== 2023-04-27 ===
* 02:48 Reedy: qdel -f {{Gerrit|9999895}} {{Gerrit|9999799}}
* 23:59 bd808: `kubectl drain --ignore-daemonsets --delete-emptydir-data --force tools-k8s-worker-67` ([[phab:T335543|T335543]])
* 20:50 bd808: Started process to rebuild all buster and bullseye based container images again. Prior problem seems to have been stale images in local cache on the build server.
* 20:42 bd808: Container image rebuild failed with GPG errors in buster-sssd base image. Will investigate and attempt to restart once resolved in a local dev environment.
* 20:33 bd808: Started process to rebuild all buster and bullseye based container images per https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes#Building_toolforge_specific_images


=== 2021-03-26 ===
=== 2023-04-18 ===
* 12:21 arturo: shutdown tools-package-builder-02 (stretch), we keep -03 which is buster ([[phab:T275864|T275864]])
* 16:46 dcaro: force-rebooting tools-sgeweblight-10-25/26/27 as they got stuck stopping the grid_exec process
* 16:35 dcaro: rebooting root@tools-sgeweblight-10-27 due to stuck exec daemon not releasing port 6445
* 16:35 dcaro: rebooting root@tools-sgeweblight-10-25 due to stuck exec daemon not releasing port 6445
* 16:32 dcaro: rebooting root@tools-sgeweblight-10-26 due to stuck exec daemon not releasing port 6445
* 16:26 dcaro: rebooting root@tools-sgeexec-10-14 due to stuck exec daemon not releasing port 6445


=== 2021-03-25 ===
=== 2023-04-17 ===
* 19:30 bstorm: forced deletion of all jobs stuck in a deleting state [[phab:T277653|T277653]]
* 13:10 dcaro: rebooting tools-sgegrid-master node ([[phab:T334847|T334847]])
* 17:46 arturo: rebooting tools-sgeexec-* nodes to account for new grid master ([[phab:T277653|T277653]])
* 02:43 legoktm: manual restart of apache2 on toolserver-proxy-1 to completely pick up renewed TLS cert (alert was flapping)
* 16:20 arturo: rebuilding tools-sgegrid-master VM as debian buster ([[phab:T277653|T277653]])
* 16:18 arturo: icinga-downtime toolschecker for 2h
* 16:05 bstorm: failed over the tools grid to the shadow master [[phab:T277653|T277653]]
* 13:36 arturo: shutdown tools-sge-services-03 ([[phab:T278354|T278354]])
* 13:33 arturo: shutdown tools-sge-services-04 ([[phab:T278354|T278354]])
* 13:31 arturo: point aptly clients to `tools-services-05.tools.eqiad1.wikimedia.cloud` (hiera change) ([[phab:T278354|T278354]])
* 12:58 arturo: created VM `tools-services-05` as Debian Buster ([[phab:T278354|T278354]])
* 12:51 arturo: create cinder volume `tools-aptly-data` ([[phab:T278354|T278354]])


=== 2021-03-24 ===
=== 2023-04-11 ===
* 12:46 arturo: shutoff the old stretch VMs `tools-docker-registry-03` and `tools-docker-registry-04` ([[phab:T278303|T278303]])
* 16:11 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|b65439b}}) - cookbook ran by arturo@nostromo
* 12:38 arturo: associate floating IP 185.15.56.67 with `tools-docker-registry-05` and refresh FQDN docker-registry.tools.wmflabs.org accordingly ([[phab:T278303|T278303]])
* 15:46 arturo: upload toolforge-jobs-framework-cli v11 to aptly
* 12:33 arturo: attach cinder volume `tools-docker-registry-data` to VM `tools-docker-registry-05` ([[phab:T278303|T278303]])
* 14:17 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/volume-admission-controller.git ({{Gerrit|d878e49}}) ([[phab:T324834|T324834]]) - cookbook ran by dcaro@vulcanus
* 12:32 arturo: snapshot cinder volume `tools-docker-registry-data` into `tools-docker-registry-data-stretch-migration` ([[phab:T278303|T278303]])
* 13:19 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:c6c693c from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|c6c693c}}) - cookbook ran by arturo@nostromo
* 12:32 arturo: bump cinder storage quota from 80G to 400G (without quota request task)
* 12:09 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/volume-admission:40bd3b3 from https://gerrit.wikimedia.org/r/cloud/toolforge/volume-admission-controller ({{Gerrit|40bd3b3}}) - cookbook ran by dcaro@vulcanus
* 12:11 arturo: created VM `tools-docker-registry-06` as Debian Buster ([[phab:T278303|T278303]])
* 10:34 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/ingress-nginx ({{Gerrit|9aed7e5}}) - cookbook ran by taavi@runko
* 12:09 arturo: dettach cinder volume `tools-docker-registry-data` ([[phab:T278303|T278303]])
* 09:15 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/calico ({{Gerrit|c6a3e29}}) ([[phab:T329677|T329677]]) - cookbook ran by taavi@runko
* 11:46 arturo: attach cinder volume `tools-docker-registry-data` to VM `tools-docker-registry-03` to format it and pre-populate it with registry data ([[phab:T278303|T278303]])
* 08:45 wm-bot2: Adding a new k8s worker node - cookbook ran by taavi@runko
* 11:20 arturo: created 80G cinder volume tools-docker-registry-data ([[phab:T278303|T278303]])
* 11:10 arturo: starting VM tools-docker-registry-04 which was stopped probably since 2021-03-09 due to hypervisor draining


=== 2021-03-23 ===
=== 2023-04-10 ===
* 12:46 arturo: aborrero@tools-sgegrid-master:~$ sudo systemctl restart gridengine-master.service
* 10:46 taavi: patch existing PSP roles to use policy/v1beta1 [[phab:T331619|T331619]]
* 12:16 arturo: delete & re-create VM tools-sgegrid-shadow as Debian Buster ([[phab:T277653|T277653]])
* 09:16 arturo: upgrading k8s cluster to 1.22 ([[phab:T286856|T286856]])
* 12:14 arturo: created puppet prefix 'tools-sgegrid-shadow' and migrated puppet configuration from VM-puppet
* 12:13 arturo: created server group 'tools-grid-master-shadow' with anty-affinity policy


=== 2021-03-18 ===
=== 2023-04-07 ===
* 19:24 bstorm: set profile::toolforge::infrastructure across the entire project with login_server set on the bastion and exec node-related prefixes
* 14:34 wm-bot2: drained, depooled and removed k8s control node tools-k8s-control-3 ([[phab:T333929|T333929]]) - cookbook ran by taavi@runko
* 16:21 andrewbogott: enabling puppet tools-wide
* 14:30 wm-bot2: removed instance tools-k8s-control-2 - cookbook ran by taavi@runko
* 16:20 andrewbogott: disabling puppet tools-wide to test https://gerrit.wikimedia.org/r/c/operations/puppet/+/672456
* 16:19 bstorm: added profile::toolforge::infrastructure class to puppetmaster [[phab:T277756|T277756]]
* 04:12 bstorm: rebooted tools-sgeexec-0935.tools.eqiad.wmflabs because it forgot how to LDAP...likely root cause of the issues tonight
* 03:59 bstorm: rebooting grid master. sorry for the cron spam
* 03:49 bstorm: restarting sssd on tools-sgegrid-master
* 03:37 bstorm: deleted a massive number of stuck jobs that misfired from the cron server
* 03:35 bstorm: rebooting tools-sgecron-01 to try to clear up the ldap-related errors coming out of it
* 01:46 bstorm: killed the toolschecker cron job, which had an LDAP error, and ran it again by hand


=== 2021-03-17 ===
=== 2023-04-05 ===
* 20:57 bstorm: deployed changes to rbac for kubernetes to add kubectl top access for tools
* 15:16 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|5ea5992}}) - cookbook ran by taavi@runko
* 20:26 andrewbogott: moving tools-elastic-3 to cloudvirt1034; two elastic nodes shouldn't be on the same hv
* 15:10 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:3569803 from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|3569803}}) - cookbook ran by taavi@runko
* 14:56 wm-bot2: Added a new k8s worker tools-k8s-worker-88.tools.eqiad1.wikimedia.cloud to the cluster ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 14:42 wm-bot2: Adding a new k8s worker node ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 14:42 wm-bot2: Added a new k8s worker tools-k8s-worker-87.tools.eqiad1.wikimedia.cloud to the cluster ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 14:28 wm-bot2: Adding a new k8s worker node ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 14:28 wm-bot2: Added a new k8s worker tools-k8s-worker-86.tools.eqiad1.wikimedia.cloud to the cluster ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 14:15 wm-bot2: Adding a new k8s worker node ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 14:15 wm-bot2: Added a new k8s worker tools-k8s-worker-85.tools.eqiad1.wikimedia.cloud to the cluster ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 14:01 wm-bot2: Adding a new k8s worker node ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 14:01 wm-bot2: Added a new k8s worker tools-k8s-worker-84.tools.eqiad1.wikimedia.cloud to the cluster ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 13:47 wm-bot2: Adding a new k8s worker node ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 13:47 wm-bot2: Added a new k8s worker tools-k8s-worker-83.tools.eqiad1.wikimedia.cloud to the cluster ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 13:34 wm-bot2: Adding a new k8s worker node ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 13:33 wm-bot2: removed instance tools-k8s-worker-83 - cookbook ran by taavi@runko
* 13:15 wm-bot2: Adding a new k8s worker node ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 13:06 wm-bot2: removing grid node tools-sgeweblight-10-31.tools.eqiad1.wikimedia.cloud ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 13:02 wm-bot2: removing grid node tools-sgeweblight-10-29.tools.eqiad1.wikimedia.cloud ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 13:00 wm-bot2: removing grid node tools-sgeexec-10-9.tools.eqiad1.wikimedia.cloud ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 12:58 wm-bot2: removing grid node tools-sgeweblight-10-15.tools.eqiad1.wikimedia.cloud ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 12:54 wm-bot2: removing grid node tools-sgeexec-10-7.tools.eqiad1.wikimedia.cloud ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 12:52 wm-bot2: removing grid node tools-sgeweblight-10-13.tools.eqiad1.wikimedia.cloud ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 12:34 wm-bot2: drained, depooled and removed k8s control node tools-k8s-control-1 - cookbook ran by taavi@runko
* 12:07 wm-bot2: Added a new k8s control tools-k8s-control-6.tools.eqiad1.wikimedia.cloud to the cluster - cookbook ran by taavi@runko
* 11:53 wm-bot2: Adding a new k8s control node - cookbook ran by taavi@runko
* 11:51 wm-bot2: removed instance tools-k8s-control-6 - cookbook ran by taavi@runko
* 11:39 wm-bot2: Adding a new k8s control node ([[phab:T333929|T333929]]) - cookbook ran by taavi@runko
* 11:38 wm-bot2: removed instance tools-k8s-control-6 - cookbook ran by taavi@runko
* 11:21 wm-bot2: Adding a new k8s control node ([[phab:T333929|T333929]]) - cookbook ran by taavi@runko
* 11:21 wm-bot2: removed instance tools-k8s-control-6 - cookbook ran by taavi@runko
* 11:09 wm-bot2: Adding a new k8s control node ([[phab:T333929|T333929]]) - cookbook ran by taavi@runko
* 10:53 wm-bot2: removed instance tools-k8s-control-6 - cookbook ran by taavi@runko
* 10:41 wm-bot2: Adding a new k8s control node ([[phab:T333929|T333929]]) - cookbook ran by taavi@runko
* 10:41 wm-bot2: removed instance tools-k8s-control-6 - cookbook ran by taavi@runko
* 10:16 wm-bot2: Adding a new k8s control node ([[phab:T333929|T333929]]) - cookbook ran by taavi@runko


=== 2021-03-16 ===
=== 2023-04-04 ===
* 16:31 arturo: installing jobutils and misctools 1.41
* 19:00 wm-bot2: Adding a new k8s control node ([[phab:T333929|T333929]]) - cookbook ran by taavi@runko
* 15:55 bstorm: deleted a bunch of messed up grid jobs ({{Gerrit|9989481}},8813,81682,86317,122602,122623,583621,606945,606999)
* 18:59 wm-bot2: removed instance tools-k8s-control-5 - cookbook ran by taavi@runko
* 12:32 arturo: add packages jobutils / misctools v1.41 to <nowiki>{</nowiki>stretch,buster<nowiki>}</nowiki>-tools aptly repository in tools-sge-services-03
* 18:46 wm-bot2: Adding a new k8s control node ([[phab:T333929|T333929]]) - cookbook ran by taavi@runko
* 18:45 wm-bot2: Adding a new k8s CONTROL node ([[phab:T333929|T333929]]) - cookbook ran by taavi@runko
* 10:15 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by arturo@nostromo
* 09:28 arturo: hard-reboot the 3 k8s control nodes


=== 2021-03-12 ===
=== 2023-04-03 ===
* 23:13 bstorm: cleared error state for all grid queues
* 17:13 wm-bot2: rebooted k8s node tools-k8s-worker-31 - cookbook ran by taavi@runko
* 17:11 wm-bot2: rebooted k8s node tools-k8s-worker-32 - cookbook ran by taavi@runko
* 17:09 wm-bot2: rebooted k8s node tools-k8s-worker-33 - cookbook ran by taavi@runko
* 17:07 wm-bot2: rebooted k8s node tools-k8s-worker-34 - cookbook ran by taavi@runko
* 17:05 wm-bot2: rebooted k8s node tools-k8s-worker-35 - cookbook ran by taavi@runko
* 17:04 wm-bot2: rebooted k8s node tools-k8s-worker-36 - cookbook ran by taavi@runko
* 17:02 wm-bot2: rebooted k8s node tools-k8s-worker-37 - cookbook ran by taavi@runko
* 17:00 wm-bot2: rebooted k8s node tools-k8s-worker-38 - cookbook ran by taavi@runko
* 16:58 wm-bot2: rebooted k8s node tools-k8s-worker-39 - cookbook ran by taavi@runko
* 16:56 wm-bot2: rebooted k8s node tools-k8s-worker-40 - cookbook ran by taavi@runko
* 16:55 wm-bot2: rebooted k8s node tools-k8s-worker-41 - cookbook ran by taavi@runko
* 16:53 wm-bot2: rebooted k8s node tools-k8s-worker-42 - cookbook ran by taavi@runko
* 16:51 wm-bot2: rebooted k8s node tools-k8s-worker-43 - cookbook ran by taavi@runko
* 16:49 wm-bot2: rebooted k8s node tools-k8s-worker-44 - cookbook ran by taavi@runko
* 16:45 wm-bot2: rebooted k8s node tools-k8s-worker-45 - cookbook ran by taavi@runko
* 16:43 wm-bot2: rebooted k8s node tools-k8s-worker-46 - cookbook ran by taavi@runko
* 16:41 wm-bot2: rebooted k8s node tools-k8s-worker-47 - cookbook ran by taavi@runko
* 16:40 wm-bot2: rebooted k8s node tools-k8s-worker-48 - cookbook ran by taavi@runko
* 16:38 wm-bot2: rebooted k8s node tools-k8s-worker-49 - cookbook ran by taavi@runko
* 16:36 wm-bot2: rebooted k8s node tools-k8s-worker-50 - cookbook ran by taavi@runko
* 16:35 wm-bot2: rebooted k8s node tools-k8s-worker-51 - cookbook ran by taavi@runko
* 16:33 wm-bot2: rebooted k8s node tools-k8s-worker-52 - cookbook ran by taavi@runko
* 16:31 wm-bot2: rebooted k8s node tools-k8s-worker-53 - cookbook ran by taavi@runko
* 16:28 wm-bot2: rebooted k8s node tools-k8s-worker-54 - cookbook ran by taavi@runko
* 16:27 wm-bot2: rebooted k8s node tools-k8s-worker-55 - cookbook ran by taavi@runko
* 16:25 wm-bot2: rebooted k8s node tools-k8s-worker-56 - cookbook ran by taavi@runko
* 16:23 wm-bot2: rebooted k8s node tools-k8s-worker-57 - cookbook ran by taavi@runko
* 16:21 wm-bot2: rebooted k8s node tools-k8s-worker-58 - cookbook ran by taavi@runko
* 16:20 wm-bot2: rebooted k8s node tools-k8s-worker-59 - cookbook ran by taavi@runko
* 16:18 wm-bot2: rebooted k8s node tools-k8s-worker-60 - cookbook ran by taavi@runko
* 16:09 wm-bot2: rebooted k8s node tools-k8s-worker-61 - cookbook ran by taavi@runko
* 16:07 wm-bot2: rebooted k8s node tools-k8s-worker-62 - cookbook ran by taavi@runko
* 16:01 wm-bot2: rebooted k8s node tools-k8s-worker-64 - cookbook ran by taavi@runko
* 16:00 wm-bot2: rebooting the whole tools k8s cluster (58 nodes) - cookbook ran by taavi@runko
* 15:58 wm-bot2: rebooted k8s node tools-k8s-worker-65 - cookbook ran by taavi@runko
* 15:56 wm-bot2: rebooted k8s node tools-k8s-worker-66 - cookbook ran by taavi@runko
* 15:48 wm-bot2: rebooted k8s node tools-k8s-worker-67 - cookbook ran by taavi@runko
* 15:38 wm-bot2: rebooted k8s node tools-k8s-worker-68 - cookbook ran by taavi@runko
* 15:36 wm-bot2: rebooted k8s node tools-k8s-worker-69 - cookbook ran by taavi@runko
* 15:34 wm-bot2: rebooted k8s node tools-k8s-worker-70 - cookbook ran by taavi@runko
* 15:32 wm-bot2: rebooted k8s node tools-k8s-worker-71 - cookbook ran by taavi@runko
* 15:30 wm-bot2: rebooted k8s node tools-k8s-worker-72 - cookbook ran by taavi@runko
* 15:28 wm-bot2: rebooted k8s node tools-k8s-worker-73 - cookbook ran by taavi@runko
* 15:26 wm-bot2: rebooted k8s node tools-k8s-worker-74 - cookbook ran by taavi@runko
* 15:24 wm-bot2: rebooted k8s node tools-k8s-worker-75 - cookbook ran by taavi@runko
* 15:22 wm-bot2: rebooting the whole tools k8s cluster (58 nodes) - cookbook ran by taavi@runko
* 15:17 wm-bot2: rebooted k8s node tools-k8s-worker-75 - cookbook ran by taavi@runko
* 15:14 wm-bot2: rebooted k8s node tools-k8s-worker-76 - cookbook ran by taavi@runko
* 15:12 wm-bot2: rebooted k8s node tools-k8s-worker-77 - cookbook ran by taavi@runko
* 15:10 wm-bot2: rebooted k8s node tools-k8s-worker-78 - cookbook ran by taavi@runko
* 15:08 wm-bot2: rebooted k8s node tools-k8s-worker-79 - cookbook ran by taavi@runko
* 15:06 wm-bot2: rebooted k8s node tools-k8s-worker-80 - cookbook ran by taavi@runko
* 14:59 wm-bot2: rebooted k8s node tools-k8s-worker-81 - cookbook ran by taavi@runko
* 14:41 wm-bot2: rebooted k8s node tools-k8s-worker-82 - cookbook ran by taavi@runko
* 14:38 wm-bot2: rebooting the whole tools k8s cluster (58 nodes) - cookbook ran by taavi@runko
* 14:13 andrewbogott: test log to see if stashbot is back working
* 13:19 andrewbogott: forcing puppet run on all toolforge VMs
* 08:28 taavi: stop exim4.service on tools-sgecron-2 [[phab:T333477|T333477]]
* 06:52 taavi: stop jobs-framework-emailer to prevent spam due to NFS being read-only [[phab:T333477|T333477]]


=== 2021-03-11 ===
=== 2023-03-29 ===
* 17:40 bstorm: deployed metrics-server:0.4.1 to kubernetes
* 16:07 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/labs/tools/registry-admission-webhook ({{Gerrit|dc26f52}}) - cookbook ran by raymond@ubuntu
* 16:21 bstorm: add jobutils 1.40 and misctools 1.40 to stretch-tools
* 15:21 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/registry-admission:24115c7 from https://gerrit.wikimedia.org/r/labs/tools/registry-admission-webhook ({{Gerrit|24115c7}}) - cookbook ran by raymond@ubuntu
* 13:11 arturo: add misctools 1.37 to buster-tools{{!}}toolsbeta aptly repo for [[phab:T275865|T275865]]
* 13:10 arturo: add jobutils 1.40 to buster-tools aptly repo for [[phab:T275865|T275865]]


=== 2021-03-10 ===
=== 2023-03-28 ===
* 10:56 arturo: briefly stopped VM tools-k8s-etcd-7 to disable VMX cpu flag
* 19:43 wm-bot2: deployed kubernetes component https://github.com/toolforge/buildpack-admission-controller ({{Gerrit|e1b9815}}) - cookbook ran by raymond@ubuntu


=== 2021-03-09 ===
=== 2023-03-27 ===
* 13:31 arturo: hard-reboot tools-docker-registry-04 because issues related to [[phab:T276922|T276922]]
* 22:51 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-buildpack-admission-controller:70d550a from https://github.com/toolforge/buildpack-admission-controller ({{Gerrit|70d550a}}) - cookbook ran by raymond@ubuntu
* 12:34 arturo: briefly rebooting VM tools-docker-registry-04, we need to reboot the hypervisor cloudvirt1038 and failed to migrate away


=== 2021-03-05 ===
=== 2023-03-26 ===
* 12:30 arturo: started tools-redis-1004 again
* 20:28 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by taavi@runko
* 12:22 arturo: stop tools-redis-1004 to ease draining of cloudvirt1035


=== 2021-03-04 ===
=== 2023-03-24 ===
* 11:25 arturo: rebooted tools-sgewebgrid-generic-0901, repool it again
* 14:13 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by arturo@endurance
* 09:58 arturo: depool tools-sgewebgrid-generic-0901 to reboot VM. It was stuck in MIGRATING state when draining cloudvirt1022


=== 2021-03-03 ===
=== 2023-03-21 ===
* 15:17 arturo: shutting down tools-sgebastion-07 in an attempt to fix nova state and finish hypervisor migration
* 08:11 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by taavi@runko
* 15:11 arturo: tools-sgebastion-07 triggered a neutron exception (unauthorized) while being live-migrated from cloudvirt1021 to 1029. Resetting nova state with `nova reset-state bd685d48-1011-404e-a755-{{Gerrit|372f6022f345}} --active` and try again
* 14:48 arturo: killed pywikibot instance running in tools-sgebastion-07 by user msyn


=== 2021-03-02 ===
=== 2023-03-20 ===
* 15:23 bstorm: depooling tools-sgewebgrid-lighttpd-0914.tools.eqiad.wmflabs for reboot. It isn't communicating right
* 13:39 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by taavi@runko
* 15:22 bstorm: cleared queue error states...will need to keep a better eye on what's causing those
* 10:57 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by arturo@endurance


=== 2021-02-27 ===
=== 2023-03-19 ===
* 02:23 bstorm: deployed typo fix to maintain-kubeusers in an innocent effort to make the weekend better [[phab:T275910|T275910]]
* 09:32 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by taavi@runko
* 02:00 bstorm: running a script to repair the dumps mount in all podpresets [[phab:T275371|T275371]]


=== 2021-02-26 ===
=== 2023-03-17 ===
* 22:04 bstorm: cleaned up grid jobs {{Gerrit|1230666}},{{Gerrit|1908277}},{{Gerrit|1908299}},{{Gerrit|2441500}},{{Gerrit|2441513}}
* 15:56 andrewbogott: truncating .out, .err, and .log files to 10MB in anticipation of moving the NFS volumes
* 21:27 bstorm: hard rebooting tools-sgeexec-0947
* 21:21 bstorm: hard rebooting tools-sgeexec-0952.tools.eqiad.wmflabs
* 20:01 bd808: Deleted csr in strange state for tool-ores-inspect


=== 2021-02-24 ===
=== 2023-03-13 ===
* 18:30 bd808: `sudo wmcs-openstack role remove --user zfilipin --project tools user` [[phab:T267313|T267313]]
* 09:50 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-buildpack-admission-controller:f90bd8f from https://github.com/toolforge/buildpack-admission-controller ({{Gerrit|f90bd8f}}) - cookbook ran by dcaro@vulcanus
* 01:04 bstorm: hard rebooting tools-k8s-worker-76 because it's in a sorry state


=== 2021-02-23 ===
=== 2023-03-12 ===
* 23:11 bstorm: draining a bunch of k8s workers to clean up after dumps changes [[phab:T272397|T272397]]
* 13:40 taavi: restart haproxy on tools-k8s-haproxy-3
* 23:06 bstorm: draining tools-k8s-worker-55 to clean up after dumps changes [[phab:T272397|T272397]]


=== 2021-02-22 ===
=== 2023-03-11 ===
* 20:40 bstorm: repooled tools-sgeexec-0918.tools.eqiad.wmflabs
* 18:38 wm-bot2: removing grid node tools-sgeexec-10-11.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 19:09 bstorm: hard rebooted tools-sgeexec-0918 from openstack [[phab:T275411|T275411]]
* 18:36 wm-bot2: removing grid node tools-sgeexec-10-11.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 19:07 bstorm: shutting down tools-sgeexec-0918 with the VM's command line (not libvirt directly yet) [[phab:T275411|T275411]]
* 18:34 wm-bot2: removing grid node tools-sgeexec-10-11.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 19:05 bstorm: shutting down tools-sgeexec-0918 (with openstack to see what happens) [[phab:T275411|T275411]]
* 18:31 taavi: reboot misbehaving tools-sgeexec-10-11
* 19:03 bstorm: depooled tools-sgeexec-0918 [[phab:T275411|T275411]]
* 18:56 bstorm: deleted job {{Gerrit|1962508}} from the grid to clear it up [[phab:T275301|T275301]]
* 16:58 bstorm: cleared error state on several grid queues


=== 2021-02-19 ===
=== 2023-03-10 ===
* 12:31 arturo: deploying new version of toolforge ingress admission controller
* 16:36 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/labs/tools/maintain-kubeusers ({{Gerrit|8b42b15}}) - cookbook ran by taavi@runko


=== 2021-02-17 ===
=== 2023-03-09 ===
* 21:26 bstorm: deleted tools-puppetdb-01 since it is unused at this time (and undersized anyway)
* 10:13 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/labs/tools/maintain-kubeusers ({{Gerrit|53e7f81}}) - cookbook ran by taavi@runko
* 10:04 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/maintain-kubeusers:834807c from https://gerrit.wikimedia.org/r/labs/tools/maintain-kubeusers ({{Gerrit|834807c}}) - cookbook ran by taavi@runko


=== 2021-02-04 ===
=== 2023-03-08 ===
* 16:27 bstorm: rebooting tools-package-builder-02
* 22:31 bd808: Live hacked user-maintainer clusterrole to work around breakage in [[phab:T331572|T331572]]


=== 2021-01-26 ===
=== 2023-03-07 ===
* 16:27 bd808: Hard reboot of tools-sgeexec-0906 via Horizon for [[phab:T272978|T272978]]
* 11:34 wm-bot2: Increased quotas by 2 volumes - cookbook ran by fran@wmf3169
* 11:09 wm-bot2: Increased quotas by 6 snapshots - cookbook ran by fran@wmf3169
* 11:07 wm-bot2: Increased quotas by 4000 gigabytes - cookbook ran by fran@wmf3169


=== 2021-01-22 ===
=== 2023-03-06 ===
* 09:59 dcaro: added the record redis.svc.tools.eqiad1.wikimedia.cloud pointing to tools-redis1003 ([[phab:T272679|T272679]])
* 12:51 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/labs/tools/registry-admission-webhook ({{Gerrit|6688477}}) - cookbook ran by taavi@runko
* 12:33 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/registry-admission:e916fee from https://gerrit.wikimedia.org/r/labs/tools/registry-admission-webhook ({{Gerrit|e916fee}}) - cookbook ran by taavi@runko
* 12:16 arturo: delete calico deployment, redeploy from https://gitlab.wikimedia.org/repos/cloud/toolforge/calico ([[phab:T328539|T328539]])


=== 2021-01-21 ===
=== 2023-03-05 ===
* 23:58 bstorm: deployed new maintain-kubeusers to tools [[phab:T271847|T271847]]
* 15:43 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/labs/tools/maintain-kubeusers ({{Gerrit|3e04025}}) - cookbook ran by taavi@runko


=== 2021-01-19 ===
=== 2023-03-02 ===
* 22:57 bstorm: truncated 75GB error log /data/project/robokobot/virgule.err [[phab:T272247|T272247]]
* 11:32 arturo: aborrero@tools-k8s-control-2:~$ sudo -i kubectl apply -f /etc/kubernetes/toolforge-tool-roles.yaml (https://gerrit.wikimedia.org/r/c/operations/puppet/+/889836)
* 22:48 bstorm: truncated 100GB error log /data/project/magnus-toolserver/error.log [[phab:T272247|T272247]]
* 22:43 bstorm: truncated 107GB log '/data/project/meetbot/logs/messages.log' [[phab:T272247|T272247]]
* 22:34 bstorm: truncating 194 GB error log '/data/project/mix-n-match/mnm-microsync.err' [[phab:T272247|T272247]]
* 16:37 bd808: Added Jhernandez to root sudoers group


=== 2021-01-14 ===
=== 2023-03-01 ===
* 20:56 bstorm: setting bastions to have mostly-uncapped egress network and 40MBps nfs_read for better shared use
* 13:18 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|13eda9d}}) - cookbook ran by taavi@runko
* 20:43 bstorm: running tc-setup across the k8s workers
* 20:40 bstorm: running tc-setup across the grid fleet
* 17:58 bstorm: hard rebooting tools-sgecron-01 following network issues during upgrade to stein [[phab:T261134|T261134]]


=== 2021-01-13 ===
=== 2023-02-28 ===
* 10:02 arturo: delete floating IP allocation 185.15.56.245 ([[phab:T271867|T271867]])
* 17:19 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway ({{Gerrit|9252af7}}) - cookbook ran by taavi@runko
* 17:04 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|e46da83}}) - cookbook ran by taavi@runko


=== 2021-01-12 ===
=== 2023-02-23 ===
* 18:16 bstorm: deleted wedged CSR tool-adhs-wde to get maintain-kubeusers working again [[phab:T271842|T271842]]
* 18:07 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway ({{Gerrit|efb60b3}}) - cookbook ran by taavi@runko
* 09:33 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/buildpack-admission:b34e2f8 from https://github.com/toolforge/buildpack-admission-controller.git ({{Gerrit|b34e2f8}}) - cookbook ran by taavi@runko


=== 2021-01-05 ===
=== 2023-02-21 ===
* 18:49 bstorm: changing the limits on k8s etcd nodes again, so disabling puppet on them [[phab:T267966|T267966]]
* 09:37 arturo: hard-reboot tools-sgeexec-10-11 (unresponsive to ssh)


=== 2021-01-04 ===
=== 2023-02-20 ===
* 18:21 bstorm: ran 'sudo systemctl stop getty@ttyS1.service && sudo systemctl disable getty@ttyS1.service' on tools-k8s-etcd-5 I have no idea why that keeps coming back.
* 11:24 taavi: redeploy volume-admission with helm and cert-manager certificates [[phab:T329530|T329530]] [[phab:T292238|T292238]]
* 11:15 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/volume-admission:7fd13ac from https://gerrit.wikimedia.org/r/cloud/toolforge/volume-admission-controller ({{Gerrit|ede8bd0}}) - cookbook ran by taavi@runko
* 11:05 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-volume-admission-controller:7fd13ac from https://gerrit.wikimedia.org/r/cloud/toolforge/volume-admission-controller ({{Gerrit|7fd13ac}}) - cookbook ran by taavi@runko
* 10:39 wm-bot2: Increased quotas by 4000 gigabytes - cookbook ran by fran@wmf3169
* 09:20 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by arturo@nostromo


=== 2020-12-22 ===
=== 2023-02-19 ===
* 18:22 bstorm: rebooting the grid master because it is misbehaving following the NFS outage
* 09:16 taavi: uncordon tools-k8s-worker-[80-82] after fixing security groups [[phab:T329378|T329378]]
* 10:53 arturo: rebase & resolve ugly git merge conflict in labs/private.git


=== 2020-12-18 ===
=== 2023-02-17 ===
* 18:37 bstorm: set profile::wmcs::kubeadm::etcd_latency_ms: 15 [[phab:T267966|T267966]]
* 11:32 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|eeeea4c}}) - cookbook ran by arturo@endurance
* 11:31 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/image-config ({{Gerrit|7729b18}}) ([[phab:T254636|T254636]]) - cookbook ran by arturo@endurance
* 11:26 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:8a9b97e from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|eeeea4c}}) - cookbook ran by arturo@endurance
* 11:24 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:8a9b97e from https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-framework-api ({{Gerrit|618ab29}}) - cookbook ran by arturo@endurance
* 10:25 arturo: build and push mariadb-sssd/base docker image for Toolforge ([[phab:T320178|T320178]], [[phab:T254636|T254636]])


=== 2020-12-17 ===
=== 2023-02-16 ===
* 21:42 bstorm: doing the same procedure to increase the timeouts more [[phab:T267966|T267966]]
* 15:58 wm-bot2: Increased quotas by 4000 gigabytes - cookbook ran by fran@wmf3169
* 19:56 bstorm: puppet enabled one at a time, letting things catch up. Timeouts are now adjusted to something closer to fsync values [[phab:T267966|T267966]]
* 15:30 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/cert-manager ({{Gerrit|d71994e}}) - cookbook ran by arturo@nostromo
* 19:44 bstorm: set etcd timeouts seed value to 20 instead of the default 10 (profile::wmcs::kubeadm::etcd_latency_ms) [[phab:T267966|T267966]]
* 13:52 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/ingress-admission-controller ({{Gerrit|7191997}}) - cookbook ran by taavi@runko
* 18:58 bstorm: disabling puppet on k8s-etcd servers to alter the timeouts [[phab:T267966|T267966]]
* 13:44 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/ingress-admission:1fe8ec4 from https://gerrit.wikimedia.org/r/cloud/toolforge/ingress-admission-controller ({{Gerrit|1fe8ec4}}) - cookbook ran by taavi@runko
* 14:23 arturo: regenerating puppet cert with proper alt names in tools-k8s-etcd-4 ([[phab:T267966|T267966]])
* 12:47 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/ingress-admission:e9b9920 from https://gerrit.wikimedia.org/r/cloud/toolforge/ingress-admission-controller ({{Gerrit|e9b9920}}) - cookbook ran by taavi@runko
* 14:21 arturo: regenerating puppet cert with proper alt names in tools-k8s-etcd-5 ([[phab:T267966|T267966]])
* 10:35 arturo: aborrero@tools-k8s-control-1:~$ sudo -i kubectl apply -f /etc/kubernetes/psp/base-pod-security-policies.yaml
* 14:19 arturo: regenerating puppet cert with proper alt names in tools-k8s-etcd-6 ([[phab:T267966|T267966]])
* 09:48 arturo: grid engine was failed over to shadow server, manually put it back into normal https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Grid#GridEngine_Master
* 14:17 arturo: regenerating puppet cert with proper alt names in tools-k8s-etcd-7 ([[phab:T267966|T267966]])
* 09:39 arturo: aborrero@tools-sgegrid-shadow:~$ sudo truncate -s 1G /var/log/syslog (was 17G, full root disk)
* 14:15 arturo: regenerating puppet cert with proper alt names in tools-k8s-etcd-8 ([[phab:T267966|T267966]])
* 14:12 arturo: updated kube-apiserver manifest with new etcd nodes ([[phab:T267966|T267966]])
* 13:56 arturo: adding etcd dns_alt_names hiera keys to the puppet prefix https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/beb27b45a74765a64552f2d4f70a40b217b4f4e9%5E%21/
* 13:12 arturo: making k8s api server aware of the new etcd nodes via hiera update https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/3761c4c4dab1c3ed0ab0a1133d2ccf3df6c28baf%5E%21/ ([[phab:T267966|T267966]])
* 12:54 arturo: joining new etcd nodes in the k8s etcd cluster ([[phab:T267966|T267966]])
* 12:52 arturo: adding more etcd nodes in the hiera key in tools-k8s-etcd puppet prefix https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/b4f60768078eccdabdfab4cd99c7c57076de51b2
* 12:50 arturo: dropping more unused hiera keys in the tools-k8s-etcd puppet prefix https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/e9e66a6787d9b91c08cf4742a27b90b3e6d05aac
* 12:49 arturo: dropping unused hiera keys in the tools-k8s-etcd puppet prefix https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/2b4cb4a41756e602fb0996e7d0210e9102172424
* 12:16 arturo: created VM `tools-k8s-etcd-8` ([[phab:T267966|T267966]])
* 12:15 arturo: created VM `tools-k8s-etcd-7` ([[phab:T267966|T267966]])
* 12:13 arturo: created `tools-k8s-etcd` anti-affinity server group


=== 2020-12-11 ===
=== 2023-02-15 ===
* 18:29 bstorm: certificatesigningrequest.certificates.k8s.io "tool-production-error-tasks-metrics" deleted to stop maintain-kubeusers issues
* 18:03 taavi: deployed https://gerrit.wikimedia.org/r/c/operations/puppet/+/889585/ to increase amount of haproxy max connections
* 12:14 dcaro: upgrading stable/main (clinic duty)
* 15:19 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by arturo@nostromo
* 12:12 dcaro: upgrading buster-wikimedia/main (clinic duty)
* 09:50 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/cert-manager.git ({{Gerrit|e3f3ce1}}) ([[phab:T329453|T329453]]) - cookbook ran by taavi@runko
* 12:03 dcaro: upgrading stable-updates/main, mainly cacertificates (clinic duty)
* 09:30 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by arturo@nostromo
* 12:01 dcaro: upgrading stretch-backports/main, mainly libuv (clinic duty)
* 11:58 dcaro: disabled all the repos blocking upgrades on tools-package-builder-02 (duplicated, other releases...)
* 11:35 arturo: uncordon tools-k8s-worker-71 and tools-k8s-worker-55, they weren't uncordoned yesterday for whatever reasons ([[phab:T263284|T263284]])
* 11:27 dcaro: upgrading stretch-wikimedia/main (clinic duty)
* 11:20 dcaro: upgrading stretch-wikimedia/thirdparty/mono-project-stretch (clinic duty)
* 11:08 dcaro: upgrade stretch-wikimedia/component/php72 (minor upgrades) (clinic duty)
* 11:04 dcaro: upgrade oldstable/main packages (clinic duty)
* 10:58 dcaro: upgrade kubectl done (clinic duty)
* 10:53 dcaro: upgrade kubectl (clinic duty)
* 10:16 dcaro: upgrading oldstable/main packages (clinic duty)


=== 2020-12-10 ===
=== 2023-02-14 ===
* 17:35 bstorm: k8s-control nodes upgraded to 1.17.13 [[phab:T263284|T263284]]
* 15:07 taavi: import cert-manager components to local docker registry [[phab:T329453|T329453]]
* 17:16 arturo: k8s control nodes were all upgraded to 1.17, now upgrading worker nodes ([[phab:T263284|T263284]])
* 12:12 arturo: the fixed webservicemonitor is starting a bunch of grid webservices ([[phab:T329611|T329611]])
* 15:50 dcaro: puppet upgraded to 5.5.10 on the hosts, ping me if you see anything weird (clinic duty)
* 12:10 arturo: included tools-manifests 0.25 in tools-buster aptly repo, deploying it now! ([[phab:T329611|T329611]], [[phab:T329467|T329467]], [[phab:T244809|T244809]])
* 15:41 arturo: icinga-downtime toolschecker for 2h ([[phab:T263284|T263284]])
* 15:35 dcaro: Puppet 5 on tools-sgebastion-09 ran well and without issues, upgrading the other sge nodes (clinic duty)
* 15:32 dcaro: Upgrading puppet from 4 to 5 on tools-sgebastion-09 (clinic duty)
* 12:41 arturo: set hiera `profile::wmcs::kubeadm::component: thirdparty/kubeadm-k8s-1-17` in project & tools-k8s-control prefix ([[phab:T263284|T263284]])
* 11:50 arturo: disabled puppet in all k8s nodes in preparation for version upgrade ([[phab:T263284|T263284]])
* 11:50 arturo: disabled puppet in all k8s nodes in preparation for version upgrade ([[phab:T263284|T263284]])
* 09:58 dcaro: successful tesseract upgrade on tools-sgewebgrid-lighttpd-0914, upgrading the rest of nodes (clinic duty)
* 09:49 dcaro: upgrading tesseract on tools-sgewebgrid-lighttpd-0914 (clinic duty)


=== 2020-12-08 ===
=== 2023-02-13 ===
* 19:01 bstorm: pushed updated calico node image (v3.14.0) to internal docker registry as well [[phab:T269016|T269016]]
* 16:05 wm-bot2: Increased quotas by 4000 gigabytes - cookbook ran by fran@wmf3169
* 16:03 taavi: update maintain-kubeusers deployment to use helm
* 15:05 taavi: deploy jobs-api updates, improving some status messages
* 15:04 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|13d87c4}}) - cookbook ran by taavi@runko
* 15:00 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:390ed64 from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|390ed64}}) - cookbook ran by taavi@runko
* 13:14 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/maintain-kubeusers:aac195b from https://gerrit.wikimedia.org/r/labs/tools/maintain-kubeusers ({{Gerrit|aac195b}}) - cookbook ran by taavi@runko


=== 2020-12-07 ===
=== 2023-02-10 ===
* 22:56 bstorm: pushed updated local copies of the typha, calico-cni and calico-pod2daemon-flexvol images to the tools internal registry [[phab:T269016|T269016]]
* 15:45 taavi: reboot tools-k8s-worker-82 to troubleshoot network issues
* 12:44 wm-bot2: Added a new k8s worker tools-k8s-worker-82.tools.eqiad1.wikimedia.cloud to the worker pool ([[phab:T329357|T329357]]) - cookbook ran by taavi@runko
* 12:31 wm-bot2: Adding a new k8s worker node ([[phab:T329357|T329357]]) - cookbook ran by taavi@runko
* 12:29 wm-bot2: Added a new k8s worker tools-k8s-worker-81.tools.eqiad1.wikimedia.cloud to the worker pool ([[phab:T329357|T329357]]) - cookbook ran by taavi@runko
* 12:15 wm-bot2: Adding a new k8s worker node ([[phab:T329357|T329357]]) - cookbook ran by taavi@runko
* 11:53 wm-bot2: Adding a new k8s worker node ([[phab:T329357|T329357]]) - cookbook ran by taavi@runko
* 11:44 wm-bot2: removing grid node tools-sgeweblight-10-23.tools.eqiad1.wikimedia.cloud ([[phab:T329357|T329357]]) - cookbook ran by taavi@runko
* 11:42 wm-bot2: removing grid node tools-sgeexec-10-5.tools.eqiad1.wikimedia.cloud ([[phab:T329357|T329357]]) - cookbook ran by taavi@runko
* 11:39 wm-bot2: removing grid node tools-sgeweblight-10-19.tools.eqiad1.wikimedia.cloud ([[phab:T329357|T329357]]) - cookbook ran by taavi@runko
* 11:26 wm-bot2: removing grid node tools-sgeweblight-10-12.tools.eqiad1.wikimedia.cloud ([[phab:T329357|T329357]]) - cookbook ran by taavi@runko
* 11:24 wm-bot2: removing grid node tools-sgeexec-10-1.tools.eqiad1.wikimedia.cloud ([[phab:T329357|T329357]]) - cookbook ran by taavi@runko


=== 2020-12-03 ===
=== 2023-02-01 ===
* 09:18 arturo: restarted kubelet systemd service on tools-k8s-worker-38. Node was NotReady, complaining about 'use of closed network connection'
* 16:03 taavi: deployed tools-webservice 0.89
* 09:16 arturo: restarted kubelet systemd service on tools-k8s-worker-59. Node was NotReady, complaining about 'use of closed network connection'
* 15:43 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/image-config ({{Gerrit|372037f}}) - cookbook ran by taavi@runko


=== 2020-11-28 ===
=== 2023-01-26 ===
* 23:35 Krenair: Re-scheduled 4 continuous jobs from tools-sgeexec-0908 as it appears to be broken, at about 23:20 UTC
* 15:05 taavi: drain and reboot tools-k8s-worker-74 which seems to have some issues with nfs
* 04:35 Krenair: Ran `sudo -i kubectl -n tool-mdbot delete cm maintain-kubeusers` on tools-k8s-control-1 for [[phab:T268904|T268904]], seems to have regenerated ~tools.mdbot/.kube/config
* 14:37 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|307f302}}) - cookbook ran by taavi@runko
* 14:30 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:05966c6 from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|05966c6}}) - cookbook ran by taavi@runko


=== 2020-11-24 ===
=== 2023-01-24 ===
* 17:44 arturo: rebased labs/private.git. 2 patches had merge conflicts
* 12:04 taavi: deploying toolforge-jobs-framework-cli v10 [[phab:T327775|T327775]]
* 16:36 bd808: clush -w @all -b 'sudo -i apt-get purge nscd'
* 10:07 taavi: publish toolforge-jobs-framework-cli v9
* 16:31 bd808: Ran `sudo -i apt-get purge nscd` on tools-sgeexec-0932 to try and fix apt state for puppet


=== 2020-11-10 ===
=== 2023-01-23 ===
* 19:45 andrewbogott: rebooting  tools-sgeexec-0950; OOM
* 11:31 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|d5ae229}}) - cookbook ran by taavi@runko
* 11:23 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:d085c50 from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|d085c50}}) - cookbook ran by taavi@runko
* 11:17 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/image-config ({{Gerrit|864171a}}) - cookbook ran by taavi@runko


=== 2020-11-02 ===
=== 2023-01-20 ===
* 13:35 arturo: (typo: dcaro)
* 23:24 andrewbogott: truncating logfiles with find . -name '*.err'  -size +1G -exec truncate --size=100M <nowiki>{</nowiki><nowiki>}</nowiki> \;
* 13:35 arturo: added dcar as projectadmin & user ([[phab:T266068|T266068]])
* 21:24 andrewbogott: truncating logfiles with find . -name '*.out'  -size +1G -exec truncate --size=100M <nowiki>{</nowiki><nowiki>}</nowiki> \;
* 01:06 andrewbogott: truncating logfiles with find . -name '*.log'  -size +1G -exec truncate --size=100M <nowiki>{</nowiki><nowiki>}</nowiki> \;


=== 2020-10-29 ===
=== 2023-01-19 ===
* 21:33 legoktm: published docker-registry.tools.wmflabs.org/toolbeta-test image ([[phab:T265681|T265681]])
* 11:46 arturo: `aborrero@tools-k8s-control-1:~$ sudo -i kubectl delete clusterrolebinding jobs-api-psp` (cleanup unused stuff)
* 21:10 bstorm: Added another ingress node to k8s cluster in case the load spikes are the problem [[phab:T266506|T266506]]
* 17:33 bstorm: hard rebooting tools-sgeexec-0905 and tools-sgeexec-0916 to get the grid back to full capacity
* 04:03 legoktm: published docker-registry.tools.wmflabs.org/toolforge-buster0-builder:latest image ([[phab:T265686|T265686]])


=== 2020-10-28 ===
=== 2023-01-18 ===
* 23:42 bstorm: dramatically elevated the egress cap on tools-k8s-ingress nodes that were affected by the NFS settings [[phab:T266506|T266506]]
* 15:42 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|0ad4c66}}) - cookbook ran by arturo@nostromo
* 22:10 bstorm: launching tools-k8s-ingress-3 to try and get an NFS-free node [[phab:T266506|T266506]]
* 15:29 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:54cc15e from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|54cc15e}}) - cookbook ran by arturo@nostromo
* 21:58 bstorm: set 'mount_nfs: false' on the tools-k8s-ingress prefix [[phab:T266506|T266506]]


=== 2020-10-23 ===
=== 2023-01-17 ===
* 22:22 legoktm: imported pack_0.14.2-1_amd64.deb into buster-tools ([[phab:T266270|T266270]])
* 13:55 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|8cf38a1}}) - cookbook ran by arturo@endurance
* 13:51 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|0d0a882}}) - cookbook ran by arturo@endurance
* 13:34 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:3a58c1d from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|3a58c1d}}) - cookbook ran by arturo@endurance


=== 2020-10-21 ===
=== 2023-01-10 ===
* 17:58 legoktm: pushed toolforge-buster0-<nowiki>{</nowiki>build,run<nowiki>}</nowiki>:latest images to docker registry
* 11:55 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|8e0a2f9}}) - cookbook ran by arturo@endurance
* 11:52 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:9514b00 from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|8e0a2f9}}) - cookbook ran by arturo@endurance
* 11:36 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|0243967}}) - cookbook ran by arturo@endurance


=== 2020-10-15 ===
=== 2023-01-03 ===
* 22:00 bstorm: manually removing nscd from tools-sgebastion-08 and running puppet
* 17:17 andrewbogott: find -name '*.log'  -size +1G -exec truncate --size=1G <nowiki>{</nowiki><nowiki>}</nowiki> \;
* 18:23 andrewbogott: uncordoning tools-k8s-worker-53, 54, 55, 59
* 17:28 andrewbogott: depooling tools-k8s-worker-53, 54, 55, 59
* 17:27 andrewbogott: uncordoning tools-k8s-worker-35, 37, 45
* 16:44 andrewbogott: depooling tools-k8s-worker-35, 37, 45


=== 2020-10-14 ===
=== 2022-12-20 ===
* 21:00 andrewbogott: repooling tools-sgewebgrid-generic-0901 and tools-sgewebgrid-lighttpd-0915
* 09:07 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by arturo@nostromo
* 20:37 andrewbogott: depooling tools-sgewebgrid-generic-0901 and tools-sgewebgrid-lighttpd-0915
* 20:35 andrewbogott: repooling tools-sgewebgrid-lighttpd-0911, 12, 13, 16
* 20:31 bd808: Deployed toollabs-webservice v0.74
* 19:53 andrewbogott: depooling tools-sgewebgrid-lighttpd-0911, 12, 13, 16 and moving to Ceph
* 19:47 andrewbogott: repooling tools-sgeexec-0932, 33, 34 and moving to Ceph
* 19:07 andrewbogott: depooling tools-sgeexec-0932, 33, 34 and moving to Ceph
* 19:06 andrewbogott: repooling tools-sgeexec-0935, 36, 38, 40 and moving to Ceph
* 16:56 andrewbogott: depooling tools-sgeexec-0935, 36, 38, 40 and moving to Ceph


=== 2020-10-10 ===
=== 2022-12-12 ===
* 17:07 bstorm: cleared errors on tools-sgeexec-0912.tools.eqiad.wmflabs to get the queue moving again
* 14:36 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by dcaro@vulcanus


=== 2020-10-08 ===
=== 2022-12-09 ===
* 17:07 bstorm: rebuilding docker images with locales-all [[phab:T263339|T263339]]
* 07:20 taavi: change the canonical tools-mail external hostname to use mail.tools.wmcloud.org and add valid spf to toolforge.org [[phab:T324809|T324809]]


=== 2020-10-06 ===
=== 2022-12-05 ===
* 19:04 andrewbogott: uncordoned tools-k8s-worker-38
* 11:06 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by dcaro@vulcanus
* 18:51 andrewbogott: uncordoned tools-k8s-worker-52
* 18:40 andrewbogott: draining and cordoning tools-k8s-worker-52 and tools-k8s-worker-38 for ceph migration


=== 2020-10-02 ===
=== 2022-11-30 ===
* 21:09 bstorm: rebooting tools-k8s-worker-70 because it seems to be unable to recover from an old NFS disconnect
* 10:39 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|bc3529d}}) - cookbook ran by arturo@nostromo
* 17:37 andrewbogott: stopping tools-prometheus-03 to attempt a snapshot
* 10:17 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:c360d54 from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|c360d54}}) - cookbook ran by arturo@nostromo
* 16:03 bstorm: shutting down tools-prometheus-04 to try to fsck the disk


=== 2020-10-01 ===
=== 2022-11-29 ===
* 21:39 andrewbogott: migrating tools-proxy-06 to ceph
* 19:52 taavi: clear puppet failure emails from exim queues
* 21:35 andrewbogott: moving  k8s.tools.eqiad1.wikimedia.cloud from 172.16.0.99 (toolsbeta-test-k8s-haproxy-1) to 172.16.0.108 (toolsbeta-test-k8s-haproxy-2) in anticipation of downtime for haproxy-1 tomorrow


=== 2020-09-30 ===
=== 2022-11-09 ===
* 18:34 andrewbogott: repooling tools-sgeexec-0918
* 08:58 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by arturo@nostromo
* 18:29 andrewbogott: depooling tools-sgeexec-0918 so I can reboot cloudvirt1036


=== 2020-09-23 ===
=== 2022-11-05 ===
* 21:38 bstorm: ran an 'apt clean' across the fleet to get ahead of the new locale install
* 19:28 andrewbogott: cleaning up nfs share with  root@labstore1004:/srv/tools/shared/tools# find -name '*.err'  -size +1G -exec truncate --size=1G <nowiki>{</nowiki><nowiki>}</nowiki> \;
* 13:26 andrewbogott: cleaning up nfs share with  root@labstore1004:/srv/tools/shared/tools# find -name '*.log' -size +1G -exec truncate --size=1G <nowiki>{</nowiki><nowiki>}</nowiki> \;


=== 2020-09-18 ===
=== 2022-11-04 ===
* 19:41 andrewbogott: repooling tools-k8s-worker-30, 33, 34, 57, 60
* 20:41 andrewbogott: cleaning up nfs share with  root@labstore1004:/srv/tools/shared/tools# find -name '*.err' -not -newermt "Nov 1, 2021" -exec rm <nowiki>{</nowiki><nowiki>}</nowiki> \;
* 19:04 andrewbogott: depooling tools-k8s-worker-30, 33, 34, 57, 60
* 14:02 andrewbogott: cleaning up nfs share with  root@labstore1004:/srv/tools/shared/tools# find -name '*.log' -not -newermt "Nov 1, 2021" -exec rm <nowiki>{</nowiki><nowiki>}</nowiki> \;
* 19:02 andrewbogott: repooling tools-k8s-worker-41, 43, 44, 47, 48, 49, 50, 51
* 12:20 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|d464be4}}) ([[phab:T304900|T304900]]) - cookbook ran by arturo@nostromo
* 17:48 andrewbogott: depooling tools-k8s-worker-41, 43, 44, 47, 48, 49, 50, 51
* 12:12 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:2b800f5 from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|2b800f5}}) ([[phab:T304900|T304900]]) - cookbook ran by arturo@nostromo
* 17:47 andrewbogott: repooling tools-k8s-worker-31, 32, 36, 39, 40
* 16:40 andrewbogott: depooling tools-k8s-worker-31, 32, 36, 39, 40
* 16:38 andrewbogott: repooling tools-sgewebgrid-lighttpd-0914, tools-sgewebgrid-generic-0902, tools-sgewebgrid-lighttpd-0919, tools-sgewebgrid-lighttpd-0918
* 16:10 andrewbogott: depooling tools-sgewebgrid-lighttpd-0914, tools-sgewebgrid-generic-0902, tools-sgewebgrid-lighttpd-0919, tools-sgewebgrid-lighttpd-0918
* 13:54 andrewbogott: repooling tools-sgeexec-0913, tools-sgeexec-0915, tools-sgeexec-0916
* 13:50 andrewbogott: depooling tools-sgeexec-0913, tools-sgeexec-0915, tools-sgeexec-0916  for flavor update
* 01:20 andrewbogott: repooling tools-sgeexec-0901, tools-sgeexec-0905, tools-sgeexec-0910, tools-sgeexec-0911, tools-sgeexec-0912  after flavor update
* 01:11 andrewbogott: depooling tools-sgeexec-0901, tools-sgeexec-0905, tools-sgeexec-0910, tools-sgeexec-0911, tools-sgeexec-0912  for flavor update
* 01:08 andrewbogott: repooling tools-sgeexec-0917, tools-sgeexec-0918, tools-sgeexec-0919, tools-sgeexec-0920  after flavor update
* 01:00 andrewbogott: depooling tools-sgeexec-0917, tools-sgeexec-0918, tools-sgeexec-0919, tools-sgeexec-0920  for flavor update
* 00:58 andrewbogott: repooling tools-sgeexec-0942, tools-sgeexec-0947, tools-sgeexec-0950, tools-sgeexec-0951, tools-sgeexec-0952 after flavor update
* 00:49 andrewbogott: depooling tools-sgeexec-0942, tools-sgeexec-0947, tools-sgeexec-0950, tools-sgeexec-0951, tools-sgeexec-0952 for flavor update


=== 2020-09-17 ===
=== 2022-11-01 ===
* 21:56 bd808: Built and deployed tools-manifest v0.22 ([[phab:T263190|T263190]])
* 09:37 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master ([[phab:T322110|T322110]]) - cookbook ran by dcaro@vulcanus
* 21:55 bd808: Built and deployed tools-manifest v0.22 ([[phab:T169695|T169695]])
* 20:34 bd808: Live hacked "--backend=gridengine" into webservicemonitor on tools-sgecron-01 ([[phab:T263190|T263190]])
* 20:21 bd808: Restarted webservicemonitor on tools-sgecron-01.tools.eqiad.wmflabs
* 20:09 andrewbogott: I didn't actually depool tools-sgeexec-0942, tools-sgeexec-0947, tools-sgeexec-0950, tools-sgeexec-0951, tools-sgeexec-0952 because there was some kind of brief outage just now
* 19:58 andrewbogott: depooling tools-sgeexec-0942, tools-sgeexec-0947, tools-sgeexec-0950, tools-sgeexec-0951, tools-sgeexec-0952 for flavor update
* 19:55 andrewbogott: repooling tools-k8s-worker-61,62,64,65,67,68,69 for flavor update
* 19:29 andrewbogott: depooling tools-k8s-worker-61,62,64,65,67,68,69 for flavor update
* 15:38 andrewbogott: repooling tools-k8s-worker-70 and tools-k8s-worker-66 after flavor remapping
* 15:34 andrewbogott: depooling tools-k8s-worker-70 and tools-k8s-worker-66 for flavor remapping
* 15:30 andrewbogott: repooling tools-sgeexec-0909, 0908, 0907, 0906, 0904
* 15:21 andrewbogott: depooling tools-sgeexec-0909, 0908, 0907, 0906, 0904 for flavor remapping
* 13:55 andrewbogott: depooled tools-sgewebgrid-lighttpd-0917 and tools-sgewebgrid-lighttpd-0920
* 13:55 andrewbogott: repooled tools-sgeexec-0937 after move to ceph
* 13:45 andrewbogott: depooled tools-sgeexec-0937 for move to ceph


=== 2020-09-16 ===
=== 2022-10-26 ===
* 23:20 andrewbogott: repooled tools-sgeexec-0941 and tools-sgeexec-0939 for move to ceph
* 08:45 dcaro: depooling and rebooting tools-sgeexec-10-22 to get nfs scratch working again
* 23:03 andrewbogott: depooled tools-sgeexec-0941 and tools-sgeexec-0939 for move to ceph
* 23:02 andrewbogott: uncordoned tools-k8s-worker-58, tools-k8s-worker-56, tools-k8s-worker-42 for migration to ceph
* 22:29 andrewbogott: draining tools-k8s-worker-58, tools-k8s-worker-56, tools-k8s-worker-42 for migration to ceph
* 17:37 andrewbogott: service gridengine-master restart on tools-sgegrid-master


=== 2020-09-10 ===
=== 2022-10-25 ===
* 15:37 arturo: hard-rebooting tools-proxy-05
* 16:14 wm-bot2: Increased quotas by 5120 gigabytes - cookbook ran by fran@wmf3169
* 15:33 arturo: rebooting tools-proxy-05 to try flushing local DNS caches
* 15:26 dcaro: pushed a newer docker-registry.tools.wmflabs.org/python:3.9-slim-bullseye (from upstream pthyon:3.9-slim-bullseye)
* 15:25 arturo: detected missing DNS record for k8s.tools.eqiad1.wikimedia.cloud which means the k8s cluster is down
* 10:22 arturo: enabling ingress dedicated worker nodes in the k8s cluster ([[phab:T250172|T250172]])


=== 2020-09-09 ===
=== 2022-10-20 ===
* 11:12 arturo: new ingress nodes added to the cluster, and tainted/labeled per the docs https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes/Deploying#ingress_nodes ([[phab:T250172|T250172]])
* 16:54 andrewbogott: rebooting tools-package-builder-04
* 10:50 arturo: created puppet prefix `tools-k8s-ingress` ([[phab:T250172|T250172]])
* 16:49 andrewbogott: rebooting redis nodes (one at a time)
* 10:42 arturo: created VMs tools-k8s-ingress-1 and tools-k8s-ingress-2 in the `tools-ingress` server group [[phab:T250172|T250172]])
* 10:54 taavi: rebuild mono68-sssd image with the expired DST Root CA X3 removed [[phab:T311466|T311466]]
* 10:38 arturo: created server group `tools-ingress` with soft anti affinity policy ([[phab:T250172|T250172]])


=== 2020-09-08 ===
=== 2022-10-18 ===
* 23:24 bstorm: clearing grid queue error states blocking job runs
* 11:52 taavi: deploy toolforge-jobs-framework-cli deb v8
* 22:53 bd808: forcing puppet run on tools-sgebastion-07
* 10:30 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-emailer ({{Gerrit|64385e9}}) ([[phab:T320405|T320405]]) - cookbook ran by arturo@nostromo
* 10:27 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:9be2272 from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|9be2272}}) - cookbook ran by taavi@runko
* 10:18 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-emailer:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-emailer ({{Gerrit|64385e9}}) ([[phab:T320405|T320405]]) - cookbook ran by arturo@nostromo


=== 2020-09-02 ===
=== 2022-10-17 ===
* 18:13 andrewbogott: moving tools-sgeexec-0920  to ceph
* 07:25 taavi: push updated perl532 images [[phab:T320824|T320824]]
* 17:57 andrewbogott: moving tools-sgeexec-0942  to ceph


=== 2020-08-31 ===
=== 2022-10-14 ===
* 19:58 andrewbogott: migrating tools-sgeexec-091[0-9] to ceph
* 07:54 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|0cc020e}}) ([[phab:T311466|T311466]]) - cookbook ran by taavi@runko
* 17:19 andrewbogott: migrating tools-sgeexec-090[4-9] to ceph
* 17:19 andrewbogott: repooled tools-sgeexec-0901
* 16:52 bstorm: `apt install uwsgi` was run on tools-checker-03 in the last log [[phab:T261677|T261677]]
* 16:51 bstorm: running `apt install uwsgi` with --allow-downgrades to fix the puppet setup there [[phab:T261677|T261677]]
* 14:26 andrewbogott: depooling tools-sgeexec-0901, migrating to ceph


=== 2020-08-30 ===
=== 2022-10-13 ===
* 00:57 Krenair: also ran qconf -ds on each
* 15:10 arturo: restart jobs-emailer pod
* 00:35 Krenair: Tidied up SGE problems (it was spamming root@ every minute for hours) following host deletions some hours ago - removed tools-sgeexec-0921 through 0931 from @general, ran qmod -rj on all jobs registered for those nodes, then qdel -f on the remainders, then qconf -de on each deleted node


=== 2020-08-29 ===
=== 2022-10-12 ===
* 16:02 bstorm: deleting "tools-sgeexec-0931", "tools-sgeexec-0930", "tools-sgeexec-0929", "tools-sgeexec-0928", "tools-sgeexec-0927"
* 23:25 bd808: Rebuilding all Toolforge docker images ([[phab:T278436|T278436]], [[phab:T311466|T311466]], [[phab:T293552|T293552]])
* 16:00 bstorm: deleting  "tools-sgeexec-0926", "tools-sgeexec-0925", "tools-sgeexec-0924", "tools-sgeexec-0923", "tools-sgeexec-0922", "tools-sgeexec-0921"
* 20:43 bd808: Rebuilding all Toolforge docker images to pick up bug and security fix packages. Third try seems to be working. ([[phab:T316554|T316554]])
* 20:31 bd808: Rebuilding all Toolforge docker images to pick up bug and security fix packages after fixing bug in building the bullseye base image. ([[phab:T316554|T316554]])
* 16:26 dcaro: deploy the latest registry admission webhook, now for real (image tag {{Gerrit|07bc7db}})
* 12:48 dcaro: deploy the latest registry admission webhook (image tag {{Gerrit|07bc7db}})
* 09:26 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by dcaro@vulcanus
* 09:19 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by dcaro@vulcanus


=== 2020-08-26 ===
=== 2022-10-11 ===
* 21:08 bd808: Disabled puppet on tools-proxy-06 to test fixes for a bug in the new [[phab:T251628|T251628]] code
* 13:52 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:8574c36 from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|8574c36}}) - cookbook ran by taavi@runko
* 08:54 arturo: merged several patches by bryan for toolforge front proxy (cleanups, etc) example: https://gerrit.wikimedia.org/r/c/operations/puppet/+/622435


=== 2020-08-25 ===
=== 2022-10-10 ===
* 19:38 andrewbogott: deleting tools-sgeexec-0943.tools.eqiad.wmflabs, tools-sgeexec-0944.tools.eqiad.wmflabs, tools-sgeexec-0945.tools.eqiad.wmflabs, tools-sgeexec-0946.tools.eqiad.wmflabs, tools-sgeexec-0948.tools.eqiad.wmflabs, tools-sgeexec-0949.tools.eqiad.wmflabs, tools-sgeexec-0953.tools.eqiad.wmflabs — they are broken and we're not very curious why; will retry this exercise when everything is standardized on
* 19:30 taavi: rebooting all k8s worker nodes to clean up labstore1006/7 remains
* 15:03 andrewbogott: removing non-ceph nodes tools-sgeexec-0921 through tools-sgeexec-0931
* 16:51 taavi: clean up labstore1006/7 mounts from k8s control nodes [[phab:T320425|T320425]]
* 15:02 andrewbogott: added new sge-exec nodes tools-sgeexec-0943 through tools-sgeexec-0953 (for real this time)
* 11:35 arturo: aborrero@tools-k8s-control-1:~$ sudo -i kubectl -n jobs-emailer rollout restart deployment/jobs-emailer ([[phab:T317998|T317998]])
* 08:44 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|afa90ed}}) ([[phab:T320284|T320284]]) - cookbook ran by taavi@runko
* 08:39 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|afa90ed}}) - cookbook ran by taavi@runko


=== 2020-08-19 ===
=== 2022-10-09 ===
* 21:29 andrewbogott: shutting down and removing  tools-k8s-worker-20 through tools-k8s-worker-29; this load can now be handled by new nodes on ceph hosts
* 17:29 taavi: kill 10 idle tmux sessions of user 'hoi' on tools-sgebastion-10 [[phab:T320352|T320352]]
* 21:15 andrewbogott: shutting down and removing  tools-k8s-worker-1 through tools-k8s-worker-19; this load can now be handled by new nodes on ceph hosts
* 18:40 andrewbogott: creating 13 new xlarge k8s worker nodes, tools-k8s-worker-67 through tools-k8s-worker-79


=== 2020-08-18 ===
=== 2022-10-07 ===
* 15:24 bd808: Rebuilding all Docker containers to pick up newest versions of installed packages
* 13:02 taavi: taavi@cloudcontrol1005 ~ $ sudo mark_tool --disable oncall # [[phab:T320240|T320240]]


=== 2020-07-30 ===
=== 2022-10-06 ===
* 16:28 andrewbogott: added new xlarge ceph-hosted worker nodes: tools-k8s-worker-61, 62, 63, 64, 65, 66.  [[phab:T258663|T258663]]
* 00:39 bd808: Image rebuild failing with debian apt repo signature issue. Will investigate tomorrow. ([[phab:T316554|T316554]])
* 00:36 bd808: Rebuilding all Toolforge docker images to pick up bug and security fix packages. ([[phab:T316554|T316554]])
* 00:04 bd808: Building new php74-sssd-base & web images ([[phab:T310435|T310435]])


=== 2020-07-29 ===
=== 2022-10-03 ===
* 23:24 bd808: Pushed a copy of docker-registry.wikimedia.org/wikimedia-jessie:latest to docker-registry.tools.wmflabs.org/wikimedia-jessie:latest in preparation for the upstream image going away
* 14:36 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/volume-admission:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/volume-admission-controller ({{Gerrit|8da432b}}) - cookbook ran by taavi@runko


=== 2020-07-24 ===
=== 2022-09-28 ===
* 22:33 bd808: Removed a few more ancient docker images: grrrit, jessie-toollabs, and nagf
* 21:23 lucaswerkmeister: on tools-sgebastion-10: run-puppet-agent # [[phab:T318858|T318858]]
* 21:02 bd808: Running cleanup script to delete the non-sssd toolforge images from docker-registry.tools.wmflabs.org
* 21:22 lucaswerkmeister: on tools-sgebastion-10: apt remove emacs-common emacs-bin-common # fix package conflict, [[phab:T318858|T318858]]
* 20:17 bd808: Forced garbage collection on docker-registry.tools.wmflabs.org
* 21:15 lucaswerkmeister: added root SSH key for myself, manually ran puppet on tools-sgebastion-10 to apply it (seemingly successfully)
* 20:06 bd808: Running cleanup script to delete all of the old toollabs-* images from docker-registry.tools.wmflabs.org


=== 2020-07-22 ===
=== 2022-09-22 ===
* 23:24 bstorm: created server group 'tools-k8s-worker' to create any new worker nodes in so that they have a low chance of being scheduled together by openstack unless it is necessary [[phab:T258663|T258663]]
* 12:30 taavi: add TheresNoTime to the 'toollabs-trusted' gerrit group [[phab:T317438|T317438]]
* 23:22 bstorm: running puppet and NFS 4.2 remount on tools-k8s-worker-[56-60] [[phab:T257945|T257945]]
* 12:27 taavi: add TheresNoTime as a project admin and to the roots sudo policy [[phab:T317438|T317438]]
* 23:17 bstorm: running puppet and NFS 4.2 remount on tools-k8s-worker-[41-55] [[phab:T257945|T257945]]
* 23:14 bstorm: running puppet and NFS 4.2 remount on tools-k8s-worker-[21-40] [[phab:T257945|T257945]]
* 23:11 bstorm: running puppet and NFS remount on tools-k8s-worker-[1-15] [[phab:T257945|T257945]]
* 23:07 bstorm: disabling puppet on k8s workers to reduce the effect of changing the NFS mount version all at once [[phab:T257945|T257945]]
* 22:28 bstorm: setting tools-k8s-control prefix to mount NFS v4.2 [[phab:T257945|T257945]]
* 22:15 bstorm: set the tools-k8s-control nodes to also use 800MBps to prevent issues with toolforge ingress and api system
* 22:07 bstorm: set the tools-k8s-haproxy-1 (main load balancer for toolforge) to have an egress limit of 800MB per sec instead of the same as all the other servers


=== 2020-07-21 ===
=== 2022-09-10 ===
* 16:09 bstorm: rebooting tools-sgegrid-shadow to remount NFS correctly
* 07:39 wm-bot2: removing instance tools-prometheus-03 - cookbook ran by taavi@runko
* 15:55 bstorm: set the bastion prefix to have explicitly set hiera value of profile::wmcs::nfsclient::nfs_version: '4'


=== 2020-07-17 ===
=== 2022-09-07 ===
* 16:47 bd808: Enabled Puppet on tools-proxy-06 following successful test ([[phab:T102367|T102367]])
* 10:22 dcaro: Pushing the new toolforge builder image based on the new 0.8 buildpacks ([[phab:T316854|T316854]])
* 16:29 bd808: Disabled Puppet on tools-proxy-06 to test nginx config changes manually ([[phab:T102367|T102367]])


=== 2020-07-15 ===
=== 2022-09-06 ===
* 23:11 bd808: Removed ssh root key for valhallasw from project hiera ([[phab:T255697|T255697]])
* 08:06 dcaro_away: Published new toolforge-bullseye0-run and toolforge-bullseye0-build images for the toolforge buildpack builder ([[phab:T316854|T316854]])


=== 2020-07-09 ===
=== 2022-08-25 ===
* 18:53 bd808: Updating git-review to 1.27 via clush across cluster ([[phab:T257496|T257496]])
* 10:40 taavi: tagged new version of the python39-web container with a shell implementation of webservice-runner [[phab:T293552|T293552]]


=== 2020-07-08 ===
=== 2022-08-24 ===
* 11:16 arturo: merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/610029 -- important change to front-proxy ([[phab:T234617|T234617]])
* 12:20 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/ingress-nginx ({{Gerrit|eba66bc}}) - cookbook ran by taavi@runko
* 11:11 arturo: live-hacking puppetmaster with https://gerrit.wikimedia.org/r/c/operations/puppet/+/610029 ([[phab:T234617|T234617]])
* 12:20 taavi: upgrading ingress-nginx to v1.3


=== 2020-07-07 ===
=== 2022-08-20 ===
* 23:22 bd808: Rebuilding all Docker images to pick up webservice v0.73 ([[phab:T234617|T234617]], [[phab:T257229|T257229]])
* 07:44 dcaro_away: all k8s nodes ready now \o/ ([[phab:T315718|T315718]])
* 23:19 bd808: Deploying webservice v0.73 via clush ([[phab:T234617|T234617]], [[phab:T257229|T257229]])
* 07:43 dcaro_away: rebooted tools-k8s-control-2, seemed stuck trying to wait for tools home (nfs?), after reboot came back up ([[phab:T315718|T315718]])
* 23:16 bd808: Building webservice v0.73 ([[phab:T234617|T234617]], [[phab:T257229|T257229]])
* 07:41 dcaro_away: cloudvirt1023 down took out 3 workers, 1 control, and a grid exec and a weblight, they are taking long to restart, looking ([[phab:T315718|T315718]])
* 15:01 Reedy: killed python process from tools.experimental-embeddings using a lot of cpu on tools-sgebastion-07
* 15:01 Reedy: killed meno25 process running pwb.py on tools-sgebastion-07
* 09:59 arturo: point DNS tools.wmflabs.org A record to 185.15.56.60 (tools-legacy-redirector) ([[phab:T247236|T247236]])


=== 2020-07-06 ===
=== 2022-08-18 ===
* 11:54 arturo: briefly point DNS tools.wmflabs.org A record to 185.15.56.60 (tools-legacy-redirector) and then switch back to 185.15.56.11 (tools-proxy-05). The legacy redirector does HTTP/307 ([[phab:T247236|T247236]])
* 14:45 andrewbogott: adding lucaswerkmeister  as projectadmin ([[phab:T314527|T314527]])
* 11:50 arturo: associate floating IP address 185.15.56.60 to tools-legacy-redirector ([[phab:T247236|T247236]])
* 14:43 andrewbogott: removing some inactive projectadmins: rush, petrb, mdipietro, jeh, krenair


=== 2020-07-01 ===
=== 2022-08-17 ===
* 11:19 arturo: cleanup exim email queue (4 frozen messages)
* 16:34 taavi: kubectl sudo delete cm -n tool-wdml maintain-kubeusers # [[phab:T315459|T315459]]
* 11:02 arturo: live-hacking puppetmaster with https://gerrit.wikimedia.org/r/c/operations/puppet/+/608849 ([[phab:T256737|T256737]])
* 08:30 taavi: failing the grid from the shadow back to the master, some disruption expected


=== 2020-06-30 ===
=== 2022-08-16 ===
* 11:18 arturo: set some hiera keys for mtail in puppet prefix `tools-mail` ([[phab:T256737|T256737]])
* 17:28 taavi: fail over docker-registry, tools-docker-registry-06->docker-registry-05


=== 2020-06-29 ===
=== 2022-08-11 ===
* 22:48 legoktm: built html-sssd/web image ([[phab:T241817|T241817]])
* 16:57 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by taavi@runko
* 22:23 legoktm: rebuild python<nowiki>{</nowiki>34,35,37<nowiki>}</nowiki>-sssd/web images for https://gerrit.wikimedia.org/r/608093
* 16:55 taavi: restart puppetdb on tools-puppetdb-1, crashed during the ceph issues
* 12:01 arturo: introduced spam filter in the mail server ([[phab:T120210|T120210]])


=== 2020-06-25 ===
=== 2022-08-05 ===
* 21:49 zhuyifei1999_: re-enabling puppet on tools-sgebastion-09 [[phab:T256426|T256426]]
* 15:08 wm-bot2: removing grid node tools-sgewebgen-10-1.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 21:39 zhuyifei1999_: disabling puppet on tools-sgebastion-09 so I can play with mount settings [[phab:T256426|T256426]]
* 15:05 wm-bot2: removing grid node tools-sgeexec-10-12.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 21:24 bstorm: hard rebooting tools-sgebastion-09
* 15:00 wm-bot2: created node tools-sgewebgen-10-3.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko


=== 2020-06-24 ===
=== 2022-08-03 ===
* 12:36 arturo: live-hacking puppetmaster with exim prometheus stuff ([[phab:T175964|T175964]])
* 15:51 dhinus: recreated jobs-api pods to pick up new ConfigMap
* 11:57 arturo: merging email ratelimiting patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/607320 ([[phab:T175964|T175964]])
* 15:02 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|c47ac41}}) - cookbook ran by fran@MacBook-Pro.station


=== 2020-06-23 ===
=== 2022-07-20 ===
* 17:55 arturo: killed procs for users `hamishz` and `msyn` which apparently were tools that should be running in the grid / kubernetes instead
* 19:31 taavi: reboot toolserver-proxy-01 to free up disk space probably held by stale file handles
* 16:08 arturo: created acme-chief cert `tools_mail` in the prefix hiera
* 08:06 wm-bot2: removing grid node tools-sgeexec-10-6.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko


=== 2020-06-17 ===
=== 2022-07-19 ===
* 10:40 arturo: created VM tools-legacy-redirector, with the corresponding puppet prefix ([[phab:T247236|T247236]], [[phab:T234617|T234617]])
* 17:53 wm-bot2: created node tools-sgeexec-10-21.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
* 17:00 wm-bot2: removing grid node tools-sgeexec-10-3.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 16:58 wm-bot2: removing grid node tools-sgeexec-10-4.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 16:24 wm-bot2: created node tools-sgeexec-10-20.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
* 15:59 taavi: tag current maintain-kubernetes :beta image as: :latest


=== 2020-06-16 ===
=== 2022-07-17 ===
* 23:01 bd808: Building new Docker images to pick up webservice 0.72
* 15:52 wm-bot2: removing grid node tools-sgeexec-10-10.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 22:58 bd808: Deploying webservice 0.72 to bastions and grid
* 15:43 wm-bot2: removing grid node tools-sgeexec-10-2.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 22:56 bd808: Building webservice 0.72
* 13:26 wm-bot2: created node tools-sgeexec-10-16.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
* 15:10 arturo: merging a patch with changes to the template for keepalived (used in the elastic cluster) https://gerrit.wikimedia.org/r/c/operations/puppet/+/605898


=== 2020-06-15 ===
=== 2022-07-14 ===
* 21:28 bstorm_: cleaned up killgridjobs.sh on the tools bastions [[phab:T157792|T157792]]
* 13:48 taavi: rebooting tools-sgeexec-10-2
* 18:14 bd808: Rebuilding all Docker images to pick up webservice 0.71 ([[phab:T254640|T254640]], [[phab:T253412|T253412]])
* 18:12 bd808: Deploying webservice 0.71 to bastions and grid via clush
* 18:05 bd808: Building webservice 0.71


=== 2020-06-12 ===
=== 2022-07-13 ===
* 13:13 arturo: live-hacking session in the puppetmaster ended
* 12:09 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by dcaro@vulcanus
* 13:10 arturo: live-hacing puppet tree in tools-puppetmaster-02 for testing PAWS related patch (they share haproxy puppet code)
* 00:16 bstorm_: remounted NFS for tools-k8s-control-3 and tools-acme-chief-01


=== 2020-06-11 ===
=== 2022-07-11 ===
* 23:35 bstorm_: rebooting tools-k8s-control-2 because it seems to be confused on NFS, interestingly enough
* 16:06 wm-bot2: Increased quotas by <nowiki>{</nowiki>self.increases<nowiki>}</nowiki> ([[phab:T312692|T312692]]) - cookbook ran by nskaggs@x1carbon


=== 2020-06-04 ===
=== 2022-07-07 ===
* 13:32 bd808: Manually restored /etc/haproxy/conf.d/elastic.cfg on tools-elastic-*
* 07:34 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by dcaro@vulcanus


=== 2020-06-02 ===
=== 2022-06-28 ===
* 12:23 arturo: renewed TLS cert for k8s metrics-server ([[phab:T250874|T250874]]) following docs: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes/Certificates#internal_API_access
* 17:34 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master ([[phab:T311538|T311538]]) - cookbook ran by dcaro@vulcanus
* 11:00 arturo: renewed TLS cert for prometheus to contact toolforge k8s ([[phab:T250874|T250874]]) following docs: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes/Certificates#external_API_access
* 15:51 taavi: add 4096G cinder quota [[phab:T311509|T311509]]


=== 2020-06-01 ===
=== 2022-06-27 ===
* 23:51 bstorm_: refreshed certs for the custom webhook controllers on the k8s cluster [[phab:T250874|T250874]]
* 18:14 taavi: restart calico, appears to have got stuck after the ca replacement operation
* 00:39 bd808: Ugh. Prior SAL message was about tools-sgeexec-0940
* 18:02 taavi: switchover active cron server to tools-sgecron-2 [[phab:T284767|T284767]]
* 00:39 bd808: Compressed /var/log/account/pacct.0 ahead of rotation schedule to free some space on the root partition
* 17:54 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0915.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:52 wm-bot2: removing grid node tools-sgewebgrid-generic-0902.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:49 wm-bot2: removing grid node tools-sgeexec-0942.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:15 taavi: [[phab:T311412|T311412]] updating ca used by k8s-apiserver->etcd communication, breakage may happen
* 14:58 taavi: renew puppet ca cert and certificate for tools-puppetmaster-02 [[phab:T311412|T311412]]
* 14:50 taavi: backup /var/lib/puppet/server to /root/puppet-ca-backup-2022-06-27.tar.gz on tools-puppetmaster-02 before we do anything else to it [[phab:T311412|T311412]]


=== 2020-05-29 ===
=== 2022-06-23 ===
* 19:37 bstorm_: adding docker image for paws-public docker-registry.tools.wmflabs.org/paws-public-nginx:openresty [[phab:T252217|T252217]]
* 17:51 wm-bot2: removing grid node tools-sgeexec-0941.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:49 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0916.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:46 wm-bot2: removing grid node tools-sgewebgrid-generic-0901.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:32 wm-bot2: removing grid node tools-sgeexec-0939.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:30 wm-bot2: removing grid node tools-sgeexec-0938.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:27 wm-bot2: removing grid node tools-sgeexec-0937.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:22 wm-bot2: removing grid node tools-sgeexec-0936.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:19 wm-bot2: removing grid node tools-sgeexec-0935.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:17 wm-bot2: removing grid node tools-sgeexec-0934.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:14 wm-bot2: removing grid node tools-sgeexec-0933.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:11 wm-bot2: removing grid node tools-sgeexec-0932.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:09 wm-bot2: removing grid node tools-sgeexec-0920.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 15:30 wm-bot2: removing grid node tools-sgeexec-0947.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 13:59 taavi: removing remaining continuous jobs from the stretch grid [[phab:T277653|T277653]]


=== 2020-05-28 ===
=== 2022-06-22 ===
* 21:19 bd808: Killed 7 python processes run by user 'mattho69' on login.toolforge.org
* 15:54 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0917.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 21:06 bstorm_: upgrading tools-k8s-worker-[30-60] to kubernetes 1.16.10 [[phab:T246122|T246122]]
* 15:51 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0918.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:54 bstorm_: upgraded tools-k8s-worker-[11..15] and starting on -21-29 now [[phab:T246122|T246122]]
* 15:47 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0919.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 16:01 bstorm_: kubectl upgraded to 1.16.10 on all bastions [[phab:T246122|T246122]]
* 15:45 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0920.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 15:58 arturo: upgrading tools-k8s-worker-[1..10] to 1.16.10 ([[phab:T246122|T246122]])
* 15:41 arturo: upgrading tools-k8s-control-3 to 1.16.10 ([[phab:T246122|T246122]])
* 15:17 arturo: upgrading tools-k8s-control-2 to 1.16.10 ([[phab:T246122|T246122]])
* 15:09 arturo: upgrading tools-k8s-control-1 to 1.16.10 ([[phab:T246122|T246122]])
* 14:49 arturo: cleanup /etc/apt/sources.list.d/ directory in all tools-k8s-* VMs
* 11:27 arturo: merging change to front-proxy: https://gerrit.wikimedia.org/r/c/operations/puppet/+/599139 ([[phab:T253816|T253816]])


=== 2020-05-27 ===
=== 2022-06-21 ===
* 17:23 bstorm_: deleting "tools-k8s-worker-20", "tools-k8s-worker-19", "tools-k8s-worker-18", "tools-k8s-worker-17", "tools-k8s-worker-16"
* 15:23 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0914.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 15:20 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0914.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 15:18 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0913.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 15:07 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0912.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko


=== 2020-05-26 ===
=== 2022-06-03 ===
* 18:45 bstorm_: upgrading maintain-kubeusers to match what is in toolsbeta [[phab:T246059|T246059]] [[phab:T211096|T211096]]
* 20:07 wm-bot2: created node tools-sgeweblight-10-26.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 16:20 bstorm_: fix incorrect volume name in kubeadm-config configmap [[phab:T246122|T246122]]
* 19:51 balloons: Scaling webservice nodes to 20, using new 8G swap flavor [[phab:T309821|T309821]]
* 19:35 wm-bot2: created node tools-sgeweblight-10-25.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 19:03 wm-bot2: created node tools-sgeweblight-10-20.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 19:01 wm-bot2: created node tools-sgeweblight-10-19.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 19:00 balloons: depooled old nodes, bringing entirely new grid of nodes online [[phab:T309821|T309821]]
* 18:22 wm-bot2: created node tools-sgeweblight-10-17.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 17:54 wm-bot2: created node tools-sgeweblight-10-16.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 17:52 wm-bot2: created node tools-sgeweblight-10-15.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 16:59 andrewbogott: building a bunch of new lighttpd nodes (beginning with tools-sgeweblight-10-12) using a flavor with more swap space
* 16:56 wm-bot2: created node tools-sgeweblight-10-12.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 15:50 balloons: fix fix g3.cores4.ram8.disk20.swap24.ephem20 flavor to include swap. Convert to fix g3.cores4.ram8.disk20.swap8.ephem20 flavor [[phab:T309821|T309821]]
* 15:50 balloons: temp add 1.0G swap to sgeweblight hosts [[phab:T309821|T309821]]
* 15:50 balloons: fix fix g3.cores4.ram8.disk20.swap24.ephem20 flavor to include swap. Convert to fix g3.cores4.ram8.disk20.swap8.ephem20 flavor t309821
* 15:49 balloons: temp add 1.0G swap to sgeweblight hosts t309821
* 13:25 bd808: Upgrading fleet to tools-webservice 0.86 ([[phab:T309821|T309821]])
* 13:20 bd808: publish tools-webservice 0.86 ([[phab:T309821|T309821]])
* 12:46 taavi: start webservicemonitor on tools-sgecron-01 [[phab:T309821|T309821]]
* 10:36 taavi: draining each sgeweblight node one by one, and removing the jobs stuck in 'deleting' too
* 05:05 taavi: removing duplicate (there should be only one per tool) web service jobs from the grid [[phab:T309821|T309821]]
* 04:52 taavi: revert bd808's changes to profile::toolforge::active_proxy_host
* 03:21 bd808: Cleared queue error states after deploying new toolforge-webservice package ([[phab:T309821|T309821]])
* 03:10 bd808: publish tools-webservice 0.85 with hack for [[phab:T309821|T309821]]


=== 2020-05-22 ===
=== 2022-06-02 ===
* 20:00 bstorm_: rebooted tools-sgebastion-07 to clear up tmp file problems with 10 min warning
* 22:26 bd808: Rebooting tools-sgeweblight-10-1.tools.eqiad1.wikimedia.cloud. Node is full of jobs that are not tracked by grid master and failing to spawn new jobs sent by the scheduler
* 19:12 bstorm_: running command to delete over 2000 tmp ca certs on tools-bastion-07 [[phab:T253412|T253412]]
* 21:56 bd808: Removed legacy "active_proxy_host" hiera setting
* 21:55 bd808: Updated hiera to use fqdn of 'tools-proxy-06.tools.eqiad1.wikimedia.cloud' for profile::toolforge::active_proxy_host key
* 21:41 bd808: Updated hiera to use fqdn of 'tools-proxy-06.tools.eqiad1.wikimedia.cloud' for active_redis key
* 21:23 wm-bot2: created node tools-sgeweblight-10-8.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
* 12:42 wm-bot2: rebooting stretch exec grid workers - cookbook ran by taavi@runko
* 12:13 wm-bot2: created node tools-sgeweblight-10-7.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
* 12:03 dcaro: refresh prometheus certs ([[phab:T308402|T308402]])
* 11:47 dcaro: refresh registry-admission-controller certs ([[phab:T308402|T308402]])
* 11:42 dcaro: refresh ingress-admission-controller certs ([[phab:T308402|T308402]])
* 11:36 dcaro: refresh volume-admission-controller certs ([[phab:T308402|T308402]])
* 11:24 wm-bot2: created node tools-sgeweblight-10-6.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
* 11:17 taavi: publish jobutils 1.44 that updates the grid default from stretch to buster [[phab:T277653|T277653]]
* 10:16 taavi: publish tools-webservice 0.84 that updates the grid default from stretch to buster [[phab:T277653|T277653]]
* 09:54 wm-bot2: created node tools-sgeexec-10-14.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko


=== 2020-05-21 ===
=== 2022-06-01 ===
* 22:40 bd808: Rebuilding all Docker containers for tools-webservice 0.70 ([[phab:T252700|T252700]])
* 11:18 taavi: depool and remove tools-sgeexec-09[07-14]
* 22:36 bd808: Updated tools-webservice to 0.70 across instances ([[phab:T252700|T252700]])
* 22:29 bd808: Building tools-webservice 0.70 via wmcs-package-build.py


=== 2020-05-20 ===
=== 2022-05-31 ===
* 09:59 arturo: now running tesseract-ocr v4.1.1-2~bpo9+1 in the Toolforge grid ([[phab:T247422|T247422]])
* 16:51 taavi: delete tools-sgeexec-0904 for [[phab:T309525|T309525]] experimentation
* 09:50 arturo: `aborrero@cloud-cumin-01:~$ sudo cumin --force -x 'O<nowiki>{</nowiki>project:tools name:tools-sge[bcew].*<nowiki>}</nowiki>' 'apt-get install tesseract-ocr -t stretch-backports -y'` ([[phab:T247422|T247422]])
* 09:35 arturo: `aborrero@cloud-cumin-01:~$ sudo cumin --force -x 'O<nowiki>{</nowiki>project:tools name:tools-sge[bcew].*<nowiki>}</nowiki>' 'rm /etc/apt/sources.lists.d/kubeadm-k8s-component-repo.list ; rm /etc/apt/sources.list.d/repository_thirdparty-kubeadm-k8s-1-15.list ; run-puppet-agent'` ([[phab:T247422|T247422]])
* 09:23 arturo: `aborrero@cloud-cumin-01:~$ sudo cumin --force -x 'O<nowiki>{</nowiki>project:tools name:tools-sge[bcew].*<nowiki>}</nowiki>' 'rm /etc/apt/preferences.d/* ; run-puppet-agent'` ([[phab:T247422|T247422]])


=== 2020-05-19 ===
=== 2022-05-30 ===
* 17:00 bstorm_: deleting/restarting the paws db-proxy pod because it cannot connect to the replicas...and I'm hoping that's due to depooling and such
* 08:24 taavi: depool tools-sgeexec-[0901-0909] (7 nodes total) [[phab:T277653|T277653]]


=== 2020-05-13 ===
=== 2022-05-26 ===
* 18:14 bstorm_: upgrading calico to 3.14.0 with typha enabled in Toolforge K8s [[phab:T250863|T250863]]
* 15:39 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|e6fa299}}) ([[phab:T309146|T309146]]) - cookbook ran by taavi@runko
* 18:10 bstorm_: set "profile::toolforge::k8s::typha_enabled: true" in tools project for calico upgrade [[phab:T250863|T250863]]


=== 2020-05-09 ===
=== 2022-05-22 ===
* 00:28 bstorm_: added nfs.* to ignored_fs_types for the prometheus::node_exporter params in project hiera [[phab:T252260|T252260]]
* 17:04 taavi: failover tools-redis to the updated cluster [[phab:T278541|T278541]]
* 16:42 wm-bot2: removing grid node tools-sgeexec-0940.tools.eqiad1.wikimedia.cloud ([[phab:T308982|T308982]]) - cookbook ran by taavi@runko


=== 2020-05-08 ===
=== 2022-05-16 ===
* 18:17 bd808: Building all jessie-sssd derived images ([[phab:T197930|T197930]])
* 14:02 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/ingress-nginx ({{Gerrit|7037eca}}) - cookbook ran by taavi@runko
* 17:29 bd808: Building new jessie-sssd base image ([[phab:T197930|T197930]])


=== 2020-05-07 ===
=== 2022-05-14 ===
* 21:51 bstorm_: rebuilding the docker images for Toolforge k8s
* 10:47 taavi: hard reboot unresponsible tools-sgeexec-0940
* 19:03 bstorm_: toollabs-webservice 0.69 now pushed to the Toolforge bastions
* 18:57 bstorm_: pushing new toollabs-webservice package v0.69 to the tools repos


=== 2020-05-06 ===
=== 2022-05-12 ===
* 21:20 bd808: Kubectl delete node tools-k8s-worker-[16-20] ([[phab:T248702|T248702]])
* 12:36 taavi: re-enable CronJobControllerV2 [[phab:T308205|T308205]]
* 18:24 bd808: Updated "profile::toolforge::k8s::worker_nodes" list in "tools-k8s-haproxy" prefix puppet ([[phab:T248702|T248702]])
* 09:28 taavi: deploy jobs-api update [[phab:T308204|T308204]]
* 18:14 bd808: Shutdown tools-k8s-worker-[16-20] instances ([[phab:T248702|T248702]])
* 09:15 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|e6fa299}}) ([[phab:T308204|T308204]]) - cookbook ran by taavi@runko
* 18:04 bd808: Draining tools-k8s-worker-[16-20] in preparation for decomm ([[phab:T248702|T248702]])
* 17:56 bd808: Cordoned tools-k8s-worker-[16-20] in preparation for decomm ([[phab:T248702|T248702]])
* 00:01 bd808: Joining tools-k8s-worker-60 to the k8s worker pool
* 00:00 bd808: Joining tools-k8s-worker-59 to the k8s worker pool


=== 2020-05-05 ===
=== 2022-05-10 ===
* 23:58 bd808: Joining tools-k8s-worker-58 to the k8s worker pool
* 15:18 taavi: depool tools-k8s-worker-42 for experiments
* 23:55 bd808: Joining tools-k8s-worker-57 to the k8s worker pool
* 13:54 taavi: enable distro-wikimedia unattended upgrades [[phab:T290494|T290494]]
* 23:53 bd808: Joining tools-k8s-worker-56 to the k8s worker pool
* 21:51 bd808: Building 5 new k8s worker nodes ([[phab:T248702|T248702]])


=== 2020-05-04 ===
=== 2022-05-06 ===
* 22:08 bstorm_: deleting tools-elastic-01/2/3 [[phab:T236606|T236606]]
* 19:46 bd808: Rebuilt toolforge-perl532-sssd-base & toolforge-perl532-sssd-web to add liblocale-codes-perl ([[phab:T307812|T307812]])
* 16:46 arturo: removing the now unused `/etc/apt/preferences.d/toolforge_k8s_kubeadmrepo*` files ([[phab:T250866|T250866]])
* 16:43 arturo: removing the now unused `/etc/apt/sources.list.d/toolforge-k8s-kubeadmrepo.list` file ([[phab:T250866|T250866]])


=== 2020-04-29 ===
=== 2022-05-05 ===
* 22:13 bstorm_: running a fixup script after fixing a bug [[phab:T247455|T247455]]
* 17:28 taavi: deploy tools-webservice 0.83 [[phab:T307693|T307693]]
* 21:28 bstorm_: running the rewrite-psp-preset.sh script across all tools [[phab:T247455|T247455]]
* 16:54 bstorm_: deleted the maintain-kubeusers pod to start running the new image [[phab:T247455|T247455]]
* 16:52 bstorm_: tagged docker-registry.tools.wmflabs.org/maintain-kubeusers:beta to latest to deploy to toolforge [[phab:T247455|T247455]]


=== 2020-04-28 ===
=== 2022-05-03 ===
* 22:58 bstorm_: rebuilding docker-registry.tools.wmflabs.org/maintain-kubeusers:beta [[phab:T247455|T247455]]
* 08:20 taavi: redis: start replication from the old cluster to the new one ([[phab:T278541|T278541]])


=== 2020-04-23 ===
=== 2022-05-02 ===
* 19:22 bd808: Increased Kubernetes services quota for bd808-test tool.
* 08:54 taavi: restart acme-chief.service [[phab:T307333|T307333]]


=== 2020-04-21 ===
=== 2022-04-25 ===
* 23:06 bstorm_: repooled tools-k8s-worker-38/52, tools-sgewebgrid-lighttpd-0918/9 and tools-sgeexec-0901 [[phab:T250869|T250869]]
* 14:56 bd808: Rebuilding all docker images to pick up toolforge-webservice v0.82 ([[phab:T214343|T214343]])
* 22:09 bstorm_: depooling tools-sgewebgrid-lighttpd-0918/9 and tools-sgeexec-0901 [[phab:T250869|T250869]]
* 14:46 bd808: Building toolforge-webservice v0.82
* 22:02 bstorm_: draining tools-k8s-worker-38 and tools-k8s-worker-52 as they are on the crashed host [[phab:T250869|T250869]]


=== 2020-04-20 ===
=== 2022-04-23 ===
* 15:31 bd808: Rebuilding Docker containers to pick up tools-webservice v0.68 ([[phab:T250625|T250625]])
* 16:51 bd808: Built new perl532-sssd/<nowiki>{</nowiki>base,web<nowiki>}</nowiki> images and pushed to registry ([[phab:T214343|T214343]])
* 14:47 arturo: added joakino to tools.admin LDAP group
* 13:28 jeh: shutdown elasticsearch v5 cluster running Jessie [[phab:T236606|T236606]]
* 12:46 arturo: uploading tools-webservice v0.68 to aptly stretch-tools and update it on relevant servers ([[phab:T250625|T250625]])
* 12:06 arturo: uploaded tools-webservice v0.68 to stretch-toolsbeta for testing
* 11:59 arturo: `root@tools-sge-services-03:~# aptly db cleanup` removed 340 unreferenced packages, and 2 unreferenced files


=== 2020-04-15 ===
=== 2022-04-20 ===
* 23:20 bd808: Building ruby25-sssd/base and children ([[phab:T141388|T141388]], [[phab:T250118|T250118]])
* 16:58 taavi: reboot toolserver-proxy-01 to free up disk space from stale file handles(?)
* 20:09 jeh: update default security group to allow prometheus01.metricsinfra.eqiad.wmflabs TCP 9100 [[phab:T250206|T250206]]
* 07:51 wm-bot: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|8f37a04}}) - cookbook ran by taavi@runko


=== 2020-04-14 ===
=== 2022-04-16 ===
* 18:26 bstorm_: Deployed new code and RBAC for maintain-kubeusers [[phab:T246123|T246123]]
* 18:53 wm-bot: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/kubernetes-metrics ({{Gerrit|2c485e9}}) - cookbook ran by taavi@runko
* 18:19 bstorm_: updating the maintain-kubeusers:latest image [[phab:T246123|T246123]]
* 17:32 bstorm_: updating the maintain-kubeusers:beta image on tools-docker-imagebuilder-01 [[phab:T246123|T246123]]


=== 2020-04-10 ===
=== 2022-04-12 ===
* 21:33 bd808: Rebuilding all Docker images for the Kubernetes cluster ([[phab:T249843|T249843]])
* 21:32 bd808: Added komla to Gerrit group 'toollabs-trusted' ([[phab:T305986|T305986]])
* 19:36 bstorm_: after testing deploying toollabs-webservice 0.67 to tools repos [[phab:T249843|T249843]]
* 21:27 bd808: Added komla to 'roots' sudoers policy ([[phab:T305986|T305986]])
* 14:53 arturo: live-hacking tools-puppetmaster-02 with https://gerrit.wikimedia.org/r/c/operations/puppet/+/587991 for [[phab:T249837|T249837]]
* 21:24 bd808: Add komla as projectadmin ([[phab:T305986|T305986]])


=== 2020-04-09 ===
=== 2022-04-10 ===
* 15:13 bd808: Rebuilding all stretch and buster Docker images. Jessie is broken at the moment due to package version mismatches
* 18:43 taavi: deleted `/tmp/dwl02.out-20210915` on tools-sgebastion-07 (not touched since september, taking up 1.3G of disk space)
* 11:18 arturo: bump nproc limit in bastions https://gerrit.wikimedia.org/r/c/operations/puppet/+/587715 ([[phab:T219070|T219070]])
* 04:29 bd808: Running rebuild_all for Docker images to pick up toollabs-webservice v0.66 [try #2] ([[phab:T154504|T154504]], [[phab:T234617|T234617]])
* 04:19 bd808: python3 build.py --image-prefix toolforge --tag latest --no-cache --push --single jessie-sssd
* 00:20 bd808: Docker rebuild failed in toolforge-python2-sssd-base: "zlib1g-dev : Depends: zlib1g (= 1:1.2.8.dfsg-2+b1) but 1:1.2.8.dfsg-2+deb8u1 is to be installed"


=== 2020-04-08 ===
=== 2022-04-09 ===
* 23:49 bd808: Running rebuild_all for Docker images to pick up toollabs-webservice v0.66 ([[phab:T154504|T154504]], [[phab:T234617|T234617]])
* 15:30 taavi: manually prune user.log on tools-prometheus-03 to free up some space on /
* 23:35 bstorm_: deploy toollabs-webservice v0.66 [[phab:T154504|T154504]] [[phab:T234617|T234617]]


=== 2020-04-07 ===
=== 2022-04-08 ===
* 20:06 andrewbogott: sss_cache -E on tools-sgebastion-08 and  tools-sgebastion-09
* 10:44 arturo: disabled debug mode on the k8s jobs-emailer component
* 20:00 andrewbogott: sss_cache -E on tools-sgebastion-07


=== 2020-04-06 ===
=== 2022-04-05 ===
* 19:16 bstorm_: deleted tools-redis-1001/2 [[phab:T248929|T248929]]
* 07:52 wm-bot: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|d7d3463}}) - cookbook ran by arturo@nostromo
* 07:44 wm-bot: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|d7d3463}}) - cookbook ran by arturo@nostromo
* 07:21 arturo: deploying toolforge-jobs-framework-cli v7


=== 2020-04-03 ===
=== 2022-04-04 ===
* 22:40 bstorm_: shut down tools-redis-1001/2 [[phab:T248929|T248929]]
* 17:05 wm-bot: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|cbcfc47}}) - cookbook ran by arturo@nostromo
* 22:32 bstorm_: switch tools-redis-1003 to the active redis server [[phab:T248929|T248929]]
* 16:56 wm-bot: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|cbcfc47}}) - cookbook ran by arturo@nostromo
* 20:41 bstorm_: deleting tools-redis-1003/4 to attach them to an anti-affinity group [[phab:T248929|T248929]]
* 09:28 arturo: deployed toolforge-jobs-framework-cli v6 into aptly and installed it on buster bastions
* 18:53 bstorm_: spin up tools-redis-1004 on stretch and connect to cluster [[phab:T248929|T248929]]
* 18:23 bstorm_: spin up tools-redis-1003 on stretch and connect to the cluster [[phab:T248929|T248929]]
* 16:50 bstorm_: launching tools-redis-03 (Buster) to see what happens


=== 2020-03-30 ===
=== 2022-03-28 ===
* 18:28 bstorm_: Beginning rolling depool, remount, repool of k8s workers for [[phab:T248702|T248702]]
* 09:32 wm-bot: cleaned up grid queue errors on tools-sgegrid-master.tools.eqiad1.wikimedia.cloud ([[phab:T304816|T304816]]) - cookbook ran by arturo@nostromo
* 18:22 bstorm_: disabled puppet across tools-k8s-worker-[1-55].tools.eqiad.wmflabs [[phab:T248702|T248702]]
* 16:56 arturo: dropping `_psl.toolforge.org` TXT record ([[phab:T168677|T168677]])


=== 2020-03-27 ===
=== 2022-03-15 ===
* 21:22 bstorm_: removed puppet prefix tools-docker-builder [[phab:T248703|T248703]]
* 16:57 wm-bot: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-emailer:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-emailer ({{Gerrit|084ee51}}) - cookbook ran by arturo@nostromo
* 21:15 bstorm_: deleted tools-docker-builder-06 [[phab:T248703|T248703]]
* 11:24 arturo: cleared error state on queue continuous@tools-sgeexec-0939.tools.eqiad.wmflabs (a job took a very long time to be scheduled...)
* 18:55 bstorm_: launching tools-docker-imagebuilder-01 [[phab:T248703|T248703]]
* 12:52 arturo: install python3-pykube on tools-k8s-control-3 for some tests interaction with the API from python


=== 2020-03-24 ===
=== 2022-03-14 ===
* 11:44 arturo: trying to solve a rebase/merge conflict in labs/private.git in tools-puppetmaster-02
* 11:44 arturo: deploy jobs-framework-emailer {{Gerrit|9470a5f339fd5a44c97c69ce97239aef30f5ee41}} ([[phab:T286135|T286135]])
* 11:33 arturo: merging tools-proxy patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/579952/ ([[phab:T234617|T234617]]) (second try with some additional bits in LUA)
* 10:48 dcaro: pushed v0.33.2 tekton control and webhook images, and bashA5.1.4 to the local repo ([[phab:T297090|T297090]])
* 10:16 arturo: merging tools-proxy patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/579952/ ([[phab:T234617|T234617]])


=== 2020-03-18 ===
=== 2022-03-10 ===
* 19:07 bstorm_: removed role::toollabs::logging::sender from project puppet (it wouldn't work anyway)
* 09:42 arturo: cleaned grid queue error state @ tools-sgewebgrid-generic-0902
* 18:04 bstorm_: removed puppet prefix tools-flannel-etcd [[phab:T246689|T246689]]
* 17:58 bstorm_: removed puppet prefix tools-worker [[phab:T246689|T246689]]
* 17:57 bstorm_: removed puppet prefix tools-k8s-master [[phab:T246689|T246689]]
* 17:36 bstorm_: removed lots of deprecated hiera keys from horizon for the old cluster [[phab:T246689|T246689]]
* 16:59 bstorm_: deleting "tools-worker-1002", "tools-worker-1001", "tools-k8s-master-01", "tools-flannel-etcd-03", "tools-k8s-etcd-03", "tools-flannel-etcd-02", "tools-k8s-etcd-02", "tools-flannel-etcd-01", "tools-k8s-etcd-01" [[phab:T246689|T246689]]


=== 2020-03-17 ===
=== 2022-03-01 ===
* 13:29 arturo: set `profile::toolforge::bastion::nproc: 200` for tools-sgebastion-08 ([[phab:T219070|T219070]])
* 13:41 dcaro: rebooting tools-sgeexec-0916 to clear any state ([[phab:T302702|T302702]])
* 00:08 bstorm_: shut off tools-flannel-etcd-01/02/03 [[phab:T246689|T246689]]
* 12:11 dcaro: Cleared error state queues for sgeexec-0916 ([[phab:T302702|T302702]])
* 10:23 arturo: tools-sgeeex-0913/0916 are depooled, queue errors. Reboot them and clean errors by hand


=== 2020-03-16 ===
=== 2022-02-28 ===
* 22:01 bstorm_: shut off tools-k8s-etcd-01/02/03 [[phab:T246689|T246689]]
* 08:02 taavi: reboot sgeexec-0916
* 22:00 bstorm_: shut off tools-k8s-master-01 [[phab:T246689|T246689]]
* 07:49 taavi: depool tools-sgeexec-0916.tools as it is out of disk space on /
* 21:59 bstorm_: shut down tools-worker-1001 and tools-worker-1002 [[phab:T246689|T246689]]


=== 2020-03-11 ===
=== 2022-02-17 ===
* 17:00 jeh: clean up apt cache on tools-sgebastion-07
* 08:23 taavi: deleted tools-clushmaster-02
* 08:14 taavi: made tools-puppetmaster-02 its own client to fix `puppet node deactivate` puppetdb access


=== 2020-03-06 ===
=== 2022-02-16 ===
* 16:25 bstorm_: updating maintain-kubeusers image to filter invalid tool names
* 00:12 bd808: Image builds completed.


=== 2020-03-03 ===
=== 2022-02-15 ===
* 18:16 jeh: create OpenStack DNS record for elasticsearch.svc.tools.eqiad1.wikimedia.cloud (eqiad1 subdomain change) [[phab:T236606|T236606]]
* 23:17 bd808: Image builds failed in buster php image with an apt error. The error looks transient, so starting builds over.
* 18:02 jeh: create OpenStack DNS record for elasticsearch.svc.tools.eqiad.wikimedia.cloud [[phab:T236606|T236606]]
* 23:06 bd808: Started full rebuild of Toolforge containers to pick up webservice 0.81 and other package updates in tmux session on tools-docker-imagebuilder-01
* 17:31 jeh: create a OpenStack virtual ip address for the new elasticsearch cluster [[phab:T236606|T236606]]
* 22:58 bd808: `sudo apt-get update && sudo apt-get install toolforge-webservice` on all bastions to pick up 0.81
* 10:54 arturo: deleted VMs `tools-worker-[1003-1020]` (legacy k8s cluster) ([[phab:T246689|T246689]])
* 22:50 bd808: Built new toollabs-webservice 0.81
* 10:51 arturo: cordoned/drained all legacy k8s worker nodes except 1001/1002 ([[phab:T246689|T246689]])
* 18:43 bd808: Enabled puppet on tools-proxy-05
* 18:38 bd808: Disabled puppet on tools-proxy-05 for manual testing of nginx config changes
* 18:21 taavi: delete tools-package-builder-03
* 11:49 arturo: invalidate sssd cache in all bastions to debug [[phab:T301736|T301736]]
* 11:16 arturo: purge debian package `unscd` on tools-sgebastion-10/11 for [[phab:T301736|T301736]]
* 11:15 arturo: reboot tools-sgebastion-10 for [[phab:T301736|T301736]]


=== 2020-03-02 ===
=== 2022-02-10 ===
* 22:26 jeh: starting first pass of elasticsearch data migration to new cluster [[phab:T236606|T236606]]
* 15:07 taavi: shutdown tools-clushmaster-02 [[phab:T298191|T298191]]
* 13:25 wm-bot: trying to join node tools-sgewebgen-10-2 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 13:24 wm-bot: trying to join node tools-sgewebgen-10-1 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 13:07 wm-bot: trying to join node tools-sgeweblight-10-5 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 13:06 wm-bot: trying to join node tools-sgeweblight-10-4 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 13:05 wm-bot: trying to join node tools-sgeweblight-10-3 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 13:03 wm-bot: trying to join node tools-sgeweblight-10-2 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 12:54 wm-bot: trying to join node tools-sgeweblight-10-1.tools.eqiad1.wikimedia.cloud to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 08:45 taavi: set `profile::base::manage_ssh_keys: true` globally [[phab:T214427|T214427]]
* 08:16 taavi: enable puppetdb and re-enable puppet with puppetdb ssh key management disabled (profile::base::manage_ssh_keys: false) - [[phab:T214427|T214427]]
* 08:06 taavi: disable puppet globally for enabling puppetdb [[phab:T214427|T214427]]


=== 2020-03-01 ===
=== 2022-02-09 ===
* 01:48 bstorm_: old version of kubectl removed. Anyone who needs it can download it with `curl -LO https://storage.googleapis.com/kubernetes-release/release/v1.4.12/bin/linux/amd64/kubectl`
* 19:29 taavi: installed tools-puppetdb-1, not configured on puppetmaster side yet [[phab:T214427|T214427]]
* 01:27 bstorm_: running the force-migrate command to make sure any new kubernetes deployments are on the new cluster.
* 18:56 wm-bot: pooled 10 grid nodes tools-sgeweblight-10-[1-5],tools-sgewebgen-10-[1,2],tools-sgeexec-10-[1-10] ([[phab:T277653|T277653]]) - cookbook ran by arturo@nostromo
* 18:30 wm-bot: pooled 9 grid nodes tools-sgeexec-10-[2-10],tools-sgewebgen-[3,15] - cookbook ran by arturo@nostromo
* 18:25 arturo: ignore last message
* 18:24 wm-bot: pooled 9 grid nodes tools-sgeexec-10-[2-10],tools-sgewebgen-[3,15] - cookbook ran by arturo@nostromo
* 14:04 taavi: created tools-cumin-1/toolsbeta-cumin-1 [[phab:T298191|T298191]]


=== 2020-02-28 ===
=== 2022-02-07 ===
* 22:14 bstorm_: shutting down the old maintain-kubeusers and taking the gloves off the new one (removing --gentle-mode)
* 17:37 taavi: generated authdns_acmechief ssh key and stored password in a text file in local labs/private repository ([[phab:T288406|T288406]])
* 16:51 bstorm_: node/tools-k8s-worker-15 uncordoned
* 12:52 taavi: updated maintain-kubeusers for [[phab:T301081|T301081]]
* 16:44 bstorm_: drained tools-k8s-worker-15 and hard rebooting it because it wasn't happy
* 16:36 bstorm_: rebooting k8s workers 1-35 on the 2020 cluster to clear a strange nologin condition that has been there since the NFS maintenance
* 16:14 bstorm_: rebooted tools-k8s-worker-7 to clear some puppet issues
* 16:00 bd808: Devoicing stashbot in #wikimedia-cloud to reduce irc spam while migrating tools to 2020 Kubernetes cluster
* 15:28 jeh: create OpenStack server group tools-elastic with anti-affinty policy enabled [[phab:T236606|T236606]]
* 15:09 jeh: create 3 new elasticsearch VMs tools-elastic-[1,2,3] [[phab:T236606|T236606]]
* 14:20 jeh: create new puppet prefixes for existing (no change in data) and new elasticsearch VMs
* 04:35 bd808: Joined tools-k8s-worker-54 to 2020 Kubernetes cluster
* 04:34 bd808: Joined tools-k8s-worker-53 to 2020 Kubernetes cluster
* 04:32 bd808: Joined tools-k8s-worker-52 to 2020 Kubernetes cluster
* 04:31 bd808: Joined tools-k8s-worker-51 to 2020 Kubernetes cluster
* 04:28 bd808: Joined tools-k8s-worker-50 to 2020 Kubernetes cluster
* 04:24 bd808: Joined tools-k8s-worker-49 to 2020 Kubernetes cluster
* 04:23 bd808: Joined tools-k8s-worker-48 to 2020 Kubernetes cluster
* 04:21 bd808: Joined tools-k8s-worker-47 to 2020 Kubernetes cluster
* 04:21 bd808: Joined tools-k8s-worker-46 to 2020 Kubernetes cluster
* 04:19 bd808: Joined tools-k8s-worker-45 to 2020 Kubernetes cluster
* 04:14 bd808: Joined tools-k8s-worker-44 to 2020 Kubernetes cluster
* 04:13 bd808: Joined tools-k8s-worker-43 to 2020 Kubernetes cluster
* 04:12 bd808: Joined tools-k8s-worker-42 to 2020 Kubernetes cluster
* 04:10 bd808: Joined tools-k8s-worker-41 to 2020 Kubernetes cluster
* 04:09 bd808: Joined tools-k8s-worker-40 to 2020 Kubernetes cluster
* 04:08 bd808: Joined tools-k8s-worker-39 to 2020 Kubernetes cluster
* 04:07 bd808: Joined tools-k8s-worker-38 to 2020 Kubernetes cluster
* 04:06 bd808: Joined tools-k8s-worker-37 to 2020 Kubernetes cluster
* 03:49 bd808: Joined tools-k8s-worker-36 to 2020 Kubernetes cluster
* 00:50 bstorm_: rebuilt all docker images to include webservice 0.64


=== 2020-02-27 ===
=== 2022-02-04 ===
* 23:27 bstorm_: installed toollabs-webservice 0.64 on the bastions
* 22:33 taavi: `root@tools-sgebastion-10:/data/project/ru_monuments/.kube# mv config old_config` # experimenting with [[phab:T301015|T301015]]
* 23:24 bstorm_: pushed toollabs-webservice version 0.64 to all toolforge repos
* 21:36 taavi: clear error state from some webgrid nodes
* 21:03 jeh: add reindex service account to elasticsearch for data migration [[phab:T236606|T236606]]
* 20:57 bstorm_: upgrading toollabs-webservice to stretch-toolsbeta version for jdk8:testing image only
* 20:19 jeh: update elasticsearch VPS security group to allow toolsbeta-elastic7-1 access on tcp 80 [[phab:T236606|T236606]]
* 18:53 bstorm_: hard rebooted a rather stuck tools-sgecron-01
* 18:20 bd808: Building tools-k8s-worker-[36-55]
* 17:56 bd808: Deleted instances tools-worker-10[21-40]
* 16:14 bd808: Decommissioning tools-worker-10[21-40]
* 16:02 bd808: Drained tools-worker-1021
* 15:51 bd808: Drained tools-worker-1022
* 15:44 bd808: Drained tools-worker-1023 (there is no tools-worker-1024)
* 15:39 bd808: Drained tools-worker-1025
* 15:39 bd808: Drained tools-worker-1026
* 15:11 bd808: Drained tools-worker-1027
* 15:09 bd808: Drained tools-worker-1028 (there is no tools-worker-1029)
* 15:07 bd808: Drained tools-worker-1030
* 15:06 bd808: Uncordoned tools-worker-10[16-20]. Was over optimistic about repacking legacy Kubernetes cluster into 15 instances. Will keep 20 for now.
* 15:00 bd808: Drained tools-worker-1031
* 14:54 bd808: Hard reboot tools-worker-1016. Direct virsh console unresponsive. Stuck in shutdown since 2020-01-22?
* 14:44 bd808: Uncordoned tools-worker-1009.tools.eqiad.wmflabs
* 14:41 bd808: Drained tools-worker-1032
* 14:37 bd808: Drained tools-worker-1033
* 14:35 bd808: Drained tools-worker-1034
* 14:34 bd808: Drained tools-worker-1035
* 14:33 bd808: Drained tools-worker-1036
* 14:33 bd808: Drained tools-worker-10<nowiki>{</nowiki>39,38,37<nowiki>}</nowiki> yesterday but did not !log
* 00:29 bd808: Drained tools-worker-1009 for reboot (NFS flakey)
* 00:11 bd808: Uncordoned tools-worker-1009.tools.eqiad.wmflabs
* 00:08 bd808: Uncordoned tools-worker-1002.tools.eqiad.wmflabs
* 00:02 bd808: Rebooting tools-worker-1002
* 00:00 bd808: Draining tools-worker-1002 to reboot for NFS problems


=== 2020-02-26 ===
=== 2022-02-03 ===
* 23:42 bd808: Drained tools-worker-1040
* 09:06 taavi: run `sudo apt-get clean` on login-buster/dev-buster to clean up disk space
* 23:41 bd808: Cordoned tools-worker-10[16-40] in preparation for shrinking legacy Kubernetes cluster
* 08:01 taavi: restart acme-chief to force renewal of toolserver.org certificate
* 23:12 bstorm_: replacing all tool limit-ranges in the 2020 cluster with a lower cpu request version
* 22:29 bstorm_: deleted pod maintain-kubeusers-6d9c45f4bc-5bqq5 to deploy new image
* 21:06 bstorm_: deleting loads of stuck grid jobs
* 20:27 jeh: rebooting tools-worker-[1008,1015,1021]
* 20:15 bstorm_: rebooting tools-sgegrid-master because it actually had the permissions thing going on still
* 18:03 bstorm_: downtimed toolschecker for nfs maintenance


=== 2020-02-25 ===
=== 2022-01-30 ===
* 15:31 bd808: `wmcs-k8s-enable-cluster-monitor toolschecker`
* 14:41 taavi: created a neutron port with ip 172.16.2.46 for a service ip for toolforge redis automatic failover [[phab:T278541|T278541]]
* 14:22 taavi: creating a cluster of 3 bullseye redis hosts for [[phab:T278541|T278541]]


=== 2020-02-23 ===
=== 2022-01-26 ===
* 00:40 Krenair: [[phab:T245932|T245932]]
* 18:33 wm-bot: depooled grid node tools-sgeexec-10-10 - cookbook ran by arturo@nostromo
* 18:33 wm-bot: depooled grid node tools-sgeexec-10-9 - cookbook ran by arturo@nostromo
* 18:33 wm-bot: depooled grid node tools-sgeexec-10-8 - cookbook ran by arturo@nostromo
* 18:32 wm-bot: depooled grid node tools-sgeexec-10-7 - cookbook ran by arturo@nostromo
* 18:32 wm-bot: depooled grid node tools-sgeexec-10-6 - cookbook ran by arturo@nostromo
* 18:31 wm-bot: depooled grid node tools-sgeexec-10-5 - cookbook ran by arturo@nostromo
* 18:30 wm-bot: depooled grid node tools-sgeexec-10-4 - cookbook ran by arturo@nostromo
* 18:28 wm-bot: depooled grid node tools-sgeexec-10-3 - cookbook ran by arturo@nostromo
* 18:27 wm-bot: depooled grid node tools-sgeexec-10-2 - cookbook ran by arturo@nostromo
* 18:27 wm-bot: depooled grid node tools-sgeexec-10-1 - cookbook ran by arturo@nostromo
* 13:55 arturo: scaling up the buster web grid with 5 lighttd and 2 generic nodes ([[phab:T277653|T277653]])


=== 2020-02-21 ===
=== 2022-01-25 ===
* 16:02 andrewbogott: moving tools-sgecron-01 to cloudvirt1022
* 11:50 wm-bot: reconfiguring the grid by using grid-configurator - cookbook ran by arturo@nostromo
* 11:44 arturo: rebooting buster exec nodes
* 08:34 taavi: sign puppet certificate for tools-sgeexec-10-4


=== 2020-02-20 ===
=== 2022-01-24 ===
* 14:49 andrewbogott: moving tools-k8s-worker-19 and tools-k8s-worker-18 to cloudvirt1022 (as part of draining 1014)
* 17:44 wm-bot: reconfiguring the grid by using grid-configurator - cookbook ran by arturo@nostromo
* 00:04 Krenair: Shut off tools-puppetmaster-01 - to be deleted in one week [[phab:T245365|T245365]]
* 15:23 arturo: scaling up the grid with 10 buster exec nodes ([[phab:T277653|T277653]])


=== 2020-02-19 ===
=== 2022-01-20 ===
* 22:05 Krenair: Project-wide hiera change to swap puppetmaster to tools-puppetmaster-02 [[phab:T245365|T245365]]
* 17:05 arturo: drop 9 of the 10 buster exec nodes created earlier. They didn't get DNS records
* 15:36 bstorm_: setting 'puppetmaster: tools-puppetmaster-02.tools.eqiad.wmflabs' on tools-sgeexec-0942 to test new puppetmaster on grid [[phab:T245365|T245365]]
* 12:56 arturo: scaling up the grid with 10 buster exec nodes ([[phab:T277653|T277653]])
* 11:50 arturo: fix invalid yaml format in horizon puppet prefix 'tools-k8s-haproxy' that prevented clean puppet run in the VMs
* 00:59 bd808: Live hacked the "nginx-configuration" ConfigMap for [[phab:T245426|T245426]] (done several hours ago, but I forgot to !log it)


=== 2020-02-18 ===
=== 2022-01-19 ===
* 23:26 bstorm_: added tools-sgegrid-master.tools.eqiad1.wikimedia.cloud and tools-sgegrid-shadow.tools.eqiad1.wikimedia.cloud to gridengine admin host lists
* 17:34 andrewbogott: rebooting tools-sgeexec-0913.tools.eqiad1.wikimedia.cloud to recover from (presumed) fallout from the scratch/nfs move
* 09:50 arturo: temporarily delete DNS zone tools.wmcloud.org to try re-creating it


=== 2020-02-17 ===
=== 2022-01-14 ===
* 18:53 arturo: [[phab:T168677|T168677]] created DNS TXT record _psl.toolforge.org. with value `https://github.com/publicsuffix/list/pull/970`
* 19:09 taavi: set /var/run/lighttpd as world-writable on all lighttpd webgrid nodes, [[phab:T299243|T299243]]
* 13:22 arturo: relocating tools-sgewebgrid-lighttpd-0914 to cloudvirt1012 to spread same VMs across different hypervisors


=== 2020-02-14 ===
=== 2022-01-12 ===
* 00:38 bd808: Added tools-k8s-worker-35 to 2020 Kubernetes cluster ([[phab:T244791|T244791]])
* 11:27 arturo: created puppet prefix `tools-sgeweblight`, drop `tools-sgeweblig`
* 00:34 bd808: Added tools-k8s-worker-34 to 2020 Kubernetes cluster ([[phab:T244791|T244791]])
* 11:03 arturo: created puppet prefix 'tools-sgeweblig'
* 00:32 bd808: Added tools-k8s-worker-33 to 2020 Kubernetes cluster ([[phab:T244791|T244791]])
* 11:02 arturo: created puppet prefix 'toolsbeta-sgeweblig'
* 00:29 bd808: Added tools-k8s-worker-32 to 2020 Kubernetes cluster ([[phab:T244791|T244791]])
* 00:25 bd808: Added tools-k8s-worker-31 to 2020 Kubernetes cluster ([[phab:T244791|T244791]])
* 00:25 bd808: Added tools-k8s-worker-30 to 2020 Kubernetes cluster ([[phab:T244791|T244791]])
* 00:17 bd808: Added tools-k8s-worker-29 to 2020 Kubernetes cluster ([[phab:T244791|T244791]])
* 00:15 bd808: Added tools-k8s-worker-28 to 2020 Kubernetes cluster ([[phab:T244791|T244791]])
* 00:13 bd808: Added tools-k8s-worker-27 to 2020 Kubernetes cluster ([[phab:T244791|T244791]])
* 00:07 bd808: Added tools-k8s-worker-26 to 2020 Kubernetes cluster ([[phab:T244791|T244791]])
* 00:03 bd808: Added tools-k8s-worker-25 to 2020 Kubernetes cluster ([[phab:T244791|T244791]])


=== 2020-02-13 ===
=== 2022-01-04 ===
* 23:53 bd808: Added tools-k8s-worker-24 to 2020 Kubernetes cluster ([[phab:T244791|T244791]])
* 17:18 bd808: tools-acme-chief-01: sudo service acme-chief restart
* 23:50 bd808: Added tools-k8s-worker-23 to 2020 Kubernetes cluster ([[phab:T244791|T244791]])
* 08:12 taavi: disable puppet & exim4 on [[phab:T298501|T298501]]
* 23:38 bd808: Added tools-k8s-worker-22 to 2020 Kubernetes cluster ([[phab:T244791|T244791]])
* 21:35 bd808: Deleted tools-sgewebgrid-lighttpd-092<nowiki>{</nowiki>1,2,3,4,5,6,7,8<nowiki>}</nowiki> & tools-sgewebgrid-generic-090<nowiki>{</nowiki>3,4<nowiki>}</nowiki> ([[phab:T244791|T244791]])
* 21:33 bd808: Removed tools-sgewebgrid-lighttpd-092<nowiki>{</nowiki>1,2,3,4,5,6,7,8<nowiki>}</nowiki> & tools-sgewebgrid-generic-090<nowiki>{</nowiki>3,4<nowiki>}</nowiki> from grid engine config ([[phab:T244791|T244791]])
* 17:43 andrewbogott: migrating b24e29d7-a468-4882-9652-{{Gerrit|9863c8acfb88}} to cloudvirt1022
 
=== 2020-02-12 ===
* 19:29 bd808: Rebuilding all Docker images to pick up toollabs-webservice (0.63) ([[phab:T244954|T244954]])
* 19:15 bd808: Deployed toollabs-webservice (0.63) on bastions ([[phab:T244954|T244954]])
* 00:20 bd808: Depooling tools-sgewebgrid-generic-0903 ([[phab:T244791|T244791]])
* 00:19 bd808: Depooling tools-sgewebgrid-generic-0904 ([[phab:T244791|T244791]])
* 00:14 bd808: Depooling tools-sgewebgrid-lighttpd-0921 ([[phab:T244791|T244791]])
* 00:09 bd808: Depooling tools-sgewebgrid-lighttpd-0922 ([[phab:T244791|T244791]])
* 00:05 bd808: Depooling tools-sgewebgrid-lighttpd-0923 ([[phab:T244791|T244791]])
* 00:05 bd808: Depooling tools-sgewebgrid-lighttpd-0924 ([[phab:T244791|T244791]])
 
=== 2020-02-11 ===
* 23:58 bd808: Depooling tools-sgewebgrid-lighttpd-0925 ([[phab:T244791|T244791]])
* 23:56 bd808: Depooling tools-sgewebgrid-lighttpd-0926 ([[phab:T244791|T244791]])
* 23:38 bd808: Depooling tools-sgewebgrid-lighttpd-0927 ([[phab:T244791|T244791]])
 
=== 2020-02-10 ===
* 23:39 bstorm_: updated tools-manifest to 0.21 on aptly for stretch
* 22:51 bstorm_: all docker images now use webservice 0.62
* 22:01 bd808: Manually starting webservices for tools that were running on tools-sgewebgrid-lighttpd-0928 ([[phab:T244791|T244791]])
* 21:47 bd808: Depooling tools-sgewebgrid-lighttpd-0928 ([[phab:T244791|T244791]])
* 21:25 bstorm_: upgraded toollabs-webservice package for tools to 0.62 [[phab:T244293|T244293]] [[phab:T244289|T244289]] [[phab:T234617|T234617]] [[phab:T156626|T156626]]
 
=== 2020-02-07 ===
* 10:55 arturo: drop jessie VM instances tools-prometheus-<nowiki>{</nowiki>01,02<nowiki>}</nowiki> which were shutdown ([[phab:T238096|T238096]])
 
=== 2020-02-06 ===
* 10:44 arturo: merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/565556 which is a behavior change to the Toolforge front proxy ([[phab:T234617|T234617]])
* 10:27 arturo: shutdown again tools-prometheus-01, no longer in use ([[phab:T238096|T238096]])
* 05:07 andrewbogott: cleared out old /tmp and /var/log files on tools-sgebastion-07
 
=== 2020-02-05 ===
* 11:22 arturo: restarting ferm fleet-wide to account for prometheus servers changed IP (but same hostname) ([[phab:T238096|T238096]])
 
=== 2020-02-04 ===
* 11:38 arturo: start again tools-prometheus-01 again to sync data to the new tools-prometheus-03/04 VMs ([[phab:T238096|T238096]])
* 11:37 arturo: re-create tools-prometheus-03/04 as 'bigdisk2' instances (300GB) [[phab:T238096|T238096]]
 
=== 2020-02-03 ===
* 14:12 arturo: move tools-prometheus-04 from cloudvirt1022 to cloudvirt1013
* 12:48 arturo: shutdown tools-prometheus-01 and tools-prometheus-02, after fixing the proxy `tools-prometheus.wmflabs.org` to tools-prometheus-03, data synced ([[phab:T238096|T238096]])
* 09:38 arturo: tools-prometheus-01: systemctl stop prometheus@tools. Another try to migrate data to tools-prometheus-<nowiki>{</nowiki>03,04<nowiki>}</nowiki> ([[phab:T238096|T238096]])
 
=== 2020-01-31 ===
* 14:06 arturo: leave tools-prometheus-01 as the backend for tools-prometheus.wmflabs.org for the weekend so grafana dashboards keep working ([[phab:T238096|T238096]])
* 14:00 arturo: syncing again prometheus data from tools-prometheus-01 to tools-prometheus-0<nowiki>{</nowiki>3,4<nowiki>}</nowiki> due to some inconsistencies preventing prometheus from starting ([[phab:T238096|T238096]])
 
=== 2020-01-30 ===
* 21:04 andrewbogott: also apt-get install python3-novaclient  on tools-prometheus-03 and tools-prometheus-04 to suppress cronspam.  Possible real fix for this is https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/569084/
* 20:39 andrewbogott: apt-get install python3-keystoneclient  on tools-prometheus-03 and tools-prometheus-04 to suppress cronspam.  Possible real fix for this is https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/569084/
* 16:27 arturo: create VM tools-prometheus-04 as cold standby of tools-prometheus-03 ([[phab:T238096|T238096]])
* 16:25 arturo: point tools-prometheus.wmflabs.org proxy to tools-prometheus-03 ([[phab:T238096|T238096]])
* 13:42 arturo: disable puppet in prometheus servers while syncing metric data ([[phab:T238096|T238096]])
* 13:15 arturo: drop floating IP 185.15.56.60 and FQDN `prometheus.tools.wmcloud.org` because this is not how the prometheus setup is right now. Use a web proxy instead `tools-prometheus-new.wmflabs.org` ([[phab:T238096|T238096]])
* 13:09 arturo: created FQDN `prometheus.tools.wmcloud.org` pointing to IPv4 185.15.56.60 (tools-prometheus-03) to test [[phab:T238096|T238096]]
* 12:59 arturo: associated floating IPv4 185.15.56.60 to tools-prometheus-03 ([[phab:T238096|T238096]])
* 12:57 arturo: created domain `tools.wmcloud.org` in the tools project after some back and forth with designated, permissions and the database. I plan to use this domain to test the new Debian Buster-based prometheus setup ([[phab:T238096|T238096]])
* 10:20 arturo: create new VM instance tools-prometheus-03 ([[phab:T238096|T238096]])
 
=== 2020-01-29 ===
* 20:07 bd808: Created <nowiki>{</nowiki>bastion,login,dev<nowiki>}</nowiki>.toolforge.org service names for Toolforge bastions using Horizon & Designate
 
=== 2020-01-28 ===
* 13:35 arturo: `aborrero@tools-clushmaster-02:~$ clush -w @exec-stretch 'for i in $(ps aux {{!}} grep [t]ools.j {{!}} awk -F" " "<nowiki>{</nowiki>print \$2<nowiki>}</nowiki>") ; do  echo "killing $i" ; sudo kill $i ; done {{!}}{{!}} true'` ([[phab:T243831|T243831]])
 
=== 2020-01-27 ===
* 07:05 zhuyifei1999_: wrong package. uninstalled. the correct one is bpfcc-tools and seems only available in buster+. [[phab:T115231|T115231]]
* 07:01 zhuyifei1999_: apt installing bcc on tools-worker-1037 to see who is sending SIGTERM, will uninstall after done. dependency: bin86. [[phab:T115231|T115231]]
 
=== 2020-01-24 ===
* 20:58 bd808: Built tools-k8s-worker-21 to test out build script following openstack client upgrade
* 15:45 bd808: Rebuilding all Docker containers again because I failed to actually update the build server git clone properly last time I did this
* 05:23 bd808: Building 6 new tools-k8s-worker instances for the 2020 Kubernetes cluster (take 2)
* 04:41 bd808: Rebuilding all Docker images to pick up webservice-python-bootstrap changes
 
=== 2020-01-23 ===
* 23:38 bd808: Halted tools-k8s-worker build script after first instance (tools-k8s-worker-10) stuck in "scheduling" state for 20 minutes
* 23:16 bd808: Building 6 new tools-k8s-worker instances for the 2020 Kubernetes cluster
* 05:15 bd808: Building tools-elastic-04
* 04:39 bd808: wmcs-openstack quota set --instances 192
* 04:36 bd808: wmcs-openstack quota set --cores 768 --ram {{Gerrit|1536000}}
 
=== 2020-01-22 ===
* 12:43 arturo: for the record, issue with tools-worker-1016 was memory exhaustion apparently
* 12:35 arturo: hard-reboot tools-worker-1016 (not responding to even console access)
 
=== 2020-01-21 ===
* 19:25 bstorm_: hard rebooting tools-sgeexec-0913/14/35 because they aren't even on the network
* 19:17 bstorm_: depooled and rebooted tools-sgeexec-0914 because it was acting funny
* 18:30 bstorm_: depooling and rebooting tools-sgeexec-[0911,0913,0919,0921,0924,0931,0933,0935,0939,0941].tools.eqiad.wmflabs
* 17:21 bstorm_: rebooting toolschecker to recover stale nfs handle
 
=== 2020-01-16 ===
* 23:54 bstorm_: rebooting tools-docker-builder-06 because there are a couple running containers that don't want to die cleanly
* 23:45 bstorm_: rebuilding docker containers to include new webservice version (0.58)
* 23:41 bstorm_: deployed toollabs-webservice 0.58 to everything that isn't a container
* 16:45 bstorm_: ran configurator to set the gridengine web queues to `rerun FALSE` [[phab:T242397|T242397]]
 
=== 2020-01-14 ===
* 15:29 bstorm_: failed the gridengine master back to the master server from the shadow
* 02:23 andrewbogott: rebooting tools-paws-worker-1006  to resolve hangs associated with an old NFS failure
 
=== 2020-01-13 ===
* 17:48 bd808: Running `puppet ca destroy` for each unsigned cert on tools-puppetmaster-01 ([[phab:T242642|T242642]])
* 16:42 bd808: Cordoned and fixed puppet on tools-k8s-worker-12. Rebooting now. [[phab:T242559|T242559]]
* 16:33 bd808: Cordoned and fixed puppet on tools-k8s-worker-11. Rebooting now. [[phab:T242559|T242559]]
* 16:31 bd808: Cordoned and fixed puppet on tools-k8s-worker-10. Rebooting now. [[phab:T242559|T242559]]
* 16:26 bd808: Cordoned and fixed puppet on tools-k8s-worker-9. Rebooting now. [[phab:T242559|T242559]]
 
=== 2020-01-12 ===
* 22:31 Krenair: same on -13 and -14
* 22:28 Krenair: same on -8
* 22:18 Krenair: same on -7
* 22:11 Krenair: Did usual new instance creation puppet dance on tools-k8s-worker-6, /data/project got created
 
=== 2020-01-11 ===
* 01:33 bstorm_: updated toollabs-webservice package to 0.57, which should allow persisting mem and cpu in manifests with burstable qos.
 
=== 2020-01-10 ===
* 23:31 bstorm_: updated toollabs-webservice package to 0.56
* 15:45 bstorm_: depooled tools-paws-worker-1013 to reboot because I think it is the last tools server with that mount issue (I hope)
* 15:35 bstorm_: depooling and rebooting tools-worker-1016 because it still had the leftover mount problems
* 15:30 bstorm_: git stash-ing local puppet changes in hopes that arturo has that material locally, and it doesn't break anything to do so
 
=== 2020-01-09 ===
* 23:35 bstorm_: depooled tools-sgeexec-0939 because it isn't acting right and rebooting it
* 18:26 bstorm_: re-joining the k8s nodes OF THE PAWS CLUSTER to the cluster one at a time to rotate the certs [[phab:T242353|T242353]]
* 18:25 bstorm_: re-joining the k8s nodes to the cluster one at a time to rotate the certs [[phab:T242353|T242353]]
* 18:06 bstorm_: rebooting tools-paws-master-01 [[phab:T242353|T242353]]
* 17:46 bstorm_: refreshing the paws cluster's entire x509 environment [[phab:T242353|T242353]]
 
=== 2020-01-07 ===
* 22:40 bstorm_: rebooted tools-worker-1007 to recover it from disk full and general badness
* 16:33 arturo: deleted by hand pod metrics/cadvisor-5pd46 due to prometheus having issues scrapping it
* 15:46 bd808: Rebooting tools-k8s-worker-[6-14]
* 15:35 bstorm_: changed kubeadm-config to use a list instead of a hash for extravols on the apiserver in the new k8s cluster [[phab:T242067|T242067]]
* 14:02 arturo: `root@tools-k8s-control-3:~# wmcs-k8s-secret-for-cert -n metrics -s metrics-server-certs -a metrics-server` ([[phab:T241853|T241853]])
* 13:33 arturo: upload docker-registry.tools.wmflabs.org/coreos/kube-state-metrics:v1.8.0 copied from quay.io/coreos/kube-state-metrics:v1.8.0 ([[phab:T241853|T241853]])
* 13:31 arturo: upload docker-registry.tools.wmflabs.org/metrics-server-amd64:v0.3.6 copied from k8s.gcr.io/metrics-server-amd64:v0.3.6 ([[phab:T241853|T241853]])
* 13:23 arturo: [new k8s] doing changes to kube-state-metrics and metrics-server trying to relocate them to the 'metrics' namespace ([[phab:T241853|T241853]])
* 05:28 bd808: Creating tools-k8s-worker-[6-14] (again)
* 05:20 bd808: Deleting busted tools-k8s-worker-[6-14]
* 05:02 bd808: Creating tools-k8s-worker-[6-14]
* 00:26 bstorm_: repooled tools-sgewebgrid-lighttpd-0919
* 00:17 bstorm_: repooled tools-sgewebgrid-lighttpd-0918
* 00:15 bstorm_: moving tools-sgewebgrid-lighttpd-0918 and -0919 to cloudvirt1004 from cloudvirt1029 to rebalance load
* 00:02 bstorm_: depooled tools-sgewebgrid-lighttpd-0918 and 0919 to move to cloudvirt1004 to improve spread
 
=== 2020-01-06 ===
* 23:40 bd808: Deleted tools-sgewebgrid-lighttpd-09<nowiki>{</nowiki>0[1-9],10<nowiki>}</nowiki>
* 23:36 bd808: Shutdown tools-sgewebgrid-lighttpd-09<nowiki>{</nowiki>0[1-9],10<nowiki>}</nowiki>
* 23:34 bd808: Decommissioned tools-sgewebgrid-lighttpd-09<nowiki>{</nowiki>0[1-9],10<nowiki>}</nowiki>
* 23:13 bstorm_: Repooled tools-sgeexec-0922 because I don't know why it was depooled
* 23:01 bd808: Depooled tools-sgewebgrid-lighttpd-0910.tools.eqiad.wmflabs
* 22:58 bd808: Depooling tools-sgewebgrid-lighttpd-090[2-9]
* 22:57 bd808: Disabling queues on tools-sgewebgrid-lighttpd-090[2-9]
* 21:07 bd808: Restarted kube2proxy on tools-proxy-05 to try and refresh admin tool's routes
* 18:54 bstorm_: edited /etc/fstab to remove NFS and unmounted the nfs volumes tools-k8s-haproxy-1 [[phab:T241908|T241908]]
* 18:49 bstorm_: edited /etc/fstab to remove NFS and rebooted to clear stale mounts on tools-k8s-haproxy-2 [[phab:T241908|T241908]]
* 18:47 bstorm_: added mount_nfs=false to tools-k8s-haproxy puppet prefix [[phab:T241908|T241908]]
* 18:24 bd808: Deleted shutdown instance tools-worker-1029 (was an SSSD testing instance)
* 16:42 bstorm_: failed sge-shadow-master back to the main grid master
* 16:42 bstorm_: Removed files for old S1tty that wasn't working on sge-grid-master
 
=== 2020-01-04 ===
* 18:11 bd808: Shutdown tools-worker-1029
* 18:10 bd808: kubectl delete node tools-worker-1029.tools.eqiad.wmflabs
* 18:06 bd808: Removed tools-worker-1029.tools.eqiad.wmflabs from k8s::worker_hosts hiera in preparation for decom
* 16:54 bstorm_: moving VMs tools-worker-1012/1028/1005 from cloudvirt1024 to cloudvirt1003 due to hardware errors [[phab:T241884|T241884]]
* 16:47 bstorm_: moving VM tools-flannel-etcd-02 from cloudvirt1024 to cloudvirt1003 due to hardware errors [[phab:T241884|T241884]]
* 16:16 bd808: Draining tools-worker-10<nowiki>{</nowiki>05,12,28<nowiki>}</nowiki> due to hardware errors ([[phab:T241884|T241884]])
* 16:13 arturo: moving VM tools-sgewebgrid-lighttpd-0927 from cloudvirt1024 to cloudvirt1009 due to hardware errors ([[phab:T241884|T241884]])
* 16:11 arturo: moving VM tools-sgewebgrid-lighttpd-0926 from cloudvirt1024 to cloudvirt1009 due to hardware errors ([[phab:T241884|T241884]])
* 16:09 arturo: moving VM tools-sgewebgrid-lighttpd-0925 from cloudvirt1024 to cloudvirt1009 due to hardware errors ([[phab:T241884|T241884]])
* 16:08 arturo: moving VM tools-sgewebgrid-lighttpd-0924 from cloudvirt1024 to cloudvirt1009 due to hardware errors ([[phab:T241884|T241884]])
* 16:07 arturo: moving VM tools-sgewebgrid-lighttpd-0923 from cloudvirt1024 to cloudvirt1009 due to hardware errors ([[phab:T241884|T241884]])
* 16:06 arturo: moving VM tools-sgewebgrid-lighttpd-0909 from cloudvirt1024 to cloudvirt1009 due to hardware errors ([[phab:T241884|T241884]])
* 16:04 arturo: moving VM tools-sgeexec-0923 from cloudvirt1024 to cloudvirt1009 due to hardware errors ([[phab:T241884|T241884]])
* 16:02 arturo: moving VM tools-sgeexec-0910 from cloudvirt1024 to cloudvirt1009 due to hardware errors ([[phab:T241873|T241873]])
 
=== 2020-01-03 ===
* 16:48 bstorm_: updated the ValidatingWebhookConfiguration for the ingress admission controller to the working settings
* 11:51 arturo: [new k8s] deploy cadvisor as in https://gerrit.wikimedia.org/r/c/operations/puppet/+/561654 ([[phab:T237643|T237643]])
* 11:21 arturo: upload k8s.gcr.io/cadvisor:v0.30.2 docker image to the docker registry as docker-registry.tools.wmflabs.org/cadvisor:0.30.2 for [[phab:T237643|T237643]]
* 03:04 bd808: Really rebuilding all <nowiki>{</nowiki>jessie,stretch,buster<nowiki>}</nowiki>-sssd images. Last time I forgot to actually update the git clone.
* 00:11 bd808: Rebuiliding all stretch-ssd Docker images to pick up busybox
 
=== 2020-01-02 ===
* 23:54 bd808: Rebuiliding all buster-ssd Docker images to pick up busybox
 
=== 2019-12-30 ===
* 05:02 andrewbogott: moving tools-worker-1012 to cloudvirt1024 for [[phab:T241523|T241523]]
* 04:49 andrewbogott: draining and rebooting tools-worker-1031, its drive is full
 
=== 2019-12-29 ===
* 01:38 Krenair: Cordoned tools-worker-1012 and deleted pods associated with dplbot and dewikigreetbot as well as my own testing one, host seems to be under heavy load - [[phab:T241523|T241523]]
 
=== 2019-12-27 ===
* 15:06 Krenair: Killed a "python parse_page.py outreachy" process by aikochou that was hogging IO on tools-sgebastion-07
 
=== 2019-12-25 ===
* 16:07 zhuyifei1999_: pkilled 5 `python pwb.py` processes belonging to `tools.kaleem-bot` on tools-sgebastion-07
 
=== 2019-12-22 ===
* 20:13 bd808: Enabled Puppet on tools-proxy-06.tools.eqiad.wmflabs after nginx config test ([[phab:T241310|T241310]])
* 18:52 bd808: Disabled Puppet on tools-proxy-06.tools.eqiad.wmflabs to test nginx config change ([[phab:T241310|T241310]])
 
=== 2019-12-20 ===
* 22:28 bd808: Re-enabled Puppet on tools-sgebastion-09. Reason for disable was "arturo raising systemd limits"
* 11:33 arturo: reboot tools-k8s-control-3 to fix some stale NFS mount issues
 
=== 2019-12-18 ===
* 17:33 bstorm_: updated package in aptly for toollabs-webservice to 0.53
* 11:49 arturo: introduce placeholder DNS records for toolforge.org domain. No services are provided under this domain yet for end users, this is just us testing (SSL, proxy stuff etc). This may be reverted anytime.
 
=== 2019-12-17 ===
* 20:25 bd808: Fixed https://tools.wmflabs.org/ to redirect to https://tools.wmflabs.org/admin/
* 19:21 bstorm_: deployed the changes to the live proxy to enable the new kubernetes cluster [[phab:T234037|T234037]]
* 16:53 bstorm_: maintain-kubeusers app deployed fully in tools for new kubernetes cluster [[phab:T214513|T214513]] [[phab:T228499|T228499]]
* 16:50 bstorm_: updated the maintain-kubeusers docker image for beta and tools
* 04:48 bstorm_: completed first run of maintain-kubeusers 2 in the new cluster [[phab:T214513|T214513]]
* 01:26 bstorm_: running the first run of maintain-kubeusers 2.0 for the new cluster [[phab:T214513|T214513]] (more successfully this time)
* 01:25 bstorm_: unset the immutable bit from 1704 tool kubeconfigs [[phab:T214513|T214513]]
* 01:05 bstorm_: beginning the first run of the new maintain-kubeusers in gentle-mode -- but it was just killed by some files setting the immutable bit [[phab:T214513|T214513]]
* 00:45 bstorm_: enabled encryption at rest on the new k8s cluster
 
=== 2019-12-16 ===
* 22:04 bd808: Added 'ALLOW IPv4 25/tcp from 0.0.0.0/0' to "MTA" security group applied to tools-mail-02
* 19:05 bstorm_: deployed the maintain-kubeusers operations pod to the new cluster
 
=== 2019-12-14 ===
* 10:48 valhallasw`cloud: re-enabling puppet on tools-sgeexec-0912, likely left-over from NFS maintenance (no reason was specified).
 
=== 2019-12-13 ===
* 18:46 bstorm_: updated tools-k8s-control-2 and 3 to the new config as well
* 17:56 bstorm_: updated tools-k8s-control-1 to the new control plane configuration
* 17:47 bstorm_: edited kubeadm-config configMap object to match the new init config
* 17:32 bstorm_: rebooting tools-k8s-control-2 to correct mount issue
* 00:45 bstorm_: rebooting tools-static-13
* 00:28 bstorm_: rebooting the k8s master to clear NFS errors
* 00:15 bstorm_: switch tools-acme-chief config to match the new authdns_servers format upstream
 
=== 2019-12-12 ===
* 23:36 bstorm_: rebooting toolschecker after downtiming the services
* 22:58 bstorm_: rebooting tools-acme-chief-01
* 22:53 bstorm_: rebooting the cron server, tools-sgecron-01 as it wasn't recovered from last night's maintenance
* 11:20 arturo: rolling reboot for all grid & k8s worker nodes due to NFS staleness
* 09:22 arturo: reboot tools-sgeexec-0911 to try fixing weird NFS state
* 08:46 arturo: doing `run-puppet-agent` in all VMs to see state of NFS
* 08:34 arturo: reboot tools-worker-1033/1034 and tools-sgebastion-08 to try to correct NFS mount issues
 
=== 2019-12-11 ===
* 18:13 bd808: Restarted maintain-dbusers on labstore1004. Process had not logged any account creations since 2019-12-01T22:45:45.
* 17:24 andrewbogott: deleted and/or truncated a bunch of logfiles on tools-worker-1031
 
=== 2019-12-10 ===
* 13:59 arturo: set pod replicas to 3 in the new k8s cluster ([[phab:T239405|T239405]])
 
=== 2019-12-09 ===
* 11:06 andrewbogott: deleting unused security groups:  catgraph, devpi, MTA, mysql, syslog, test    [[phab:T91619|T91619]]
 
=== 2019-12-04 ===
* 13:45 arturo: drop puppet prefix `tools-cron`, deprecated and no longer in use
 
=== 2019-11-29 ===
* 11:45 arturo: created 3 new VMs `tools-k8s-worker-[3,4,5]` ([[phab:T239403|T239403]])
* 10:28 arturo: re-arm keyholder in tools-acme-chief-01 (password in labs/private.git @ tools-puppetmaster-01)
* 10:27 arturo: re-arm keyholder in tools-acme-chief-02 (password in labs/private.git @ tools-puppetmaster-01)
 
=== 2019-11-26 ===
* 23:25 bstorm_: rebuilding docker images to include the new webservice 0.52 in all versions instead of just the stretch ones [[phab:T236202|T236202]]
* 22:57 bstorm_: push upgraded webservice 0.52 to the buster and jessie repos for container rebuilds [[phab:T236202|T236202]]
* 19:55 phamhi: drained tools-worker-1002,8,15,32 to rebalance the cluster
* 19:45 phamhi: cleaned up container that was taken up 16G of disk space on tools-worker-1020 in order to re-run puppet client
* 14:01 arturo: drop hiera references to `tools-test-proxy-01.tools.eqiad.wmflabs`. Such VM no longer exists
* 14:00 arturo: introduce the `profile::toolforge::proxies` hiera key in the global puppet config
 
=== 2019-11-25 ===
* 10:35 arturo: refresh puppet certs for tools-k8s-etcd-[4-6] nodes ([[phab:T238655|T238655]])
* 10:35 arturo: add puppet cert SANs via instance hiera to tools-k8s-etcd-[4-6] nodes ([[phab:T238655|T238655]])
 
=== 2019-11-22 ===
* 13:32 arturo: created security group `tools-new-k8s-full-connectivity` and add new k8s VMs to it ([[phab:T238654|T238654]])
* 05:55 jeh: add Riley Huntley `riley` to base tools project
 
=== 2019-11-21 ===
* 12:48 arturo: reboot the new k8s cluster after the upgrade
* 11:49 arturo: upgrading new k8s kubectl version to 1.15.6 ([[phab:T238654|T238654]])
* 11:44 arturo: upgrading new k8s kubelet version to 1.15.6 ([[phab:T238654|T238654]])
* 10:29 arturo: upgrading new k8s cluster version to 1.15.6 using kubeadm ([[phab:T238654|T238654]])
* 10:28 arturo: install kubeadm 1.15.6 on worker/control nodes in the new k8s cluster ([[phab:T238654|T238654]])
 
=== 2019-11-19 ===
* 13:49 arturo: re-create nginx-ingress pod due to deployment template refresh ([[phab:T237643|T237643]])
* 12:46 arturo: deploy changes to tools-prometheus to account for the new k8s cluster ([[phab:T237643|T237643]])
 
=== 2019-11-15 ===
* 14:44 arturo: stop live-hacks on tools-prometheus-01 [[phab:T237643|T237643]]
 
=== 2019-11-13 ===
* 17:20 arturo: live-hacking tools-prometheus-01 to test some experimental configs for the new k8s cluster ([[phab:T237643|T237643]])
 
=== 2019-11-12 ===
* 12:52 arturo: reboot tools-proxy-06 to reset iptables setup [[phab:T238058|T238058]]
 
=== 2019-11-10 ===
* 02:17 bd808: Building new Docker images for [[phab:T237836|T237836]] (retrying after cleaning out old images on tools-docker-builder-06)
* 02:15 bd808: Cleaned up old images on tools-docker-builder-06 using instructions from https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes#Building_toolforge_specific_images
* 02:10 bd808: Building new Docker images for [[phab:T237836|T237836]]
* 01:45 bstorm_: deploying bugfix for webservice in tools and toolsbeta [[phab:T237836|T237836]]
 
=== 2019-11-08 ===
* 22:47 bstorm_: adding rsync::server::wrap_with_stunnel: false to the tools-docker-registry-03/4 servers to unbreak puppet
* 18:40 bstorm_: pushed new webservice package to the bastions [[phab:T230961|T230961]]
* 18:37 bstorm_: pushed new webservice package supporting buster containers to repo [[phab:T230961|T230961]]
* 18:36 bstorm_: pushed buster-sssd images to the docker repo
* 17:15 phamhi: pushed new buster images with the prefix name "toolforge"
 
=== 2019-11-07 ===
* 13:27 arturo: deployed registry-admission-webhook and ingress-admission-controller into the new k8s cluster ([[phab:T236826|T236826]])
* 13:01 arturo: creating puppet prefix `tools-k8s-worker` and a couple of VMs `tools-k8s-worker-[1,2]` [[phab:T236826|T236826]]
* 12:57 arturo: increasing project quota [[phab:T237633|T237633]]
* 11:54 arturo: point `k8s.tools.eqiad1.wikimedia.cloud` to tools-k8s-haproxy-1 [[phab:T236826|T236826]]
* 11:43 arturo: create VMs `tools-k8s-haproxy-[1,2]` [[phab:T236826|T236826]]
* 11:43 arturo: create puppet prefix `tools-k8s-haproxy` [[phab:T236826|T236826]]
 
=== 2019-11-06 ===
* 22:32 bstorm_: added rsync::server::wrap_with_stunnel: false to tools-sge-services prefix to fix puppet
* 21:33 bstorm_: docker images needed for kubernetes cluster upgrade deployed [[phab:T215531|T215531]]
* 20:26 bstorm_: building and pushing docker images needed for kubernetes cluster upgrade
* 16:10 arturo: new k8s cluster control nodes are bootstrapped ([[phab:T236826|T236826]])
* 13:51 arturo: created FQDN `k8s.tools.eqiad1.wikimedia.cloud` pointing to `tools-k8s-control-1` for the initial bootstrap ([[phab:T236826|T236826]])
* 13:50 arturo: created 3 VMs`tools-k8s-control-[1,2,3]` ([[phab:T236826|T236826]])
* 13:43 arturo: created `tools-k8s-control` puppet prefix [[phab:T236826|T236826]]
* 11:57 phamhi: restarted all webservices in grid ([[phab:T233347|T233347]])
 
=== 2019-11-05 ===
* 23:08 Krenair: Dropped {{Gerrit|59a77a3}}, {{Gerrit|3830802}}, and {{Gerrit|83df61f}} from tools-puppetmaster-01:/var/lib/git/labs/private cherry-picks as these are no longer required [[phab:T206235|T206235]]
* 22:49 Krenair: Disassociated floating IP 185.15.56.60 from tools-static-13, traffic to this host goes via the project-proxy now. DNS was already changed a few days ago. [[phab:T236952|T236952]]
* 22:35 bstorm_: upgraded libpython3.4 libpython3.4-dbg libpython3.4-minimal libpython3.4-stdlib python3.4 python3.4-dbg python3.4-minimal to fix an old broken patch [[phab:T237468|T237468]]
* 22:12 bstorm_: pushed docker-registry.tools.wmflabs.org/maintain-kubeusers:beta to the registry to deploy in toolsbeta
* 17:38 phamhi: restarted lighttpd based webservice pods on tools-worker-103x and 1040 ([[phab:T233347|T233347]])
* 17:34 phamhi: restarted lighttpd based webservice pods on tools-worker-102[0-9] ([[phab:T233347|T233347]])
* 17:06 phamhi: restarted lighttpd based webservice pods on tools-worker-101[0-9] ([[phab:T233347|T233347]])
* 16:44 phamhi: restarted lighttpd based webservice pods on tools-worker-100[1-9] ([[phab:T233347|T233347]])
* 13:55 arturo: created 3 new VMs: `tools-k8s-etcd-[4,5,6]` [[phab:T236826|T236826]]
 
=== 2019-11-04 ===
* 14:45 phamhi: Built and pushed ruby25 docker image based on buster ([[phab:T230961|T230961]])
* 14:45 phamhi: Built and pushed golang111 docker image based on buster ([[phab:T230961|T230961]])
* 14:45 phamhi: Built and pushed jdk11 docker image based on buster ([[phab:T230961|T230961]])
* 14:45 phamhi: Built and pushed php73 docker image based on buster ([[phab:T230961|T230961]])
* 11:10 phamhi: Built and pushed python37 docker image based on buster ([[phab:T230961|T230961]])
 
=== 2019-11-01 ===
* 21:00 Krenair: Removed tools-checker.wmflabs.org A record to 208.80.155.229 as that target IP is in the old pre-neutron range that is no longer routed
* 20:57 Krenair: Removed trusty.tools.wmflabs.org CNAME to login-trusty.tools.wmflabs.org as that target record does not exist, presumably deleted ages ago
* 20:56 Krenair: Removed tools-trusty.wmflabs.org CNAME to login-trusty.tools.wmflabs.org as that target record does not exist, presumably deleted ages ago
* 20:38 Krenair: Updated A record for tools-static.wmflabs.org to point towards project-proxy [[phab:T236952|T236952]]
 
=== 2019-10-31 ===
* 18:47 andrewbogott: deleted and/or truncated a bunch of logfiles on tools-worker-1001.  Runaway logfiles filled up the drive which prevented puppet from running.  If puppet had run, it would have prevented the runaway logfiles.
* 13:59 arturo: update puppet prefix `tools-k8s-etcd-` to use the `role::wmcs::toolforge::k8s::etcd` [[phab:T236826|T236826]]
* 13:41 arturo: disabling puppet in tools-k8s-etcd- nodes to test https://gerrit.wikimedia.org/r/c/operations/puppet/+/546995
* 10:15 arturo: SSL cert replacement for tools-docker-registry and tools-k8s-master went fine apparently ([[phab:T236962|T236962]])
* 10:02 arturo: icinga downtime toolschecker for 1h for replacing SSL certs in tools-docker-registry and tools-k8s-master ([[phab:T236962|T236962]])
 
=== 2019-10-30 ===
* 13:53 arturo: replacing SSL cert in tools-proxy-x server apparently OK (merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/545679) [[phab:T235252|T235252]]
* 13:48 arturo: replacing SSL cert in tools-proxy-x server (live-hacking https://gerrit.wikimedia.org/r/c/operations/puppet/+/545679 first for testing) [[phab:T235252|T235252]]
* 13:40 arturo: icinga downtime toolschecker for 1h for replacing SSL cert [[phab:T235252|T235252]]
 
=== 2019-10-29 ===
* 10:49 arturo: deleting VMs tools-test-proxy-01, no longer in use
* 10:07 arturo: deleting old jessie VMs tools-proxy-03/04 [[phab:T235627|T235627]]
 
=== 2019-10-28 ===
* 16:06 arturo: delete VM instance `tools-test-proxy-01` and the puppet prefix `tools-test-proxy`
* 15:54 arturo: tools-proxy-05 has now the 185.15.56.11 floating IP as active proxy. Old one 185.15.56.6 has been freed [[phab:T235627|T235627]]
* 15:54 arturo: shutting down tools-proxy-03 [[phab:T235627|T235627]]
* 15:26 bd808: Killed all processes owned by jem on tools-sgebastion-08
* 15:16 arturo: tools-proxy-05 has now the 185.15.56.5 floating IP as active proxy [[phab:T235627|T235627]]
* 15:14 arturo: refresh hiera to use tools-proxy-05 as active proxy [[phab:T235627|T235627]]
* 15:11 bd808: Killed ircbot.php processes started by jem on tools-sgebastion-08 per request on irc
* 14:58 arturo: added `webproxy` security group to tools-proxy-05 and tools-proxy-06 ([[phab:T235627|T235627]])
* 14:57 phamhi: drained tools-worker-1031.tools.eqiad.wmflabs to clean up disk space
* 14:45 arturo: created VMs tools-proxy-05 and tools-proxy-06 ([[phab:T235627|T235627]])
* 14:43 arturo: adding `role::wmcs::toolforge::proxy` to the `tools-proxy` puppet prefix ([[phab:T235627|T235627]])
* 14:42 arturo: deleted `role::toollabs::proxy` from the `tools-proxy` puppet profile ([[phab:T235627|T235627]])
* 14:34 arturo: icinga downtime toolschecker for 1h ([[phab:T235627|T235627]])
* 12:25 arturo: upload image `coredns` v1.3.1 ({{Gerrit|eb516548c180}}) to docker registry ([[phab:T236249|T236249]])
* 12:23 arturo: upload image `kube-apiserver` v1.15.1 ({{Gerrit|68c3eb07bfc3}}) to docker registry ([[phab:T236249|T236249]])
* 12:22 arturo: upload image `kube-controller-manager` v1.15.1 ({{Gerrit|d75082f1d121}}) to docker registry ([[phab:T236249|T236249]])
* 12:20 arturo: upload image `kube-proxy` v1.15.1 ({{Gerrit|89a062da739d}}) to docker registry ([[phab:T236249|T236249]])
* 12:19 arturo: upload image `kube-scheduler` v1.15.1 ({{Gerrit|b0b3c4c404da}}) to docker registry ([[phab:T236249|T236249]])
* 12:04 arturo: upload image `calico/node` v3.8.0 ({{Gerrit|cd3efa20ff37}}) to docker registry ([[phab:T236249|T236249]])
* 12:03 arturo: upload image `calico/calico/pod2daemon-flexvol` v3.8.0 ({{Gerrit|f68c8f870a03}}) to docker registry ([[phab:T236249|T236249]])
* 12:01 arturo: upload image `calico/cni` v3.8.0 ({{Gerrit|539ca36a4c13}}) to docker registry ([[phab:T236249|T236249]])
* 11:58 arturo: upload image `calico/kube-controllers` v3.8.0 ({{Gerrit|df5ff96cd966}}) to docker registry ([[phab:T236249|T236249]])
* 11:47 arturo: upload image `nginx-ingress-controller` v0.25.1 ({{Gerrit|0439eb3e11f1}}) to docker registry ([[phab:T236249|T236249]])
 
=== 2019-10-24 ===
* 16:32 bstorm_: set the prod rsyslog config for kubernetes to false for Toolforge
 
=== 2019-10-23 ===
* 20:00 phamhi: Rebuilding all jessie and stretch docker images to pick up toollabs-webservice 0.47 ([[phab:T233347|T233347]])
* 12:09 phamhi: Deployed toollabs-webservice 0.47 to buster-tools and stretch-tools ([[phab:T233347|T233347]])
* 09:13 arturo: 9 tools-sgeexec nodes and 6 other related VMs are down because hypervisor is rebooting
* 09:03 arturo: tools-sgebastion-08 is down because hypervisor is rebooting
 
=== 2019-10-22 ===
* 16:56 bstorm_: drained tools-worker-1025.tools.eqiad.wmflabs which was malfunctioning
* 09:25 arturo: created the `tools.eqiad1.wikimedia.cloud.` DNS zone
 
=== 2019-10-21 ===
* 17:32 phamhi: Rebuilding all jessie and stretch docker images to pick up toollabs-webservice 0.46
 
=== 2019-10-18 ===
* 22:15 bd808: Rescheduled continuous jobs away from tools-sgeexec-0904 because of high system load
* 22:09 bd808: Cleared error state of webgrid-generic@tools-sgewebgrid-generic-0901, webgrid-lighttpd@tools-sgewebgrid-lighttpd-09{12,15,19,20,26}
* 21:29 bd808: Rescheduled all grid engine webservice jobs ([[phab:T217815|T217815]])
 
=== 2019-10-16 ===
* 16:21 phamhi: Deployed toollabs-webservice 0.46 to buster-tools and stretch-tools ([[phab:T218461|T218461]])
* 09:29 arturo: toolforge is recovered from the reboot of cloudvirt1029
* 09:17 arturo: due to the reboot of cloudvirt1029, several sgeexec nodes (8) are offline, also sgewebgrid-lighttpd (8) and tools-worker (3) and the main toolforge proxy (tools-proxy-03)
 
=== 2019-10-15 ===
* 17:10 phamhi: restart tools-worker-1035 because it is no longer responding
 
=== 2019-10-14 ===
* 09:26 arturo: cleaned-up updatetools from tools-sge-services nodes ([[phab:T229261|T229261]])
 
=== 2019-10-11 ===
* 19:52 bstorm_: restarted docker on tools-docker-builder after phamhi noticed the daemon had a routing issue (blank iptables)
* 11:55 arturo: create tools-test-proxy-01 VM for testing [[phab:T235059|T235059]] and a puppet prefix for it
* 10:53 arturo: added kubernetes-node_1.4.6-7_amd64.deb to buster-tools and buster-toolsbeta (aptly) for [[phab:T235059|T235059]]
* 10:51 arturo: added docker-engine_1.12.6-0~debian-jessie_amd64.deb to buster-tools and buster-toolsbeta (aptly) for [[phab:T235059|T235059]]
* 10:46 arturo: added logster_0.0.10-2~jessie1_all.deb to buster-tools and buster-toolsbeta (aptly) for [[phab:T235059|T235059]]
 
=== 2019-10-10 ===
* 02:33 bd808: Rebooting tools-sgewebgrid-lighttpd-0903. Instance hung.
 
=== 2019-10-09 ===
* 22:52 jeh: removing test instances tools-sssd-sgeexec-test-[12] from SGE
* 15:32 phamhi: drained tools-worker-1020/23/33/35/36/40 to rebalance the cluster
* 14:46 phamhi: drained and cordoned tools-worker-1029 after status reset on reboot
* 12:37 arturo: drain tools-worker-1038 to rebalance load in the k8s cluster
* 12:35 arturo: uncordon tools-worker-1029 (was disabled for unknown reasons)
* 12:33 arturo: drain tools-worker-1010 to rebalance load
* 10:33 arturo: several sgewebgrid-lighttpd nodes (9) not available because cloudvirt1013 is rebooting
* 10:21 arturo: several worker nodes (7) not available because cloudvirt1012 is rebooting
* 10:08 arturo: several worker nodes (6) not available because cloudvirt1009 is rebooting
* 09:59 arturo: several worker nodes (5) not available because cloudvirt1008 is rebooting
 
=== 2019-10-08 ===
* 19:40 bstorm_: drained tools-worker-1007/8 to rebalance the cluster
* 19:34 bstorm_: drained tools-worker-1009 and then 1014 for rebalancing
* 19:27 bstorm_: drained tools-worker-1005 for rebalancing (and put these back in service as I went)
* 19:24 bstorm_: drained tools-worker-1003 and 1009 for rebalancing
* 15:41 arturo: deleted VM instance tools-sgebastion-0test. No longer in use.
 
=== 2019-10-07 ===
* 20:17 bd808: Dropped backlog of messages for delivery to tools.usrd-tools
* 20:16 bd808: Dropped backlog of messages for delivery to tools.mix-n-match
* 20:13 bd808: Dropped backlog of frozen messages for delivery (240 dropped)
* 19:25 bstorm_: deleted tools-puppetmaster-02
* 19:20 Krenair: reboot tools-k8s-master-01 due to nfs stale issue
* 19:18 Krenair: reboot tools-paws-worker-1006 due to nfs stale issue
* 19:16 phamhi: reboot tools-worker-1040 due to nfs stale issue
* 19:16 phamhi: reboot tools-worker-1039 due to nfs stale issue
* 19:16 phamhi: reboot tools-worker-1038 due to nfs stale issue
* 19:16 phamhi: reboot tools-worker-1037 due to nfs stale issue
* 19:16 phamhi: reboot tools-worker-1036 due to nfs stale issue
* 19:16 phamhi: reboot tools-worker-1035 due to nfs stale issue
* 19:15 phamhi: reboot tools-worker-1034 due to nfs stale issue
* 19:15 phamhi: reboot tools-worker-1033 due to nfs stale issue
* 19:15 phamhi: reboot tools-worker-1032 due to nfs stale issue
* 19:15 phamhi: reboot tools-worker-1031 due to nfs stale issue
* 19:15 phamhi: reboot tools-worker-1030 due to nfs stale issue
* 19:10 Krenair: reboot tools-puppetmaster-02 due to nfs stale issue
* 19:09 Krenair: reboot tools-sgebastion-0test due to nfs stale issue
* 19:08 Krenair: reboot tools-sgebastion-09 due to nfs stale issue
* 19:08 Krenair: reboot tools-sge-services-04 due to nfs stale issue
* 19:07 Krenair: reboot tools-paws-worker-1002 due to nfs stale issue
* 19:06 Krenair: reboot tools-mail-02 due to nfs stale issue
* 19:06 Krenair: reboot tools-docker-registry-03 due to nfs stale issue
* 19:04 Krenair: reboot tools-worker-1029 due to nfs stale issue
* 19:00 Krenair: reboot tools-static-12 tools-docker-registry-04 and tools-clushmaster-02 due to NFS stale issue
* 18:55 phamhi: reboot tools-worker-1028 due to nfs stale issue
* 18:55 phamhi: reboot tools-worker-1027 due to nfs stale issue
* 18:55 phamhi: reboot tools-worker-1026 due to nfs stale issue
* 18:55 phamhi: reboot tools-worker-1025 due to nfs stale issue
* 18:47 phamhi: reboot tools-worker-1023 due to nfs stale issue
* 18:47 phamhi: reboot tools-worker-1022 due to nfs stale issue
* 18:46 phamhi: reboot tools-worker-1021 due to nfs stale issue
* 18:46 phamhi: reboot tools-worker-1020 due to nfs stale issue
* 18:46 phamhi: reboot tools-worker-1019 due to nfs stale issue
* 18:46 phamhi: reboot tools-worker-1018 due to nfs stale issue
* 18:34 phamhi: reboot tools-worker-1017 due to nfs stale issue
* 18:34 phamhi: reboot tools-worker-1016 due to nfs stale issue
* 18:32 phamhi: reboot tools-worker-1015 due to nfs stale issue
* 18:32 phamhi: reboot tools-worker-1014 due to nfs stale issue
* 18:23 phamhi: reboot tools-worker-1013 due to nfs stale issue
* 18:21 phamhi: reboot tools-worker-1012 due to nfs stale issue
* 18:12 phamhi: reboot tools-worker-1011 due to nfs stale issue
* 18:12 phamhi: reboot tools-worker-1010 due to nfs stale issue
* 18:08 phamhi: reboot tools-worker-1009 due to nfs stale issue
* 18:07 phamhi: reboot tools-worker-1008 due to nfs stale issue
* 17:58 phamhi: reboot tools-worker-1007 due to nfs stale issue
* 17:57 phamhi: reboot tools-worker-1006 due to nfs stale issue
* 17:47 phamhi: reboot tools-worker-1005 due to nfs stale issue
* 17:47 phamhi: reboot tools-worker-1004 due to nfs stale issue
* 17:43 phamhi: reboot tools-worker-1002.tools.eqiad.wmflabs due to nfs stale issue
* 17:35 phamhi: drained and uncordoned tools-worker-100[1-5]
* 17:32 bstorm_: reboot tools-sgewebgrid-lighttpd-0912
* 17:30 bstorm_: reboot tools-sgewebgrid-lighttpd-0923/24/08
* 17:01 bstorm_: rebooting tools-sgegrid-master and tools-sgegrid-shadow 😭
* 16:58 bstorm_: rebooting tools-sgewebgrid-lighttpd-0902/4/6/7/8/19
* 16:53 bstorm_: rebooting tools-sgewebgrid-generic-0902/4
* 16:50 bstorm_: rebooting tools-sgeexec-0915/18/19/23/26
* 16:49 bstorm_: rebooting tools-sgeexec-0901 and tools-sgeexec-0909/10/11
* 16:46 bd808: `sudo shutdown -r now` for tools-sgebastion-08
* 16:41 bstorm_: reboot tools-sgebastion-07
* 16:39 bd808: `sudo service nslcd restart` on tools-sgebastion-08
 
=== 2019-10-04 ===
* 21:43 bd808: `sudo exec-manage repool tools-sgeexec-0923.tools.eqiad.wmflabs`
* 21:26 bd808: Rebooting tools-sgeexec-0923 after lots of messing about with a broken update-initramfs build
* 20:35 bd808: Manually running `/usr/bin/python3 /usr/bin/unattended-upgrade` on tools-sgeexec-0923
* 20:33 bd808: Killed 2 /usr/bin/unattended-upgrade procs on tools-sgeexec-0923 that seemed stuck
* 13:33 arturo: remove /etc/init.d/rsyslog on tools-worker-XXXX nodes so the rsyslog deb prerm script doesn't prevent the package from being updated
 
=== 2019-10-03 ===
* 13:05 arturo: delete servers tools-sssd-sgeexec-test-[1,2], no longer required
 
=== 2019-09-27 ===
* 16:59 bd808: Set "profile::rsyslog::kafka_shipper::kafka_brokers: []" in tools-elastic prefix puppet
* 00:40 bstorm_: depooled and rebooted tools-sgewebgrid-lighttpd-0927
 
=== 2019-09-25 ===
* 19:08 andrewbogott: moving tools-sgewebgrid-lighttpd-0903 to cloudvirt1021
 
=== 2019-09-23 ===
* 16:58 bstorm_: deployed tools-manifest 0.20 and restarted webservicemonitor
* 06:01 bd808: Restarted maintain-dbusers process on labstore1004. ([[phab:T233530|T233530]])
 
=== 2019-09-12 ===
* 20:48 phamhi: Deleted tools-puppetdb-01.tools as it is no longer in used
 
=== 2019-09-11 ===
* 13:30 jeh: restart tools-sgeexec-0912
 
=== 2019-09-09 ===
* 22:44 bstorm_: uncordoned tools-worker-1030 and tools-worker-1038
 
=== 2019-09-06 ===
* 15:11 bd808: `sudo kill -9 10635` on tools-k8s-master-01 ([[phab:T194859|T194859]])
 
=== 2019-09-05 ===
* 21:02 bd808: Enabled Puppet on tools-docker-registry-03 and forced puppet run ([[phab:T232135|T232135]])
* 18:13 bd808: Disabled Puppet on tools-docker-registry-03 to investigate docker-registry issue (no phab task yet)
 
=== 2019-09-01 ===
* 20:51 Reedy: `sudo service maintain-kubeusers restart` on tools-k8s-master-01
 
=== 2019-08-30 ===
* 16:54 phamhi: restart maintain-kuberusers service in tools-k8s-master-01
* 16:21 bstorm_: depooling tools-sgewebgrid-lighttpd-0923 to reboot -- high iowait likely from NFS mounts
 
=== 2019-08-29 ===
* 22:18 bd808: Finished building new stretch Docker images for Toolforge Kubernetes use
* 22:06 bd808: Starting process of building new stretch Docker images for Toolforge Kubernetes use
* 22:05 bd808: Jessie Docker image rebuild complete
* 21:31 bd808: Starting process of building new jessie Docker images for Toolforge Kubernetes use
 
=== 2019-08-27 ===
* 19:10 bd808: Restarted maintain-kubeusers after complaint on irc. It was stuck in limbo again
 
=== 2019-08-26 ===
* 21:48 bstorm_: repooled tools-sgewebgrid-generic-0902, tools-sgewebgrid-lighttpd-0902, tools-sgewebgrid-lighttpd-0903 and tools-sgeexec-0905
 
=== 2019-08-18 ===
* 08:11 arturo: restart maintain-kuberusers service in tools-k8s-master-01
 
=== 2019-08-17 ===
* 10:56 arturo: force-reboot tools-worker-1006. Is completely stuck
 
=== 2019-08-15 ===
* 15:32 jeh: upgraded jobutils debian package to 1.38 [[phab:T229551|T229551]]
* 09:22 arturo: restart maintain-kubeusers service in tools-k8s-master-01 because some tools were missing their namespaces
 
=== 2019-08-13 ===
* 22:00 bstorm_: truncated exim paniclog on tools-sgecron-01 because it was being spammy
* 13:41 jeh: Set icingia downtime for toolschecker labs showmount [[phab:T229448|T229448]]
 
=== 2019-08-12 ===
* 16:08 phamhi: updated prometheus-node-exporter from 0.14.0~git20170523-1 to 0.17.0+ds-3 in tools-worker-[1030-1040] nodes ([[phab:T230147|T230147]])
 
=== 2019-08-08 ===
* 19:26 jeh: restarting tools-sgewebgrid-lighttpd-0915 [[phab:T230157|T230157]]
 
=== 2019-08-07 ===
* 19:07 bd808: Disassociated SUL and Phabricator accounts from user Lophi ([[phab:T229713|T229713]])
 
=== 2019-08-06 ===
* 16:18 arturo: add phamhi as user/projectadmin ([[phab:T228942|T228942]]) and delete hpham
* 15:59 arturo: add hpham as user/projectadmin ([[phab:T228942|T228942]])
* 13:44 jeh: disabling puppet on tools-checker-03 while testing nginx timeouts [[phab:T221301|T221301]]
 
=== 2019-08-05 ===
* 22:49 bstorm_: launching tools-worker-1040
* 20:36 andrewbogott: rebooting oom tools-worker-1026
* 16:10 jeh: `tools-k8s-master-01: systemctl restart maintain-kubeusers` [[phab:T229846|T229846]]
* 09:39 arturo: `root@tools-checker-03:~# toolscheckerctl restart` again ([[phab:T229787|T229787]])
* 09:30 arturo: `root@tools-checker-03:~# toolscheckerctl restart` ([[phab:T229787|T229787]])
 
=== 2019-08-02 ===
* 14:00 andrewbogott_: rebooting tools-worker-1022 as it is unresponsive
 
=== 2019-07-31 ===
* 18:07 bstorm_: drained tools-worker-1015/05/03/17 to rebalance load
* 17:41 bstorm_: drained tools-worker-1025 and 1026 to rebalance load
* 17:32 bstorm_: drained tools-worker-1028 to rebalance load
* 17:29 bstorm_: drained tools-worker-1008 to rebalance load
* 17:23 bstorm_: drained tools-worker-1021 to rebalance load
* 17:17 bstorm_: drained tools-worker-1007 to rebalance load
* 17:07 bstorm_: drained tools-worker-1004 to rebalance load
* 16:27 andrewbogott: moving tools-static-12 to cloudvirt1018
* 15:33 bstorm_: [[phab:T228573|T228573]] spinning up 5 worker nodes for kubernetes cluster (tools-worker-1035-9)
 
=== 2019-07-27 ===
* 23:00 zhuyifei1999_: a past probably related ticket: [[phab:T194859|T194859]]
* 22:57 zhuyifei1999_: maintain-kubeusers seems stuck. Traceback: https://phabricator.wikimedia.org/P8812, core dump: /root/core.17898. Restarting
 
=== 2019-07-26 ===
* 17:39 bstorm_: restarted maintain-kubeusers because it was suspiciously tardy and quiet
* 17:14 bstorm_: drained tools-worker-1013.tools.eqiad.wmflabs to rebalance load
* 17:09 bstorm_: draining tools-worker-1020.tools.eqiad.wmflabs to rebalance load
* 16:32 bstorm_: created tools-worker-1034 - [[phab:T228573|T228573]]
* 15:57 bstorm_: created tools-worker-1032 and 1033 - [[phab:T228573|T228573]]
* 15:55 bstorm_: created tools-worker-1031 - [[phab:T228573|T228573]]
 
=== 2019-07-25 ===
* 22:01 bstorm_: [[phab:T228573|T228573]] created tools-worker-1030
* 21:22 jeh: rebooting tools-worker-1016 unresponsive
 
=== 2019-07-24 ===
* 10:14 arturo: reallocating tools-puppetmaster-01 from cloudvirt1027 to cloudvirt1028 ([[phab:T227539|T227539]])
* 10:12 arturo: reallocating tools-docker-registry-04 from cloudvirt1027 to cloudvirt1028 ([[phab:T227539|T227539]])
 
=== 2019-07-22 ===
* 18:39 bstorm_: repooled tools-sgeexec-0905 after reboot
* 18:33 bstorm_: depooled tools-sgeexec-0905 because it's acting kind of weird and not responding to prometheus
* 18:32 bstorm_: repooled tools-sgewebgrid-lighttpd-0902 after restarting the grid-exec service
* 18:28 bstorm_: depooled tools-sgewebgrid-lighttpd-0902 to find out why it is behaving weird
* 17:55 bstorm_: draining tools-worker-1023 since it is having issues
* 17:38 bstorm_: Adding the prometheus servers to the ferm rules via wikitech hiera for kubelet stats [[phab:T228573|T228573]]
 
=== 2019-07-20 ===
* 19:52 andrewbogott: rebooting tools-worker-1023
 
=== 2019-07-17 ===
* 20:23 andrewbogott: migrating tools-sgegrid-shadow to cloudvirt1014
 
=== 2019-07-15 ===
* 14:50 bstorm_: cleared error state from tools-sgeexec-0911 which went offline after error from job {{Gerrit|5190035}}
 
=== 2019-06-25 ===
* 09:30 arturo: detected puppet issue in all VMs: [[phab:T226480|T226480]]
 
=== 2019-06-24 ===
* 17:42 andrewbogott: moving tools-sgeexec-0905 to cloudvirt1015
 
=== 2019-06-17 ===
* 14:07 andrewbogott: moving tools-sgewebgrid-lighttpd-0903 to cloudvirt1015
* 13:59 andrewbogott: moving tools-sgewebgrid-generic-0902 and tools-sgewebgrid-lighttpd-0902 to cloudvirt1015 (optimistic re: [[phab:T220853|T220853]] )
 
=== 2019-06-11 ===
* 18:03 bstorm_: deleted anomalous kubernetes node tools-worker-1019.eqiad.wmflabs
 
=== 2019-06-05 ===
* 18:33 andrewbogott: repooled  tools-sgeexec-0921 and tools-sgeexec-0929
* 18:16 andrewbogott: depooling and moving tools-sgeexec-0921 and tools-sgeexec-0929
 
=== 2019-05-30 ===
* 13:01 arturo: uncordon/repool tools-worker-1001/2/3. They should be fine now. I'm only leaving 1029 cordoned for testing purposes
* 13:01 arturo: reboot tools-woker-1003 to cleanup sssd config and let nslcd/nscd start freshly
* 12:47 arturo: reboot tools-woker-1002 to cleanup sssd config and let nslcd/nscd start freshly
* 12:42 arturo: reboot tools-woker-1001 to cleanup sssd config and let nslcd/nscd start freshly
* 12:35 arturo: enable puppet in tools-worker nodes
* 12:29 arturo: switch hiera setting back to classic/sudoldap for tools-worker because [[phab:T224651|T224651]] ([[phab:T224558|T224558]])
* 12:25 arturo: cordon/drain tools-worker-1002 because [[phab:T224651|T224651]] and [[phab:T224651|T224651]]
* 12:23 arturo: cordon/drain tools-worker-1001 because [[phab:T224651|T224651]] and [[phab:T224651|T224651]]
* 12:22 arturo: cordon/drain tools-worker-1029 because [[phab:T224651|T224651]] and [[phab:T224651|T224651]]
* 12:20 arturo: cordon/drain tools-worker-1003 because [[phab:T224651|T224651]] and [[phab:T224651|T224651]]
* 11:59 arturo: [[phab:T224558|T224558]] repool tools-worker-1003 (using sssd/sudo now!)
* 11:23 arturo: [[phab:T224558|T224558]] depool tools-worker-1003
* 10:48 arturo: [[phab:T224558|T224558]] drop/build a VM for tools-worker-1002. It didn't like the sssd/sudo change :-(
* 10:33 arturo: [[phab:T224558|T224558]] switch tools-worker-1002 to sssd/sudo. Includes drain/depool/reboot/repool
* 10:28 arturo: [[phab:T224558|T224558]] use hiera config in prefix tools-worker for sssd/sudo
* 10:27 arturo: [[phab:T224558|T224558]] switch tools-worker-1001 to sssd/sudo. Includes drain/depool/reboot/repool
* 10:09 arturo: [[phab:T224558|T224558]] disable puppet in all tools-worker- nodes
* 10:01 arturo: [[phab:T224558|T224558]] add tools-worker-1029 to the nodes pool of k8s
* 09:58 arturo: [[phab:T224558|T224558]] reboot tools-worker-1029 after puppet changes for sssd/sudo in jessie
 
=== 2019-05-29 ===
* 11:13 arturo: briefly tested some sssd config changes in tools-sgebastion-09
* 10:13 arturo: enroll the tools-worker-1029 VM into toolforge k8s, but leave it cordoned for sssd testing purposes ([[phab:T221225|T221225]])
* 10:12 arturo: re-create the tools-worker-1001 VM, already enrolled into toolforge k8s
* 09:34 arturo: delete tools-worker-1001, it was totally malfunctioning
 
=== 2019-05-28 ===
* 18:15 arturo: [[phab:T221225|T221225]] for the record, tools-worker-1001 is not working after trying with sssd
* 18:13 arturo: [[phab:T221225|T221225]] created tools-worker-1029 to test sssd/sudo stuff
* 17:49 arturo: [[phab:T221225|T221225]] repool tools-worker-1002 (using nscd/nslcd and sudoldap)
* 17:44 arturo: [[phab:T221225|T221225]] back to classic/ldap hiera config in the tools-worker puppet prefix
* 17:35 arturo: [[phab:T221225|T221225]] hard reboot tools-worker-1001 again
* 17:27 arturo: [[phab:T221225|T221225]] hard reboot tools-worker-1001
* 17:12 arturo: [[phab:T221225|T221225]] depool & switch to sssd/sudo & reboot & repool tools-worker-1002
* 17:09 arturo: [[phab:T221225|T221225]] depool & switch to sssd/sudo & reboot & repool tools-worker-1001
* 17:08 arturo: [[phab:T221225|T221225]] switch to sssd/sudo in puppet prefix for tools-worker
* 13:04 arturo: [[phab:T221225|T221225]] depool and rebooted tools-worker-1001 in preparation for sssd migration
* 12:39 arturo: [[phab:T221225|T221225]] disable puppet in all tools-worker nodes in preparation for sssd
* 12:32 arturo: drop the tools-bastion puppet prefix, unused
* 12:31 arturo: [[phab:T221225|T221225]] set sssd/sudo in the hiera config for the tools-checker prefix, and reboot tools-checker-03
* 12:27 arturo: [[phab:T221225|T221225]] set sssd/sudo in the hiera config for the tools-docker-registry prefix, and reboot tools-docker-registry-[03-04]
* 12:16 arturo: [[phab:T221225|T221225]] set sssd/sudo in the hiera config for the tools-sgebastion prefix, and reboot tools-sgebastion-07/08
* 11:26 arturo: merged change to the sudo module to allow sssd transition
 
=== 2019-05-27 ===
* 09:47 arturo: run `apt-get clean` to wipe 4GB of unused .deb packages, usage on / (root) was > 90% (on tools-sgebastion-08)
* 09:47 arturo: run `apt-get clean` to wipe 4GB of unused .deb packages, usage on / (root) was > 90%
 
=== 2019-05-21 ===
* 12:35 arturo: [[phab:T223992|T223992]] rebooting tools-redis-1002
 
=== 2019-05-20 ===
* 11:25 arturo: [[phab:T223332|T223332]] enable puppet agent in tools-k8s-master and tools-docker-registry nodes and deploy new SSL cert
* 10:53 arturo: [[phab:T223332|T223332]] disable puppet agent in tools-k8s-master and tools-docker-registry nodes
 
=== 2019-05-18 ===
* 11:13 chicocvenancio: PAWS update helm chart to point to new singleuser image ([[phab:T217908|T217908]])
* 09:06 bd808: Rebuilding all stretch docker images to pick up toollabs-webservice 0.45
 
=== 2019-05-17 ===
* 17:36 bd808: Rebuilding all docker images to pick up toollabs-webservice 0.45
* 17:35 bd808: Deployed toollabs-webservice 0.45 (python 3.5 and nodejs 10 containers)
 
=== 2019-05-16 ===
* 11:22 chicocvenancio: PAWS: restart hub to get new configured announcement
* 11:05 chicocvenancio: PAWS: change confimap to reference WMHACK 2019 as busiest time
 
=== 2019-05-15 ===
* 16:20 arturo: [[phab:T223148|T223148]] repool both tools-sgeexec-0921 and -0929
* 15:32 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0921 and move to cloudvirt1014
* 15:32 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0920 and move to cloudvirt1014
* 12:29 arturo: [[phab:T223148|T223148]] repool both tools-sgeexec-09[37,39]
* 12:13 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0937 and move to cloudvirt1008
* 12:13 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0939 and move to cloudvirt1007
* 11:34 arturo: [[phab:T223148|T223148]] repool tools-sgeexec-0940
* 11:20 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0940 and move to cloudvirt1006
* 11:11 arturo: [[phab:T223148|T223148]] repool tools-sgeexec-0941
* 10:46 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0941 and move to cloudvirt1005
* 09:44 arturo: [[phab:T223148|T223148]] repool tools-sgeexec-0901
* 09:00 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0901 and reallocate to cloudvirt1004
 
=== 2019-05-14 ===
* 17:12 arturo: [[phab:T223148|T223148]] repool tools-sgeexec-0920
* 16:37 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0920 and reallocate to cloudvirt1003
* 16:36 arturo: [[phab:T223148|T223148]] repool tools-sgeexec-0911
* 15:56 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0911 and reallocate to cloudvirt1003
* 15:52 arturo: [[phab:T223148|T223148]] repool tools-sgeexec-0909
* 15:24 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0909 and reallocate to cloudvirt1002
* 15:24 arturo: [[phab:T223148|T223148]] last SAL entry is bogus, please ignore (depool tools-worker-1009)
* 15:23 arturo: [[phab:T223148|T223148]] depool tools-worker-1009
* 15:13 arturo: [[phab:T223148|T223148]] repool tools-worker-1023
* 13:16 arturo: [[phab:T223148|T223148]] repool tools-sgeexec-0942
* 13:03 arturo: [[phab:T223148|T223148]] repool tools-sgewebgrid-generic-0904
* 12:58 arturo: [[phab:T223148|T223148]] reallocating tools-worker-1023 to cloudvirt1001
* 12:56 arturo: [[phab:T223148|T223148]] depool tools-worker-1023
* 12:52 arturo: [[phab:T223148|T223148]] reallocating tools-sgeexec-0942 to cloudvirt1001
* 12:50 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0942
* 12:49 arturo: [[phab:T223148|T223148]] reallocating tools-sgewebgrid-generic-0904 to cloudvirt1001
* 12:43 arturo: [[phab:T223148|T223148]] depool tools-sgewebgrid-generic-0904
 
=== 2019-05-13 ===
* 08:15 zhuyifei1999_: `truncate -s 0 /var/log/exim4/paniclog` on tools-sgecron-01.tools.eqiad.wmflabs & tools-sgewebgrid-lighttpd-0921.tools.eqiad.wmflabs
 
=== 2019-05-07 ===
* 14:38 arturo: [[phab:T222718|T222718]] uncordon tools-worker-1019, I couldn't find a reason for it to be cordoned
* 14:31 arturo: [[phab:T222718|T222718]] reboot tools-worker-1009 and 1022 after being drained
* 14:28 arturo: k8s drain tools-worker-1009 and 1022
* 11:46 arturo: [[phab:T219362|T219362]] enable puppet in tools-redis servers and use the new puppet role
* 11:33 arturo: [[phab:T219362|T219362]] disable puppet in tools-reds servers for puppet code cleanup
* 11:12 arturo: [[phab:T219362|T219362]] drop the `tools-services` puppet prefix (we are actually using `tools-sgeservices`)
* 11:10 arturo: [[phab:T219362|T219362]] enable puppet in tools-static servers and use new puppet role
* 11:01 arturo: [[phab:T219362|T219362]] disable puppet in tools-static servers for puppet code cleanup
* 10:16 arturo: [[phab:T219362|T219362]] drop the `tools-webgrid-lighttpd` puppet prefix
* 10:14 arturo: [[phab:T219362|T219362]] drop the `tools-webgrid-generic` puppet prefix
* 10:06 arturo: [[phab:T219362|T219362]] drop the `tools-exec-1` puppet prefix
 
=== 2019-05-06 ===
* 11:34 arturo: [[phab:T221225|T221225]] reenable puppet
* 10:53 arturo: [[phab:T221225|T221225]] disable puppet in all toolforge servers for testing sssd patch (puppetmaster livehack)
 
=== 2019-05-03 ===
* 09:43 arturo: fixed puppet in tools-puppetdb-01 too
* 09:39 arturo: puppet should be now fine across toolforge (except tools-puppetdb-01 which is WIP I think)
* 09:37 arturo: fix puppet in tools-elastic-03, archived jessie repos, weird rsyslog-kafka package situation
* 09:33 arturo: fix puppet in tools-elastic-02, archived jessie repos, weird rsyslog-kafka package situation
* 09:24 arturo: fix puppet in tools-elastic-01, archived jessie repos, weird rsyslog-kafka package situation
* 09:18 arturo: solve a weird apt situation in tools-puppetmaster-01 regarding the rsyslog-kafka package (puppet agent was failing)
* 09:16 arturo: solve a weird apt situation in tools-worker-1028 regarding the rsyslog-kafka package
 
=== 2019-04-30 ===
* 12:50 arturo: enable puppet in all servers [[phab:T221225|T221225]]
* 12:45 arturo: adding `sudo_flavor: sudo` hiera config to all puppet prefixes with sssd ([[phab:T221225|T221225]])
* 12:45 arturo: adding `sudo_flavor: sudo` hiera config to all puppet prefixes with sssd
* 11:07 arturo: [[phab:T221225|T221225]] disable puppet in toolforge
* 10:56 arturo: [[phab:T221225|T221225]] create tools-sgebastion-0test for more sssd tests
 
=== 2019-04-29 ===
* 11:22 arturo: [[phab:T221225|T221225]] re-enable puppet agent in all toolforge servers
* 10:27 arturo: [[phab:T221225|T221225]] reboot tool-sgebastion-09 for testing sssd
* 10:21 arturo: disable puppet in all servers to livehack tools-puppetmaster-01 to test [[phab:T221225|T221225]]
* 08:29 arturo: cleanup disk in tools-sgebastion-09, was full of debug logs and unused apt packages
 
=== 2019-04-26 ===
* 12:20 andrewbogott: rescheduling every pod everywhere
* 12:18 andrewbogott: rescheduling all pods on tools-worker-1023.tools.eqiad.wmflabs
 
=== 2019-04-25 ===
* 12:49 arturo: [[phab:T221225|T221225]] using `profile::ldap::client::labs::client_stack: sssd` in horizon for tools-sgebastion-09 (testing)
* 11:43 arturo: [[phab:T221793|T221793]] removing prometheus crontab and letting puppet agent re-create it again to resolve staleness
 
=== 2019-04-24 ===
* 12:54 arturo: puppet broken, fixing right now
* 09:18 arturo: [[phab:T221225|T221225]] reallocating tools-sgebastion-09 to cloudvirt1008
 
=== 2019-04-23 ===
* 15:26 arturo: [[phab:T221225|T221225]] rebooting tools-sgebastion-08 to cleanup sssd
* 15:19 arturo: [[phab:T221225|T221225]] creating tools-sgebastion-09 for testing sssd stuff
* 13:06 arturo: [[phab:T221225|T221225]] use `profile::ldap::client::labs::client_stack: classic` in the puppet bastion prefix, again. Rollback again.
* 12:57 arturo: [[phab:T221225|T221225]] use `profile::ldap::client::labs::client_stack: sssd` in the puppet bastion prefix, try again with sssd in the bastions, reboot them
* 10:28 arturo: [[phab:T221225|T221225]] use `profile::ldap::client::labs::client_stack: classic` in the puppet bastion prefix
* 10:27 arturo: [[phab:T221225|T221225]] rebooting tools-sgebastion-07 to clean sssd confiuration
* 10:16 arturo: [[phab:T221225|T221225]] disable puppet in tools-sgebastion-08 for sssd testing
* 09:49 arturo: [[phab:T221225|T221225]] run puppet agent in the bastions and reboot them with sssd
* 09:43 arturo: [[phab:T221225|T221225]] use `profile::ldap::client::labs::client_stack: sssd` in the puppet bastion prefix
* 09:41 arturo: [[phab:T221225|T221225]] disable puppet agent in the bastions
 
=== 2019-04-17 ===
* 12:09 arturo: [[phab:T221225|T221225]] rebooting bastions to clean sssd. We are back to nscd/nslcd until we figure out what's wrong here
* 11:59 arturo: [[phab:T221205|T221205]] sssd was deployed successfully into all webgrid nodes
* 11:39 arturo: deploy sssd to tools-sge-services-03/04 (includes reboot)
* 11:31 arturo: reboot bastions for sssd deployment
* 11:30 arturo: deploy sssd to bastions
* 11:24 arturo: disable puppet in bastions to deploy sssd
* 09:52 arturo: [[phab:T221205|T221205]] tools-sgewebgrid-lighttpd-0915 requires some manual intervention because issues in the dpkg database prevents deleting nscd/nslcd packages
* 09:45 arturo: [[phab:T221205|T221205]] tools-sgewebgrid-lighttpd-0913 requires some manual intervention because unconfigured packages prevents a clean puppet agent run
* 09:12 arturo: [[phab:T221205|T221205]] start deploying sssd to sgewebgrid nodes
* 09:00 arturo: [[phab:T221205|T221205]] add `profile::ldap::client::labs::client_stack: sssd` in horizon for the puppet prefixes `tools-sgewebgrid-lighttpd` and `tools-sgewebgrid-generic`
* 08:57 arturo: [[phab:T221205|T221205]] disable puppet in all tools-sgewebgrid-* nodes
 
=== 2019-04-16 ===
* 20:49 chicocvenancio: change paws announcement in configmap hub-config back to a welcome message
* 17:15 chicocvenancio: add paws outage announcement in configmap hub-config
* 17:00 andrewbogott: moving tools-k8s-master-01 to eqiad1-r
 
=== 2019-04-15 ===
* 18:50 andrewbogott: moving tools-elastic-01 to cloudvirt1008 to make spreadcheck happy
* 15:01 andrewbogott: moving tools-redis-1001 to eqiad1-r
 
=== 2019-04-14 ===
* 16:23 andrewbogott: moved all tools-worker nodes off of cloudvirt1015 and uncordoned them
 
=== 2019-04-13 ===
* 21:09 bstorm_: Moving tools-prometheus-01 to cloudvirt1009 and tools-clushmaster-02 to cloudvirt1008 for [[phab:T220853|T220853]]
* 20:36 bstorm_: moving tools-elastic-02 to cloudvirt1009 for [[phab:T220853|T220853]]
* 19:58 bstorm_: started migrating tools-k8s-etcd-03 to cloudvirt1012 [[phab:T220853|T220853]]
* 19:51 bstorm_: started migrating tools-flannel-etcd-02 to cloudvirt1013 [[phab:T220853|T220853]]
 
=== 2019-04-11 ===
* 22:38 andrewbogott: moving tools-paws-worker-1005 to cloudvirt1009 to make spreadcheck happier
* 21:49 bd808: Re-enabled puppet on tools-elastic-02 and forced puppet run
* 21:44 andrewbogott: moving tools-mail-02 to eqiad1-r
* 20:56 andrewbogott: shutting down tools-logs-02 — seems unused
* 19:44 bd808: Disabled puppet on tools-elastic-02 and set to 1-node master
* 19:34 andrewbogott: moving tools-puppetmaster-01 to eqiad1-r
* 15:40 andrewbogott: moving tools-redis-1002  to eqiad1-r
* 13:52 andrewbogott: moving tools-prometheus-01 and tools-elastic-01 to eqiad1-r
* 12:01 arturo: [[phab:T151704|T151704]] deploying oidentd
* 11:54 arturo: disable puppet in all hosts to deploy oidentd
* 02:33 andrewbogott: tools-paws-worker-1005,  tools-paws-worker-1006 to eqiad1-r
* 00:03 andrewbogott: tools-paws-worker-1002,  tools-paws-worker-1003 to eqiad1-r
 
=== 2019-04-10 ===
* 23:58 andrewbogott: moving tools-clushmaster-02, tools-elastic-03 and tools-paws-worker-1001 to eqiad1-r
* 18:52 bstorm_: depooled and rebooted tools-sgeexec-0929 because systemd was in a weird state
* 18:46 bstorm_: depooled and rebooted tools-sgewebgrid-lighttpd-0913 because high load was caused by ancient lsof processes
* 14:49 bstorm_: cleared E state from 5 queues
* 13:06 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0906
* 12:31 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0926
* 12:27 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0925
* 12:06 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0901
* 11:55 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0924
* 11:47 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0921
* 11:23 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0940
* 11:03 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0928
* 10:49 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0923
* 10:43 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0915
* 10:27 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0935
* 10:19 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0914
* 10:02 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0907
* 09:41 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0918
* 09:27 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0932
* 09:26 arturo: [[phab:T218216|T218216]] hard reboot tools-sgeexec-0932
* 09:04 arturo: [[phab:T218216|T218216]] add `profile::ldap::client::labs::client_stack: sssd` to prefix puppet for sge-exec nodes
* 09:03 arturo: [[phab:T218216|T218216]] do a controlled rollover of sssd, depooling sgeexec nodes, reboot and repool
* 08:39 arturo: [[phab:T218216|T218216]] disable puppet in all tools-sgeexec-XXXX nodes for controlled sssd rollout
* 00:32 andrewbogott: migrating tools-worker-1022, 1023, 1025, 1026 to eqiad1-r
 
=== 2019-04-09 ===
* 22:04 bstorm_: added the new region on port 80 to the elasticsearch security group for stashbot
* 21:16 andrewbogott: moving tools-flannel-etcd-03 to eqiad1-r
* 20:43 andrewbogott: moving tools-worker-1018, 1019, 1020, 1021 to eqiad1-r
* 20:04 andrewbogott: moving tools-k8s-etcd-03 to eqiad1-r
* 19:54 andrewbogott: moving tools-flannel-etcd-02 to eqiad1-r
* 18:36 andrewbogott: moving tools-worker-1016, tools-worker-1017 to eqiad1-r
* 18:05 andrewbogott: migrating tools-k8s-etcd-02 to eqiad1-r
* 18:00 andrewbogott: migrating tools-flannel-etcd-01 to eqiad1-r
* 17:36 andrewbogott: moving tools-worker-1014, tools-worker-1015 to eqiad1-r
* 17:05 andrewbogott: migrating  tools-k8s-etcd-01 to eqiad1-r
* 15:56 andrewbogott: moving tools-worker-1012, tools-worker-1013 to eqiad1-r
* 14:56 bstorm_: cleared 4 queues on gridengine of E status (ldap again)
* 14:07 andrewbogott: moving tools-worker-1010, tools-worker-1011, tools-worker-1001 to eqiad1-r
* 03:48 andrewbogott: moving tools-worker-1008 and tools-worker-1009 to eqiad1-r
* 02:07 bstorm_: reloaded ferm on tools-flannel-etcd-0[1-3] to get  the k8s node moves to register
 
=== 2019-04-08 ===
* 22:36 andrewbogott: moving tools-worker-1006 and tools-worker-1007 to eqiad1-r
* 20:03 andrewbogott: moving tools-worker-1003 and tools-worker-1004 to eqiad1-r
 
=== 2019-04-07 ===
* 16:54 zhuyifei1999_: tools-sgeexec-0928 unresponsive since around 22 UTC. No data on Graphite. Can't ssh in even as root. Hard rebooting via Horizon
* 01:06 bstorm_: cleared E state from 6 queues
 
=== 2019-04-05 ===
* 15:44 bstorm_: cleared E state from two exec queues
 
=== 2019-04-04 ===
* 21:21 bd808: Uncordoned tools-worker-1013.tools.eqiad.wmflabs after reboot and forced puppet run
* 20:53 bd808: Rebooting tools-worker-1013
* 20:50 bd808: Draining tools-worker-1013.tools.eqiad.wmflabs
* 20:29 bd808: Released floating IP and deleted instance tools-checker-01 via Horizon
* 20:28 bd808: Shutdown tools-checker-01 via Horizon
* 20:17 bd808: Repooled tools-webgrid-lighttpd-0906 after reboot, apt-get dist-upgrade, and forced puppet run
* 20:13 bd808: Hard reboot of tools-sgewebgrid-lighttpd-0906 via Horizon
* 20:09 bd808: Repooled tools-webgrid-lighttpd-0912 after reboot, apt-get dist-upgrade, and forced puppet run
* 20:05 bd808: Depooled and rebooted tools-sgewebgrid-lighttpd-0912
* 20:05 bstorm_: rebooted tools-webgrid-lighttpd-0912
* 20:03 bstorm_: depooled  tools-webgrid-lighttpd-0912
* 19:59 bstorm_: depooling and rebooting tools-webgrid-lighttpd-0906
* 19:43 bd808: Repooled tools-sgewebgrid-lighttpd-0926 after reboot, apt-get dist-update, and forced puppet run
* 19:36 bd808: Hard reboot of tools-sgewebgrid-lighttpd-0926 via Horizon
* 19:30 bd808: Rebooting tools-sgewebgrid-lighttpd-0926
* 19:28 bd808: Depooled tools-sgewebgrid-lighttpd-0926
* 19:13 bstorm_: cleared E state from 7 queues
* 17:32 andrewbogott: moving tools-static-12 to cloudvirt1023 to keep the two static nodes off the same host
 
=== 2019-04-03 ===
* 11:22 arturo: puppet breakage in due to me introducing openstack-mitaka-jessie repo by mistake. Cleaning up already
 
=== 2019-04-02 ===
* 12:11 arturo: icinga downtime toolschecker for 1 month [[phab:T219243|T219243]]
* 03:55 bd808: Added etcd service group to tools-k8s-etcd-* ([[phab:T219243|T219243]])
 
=== 2019-04-01 ===
* 19:44 bd808: Deleted tools-checker-02 via Horizon ([[phab:T219243|T219243]])
* 19:43 bd808: Shutdown tools-checker-02 via Horizon ([[phab:T219243|T219243]])
* 16:53 bstorm_: cleared E state on 6 grid queues
* 14:54 andrewbogott: moving tools-static-12 to eqiad1-r (for real this time maybe)
 
=== 2019-03-29 ===
* 21:13 bstorm_: depooled tools-sgewebgrid-generic-0903 because of some stuck jobs and odd load characteristics
* 21:09 bd808: Updated cherry-pick of https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/500095/ on tools-puppetmaster-01 ([[phab:T219243|T219243]])
* 20:48 bd808: Using root console to fix broken initial puppet run on tools-checker-03.
* 20:32 bd808: Creating tools-checker-03 with role::wmcs::toolforge::checker ([[phab:T219243|T219243]])
* 20:24 bd808: Cherry-picked https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/500095/ to tools-puppetmaster-01 for testing ([[phab:T219243|T219243]])
* 20:22 bd808: Disabled puppet on tools-checker-0{1,2} to make testing new role::wmcs::toolforge::checker easier ([[phab:T219243|T219243]])
* 17:25 bd808: Cleared the "Eqw" state of 44 jobs with `qstat -u '*' {{!}} grep Eqw {{!}} awk '{print $1;}' {{!}} xargs -L1 sudo qmod -cj` on tools-sgegrid-master
* 17:16 andrewbogott: aborted move of tools-static-12; will wait until tomorrow and give DNS caches more time to update
* 17:11 bd808: Restarted nginx on tools-static-13
* 16:53 andrewbogott: moving tools-static-12 to eqiad1-r
* 16:49 bstorm_: cleared E state from 21 queues
* 14:34 andrewbogott: moving tools-static.wmflabs.org to point to tools-static-13 in eqiad1-r
* 13:54 andrewbogott: moving tools-static-13 to eqiad1-r
 
=== 2019-03-28 ===
* 01:00 bstorm_: cleared error states from two queues
* 00:23 bstorm_: [[phab:T216060|T216060]] created tools-sgewebgrid-generic-0901...again!
 
=== 2019-03-27 ===
* 23:35 bstorm_: rebooted tools-paws-master-01 for NFS issue [[phab:T219460|T219460]]
* 14:45 bstorm_: cleared several "E" state queues
* 12:26 gtirloni: truncated exim4/paniclog on tools-sgewebgrid-lighttpd-0921
* 12:25 gtirloni: truncated exim4/paniclog on tools-sgecron-01
* 12:15 arturo: [[phab:T218126|T218126]] `aborrero@tools-sgegrid-master:~$ sudo qmod -d 'test@tools-sssd-sgeexec-test-2'` (and 1)
 
=== 2019-03-26 ===
* 22:00 gtirloni: downtimed toolschecker
* 17:31 arturo: [[phab:T218126|T218126]] create VM instances tools-sssd-sgeexec-test-[12]
* 00:26 bd808: Deleted DNS record for login-trusty.tools.wmflabs.org
* 00:26 bd808: Deleted DNS record for trusty-dev.tools.wmflabs.org
 
=== 2019-03-25 ===
* 21:21 bd808: All Trusty grid engine hosts shutdown and deleted ([[phab:T217152|T217152]])
* {{safesubst:SAL entry|1=21:19 bd808: Deleted tools-grid-{master,shadow}  (T217152)}}
* 21:18 bd808: Deleted tools-webgrid-lighttpd-14*  ([[phab:T217152|T217152]])
* 20:55 bstorm_: reboot tools-sgewebgrid-generic-0903 to clear up some issues
* 20:52 bstorm_: rebooting tools-package-builder-02 due to lots of hung /usr/bin/lsof +c 15 -nXd DEL processes
* 20:51 bd808: Deleted tools-webgrid-generic-14* ([[phab:T217152|T217152]])
* 20:49 bd808: Deleted tools-exec-143* ([[phab:T217152|T217152]])
* 20:49 bd808: Deleted tools-exec-142* ([[phab:T217152|T217152]])
* 20:48 bd808: Deleted tools-exec-141* ([[phab:T217152|T217152]])
* 20:47 bd808: Deleted tools-exec-140* ([[phab:T217152|T217152]])
* 20:43 bd808: Deleted  tools-cron-01 ([[phab:T217152|T217152]])
* 20:42 bd808: Deleted tools-bastion-0{2,3} ([[phab:T217152|T217152]])
* 20:35 bstorm_: rebooted tools-worker-1025 and tools-worker-1021
* 19:59 bd808: Shutdown tools-exec-143* ([[phab:T217152|T217152]])
* 19:51 bd808: Shutdown tools-exec-142* ([[phab:T217152|T217152]])
* 19:47 bstorm_: depooling tools-worker-1025.tools.eqiad.wmflabs because it's not responding and showing insane load
* 19:33 bd808: Shutdown tools-exec-141* ([[phab:T217152|T217152]])
* 19:31 bd808: Shutdown tools-bastion-0{2,3} ([[phab:T217152|T217152]])
* 19:19 bd808: Shutdown tools-exec-140* ([[phab:T217152|T217152]])
* 19:12 bd808: Shutdown tools-webgrid-generic-14* ([[phab:T217152|T217152]])
* 19:11 bd808: Shutdown tools-webgrid-lighttpd-14* ([[phab:T217152|T217152]])
* 18:53 bd808: Shutdown tools-grid-master ([[phab:T217152|T217152]])
* 18:53 bd808: Shutdown tools-grid-shadow ([[phab:T217152|T217152]])
* 18:49 bd808: All jobs still running on the Trusty job grid force deleted.
* 18:46 bd808: All Trusty job grid queues marked as disabled. This should stop all new Trusty job submissions.
* 18:43 arturo: icinga downtime tools-checker for 24h due to trusty grid shutdown
* 18:39 bd808: Shutdown tools-cron-01.tools.eqiad.wmflabs ([[phab:T217152|T217152]])
* 15:27 bd808: Copied all crontab files still on tools-cron-01 to tool's $HOME/crontab.trusty.save
* 02:34 bd808: Disassociated floating IPs and deleted shutdown Trusty grid nodes tools-exec-14{33,34,35,36,37,38,39,40,41,42} ([[phab:T217152|T217152]])
* 02:26 bd808: Deleted shutdown Trusty grid nodes tools-webgrid-lighttpd-14{20,21,22,24,25,26,27,28} ([[phab:T217152|T217152]])
 
=== 2019-03-22 ===
* 17:16 andrewbogott: switching all instances to use ldap-ro.eqiad.wikimedia.org as both primary and secondary ldap server
* 16:12 bstorm_: cleared errored out stretch grid queues
* 15:56 bd808: Rebooting tools-static-12
* 03:09 bstorm_: [[phab:T217280|T217280]] depooled and rebooted 15 other nodes.  Entire stretch grid is in a good state for now.
* 02:31 bstorm_: [[phab:T217280|T217280]] depooled and rebooted tools-sgeexec-0908 since it had no jobs but very high load from an NFS event that was no longer happening
* 02:09 bstorm_: [[phab:T217280|T217280]] depooled and rebooted tools-sgewebgrid-lighttpd-0924
* 00:39 bstorm_: [[phab:T217280|T217280]] depooled and rebooted tools-sgewebgrid-lighttpd-0902
 
=== 2019-03-21 ===
* 23:28 bstorm_: [[phab:T217280|T217280]] depooled, reloaded and repooled tools-sgeexec-0938
* 21:53 bstorm_: [[phab:T217280|T217280]] rebooted and cleared "unknown status" from tools-sgeexec-0914 after depooling
* 21:51 bstorm_: [[phab:T217280|T217280]] rebooted and cleared "unknown status" from tools-sgeexec-0909 after depooling
* 21:26 bstorm_: [[phab:T217280|T217280]] cleared error state from a couple queues and rebooted tools-sgeexec-0901 and 04 to clear other issues related
 
=== 2019-03-18 ===
* 18:43 bd808: Rebooting tools-static-12
* 18:42 chicocvenancio: PAWS: 3 nodes still in not ready state, `worker-10(01{{!}}07{{!}}10)` all else working
* 18:41 chicocvenancio: PAWS: deleting pods stuck in Unknown state with ` --grace-period=0 --force`
* 18:40 andrewbogott: rebooting tools-static-13 in hopes of fixing some nfs mounts
* 18:25 chicocvenancio: removing postStart hook for PWB update and restarting hub while gerrit.wikimedia.com is down
 
=== 2019-03-17 ===
* 23:41 bd808: Cherry-picked https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/497210/ as a quick fix for [[phab:T218494|T218494]]
* 22:30 bd808: Investigating strange system state on tools-bastion-03.
* 17:48 bstorm_: [[phab:T218514|T218514]] rebooting tools-worker-1009 and 1012
* 17:46 bstorm_: depooling tools-worker-1009 and tools-worker-1012 for [[phab:T218514|T218514]]
* 17:13 bstorm_: depooled and rebooting tools-worker-1018
* 15:09 andrewbogott: running 'killall dpkg and dpkg --configure -a' on all nodes to try to work around a race with initramfs
 
=== 2019-03-16 ===
* 22:34 bstorm_: clearing errored out queues again
 
=== 2019-03-15 ===
* 21:08 bstorm_: cleared error state on several queues [[phab:T217280|T217280]]
* 15:58 gtirloni: rebooted tools-clushmaster-02
* 14:40 mutante: tools-sgebastion-07 - dpkg-reconfigure locales and adding Korean ko_KR.EUC-KR - [[phab:T130532|T130532]]
* 14:32 mutante: tools-sgebastion-07 - generating locales for user request in [[phab:T130532|T130532]]
 
=== 2019-03-14 ===
* 23:52 bd808: Disabled job queues and rescheduled continuous jobs away from tools-exec-14{21,22,23,24,25,26,27,28,29,30,31,32} ([[phab:T217152|T217152]])
* 23:28 bd808: Deleted tools-bastion-05 ([[phab:T217152|T217152]])
* 22:30 bd808: Removed obsolete submit hosts from Trusty grid config
* 22:20 bd808: Removed tools-webgrid-lighttpd-142{0,1,2,5} from the grid and shutdown instances via horizon ([[phab:T217152|T217152]])
* 22:10 bd808: Depooled tools-webgrid-lighttpd-142{0,1,2,5} ([[phab:T217152|T217152]])
* 21:55 bd808: Removed submit host flag from tools-bastion-05.tools.eqiad.wmflabs, removed floating ip, and shutdown instance via horizon ([[phab:T217152|T217152]])
* 21:48 bd808: Removed tools-exec-14{33,34,35,36,37,38,39,40,41,42} from the grid and shutdown instances via horizon ([[phab:T217152|T217152]])
* 21:38 gtirloni: rebooted tools-sgewebgrid-generic-0904 ([[phab:T218341|T218341]])
* 21:32 gtirloni: rebooted tools-exec-1020 ([[phab:T218341|T218341]])
* 21:23 gtirloni: rebooted tools-sgeexec-0919, tools-sgeexec-0934, tools-worker-1018 ([[phab:T218341|T218341]])
* 21:19 bd808: Killed jobs still running on tools-exec-14{33,34,35,36,37,38,39,40,41,42}.tools.eqiad.wmflabs 2 weeks after being depooled ([[phab:T217152|T217152]])
* 20:58 bd808: Repooled tools-sgeexec-0941 following reboot
* 20:57 bd808: Hard reboot of tools-sgeexec-0941 via horizon
* 20:54 bd808: Depooled and rebooted tools-sgeexec-0941.tools.eqiad.wmflabs
* 20:53 bd808: Repooled tools-sgeexec-0917 following reboot
* 20:52 bd808: Hard reboot of tools-sgeexec-0917 via horizon
* 20:47 bd808: Depooled and rebooted tools-sgeexec-0917
* 20:44 bd808: Repooled tools-sgeexec-0908 after reboot
* 20:36 bd808: depooled and rebooted tools-sgeexec-0908
* 19:08 gtirloni: rebooted tools-worker-1028 ([[phab:T218341|T218341]])
* 19:08 gtirloni: rebooted tools-sgewebgrid-lighttpd-0914 ([[phab:T218341|T218341]])
* 19:07 gtirloni: rebooted tools-sgewebgrid-lighttpd-0914
* 18:13 gtirloni: drained tools-worker-1028 for reboot (processes in D state)
 
=== 2019-03-13 ===
* 23:30 bd808: Rebuilding stretch Kubernetes images
* 22:55 bd808: Rebuilding jessie Kubernetes images
* 17:11 bstorm_: specifically rebooted SGE cron server tools-sgecron-01
* 17:10 bstorm_: rebooted cron server
* 16:10 bd808: Updated DNS for dev.tools.wmflabs.org to point to Stretch secondary bastion. This was missed on 2019-03-07
* 12:33 arturo: reboot tools-sgebastion-08 ([[phab:T215154|T215154]])
* 12:17 arturo: reboot tools-sgebastion-07 ([[phab:T215154|T215154]])
* 11:53 arturo: enable puppet in tools-sgebastion-07 ([[phab:T215154|T215154]])
* 11:20 arturo: disable puppet in tools-sgebastion-07 for testing [[phab:T215154|T215154]]
* 05:07 bstorm_: re-enabled puppet for tools-sgebastion-07
* 04:59 bstorm_: disabled puppet for a little bit on tools-bastion-07
* 00:22 bd808: Raise web-memlimit for isbn tool to 6G for tomcat8 ([[phab:T217406|T217406]])
 
=== 2019-03-11 ===
* 15:53 bd808: Manually started `service gridengine-master` on tools-sgegrid-master after reboot ([[phab:T218038|T218038]])
* 15:47 bd808: Hard reboot of tools-sgegrid-master via Horizon UI ([[phab:T218038|T218038]])
* 15:42 bd808: Rebooting tools-sgegrid-master ([[phab:T218038|T218038]])
* 14:49 gtirloni: deleted tools-webgrid-lighttpd-1419
* 00:53 bd808: Re-enabled 13 queue instances that had been disabled by LDAP failures during job initialization ([[phab:T217280|T217280]])
 
=== 2019-03-10 ===
* 22:36 gtirloni: increased nscd group TTL from 60 to 300sec
 
=== 2019-03-08 ===
* 19:48 andrewbogott: repooling tools-exec-1430 and tools-sgeexec-0905 to compare ldap usage
* 19:21 andrewbogott: depooling tools-exec-1430 and tools-sgeexec-0905 to compare ldap usage
* 17:49 bd808: Re-enabled 4 queue instances that had been disabled by LDAP failures during job initialization ([[phab:T217280|T217280]])
* 00:30 bd808: DNS record created for trusty-dev.tools.wmflabs.org (Trusty secondary bastion)
 
=== 2019-03-07 ===
* 23:31 bd808: Updated DNS to point login.tools.wmflabs.org at 185.15.56.48 (Stretch bastion)
* 04:15 bd808: Killed 3 orphan processes on Trusty grid
* 04:01 bd808: Cleared error state on a large number of Stretch grid queues which had been disabled by LDAP and/or NFS hiccups ([[phab:T217280|T217280]])
* 00:49 zhuyifei1999_: clushed misctools 1.37 upgrade on @bastion,@cron,@bastion-stretch [[phab:T217406|T217406]]
* 00:38 zhuyifei1999_: published misctools 1.37 [[phab:T217406|T217406]]
* 00:34 zhuyifei1999_: begin building misctools 1.37 using debuild [[phab:T217406|T217406]]
 
=== 2019-03-06 ===
* 13:57 gtirloni: fixed SSH warnings in tools-clushmaster-02
 
=== 2019-03-04 ===
* 19:07 bstorm_: umounted /mnt/nfs/dumps-labstore1006.wikimedia.org for [[phab:T217473|T217473]]
* {{safesubst:SAL entry|1=14:05 gtirloni: rebooted tools-docker-registry-{03,04}, tools-puppetmaster-02 and tools-puppetdb-01 (load avg >45, not accessible)}}
 
=== 2019-03-03 ===
* 20:54 andrewbogott: cleaning out /tmp on tools-exec-1412
 
=== 2019-02-28 ===
* 19:36 zhuyifei1999_: built with debuild instead [[phab:T217297|T217297]]
* 19:08 zhuyifei1999_: test failures during build, see ticket
* 18:55 zhuyifei1999_: start building jobutils 1.36 [[phab:T217297|T217297]]
 
=== 2019-02-27 ===
* 20:41 andrewbogott: restarting nginx on tools-checker-01
* 19:34 andrewbogott: uncordoning tools-worker-1028, 1002 and 1005, now in eqiad1-r
* 16:20 zhuyifei1999_: regenerating k8s creds for tools.whichsub & tools.permission-denied-test [[phab:T176027|T176027]]
* 15:40 andrewbogott: moving tools-worker-1002, 1005, 1028 to eqiad1-r
* 01:36 bd808: Shutdown tools-webgrid-lighttpd-1419.tools.eqiad.wmflabs via horizon ([[phab:T217152|T217152]])
* 01:29 bd808: Depooled tools-webgrid-lighttpd-1419.tools.eqiad.wmflabs ([[phab:T217152|T217152]])
* 01:26 bd808: Disabled job queues and rescheduled continuous jobs away from tools-exec-14{33,34,35,36,37,38,39,40,41,42}.tools.eqiad.wmflabs ([[phab:T217152|T217152]])
 
=== 2019-02-26 ===
* 20:51 gtirloni: reboot tools-package-builder-02 (unresponsive)
* 19:01 gtirloni: pushed updated docker images
* 17:30 andrewbogott: draining and cordoning tools-worker-1027 for a region migration test
 
=== 2019-02-25 ===
* 23:20 bstorm_: Depooled tools-sgeexec-0914 and tools-sgeexec-0915 for [[phab:T217066|T217066]]
* 21:41 andrewbogott: depooling tools-sgeexec-0911, tools-sgeexec-0912, tools-sgeexec-0913 to test [[phab:T217066|T217066]]
* 13:11 chicocvenancio: PAWS:  Stopped AABot notebook pod [[phab:T217010|T217010]]
* 12:54 chicocvenancio: PAWS:  Restarted Criscod notebook pod [[phab:T217010|T217010]]
* 12:21 chicocvenancio: PAWS: killed proxy and hub pods to attempt to get it to see routes to open notebooks servers to no avail. Restarted BernhardHumm's notebook pod [[phab:T217010|T217010]]
* 09:50 gtirloni: rebooted tools-sgeexec-09{16,22,40} ([[phab:T216988|T216988]])
* 09:41 gtirloni: rebooted tools-sgeexec-09{16,22,40}
* 08:37 zhuyifei1999_: uncordon tools-worker-1015.tools.eqiad.wmflabs
* 08:34 legoktm: hard rebooted tools-worker-1015 via horizon
* 07:48 zhuyifei1999_: systemd stuck in D state. :(
* 07:44 zhuyifei1999_: I saved dmesg and process list to a few files in /root if that helps debugging
* 07:43 zhuyifei1999_: D states are not responding to SIGKILL. Will reboot.
* 07:37 zhuyifei1999_: tools-worker-1015.tools.eqiad.wmflabs having severe NFS issues (all NFS accessing processes are stuck in D state). Draining.
 
=== 2019-02-22 ===
* 16:29 gtirloni: upgraded and rebooted tools-puppetmaster-01 (new kernel)
* 15:59 gtirloni: started tools-puppetmaster-01 (new size: m1.large)
* 15:13 gtirloni: shutdown tools-puppetmaster-01
 
=== 2019-02-21 ===
* 09:59 gtirloni: upgraded all packages in all stretch nodes
* 00:12 zhuyifei1999_: forcing puppet run on tools-k8s-master-01
* 00:08 zhuyifei1999_: running /usr/local/bin/git-sync-upstream on tools-puppetmaster-01 to speed puppet changes up
 
=== 2019-02-20 ===
* 23:30 zhuyifei1999_: begin rebuilding all docker images [[phab:T178601|T178601]] [[phab:T193646|T193646]] [[phab:T215683|T215683]]
* 23:25 zhuyifei1999_: upgraded toollabs-webservice on tools-bastion-02 to 0.44 (newly-built version)
* 23:19 zhuyifei1999_: this was built for stretch. hopefully it works for all distros
* 23:17 zhuyifei1999_: begin build new tools-webservice package [[phab:T178601|T178601]] [[phab:T193646|T193646]] [[phab:T215683|T215683]]
* 21:57 andrewbogott: moving tools-static-13  to a new virt host
* 21:34 andrewbogott: moving the tools-static IP from tools-static-13 to tools-static-12
* 19:17 andrewbogott: moving tools-bastion-02 to labvirt1004
* 16:56 andrewbogott: moving tools-paws-worker-1003
* 15:53 andrewbogott: moving tools-worker-1017, tools-worker-1027, tools-worker-1028
* 15:04 andrewbogott: moving tools-exec-1413 and tools-exec-1442
 
=== 2019-02-19 ===
* 01:49 bd808: Revoked Toolforge project membership for user DannyS712 ([[phab:T215092|T215092]])
 
=== 2019-02-18 ===
* 20:45 gtirloni: upgraded and rebooted tools-sgebastion-07 (login-stretch)
* 20:22 gtirloni: enabled toolsdb monitoring in Icinga
* 20:03 gtirloni: pointed tools-db.eqiad.wmflabs to 172.16.7.153
* 18:50 chicocvenancio: moving paws back to toolsdb [[phab:T216208|T216208]]
* 13:47 arturo: rebooting tools-sgebastion-07 to try fixing general slowness
 
=== 2019-02-17 ===
* 22:23 zhuyifei1999_: uncordon tools-worker-1010.tools.eqiad.wmflabs
* 22:11 zhuyifei1999_: rebooting tools-worker-1010.tools.eqiad.wmflabs
* 22:10 zhuyifei1999_: draining tools-worker-1010.tools.eqiad.wmflabs, `docker ps` is hanging. no idea why. also other weirdness like ContainerCreating forever
 
=== 2019-02-16 ===
* 05:00 zhuyifei1999_: fixed by restarting flannel. another puppet run simply started kubelet
* 04:58 zhuyifei1999_: puppet logs: https://phabricator.wikimedia.org/P8097. Docker is failing with 'Failed to load environment files: No such file or directory'
* 04:52 zhuyifei1999_: copied the resolv.conf from tools-k8s-master-01, removing secondary DNS to make sure puppet fixes that, and starting puppet
* 04:48 zhuyifei1999_: that host's resolv.conf is badly broken https://phabricator.wikimedia.org/P8096. The last Puppet run was at Thu Feb 14 15:21:09 UTC 2019 (2247 minutes ago)
* 04:44 zhuyifei1999_: puppet is also failing bad here 'Error: Could not request certificate: getaddrinfo: Name or service not known'
* 04:43 zhuyifei1999_: this one has logs full of 'Can't contact LDAP server'
* 04:41 zhuyifei1999_: nslcd also broken on tools-worker-1005
* 04:34 zhuyifei1999_: uncordon tools-worker-1014.tools.eqiad.wmflabs
* 04:33 zhuyifei1999_: the issue was, /var/run/nslcd/socket was somehow a directory, AFAICT
* 04:31 zhuyifei1999_: then started nslcd vis systemctl and `id zhuyifei1999` returns correct stuffs
* 04:30 zhuyifei1999_: `nslcd -nd` complains about 'nslcd: bind() to /var/run/nslcd/socket failed: Address already in use'. SIGTERMed a background nslcd, `rmdir /var/run/nslcd/socket`, and `nslcd -nd` seemingly starts to work
* 04:23 zhuyifei1999_: drained tools-worker-1014.tools.eqiad.wmflabs
* 04:16 zhuyifei1999_: logs: https://phabricator.wikimedia.org/P8095
* 04:14 zhuyifei1999_: restarting nslcd on tools-worker-1014 in an attempt to fix that, service failed to start, looking into logs
* 04:12 zhuyifei1999_: restarting nscd on tools-worker-1014 in an attempt to fix seemingly-not-attached-to-LDAP
 
=== 2019-02-14 ===
* 21:57 bd808: Deleted old tools-proxy-02 instance
* 21:57 bd808: Deleted old tools-proxy-01 instance
* 21:56 bd808: Deleted old tools-package-builder-01 instance
* 20:57 andrewbogott: rebooting tools-worker-1005
* 20:34 andrewbogott: moving tools-exec-1409, tools-exec-1410, tools-exec-1414, tools-exec-1419
* 19:55 andrewbogott: moving tools-webgrid-generic-1401 and tools-webgrid-lighttpd-1419
* 19:33 andrewbogott: moving tools-checker-01 to labvirt1003
* 19:25 andrewbogott: moving tools-elastic-02 to labvirt1003
* 19:11 andrewbogott: moving tools-k8s-etcd-01 to labvirt1002
* 18:37 andrewbogott: moving tools-exec-1418, tools-exec-1424 to labvirt1003
* 18:34 andrewbogott: moving tools-webgrid-lighttpd-1404, tools-webgrid-lighttpd-1406, tools-webgrid-lighttpd-1410 to labvirt1002
* 17:35 arturo: [[phab:T215154|T215154]] tools-sgebastion-07 now running systemd 239 and starts enforcing user limits
* 15:33 andrewbogott: moving tools-worker-1002, 1003, 1005, 1006, 1007, 1010, 1013, 1014 to different labvirts in order to move labvirt1012 to eqiad1-r
 
=== 2019-02-13 ===
* 19:16 andrewbogott: deleting tools-sgewebgrid-generic-0901, tools-sgewebgrid-lighttpd-0901, tools-sgebastion-06
* 15:16 zhuyifei1999_: `sudo /usr/local/bin/grid-configurator --all-domains --observer-pass $(grep OS_PASSWORD /etc/novaobserver.yaml{{!}}awk '{gsub(/"/,"",$2);print $2}')` on tools-sgegrid-master to attempt to make it recognize -sgebastion-07 [[phab:T216042|T216042]]
* 15:06 zhuyifei1999_: `sudo systemctl restart gridengine-master` on tools-sgegrid-master to attempt to make it recognize -sgebastion-07 [[phab:T216042|T216042]]
* 13:03 arturo: [[phab:T216030|T216030]] switch login-stretch.tools.wmflabs.org floating IP to tools-sgebastion-07
 
=== 2019-02-12 ===
* 01:24 bd808: Stopped maintain-kubeusers, edited /etc/kubernetes/tokenauth, restarted maintain-kubeusers ([[phab:T215704|T215704]])
 
=== 2019-02-11 ===
* 22:57 bd808: Shutoff tools-webgrid-lighttpd-14{01,13,24,26,27,28} via Horizon UI
* 22:34 bd808: Decommissioned tools-webgrid-lighttpd-14{01,13,24,26,27,28}
* 22:23 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1401.tools.eqiad.wmflabs
* 22:21 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1413.tools.eqiad.wmflabs
* 22:18 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1428.tools.eqiad.wmflabs
* 22:07 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1427.tools.eqiad.wmflabs
* 22:06 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1424.tools.eqiad.wmflabs
* 22:05 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1426.tools.eqiad.wmflabs
* 20:06 bstorm_: Ran apt-get clean on tools-sgebastion-07 since it was running out of disk (and lots of it was the apt cache)
* 19:09 bd808: Upgraded tools-manifest on tools-cron-01 to v0.19 ([[phab:T107878|T107878]])
* 18:57 bd808: Upgraded tools-manifest on tools-sgecron-01 to v0.19 ([[phab:T107878|T107878]])
* 18:57 bd808: Built tools-manifest_0.19_all.deb and published to aptly repos ([[phab:T107878|T107878]])
* 18:26 bd808: Upgraded tools-manifest on tools-sgecron-01 to v0.18 ([[phab:T107878|T107878]])
* 18:25 bd808: Built tools-manifest_0.18_all.deb and published to aptly repos ([[phab:T107878|T107878]])
* 18:12 bd808: Upgraded tools-manifest on tools-sgecron-01 to v0.17 ([[phab:T107878|T107878]])
* 18:08 bd808: Built tools-manifest_0.17_all.deb and published to aptly repos ([[phab:T107878|T107878]])
* 10:41 godog: flip tools-prometheus proxy back to tools-prometheus-01 and upgrade to prometheus 2.7.1
 
=== 2019-02-08 ===
* 19:17 hauskatze: Stopped webservice of `tools.sulinfo` which redirects to `tools.quentinv57-tools` which is also unavalaible
* 18:32 hauskatze: Stopped webservice for `tools.quentinv57-tools` for [[phab:T210829|T210829]].
* 13:49 gtirloni: upgraded all packages in SGE cluster
* 12:25 arturo: install aptitude in tools-sgebastion-06
* 11:08 godog: flip tools-prometheus.wmflabs.org to tools-prometheus-02 - [[phab:T215272|T215272]]
* 01:07 bd808: Creating tools-sgebastion-07
 
=== 2019-02-07 ===
* 23:48 bd808: Updated DNS to make tools-trusty.wmflabs.org and trusty.tools.wmflabs.org CNAMEs for login-trusty.tools.wmflabs.org
* 20:18 gtirloni: cleared mail queue on tools-mail-02
* 08:41 godog: upgrade prometheus-02 to prometheus 2.6 - [[phab:T215272|T215272]]
 
=== 2019-02-04 ===
* 13:20 arturo: [[phab:T215154|T215154]] another reboot for tools-sgebastion-06
* 12:26 arturo: [[phab:T215154|T215154]] another reboot for tools-sgebastion-06. Puppet is disabled
* 11:38 arturo: [[phab:T215154|T215154]] reboot tools-sgebastion-06 to totally refresh systemd status
* 11:36 arturo: [[phab:T215154|T215154]] manually install systemd 239 in tools-sgebastion-06
 
=== 2019-01-30 ===
* 23:54 gtirloni: cleared apt cache on sge* hosts
 
=== 2019-01-25 ===
* 20:50 bd808: Deployed new tcl/web Kubernetes image based on Debian Stretch ([[phab:T214668|T214668]])
* 14:22 andrewbogott: draining and moving tools-worker-1016 to a new labvirt for [[phab:T214447|T214447]]
* 14:22 andrewbogott: draining and moving tools-worker-1021 to a new labvirt for [[phab:T214447|T214447]]
 
=== 2019-01-24 ===
* 11:09 arturo: [[phab:T213421|T213421]] delete tools-services-01/02
* 09:46 arturo: [[phab:T213418|T213418]] delete tools-docker-registry-02
* 09:45 arturo: [[phab:T213418|T213418]] delete tools-docker-builder-05 and tools-docker-registry-01
* 03:28 bd808: Fixed rebase conflict in labs/private on tools-puppetmaster-01
 
=== 2019-01-23 ===
* 22:18 bd808: Building new tools-sgewebgrid-lighttpd-0904 instance using Stretch base image ([[phab:T214519|T214519]])
* 22:09 bd808: Deleted tools-sgewebgrid-lighttpd-0904 instance via Horizon, used wrong base image ([[phab:T214519|T214519]])
* 21:04 bd808: Building new tools-sgewebgrid-lighttpd-0904 instance ([[phab:T214519|T214519]])
* 20:53 bd808: Deleted broken tools-sgewebgrid-lighttpd-0904 instance via Horizon ([[phab:T214519|T214519]])
* 19:49 andrewbogott: shutting down eqiad-region proxies tools-proxy-01 and tools-proxy-02
* 17:44 bd808: Added rules to default security group for prometheus monitoring on port 9100 ([[phab:T211684|T211684]])
 
=== 2019-01-22 ===
* 20:21 gtirloni: published new docker images (all)
* 18:57 bd808: Changed deb-tools.wmflabs.org proxy to point to tools-sge-services-03.tools.eqiad.wmflabs
 
=== 2019-01-21 ===
* 05:25 andrewbogott: restarted tools-sgeexec-0906 and tools-sgeexec-0904; they seem better now but I have not repooled them yet
 
=== 2019-01-18 ===
* 21:22 bd808: Forcing php-igbinary update via clush for [[phab:T213666|T213666]]
 
=== 2019-01-17 ===
* 23:37 bd808: Shutdown tools-package-builder-01. Use tools-package-builder-02 instead!
* 22:09 bd808: Upgrading tools-manifest to 0.16 on tools-sgecron-01
* 22:05 bd808: Upgrading tools-manifest to 0.16 on tools-cron-01
* 21:51 bd808: Upgrading tools-manifest to 0.15 on tools-cron-01
* 20:41 bd808: Building tools-package-builder-02 to replace tools-package-builder-01
* 17:16 arturo: [[phab:T213421|T213421]] shutdown tools-services-01/02. Will delete VMs after a grace period
* 12:54 arturo: add webservice security group to tools-sge-services-03/04
 
=== 2019-01-16 ===
* 17:29 andrewbogott: depooling and moving tools-sgeexec-0904 tools-sgeexec-0906 tools-sgewebgrid-lighttpd-0904
* 16:38 arturo: [[phab:T213418|T213418]] shutdown tools-docker-registry-01 and 02. Will delete the instances in a week or so
* 14:34 arturo: [[phab:T213418|T213418]] point docker-registry.tools.wmflabs.org to tools-docker-registry-03 (was in -02)
* 14:24 arturo: [[phab:T213418|T213418]] allocate floating IPs for tools-docker-registry-03 & 04
 
=== 2019-01-15 ===
* 21:02 bstorm_: restarting webservicemonitor on tools-services-02 -- acting funny
* 18:46 bd808: Dropped A record for www.tools.wmflabs.org and replaced it with a CNAME pointing to tools.wmflabs.org.
* 18:29 bstorm_: [[phab:T213711|T213711]] installed python3-requests=2.11.1-1~bpo8+1 python3-urllib3=1.16-1~bpo8+1 on tools-proxy-03, which stopped the bleeding
* 14:55 arturo: disable puppet in tools-docker-registry-01 and tools-docker-registry-02, trying with `role::wmcs::toolforge::docker::registry` in the puppetmaster for -03 and -04. The registry shouldn't be affected by this
* 14:21 arturo: [[phab:T213418|T213418]] put a backup of the docker registry in NFS just in case: `aborrero@tools-docker-registry-02:$ sudo cp /srv/registry/registry.tar.gz /data/project/.system_sge/docker-registry-backup/`
 
=== 2019-01-14 ===
* 22:03 bstorm_: [[phab:T213711|T213711]] Added UDP port needed for flannel packets to work to k8s worker sec groups in both eqiad and eqiad1-r
* 22:03 bstorm_: [[phab:T213711|T213711]] Added ports needed for etcd-flannel to work on the etcd security group in eqiad
* 21:42 zhuyifei1999_: also `write`-ed to them (as root). auth on my personal account would take a long time
* 21:37 zhuyifei1999_: that command belonged to tools.scholia (with fnielsen as the ssh user)
* 21:36 zhuyifei1999_: killed an egrep using too mush NFS bandwidth on tools-bastion-03
* 21:33 zhuyifei1999_: SIGTERM PID 12542 24780 875 14569 14722. `tail`s with parent as init, belonging to user maxlath. they should submit to grid.
* 16:44 arturo: [[phab:T213418|T213418]] docker-registry.tools.wmflabs.org point floating IP to tools-docker-registry-02
* 14:00 arturo: [[phab:T213421|T213421]] disable updatetools in the new services nodes while building them
* 13:53 arturo: [[phab:T213421|T213421]] delete tools-services-03/04 and create them with another prefix: tools-sge-services-03/04 to actually use the new role
* 13:47 arturo: [[phab:T213421|T213421]] create tools-services-03 and tools-services-04 (stretch) they will use the new puppet role `role::wmcs::toolforge::services`
 
=== 2019-01-11 ===
* 11:55 arturo: [[phab:T213418|T213418]] shutdown tools-docker-builder-05, will give a grace period before deleting the VM
* 10:51 arturo: [[phab:T213418|T213418]] created tools-docker-builder-06 in eqiad1
* 10:46 arturo: [[phab:T213418|T213418]] migrating tools-docker-registry-02 from eqiad to eqiad1
 
=== 2019-01-10 ===
* 22:45 bstorm_: [[phab:T213357|T213357]] - Added 24 lighttpd nodes tot he new grid
* 18:54 bstorm_: [[phab:T213355|T213355]] built and configured two more generic web nodes for the new grid
* 10:35 gtirloni: deleted non-puppetized checks from tools-checker-0[1,2]
* 00:12 bstorm_: [[phab:T213353|T213353]] Added 36 exec nodes to the new grid
 
=== 2019-01-09 ===
* 20:16 andrewbogott: moving tools-paws-worker-1013 and tools-paws-worker-1007 to eqiad1
* 17:17 andrewbogott: moving paws-worker-1017 and paws-worker-1016 to eqiad1
* 14:42 andrewbogott: experimentally moving tools-paws-worker-1019 to eqiad1
* 09:59 gtirloni: rebooted tools-checker-01 ([[phab:T213252|T213252]])
 
=== 2019-01-07 ===
* 17:21 bstorm_: [[phab:T67777|T67777]] - set the max_u_jobs global grid config setting to 50 in the new grid
* 15:54 bstorm_: [[phab:T67777|T67777]] Set stretch grid user job limit to 16
* 05:45 bd808: Manually installed python3-venv on tools-sgebastion-06. Gerrit patch submitted for proper automation.
 
=== 2019-01-06 ===
* 22:06 bd808: Added floating ip to tools-sgebastion-06 ([[phab:T212360|T212360]])
 
=== 2019-01-05 ===
* 23:54 bd808: Manually installed php-mbstring on tools-sgebastion-06. Gerrit patch submitted to install it on the rest of the Son of Grid Engine nodes.
 
=== 2019-01-04 ===
* 21:37 bd808: Truncated /data/project/.system/accounting after archiving ~30 days of history
 
=== 2019-01-03 ===
* 21:03 bd808: Enabled Puppet on tools-proxy-02
* 20:53 bd808: Disabled Puppet on tools-proxy-02
* 20:51 bd808: Enabled Puppet on tools-proxy-01
* 20:49 bd808: Disabled Puppet on tools-proxy-01
 
=== 2018-12-21 ===
* 16:29 andrewbogott: migrating tools-exec-1416  to labvirt1004
* 16:01 andrewbogott: moving tools-grid-master to labvirt1004
* 00:35 bd808: Installed tools-manifest 0.14 for [[phab:T212390|T212390]]
* 00:22 bd808: Rebuiliding all docker containers with toollabs-webservice 0.43 for [[phab:T212390|T212390]]
* 00:19 bd808: Installed toollabs-webservice 0.43 on all hosts for [[phab:T212390|T212390]]