You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Nova Resource:Tools/SAL: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Stashbot
(majavah: clear error state from tools-sgeexec-0907, task@tools-sgeexec-0939)
imported>Stashbot
(wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-builds-api (7e57832) (T337218) - cookbook ran by dcaro@vulcanus)
 
(238 intermediate revisions by 2 users not shown)
Line 1: Line 1:
=== 2021-06-10 ===
=== 2023-06-01 ===
* 17:38 majavah: clear error state from tools-sgeexec-0907, task@tools-sgeexec-0939
* 10:07 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-builds-api ({{Gerrit|7e57832}}) ([[phab:T337218|T337218]]) - cookbook ran by dcaro@vulcanus
* 09:21 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-builds-api ({{Gerrit|0f4076a}}) ([[phab:T336130|T336130]]) - cookbook ran by dcaro@vulcanus
* 09:18 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/buildpack-admission-controller ({{Gerrit|ef7f103}}) ([[phab:T336130|T336130]]) - cookbook ran by dcaro@vulcanus
* 07:52 dcaro: rebooted tools-package-builder-04 (stuck not letting me log in with my user)


=== 2021-06-09 ===
=== 2023-05-31 ===
* 13:57 majavah: clear error state from exec nodes tools-sgeexec-0913, tools-sgeexec-0936, task@tools-sgeexec-0940
* 02:38 andrewbogott: rebooted tools-sgeweblight-10-16, [[phab:T337806|T337806]]


=== 2021-06-07 ===
=== 2023-05-30 ===
* 18:39 bstorm: cleaning up more error conditions on grid queues
* 00:22 andrewbogott: rebooted tools-sgeweblight-10-30,  oom
* 17:42 majavah: delete `ingress-nginx` namespace and related objects [[phab:T264221|T264221]]
* 00:16 andrewbogott: rebooted tools-sgeweblight-10-24, seems to be oom
* 17:37 majavah: remove tools-k8s-ingress-[1-3] from kubernetes, follow-up to https://sal.toolforge.org/log/nd7v2HkB1jz_IcWuCX5M [[phab:T264221|T264221]]


=== 2021-06-04 ===
=== 2023-05-26 ===
* 21:30 bstorm: deleting "tools-k8s-ingress-3", "tools-k8s-ingress-2", "tools-k8s-ingress-1" [[phab:T264221|T264221]]
* 13:13 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/buildpack-admission-controller ({{Gerrit|ef7f103}}) ([[phab:T337218|T337218]]) - cookbook ran by dcaro@vulcanus
* 21:21 bstorm: cleared error state from 4 grid queues
* 12:59 dcaro: rebooting tools-sgeexec-10-16.tools.eqiad1.wikimedia.cloud for stale NFS handles (D processes)


=== 2021-06-03 ===
=== 2023-05-24 ===
* 18:27 majavah: renew prometheus kubernetes certificate [[phab:T280301|T280301]]
* 12:28 dcaro: deploy latest buildservice ([[phab:T335865|T335865]])
* 17:06 majavah: renew admission webhook certificates [[phab:T280301|T280301]]
* 12:28 dcaro: deploy latest buildservice ([[phab:T336050|T336050]])


=== 2021-06-01 ===
=== 2023-05-23 ===
* 10:10 majavah: properly clean up deleted vms tools-k8s-haproxy-[1,2], tools-checker-03 from puppet after using the wrong fqdn first time
* 14:40 wm-bot2: deployed kubernetes component https://github.com/toolforge/buildpack-admission-controller ({{Gerrit|0c7b25b}}) - cookbook ran by fran@wmf3169
* 09:54 majavah: clear error state from tools-sgeexec-0913, tools-sgeexec-0950


=== 2021-05-30 ===
=== 2023-05-22 ===
* 18:58 majavah: clear grid error state from 14 queues
* 10:06 arturo: hard-reboot tools-sgeexec-10-18 (monitoring reporting it as down)


=== 2021-05-27 ===
=== 2023-05-19 ===
* 18:03 bstorm: adjusted profile::wmcs::kubeadm::etcd_latency_ms from 30 back to the default (10)
* 13:38 arturo: uncordon tools-k8s-worker-47/48/64/75
* 16:04 bstorm: cleared error state from several exec node queues
* 08:46 bd808: Building new perl532-sssd/<nowiki>{</nowiki>base,web<nowiki>}</nowiki> images ([[phab:T323522|T323522]], [[phab:T320904|T320904]])
* 14:49 andrewbogott: swapping in three new etcd nodes with local storage: tools-k8s-etcd-13,14,15


=== 2021-05-24 ===
=== 2023-05-17 ===
* 10:36 arturo: rebased labs/private.git after merge conflict
* 16:05 dcaro: release toolforge-cli 0.3.0 ([[phab:T336225|T336225]])
* 06:49 majavah: remove scfc kubernetes admin access after bd808 removed tools.admin membership to avoid maintain-kubeusers crashes when it expires
* 12:48 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway ({{Gerrit|fa8ed2c}}) ([[phab:T336225|T336225]]) - cookbook ran by dcaro@vulcanus
* 12:48 wm-bot2: rebooted k8s node tools-k8s-worker-71 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 12:45 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway ({{Gerrit|d1bb238}}) ([[phab:T336225|T336225]]) - cookbook ran by dcaro@vulcanus
* 12:43 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-builds-api ({{Gerrit|8d21314}}) - cookbook ran by dcaro@vulcanus
* 10:54 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-buildpack-admission-controller:7199a9e from https://github.com/toolforge/buildpack-admission-controller ({{Gerrit|7199a9e}}) - cookbook ran by fran@wmf3169
* 08:49 wm-bot2: rebooted k8s node tools-k8s-worker-55 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 08:33 wm-bot2: rebooted k8s node tools-k8s-worker-64 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 08:32 wm-bot2: rebooted k8s node tools-k8s-worker-75 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 08:25 wm-bot2: rebooted k8s node tools-k8s-worker-74 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 08:17 wm-bot2: rebooted k8s node tools-k8s-worker-61 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 08:10 wm-bot2: rebooted k8s node tools-k8s-worker-70 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 08:03 wm-bot2: rebooted k8s node tools-k8s-worker-66 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 07:54 wm-bot2: rebooted k8s node tools-k8s-worker-72 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 07:46 wm-bot2: rebooted k8s node tools-k8s-worker-47 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 07:45 wm-bot2: rebooted k8s node tools-k8s-worker-48 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 07:42 wm-bot2: rebooted k8s node tools-k8s-worker-69 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 07:29 wm-bot2: rebooted k8s node tools-k8s-worker-76 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus


=== 2021-05-22 ===
=== 2023-05-16 ===
* 14:47 majavah: manually remove jeh admin certificates and from maintain-kubeusers configmap [[phab:T282725|T282725]]
* 23:24 bd808: kubectl uncordon tools-k8s-worker-69
* 14:32 majavah: manually remove valhallasw yuvipanda admin certificates and from configmap and restart maintain-kubeusers pod [[phab:T282725|T282725]]
* 23:22 bd808: Force reboot tools-k8s-worker-69 via Horizon
* 02:51 bd808: Restarted nginx on tools-static-14 to see if that clears up the fontcdn 502 errors
* 23:18 bd808: kubectl drain --ignore-daemonsets --delete-emptydir-data --force tools-k8s-worker-69
* 23:17 bd808: kubectl cordon tools-k8s-worker-69
* 14:37 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/builds-api:35b57c6 from https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-builds-api.git ({{Gerrit|35b57c6}}) - cookbook ran by dcaro@vulcanus
* 13:05 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/volume-admission-controller ({{Gerrit|df52a39}}) ([[phab:T334081|T334081]]) - cookbook ran by dcaro@vulcanus
* 12:54 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/volume-admission-controller ({{Gerrit|ad5b2b5}}) ([[phab:T334081|T334081]]) - cookbook ran by dcaro@vulcanus
* 11:52 dcaro: release toolforge-weld 0.2.0 and toolforge-webservice 0.98
* 08:08 dcaro: reboot tools-mail-03 ([[phab:T316544|T316544]])
* 08:07 dcaro: reboot tools-sgebastion-10 ([[phab:T316544|T316544]])


=== 2021-05-21 ===
=== 2023-05-15 ===
* 17:06 majavah: unpool tooks-k8s-ingress-[4-6]
* 22:50 bd808: Rebuilding bullseye and buster docker containers to pick up make package addition ([[phab:T320343|T320343]])
* 17:06 majavah: repool tools-k8s-ingress-6
* 22:09 wm-bot2: rebooted k8s node tools-k8s-worker-66 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 17:02 majavah: repool tools-k8s-ingress-4 and -5
* 22:07 wm-bot2: rebooted k8s node tools-k8s-worker-65 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 16:59 bstorm: upgrading the ingress-gen2 controllers to release 3 to capture new RAM/CPU limits
* 22:06 wm-bot2: rebooted k8s node tools-k8s-worker-64 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 16:43 bstorm: resize tools-k8s-ingress-4 to g3.cores4.ram8.disk20
* 22:04 wm-bot2: rebooted k8s node tools-k8s-worker-62 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 16:43 bstorm: resize tools-k8s-ingress-6 to g3.cores4.ram8.disk20
* 22:02 wm-bot2: rebooted k8s node tools-k8s-worker-61 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 16:40 bstorm: resize tools-k8s-ingress-5 to g3.cores4.ram8.disk20
* 21:58 wm-bot2: rebooted k8s node tools-k8s-worker-60 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 16:04 majavah: rollback kubernetes ingress update from front proxy
* 21:56 wm-bot2: rebooted k8s node tools-k8s-worker-59 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 06:52 Majavah: pool tools-k8s-ingress-6 and depool ingress-[2,3] [[phab:T264221|T264221]]
* 21:54 wm-bot2: rebooted k8s node tools-k8s-worker-58 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 21:52 wm-bot2: rebooted k8s node tools-k8s-worker-57 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 21:51 wm-bot2: rebooted k8s node tools-k8s-worker-56 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 21:50 wm-bot2: rebooted k8s node tools-k8s-worker-55 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 21:49 wm-bot2: rebooted k8s node tools-k8s-worker-54 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 21:47 wm-bot2: rebooted k8s node tools-k8s-worker-53 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 21:44 wm-bot2: rebooted k8s node tools-k8s-worker-52 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 21:42 wm-bot2: rebooted k8s node tools-k8s-worker-51 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 21:41 wm-bot2: rebooted k8s node tools-k8s-worker-50 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 21:40 wm-bot2: rebooted k8s node tools-k8s-worker-49 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 21:38 wm-bot2: rebooted k8s node tools-k8s-worker-48 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 21:37 wm-bot2: rebooted k8s node tools-k8s-worker-47 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 21:33 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by andrew@bullseye
* 21:16 wm-bot2: rebooted k8s node tools-k8s-worker-45 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 21:15 wm-bot2: rebooted k8s node tools-k8s-worker-44 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 21:13 wm-bot2: rebooted k8s node tools-k8s-worker-43 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 21:12 wm-bot2: rebooted k8s node tools-k8s-worker-42 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 21:09 wm-bot2: rebooted k8s node tools-k8s-worker-41 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 21:03 wm-bot2: rebooted k8s node tools-k8s-worker-40 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 20:58 wm-bot2: rebooted k8s node tools-k8s-worker-39 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 20:52 wm-bot2: rebooted k8s node tools-k8s-worker-38 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 20:50 wm-bot2: rebooted k8s node tools-k8s-worker-37 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 20:49 wm-bot2: rebooted k8s node tools-k8s-worker-36 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 20:48 wm-bot2: rebooted k8s node tools-k8s-worker-35 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 20:47 wm-bot2: rebooted k8s node tools-k8s-worker-34 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 20:42 wm-bot2: rebooted k8s node tools-k8s-worker-33 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 20:41 andrewbogott: rebooting frozen VMs: tools-k8s-worker-65, tools-sgeweblight-10-27, tools-k8s-worker-45, tools-k8s-worker-36, tools-sgewebgen-10-3 (fallout from earlier nfs outage)
* 20:36 wm-bot2: rebooted k8s node tools-k8s-worker-32 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 20:32 wm-bot2: rebooted k8s node tools-k8s-worker-31 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 20:24 wm-bot2: rebooted k8s node tools-k8s-worker-30 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 19:04 wm-bot2: rebooted k8s node tools-k8s-worker-67 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 18:56 wm-bot2: rebooted k8s node tools-k8s-worker-68 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 18:49 wm-bot2: rebooted k8s node tools-k8s-worker-69 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 18:46 bd808: Hard reboot tools-static-14 via Horizon per IRC report of unresponsive requests
* 18:44 wm-bot2: rebooted k8s node tools-k8s-worker-70 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 18:42 wm-bot2: rebooted k8s node tools-k8s-worker-71 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 18:39 wm-bot2: rebooted k8s node tools-k8s-worker-72 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 18:34 wm-bot2: rebooted k8s node tools-k8s-worker-73 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 18:28 wm-bot2: rebooted k8s node tools-k8s-worker-74 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 18:22 wm-bot2: rebooted k8s node tools-k8s-worker-75 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 18:22 taavi: clear mail queue
* 18:21 wm-bot2: rebooted k8s node tools-k8s-worker-76 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 18:15 wm-bot2: rebooted k8s node tools-k8s-worker-77 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 18:08 wm-bot2: rebooted k8s node tools-k8s-worker-80 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 18:06 wm-bot2: rebooted k8s node tools-k8s-worker-81 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 18:05 wm-bot2: rebooted k8s node tools-k8s-worker-82 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 17:57 wm-bot2: rebooted k8s node tools-k8s-worker-83 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 17:48 wm-bot2: rebooted k8s node tools-k8s-worker-84 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 17:47 wm-bot2: rebooted k8s node tools-k8s-worker-85 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 17:38 wm-bot2: rebooted k8s node tools-k8s-worker-86 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 17:37 wm-bot2: rebooted k8s node tools-k8s-worker-87 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 17:35 wm-bot2: rebooted k8s node tools-k8s-worker-88 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 17:34 wm-bot2: rebooting all the workers of tools k8s cluster (64 nodes) ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 17:20 wm-bot2: rebooted k8s node tools-k8s-worker-87 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 17:19 wm-bot2: rebooted k8s node tools-k8s-worker-88 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 17:17 bd808: Rebuilding bullseye and buster docker containers to pick up openssh-client package addition ([[phab:T258841|T258841]])
* 17:12 wm-bot2: rebooting the whole tools k8s cluster (64 nodes) ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 17:06 dcaro: rebooting tools-sgegrid-shadow ([[phab:T316544|T316544]])
* 17:00 dcaro: rebooting tools-sgegrid-master ([[phab:T316544|T316544]])
* 16:55 dcaro: rebooting tools-sgeexec-10-20 ([[phab:T316544|T316544]])
* 16:53 dcaro: rebooting tools-sgeweblight-10-18 ([[phab:T316544|T316544]])
* 16:53 dcaro: rebooting tools-sgeweblight-10-25 ([[phab:T316544|T316544]])
* 16:53 dcaro: rebooting tools-sgeweblight-10-20 ([[phab:T316544|T316544]])
* 16:52 dcaro: rebooting tools-sgeweblight-10-21 ([[phab:T316544|T316544]])
* 16:52 dcaro: rebooting tools-sgeexec-10-22 ([[phab:T316544|T316544]])
* 16:51 dcaro: rebooting tools-sgeweblight-10-28 ([[phab:T316544|T316544]])
* 16:50 dcaro: rebooting tools-sgeexec-10-17 ([[phab:T316544|T316544]])
* 16:48 dcaro: rebooting tools-sgeexec-10-21 ([[phab:T316544|T316544]])
* 16:47 dcaro: rebooting tools-sgeexec-10-19 ([[phab:T316544|T316544]])
* 16:45 dcaro: rebooting tools-sgeexec-10-8 ([[phab:T316544|T316544]])
* 16:45 dcaro: rebooting tools-sgeweblight-10-24 ([[phab:T316544|T316544]])
* 16:44 dcaro: rebooting tools-sgewebgen-10-2 ([[phab:T316544|T316544]])
* 16:44 dcaro: rebooting tools-sgeweblight-10-16 ([[phab:T316544|T316544]])
* 16:43 dcaro: rebooting tools-sgeweblight-10-30 ([[phab:T316544|T316544]])
* 16:43 dcaro: rebooting tools-sgeexec-10-18 ([[phab:T316544|T316544]])
* 16:42 dcaro: rebooting tools-sgeexec-10-16 ([[phab:T316544|T316544]])
* 16:42 dcaro: rebooting tools-sgeexec-10-14 ([[phab:T316544|T316544]])
* 16:41 dcaro: rebooting tools-sgeweblight-10-32 ([[phab:T316544|T316544]])
* 16:40 dcaro: rebooting tools-sgeweblight-10-22 ([[phab:T316544|T316544]])
* 16:39 dcaro: rebooting tools-sgeweblight-10-17 ([[phab:T316544|T316544]])
* 16:32 dcaro: rebooting tools-sgeexec-10-13.tools.eqiad1.wikimedia.cloud ([[phab:T316544|T316544]])
* 16:23 dcaro: rebooting tools-sgeweblight-10-26 ([[phab:T316544|T316544]])
* 16:15 bd808: Hard reboot of tools-sgebastion-11 via Horizon (done circa 16:11Z)
* 16:14 arturo: rebooted a bunch of nodes to cleanup D procs and high load avg because NFS outage (result of [[phab:T316544|T316544]])
* 12:36 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/builds-api:09f3b49-dev from https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-builds-api.git ({{Gerrit|32a8ae9}}) - cookbook ran by dcaro@vulcanus
* 09:12 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/volume-admission:c64da5a from https://gerrit.wikimedia.org/r/cloud/toolforge/volume-admission-controller ({{Gerrit|c64da5a}}) - cookbook ran by dcaro@vulcanus


=== 2021-05-20 ===
=== 2023-05-13 ===
* 17:05 Majavah: pool tools-k8s-ingress-5 as an ingress node, depool ingress-1 [[phab:T264221|T264221]]
* 09:13 taavi: reboot tools-sgeexec-10-15,17,18,21
* 16:31 Majavah: pool tools-k8s-worker-4 as an ingress node [[phab:T264221|T264221]]
* 15:17 Majavah: trying to install ingress-nginx via helm again after adjusting security groups [[phab:T264221|T264221]]
* 15:15 Majavah: move tools-k8s-ingress-[5-6] from "tools-k8s-full-connectivity" to "tools-new-k8s-full-connectivity" security group [[phab:T264221|T264221]]


=== 2021-05-19 ===
=== 2023-05-11 ===
* 12:15 Majavah: rollback ingress-nginx-gen2
* 15:48 bd808: Rebooted tools-sgebastion-10 for [[phab:T336510|T336510]]
* 11:09 Majavah: deploy helm-based nginx ingress controller v0.46.0 to ingress-nginx-gen2 namespace [[phab:T264221|T264221]]
* 15:31 bd808: Sent `wall` for reboot of tools-sgebastion-10 circa 15:40Z
* 10:44 Majavah: create tools-k8s-ingress-[4-6] [[phab:T264221|T264221]]


=== 2021-05-16 ===
=== 2023-05-09 ===
* 16:52 Majavah: clear error state from tools-sgeexec-0905 tools-sgeexec-0907 tools-sgeexec-0936 tools-sgeexec-0941
* 16:36 taavi: delegated beta.toolforge.org domain to toolsbeta per [[phab:T257386|T257386]]
* 09:35 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|ad4fa2a}}) - cookbook ran by taavi@runko


=== 2021-05-14 ===
=== 2023-05-08 ===
* 19:18 bstorm: adjusting the rate limits for bastions nfs_write upward a lot to make NFS writes faster now that the cluster is finally using 10Gb on the backend and frontend [[phab:T218338|T218338]]
* 09:12 arturo: force-reboot tools-sgeexec-10-13 (reported as down by the monitoring, no SSH)
* 16:55 andrewbogott: rebooting toolserver-proxy-01 to clear up stray files
* 16:47 andrewbogott: deleting log files older than 14 days on toolserver-proxy-01


=== 2021-05-12 ===
=== 2023-05-07 ===
* 19:45 bstorm: cleared error state from some queues
* 16:06 taavi: remove inbound 25/tcp rule from the toolserver legacy server [[phab:T136225|T136225]]
* 19:05 Majavah: remove phamhi-binding phamhi-view-binding cluster role bindings [[phab:T282725|T282725]]
* 19:04 bstorm: deleted the maintain-kubeusers pod to get it up and running fast [[phab:T282725|T282725]]
* 19:03 bstorm: deleted phamhi from admin configmap in maintain-kubeusers [[phab:T282725|T282725]]


=== 2021-05-11 ===
=== 2023-05-05 ===
* 17:17 Majavah: shutdown and delete tools-checker-03 [[phab:T278540|T278540]]
* 22:21 bd808: Added "RepoLookoutBot" to hiera key "dynamicproxy::blocked_user_agent_regex" to stop unnecessary scans by https://www.repo-lookout.org/
* 17:14 Majavah: move floating ip 185.15.56.61 to tools-checker-04
* 22:20 bd808: Added
* 17:12 Majavah: add tools-checker-04 as a grid submit host [[phab:T278540|T278540]]
* 11:30 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:811164e from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|811164e}}) - cookbook ran by taavi@runko
* 16:58 Majavah: add tools-checker-04 to toollabs::checker_hosts hiera key [[phab:T278540|T278540]]
* 09:13 dcaro: rebooted tools-sgeexec-10-16 as it was stuck ([[phab:T335009|T335009]])
* 16:49 Majavah: creating tools-checker-04 with buster [[phab:T278540|T278540]]
* 16:32 Majavah: carefully shutdown tools-k8s-haproxy-1 [[phab:T252239|T252239]]
* 16:29 Majavah: carefully shutdown tools-k8s-haproxy-2 [[phab:T252239|T252239]]


=== 2021-05-10 ===
=== 2023-05-04 ===
* 22:58 bstorm: cleared error state on a grid queue
* 15:15 wm-bot2: removed instance tools-k8s-etcd-15 - cookbook ran by andrew@bullseye
* 22:58 bstorm: setting `profile::wmcs::kubeadm::docker_vol: false` on ingress nodes
* 14:13 wm-bot2: removed instance tools-k8s-etcd-14 - cookbook ran by andrew@bullseye
* 15:22 Majavah: change k8s.svc.tools.eqiad1.wikimedia.cloud. to point to the tools-k8s-haproxy-keepalived-vip address 172.16.6.113 ([[phab:T252239|T252239]])
* 15:06 Majavah: carefully rolling out keepalived to tools-k8s-haproxy-[3-4] while making sure [1-2] do not have changes
* 15:03 Majavah: clear all error states caused by overloaded exec nodes
* 14:57 arturo: allow tools-k8s-haproxy-[3-4] to use the tools-k8s-haproxy-keepalived-vip address (172.16.6.113) ([[phab:T252239|T252239]])
* 12:53 Majavah: creating tools-k8s-haproxy-[3-4] to rebuild current ones without nfs and with keepalived


=== 2021-05-09 ===
=== 2023-05-03 ===
* 06:55 Majavah: clear error state from tools-sgeexec-0916
* 12:41 wm-bot2: removed instance tools-k8s-etcd-13 - cookbook ran by andrew@bullseye


=== 2021-05-08 ===
=== 2023-05-02 ===
* 10:57 Majavah: import docker image k8s.gcr.io/ingress-nginx/controller:v0.46.0 to local registry as docker-registry.tools.wmflabs.org/nginx-ingress-controller:v0.46.0 [[phab:T264221|T264221]]
* 00:29 wm-bot2: deployed kubernetes component https://github.com/toolforge/buildpack-admission-controller ({{Gerrit|7199a9e}}) - cookbook ran by raymond@ubuntu


=== 2021-05-07 ===
=== 2023-05-01 ===
* 18:07 Majavah: generate and add k8s haproxy keepalived password (profile::toolforge::k8s::haproxy::keepalived_password) to private puppet repo
* 23:17 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-buildpack-admission-controller:3b3803f from https://github.com/toolforge/buildpack-admission-controller ({{Gerrit|3b3803f}}) - cookbook ran by raymond@ubuntu
* 17:15 bstorm: recreated recordset of k8s.tools.eqiad1.wikimedia.cloud as CNAME to k8s.svc.tools.eqiad1.wikimedia.cloud [[phab:T282227|T282227]]
* 17:12 bstorm: created A record of k8s.svc.tools.eqiad1.wikimedia.cloud pointing at current cluster with TTL of 300 for quick initial failover when the new set of haproxy nodes are ready [[phab:T282227|T282227]]
* 09:44 arturo: `sudo wmcs-openstack --os-project-id=tools port create --network lan-flat-cloudinstances2b tools-k8s-haproxy-keepalived-vip`


=== 2021-05-06 ===
=== 2023-04-28 ===
* 14:43 Majavah: clear error states from all currently erroring exec nodes
* 15:01 arturo: force reboot tools-k8s-worker-79, unresponsive
* 14:37 Majavah: clear error state from tools-sgeexec-0913
* 08:27 dcaro: rebooting tools-sgeweblight-10-28 ([[phab:T335336|T335336]])
* 04:35 Majavah: add own root key to project hiera on horizon [[phab:T278390|T278390]]
* 07:20 dcaro: rebooting tools-sgegrid-shadow due to stale nfs mount
* 02:36 andrewbogott: removing jhedden from sudo roots
* 00:09 bd808: `kubectl uncordon tools-k8s-worker-67` ([[phab:T335543|T335543]])
* 00:07 bd808: Hard reboot tools-k8s-worker-67.tools.eqiad1.wikimedia.cloud via horizon ([[phab:T335543|T335543]])
* 00:04 bd808: Rebooting tools-k8s-worker-67.tools.eqiad1.wikimedia.cloud ([[phab:T335543|T335543]])


=== 2021-05-05 ===
=== 2023-04-27 ===
* 19:27 andrewbogott: adding taavi as a sudo root to project toolforge for [[phab:T278390|T278390]]
* 23:59 bd808: `kubectl drain --ignore-daemonsets --delete-emptydir-data --force tools-k8s-worker-67` ([[phab:T335543|T335543]])
* 20:50 bd808: Started process to rebuild all buster and bullseye based container images again. Prior problem seems to have been stale images in local cache on the build server.
* 20:42 bd808: Container image rebuild failed with GPG errors in buster-sssd base image. Will investigate and attempt to restart once resolved in a local dev environment.
* 20:33 bd808: Started process to rebuild all buster and bullseye based container images per https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes#Building_toolforge_specific_images


=== 2021-05-04 ===
=== 2023-04-18 ===
* 15:23 arturo: upgrading exim4-daemon-heavy in tools-mail-03
* 16:46 dcaro: force-rebooting tools-sgeweblight-10-25/26/27 as they got stuck stopping the grid_exec process
* 10:47 arturo: rebase & resolve merge conflicts in labs/private.git
* 16:35 dcaro: rebooting root@tools-sgeweblight-10-27 due to stuck exec daemon not releasing port 6445
* 16:35 dcaro: rebooting root@tools-sgeweblight-10-25 due to stuck exec daemon not releasing port 6445
* 16:32 dcaro: rebooting root@tools-sgeweblight-10-26 due to stuck exec daemon not releasing port 6445
* 16:26 dcaro: rebooting root@tools-sgeexec-10-14 due to stuck exec daemon not releasing port 6445


=== 2021-05-03 ===
=== 2023-04-17 ===
* 16:24 dcaro: started tools-sgeexec-0907, was stuck on initramfs due to an unclean fs (/dev/vda3, root), ran fsck manually fixing all the errors and booted up correctly after ([[phab:T280641|T280641]])
* 13:10 dcaro: rebooting tools-sgegrid-master node ([[phab:T334847|T334847]])
* 14:07 dcaro: depooling tols-sgeexec-0908/7 to be able to restart the VMs as they got stuck during migration ([[phab:T280641|T280641]])
* 02:43 legoktm: manual restart of apache2 on toolserver-proxy-1 to completely pick up renewed TLS cert (alert was flapping)


=== 2021-04-29 ===
=== 2023-04-11 ===
* 18:23 bstorm: removing one more etcd node via cookbook [[phab:T279723|T279723]]
* 16:11 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|b65439b}}) - cookbook ran by arturo@nostromo
* 18:12 bstorm: removing an etcd node via cookbook [[phab:T279723|T279723]]
* 15:46 arturo: upload toolforge-jobs-framework-cli v11 to aptly
* 14:17 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/volume-admission-controller.git ({{Gerrit|d878e49}}) ([[phab:T324834|T324834]]) - cookbook ran by dcaro@vulcanus
* 13:19 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:c6c693c from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|c6c693c}}) - cookbook ran by arturo@nostromo
* 12:09 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/volume-admission:40bd3b3 from https://gerrit.wikimedia.org/r/cloud/toolforge/volume-admission-controller ({{Gerrit|40bd3b3}}) - cookbook ran by dcaro@vulcanus
* 10:34 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/ingress-nginx ({{Gerrit|9aed7e5}}) - cookbook ran by taavi@runko
* 09:15 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/calico ({{Gerrit|c6a3e29}}) ([[phab:T329677|T329677]]) - cookbook ran by taavi@runko
* 08:45 wm-bot2: Adding a new k8s worker node - cookbook ran by taavi@runko


=== 2021-04-27 ===
=== 2023-04-10 ===
* 16:40 bstorm: deleted all the errored out grid jobs stuck in queue wait
* 10:46 taavi: patch existing PSP roles to use policy/v1beta1 [[phab:T331619|T331619]]
* 16:16 bstorm: cleared E status on grid queues to get things flowing again
* 09:16 arturo: upgrading k8s cluster to 1.22 ([[phab:T286856|T286856]])


=== 2021-04-26 ===
=== 2023-04-07 ===
* 12:17 arturo: allowing more tools into the legacy redirector ([[phab:T281003|T281003]])
* 14:34 wm-bot2: drained, depooled and removed k8s control node tools-k8s-control-3 ([[phab:T333929|T333929]]) - cookbook ran by taavi@runko
* 14:30 wm-bot2: removed instance tools-k8s-control-2 - cookbook ran by taavi@runko


=== 2021-04-22 ===
=== 2023-04-05 ===
* 08:44 Krenair: Removed yuvipanda from roots sudo policy
* 15:16 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|5ea5992}}) - cookbook ran by taavi@runko
* 08:42 Krenair: Removed yuvipanda from projectadmin per request
* 15:10 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:3569803 from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|3569803}}) - cookbook ran by taavi@runko
* 08:40 Krenair: Removed yuvipanda from tools.admin per request
* 14:56 wm-bot2: Added a new k8s worker tools-k8s-worker-88.tools.eqiad1.wikimedia.cloud to the cluster ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 14:42 wm-bot2: Adding a new k8s worker node ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 14:42 wm-bot2: Added a new k8s worker tools-k8s-worker-87.tools.eqiad1.wikimedia.cloud to the cluster ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 14:28 wm-bot2: Adding a new k8s worker node ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 14:28 wm-bot2: Added a new k8s worker tools-k8s-worker-86.tools.eqiad1.wikimedia.cloud to the cluster ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 14:15 wm-bot2: Adding a new k8s worker node ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 14:15 wm-bot2: Added a new k8s worker tools-k8s-worker-85.tools.eqiad1.wikimedia.cloud to the cluster ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 14:01 wm-bot2: Adding a new k8s worker node ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 14:01 wm-bot2: Added a new k8s worker tools-k8s-worker-84.tools.eqiad1.wikimedia.cloud to the cluster ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 13:47 wm-bot2: Adding a new k8s worker node ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 13:47 wm-bot2: Added a new k8s worker tools-k8s-worker-83.tools.eqiad1.wikimedia.cloud to the cluster ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 13:34 wm-bot2: Adding a new k8s worker node ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 13:33 wm-bot2: removed instance tools-k8s-worker-83 - cookbook ran by taavi@runko
* 13:15 wm-bot2: Adding a new k8s worker node ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 13:06 wm-bot2: removing grid node tools-sgeweblight-10-31.tools.eqiad1.wikimedia.cloud ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 13:02 wm-bot2: removing grid node tools-sgeweblight-10-29.tools.eqiad1.wikimedia.cloud ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 13:00 wm-bot2: removing grid node tools-sgeexec-10-9.tools.eqiad1.wikimedia.cloud ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 12:58 wm-bot2: removing grid node tools-sgeweblight-10-15.tools.eqiad1.wikimedia.cloud ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 12:54 wm-bot2: removing grid node tools-sgeexec-10-7.tools.eqiad1.wikimedia.cloud ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 12:52 wm-bot2: removing grid node tools-sgeweblight-10-13.tools.eqiad1.wikimedia.cloud ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 12:34 wm-bot2: drained, depooled and removed k8s control node tools-k8s-control-1 - cookbook ran by taavi@runko
* 12:07 wm-bot2: Added a new k8s control tools-k8s-control-6.tools.eqiad1.wikimedia.cloud to the cluster - cookbook ran by taavi@runko
* 11:53 wm-bot2: Adding a new k8s control node - cookbook ran by taavi@runko
* 11:51 wm-bot2: removed instance tools-k8s-control-6 - cookbook ran by taavi@runko
* 11:39 wm-bot2: Adding a new k8s control node ([[phab:T333929|T333929]]) - cookbook ran by taavi@runko
* 11:38 wm-bot2: removed instance tools-k8s-control-6 - cookbook ran by taavi@runko
* 11:21 wm-bot2: Adding a new k8s control node ([[phab:T333929|T333929]]) - cookbook ran by taavi@runko
* 11:21 wm-bot2: removed instance tools-k8s-control-6 - cookbook ran by taavi@runko
* 11:09 wm-bot2: Adding a new k8s control node ([[phab:T333929|T333929]]) - cookbook ran by taavi@runko
* 10:53 wm-bot2: removed instance tools-k8s-control-6 - cookbook ran by taavi@runko
* 10:41 wm-bot2: Adding a new k8s control node ([[phab:T333929|T333929]]) - cookbook ran by taavi@runko
* 10:41 wm-bot2: removed instance tools-k8s-control-6 - cookbook ran by taavi@runko
* 10:16 wm-bot2: Adding a new k8s control node ([[phab:T333929|T333929]]) - cookbook ran by taavi@runko


=== 2021-04-20 ===
=== 2023-04-04 ===
* 22:20 bd808: `clush -w @all -b "sudo exiqgrep -z -i {{!}} xargs sudo exim -Mt"`
* 19:00 wm-bot2: Adding a new k8s control node ([[phab:T333929|T333929]]) - cookbook ran by taavi@runko
* 22:19 bd808: `clush -w @exec -b "sudo exiqgrep -z -i {{!}} xargs sudo exim -Mt"`
* 18:59 wm-bot2: removed instance tools-k8s-control-5 - cookbook ran by taavi@runko
* 21:52 bd808: Update hiera `profile::toolforge::active_mail_relay: tools-mail-03.tools.eqiad1.wikimedia.cloud`. Was using wrong domain name in prior update.
* 18:46 wm-bot2: Adding a new k8s control node ([[phab:T333929|T333929]]) - cookbook ran by taavi@runko
* 21:49 bstorm: tagged the latest maintain-kubeusers and deployed to toolforge (with kustomize changes to rbac) after testing in toolsbeta [[phab:T280300|T280300]]
* 18:45 wm-bot2: Adding a new k8s CONTROL node ([[phab:T333929|T333929]]) - cookbook ran by taavi@runko
* 21:27 bd808: Update hiera `profile::toolforge::active_mail_relay: tools-mail-03.tools.eqiad.wmflabs`. was -2 which is decommed.
* 10:15 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by arturo@nostromo
* 10:18 dcaro: seting the retention on the tools-prometheus VMs to 250GB (they have 276GB total, leaving some space for online data operations if needed) ([[phab:T279990|T279990]])
* 09:28 arturo: hard-reboot the 3 k8s control nodes


=== 2021-04-19 ===
=== 2023-04-03 ===
* 10:53 dcaro: reverting setting prometheus data source in grafana to 'server', can't connect,
* 17:13 wm-bot2: rebooted k8s node tools-k8s-worker-31 - cookbook ran by taavi@runko
* 10:51 dcaro: setting prometheus data source in grafana to 'server' to avoid CORS issues
* 17:11 wm-bot2: rebooted k8s node tools-k8s-worker-32 - cookbook ran by taavi@runko
* 17:09 wm-bot2: rebooted k8s node tools-k8s-worker-33 - cookbook ran by taavi@runko
* 17:07 wm-bot2: rebooted k8s node tools-k8s-worker-34 - cookbook ran by taavi@runko
* 17:05 wm-bot2: rebooted k8s node tools-k8s-worker-35 - cookbook ran by taavi@runko
* 17:04 wm-bot2: rebooted k8s node tools-k8s-worker-36 - cookbook ran by taavi@runko
* 17:02 wm-bot2: rebooted k8s node tools-k8s-worker-37 - cookbook ran by taavi@runko
* 17:00 wm-bot2: rebooted k8s node tools-k8s-worker-38 - cookbook ran by taavi@runko
* 16:58 wm-bot2: rebooted k8s node tools-k8s-worker-39 - cookbook ran by taavi@runko
* 16:56 wm-bot2: rebooted k8s node tools-k8s-worker-40 - cookbook ran by taavi@runko
* 16:55 wm-bot2: rebooted k8s node tools-k8s-worker-41 - cookbook ran by taavi@runko
* 16:53 wm-bot2: rebooted k8s node tools-k8s-worker-42 - cookbook ran by taavi@runko
* 16:51 wm-bot2: rebooted k8s node tools-k8s-worker-43 - cookbook ran by taavi@runko
* 16:49 wm-bot2: rebooted k8s node tools-k8s-worker-44 - cookbook ran by taavi@runko
* 16:45 wm-bot2: rebooted k8s node tools-k8s-worker-45 - cookbook ran by taavi@runko
* 16:43 wm-bot2: rebooted k8s node tools-k8s-worker-46 - cookbook ran by taavi@runko
* 16:41 wm-bot2: rebooted k8s node tools-k8s-worker-47 - cookbook ran by taavi@runko
* 16:40 wm-bot2: rebooted k8s node tools-k8s-worker-48 - cookbook ran by taavi@runko
* 16:38 wm-bot2: rebooted k8s node tools-k8s-worker-49 - cookbook ran by taavi@runko
* 16:36 wm-bot2: rebooted k8s node tools-k8s-worker-50 - cookbook ran by taavi@runko
* 16:35 wm-bot2: rebooted k8s node tools-k8s-worker-51 - cookbook ran by taavi@runko
* 16:33 wm-bot2: rebooted k8s node tools-k8s-worker-52 - cookbook ran by taavi@runko
* 16:31 wm-bot2: rebooted k8s node tools-k8s-worker-53 - cookbook ran by taavi@runko
* 16:28 wm-bot2: rebooted k8s node tools-k8s-worker-54 - cookbook ran by taavi@runko
* 16:27 wm-bot2: rebooted k8s node tools-k8s-worker-55 - cookbook ran by taavi@runko
* 16:25 wm-bot2: rebooted k8s node tools-k8s-worker-56 - cookbook ran by taavi@runko
* 16:23 wm-bot2: rebooted k8s node tools-k8s-worker-57 - cookbook ran by taavi@runko
* 16:21 wm-bot2: rebooted k8s node tools-k8s-worker-58 - cookbook ran by taavi@runko
* 16:20 wm-bot2: rebooted k8s node tools-k8s-worker-59 - cookbook ran by taavi@runko
* 16:18 wm-bot2: rebooted k8s node tools-k8s-worker-60 - cookbook ran by taavi@runko
* 16:09 wm-bot2: rebooted k8s node tools-k8s-worker-61 - cookbook ran by taavi@runko
* 16:07 wm-bot2: rebooted k8s node tools-k8s-worker-62 - cookbook ran by taavi@runko
* 16:01 wm-bot2: rebooted k8s node tools-k8s-worker-64 - cookbook ran by taavi@runko
* 16:00 wm-bot2: rebooting the whole tools k8s cluster (58 nodes) - cookbook ran by taavi@runko
* 15:58 wm-bot2: rebooted k8s node tools-k8s-worker-65 - cookbook ran by taavi@runko
* 15:56 wm-bot2: rebooted k8s node tools-k8s-worker-66 - cookbook ran by taavi@runko
* 15:48 wm-bot2: rebooted k8s node tools-k8s-worker-67 - cookbook ran by taavi@runko
* 15:38 wm-bot2: rebooted k8s node tools-k8s-worker-68 - cookbook ran by taavi@runko
* 15:36 wm-bot2: rebooted k8s node tools-k8s-worker-69 - cookbook ran by taavi@runko
* 15:34 wm-bot2: rebooted k8s node tools-k8s-worker-70 - cookbook ran by taavi@runko
* 15:32 wm-bot2: rebooted k8s node tools-k8s-worker-71 - cookbook ran by taavi@runko
* 15:30 wm-bot2: rebooted k8s node tools-k8s-worker-72 - cookbook ran by taavi@runko
* 15:28 wm-bot2: rebooted k8s node tools-k8s-worker-73 - cookbook ran by taavi@runko
* 15:26 wm-bot2: rebooted k8s node tools-k8s-worker-74 - cookbook ran by taavi@runko
* 15:24 wm-bot2: rebooted k8s node tools-k8s-worker-75 - cookbook ran by taavi@runko
* 15:22 wm-bot2: rebooting the whole tools k8s cluster (58 nodes) - cookbook ran by taavi@runko
* 15:17 wm-bot2: rebooted k8s node tools-k8s-worker-75 - cookbook ran by taavi@runko
* 15:14 wm-bot2: rebooted k8s node tools-k8s-worker-76 - cookbook ran by taavi@runko
* 15:12 wm-bot2: rebooted k8s node tools-k8s-worker-77 - cookbook ran by taavi@runko
* 15:10 wm-bot2: rebooted k8s node tools-k8s-worker-78 - cookbook ran by taavi@runko
* 15:08 wm-bot2: rebooted k8s node tools-k8s-worker-79 - cookbook ran by taavi@runko
* 15:06 wm-bot2: rebooted k8s node tools-k8s-worker-80 - cookbook ran by taavi@runko
* 14:59 wm-bot2: rebooted k8s node tools-k8s-worker-81 - cookbook ran by taavi@runko
* 14:41 wm-bot2: rebooted k8s node tools-k8s-worker-82 - cookbook ran by taavi@runko
* 14:38 wm-bot2: rebooting the whole tools k8s cluster (58 nodes) - cookbook ran by taavi@runko
* 14:13 andrewbogott: test log to see if stashbot is back working
* 13:19 andrewbogott: forcing puppet run on all toolforge VMs
* 08:28 taavi: stop exim4.service on tools-sgecron-2 [[phab:T333477|T333477]]
* 06:52 taavi: stop jobs-framework-emailer to prevent spam due to NFS being read-only [[phab:T333477|T333477]]


=== 2021-04-16 ===
=== 2023-03-29 ===
* 23:15 bstorm: cleaned up all source files for the grid with the old domain name to enable future node creation [[phab:T277653|T277653]]
* 16:07 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/labs/tools/registry-admission-webhook ({{Gerrit|dc26f52}}) - cookbook ran by raymond@ubuntu
* 14:38 dcaro: added 'will get out of space in X days' panel to the dasboard https://grafana-labs.wikimedia.org/goto/kBlGd0uGk ([[phab:T279990|T279990]]), we got <5days xd
* 15:21 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/registry-admission:24115c7 from https://gerrit.wikimedia.org/r/labs/tools/registry-admission-webhook ({{Gerrit|24115c7}}) - cookbook ran by raymond@ubuntu
* 11:35 arturo: running `grid-configurator --all-domains` which basically added tools-sgebastion-10,11 as submit hosts and removed tools-sgegrid-master,shadow as submit hosts


=== 2021-04-15 ===
=== 2023-03-28 ===
* 17:45 bstorm: cleared error state from tools-sgeexec-0920.tools.eqiad.wmflabs for a failed job
* 19:43 wm-bot2: deployed kubernetes component https://github.com/toolforge/buildpack-admission-controller ({{Gerrit|e1b9815}}) - cookbook ran by raymond@ubuntu


=== 2021-04-13 ===
=== 2023-03-27 ===
* 13:26 dcaro: upgrade puppet and python-wmflib on tools-prometheus-03
* 22:51 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-buildpack-admission-controller:70d550a from https://github.com/toolforge/buildpack-admission-controller ({{Gerrit|70d550a}}) - cookbook ran by raymond@ubuntu
* 11:23 arturo: deleted shutoff VM tools-package-builder-02 ([[phab:T275864|T275864]])
* 11:21 arturo: deleted shutoff VM tools-sge-services-03,04 ([[phab:T278354|T278354]])
* 11:20 arturo: deleted shutoff VM tools-docker-registry-03,04 ([[phab:T278303|T278303]])
* 11:18 arturo: deleted shutoff VM tools-mail-02 ([[phab:T278538|T278538]])
* 11:17 arturo: deleted shutoff VMs tools-static-12,13 ([[phab:T278539|T278539]])


=== 2021-04-11 ===
=== 2023-03-26 ===
* 16:07 bstorm: cleared E state from tools-sgeexec-0917 tools-sgeexec-0933 tools-sgeexec-0934 tools-sgeexec-0937 from failures of jobs 761759, 815031, 815056, 855676, 898936
* 20:28 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by taavi@runko


=== 2021-04-08 ===
=== 2023-03-24 ===
* 18:25 bstorm: cleaned up the deprecated entries in /data/project/.system_sge/gridengine/etc/submithosts for tools-sgegrid-master and tools-sgegrid-shadow using the old fqdns [[phab:T277653|T277653]]
* 14:13 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by arturo@endurance
* 09:24 arturo: allocate & associate floating IP 185.15.56.122 for tools-sgebastion-11, also with DNS A record `dev-buster.toolforge.org` ([[phab:T275865|T275865]])
* 09:22 arturo: create DNS A record `login-buster.toolforge.org` pointing to 185.15.56.66 (tools-sgebastion-10) ([[phab:T275865|T275865]])
* 09:20 arturo: associate floating IP 185.15.56.66 to tools-sgebastion-10 ([[phab:T275865|T275865]])
* 09:13 arturo: created tools-sgebastion-11 (buster) ([[phab:T275865|T275865]])


=== 2021-04-07 ===
=== 2023-03-21 ===
* 04:35 andrewbogott: replacing the mx record '10 mail.tools.wmcloud.org' with '10 mail.tools.wmcloud.org.' — trying to fix axfr for the tools.wmcloud.org zone
* 08:11 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by taavi@runko


=== 2021-04-06 ===
=== 2023-03-20 ===
* 15:16 bstorm: cleared queue state since a few had "errored" for failed jobs.
* 13:39 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by taavi@runko
* 12:59 dcaro: Removing etcd member tools-k8s-etcd-7.tools.eqiad1.wikimedia.cloud to get an odd number  ([[phab:T267082|T267082]])
* 10:57 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by arturo@endurance
* 11:45 arturo: upgrading jobutils & misctools to 1.42 everywhere
* 11:39 arturo: cleaning up aptly: old package versions, old repos (jessie, trusty, precise) etc
* 10:31 dcaro: Removing etcd member tools-k8s-etcd-6.tools.eqiad.wmflabs  ([[phab:T267082|T267082]])
* 10:21 arturo: published jobutils & misctools 1.42 ([[phab:T278748|T278748]])
* 10:21 arturo: published jobutils & misctools 1.42
* 10:21 arturo: aptly repo had some weirdness due to the cinder volume: hardlinks created by aptly were broken, solved with `sudo aptly publish --skip-signing repo stretch-tools -force-overwrite`
* 10:07 dcaro: adding new etcd member using the cookbook wmcs.toolforge.add_etcd_node ([[phab:T267082|T267082]])
* 10:05 arturo: installed aptly from buster-backports on tools-services-05 to see if that makes any difference with an issue when publishing repos
* 09:53 dcaro: Removing etcd member tools-k8s-etcd-4.tools.eqiad.wmflabs  ([[phab:T267082|T267082]])
* 08:55 dcaro: adding new etcd member using the cookbook wmcs.toolforge.add_etcd_node ([[phab:T267082|T267082]])


=== 2021-04-05 ===
=== 2023-03-19 ===
* 17:02 bstorm: chowned the data volume for the docker registry to docker-registry:docker-registry
* 09:32 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by taavi@runko
* 09:56 arturo: make jhernandez (IRC joakino) projectadmin ([[phab:T278975|T278975]])


=== 2021-04-01 ===
=== 2023-03-17 ===
* 20:43 bstorm: cleared error state from the grid queues caused by unspecified job errors
* 15:56 andrewbogott: truncating .out, .err, and .log files to 10MB in anticipation of moving the NFS volumes
* 15:53 dcaro: Removed etcd member tools-k8s-etcd-5.tools.eqiad.wmflabs, adding a new member  ([[phab:T267082|T267082]])
* 15:43 dcaro: Removing etcd member tools-k8s-etcd-5.tools.eqiad.wmflabs  ([[phab:T267082|T267082]])
* 15:36 dcaro: Added new etcd member tools-k8s-etcd-9.tools.eqiad1.wikimedia.cloud ([[phab:T267082|T267082]])
* 15:18 dcaro: adding new etcd member using the cookbook wmcs.toolforge.add_etcd_node ([[phab:T267082|T267082]])


=== 2021-03-31 ===
=== 2023-03-13 ===
* 15:57 arturo: rebooting `tools-mail-03` after enabling NFS ([[phab:T267082|T267082]], [[phab:T278538|T278538]])
* 09:50 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-buildpack-admission-controller:f90bd8f from https://github.com/toolforge/buildpack-admission-controller ({{Gerrit|f90bd8f}}) - cookbook ran by dcaro@vulcanus
* 15:57 arturo: rebooting `tools-mail-03` after enabling NFS (T
* 15:04 arturo: created MX record for `tools.wmcloud.org` pointing to `mail.tools.wmcloud.org`
* 15:03 arturo: created DNS A record `mail.tools.wmcloud.org` pointing to 185.15.56.63
* 14:56 arturo: shutoff tools-mail-02 ([[phab:T278538|T278538]])
* 14:55 arturo: point floating IP 185.15.56.63 to tools-mail-03 ([[phab:T278538|T278538]])
* 14:45 arturo: created VM `tools-mail-03` as Debian Buster ([[phab:T278538|T278538]])
* 14:39 arturo: relocate some of the hiera keys for email server from project-level to prefix
* 09:44 dcaro: running disk performance test on etcd-4 (round2)
* 09:05 dcaro: running disk performance test on etcd-8
* 08:43 dcaro: running disk performance test on etcd-4


=== 2021-03-30 ===
=== 2023-03-12 ===
* 16:15 bstorm: added `labstore::traffic_shaping::egress: 800mbps` to tools-static prefix [[phab:T278539|T278539]]
* 13:40 taavi: restart haproxy on tools-k8s-haproxy-3
* 15:44 arturo: shutoff tools-static-12/13 ([[phab:T278539|T278539]])
* 15:41 arturo: point horizon web proxy `tools-static.wmflabs.org` to tools-static-14  ([[phab:T278539|T278539]])
* 15:37 arturo: add `mount_nfs: true` to tools-static prefix ([[phab:T2778539|T2778539]])
* 15:26 arturo: create VM tools-static-14 with Debian Buster image ([[phab:T278539|T278539]])
* 12:19 arturo: introduce horizon proxy `deb-tools.wmcloud.org` ([[phab:T278436|T278436]])
* 12:15 arturo: shutdown tools-sgebastion-09 (stretch)
* 11:05 arturo: created VM `tools-sgebastion-10` as Debian Buster ([[phab:T275865|T275865]])
* 11:04 arturo: created server group `tools-bastion` with anti-affinity policy


=== 2021-03-28 ===
=== 2023-03-11 ===
* 19:31 legoktm: legoktm@tools-sgebastion-08:~$ sudo qdel -f {{Gerrit|9999704}} # [[phab:T278645|T278645]]
* 18:38 wm-bot2: removing grid node tools-sgeexec-10-11.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 18:36 wm-bot2: removing grid node tools-sgeexec-10-11.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 18:34 wm-bot2: removing grid node tools-sgeexec-10-11.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 18:31 taavi: reboot misbehaving tools-sgeexec-10-11


=== 2021-03-27 ===
=== 2023-03-10 ===
* 02:48 Reedy: qdel -f {{Gerrit|9999895}} {{Gerrit|9999799}}
* 16:36 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/labs/tools/maintain-kubeusers ({{Gerrit|8b42b15}}) - cookbook ran by taavi@runko


=== 2021-03-26 ===
=== 2023-03-09 ===
* 12:21 arturo: shutdown tools-package-builder-02 (stretch), we keep -03 which is buster ([[phab:T275864|T275864]])
* 10:13 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/labs/tools/maintain-kubeusers ({{Gerrit|53e7f81}}) - cookbook ran by taavi@runko
* 10:04 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/maintain-kubeusers:834807c from https://gerrit.wikimedia.org/r/labs/tools/maintain-kubeusers ({{Gerrit|834807c}}) - cookbook ran by taavi@runko


=== 2021-03-25 ===
=== 2023-03-08 ===
* 19:30 bstorm: forced deletion of all jobs stuck in a deleting state [[phab:T277653|T277653]]
* 22:31 bd808: Live hacked user-maintainer clusterrole to work around breakage in [[phab:T331572|T331572]]
* 17:46 arturo: rebooting tools-sgeexec-* nodes to account for new grid master ([[phab:T277653|T277653]])
* 16:20 arturo: rebuilding tools-sgegrid-master VM as debian buster ([[phab:T277653|T277653]])
* 16:18 arturo: icinga-downtime toolschecker for 2h
* 16:05 bstorm: failed over the tools grid to the shadow master [[phab:T277653|T277653]]
* 13:36 arturo: shutdown tools-sge-services-03 ([[phab:T278354|T278354]])
* 13:33 arturo: shutdown tools-sge-services-04 ([[phab:T278354|T278354]])
* 13:31 arturo: point aptly clients to `tools-services-05.tools.eqiad1.wikimedia.cloud` (hiera change) ([[phab:T278354|T278354]])
* 12:58 arturo: created VM `tools-services-05` as Debian Buster ([[phab:T278354|T278354]])
* 12:51 arturo: create cinder volume `tools-aptly-data` ([[phab:T278354|T278354]])


=== 2021-03-24 ===
=== 2023-03-07 ===
* 12:46 arturo: shutoff the old stretch VMs `tools-docker-registry-03` and `tools-docker-registry-04` ([[phab:T278303|T278303]])
* 11:34 wm-bot2: Increased quotas by 2 volumes - cookbook ran by fran@wmf3169
* 12:38 arturo: associate floating IP 185.15.56.67 with `tools-docker-registry-05` and refresh FQDN docker-registry.tools.wmflabs.org accordingly ([[phab:T278303|T278303]])
* 11:09 wm-bot2: Increased quotas by 6 snapshots - cookbook ran by fran@wmf3169
* 12:33 arturo: attach cinder volume `tools-docker-registry-data` to VM `tools-docker-registry-05` ([[phab:T278303|T278303]])
* 11:07 wm-bot2: Increased quotas by 4000 gigabytes - cookbook ran by fran@wmf3169
* 12:32 arturo: snapshot cinder volume `tools-docker-registry-data` into `tools-docker-registry-data-stretch-migration` ([[phab:T278303|T278303]])
* 12:32 arturo: bump cinder storage quota from 80G to 400G (without quota request task)
* 12:11 arturo: created VM `tools-docker-registry-06` as Debian Buster ([[phab:T278303|T278303]])
* 12:09 arturo: dettach cinder volume `tools-docker-registry-data` ([[phab:T278303|T278303]])
* 11:46 arturo: attach cinder volume `tools-docker-registry-data` to VM `tools-docker-registry-03` to format it and pre-populate it with registry data ([[phab:T278303|T278303]])
* 11:20 arturo: created 80G cinder volume tools-docker-registry-data ([[phab:T278303|T278303]])
* 11:10 arturo: starting VM tools-docker-registry-04 which was stopped probably since 2021-03-09 due to hypervisor draining


=== 2021-03-23 ===
=== 2023-03-06 ===
* 12:46 arturo: aborrero@tools-sgegrid-master:~$ sudo systemctl restart gridengine-master.service
* 12:51 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/labs/tools/registry-admission-webhook ({{Gerrit|6688477}}) - cookbook ran by taavi@runko
* 12:16 arturo: delete & re-create VM tools-sgegrid-shadow as Debian Buster ([[phab:T277653|T277653]])
* 12:33 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/registry-admission:e916fee from https://gerrit.wikimedia.org/r/labs/tools/registry-admission-webhook ({{Gerrit|e916fee}}) - cookbook ran by taavi@runko
* 12:14 arturo: created puppet prefix 'tools-sgegrid-shadow' and migrated puppet configuration from VM-puppet
* 12:16 arturo: delete calico deployment, redeploy from https://gitlab.wikimedia.org/repos/cloud/toolforge/calico ([[phab:T328539|T328539]])
* 12:13 arturo: created server group 'tools-grid-master-shadow' with anty-affinity policy


=== 2021-03-18 ===
=== 2023-03-05 ===
* 19:24 bstorm: set profile::toolforge::infrastructure across the entire project with login_server set on the bastion and exec node-related prefixes
* 15:43 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/labs/tools/maintain-kubeusers ({{Gerrit|3e04025}}) - cookbook ran by taavi@runko
* 16:21 andrewbogott: enabling puppet tools-wide
* 16:20 andrewbogott: disabling puppet tools-wide to test https://gerrit.wikimedia.org/r/c/operations/puppet/+/672456
* 16:19 bstorm: added profile::toolforge::infrastructure class to puppetmaster [[phab:T277756|T277756]]
* 04:12 bstorm: rebooted tools-sgeexec-0935.tools.eqiad.wmflabs because it forgot how to LDAP...likely root cause of the issues tonight
* 03:59 bstorm: rebooting grid master. sorry for the cron spam
* 03:49 bstorm: restarting sssd on tools-sgegrid-master
* 03:37 bstorm: deleted a massive number of stuck jobs that misfired from the cron server
* 03:35 bstorm: rebooting tools-sgecron-01 to try to clear up the ldap-related errors coming out of it
* 01:46 bstorm: killed the toolschecker cron job, which had an LDAP error, and ran it again by hand


=== 2021-03-17 ===
=== 2023-03-02 ===
* 20:57 bstorm: deployed changes to rbac for kubernetes to add kubectl top access for tools
* 11:32 arturo: aborrero@tools-k8s-control-2:~$ sudo -i kubectl apply -f /etc/kubernetes/toolforge-tool-roles.yaml (https://gerrit.wikimedia.org/r/c/operations/puppet/+/889836)
* 20:26 andrewbogott: moving tools-elastic-3 to cloudvirt1034; two elastic nodes shouldn't be on the same hv


=== 2021-03-16 ===
=== 2023-03-01 ===
* 16:31 arturo: installing jobutils and misctools 1.41
* 13:18 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|13eda9d}}) - cookbook ran by taavi@runko
* 15:55 bstorm: deleted a bunch of messed up grid jobs ({{Gerrit|9989481}},8813,81682,86317,122602,122623,583621,606945,606999)
* 12:32 arturo: add packages jobutils / misctools v1.41 to <nowiki>{</nowiki>stretch,buster<nowiki>}</nowiki>-tools aptly repository in tools-sge-services-03


=== 2021-03-12 ===
=== 2023-02-28 ===
* 23:13 bstorm: cleared error state for all grid queues
* 17:19 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway ({{Gerrit|9252af7}}) - cookbook ran by taavi@runko
* 17:04 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|e46da83}}) - cookbook ran by taavi@runko


=== 2021-03-11 ===
=== 2023-02-23 ===
* 17:40 bstorm: deployed metrics-server:0.4.1 to kubernetes
* 18:07 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway ({{Gerrit|efb60b3}}) - cookbook ran by taavi@runko
* 16:21 bstorm: add jobutils 1.40 and misctools 1.40 to stretch-tools
* 09:33 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/buildpack-admission:b34e2f8 from https://github.com/toolforge/buildpack-admission-controller.git ({{Gerrit|b34e2f8}}) - cookbook ran by taavi@runko
* 13:11 arturo: add misctools 1.37 to buster-tools{{!}}toolsbeta aptly repo for [[phab:T275865|T275865]]
* 13:10 arturo: add jobutils 1.40 to buster-tools aptly repo for [[phab:T275865|T275865]]


=== 2021-03-10 ===
=== 2023-02-21 ===
* 10:56 arturo: briefly stopped VM tools-k8s-etcd-7 to disable VMX cpu flag
* 09:37 arturo: hard-reboot tools-sgeexec-10-11 (unresponsive to ssh)


=== 2021-03-09 ===
=== 2023-02-20 ===
* 13:31 arturo: hard-reboot tools-docker-registry-04 because issues related to [[phab:T276922|T276922]]
* 11:24 taavi: redeploy volume-admission with helm and cert-manager certificates [[phab:T329530|T329530]] [[phab:T292238|T292238]]
* 12:34 arturo: briefly rebooting VM tools-docker-registry-04, we need to reboot the hypervisor cloudvirt1038 and failed to migrate away
* 11:15 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/volume-admission:7fd13ac from https://gerrit.wikimedia.org/r/cloud/toolforge/volume-admission-controller ({{Gerrit|ede8bd0}}) - cookbook ran by taavi@runko
* 11:05 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-volume-admission-controller:7fd13ac from https://gerrit.wikimedia.org/r/cloud/toolforge/volume-admission-controller ({{Gerrit|7fd13ac}}) - cookbook ran by taavi@runko
* 10:39 wm-bot2: Increased quotas by 4000 gigabytes - cookbook ran by fran@wmf3169
* 09:20 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by arturo@nostromo


=== 2021-03-05 ===
=== 2023-02-19 ===
* 12:30 arturo: started tools-redis-1004 again
* 09:16 taavi: uncordon tools-k8s-worker-[80-82] after fixing security groups [[phab:T329378|T329378]]
* 12:22 arturo: stop tools-redis-1004 to ease draining of cloudvirt1035


=== 2021-03-04 ===
=== 2023-02-17 ===
* 11:25 arturo: rebooted tools-sgewebgrid-generic-0901, repool it again
* 11:32 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|eeeea4c}}) - cookbook ran by arturo@endurance
* 09:58 arturo: depool tools-sgewebgrid-generic-0901 to reboot VM. It was stuck in MIGRATING state when draining cloudvirt1022
* 11:31 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/image-config ({{Gerrit|7729b18}}) ([[phab:T254636|T254636]]) - cookbook ran by arturo@endurance
* 11:26 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:8a9b97e from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|eeeea4c}}) - cookbook ran by arturo@endurance
* 11:24 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:8a9b97e from https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-framework-api ({{Gerrit|618ab29}}) - cookbook ran by arturo@endurance
* 10:25 arturo: build and push mariadb-sssd/base docker image for Toolforge ([[phab:T320178|T320178]], [[phab:T254636|T254636]])


=== 2021-03-03 ===
=== 2023-02-16 ===
* 15:17 arturo: shutting down tools-sgebastion-07 in an attempt to fix nova state and finish hypervisor migration
* 15:58 wm-bot2: Increased quotas by 4000 gigabytes - cookbook ran by fran@wmf3169
* 15:11 arturo: tools-sgebastion-07 triggered a neutron exception (unauthorized) while being live-migrated from cloudvirt1021 to 1029. Resetting nova state with `nova reset-state bd685d48-1011-404e-a755-{{Gerrit|372f6022f345}} --active` and try again
* 15:30 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/cert-manager ({{Gerrit|d71994e}}) - cookbook ran by arturo@nostromo
* 14:48 arturo: killed pywikibot instance running in tools-sgebastion-07 by user msyn
* 13:52 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/ingress-admission-controller ({{Gerrit|7191997}}) - cookbook ran by taavi@runko
* 13:44 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/ingress-admission:1fe8ec4 from https://gerrit.wikimedia.org/r/cloud/toolforge/ingress-admission-controller ({{Gerrit|1fe8ec4}}) - cookbook ran by taavi@runko
* 12:47 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/ingress-admission:e9b9920 from https://gerrit.wikimedia.org/r/cloud/toolforge/ingress-admission-controller ({{Gerrit|e9b9920}}) - cookbook ran by taavi@runko
* 10:35 arturo: aborrero@tools-k8s-control-1:~$ sudo -i kubectl apply -f /etc/kubernetes/psp/base-pod-security-policies.yaml
* 09:48 arturo: grid engine was failed over to shadow server, manually put it back into normal https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Grid#GridEngine_Master
* 09:39 arturo: aborrero@tools-sgegrid-shadow:~$ sudo truncate -s 1G /var/log/syslog (was 17G, full root disk)


=== 2021-03-02 ===
=== 2023-02-15 ===
* 15:23 bstorm: depooling tools-sgewebgrid-lighttpd-0914.tools.eqiad.wmflabs for reboot. It isn't communicating right
* 18:03 taavi: deployed https://gerrit.wikimedia.org/r/c/operations/puppet/+/889585/ to increase amount of haproxy max connections
* 15:22 bstorm: cleared queue error states...will need to keep a better eye on what's causing those
* 15:19 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by arturo@nostromo
* 09:50 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/cert-manager.git ({{Gerrit|e3f3ce1}}) ([[phab:T329453|T329453]]) - cookbook ran by taavi@runko
* 09:30 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by arturo@nostromo


=== 2021-02-27 ===
=== 2023-02-14 ===
* 02:23 bstorm: deployed typo fix to maintain-kubeusers in an innocent effort to make the weekend better [[phab:T275910|T275910]]
* 15:07 taavi: import cert-manager components to local docker registry [[phab:T329453|T329453]]
* 02:00 bstorm: running a script to repair the dumps mount in all podpresets [[phab:T275371|T275371]]
* 12:12 arturo: the fixed webservicemonitor is starting a bunch of grid webservices ([[phab:T329611|T329611]])
* 12:10 arturo: included tools-manifests 0.25 in tools-buster aptly repo, deploying it now! ([[phab:T329611|T329611]], [[phab:T329467|T329467]], [[phab:T244809|T244809]])


=== 2021-02-26 ===
=== 2023-02-13 ===
* 22:04 bstorm: cleaned up grid jobs {{Gerrit|1230666}},{{Gerrit|1908277}},{{Gerrit|1908299}},{{Gerrit|2441500}},{{Gerrit|2441513}}
* 16:05 wm-bot2: Increased quotas by 4000 gigabytes - cookbook ran by fran@wmf3169
* 21:27 bstorm: hard rebooting tools-sgeexec-0947
* 16:03 taavi: update maintain-kubeusers deployment to use helm
* 21:21 bstorm: hard rebooting tools-sgeexec-0952.tools.eqiad.wmflabs
* 15:05 taavi: deploy jobs-api updates, improving some status messages
* 20:01 bd808: Deleted csr in strange state for tool-ores-inspect
* 15:04 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|13d87c4}}) - cookbook ran by taavi@runko
* 15:00 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:390ed64 from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|390ed64}}) - cookbook ran by taavi@runko
* 13:14 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/maintain-kubeusers:aac195b from https://gerrit.wikimedia.org/r/labs/tools/maintain-kubeusers ({{Gerrit|aac195b}}) - cookbook ran by taavi@runko


=== 2021-02-24 ===
=== 2023-02-10 ===
* 18:30 bd808: `sudo wmcs-openstack role remove --user zfilipin --project tools user` [[phab:T267313|T267313]]
* 15:45 taavi: reboot tools-k8s-worker-82 to troubleshoot network issues
* 01:04 bstorm: hard rebooting tools-k8s-worker-76 because it's in a sorry state
* 12:44 wm-bot2: Added a new k8s worker tools-k8s-worker-82.tools.eqiad1.wikimedia.cloud to the worker pool ([[phab:T329357|T329357]]) - cookbook ran by taavi@runko
* 12:31 wm-bot2: Adding a new k8s worker node ([[phab:T329357|T329357]]) - cookbook ran by taavi@runko
* 12:29 wm-bot2: Added a new k8s worker tools-k8s-worker-81.tools.eqiad1.wikimedia.cloud to the worker pool ([[phab:T329357|T329357]]) - cookbook ran by taavi@runko
* 12:15 wm-bot2: Adding a new k8s worker node ([[phab:T329357|T329357]]) - cookbook ran by taavi@runko
* 11:53 wm-bot2: Adding a new k8s worker node ([[phab:T329357|T329357]]) - cookbook ran by taavi@runko
* 11:44 wm-bot2: removing grid node tools-sgeweblight-10-23.tools.eqiad1.wikimedia.cloud ([[phab:T329357|T329357]]) - cookbook ran by taavi@runko
* 11:42 wm-bot2: removing grid node tools-sgeexec-10-5.tools.eqiad1.wikimedia.cloud ([[phab:T329357|T329357]]) - cookbook ran by taavi@runko
* 11:39 wm-bot2: removing grid node tools-sgeweblight-10-19.tools.eqiad1.wikimedia.cloud ([[phab:T329357|T329357]]) - cookbook ran by taavi@runko
* 11:26 wm-bot2: removing grid node tools-sgeweblight-10-12.tools.eqiad1.wikimedia.cloud ([[phab:T329357|T329357]]) - cookbook ran by taavi@runko
* 11:24 wm-bot2: removing grid node tools-sgeexec-10-1.tools.eqiad1.wikimedia.cloud ([[phab:T329357|T329357]]) - cookbook ran by taavi@runko


=== 2021-02-23 ===
=== 2023-02-01 ===
* 23:11 bstorm: draining a bunch of k8s workers to clean up after dumps changes [[phab:T272397|T272397]]
* 16:03 taavi: deployed tools-webservice 0.89
* 23:06 bstorm: draining tools-k8s-worker-55 to clean up after dumps changes [[phab:T272397|T272397]]
* 15:43 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/image-config ({{Gerrit|372037f}}) - cookbook ran by taavi@runko


=== 2021-02-22 ===
=== 2023-01-26 ===
* 20:40 bstorm: repooled tools-sgeexec-0918.tools.eqiad.wmflabs
* 15:05 taavi: drain and reboot tools-k8s-worker-74 which seems to have some issues with nfs
* 19:09 bstorm: hard rebooted tools-sgeexec-0918 from openstack [[phab:T275411|T275411]]
* 14:37 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|307f302}}) - cookbook ran by taavi@runko
* 19:07 bstorm: shutting down tools-sgeexec-0918 with the VM's command line (not libvirt directly yet) [[phab:T275411|T275411]]
* 14:30 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:05966c6 from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|05966c6}}) - cookbook ran by taavi@runko
* 19:05 bstorm: shutting down tools-sgeexec-0918 (with openstack to see what happens) [[phab:T275411|T275411]]
* 19:03 bstorm: depooled tools-sgeexec-0918 [[phab:T275411|T275411]]
* 18:56 bstorm: deleted job {{Gerrit|1962508}} from the grid to clear it up [[phab:T275301|T275301]]
* 16:58 bstorm: cleared error state on several grid queues


=== 2021-02-19 ===
=== 2023-01-24 ===
* 12:31 arturo: deploying new version of toolforge ingress admission controller
* 12:04 taavi: deploying toolforge-jobs-framework-cli v10 [[phab:T327775|T327775]]
* 10:07 taavi: publish toolforge-jobs-framework-cli v9


=== 2021-02-17 ===
=== 2023-01-23 ===
* 21:26 bstorm: deleted tools-puppetdb-01 since it is unused at this time (and undersized anyway)
* 11:31 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|d5ae229}}) - cookbook ran by taavi@runko
* 11:23 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:d085c50 from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|d085c50}}) - cookbook ran by taavi@runko
* 11:17 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/image-config ({{Gerrit|864171a}}) - cookbook ran by taavi@runko


=== 2021-02-04 ===
=== 2023-01-20 ===
* 16:27 bstorm: rebooting tools-package-builder-02
* 23:24 andrewbogott: truncating logfiles with find . -name '*.err'  -size +1G -exec truncate --size=100M <nowiki>{</nowiki><nowiki>}</nowiki> \;
* 21:24 andrewbogott: truncating logfiles with find . -name '*.out'  -size +1G -exec truncate --size=100M <nowiki>{</nowiki><nowiki>}</nowiki> \;
* 01:06 andrewbogott: truncating logfiles with find . -name '*.log'  -size +1G -exec truncate --size=100M <nowiki>{</nowiki><nowiki>}</nowiki> \;


=== 2021-01-26 ===
=== 2023-01-19 ===
* 16:27 bd808: Hard reboot of tools-sgeexec-0906 via Horizon for [[phab:T272978|T272978]]
* 11:46 arturo: `aborrero@tools-k8s-control-1:~$ sudo -i kubectl delete clusterrolebinding jobs-api-psp` (cleanup unused stuff)


=== 2021-01-22 ===
=== 2023-01-18 ===
* 09:59 dcaro: added the record redis.svc.tools.eqiad1.wikimedia.cloud pointing to tools-redis1003 ([[phab:T272679|T272679]])
* 15:42 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|0ad4c66}}) - cookbook ran by arturo@nostromo
* 15:29 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:54cc15e from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|54cc15e}}) - cookbook ran by arturo@nostromo


=== 2021-01-21 ===
=== 2023-01-17 ===
* 23:58 bstorm: deployed new maintain-kubeusers to tools [[phab:T271847|T271847]]
* 13:55 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|8cf38a1}}) - cookbook ran by arturo@endurance
* 13:51 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|0d0a882}}) - cookbook ran by arturo@endurance
* 13:34 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:3a58c1d from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|3a58c1d}}) - cookbook ran by arturo@endurance


=== 2021-01-19 ===
=== 2023-01-10 ===
* 22:57 bstorm: truncated 75GB error log /data/project/robokobot/virgule.err [[phab:T272247|T272247]]
* 11:55 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|8e0a2f9}}) - cookbook ran by arturo@endurance
* 22:48 bstorm: truncated 100GB error log /data/project/magnus-toolserver/error.log [[phab:T272247|T272247]]
* 11:52 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:9514b00 from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|8e0a2f9}}) - cookbook ran by arturo@endurance
* 22:43 bstorm: truncated 107GB log '/data/project/meetbot/logs/messages.log' [[phab:T272247|T272247]]
* 11:36 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|0243967}}) - cookbook ran by arturo@endurance
* 22:34 bstorm: truncating 194 GB error log '/data/project/mix-n-match/mnm-microsync.err' [[phab:T272247|T272247]]
* 16:37 bd808: Added Jhernandez to root sudoers group


=== 2021-01-14 ===
=== 2023-01-03 ===
* 20:56 bstorm: setting bastions to have mostly-uncapped egress network and 40MBps nfs_read for better shared use
* 17:17 andrewbogott: find -name '*.log'  -size +1G -exec truncate --size=1G <nowiki>{</nowiki><nowiki>}</nowiki> \;
* 20:43 bstorm: running tc-setup across the k8s workers
* 20:40 bstorm: running tc-setup across the grid fleet
* 17:58 bstorm: hard rebooting tools-sgecron-01 following network issues during upgrade to stein [[phab:T261134|T261134]]


=== 2021-01-13 ===
=== 2022-12-20 ===
* 10:02 arturo: delete floating IP allocation 185.15.56.245 ([[phab:T271867|T271867]])
* 09:07 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by arturo@nostromo


=== 2021-01-12 ===
=== 2022-12-12 ===
* 18:16 bstorm: deleted wedged CSR tool-adhs-wde to get maintain-kubeusers working again [[phab:T271842|T271842]]
* 14:36 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by dcaro@vulcanus


=== 2021-01-05 ===
=== 2022-12-09 ===
* 18:49 bstorm: changing the limits on k8s etcd nodes again, so disabling puppet on them [[phab:T267966|T267966]]
* 07:20 taavi: change the canonical tools-mail external hostname to use mail.tools.wmcloud.org and add valid spf to toolforge.org [[phab:T324809|T324809]]


=== 2021-01-04 ===
=== 2022-12-05 ===
* 18:21 bstorm: ran 'sudo systemctl stop getty@ttyS1.service && sudo systemctl disable getty@ttyS1.service' on tools-k8s-etcd-5 I have no idea why that keeps coming back.
* 11:06 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by dcaro@vulcanus


=== 2020-12-22 ===
=== 2022-11-30 ===
* 18:22 bstorm: rebooting the grid master because it is misbehaving following the NFS outage
* 10:39 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|bc3529d}}) - cookbook ran by arturo@nostromo
* 10:53 arturo: rebase & resolve ugly git merge conflict in labs/private.git
* 10:17 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:c360d54 from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|c360d54}}) - cookbook ran by arturo@nostromo


=== 2020-12-18 ===
=== 2022-11-29 ===
* 18:37 bstorm: set profile::wmcs::kubeadm::etcd_latency_ms: 15 [[phab:T267966|T267966]]
* 19:52 taavi: clear puppet failure emails from exim queues


=== 2020-12-17 ===
=== 2022-11-09 ===
* 21:42 bstorm: doing the same procedure to increase the timeouts more [[phab:T267966|T267966]]
* 08:58 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by arturo@nostromo
* 19:56 bstorm: puppet enabled one at a time, letting things catch up. Timeouts are now adjusted to something closer to fsync values [[phab:T267966|T267966]]
* 19:44 bstorm: set etcd timeouts seed value to 20 instead of the default 10 (profile::wmcs::kubeadm::etcd_latency_ms) [[phab:T267966|T267966]]
* 18:58 bstorm: disabling puppet on k8s-etcd servers to alter the timeouts [[phab:T267966|T267966]]
* 14:23 arturo: regenerating puppet cert with proper alt names in tools-k8s-etcd-4 ([[phab:T267966|T267966]])
* 14:21 arturo: regenerating puppet cert with proper alt names in tools-k8s-etcd-5 ([[phab:T267966|T267966]])
* 14:19 arturo: regenerating puppet cert with proper alt names in tools-k8s-etcd-6 ([[phab:T267966|T267966]])
* 14:17 arturo: regenerating puppet cert with proper alt names in tools-k8s-etcd-7 ([[phab:T267966|T267966]])
* 14:15 arturo: regenerating puppet cert with proper alt names in tools-k8s-etcd-8 ([[phab:T267966|T267966]])
* 14:12 arturo: updated kube-apiserver manifest with new etcd nodes ([[phab:T267966|T267966]])
* 13:56 arturo: adding etcd dns_alt_names hiera keys to the puppet prefix https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/beb27b45a74765a64552f2d4f70a40b217b4f4e9%5E%21/
* 13:12 arturo: making k8s api server aware of the new etcd nodes via hiera update https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/3761c4c4dab1c3ed0ab0a1133d2ccf3df6c28baf%5E%21/ ([[phab:T267966|T267966]])
* 12:54 arturo: joining new etcd nodes in the k8s etcd cluster ([[phab:T267966|T267966]])
* 12:52 arturo: adding more etcd nodes in the hiera key in tools-k8s-etcd puppet prefix https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/b4f60768078eccdabdfab4cd99c7c57076de51b2
* 12:50 arturo: dropping more unused hiera keys in the tools-k8s-etcd puppet prefix https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/e9e66a6787d9b91c08cf4742a27b90b3e6d05aac
* 12:49 arturo: dropping unused hiera keys in the tools-k8s-etcd puppet prefix https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/2b4cb4a41756e602fb0996e7d0210e9102172424
* 12:16 arturo: created VM `tools-k8s-etcd-8` ([[phab:T267966|T267966]])
* 12:15 arturo: created VM `tools-k8s-etcd-7` ([[phab:T267966|T267966]])
* 12:13 arturo: created `tools-k8s-etcd` anti-affinity server group


=== 2020-12-11 ===
=== 2022-11-05 ===
* 18:29 bstorm: certificatesigningrequest.certificates.k8s.io "tool-production-error-tasks-metrics" deleted to stop maintain-kubeusers issues
* 19:28 andrewbogott: cleaning up nfs share with  root@labstore1004:/srv/tools/shared/tools# find -name '*.err'  -size +1G -exec truncate --size=1G <nowiki>{</nowiki><nowiki>}</nowiki> \;
* 12:14 dcaro: upgrading stable/main (clinic duty)
* 13:26 andrewbogott: cleaning up nfs share with  root@labstore1004:/srv/tools/shared/tools# find -name '*.log'  -size +1G -exec truncate --size=1G <nowiki>{</nowiki><nowiki>}</nowiki> \;
* 12:12 dcaro: upgrading buster-wikimedia/main (clinic duty)
* 12:03 dcaro: upgrading stable-updates/main, mainly cacertificates (clinic duty)
* 12:01 dcaro: upgrading stretch-backports/main, mainly libuv (clinic duty)
* 11:58 dcaro: disabled all the repos blocking upgrades on tools-package-builder-02 (duplicated, other releases...)
* 11:35 arturo: uncordon tools-k8s-worker-71 and tools-k8s-worker-55, they weren't uncordoned yesterday for whatever reasons ([[phab:T263284|T263284]])
* 11:27 dcaro: upgrading stretch-wikimedia/main (clinic duty)
* 11:20 dcaro: upgrading stretch-wikimedia/thirdparty/mono-project-stretch (clinic duty)
* 11:08 dcaro: upgrade stretch-wikimedia/component/php72 (minor upgrades) (clinic duty)
* 11:04 dcaro: upgrade oldstable/main packages (clinic duty)
* 10:58 dcaro: upgrade kubectl done (clinic duty)
* 10:53 dcaro: upgrade kubectl (clinic duty)
* 10:16 dcaro: upgrading oldstable/main packages (clinic duty)


=== 2020-12-10 ===
=== 2022-11-04 ===
* 17:35 bstorm: k8s-control nodes upgraded to 1.17.13 [[phab:T263284|T263284]]
* 20:41 andrewbogott: cleaning up nfs share with  root@labstore1004:/srv/tools/shared/tools# find -name '*.err' -not -newermt "Nov 1, 2021" -exec rm <nowiki>{</nowiki><nowiki>}</nowiki> \;
* 17:16 arturo: k8s control nodes were all upgraded to 1.17, now upgrading worker nodes ([[phab:T263284|T263284]])
* 14:02 andrewbogott: cleaning up nfs share with  root@labstore1004:/srv/tools/shared/tools# find -name '*.log' -not -newermt "Nov 1, 2021" -exec rm <nowiki>{</nowiki><nowiki>}</nowiki> \;
* 15:50 dcaro: puppet upgraded to 5.5.10 on the hosts, ping me if you see anything weird (clinic duty)
* 12:20 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|d464be4}}) ([[phab:T304900|T304900]]) - cookbook ran by arturo@nostromo
* 15:41 arturo: icinga-downtime toolschecker for 2h ([[phab:T263284|T263284]])
* 12:12 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:2b800f5 from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|2b800f5}}) ([[phab:T304900|T304900]]) - cookbook ran by arturo@nostromo
* 15:35 dcaro: Puppet 5 on tools-sgebastion-09 ran well and without issues, upgrading the other sge nodes (clinic duty)
* 15:32 dcaro: Upgrading puppet from 4 to 5 on tools-sgebastion-09 (clinic duty)
* 12:41 arturo: set hiera `profile::wmcs::kubeadm::component: thirdparty/kubeadm-k8s-1-17` in project & tools-k8s-control prefix ([[phab:T263284|T263284]])
* 11:50 arturo: disabled puppet in all k8s nodes in preparation for version upgrade ([[phab:T263284|T263284]])
* 11:50 arturo: disabled puppet in all k8s nodes in preparation for version upgrade ([[phab:T263284|T263284]])
* 09:58 dcaro: successful tesseract upgrade on tools-sgewebgrid-lighttpd-0914, upgrading the rest of nodes (clinic duty)
* 09:49 dcaro: upgrading tesseract on tools-sgewebgrid-lighttpd-0914 (clinic duty)


=== 2020-12-08 ===
=== 2022-11-01 ===
* 19:01 bstorm: pushed updated calico node image (v3.14.0) to internal docker registry as well [[phab:T269016|T269016]]
* 09:37 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master ([[phab:T322110|T322110]]) - cookbook ran by dcaro@vulcanus


=== 2020-12-07 ===
=== 2022-10-26 ===
* 22:56 bstorm: pushed updated local copies of the typha, calico-cni and calico-pod2daemon-flexvol images to the tools internal registry [[phab:T269016|T269016]]
* 08:45 dcaro: depooling and rebooting tools-sgeexec-10-22 to get nfs scratch working again


=== 2020-12-03 ===
=== 2022-10-25 ===
* 09:18 arturo: restarted kubelet systemd service on tools-k8s-worker-38. Node was NotReady, complaining about 'use of closed network connection'
* 16:14 wm-bot2: Increased quotas by 5120 gigabytes - cookbook ran by fran@wmf3169
* 09:16 arturo: restarted kubelet systemd service on tools-k8s-worker-59. Node was NotReady, complaining about 'use of closed network connection'
* 15:26 dcaro: pushed a newer docker-registry.tools.wmflabs.org/python:3.9-slim-bullseye (from upstream pthyon:3.9-slim-bullseye)


=== 2020-11-28 ===
=== 2022-10-20 ===
* 23:35 Krenair: Re-scheduled 4 continuous jobs from tools-sgeexec-0908 as it appears to be broken, at about 23:20 UTC
* 16:54 andrewbogott: rebooting tools-package-builder-04
* 04:35 Krenair: Ran `sudo -i kubectl -n tool-mdbot delete cm maintain-kubeusers` on tools-k8s-control-1 for [[phab:T268904|T268904]], seems to have regenerated ~tools.mdbot/.kube/config
* 16:49 andrewbogott: rebooting redis nodes (one at a time)
* 10:54 taavi: rebuild mono68-sssd image with the expired DST Root CA X3 removed [[phab:T311466|T311466]]


=== 2020-11-24 ===
=== 2022-10-18 ===
* 17:44 arturo: rebased labs/private.git. 2 patches had merge conflicts
* 11:52 taavi: deploy toolforge-jobs-framework-cli deb v8
* 16:36 bd808: clush -w @all -b 'sudo -i apt-get purge nscd'
* 10:30 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-emailer ({{Gerrit|64385e9}}) ([[phab:T320405|T320405]]) - cookbook ran by arturo@nostromo
* 16:31 bd808: Ran `sudo -i apt-get purge nscd` on tools-sgeexec-0932 to try and fix apt state for puppet
* 10:27 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:9be2272 from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|9be2272}}) - cookbook ran by taavi@runko
* 10:18 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-emailer:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-emailer ({{Gerrit|64385e9}}) ([[phab:T320405|T320405]]) - cookbook ran by arturo@nostromo


=== 2020-11-10 ===
=== 2022-10-17 ===
* 19:45 andrewbogott: rebooting  tools-sgeexec-0950; OOM
* 07:25 taavi: push updated perl532 images [[phab:T320824|T320824]]


=== 2020-11-02 ===
=== 2022-10-14 ===
* 13:35 arturo: (typo: dcaro)
* 07:54 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|0cc020e}}) ([[phab:T311466|T311466]]) - cookbook ran by taavi@runko
* 13:35 arturo: added dcar as projectadmin & user ([[phab:T266068|T266068]])


=== 2020-10-29 ===
=== 2022-10-13 ===
* 21:33 legoktm: published docker-registry.tools.wmflabs.org/toolbeta-test image ([[phab:T265681|T265681]])
* 15:10 arturo: restart jobs-emailer pod
* 21:10 bstorm: Added another ingress node to k8s cluster in case the load spikes are the problem [[phab:T266506|T266506]]
* 17:33 bstorm: hard rebooting tools-sgeexec-0905 and tools-sgeexec-0916 to get the grid back to full capacity
* 04:03 legoktm: published docker-registry.tools.wmflabs.org/toolforge-buster0-builder:latest image ([[phab:T265686|T265686]])


=== 2020-10-28 ===
=== 2022-10-12 ===
* 23:42 bstorm: dramatically elevated the egress cap on tools-k8s-ingress nodes that were affected by the NFS settings [[phab:T266506|T266506]]
* 23:25 bd808: Rebuilding all Toolforge docker images ([[phab:T278436|T278436]], [[phab:T311466|T311466]], [[phab:T293552|T293552]])
* 22:10 bstorm: launching tools-k8s-ingress-3 to try and get an NFS-free node [[phab:T266506|T266506]]
* 20:43 bd808: Rebuilding all Toolforge docker images to pick up bug and security fix packages. Third try seems to be working. ([[phab:T316554|T316554]])
* 21:58 bstorm: set 'mount_nfs: false' on the tools-k8s-ingress prefix [[phab:T266506|T266506]]
* 20:31 bd808: Rebuilding all Toolforge docker images to pick up bug and security fix packages after fixing bug in building the bullseye base image. ([[phab:T316554|T316554]])
* 16:26 dcaro: deploy the latest registry admission webhook, now for real (image tag {{Gerrit|07bc7db}})
* 12:48 dcaro: deploy the latest registry admission webhook (image tag {{Gerrit|07bc7db}})
* 09:26 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by dcaro@vulcanus
* 09:19 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by dcaro@vulcanus


=== 2020-10-23 ===
=== 2022-10-11 ===
* 22:22 legoktm: imported pack_0.14.2-1_amd64.deb into buster-tools ([[phab:T266270|T266270]])
* 13:52 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:8574c36 from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|8574c36}}) - cookbook ran by taavi@runko


=== 2020-10-21 ===
=== 2022-10-10 ===
* 17:58 legoktm: pushed toolforge-buster0-<nowiki>{</nowiki>build,run<nowiki>}</nowiki>:latest images to docker registry
* 19:30 taavi: rebooting all k8s worker nodes to clean up labstore1006/7 remains
* 16:51 taavi: clean up labstore1006/7 mounts from k8s control nodes [[phab:T320425|T320425]]
* 11:35 arturo: aborrero@tools-k8s-control-1:~$ sudo -i kubectl -n jobs-emailer rollout restart deployment/jobs-emailer ([[phab:T317998|T317998]])
* 08:44 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|afa90ed}}) ([[phab:T320284|T320284]]) - cookbook ran by taavi@runko
* 08:39 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|afa90ed}}) - cookbook ran by taavi@runko


=== 2020-10-15 ===
=== 2022-10-09 ===
* 22:00 bstorm: manually removing nscd from tools-sgebastion-08 and running puppet
* 17:29 taavi: kill 10 idle tmux sessions of user 'hoi' on tools-sgebastion-10 [[phab:T320352|T320352]]
* 18:23 andrewbogott: uncordoning tools-k8s-worker-53, 54, 55, 59
* 17:28 andrewbogott: depooling tools-k8s-worker-53, 54, 55, 59
* 17:27 andrewbogott: uncordoning tools-k8s-worker-35, 37, 45
* 16:44 andrewbogott: depooling tools-k8s-worker-35, 37, 45


=== 2020-10-14 ===
=== 2022-10-07 ===
* 21:00 andrewbogott: repooling tools-sgewebgrid-generic-0901 and tools-sgewebgrid-lighttpd-0915
* 13:02 taavi: taavi@cloudcontrol1005 ~ $ sudo mark_tool --disable oncall # [[phab:T320240|T320240]]
* 20:37 andrewbogott: depooling tools-sgewebgrid-generic-0901 and tools-sgewebgrid-lighttpd-0915
* 20:35 andrewbogott: repooling tools-sgewebgrid-lighttpd-0911, 12, 13, 16
* 20:31 bd808: Deployed toollabs-webservice v0.74
* 19:53 andrewbogott: depooling tools-sgewebgrid-lighttpd-0911, 12, 13, 16 and moving to Ceph
* 19:47 andrewbogott: repooling tools-sgeexec-0932, 33, 34 and moving to Ceph
* 19:07 andrewbogott: depooling tools-sgeexec-0932, 33, 34 and moving to Ceph
* 19:06 andrewbogott: repooling tools-sgeexec-0935, 36, 38, 40 and moving to Ceph
* 16:56 andrewbogott: depooling tools-sgeexec-0935, 36, 38, 40 and moving to Ceph


=== 2020-10-10 ===
=== 2022-10-06 ===
* 17:07 bstorm: cleared errors on tools-sgeexec-0912.tools.eqiad.wmflabs to get the queue moving again
* 00:39 bd808: Image rebuild failing with debian apt repo signature issue. Will investigate tomorrow. ([[phab:T316554|T316554]])
* 00:36 bd808: Rebuilding all Toolforge docker images to pick up bug and security fix packages. ([[phab:T316554|T316554]])
* 00:04 bd808: Building new php74-sssd-base & web images ([[phab:T310435|T310435]])


=== 2020-10-08 ===
=== 2022-10-03 ===
* 17:07 bstorm: rebuilding docker images with locales-all [[phab:T263339|T263339]]
* 14:36 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/volume-admission:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/volume-admission-controller ({{Gerrit|8da432b}}) - cookbook ran by taavi@runko


=== 2020-10-06 ===
=== 2022-09-28 ===
* 19:04 andrewbogott: uncordoned tools-k8s-worker-38
* 21:23 lucaswerkmeister: on tools-sgebastion-10: run-puppet-agent # [[phab:T318858|T318858]]
* 18:51 andrewbogott: uncordoned tools-k8s-worker-52
* 21:22 lucaswerkmeister: on tools-sgebastion-10: apt remove emacs-common emacs-bin-common # fix package conflict, [[phab:T318858|T318858]]
* 18:40 andrewbogott: draining and cordoning tools-k8s-worker-52 and tools-k8s-worker-38 for ceph migration
* 21:15 lucaswerkmeister: added root SSH key for myself, manually ran puppet on tools-sgebastion-10 to apply it (seemingly successfully)


=== 2020-10-02 ===
=== 2022-09-22 ===
* 21:09 bstorm: rebooting tools-k8s-worker-70 because it seems to be unable to recover from an old NFS disconnect
* 12:30 taavi: add TheresNoTime to the 'toollabs-trusted' gerrit group [[phab:T317438|T317438]]
* 17:37 andrewbogott: stopping tools-prometheus-03 to attempt a snapshot
* 12:27 taavi: add TheresNoTime as a project admin and to the roots sudo policy [[phab:T317438|T317438]]
* 16:03 bstorm: shutting down tools-prometheus-04 to try to fsck the disk


=== 2020-10-01 ===
=== 2022-09-10 ===
* 21:39 andrewbogott: migrating tools-proxy-06 to ceph
* 07:39 wm-bot2: removing instance tools-prometheus-03 - cookbook ran by taavi@runko
* 21:35 andrewbogott: moving  k8s.tools.eqiad1.wikimedia.cloud from 172.16.0.99 (toolsbeta-test-k8s-haproxy-1) to 172.16.0.108 (toolsbeta-test-k8s-haproxy-2) in anticipation of downtime for haproxy-1 tomorrow


=== 2020-09-30 ===
=== 2022-09-07 ===
* 18:34 andrewbogott: repooling tools-sgeexec-0918
* 10:22 dcaro: Pushing the new toolforge builder image based on the new 0.8 buildpacks ([[phab:T316854|T316854]])
* 18:29 andrewbogott: depooling tools-sgeexec-0918 so I can reboot cloudvirt1036


=== 2020-09-23 ===
=== 2022-09-06 ===
* 21:38 bstorm: ran an 'apt clean' across the fleet to get ahead of the new locale install
* 08:06 dcaro_away: Published new toolforge-bullseye0-run and toolforge-bullseye0-build images for the toolforge buildpack builder ([[phab:T316854|T316854]])


=== 2020-09-18 ===
=== 2022-08-25 ===
* 19:41 andrewbogott: repooling tools-k8s-worker-30, 33, 34, 57, 60
* 10:40 taavi: tagged new version of the python39-web container with a shell implementation of webservice-runner [[phab:T293552|T293552]]
* 19:04 andrewbogott: depooling tools-k8s-worker-30, 33, 34, 57, 60
* 19:02 andrewbogott: repooling tools-k8s-worker-41, 43, 44, 47, 48, 49, 50, 51
* 17:48 andrewbogott: depooling tools-k8s-worker-41, 43, 44, 47, 48, 49, 50, 51
* 17:47 andrewbogott: repooling tools-k8s-worker-31, 32, 36, 39, 40
* 16:40 andrewbogott: depooling tools-k8s-worker-31, 32, 36, 39, 40
* 16:38 andrewbogott: repooling tools-sgewebgrid-lighttpd-0914, tools-sgewebgrid-generic-0902, tools-sgewebgrid-lighttpd-0919, tools-sgewebgrid-lighttpd-0918
* 16:10 andrewbogott: depooling tools-sgewebgrid-lighttpd-0914, tools-sgewebgrid-generic-0902, tools-sgewebgrid-lighttpd-0919, tools-sgewebgrid-lighttpd-0918
* 13:54 andrewbogott: repooling tools-sgeexec-0913, tools-sgeexec-0915, tools-sgeexec-0916
* 13:50 andrewbogott: depooling tools-sgeexec-0913, tools-sgeexec-0915, tools-sgeexec-0916  for flavor update
* 01:20 andrewbogott: repooling tools-sgeexec-0901, tools-sgeexec-0905, tools-sgeexec-0910, tools-sgeexec-0911, tools-sgeexec-0912  after flavor update
* 01:11 andrewbogott: depooling tools-sgeexec-0901, tools-sgeexec-0905, tools-sgeexec-0910, tools-sgeexec-0911, tools-sgeexec-0912  for flavor update
* 01:08 andrewbogott: repooling tools-sgeexec-0917, tools-sgeexec-0918, tools-sgeexec-0919, tools-sgeexec-0920  after flavor update
* 01:00 andrewbogott: depooling tools-sgeexec-0917, tools-sgeexec-0918, tools-sgeexec-0919, tools-sgeexec-0920  for flavor update
* 00:58 andrewbogott: repooling tools-sgeexec-0942, tools-sgeexec-0947, tools-sgeexec-0950, tools-sgeexec-0951, tools-sgeexec-0952 after flavor update
* 00:49 andrewbogott: depooling tools-sgeexec-0942, tools-sgeexec-0947, tools-sgeexec-0950, tools-sgeexec-0951, tools-sgeexec-0952 for flavor update


=== 2020-09-17 ===
=== 2022-08-24 ===
* 21:56 bd808: Built and deployed tools-manifest v0.22 ([[phab:T263190|T263190]])
* 12:20 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/ingress-nginx ({{Gerrit|eba66bc}}) - cookbook ran by taavi@runko
* 21:55 bd808: Built and deployed tools-manifest v0.22 ([[phab:T169695|T169695]])
* 12:20 taavi: upgrading ingress-nginx to v1.3
* 20:34 bd808: Live hacked "--backend=gridengine" into webservicemonitor on tools-sgecron-01 ([[phab:T263190|T263190]])
* 20:21 bd808: Restarted webservicemonitor on tools-sgecron-01.tools.eqiad.wmflabs
* 20:09 andrewbogott: I didn't actually depool tools-sgeexec-0942, tools-sgeexec-0947, tools-sgeexec-0950, tools-sgeexec-0951, tools-sgeexec-0952 because there was some kind of brief outage just now
* 19:58 andrewbogott: depooling tools-sgeexec-0942, tools-sgeexec-0947, tools-sgeexec-0950, tools-sgeexec-0951, tools-sgeexec-0952 for flavor update
* 19:55 andrewbogott: repooling tools-k8s-worker-61,62,64,65,67,68,69 for flavor update
* 19:29 andrewbogott: depooling tools-k8s-worker-61,62,64,65,67,68,69 for flavor update
* 15:38 andrewbogott: repooling tools-k8s-worker-70 and tools-k8s-worker-66 after flavor remapping
* 15:34 andrewbogott: depooling tools-k8s-worker-70 and tools-k8s-worker-66 for flavor remapping
* 15:30 andrewbogott: repooling tools-sgeexec-0909, 0908, 0907, 0906, 0904
* 15:21 andrewbogott: depooling tools-sgeexec-0909, 0908, 0907, 0906, 0904 for flavor remapping
* 13:55 andrewbogott: depooled tools-sgewebgrid-lighttpd-0917 and tools-sgewebgrid-lighttpd-0920
* 13:55 andrewbogott: repooled tools-sgeexec-0937 after move to ceph
* 13:45 andrewbogott: depooled tools-sgeexec-0937 for move to ceph


=== 2020-09-16 ===
=== 2022-08-20 ===
* 23:20 andrewbogott: repooled tools-sgeexec-0941 and tools-sgeexec-0939 for move to ceph
* 07:44 dcaro_away: all k8s nodes ready now \o/ ([[phab:T315718|T315718]])
* 23:03 andrewbogott: depooled tools-sgeexec-0941 and tools-sgeexec-0939 for move to ceph
* 07:43 dcaro_away: rebooted tools-k8s-control-2, seemed stuck trying to wait for tools home (nfs?), after reboot came back up ([[phab:T315718|T315718]])
* 23:02 andrewbogott: uncordoned tools-k8s-worker-58, tools-k8s-worker-56, tools-k8s-worker-42 for migration to ceph
* 07:41 dcaro_away: cloudvirt1023 down took out 3 workers, 1 control, and a grid exec and a weblight, they are taking long to restart, looking ([[phab:T315718|T315718]])
* 22:29 andrewbogott: draining tools-k8s-worker-58, tools-k8s-worker-56, tools-k8s-worker-42 for migration to ceph
* 17:37 andrewbogott: service gridengine-master restart on tools-sgegrid-master


=== 2020-09-10 ===
=== 2022-08-18 ===
* 15:37 arturo: hard-rebooting tools-proxy-05
* 14:45 andrewbogott: adding lucaswerkmeister  as projectadmin ([[phab:T314527|T314527]])
* 15:33 arturo: rebooting tools-proxy-05 to try flushing local DNS caches
* 14:43 andrewbogott: removing some inactive projectadmins: rush, petrb, mdipietro, jeh, krenair
* 15:25 arturo: detected missing DNS record for k8s.tools.eqiad1.wikimedia.cloud which means the k8s cluster is down
* 10:22 arturo: enabling ingress dedicated worker nodes in the k8s cluster ([[phab:T250172|T250172]])


=== 2020-09-09 ===
=== 2022-08-17 ===
* 11:12 arturo: new ingress nodes added to the cluster, and tainted/labeled per the docs https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes/Deploying#ingress_nodes ([[phab:T250172|T250172]])
* 16:34 taavi: kubectl sudo delete cm -n tool-wdml maintain-kubeusers # [[phab:T315459|T315459]]
* 10:50 arturo: created puppet prefix `tools-k8s-ingress` ([[phab:T250172|T250172]])
* 08:30 taavi: failing the grid from the shadow back to the master, some disruption expected
* 10:42 arturo: created VMs tools-k8s-ingress-1 and tools-k8s-ingress-2 in the `tools-ingress` server group [[phab:T250172|T250172]])
* 10:38 arturo: created server group `tools-ingress` with soft anti affinity policy ([[phab:T250172|T250172]])


=== 2020-09-08 ===
=== 2022-08-16 ===
* 23:24 bstorm: clearing grid queue error states blocking job runs
* 17:28 taavi: fail over docker-registry, tools-docker-registry-06->docker-registry-05
* 22:53 bd808: forcing puppet run on tools-sgebastion-07


=== 2020-09-02 ===
=== 2022-08-11 ===
* 18:13 andrewbogott: moving tools-sgeexec-0920  to ceph
* 16:57 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by taavi@runko
* 17:57 andrewbogott: moving tools-sgeexec-0942  to ceph
* 16:55 taavi: restart puppetdb on tools-puppetdb-1, crashed during the ceph issues


=== 2020-08-31 ===
=== 2022-08-05 ===
* 19:58 andrewbogott: migrating tools-sgeexec-091[0-9] to ceph
* 15:08 wm-bot2: removing grid node tools-sgewebgen-10-1.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:19 andrewbogott: migrating tools-sgeexec-090[4-9] to ceph
* 15:05 wm-bot2: removing grid node tools-sgeexec-10-12.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:19 andrewbogott: repooled tools-sgeexec-0901
* 15:00 wm-bot2: created node tools-sgewebgen-10-3.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
* 16:52 bstorm: `apt install uwsgi` was run on tools-checker-03 in the last log [[phab:T261677|T261677]]
* 16:51 bstorm: running `apt install uwsgi` with --allow-downgrades to fix the puppet setup there [[phab:T261677|T261677]]
* 14:26 andrewbogott: depooling tools-sgeexec-0901, migrating to ceph


=== 2020-08-30 ===
=== 2022-08-03 ===
* 00:57 Krenair: also ran qconf -ds on each
* 15:51 dhinus: recreated jobs-api pods to pick up new ConfigMap
* 00:35 Krenair: Tidied up SGE problems (it was spamming root@ every minute for hours) following host deletions some hours ago - removed tools-sgeexec-0921 through 0931 from @general, ran qmod -rj on all jobs registered for those nodes, then qdel -f on the remainders, then qconf -de on each deleted node
* 15:02 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|c47ac41}}) - cookbook ran by fran@MacBook-Pro.station


=== 2020-08-29 ===
=== 2022-07-20 ===
* 16:02 bstorm: deleting "tools-sgeexec-0931", "tools-sgeexec-0930", "tools-sgeexec-0929", "tools-sgeexec-0928", "tools-sgeexec-0927"
* 19:31 taavi: reboot toolserver-proxy-01 to free up disk space probably held by stale file handles
* 16:00 bstorm: deleting  "tools-sgeexec-0926", "tools-sgeexec-0925", "tools-sgeexec-0924", "tools-sgeexec-0923", "tools-sgeexec-0922", "tools-sgeexec-0921"
* 08:06 wm-bot2: removing grid node tools-sgeexec-10-6.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko


=== 2020-08-26 ===
=== 2022-07-19 ===
* 21:08 bd808: Disabled puppet on tools-proxy-06 to test fixes for a bug in the new [[phab:T251628|T251628]] code
* 17:53 wm-bot2: created node tools-sgeexec-10-21.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
* 08:54 arturo: merged several patches by bryan for toolforge front proxy (cleanups, etc) example: https://gerrit.wikimedia.org/r/c/operations/puppet/+/622435
* 17:00 wm-bot2: removing grid node tools-sgeexec-10-3.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 16:58 wm-bot2: removing grid node tools-sgeexec-10-4.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 16:24 wm-bot2: created node tools-sgeexec-10-20.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
* 15:59 taavi: tag current maintain-kubernetes :beta image as: :latest


=== 2020-08-25 ===
=== 2022-07-17 ===
* 19:38 andrewbogott: deleting tools-sgeexec-0943.tools.eqiad.wmflabs, tools-sgeexec-0944.tools.eqiad.wmflabs, tools-sgeexec-0945.tools.eqiad.wmflabs, tools-sgeexec-0946.tools.eqiad.wmflabs, tools-sgeexec-0948.tools.eqiad.wmflabs, tools-sgeexec-0949.tools.eqiad.wmflabs, tools-sgeexec-0953.tools.eqiad.wmflabs — they are broken and we're not very curious why; will retry this exercise when everything is standardized on
* 15:52 wm-bot2: removing grid node tools-sgeexec-10-10.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 15:03 andrewbogott: removing non-ceph nodes tools-sgeexec-0921 through tools-sgeexec-0931
* 15:43 wm-bot2: removing grid node tools-sgeexec-10-2.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 15:02 andrewbogott: added new sge-exec nodes tools-sgeexec-0943 through tools-sgeexec-0953 (for real this time)
* 13:26 wm-bot2: created node tools-sgeexec-10-16.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko


=== 2020-08-19 ===
=== 2022-07-14 ===
* 21:29 andrewbogott: shutting down and removing  tools-k8s-worker-20 through tools-k8s-worker-29; this load can now be handled by new nodes on ceph hosts
* 13:48 taavi: rebooting tools-sgeexec-10-2
* 21:15 andrewbogott: shutting down and removing  tools-k8s-worker-1 through tools-k8s-worker-19; this load can now be handled by new nodes on ceph hosts
* 18:40 andrewbogott: creating 13 new xlarge k8s worker nodes, tools-k8s-worker-67 through tools-k8s-worker-79


=== 2020-08-18 ===
=== 2022-07-13 ===
* 15:24 bd808: Rebuilding all Docker containers to pick up newest versions of installed packages
* 12:09 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by dcaro@vulcanus


=== 2020-07-30 ===
=== 2022-07-11 ===
* 16:28 andrewbogott: added new xlarge ceph-hosted worker nodes: tools-k8s-worker-61, 62, 63, 64, 65, 66. [[phab:T258663|T258663]]
* 16:06 wm-bot2: Increased quotas by <nowiki>{</nowiki>self.increases<nowiki>}</nowiki> ([[phab:T312692|T312692]]) - cookbook ran by nskaggs@x1carbon


=== 2020-07-29 ===
=== 2022-07-07 ===
* 23:24 bd808: Pushed a copy of docker-registry.wikimedia.org/wikimedia-jessie:latest to docker-registry.tools.wmflabs.org/wikimedia-jessie:latest in preparation for the upstream image going away
* 07:34 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by dcaro@vulcanus


=== 2020-07-24 ===
=== 2022-06-28 ===
* 22:33 bd808: Removed a few more ancient docker images: grrrit, jessie-toollabs, and nagf
* 17:34 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master ([[phab:T311538|T311538]]) - cookbook ran by dcaro@vulcanus
* 21:02 bd808: Running cleanup script to delete the non-sssd toolforge images from docker-registry.tools.wmflabs.org
* 15:51 taavi: add 4096G cinder quota [[phab:T311509|T311509]]
* 20:17 bd808: Forced garbage collection on docker-registry.tools.wmflabs.org
* 20:06 bd808: Running cleanup script to delete all of the old toollabs-* images from docker-registry.tools.wmflabs.org


=== 2020-07-22 ===
=== 2022-06-27 ===
* 23:24 bstorm: created server group 'tools-k8s-worker' to create any new worker nodes in so that they have a low chance of being scheduled together by openstack unless it is necessary [[phab:T258663|T258663]]
* 18:14 taavi: restart calico, appears to have got stuck after the ca replacement operation
* 23:22 bstorm: running puppet and NFS 4.2 remount on tools-k8s-worker-[56-60] [[phab:T257945|T257945]]
* 18:02 taavi: switchover active cron server to tools-sgecron-2 [[phab:T284767|T284767]]
* 23:17 bstorm: running puppet and NFS 4.2 remount on tools-k8s-worker-[41-55] [[phab:T257945|T257945]]
* 17:54 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0915.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 23:14 bstorm: running puppet and NFS 4.2 remount on tools-k8s-worker-[21-40] [[phab:T257945|T257945]]
* 17:52 wm-bot2: removing grid node tools-sgewebgrid-generic-0902.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 23:11 bstorm: running puppet and NFS remount on tools-k8s-worker-[1-15] [[phab:T257945|T257945]]
* 17:49 wm-bot2: removing grid node tools-sgeexec-0942.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 23:07 bstorm: disabling puppet on k8s workers to reduce the effect of changing the NFS mount version all at once [[phab:T257945|T257945]]
* 17:15 taavi: [[phab:T311412|T311412]] updating ca used by k8s-apiserver->etcd communication, breakage may happen
* 22:28 bstorm: setting tools-k8s-control prefix to mount NFS v4.2 [[phab:T257945|T257945]]
* 14:58 taavi: renew puppet ca cert and certificate for tools-puppetmaster-02 [[phab:T311412|T311412]]
* 22:15 bstorm: set the tools-k8s-control nodes to also use 800MBps to prevent issues with toolforge ingress and api system
* 14:50 taavi: backup /var/lib/puppet/server to /root/puppet-ca-backup-2022-06-27.tar.gz on tools-puppetmaster-02 before we do anything else to it [[phab:T311412|T311412]]
* 22:07 bstorm: set the tools-k8s-haproxy-1 (main load balancer for toolforge) to have an egress limit of 800MB per sec instead of the same as all the other servers


=== 2020-07-21 ===
=== 2022-06-23 ===
* 16:09 bstorm: rebooting tools-sgegrid-shadow to remount NFS correctly
* 17:51 wm-bot2: removing grid node tools-sgeexec-0941.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 15:55 bstorm: set the bastion prefix to have explicitly set hiera value of profile::wmcs::nfsclient::nfs_version: '4'
* 17:49 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0916.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:46 wm-bot2: removing grid node tools-sgewebgrid-generic-0901.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:32 wm-bot2: removing grid node tools-sgeexec-0939.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:30 wm-bot2: removing grid node tools-sgeexec-0938.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:27 wm-bot2: removing grid node tools-sgeexec-0937.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:22 wm-bot2: removing grid node tools-sgeexec-0936.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:19 wm-bot2: removing grid node tools-sgeexec-0935.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:17 wm-bot2: removing grid node tools-sgeexec-0934.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:14 wm-bot2: removing grid node tools-sgeexec-0933.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:11 wm-bot2: removing grid node tools-sgeexec-0932.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:09 wm-bot2: removing grid node tools-sgeexec-0920.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 15:30 wm-bot2: removing grid node tools-sgeexec-0947.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 13:59 taavi: removing remaining continuous jobs from the stretch grid [[phab:T277653|T277653]]


=== 2020-07-17 ===
=== 2022-06-22 ===
* 16:47 bd808: Enabled Puppet on tools-proxy-06 following successful test ([[phab:T102367|T102367]])
* 15:54 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0917.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 16:29 bd808: Disabled Puppet on tools-proxy-06 to test nginx config changes manually ([[phab:T102367|T102367]])
* 15:51 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0918.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 15:47 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0919.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 15:45 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0920.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko


=== 2020-07-15 ===
=== 2022-06-21 ===
* 23:11 bd808: Removed ssh root key for valhallasw from project hiera ([[phab:T255697|T255697]])
* 15:23 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0914.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 15:20 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0914.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 15:18 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0913.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 15:07 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0912.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko


=== 2020-07-09 ===
=== 2022-06-03 ===
* 18:53 bd808: Updating git-review to 1.27 via clush across cluster ([[phab:T257496|T257496]])
* 20:07 wm-bot2: created node tools-sgeweblight-10-26.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 19:51 balloons: Scaling webservice nodes to 20, using new 8G swap flavor [[phab:T309821|T309821]]
* 19:35 wm-bot2: created node tools-sgeweblight-10-25.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 19:03 wm-bot2: created node tools-sgeweblight-10-20.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 19:01 wm-bot2: created node tools-sgeweblight-10-19.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 19:00 balloons: depooled old nodes, bringing entirely new grid of nodes online [[phab:T309821|T309821]]
* 18:22 wm-bot2: created node tools-sgeweblight-10-17.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 17:54 wm-bot2: created node tools-sgeweblight-10-16.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 17:52 wm-bot2: created node tools-sgeweblight-10-15.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 16:59 andrewbogott: building a bunch of new lighttpd nodes (beginning with tools-sgeweblight-10-12) using a flavor with more swap space
* 16:56 wm-bot2: created node tools-sgeweblight-10-12.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 15:50 balloons: fix fix g3.cores4.ram8.disk20.swap24.ephem20 flavor to include swap. Convert to fix g3.cores4.ram8.disk20.swap8.ephem20 flavor [[phab:T309821|T309821]]
* 15:50 balloons: temp add 1.0G swap to sgeweblight hosts [[phab:T309821|T309821]]
* 15:50 balloons: fix fix g3.cores4.ram8.disk20.swap24.ephem20 flavor to include swap. Convert to fix g3.cores4.ram8.disk20.swap8.ephem20 flavor t309821
* 15:49 balloons: temp add 1.0G swap to sgeweblight hosts t309821
* 13:25 bd808: Upgrading fleet to tools-webservice 0.86 ([[phab:T309821|T309821]])
* 13:20 bd808: publish tools-webservice 0.86 ([[phab:T309821|T309821]])
* 12:46 taavi: start webservicemonitor on tools-sgecron-01 [[phab:T309821|T309821]]
* 10:36 taavi: draining each sgeweblight node one by one, and removing the jobs stuck in 'deleting' too
* 05:05 taavi: removing duplicate (there should be only one per tool) web service jobs from the grid [[phab:T309821|T309821]]
* 04:52 taavi: revert bd808's changes to profile::toolforge::active_proxy_host
* 03:21 bd808: Cleared queue error states after deploying new toolforge-webservice package ([[phab:T309821|T309821]])
* 03:10 bd808: publish tools-webservice 0.85 with hack for [[phab:T309821|T309821]]


=== 2020-07-08 ===
=== 2022-06-02 ===
* 11:16 arturo: merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/610029 -- important change to front-proxy ([[phab:T234617|T234617]])
* 22:26 bd808: Rebooting tools-sgeweblight-10-1.tools.eqiad1.wikimedia.cloud. Node is full of jobs that are not tracked by grid master and failing to spawn new jobs sent by the scheduler
* 11:11 arturo: live-hacking puppetmaster with https://gerrit.wikimedia.org/r/c/operations/puppet/+/610029 ([[phab:T234617|T234617]])
* 21:56 bd808: Removed legacy "active_proxy_host" hiera setting
* 21:55 bd808: Updated hiera to use fqdn of 'tools-proxy-06.tools.eqiad1.wikimedia.cloud' for profile::toolforge::active_proxy_host key
* 21:41 bd808: Updated hiera to use fqdn of 'tools-proxy-06.tools.eqiad1.wikimedia.cloud' for active_redis key
* 21:23 wm-bot2: created node tools-sgeweblight-10-8.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
* 12:42 wm-bot2: rebooting stretch exec grid workers - cookbook ran by taavi@runko
* 12:13 wm-bot2: created node tools-sgeweblight-10-7.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
* 12:03 dcaro: refresh prometheus certs ([[phab:T308402|T308402]])
* 11:47 dcaro: refresh registry-admission-controller certs ([[phab:T308402|T308402]])
* 11:42 dcaro: refresh ingress-admission-controller certs ([[phab:T308402|T308402]])
* 11:36 dcaro: refresh volume-admission-controller certs ([[phab:T308402|T308402]])
* 11:24 wm-bot2: created node tools-sgeweblight-10-6.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
* 11:17 taavi: publish jobutils 1.44 that updates the grid default from stretch to buster [[phab:T277653|T277653]]
* 10:16 taavi: publish tools-webservice 0.84 that updates the grid default from stretch to buster [[phab:T277653|T277653]]
* 09:54 wm-bot2: created node tools-sgeexec-10-14.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko


=== 2020-07-07 ===
=== 2022-06-01 ===
* 23:22 bd808: Rebuilding all Docker images to pick up webservice v0.73 ([[phab:T234617|T234617]], [[phab:T257229|T257229]])
* 11:18 taavi: depool and remove tools-sgeexec-09[07-14]
* 23:19 bd808: Deploying webservice v0.73 via clush ([[phab:T234617|T234617]], [[phab:T257229|T257229]])
* 23:16 bd808: Building webservice v0.73 ([[phab:T234617|T234617]], [[phab:T257229|T257229]])
* 15:01 Reedy: killed python process from tools.experimental-embeddings using a lot of cpu on tools-sgebastion-07
* 15:01 Reedy: killed meno25 process running pwb.py on tools-sgebastion-07
* 09:59 arturo: point DNS tools.wmflabs.org A record to 185.15.56.60 (tools-legacy-redirector) ([[phab:T247236|T247236]])


=== 2020-07-06 ===
=== 2022-05-31 ===
* 11:54 arturo: briefly point DNS tools.wmflabs.org A record to 185.15.56.60 (tools-legacy-redirector) and then switch back to 185.15.56.11 (tools-proxy-05). The legacy redirector does HTTP/307 ([[phab:T247236|T247236]])
* 16:51 taavi: delete tools-sgeexec-0904 for [[phab:T309525|T309525]] experimentation
* 11:50 arturo: associate floating IP address 185.15.56.60 to tools-legacy-redirector ([[phab:T247236|T247236]])


=== 2020-07-01 ===
=== 2022-05-30 ===
* 11:19 arturo: cleanup exim email queue (4 frozen messages)
* 08:24 taavi: depool tools-sgeexec-[0901-0909] (7 nodes total) [[phab:T277653|T277653]]
* 11:02 arturo: live-hacking puppetmaster with https://gerrit.wikimedia.org/r/c/operations/puppet/+/608849 ([[phab:T256737|T256737]])


=== 2020-06-30 ===
=== 2022-05-26 ===
* 11:18 arturo: set some hiera keys for mtail in puppet prefix `tools-mail` ([[phab:T256737|T256737]])
* 15:39 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|e6fa299}}) ([[phab:T309146|T309146]]) - cookbook ran by taavi@runko


=== 2020-06-29 ===
=== 2022-05-22 ===
* 22:48 legoktm: built html-sssd/web image ([[phab:T241817|T241817]])
* 17:04 taavi: failover tools-redis to the updated cluster [[phab:T278541|T278541]]
* 22:23 legoktm: rebuild python<nowiki>{</nowiki>34,35,37<nowiki>}</nowiki>-sssd/web images for https://gerrit.wikimedia.org/r/608093
* 16:42 wm-bot2: removing grid node tools-sgeexec-0940.tools.eqiad1.wikimedia.cloud ([[phab:T308982|T308982]]) - cookbook ran by taavi@runko
* 12:01 arturo: introduced spam filter in the mail server ([[phab:T120210|T120210]])


=== 2020-06-25 ===
=== 2022-05-16 ===
* 21:49 zhuyifei1999_: re-enabling puppet on tools-sgebastion-09 [[phab:T256426|T256426]]
* 14:02 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/ingress-nginx ({{Gerrit|7037eca}}) - cookbook ran by taavi@runko
* 21:39 zhuyifei1999_: disabling puppet on tools-sgebastion-09 so I can play with mount settings [[phab:T256426|T256426]]
* 21:24 bstorm: hard rebooting tools-sgebastion-09


=== 2020-06-24 ===
=== 2022-05-14 ===
* 12:36 arturo: live-hacking puppetmaster with exim prometheus stuff ([[phab:T175964|T175964]])
* 10:47 taavi: hard reboot unresponsible tools-sgeexec-0940
* 11:57 arturo: merging email ratelimiting patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/607320 ([[phab:T175964|T175964]])


=== 2020-06-23 ===
=== 2022-05-12 ===
* 17:55 arturo: killed procs for users `hamishz` and `msyn` which apparently were tools that should be running in the grid / kubernetes instead
* 12:36 taavi: re-enable CronJobControllerV2 [[phab:T308205|T308205]]
* 16:08 arturo: created acme-chief cert `tools_mail` in the prefix hiera
* 09:28 taavi: deploy jobs-api update [[phab:T308204|T308204]]
* 09:15 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|e6fa299}}) ([[phab:T308204|T308204]]) - cookbook ran by taavi@runko


=== 2020-06-17 ===
=== 2022-05-10 ===
* 10:40 arturo: created VM tools-legacy-redirector, with the corresponding puppet prefix ([[phab:T247236|T247236]], [[phab:T234617|T234617]])
* 15:18 taavi: depool tools-k8s-worker-42 for experiments
* 13:54 taavi: enable distro-wikimedia unattended upgrades [[phab:T290494|T290494]]


=== 2020-06-16 ===
=== 2022-05-06 ===
* 23:01 bd808: Building new Docker images to pick up webservice 0.72
* 19:46 bd808: Rebuilt toolforge-perl532-sssd-base & toolforge-perl532-sssd-web to add liblocale-codes-perl ([[phab:T307812|T307812]])
* 22:58 bd808: Deploying webservice 0.72 to bastions and grid
* 22:56 bd808: Building webservice 0.72
* 15:10 arturo: merging a patch with changes to the template for keepalived (used in the elastic cluster) https://gerrit.wikimedia.org/r/c/operations/puppet/+/605898


=== 2020-06-15 ===
=== 2022-05-05 ===
* 21:28 bstorm_: cleaned up killgridjobs.sh on the tools bastions [[phab:T157792|T157792]]
* 17:28 taavi: deploy tools-webservice 0.83 [[phab:T307693|T307693]]
* 18:14 bd808: Rebuilding all Docker images to pick up webservice 0.71 ([[phab:T254640|T254640]], [[phab:T253412|T253412]])
* 18:12 bd808: Deploying webservice 0.71 to bastions and grid via clush
* 18:05 bd808: Building webservice 0.71


=== 2020-06-12 ===
=== 2022-05-03 ===
* 13:13 arturo: live-hacking session in the puppetmaster ended
* 08:20 taavi: redis: start replication from the old cluster to the new one ([[phab:T278541|T278541]])
* 13:10 arturo: live-hacing puppet tree in tools-puppetmaster-02 for testing PAWS related patch (they share haproxy puppet code)
* 00:16 bstorm_: remounted NFS for tools-k8s-control-3 and tools-acme-chief-01


=== 2020-06-11 ===
=== 2022-05-02 ===
* 23:35 bstorm_: rebooting tools-k8s-control-2 because it seems to be confused on NFS, interestingly enough
* 08:54 taavi: restart acme-chief.service [[phab:T307333|T307333]]


=== 2020-06-04 ===
=== 2022-04-25 ===
* 13:32 bd808: Manually restored /etc/haproxy/conf.d/elastic.cfg on tools-elastic-*
* 14:56 bd808: Rebuilding all docker images to pick up toolforge-webservice v0.82 ([[phab:T214343|T214343]])
* 14:46 bd808: Building toolforge-webservice v0.82


=== 2020-06-02 ===
=== 2022-04-23 ===
* 12:23 arturo: renewed TLS cert for k8s metrics-server ([[phab:T250874|T250874]]) following docs: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes/Certificates#internal_API_access
* 16:51 bd808: Built new perl532-sssd/<nowiki>{</nowiki>base,web<nowiki>}</nowiki> images and pushed to registry ([[phab:T214343|T214343]])
* 11:00 arturo: renewed TLS cert for prometheus to contact toolforge k8s ([[phab:T250874|T250874]]) following docs: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes/Certificates#external_API_access


=== 2020-06-01 ===
=== 2022-04-20 ===
* 23:51 bstorm_: refreshed certs for the custom webhook controllers on the k8s cluster [[phab:T250874|T250874]]
* 16:58 taavi: reboot toolserver-proxy-01 to free up disk space from stale file handles(?)
* 00:39 bd808: Ugh. Prior SAL message was about tools-sgeexec-0940
* 07:51 wm-bot: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|8f37a04}}) - cookbook ran by taavi@runko
* 00:39 bd808: Compressed /var/log/account/pacct.0 ahead of rotation schedule to free some space on the root partition


=== 2020-05-29 ===
=== 2022-04-16 ===
* 19:37 bstorm_: adding docker image for paws-public docker-registry.tools.wmflabs.org/paws-public-nginx:openresty [[phab:T252217|T252217]]
* 18:53 wm-bot: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/kubernetes-metrics ({{Gerrit|2c485e9}}) - cookbook ran by taavi@runko


=== 2020-05-28 ===
=== 2022-04-12 ===
* 21:19 bd808: Killed 7 python processes run by user 'mattho69' on login.toolforge.org
* 21:32 bd808: Added komla to Gerrit group 'toollabs-trusted' ([[phab:T305986|T305986]])
* 21:06 bstorm_: upgrading tools-k8s-worker-[30-60] to kubernetes 1.16.10 [[phab:T246122|T246122]]
* 21:27 bd808: Added komla to 'roots' sudoers policy ([[phab:T305986|T305986]])
* 17:54 bstorm_: upgraded tools-k8s-worker-[11..15] and starting on -21-29 now [[phab:T246122|T246122]]
* 21:24 bd808: Add komla as projectadmin ([[phab:T305986|T305986]])
* 16:01 bstorm_: kubectl upgraded to 1.16.10 on all bastions [[phab:T246122|T246122]]
* 15:58 arturo: upgrading tools-k8s-worker-[1..10] to 1.16.10 ([[phab:T246122|T246122]])
* 15:41 arturo: upgrading tools-k8s-control-3 to 1.16.10 ([[phab:T246122|T246122]])
* 15:17 arturo: upgrading tools-k8s-control-2 to 1.16.10 ([[phab:T246122|T246122]])
* 15:09 arturo: upgrading tools-k8s-control-1 to 1.16.10 ([[phab:T246122|T246122]])
* 14:49 arturo: cleanup /etc/apt/sources.list.d/ directory in all tools-k8s-* VMs
* 11:27 arturo: merging change to front-proxy: https://gerrit.wikimedia.org/r/c/operations/puppet/+/599139 ([[phab:T253816|T253816]])


=== 2020-05-27 ===
=== 2022-04-10 ===
* 17:23 bstorm_: deleting "tools-k8s-worker-20", "tools-k8s-worker-19", "tools-k8s-worker-18", "tools-k8s-worker-17", "tools-k8s-worker-16"
* 18:43 taavi: deleted `/tmp/dwl02.out-20210915` on tools-sgebastion-07 (not touched since september, taking up 1.3G of disk space)


=== 2020-05-26 ===
=== 2022-04-09 ===
* 18:45 bstorm_: upgrading maintain-kubeusers to match what is in toolsbeta [[phab:T246059|T246059]] [[phab:T211096|T211096]]
* 15:30 taavi: manually prune user.log on tools-prometheus-03 to free up some space on /
* 16:20 bstorm_: fix incorrect volume name in kubeadm-config configmap [[phab:T246122|T246122]]


=== 2020-05-22 ===
=== 2022-04-08 ===
* 20:00 bstorm_: rebooted tools-sgebastion-07 to clear up tmp file problems with 10 min warning
* 10:44 arturo: disabled debug mode on the k8s jobs-emailer component
* 19:12 bstorm_: running command to delete over 2000 tmp ca certs on tools-bastion-07 [[phab:T253412|T253412]]


=== 2020-05-21 ===
=== 2022-04-05 ===
* 22:40 bd808: Rebuilding all Docker containers for tools-webservice 0.70 ([[phab:T252700|T252700]])
* 07:52 wm-bot: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|d7d3463}}) - cookbook ran by arturo@nostromo
* 22:36 bd808: Updated tools-webservice to 0.70 across instances ([[phab:T252700|T252700]])
* 07:44 wm-bot: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|d7d3463}}) - cookbook ran by arturo@nostromo
* 22:29 bd808: Building tools-webservice 0.70 via wmcs-package-build.py
* 07:21 arturo: deploying toolforge-jobs-framework-cli v7


=== 2020-05-20 ===
=== 2022-04-04 ===
* 09:59 arturo: now running tesseract-ocr v4.1.1-2~bpo9+1 in the Toolforge grid ([[phab:T247422|T247422]])
* 17:05 wm-bot: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|cbcfc47}}) - cookbook ran by arturo@nostromo
* 09:50 arturo: `aborrero@cloud-cumin-01:~$ sudo cumin --force -x 'O<nowiki>{</nowiki>project:tools name:tools-sge[bcew].*<nowiki>}</nowiki>' 'apt-get install tesseract-ocr -t stretch-backports -y'` ([[phab:T247422|T247422]])
* 16:56 wm-bot: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|cbcfc47}}) - cookbook ran by arturo@nostromo
* 09:35 arturo: `aborrero@cloud-cumin-01:~$ sudo cumin --force -x 'O<nowiki>{</nowiki>project:tools name:tools-sge[bcew].*<nowiki>}</nowiki>' 'rm /etc/apt/sources.lists.d/kubeadm-k8s-component-repo.list ; rm /etc/apt/sources.list.d/repository_thirdparty-kubeadm-k8s-1-15.list ; run-puppet-agent'` ([[phab:T247422|T247422]])
* 09:28 arturo: deployed toolforge-jobs-framework-cli v6 into aptly and installed it on buster bastions
* 09:23 arturo: `aborrero@cloud-cumin-01:~$ sudo cumin --force -x 'O<nowiki>{</nowiki>project:tools name:tools-sge[bcew].*<nowiki>}</nowiki>' 'rm /etc/apt/preferences.d/* ; run-puppet-agent'` ([[phab:T247422|T247422]])


=== 2020-05-19 ===
=== 2022-03-28 ===
* 17:00 bstorm_: deleting/restarting the paws db-proxy pod because it cannot connect to the replicas...and I'm hoping that's due to depooling and such
* 09:32 wm-bot: cleaned up grid queue errors on tools-sgegrid-master.tools.eqiad1.wikimedia.cloud ([[phab:T304816|T304816]]) - cookbook ran by arturo@nostromo


=== 2020-05-13 ===
=== 2022-03-15 ===
* 18:14 bstorm_: upgrading calico to 3.14.0 with typha enabled in Toolforge K8s [[phab:T250863|T250863]]
* 16:57 wm-bot: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-emailer:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-emailer ({{Gerrit|084ee51}}) - cookbook ran by arturo@nostromo
* 18:10 bstorm_: set "profile::toolforge::k8s::typha_enabled: true" in tools project for calico upgrade [[phab:T250863|T250863]]
* 11:24 arturo: cleared error state on queue continuous@tools-sgeexec-0939.tools.eqiad.wmflabs (a job took a very long time to be scheduled...)


=== 2020-05-09 ===
=== 2022-03-14 ===
* 00:28 bstorm_: added nfs.* to ignored_fs_types for the prometheus::node_exporter params in project hiera [[phab:T252260|T252260]]
* 11:44 arturo: deploy jobs-framework-emailer {{Gerrit|9470a5f339fd5a44c97c69ce97239aef30f5ee41}} ([[phab:T286135|T286135]])
* 10:48 dcaro: pushed v0.33.2 tekton control and webhook images, and bashA5.1.4 to the local repo ([[phab:T297090|T297090]])


=== 2020-05-08 ===
=== 2022-03-10 ===
* 18:17 bd808: Building all jessie-sssd derived images ([[phab:T197930|T197930]])
* 09:42 arturo: cleaned grid queue error state @ tools-sgewebgrid-generic-0902
* 17:29 bd808: Building new jessie-sssd base image ([[phab:T197930|T197930]])


=== 2020-05-07 ===
=== 2022-03-01 ===
* 21:51 bstorm_: rebuilding the docker images for Toolforge k8s
* 13:41 dcaro: rebooting tools-sgeexec-0916 to clear any state ([[phab:T302702|T302702]])
* 19:03 bstorm_: toollabs-webservice 0.69 now pushed to the Toolforge bastions
* 12:11 dcaro: Cleared error state queues for sgeexec-0916 ([[phab:T302702|T302702]])
* 18:57 bstorm_: pushing new toollabs-webservice package v0.69 to the tools repos
* 10:23 arturo: tools-sgeeex-0913/0916 are depooled, queue errors. Reboot them and clean errors by hand


=== 2020-05-06 ===
=== 2022-02-28 ===
* 21:20 bd808: Kubectl delete node tools-k8s-worker-[16-20] ([[phab:T248702|T248702]])
* 08:02 taavi: reboot sgeexec-0916
* 18:24 bd808: Updated "profile::toolforge::k8s::worker_nodes" list in "tools-k8s-haproxy" prefix puppet ([[phab:T248702|T248702]])
* 07:49 taavi: depool tools-sgeexec-0916.tools as it is out of disk space on /
* 18:14 bd808: Shutdown tools-k8s-worker-[16-20] instances ([[phab:T248702|T248702]])
* 18:04 bd808: Draining tools-k8s-worker-[16-20] in preparation for decomm ([[phab:T248702|T248702]])
* 17:56 bd808: Cordoned tools-k8s-worker-[16-20] in preparation for decomm ([[phab:T248702|T248702]])
* 00:01 bd808: Joining tools-k8s-worker-60 to the k8s worker pool
* 00:00 bd808: Joining tools-k8s-worker-59 to the k8s worker pool


=== 2020-05-05 ===
=== 2022-02-17 ===
* 23:58 bd808: Joining tools-k8s-worker-58 to the k8s worker pool
* 08:23 taavi: deleted tools-clushmaster-02
* 23:55 bd808: Joining tools-k8s-worker-57 to the k8s worker pool
* 08:14 taavi: made tools-puppetmaster-02 its own client to fix `puppet node deactivate` puppetdb access
* 23:53 bd808: Joining tools-k8s-worker-56 to the k8s worker pool
* 21:51 bd808: Building 5 new k8s worker nodes ([[phab:T248702|T248702]])


=== 2020-05-04 ===
=== 2022-02-16 ===
* 22:08 bstorm_: deleting tools-elastic-01/2/3 [[phab:T236606|T236606]]
* 00:12 bd808: Image builds completed.
* 16:46 arturo: removing the now unused `/etc/apt/preferences.d/toolforge_k8s_kubeadmrepo*` files ([[phab:T250866|T250866]])
* 16:43 arturo: removing the now unused `/etc/apt/sources.list.d/toolforge-k8s-kubeadmrepo.list` file ([[phab:T250866|T250866]])


=== 2020-04-29 ===
=== 2022-02-15 ===
* 22:13 bstorm_: running a fixup script after fixing a bug [[phab:T247455|T247455]]
* 23:17 bd808: Image builds failed in buster php image with an apt error. The error looks transient, so starting builds over.
* 21:28 bstorm_: running the rewrite-psp-preset.sh script across all tools [[phab:T247455|T247455]]
* 23:06 bd808: Started full rebuild of Toolforge containers to pick up webservice 0.81 and other package updates in tmux session on tools-docker-imagebuilder-01
* 16:54 bstorm_: deleted the maintain-kubeusers pod to start running the new image [[phab:T247455|T247455]]
* 22:58 bd808: `sudo apt-get update && sudo apt-get install toolforge-webservice` on all bastions to pick up 0.81
* 16:52 bstorm_: tagged docker-registry.tools.wmflabs.org/maintain-kubeusers:beta to latest to deploy to toolforge [[phab:T247455|T247455]]
* 22:50 bd808: Built new toollabs-webservice 0.81
* 18:43 bd808: Enabled puppet on tools-proxy-05
* 18:38 bd808: Disabled puppet on tools-proxy-05 for manual testing of nginx config changes
* 18:21 taavi: delete tools-package-builder-03
* 11:49 arturo: invalidate sssd cache in all bastions to debug [[phab:T301736|T301736]]
* 11:16 arturo: purge debian package `unscd` on tools-sgebastion-10/11 for [[phab:T301736|T301736]]
* 11:15 arturo: reboot tools-sgebastion-10 for [[phab:T301736|T301736]]


=== 2020-04-28 ===
=== 2022-02-10 ===
* 22:58 bstorm_: rebuilding docker-registry.tools.wmflabs.org/maintain-kubeusers:beta [[phab:T247455|T247455]]
* 15:07 taavi: shutdown tools-clushmaster-02 [[phab:T298191|T298191]]
* 13:25 wm-bot: trying to join node tools-sgewebgen-10-2 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 13:24 wm-bot: trying to join node tools-sgewebgen-10-1 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 13:07 wm-bot: trying to join node tools-sgeweblight-10-5 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 13:06 wm-bot: trying to join node tools-sgeweblight-10-4 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 13:05 wm-bot: trying to join node tools-sgeweblight-10-3 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 13:03 wm-bot: trying to join node tools-sgeweblight-10-2 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 12:54 wm-bot: trying to join node tools-sgeweblight-10-1.tools.eqiad1.wikimedia.cloud to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 08:45 taavi: set `profile::base::manage_ssh_keys: true` globally [[phab:T214427|T214427]]
* 08:16 taavi: enable puppetdb and re-enable puppet with puppetdb ssh key management disabled (profile::base::manage_ssh_keys: false) - [[phab:T214427|T214427]]
* 08:06 taavi: disable puppet globally for enabling puppetdb [[phab:T214427|T214427]]


=== 2020-04-23 ===
=== 2022-02-09 ===
* 19:22 bd808: Increased Kubernetes services quota for bd808-test tool.
* 19:29 taavi: installed tools-puppetdb-1, not configured on puppetmaster side yet [[phab:T214427|T214427]]
* 18:56 wm-bot: pooled 10 grid nodes tools-sgeweblight-10-[1-5],tools-sgewebgen-10-[1,2],tools-sgeexec-10-[1-10] ([[phab:T277653|T277653]]) - cookbook ran by arturo@nostromo
* 18:30 wm-bot: pooled 9 grid nodes tools-sgeexec-10-[2-10],tools-sgewebgen-[3,15] - cookbook ran by arturo@nostromo
* 18:25 arturo: ignore last message
* 18:24 wm-bot: pooled 9 grid nodes tools-sgeexec-10-[2-10],tools-sgewebgen-[3,15] - cookbook ran by arturo@nostromo
* 14:04 taavi: created tools-cumin-1/toolsbeta-cumin-1 [[phab:T298191|T298191]]


=== 2020-04-21 ===
=== 2022-02-07 ===
* 23:06 bstorm_: repooled tools-k8s-worker-38/52, tools-sgewebgrid-lighttpd-0918/9 and tools-sgeexec-0901 [[phab:T250869|T250869]]
* 17:37 taavi: generated authdns_acmechief ssh key and stored password in a text file in local labs/private repository ([[phab:T288406|T288406]])
* 22:09 bstorm_: depooling tools-sgewebgrid-lighttpd-0918/9 and tools-sgeexec-0901 [[phab:T250869|T250869]]
* 12:52 taavi: updated maintain-kubeusers for [[phab:T301081|T301081]]
* 22:02 bstorm_: draining tools-k8s-worker-38 and tools-k8s-worker-52 as they are on the crashed host [[phab:T250869|T250869]]


=== 2020-04-20 ===
=== 2022-02-04 ===
* 15:31 bd808: Rebuilding Docker containers to pick up tools-webservice v0.68 ([[phab:T250625|T250625]])
* 22:33 taavi: `root@tools-sgebastion-10:/data/project/ru_monuments/.kube# mv config old_config` # experimenting with [[phab:T301015|T301015]]
* 14:47 arturo: added joakino to tools.admin LDAP group
* 21:36 taavi: clear error state from some webgrid nodes
* 13:28 jeh: shutdown elasticsearch v5 cluster running Jessie [[phab:T236606|T236606]]
* 12:46 arturo: uploading tools-webservice v0.68 to aptly stretch-tools and update it on relevant servers ([[phab:T250625|T250625]])
* 12:06 arturo: uploaded tools-webservice v0.68 to stretch-toolsbeta for testing
* 11:59 arturo: `root@tools-sge-services-03:~# aptly db cleanup` removed 340 unreferenced packages, and 2 unreferenced files


=== 2020-04-15 ===
=== 2022-02-03 ===
* 23:20 bd808: Building ruby25-sssd/base and children ([[phab:T141388|T141388]], [[phab:T250118|T250118]])
* 09:06 taavi: run `sudo apt-get clean` on login-buster/dev-buster to clean up disk space
* 20:09 jeh: update default security group to allow prometheus01.metricsinfra.eqiad.wmflabs TCP 9100 [[phab:T250206|T250206]]
* 08:01 taavi: restart acme-chief to force renewal of toolserver.org certificate


=== 2020-04-14 ===
=== 2022-01-30 ===
* 18:26 bstorm_: Deployed new code and RBAC for maintain-kubeusers [[phab:T246123|T246123]]
* 14:41 taavi: created a neutron port with ip 172.16.2.46 for a service ip for toolforge redis automatic failover [[phab:T278541|T278541]]
* 18:19 bstorm_: updating the maintain-kubeusers:latest image [[phab:T246123|T246123]]
* 14:22 taavi: creating a cluster of 3 bullseye redis hosts for [[phab:T278541|T278541]]
* 17:32 bstorm_: updating the maintain-kubeusers:beta image on tools-docker-imagebuilder-01 [[phab:T246123|T246123]]


=== 2020-04-10 ===
=== 2022-01-26 ===
* 21:33 bd808: Rebuilding all Docker images for the Kubernetes cluster ([[phab:T249843|T249843]])
* 18:33 wm-bot: depooled grid node tools-sgeexec-10-10 - cookbook ran by arturo@nostromo
* 19:36 bstorm_: after testing deploying toollabs-webservice 0.67 to tools repos [[phab:T249843|T249843]]
* 18:33 wm-bot: depooled grid node tools-sgeexec-10-9 - cookbook ran by arturo@nostromo
* 14:53 arturo: live-hacking tools-puppetmaster-02 with https://gerrit.wikimedia.org/r/c/operations/puppet/+/587991 for [[phab:T249837|T249837]]
* 18:33 wm-bot: depooled grid node tools-sgeexec-10-8 - cookbook ran by arturo@nostromo
* 18:32 wm-bot: depooled grid node tools-sgeexec-10-7 - cookbook ran by arturo@nostromo
* 18:32 wm-bot: depooled grid node tools-sgeexec-10-6 - cookbook ran by arturo@nostromo
* 18:31 wm-bot: depooled grid node tools-sgeexec-10-5 - cookbook ran by arturo@nostromo
* 18:30 wm-bot: depooled grid node tools-sgeexec-10-4 - cookbook ran by arturo@nostromo
* 18:28 wm-bot: depooled grid node tools-sgeexec-10-3 - cookbook ran by arturo@nostromo
* 18:27 wm-bot: depooled grid node tools-sgeexec-10-2 - cookbook ran by arturo@nostromo
* 18:27 wm-bot: depooled grid node tools-sgeexec-10-1 - cookbook ran by arturo@nostromo
* 13:55 arturo: scaling up the buster web grid with 5 lighttd and 2 generic nodes ([[phab:T277653|T277653]])


=== 2020-04-09 ===
=== 2022-01-25 ===
* 15:13 bd808: Rebuilding all stretch and buster Docker images. Jessie is broken at the moment due to package version mismatches
* 11:50 wm-bot: reconfiguring the grid by using grid-configurator - cookbook ran by arturo@nostromo
* 11:18 arturo: bump nproc limit in bastions https://gerrit.wikimedia.org/r/c/operations/puppet/+/587715 ([[phab:T219070|T219070]])
* 11:44 arturo: rebooting buster exec nodes
* 04:29 bd808: Running rebuild_all for Docker images to pick up toollabs-webservice v0.66 [try #2] ([[phab:T154504|T154504]], [[phab:T234617|T234617]])
* 08:34 taavi: sign puppet certificate for tools-sgeexec-10-4
* 04:19 bd808: python3 build.py --image-prefix toolforge --tag latest --no-cache --push --single jessie-sssd
* 00:20 bd808: Docker rebuild failed in toolforge-python2-sssd-base: "zlib1g-dev : Depends: zlib1g (= 1:1.2.8.dfsg-2+b1) but 1:1.2.8.dfsg-2+deb8u1 is to be installed"


=== 2020-04-08 ===
=== 2022-01-24 ===
* 23:49 bd808: Running rebuild_all for Docker images to pick up toollabs-webservice v0.66 ([[phab:T154504|T154504]], [[phab:T234617|T234617]])
* 17:44 wm-bot: reconfiguring the grid by using grid-configurator - cookbook ran by arturo@nostromo
* 23:35 bstorm_: deploy toollabs-webservice v0.66 [[phab:T154504|T154504]] [[phab:T234617|T234617]]
* 15:23 arturo: scaling up the grid with 10 buster exec nodes ([[phab:T277653|T277653]])


=== 2020-04-07 ===
=== 2022-01-20 ===
* 20:06 andrewbogott: sss_cache -E on tools-sgebastion-08 and  tools-sgebastion-09
* 17:05 arturo: drop 9 of the 10 buster exec nodes created earlier. They didn't get DNS records
* 20:00 andrewbogott: sss_cache -E on tools-sgebastion-07
* 12:56 arturo: scaling up the grid with 10 buster exec nodes ([[phab:T277653|T277653]])


=== 2020-04-06 ===
=== 2022-01-19 ===
* 19:16 bstorm_: deleted tools-redis-1001/2 [[phab:T248929|T248929]]
* 17:34 andrewbogott: rebooting tools-sgeexec-0913.tools.eqiad1.wikimedia.cloud to recover from (presumed) fallout from the scratch/nfs move


=== 2020-04-03 ===
=== 2022-01-14 ===
* 22:40 bstorm_: shut down tools-redis-1001/2 [[phab:T248929|T248929]]
* 19:09 taavi: set /var/run/lighttpd as world-writable on all lighttpd webgrid nodes, [[phab:T299243|T299243]]
* 22:32 bstorm_: switch tools-redis-1003 to the active redis server [[phab:T248929|T248929]]
* 20:41 bstorm_: deleting tools-redis-1003/4 to attach them to an anti-affinity group [[phab:T248929|T248929]]
* 18:53 bstorm_: spin up tools-redis-1004 on stretch and connect to cluster [[phab:T248929|T248929]]
* 18:23 bstorm_: spin up tools-redis-1003 on stretch and connect to the cluster [[phab:T248929|T248929]]
* 16:50 bstorm_: launching tools-redis-03 (Buster) to see what happens


=== 2020-03-30 ===
=== 2022-01-12 ===
* 18:28 bstorm_: Beginning rolling depool, remount, repool of k8s workers for [[phab:T248702|T248702]]
* 11:27 arturo: created puppet prefix `tools-sgeweblight`, drop `tools-sgeweblig`
* 18:22 bstorm_: disabled puppet across tools-k8s-worker-[1-55].tools.eqiad.wmflabs [[phab:T248702|T248702]]
* 11:03 arturo: created puppet prefix 'tools-sgeweblig'
* 16:56 arturo: dropping `_psl.toolforge.org` TXT record ([[phab:T168677|T168677]])
* 11:02 arturo: created puppet prefix 'toolsbeta-sgeweblig'


=== 2020-03-27 ===
=== 2022-01-04 ===
* 21:22 bstorm_: removed puppet prefix tools-docker-builder [[phab:T248703|T248703]]
* 17:18 bd808: tools-acme-chief-01: sudo service acme-chief restart
* 21:15 bstorm_: deleted tools-docker-builder-06 [[phab:T248703|T248703]]
* 08:12 taavi: disable puppet & exim4 on [[phab:T298501|T298501]]
* 18:55 bstorm_: launching tools-docker-imagebuilder-01 [[phab:T248703|T248703]]
* 12:52 arturo: install python3-pykube on tools-k8s-control-3 for some tests interaction with the API from python
 
=== 2020-03-24 ===
* 11:44 arturo: trying to solve a rebase/merge conflict in labs/private.git in tools-puppetmaster-02
* 11:33 arturo: merging tools-proxy patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/579952/ ([[phab:T234617|T234617]]) (second try with some additional bits in LUA)
* 10:16 arturo: merging tools-proxy patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/579952/ ([[phab:T234617|T234617]])
 
=== 2020-03-18 ===
* 19:07 bstorm_: removed role::toollabs::logging::sender from project puppet (it wouldn't work anyway)
* 18:04 bstorm_: removed puppet prefix tools-flannel-etcd [[phab:T246689|T246689]]
* 17:58 bstorm_: removed puppet prefix tools-worker [[phab:T246689|T246689]]
* 17:57 bstorm_: removed puppet prefix tools-k8s-master [[phab:T246689|T246689]]
* 17:36 bstorm_: removed lots of deprecated hiera keys from horizon for the old cluster [[phab:T246689|T246689]]
* 16:59 bstorm_: deleting "tools-worker-1002", "tools-worker-1001", "tools-k8s-master-01", "tools-flannel-etcd-03", "tools-k8s-etcd-03", "tools-flannel-etcd-02", "tools-k8s-etcd-02", "tools-flannel-etcd-01", "tools-k8s-etcd-01" [[phab:T246689|T246689]]
 
=== 2020-03-17 ===
* 13:29 arturo: set `profile::toolforge::bastion::nproc: 200` for tools-sgebastion-08 ([[phab:T219070|T219070]])
* 00:08 bstorm_: shut off tools-flannel-etcd-01/02/03 [[phab:T246689|T246689]]
 
=== 2020-03-16 ===
* 22:01 bstorm_: shut off tools-k8s-etcd-01/02/03 [[phab:T246689|T246689]]
* 22:00 bstorm_: shut off tools-k8s-master-01 [[phab:T246689|T246689]]
* 21:59 bstorm_: shut down tools-worker-1001 and tools-worker-1002 [[phab:T246689|T246689]]
 
=== 2020-03-11 ===
* 17:00 jeh: clean up apt cache on tools-sgebastion-07
 
=== 2020-03-06 ===
* 16:25 bstorm_: updating maintain-kubeusers image to filter invalid tool names
 
=== 2020-03-03 ===
* 18:16 jeh: create OpenStack DNS record for elasticsearch.svc.tools.eqiad1.wikimedia.cloud (eqiad1 subdomain change) [[phab:T236606|T236606]]
* 18:02 jeh: create OpenStack DNS record for elasticsearch.svc.tools.eqiad.wikimedia.cloud [[phab:T236606|T236606]]
* 17:31 jeh: create a OpenStack virtual ip address for the new elasticsearch cluster [[phab:T236606|T236606]]
* 10:54 arturo: deleted VMs `tools-worker-[1003-1020]` (legacy k8s cluster) ([[phab:T246689|T246689]])
* 10:51 arturo: cordoned/drained all legacy k8s worker nodes except 1001/1002 ([[phab:T246689|T246689]])
 
=== 2020-03-02 ===
* 22:26 jeh: starting first pass of elasticsearch data migration to new cluster [[phab:T236606|T236606]]
 
=== 2020-03-01 ===
* 01:48 bstorm_: old version of kubectl removed. Anyone who needs it can download it with `curl -LO https://storage.googleapis.com/kubernetes-release/release/v1.4.12/bin/linux/amd64/kubectl`
* 01:27 bstorm_: running the force-migrate command to make sure any new kubernetes deployments are on the new cluster.
 
=== 2020-02-28 ===
* 22:14 bstorm_: shutting down the old maintain-kubeusers and taking the gloves off the new one (removing --gentle-mode)
* 16:51 bstorm_: node/tools-k8s-worker-15 uncordoned
* 16:44 bstorm_: drained tools-k8s-worker-15 and hard rebooting it because it wasn't happy
* 16:36 bstorm_: rebooting k8s workers 1-35 on the 2020 cluster to clear a strange nologin condition that has been there since the NFS maintenance
* 16:14 bstorm_: rebooted tools-k8s-worker-7 to clear some puppet issues
* 16:00 bd808: Devoicing stashbot in #wikimedia-cloud to reduce irc spam while migrating tools to 2020 Kubernetes cluster
* 15:28 jeh: create OpenStack server group tools-elastic with anti-affinty policy enabled [[phab:T236606|T236606]]
* 15:09 jeh: create 3 new elasticsearch VMs tools-elastic-[1,2,3] [[phab:T236606|T236606]]
* 14:20 jeh: create new puppet prefixes for existing (no change in data) and new elasticsearch VMs
* 04:35 bd808: Joined tools-k8s-worker-54 to 2020 Kubernetes cluster
* 04:34 bd808: Joined tools-k8s-worker-53 to 2020 Kubernetes cluster
* 04:32 bd808: Joined tools-k8s-worker-52 to 2020 Kubernetes cluster
* 04:31 bd808: Joined tools-k8s-worker-51 to 2020 Kubernetes cluster
* 04:28 bd808: Joined tools-k8s-worker-50 to 2020 Kubernetes cluster
* 04:24 bd808: Joined tools-k8s-worker-49 to 2020 Kubernetes cluster
* 04:23 bd808: Joined tools-k8s-worker-48 to 2020 Kubernetes cluster
* 04:21 bd808: Joined tools-k8s-worker-47 to 2020 Kubernetes cluster
* 04:21 bd808: Joined tools-k8s-worker-46 to 2020 Kubernetes cluster
* 04:19 bd808: Joined tools-k8s-worker-45 to 2020 Kubernetes cluster
* 04:14 bd808: Joined tools-k8s-worker-44 to 2020 Kubernetes cluster
* 04:13 bd808: Joined tools-k8s-worker-43 to 2020 Kubernetes cluster
* 04:12 bd808: Joined tools-k8s-worker-42 to 2020 Kubernetes cluster
* 04:10 bd808: Joined tools-k8s-worker-41 to 2020 Kubernetes cluster
* 04:09 bd808: Joined tools-k8s-worker-40 to 2020 Kubernetes cluster
* 04:08 bd808: Joined tools-k8s-worker-39 to 2020 Kubernetes cluster
* 04:07 bd808: Joined tools-k8s-worker-38 to 2020 Kubernetes cluster
* 04:06 bd808: Joined tools-k8s-worker-37 to 2020 Kubernetes cluster
* 03:49 bd808: Joined tools-k8s-worker-36 to 2020 Kubernetes cluster
* 00:50 bstorm_: rebuilt all docker images to include webservice 0.64
 
=== 2020-02-27 ===
* 23:27 bstorm_: installed toollabs-webservice 0.64 on the bastions
* 23:24 bstorm_: pushed toollabs-webservice version 0.64 to all toolforge repos
* 21:03 jeh: add reindex service account to elasticsearch for data migration [[phab:T236606|T236606]]
* 20:57 bstorm_: upgrading toollabs-webservice to stretch-toolsbeta version for jdk8:testing image only
* 20:19 jeh: update elasticsearch VPS security group to allow toolsbeta-elastic7-1 access on tcp 80 [[phab:T236606|T236606]]
* 18:53 bstorm_: hard rebooted a rather stuck tools-sgecron-01
* 18:20 bd808: Building tools-k8s-worker-[36-55]
* 17:56 bd808: Deleted instances tools-worker-10[21-40]
* 16:14 bd808: Decommissioning tools-worker-10[21-40]
* 16:02 bd808: Drained tools-worker-1021
* 15:51 bd808: Drained tools-worker-1022
* 15:44 bd808: Drained tools-worker-1023 (there is no tools-worker-1024)
* 15:39 bd808: Drained tools-worker-1025
* 15:39 bd808: Drained tools-worker-1026
* 15:11 bd808: Drained tools-worker-1027
* 15:09 bd808: Drained tools-worker-1028 (there is no tools-worker-1029)
* 15:07 bd808: Drained tools-worker-1030
* 15:06 bd808: Uncordoned tools-worker-10[16-20]. Was over optimistic about repacking legacy Kubernetes cluster into 15 instances. Will keep 20 for now.
* 15:00 bd808: Drained tools-worker-1031
* 14:54 bd808: Hard reboot tools-worker-1016. Direct virsh console unresponsive. Stuck in shutdown since 2020-01-22?
* 14:44 bd808: Uncordoned tools-worker-1009.tools.eqiad.wmflabs
* 14:41 bd808: Drained tools-worker-1032
* 14:37 bd808: Drained tools-worker-1033
* 14:35 bd808: Drained tools-worker-1034
* 14:34 bd808: Drained tools-worker-1035
* 14:33 bd808: Drained tools-worker-1036
* 14:33 bd808: Drained tools-worker-10<nowiki>{</nowiki>39,38,37<nowiki>}</nowiki> yesterday but did not !log
* 00:29 bd808: Drained tools-worker-1009 for reboot (NFS flakey)
* 00:11 bd808: Uncordoned tools-worker-1009.tools.eqiad.wmflabs
* 00:08 bd808: Uncordoned tools-worker-1002.tools.eqiad.wmflabs
* 00:02 bd808: Rebooting tools-worker-1002
* 00:00 bd808: Draining tools-worker-1002 to reboot for NFS problems
 
=== 2020-02-26 ===
* 23:42 bd808: Drained tools-worker-1040
* 23:41 bd808: Cordoned tools-worker-10[16-40] in preparation for shrinking legacy Kubernetes cluster
* 23:12 bstorm_: replacing all tool limit-ranges in the 2020 cluster with a lower cpu request version
* 22:29 bstorm_: deleted pod maintain-kubeusers-6d9c45f4bc-5bqq5 to deploy new image
* 21:06 bstorm_: deleting loads of stuck grid jobs
* 20:27 jeh: rebooting tools-worker-[1008,1015,1021]
* 20:15 bstorm_: rebooting tools-sgegrid-master because it actually had the permissions thing going on still
* 18:03 bstorm_: downtimed toolschecker for nfs maintenance
 
=== 2020-02-25 ===
* 15:31 bd808: `wmcs-k8s-enable-cluster-monitor toolschecker`
 
=== 2020-02-23 ===
* 00:40 Krenair: [[phab:T245932|T245932]]
 
=== 2020-02-21 ===
* 16:02 andrewbogott: moving tools-sgecron-01 to cloudvirt1022
 
=== 2020-02-20 ===
* 14:49 andrewbogott: moving tools-k8s-worker-19 and tools-k8s-worker-18 to cloudvirt1022 (as part of draining 1014)
* 00:04 Krenair: Shut off tools-puppetmaster-01 - to be deleted in one week [[phab:T245365|T245365]]
 
=== 2020-02-19 ===
* 22:05 Krenair: Project-wide hiera change to swap puppetmaster to tools-puppetmaster-02 [[phab:T245365|T245365]]
* 15:36 bstorm_: setting 'puppetmaster: tools-puppetmaster-02.tools.eqiad.wmflabs' on tools-sgeexec-0942 to test new puppetmaster on grid [[phab:T245365|T245365]]
* 11:50 arturo: fix invalid yaml format in horizon puppet prefix 'tools-k8s-haproxy' that prevented clean puppet run in the VMs
* 00:59 bd808: Live hacked the "nginx-configuration" ConfigMap for [[phab:T245426|T245426]] (done several hours ago, but I forgot to !log it)
 
=== 2020-02-18 ===
* 23:26 bstorm_: added tools-sgegrid-master.tools.eqiad1.wikimedia.cloud and tools-sgegrid-shadow.tools.eqiad1.wikimedia.cloud to gridengine admin host lists
* 09:50 arturo: temporarily delete DNS zone tools.wmcloud.org to try re-creating it
 
=== 2020-02-17 ===
* 18:53 arturo: [[phab:T168677|T168677]] created DNS TXT record _psl.toolforge.org. with value `https://github.com/publicsuffix/list/pull/970`
* 13:22 arturo: relocating tools-sgewebgrid-lighttpd-0914 to cloudvirt1012 to spread same VMs across different hypervisors
 
=== 2020-02-14 ===
* 00:38 bd808: Added tools-k8s-worker-35 to 2020 Kubernetes cluster ([[phab:T244791|T244791]])
* 00:34 bd808: Added tools-k8s-worker-34 to 2020 Kubernetes cluster ([[phab:T244791|T244791]])
* 00:32 bd808: Added tools-k8s-worker-33 to 2020 Kubernetes cluster ([[phab:T244791|T244791]])
* 00:29 bd808: Added tools-k8s-worker-32 to 2020 Kubernetes cluster ([[phab:T244791|T244791]])
* 00:25 bd808: Added tools-k8s-worker-31 to 2020 Kubernetes cluster ([[phab:T244791|T244791]])
* 00:25 bd808: Added tools-k8s-worker-30 to 2020 Kubernetes cluster ([[phab:T244791|T244791]])
* 00:17 bd808: Added tools-k8s-worker-29 to 2020 Kubernetes cluster ([[phab:T244791|T244791]])
* 00:15 bd808: Added tools-k8s-worker-28 to 2020 Kubernetes cluster ([[phab:T244791|T244791]])
* 00:13 bd808: Added tools-k8s-worker-27 to 2020 Kubernetes cluster ([[phab:T244791|T244791]])
* 00:07 bd808: Added tools-k8s-worker-26 to 2020 Kubernetes cluster ([[phab:T244791|T244791]])
* 00:03 bd808: Added tools-k8s-worker-25 to 2020 Kubernetes cluster ([[phab:T244791|T244791]])
 
=== 2020-02-13 ===
* 23:53 bd808: Added tools-k8s-worker-24 to 2020 Kubernetes cluster ([[phab:T244791|T244791]])
* 23:50 bd808: Added tools-k8s-worker-23 to 2020 Kubernetes cluster ([[phab:T244791|T244791]])
* 23:38 bd808: Added tools-k8s-worker-22 to 2020 Kubernetes cluster ([[phab:T244791|T244791]])
* 21:35 bd808: Deleted tools-sgewebgrid-lighttpd-092<nowiki>{</nowiki>1,2,3,4,5,6,7,8<nowiki>}</nowiki> & tools-sgewebgrid-generic-090<nowiki>{</nowiki>3,4<nowiki>}</nowiki> ([[phab:T244791|T244791]])
* 21:33 bd808: Removed tools-sgewebgrid-lighttpd-092<nowiki>{</nowiki>1,2,3,4,5,6,7,8<nowiki>}</nowiki> & tools-sgewebgrid-generic-090<nowiki>{</nowiki>3,4<nowiki>}</nowiki> from grid engine config ([[phab:T244791|T244791]])
* 17:43 andrewbogott: migrating b24e29d7-a468-4882-9652-{{Gerrit|9863c8acfb88}} to cloudvirt1022
 
=== 2020-02-12 ===
* 19:29 bd808: Rebuilding all Docker images to pick up toollabs-webservice (0.63) ([[phab:T244954|T244954]])
* 19:15 bd808: Deployed toollabs-webservice (0.63) on bastions ([[phab:T244954|T244954]])
* 00:20 bd808: Depooling tools-sgewebgrid-generic-0903 ([[phab:T244791|T244791]])
* 00:19 bd808: Depooling tools-sgewebgrid-generic-0904 ([[phab:T244791|T244791]])
* 00:14 bd808: Depooling tools-sgewebgrid-lighttpd-0921 ([[phab:T244791|T244791]])
* 00:09 bd808: Depooling tools-sgewebgrid-lighttpd-0922 ([[phab:T244791|T244791]])
* 00:05 bd808: Depooling tools-sgewebgrid-lighttpd-0923 ([[phab:T244791|T244791]])
* 00:05 bd808: Depooling tools-sgewebgrid-lighttpd-0924 ([[phab:T244791|T244791]])
 
=== 2020-02-11 ===
* 23:58 bd808: Depooling tools-sgewebgrid-lighttpd-0925 ([[phab:T244791|T244791]])
* 23:56 bd808: Depooling tools-sgewebgrid-lighttpd-0926 ([[phab:T244791|T244791]])
* 23:38 bd808: Depooling tools-sgewebgrid-lighttpd-0927 ([[phab:T244791|T244791]])
 
=== 2020-02-10 ===
* 23:39 bstorm_: updated tools-manifest to 0.21 on aptly for stretch
* 22:51 bstorm_: all docker images now use webservice 0.62
* 22:01 bd808: Manually starting webservices for tools that were running on tools-sgewebgrid-lighttpd-0928 ([[phab:T244791|T244791]])
* 21:47 bd808: Depooling tools-sgewebgrid-lighttpd-0928 ([[phab:T244791|T244791]])
* 21:25 bstorm_: upgraded toollabs-webservice package for tools to 0.62 [[phab:T244293|T244293]] [[phab:T244289|T244289]] [[phab:T234617|T234617]] [[phab:T156626|T156626]]
 
=== 2020-02-07 ===
* 10:55 arturo: drop jessie VM instances tools-prometheus-<nowiki>{</nowiki>01,02<nowiki>}</nowiki> which were shutdown ([[phab:T238096|T238096]])
 
=== 2020-02-06 ===
* 10:44 arturo: merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/565556 which is a behavior change to the Toolforge front proxy ([[phab:T234617|T234617]])
* 10:27 arturo: shutdown again tools-prometheus-01, no longer in use ([[phab:T238096|T238096]])
* 05:07 andrewbogott: cleared out old /tmp and /var/log files on tools-sgebastion-07
 
=== 2020-02-05 ===
* 11:22 arturo: restarting ferm fleet-wide to account for prometheus servers changed IP (but same hostname) ([[phab:T238096|T238096]])
 
=== 2020-02-04 ===
* 11:38 arturo: start again tools-prometheus-01 again to sync data to the new tools-prometheus-03/04 VMs ([[phab:T238096|T238096]])
* 11:37 arturo: re-create tools-prometheus-03/04 as 'bigdisk2' instances (300GB) [[phab:T238096|T238096]]
 
=== 2020-02-03 ===
* 14:12 arturo: move tools-prometheus-04 from cloudvirt1022 to cloudvirt1013
* 12:48 arturo: shutdown tools-prometheus-01 and tools-prometheus-02, after fixing the proxy `tools-prometheus.wmflabs.org` to tools-prometheus-03, data synced ([[phab:T238096|T238096]])
* 09:38 arturo: tools-prometheus-01: systemctl stop prometheus@tools. Another try to migrate data to tools-prometheus-<nowiki>{</nowiki>03,04<nowiki>}</nowiki> ([[phab:T238096|T238096]])
 
=== 2020-01-31 ===
* 14:06 arturo: leave tools-prometheus-01 as the backend for tools-prometheus.wmflabs.org for the weekend so grafana dashboards keep working ([[phab:T238096|T238096]])
* 14:00 arturo: syncing again prometheus data from tools-prometheus-01 to tools-prometheus-0<nowiki>{</nowiki>3,4<nowiki>}</nowiki> due to some inconsistencies preventing prometheus from starting ([[phab:T238096|T238096]])
 
=== 2020-01-30 ===
* 21:04 andrewbogott: also apt-get install python3-novaclient  on tools-prometheus-03 and tools-prometheus-04 to suppress cronspam.  Possible real fix for this is https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/569084/
* 20:39 andrewbogott: apt-get install python3-keystoneclient  on tools-prometheus-03 and tools-prometheus-04 to suppress cronspam.  Possible real fix for this is https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/569084/
* 16:27 arturo: create VM tools-prometheus-04 as cold standby of tools-prometheus-03 ([[phab:T238096|T238096]])
* 16:25 arturo: point tools-prometheus.wmflabs.org proxy to tools-prometheus-03 ([[phab:T238096|T238096]])
* 13:42 arturo: disable puppet in prometheus servers while syncing metric data ([[phab:T238096|T238096]])
* 13:15 arturo: drop floating IP 185.15.56.60 and FQDN `prometheus.tools.wmcloud.org` because this is not how the prometheus setup is right now. Use a web proxy instead `tools-prometheus-new.wmflabs.org` ([[phab:T238096|T238096]])
* 13:09 arturo: created FQDN `prometheus.tools.wmcloud.org` pointing to IPv4 185.15.56.60 (tools-prometheus-03) to test [[phab:T238096|T238096]]
* 12:59 arturo: associated floating IPv4 185.15.56.60 to tools-prometheus-03 ([[phab:T238096|T238096]])
* 12:57 arturo: created domain `tools.wmcloud.org` in the tools project after some back and forth with designated, permissions and the database. I plan to use this domain to test the new Debian Buster-based prometheus setup ([[phab:T238096|T238096]])
* 10:20 arturo: create new VM instance tools-prometheus-03 ([[phab:T238096|T238096]])
 
=== 2020-01-29 ===
* 20:07 bd808: Created <nowiki>{</nowiki>bastion,login,dev<nowiki>}</nowiki>.toolforge.org service names for Toolforge bastions using Horizon & Designate
 
=== 2020-01-28 ===
* 13:35 arturo: `aborrero@tools-clushmaster-02:~$ clush -w @exec-stretch 'for i in $(ps aux {{!}} grep [t]ools.j {{!}} awk -F" " "<nowiki>{</nowiki>print \$2<nowiki>}</nowiki>") ; do  echo "killing $i" ; sudo kill $i ; done {{!}}{{!}} true'` ([[phab:T243831|T243831]])
 
=== 2020-01-27 ===
* 07:05 zhuyifei1999_: wrong package. uninstalled. the correct one is bpfcc-tools and seems only available in buster+. [[phab:T115231|T115231]]
* 07:01 zhuyifei1999_: apt installing bcc on tools-worker-1037 to see who is sending SIGTERM, will uninstall after done. dependency: bin86. [[phab:T115231|T115231]]
 
=== 2020-01-24 ===
* 20:58 bd808: Built tools-k8s-worker-21 to test out build script following openstack client upgrade
* 15:45 bd808: Rebuilding all Docker containers again because I failed to actually update the build server git clone properly last time I did this
* 05:23 bd808: Building 6 new tools-k8s-worker instances for the 2020 Kubernetes cluster (take 2)
* 04:41 bd808: Rebuilding all Docker images to pick up webservice-python-bootstrap changes
 
=== 2020-01-23 ===
* 23:38 bd808: Halted tools-k8s-worker build script after first instance (tools-k8s-worker-10) stuck in "scheduling" state for 20 minutes
* 23:16 bd808: Building 6 new tools-k8s-worker instances for the 2020 Kubernetes cluster
* 05:15 bd808: Building tools-elastic-04
* 04:39 bd808: wmcs-openstack quota set --instances 192
* 04:36 bd808: wmcs-openstack quota set --cores 768 --ram {{Gerrit|1536000}}
 
=== 2020-01-22 ===
* 12:43 arturo: for the record, issue with tools-worker-1016 was memory exhaustion apparently
* 12:35 arturo: hard-reboot tools-worker-1016 (not responding to even console access)
 
=== 2020-01-21 ===
* 19:25 bstorm_: hard rebooting tools-sgeexec-0913/14/35 because they aren't even on the network
* 19:17 bstorm_: depooled and rebooted tools-sgeexec-0914 because it was acting funny
* 18:30 bstorm_: depooling and rebooting tools-sgeexec-[0911,0913,0919,0921,0924,0931,0933,0935,0939,0941].tools.eqiad.wmflabs
* 17:21 bstorm_: rebooting toolschecker to recover stale nfs handle
 
=== 2020-01-16 ===
* 23:54 bstorm_: rebooting tools-docker-builder-06 because there are a couple running containers that don't want to die cleanly
* 23:45 bstorm_: rebuilding docker containers to include new webservice version (0.58)
* 23:41 bstorm_: deployed toollabs-webservice 0.58 to everything that isn't a container
* 16:45 bstorm_: ran configurator to set the gridengine web queues to `rerun FALSE` [[phab:T242397|T242397]]
 
=== 2020-01-14 ===
* 15:29 bstorm_: failed the gridengine master back to the master server from the shadow
* 02:23 andrewbogott: rebooting tools-paws-worker-1006  to resolve hangs associated with an old NFS failure
 
=== 2020-01-13 ===
* 17:48 bd808: Running `puppet ca destroy` for each unsigned cert on tools-puppetmaster-01 ([[phab:T242642|T242642]])
* 16:42 bd808: Cordoned and fixed puppet on tools-k8s-worker-12. Rebooting now. [[phab:T242559|T242559]]
* 16:33 bd808: Cordoned and fixed puppet on tools-k8s-worker-11. Rebooting now. [[phab:T242559|T242559]]
* 16:31 bd808: Cordoned and fixed puppet on tools-k8s-worker-10. Rebooting now. [[phab:T242559|T242559]]
* 16:26 bd808: Cordoned and fixed puppet on tools-k8s-worker-9. Rebooting now. [[phab:T242559|T242559]]
 
=== 2020-01-12 ===
* 22:31 Krenair: same on -13 and -14
* 22:28 Krenair: same on -8
* 22:18 Krenair: same on -7
* 22:11 Krenair: Did usual new instance creation puppet dance on tools-k8s-worker-6, /data/project got created
 
=== 2020-01-11 ===
* 01:33 bstorm_: updated toollabs-webservice package to 0.57, which should allow persisting mem and cpu in manifests with burstable qos.
 
=== 2020-01-10 ===
* 23:31 bstorm_: updated toollabs-webservice package to 0.56
* 15:45 bstorm_: depooled tools-paws-worker-1013 to reboot because I think it is the last tools server with that mount issue (I hope)
* 15:35 bstorm_: depooling and rebooting tools-worker-1016 because it still had the leftover mount problems
* 15:30 bstorm_: git stash-ing local puppet changes in hopes that arturo has that material locally, and it doesn't break anything to do so
 
=== 2020-01-09 ===
* 23:35 bstorm_: depooled tools-sgeexec-0939 because it isn't acting right and rebooting it
* 18:26 bstorm_: re-joining the k8s nodes OF THE PAWS CLUSTER to the cluster one at a time to rotate the certs [[phab:T242353|T242353]]
* 18:25 bstorm_: re-joining the k8s nodes to the cluster one at a time to rotate the certs [[phab:T242353|T242353]]
* 18:06 bstorm_: rebooting tools-paws-master-01 [[phab:T242353|T242353]]
* 17:46 bstorm_: refreshing the paws cluster's entire x509 environment [[phab:T242353|T242353]]
 
=== 2020-01-07 ===
* 22:40 bstorm_: rebooted tools-worker-1007 to recover it from disk full and general badness
* 16:33 arturo: deleted by hand pod metrics/cadvisor-5pd46 due to prometheus having issues scrapping it
* 15:46 bd808: Rebooting tools-k8s-worker-[6-14]
* 15:35 bstorm_: changed kubeadm-config to use a list instead of a hash for extravols on the apiserver in the new k8s cluster [[phab:T242067|T242067]]
* 14:02 arturo: `root@tools-k8s-control-3:~# wmcs-k8s-secret-for-cert -n metrics -s metrics-server-certs -a metrics-server` ([[phab:T241853|T241853]])
* 13:33 arturo: upload docker-registry.tools.wmflabs.org/coreos/kube-state-metrics:v1.8.0 copied from quay.io/coreos/kube-state-metrics:v1.8.0 ([[phab:T241853|T241853]])
* 13:31 arturo: upload docker-registry.tools.wmflabs.org/metrics-server-amd64:v0.3.6 copied from k8s.gcr.io/metrics-server-amd64:v0.3.6 ([[phab:T241853|T241853]])
* 13:23 arturo: [new k8s] doing changes to kube-state-metrics and metrics-server trying to relocate them to the 'metrics' namespace ([[phab:T241853|T241853]])
* 05:28 bd808: Creating tools-k8s-worker-[6-14] (again)
* 05:20 bd808: Deleting busted tools-k8s-worker-[6-14]
* 05:02 bd808: Creating tools-k8s-worker-[6-14]
* 00:26 bstorm_: repooled tools-sgewebgrid-lighttpd-0919
* 00:17 bstorm_: repooled tools-sgewebgrid-lighttpd-0918
* 00:15 bstorm_: moving tools-sgewebgrid-lighttpd-0918 and -0919 to cloudvirt1004 from cloudvirt1029 to rebalance load
* 00:02 bstorm_: depooled tools-sgewebgrid-lighttpd-0918 and 0919 to move to cloudvirt1004 to improve spread
 
=== 2020-01-06 ===
* 23:40 bd808: Deleted tools-sgewebgrid-lighttpd-09<nowiki>{</nowiki>0[1-9],10<nowiki>}</nowiki>
* 23:36 bd808: Shutdown tools-sgewebgrid-lighttpd-09<nowiki>{</nowiki>0[1-9],10<nowiki>}</nowiki>
* 23:34 bd808: Decommissioned tools-sgewebgrid-lighttpd-09<nowiki>{</nowiki>0[1-9],10<nowiki>}</nowiki>
* 23:13 bstorm_: Repooled tools-sgeexec-0922 because I don't know why it was depooled
* 23:01 bd808: Depooled tools-sgewebgrid-lighttpd-0910.tools.eqiad.wmflabs
* 22:58 bd808: Depooling tools-sgewebgrid-lighttpd-090[2-9]
* 22:57 bd808: Disabling queues on tools-sgewebgrid-lighttpd-090[2-9]
* 21:07 bd808: Restarted kube2proxy on tools-proxy-05 to try and refresh admin tool's routes
* 18:54 bstorm_: edited /etc/fstab to remove NFS and unmounted the nfs volumes tools-k8s-haproxy-1 [[phab:T241908|T241908]]
* 18:49 bstorm_: edited /etc/fstab to remove NFS and rebooted to clear stale mounts on tools-k8s-haproxy-2 [[phab:T241908|T241908]]
* 18:47 bstorm_: added mount_nfs=false to tools-k8s-haproxy puppet prefix [[phab:T241908|T241908]]
* 18:24 bd808: Deleted shutdown instance tools-worker-1029 (was an SSSD testing instance)
* 16:42 bstorm_: failed sge-shadow-master back to the main grid master
* 16:42 bstorm_: Removed files for old S1tty that wasn't working on sge-grid-master
 
=== 2020-01-04 ===
* 18:11 bd808: Shutdown tools-worker-1029
* 18:10 bd808: kubectl delete node tools-worker-1029.tools.eqiad.wmflabs
* 18:06 bd808: Removed tools-worker-1029.tools.eqiad.wmflabs from k8s::worker_hosts hiera in preparation for decom
* 16:54 bstorm_: moving VMs tools-worker-1012/1028/1005 from cloudvirt1024 to cloudvirt1003 due to hardware errors [[phab:T241884|T241884]]
* 16:47 bstorm_: moving VM tools-flannel-etcd-02 from cloudvirt1024 to cloudvirt1003 due to hardware errors [[phab:T241884|T241884]]
* 16:16 bd808: Draining tools-worker-10<nowiki>{</nowiki>05,12,28<nowiki>}</nowiki> due to hardware errors ([[phab:T241884|T241884]])
* 16:13 arturo: moving VM tools-sgewebgrid-lighttpd-0927 from cloudvirt1024 to cloudvirt1009 due to hardware errors ([[phab:T241884|T241884]])
* 16:11 arturo: moving VM tools-sgewebgrid-lighttpd-0926 from cloudvirt1024 to cloudvirt1009 due to hardware errors ([[phab:T241884|T241884]])
* 16:09 arturo: moving VM tools-sgewebgrid-lighttpd-0925 from cloudvirt1024 to cloudvirt1009 due to hardware errors ([[phab:T241884|T241884]])
* 16:08 arturo: moving VM tools-sgewebgrid-lighttpd-0924 from cloudvirt1024 to cloudvirt1009 due to hardware errors ([[phab:T241884|T241884]])
* 16:07 arturo: moving VM tools-sgewebgrid-lighttpd-0923 from cloudvirt1024 to cloudvirt1009 due to hardware errors ([[phab:T241884|T241884]])
* 16:06 arturo: moving VM tools-sgewebgrid-lighttpd-0909 from cloudvirt1024 to cloudvirt1009 due to hardware errors ([[phab:T241884|T241884]])
* 16:04 arturo: moving VM tools-sgeexec-0923 from cloudvirt1024 to cloudvirt1009 due to hardware errors ([[phab:T241884|T241884]])
* 16:02 arturo: moving VM tools-sgeexec-0910 from cloudvirt1024 to cloudvirt1009 due to hardware errors ([[phab:T241873|T241873]])
 
=== 2020-01-03 ===
* 16:48 bstorm_: updated the ValidatingWebhookConfiguration for the ingress admission controller to the working settings
* 11:51 arturo: [new k8s] deploy cadvisor as in https://gerrit.wikimedia.org/r/c/operations/puppet/+/561654 ([[phab:T237643|T237643]])
* 11:21 arturo: upload k8s.gcr.io/cadvisor:v0.30.2 docker image to the docker registry as docker-registry.tools.wmflabs.org/cadvisor:0.30.2 for [[phab:T237643|T237643]]
* 03:04 bd808: Really rebuilding all <nowiki>{</nowiki>jessie,stretch,buster<nowiki>}</nowiki>-sssd images. Last time I forgot to actually update the git clone.
* 00:11 bd808: Rebuiliding all stretch-ssd Docker images to pick up busybox
 
=== 2020-01-02 ===
* 23:54 bd808: Rebuiliding all buster-ssd Docker images to pick up busybox
 
=== 2019-12-30 ===
* 05:02 andrewbogott: moving tools-worker-1012 to cloudvirt1024 for [[phab:T241523|T241523]]
* 04:49 andrewbogott: draining and rebooting tools-worker-1031, its drive is full
 
=== 2019-12-29 ===
* 01:38 Krenair: Cordoned tools-worker-1012 and deleted pods associated with dplbot and dewikigreetbot as well as my own testing one, host seems to be under heavy load - [[phab:T241523|T241523]]
 
=== 2019-12-27 ===
* 15:06 Krenair: Killed a "python parse_page.py outreachy" process by aikochou that was hogging IO on tools-sgebastion-07
 
=== 2019-12-25 ===
* 16:07 zhuyifei1999_: pkilled 5 `python pwb.py` processes belonging to `tools.kaleem-bot` on tools-sgebastion-07
 
=== 2019-12-22 ===
* 20:13 bd808: Enabled Puppet on tools-proxy-06.tools.eqiad.wmflabs after nginx config test ([[phab:T241310|T241310]])
* 18:52 bd808: Disabled Puppet on tools-proxy-06.tools.eqiad.wmflabs to test nginx config change ([[phab:T241310|T241310]])
 
=== 2019-12-20 ===
* 22:28 bd808: Re-enabled Puppet on tools-sgebastion-09. Reason for disable was "arturo raising systemd limits"
* 11:33 arturo: reboot tools-k8s-control-3 to fix some stale NFS mount issues
 
=== 2019-12-18 ===
* 17:33 bstorm_: updated package in aptly for toollabs-webservice to 0.53
* 11:49 arturo: introduce placeholder DNS records for toolforge.org domain. No services are provided under this domain yet for end users, this is just us testing (SSL, proxy stuff etc). This may be reverted anytime.
 
=== 2019-12-17 ===
* 20:25 bd808: Fixed https://tools.wmflabs.org/ to redirect to https://tools.wmflabs.org/admin/
* 19:21 bstorm_: deployed the changes to the live proxy to enable the new kubernetes cluster [[phab:T234037|T234037]]
* 16:53 bstorm_: maintain-kubeusers app deployed fully in tools for new kubernetes cluster [[phab:T214513|T214513]] [[phab:T228499|T228499]]
* 16:50 bstorm_: updated the maintain-kubeusers docker image for beta and tools
* 04:48 bstorm_: completed first run of maintain-kubeusers 2 in the new cluster [[phab:T214513|T214513]]
* 01:26 bstorm_: running the first run of maintain-kubeusers 2.0 for the new cluster [[phab:T214513|T214513]] (more successfully this time)
* 01:25 bstorm_: unset the immutable bit from 1704 tool kubeconfigs [[phab:T214513|T214513]]
* 01:05 bstorm_: beginning the first run of the new maintain-kubeusers in gentle-mode -- but it was just killed by some files setting the immutable bit [[phab:T214513|T214513]]
* 00:45 bstorm_: enabled encryption at rest on the new k8s cluster
 
=== 2019-12-16 ===
* 22:04 bd808: Added 'ALLOW IPv4 25/tcp from 0.0.0.0/0' to "MTA" security group applied to tools-mail-02
* 19:05 bstorm_: deployed the maintain-kubeusers operations pod to the new cluster
 
=== 2019-12-14 ===
* 10:48 valhallasw`cloud: re-enabling puppet on tools-sgeexec-0912, likely left-over from NFS maintenance (no reason was specified).
 
=== 2019-12-13 ===
* 18:46 bstorm_: updated tools-k8s-control-2 and 3 to the new config as well
* 17:56 bstorm_: updated tools-k8s-control-1 to the new control plane configuration
* 17:47 bstorm_: edited kubeadm-config configMap object to match the new init config
* 17:32 bstorm_: rebooting tools-k8s-control-2 to correct mount issue
* 00:45 bstorm_: rebooting tools-static-13
* 00:28 bstorm_: rebooting the k8s master to clear NFS errors
* 00:15 bstorm_: switch tools-acme-chief config to match the new authdns_servers format upstream
 
=== 2019-12-12 ===
* 23:36 bstorm_: rebooting toolschecker after downtiming the services
* 22:58 bstorm_: rebooting tools-acme-chief-01
* 22:53 bstorm_: rebooting the cron server, tools-sgecron-01 as it wasn't recovered from last night's maintenance
* 11:20 arturo: rolling reboot for all grid & k8s worker nodes due to NFS staleness
* 09:22 arturo: reboot tools-sgeexec-0911 to try fixing weird NFS state
* 08:46 arturo: doing `run-puppet-agent` in all VMs to see state of NFS
* 08:34 arturo: reboot tools-worker-1033/1034 and tools-sgebastion-08 to try to correct NFS mount issues
 
=== 2019-12-11 ===
* 18:13 bd808: Restarted maintain-dbusers on labstore1004. Process had not logged any account creations since 2019-12-01T22:45:45.
* 17:24 andrewbogott: deleted and/or truncated a bunch of logfiles on tools-worker-1031
 
=== 2019-12-10 ===
* 13:59 arturo: set pod replicas to 3 in the new k8s cluster ([[phab:T239405|T239405]])
 
=== 2019-12-09 ===
* 11:06 andrewbogott: deleting unused security groups:  catgraph, devpi, MTA, mysql, syslog, test    [[phab:T91619|T91619]]
 
=== 2019-12-04 ===
* 13:45 arturo: drop puppet prefix `tools-cron`, deprecated and no longer in use
 
=== 2019-11-29 ===
* 11:45 arturo: created 3 new VMs `tools-k8s-worker-[3,4,5]` ([[phab:T239403|T239403]])
* 10:28 arturo: re-arm keyholder in tools-acme-chief-01 (password in labs/private.git @ tools-puppetmaster-01)
* 10:27 arturo: re-arm keyholder in tools-acme-chief-02 (password in labs/private.git @ tools-puppetmaster-01)
 
=== 2019-11-26 ===
* 23:25 bstorm_: rebuilding docker images to include the new webservice 0.52 in all versions instead of just the stretch ones [[phab:T236202|T236202]]
* 22:57 bstorm_: push upgraded webservice 0.52 to the buster and jessie repos for container rebuilds [[phab:T236202|T236202]]
* 19:55 phamhi: drained tools-worker-1002,8,15,32 to rebalance the cluster
* 19:45 phamhi: cleaned up container that was taken up 16G of disk space on tools-worker-1020 in order to re-run puppet client
* 14:01 arturo: drop hiera references to `tools-test-proxy-01.tools.eqiad.wmflabs`. Such VM no longer exists
* 14:00 arturo: introduce the `profile::toolforge::proxies` hiera key in the global puppet config
 
=== 2019-11-25 ===
* 10:35 arturo: refresh puppet certs for tools-k8s-etcd-[4-6] nodes ([[phab:T238655|T238655]])
* 10:35 arturo: add puppet cert SANs via instance hiera to tools-k8s-etcd-[4-6] nodes ([[phab:T238655|T238655]])
 
=== 2019-11-22 ===
* 13:32 arturo: created security group `tools-new-k8s-full-connectivity` and add new k8s VMs to it ([[phab:T238654|T238654]])
* 05:55 jeh: add Riley Huntley `riley` to base tools project
 
=== 2019-11-21 ===
* 12:48 arturo: reboot the new k8s cluster after the upgrade
* 11:49 arturo: upgrading new k8s kubectl version to 1.15.6 ([[phab:T238654|T238654]])
* 11:44 arturo: upgrading new k8s kubelet version to 1.15.6 ([[phab:T238654|T238654]])
* 10:29 arturo: upgrading new k8s cluster version to 1.15.6 using kubeadm ([[phab:T238654|T238654]])
* 10:28 arturo: install kubeadm 1.15.6 on worker/control nodes in the new k8s cluster ([[phab:T238654|T238654]])
 
=== 2019-11-19 ===
* 13:49 arturo: re-create nginx-ingress pod due to deployment template refresh ([[phab:T237643|T237643]])
* 12:46 arturo: deploy changes to tools-prometheus to account for the new k8s cluster ([[phab:T237643|T237643]])
 
=== 2019-11-15 ===
* 14:44 arturo: stop live-hacks on tools-prometheus-01 [[phab:T237643|T237643]]
 
=== 2019-11-13 ===
* 17:20 arturo: live-hacking tools-prometheus-01 to test some experimental configs for the new k8s cluster ([[phab:T237643|T237643]])
 
=== 2019-11-12 ===
* 12:52 arturo: reboot tools-proxy-06 to reset iptables setup [[phab:T238058|T238058]]
 
=== 2019-11-10 ===
* 02:17 bd808: Building new Docker images for [[phab:T237836|T237836]] (retrying after cleaning out old images on tools-docker-builder-06)
* 02:15 bd808: Cleaned up old images on tools-docker-builder-06 using instructions from https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes#Building_toolforge_specific_images
* 02:10 bd808: Building new Docker images for [[phab:T237836|T237836]]
* 01:45 bstorm_: deploying bugfix for webservice in tools and toolsbeta [[phab:T237836|T237836]]
 
=== 2019-11-08 ===
* 22:47 bstorm_: adding rsync::server::wrap_with_stunnel: false to the tools-docker-registry-03/4 servers to unbreak puppet
* 18:40 bstorm_: pushed new webservice package to the bastions [[phab:T230961|T230961]]
* 18:37 bstorm_: pushed new webservice package supporting buster containers to repo [[phab:T230961|T230961]]
* 18:36 bstorm_: pushed buster-sssd images to the docker repo
* 17:15 phamhi: pushed new buster images with the prefix name "toolforge"
 
=== 2019-11-07 ===
* 13:27 arturo: deployed registry-admission-webhook and ingress-admission-controller into the new k8s cluster ([[phab:T236826|T236826]])
* 13:01 arturo: creating puppet prefix `tools-k8s-worker` and a couple of VMs `tools-k8s-worker-[1,2]` [[phab:T236826|T236826]]
* 12:57 arturo: increasing project quota [[phab:T237633|T237633]]
* 11:54 arturo: point `k8s.tools.eqiad1.wikimedia.cloud` to tools-k8s-haproxy-1 [[phab:T236826|T236826]]
* 11:43 arturo: create VMs `tools-k8s-haproxy-[1,2]` [[phab:T236826|T236826]]
* 11:43 arturo: create puppet prefix `tools-k8s-haproxy` [[phab:T236826|T236826]]
 
=== 2019-11-06 ===
* 22:32 bstorm_: added rsync::server::wrap_with_stunnel: false to tools-sge-services prefix to fix puppet
* 21:33 bstorm_: docker images needed for kubernetes cluster upgrade deployed [[phab:T215531|T215531]]
* 20:26 bstorm_: building and pushing docker images needed for kubernetes cluster upgrade
* 16:10 arturo: new k8s cluster control nodes are bootstrapped ([[phab:T236826|T236826]])
* 13:51 arturo: created FQDN `k8s.tools.eqiad1.wikimedia.cloud` pointing to `tools-k8s-control-1` for the initial bootstrap ([[phab:T236826|T236826]])
* 13:50 arturo: created 3 VMs`tools-k8s-control-[1,2,3]` ([[phab:T236826|T236826]])
* 13:43 arturo: created `tools-k8s-control` puppet prefix [[phab:T236826|T236826]]
* 11:57 phamhi: restarted all webservices in grid ([[phab:T233347|T233347]])
 
=== 2019-11-05 ===
* 23:08 Krenair: Dropped {{Gerrit|59a77a3}}, {{Gerrit|3830802}}, and {{Gerrit|83df61f}} from tools-puppetmaster-01:/var/lib/git/labs/private cherry-picks as these are no longer required [[phab:T206235|T206235]]
* 22:49 Krenair: Disassociated floating IP 185.15.56.60 from tools-static-13, traffic to this host goes via the project-proxy now. DNS was already changed a few days ago. [[phab:T236952|T236952]]
* 22:35 bstorm_: upgraded libpython3.4 libpython3.4-dbg libpython3.4-minimal libpython3.4-stdlib python3.4 python3.4-dbg python3.4-minimal to fix an old broken patch [[phab:T237468|T237468]]
* 22:12 bstorm_: pushed docker-registry.tools.wmflabs.org/maintain-kubeusers:beta to the registry to deploy in toolsbeta
* 17:38 phamhi: restarted lighttpd based webservice pods on tools-worker-103x and 1040 ([[phab:T233347|T233347]])
* 17:34 phamhi: restarted lighttpd based webservice pods on tools-worker-102[0-9] ([[phab:T233347|T233347]])
* 17:06 phamhi: restarted lighttpd based webservice pods on tools-worker-101[0-9] ([[phab:T233347|T233347]])
* 16:44 phamhi: restarted lighttpd based webservice pods on tools-worker-100[1-9] ([[phab:T233347|T233347]])
* 13:55 arturo: created 3 new VMs: `tools-k8s-etcd-[4,5,6]` [[phab:T236826|T236826]]
 
=== 2019-11-04 ===
* 14:45 phamhi: Built and pushed ruby25 docker image based on buster ([[phab:T230961|T230961]])
* 14:45 phamhi: Built and pushed golang111 docker image based on buster ([[phab:T230961|T230961]])
* 14:45 phamhi: Built and pushed jdk11 docker image based on buster ([[phab:T230961|T230961]])
* 14:45 phamhi: Built and pushed php73 docker image based on buster ([[phab:T230961|T230961]])
* 11:10 phamhi: Built and pushed python37 docker image based on buster ([[phab:T230961|T230961]])
 
=== 2019-11-01 ===
* 21:00 Krenair: Removed tools-checker.wmflabs.org A record to 208.80.155.229 as that target IP is in the old pre-neutron range that is no longer routed
* 20:57 Krenair: Removed trusty.tools.wmflabs.org CNAME to login-trusty.tools.wmflabs.org as that target record does not exist, presumably deleted ages ago
* 20:56 Krenair: Removed tools-trusty.wmflabs.org CNAME to login-trusty.tools.wmflabs.org as that target record does not exist, presumably deleted ages ago
* 20:38 Krenair: Updated A record for tools-static.wmflabs.org to point towards project-proxy [[phab:T236952|T236952]]
 
=== 2019-10-31 ===
* 18:47 andrewbogott: deleted and/or truncated a bunch of logfiles on tools-worker-1001.  Runaway logfiles filled up the drive which prevented puppet from running.  If puppet had run, it would have prevented the runaway logfiles.
* 13:59 arturo: update puppet prefix `tools-k8s-etcd-` to use the `role::wmcs::toolforge::k8s::etcd` [[phab:T236826|T236826]]
* 13:41 arturo: disabling puppet in tools-k8s-etcd- nodes to test https://gerrit.wikimedia.org/r/c/operations/puppet/+/546995
* 10:15 arturo: SSL cert replacement for tools-docker-registry and tools-k8s-master went fine apparently ([[phab:T236962|T236962]])
* 10:02 arturo: icinga downtime toolschecker for 1h for replacing SSL certs in tools-docker-registry and tools-k8s-master ([[phab:T236962|T236962]])
 
=== 2019-10-30 ===
* 13:53 arturo: replacing SSL cert in tools-proxy-x server apparently OK (merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/545679) [[phab:T235252|T235252]]
* 13:48 arturo: replacing SSL cert in tools-proxy-x server (live-hacking https://gerrit.wikimedia.org/r/c/operations/puppet/+/545679 first for testing) [[phab:T235252|T235252]]
* 13:40 arturo: icinga downtime toolschecker for 1h for replacing SSL cert [[phab:T235252|T235252]]
 
=== 2019-10-29 ===
* 10:49 arturo: deleting VMs tools-test-proxy-01, no longer in use
* 10:07 arturo: deleting old jessie VMs tools-proxy-03/04 [[phab:T235627|T235627]]
 
=== 2019-10-28 ===
* 16:06 arturo: delete VM instance `tools-test-proxy-01` and the puppet prefix `tools-test-proxy`
* 15:54 arturo: tools-proxy-05 has now the 185.15.56.11 floating IP as active proxy. Old one 185.15.56.6 has been freed [[phab:T235627|T235627]]
* 15:54 arturo: shutting down tools-proxy-03 [[phab:T235627|T235627]]
* 15:26 bd808: Killed all processes owned by jem on tools-sgebastion-08
* 15:16 arturo: tools-proxy-05 has now the 185.15.56.5 floating IP as active proxy [[phab:T235627|T235627]]
* 15:14 arturo: refresh hiera to use tools-proxy-05 as active proxy [[phab:T235627|T235627]]
* 15:11 bd808: Killed ircbot.php processes started by jem on tools-sgebastion-08 per request on irc
* 14:58 arturo: added `webproxy` security group to tools-proxy-05 and tools-proxy-06 ([[phab:T235627|T235627]])
* 14:57 phamhi: drained tools-worker-1031.tools.eqiad.wmflabs to clean up disk space
* 14:45 arturo: created VMs tools-proxy-05 and tools-proxy-06 ([[phab:T235627|T235627]])
* 14:43 arturo: adding `role::wmcs::toolforge::proxy` to the `tools-proxy` puppet prefix ([[phab:T235627|T235627]])
* 14:42 arturo: deleted `role::toollabs::proxy` from the `tools-proxy` puppet profile ([[phab:T235627|T235627]])
* 14:34 arturo: icinga downtime toolschecker for 1h ([[phab:T235627|T235627]])
* 12:25 arturo: upload image `coredns` v1.3.1 ({{Gerrit|eb516548c180}}) to docker registry ([[phab:T236249|T236249]])
* 12:23 arturo: upload image `kube-apiserver` v1.15.1 ({{Gerrit|68c3eb07bfc3}}) to docker registry ([[phab:T236249|T236249]])
* 12:22 arturo: upload image `kube-controller-manager` v1.15.1 ({{Gerrit|d75082f1d121}}) to docker registry ([[phab:T236249|T236249]])
* 12:20 arturo: upload image `kube-proxy` v1.15.1 ({{Gerrit|89a062da739d}}) to docker registry ([[phab:T236249|T236249]])
* 12:19 arturo: upload image `kube-scheduler` v1.15.1 ({{Gerrit|b0b3c4c404da}}) to docker registry ([[phab:T236249|T236249]])
* 12:04 arturo: upload image `calico/node` v3.8.0 ({{Gerrit|cd3efa20ff37}}) to docker registry ([[phab:T236249|T236249]])
* 12:03 arturo: upload image `calico/calico/pod2daemon-flexvol` v3.8.0 ({{Gerrit|f68c8f870a03}}) to docker registry ([[phab:T236249|T236249]])
* 12:01 arturo: upload image `calico/cni` v3.8.0 ({{Gerrit|539ca36a4c13}}) to docker registry ([[phab:T236249|T236249]])
* 11:58 arturo: upload image `calico/kube-controllers` v3.8.0 ({{Gerrit|df5ff96cd966}}) to docker registry ([[phab:T236249|T236249]])
* 11:47 arturo: upload image `nginx-ingress-controller` v0.25.1 ({{Gerrit|0439eb3e11f1}}) to docker registry ([[phab:T236249|T236249]])
 
=== 2019-10-24 ===
* 16:32 bstorm_: set the prod rsyslog config for kubernetes to false for Toolforge
 
=== 2019-10-23 ===
* 20:00 phamhi: Rebuilding all jessie and stretch docker images to pick up toollabs-webservice 0.47 ([[phab:T233347|T233347]])
* 12:09 phamhi: Deployed toollabs-webservice 0.47 to buster-tools and stretch-tools ([[phab:T233347|T233347]])
* 09:13 arturo: 9 tools-sgeexec nodes and 6 other related VMs are down because hypervisor is rebooting
* 09:03 arturo: tools-sgebastion-08 is down because hypervisor is rebooting
 
=== 2019-10-22 ===
* 16:56 bstorm_: drained tools-worker-1025.tools.eqiad.wmflabs which was malfunctioning
* 09:25 arturo: created the `tools.eqiad1.wikimedia.cloud.` DNS zone
 
=== 2019-10-21 ===
* 17:32 phamhi: Rebuilding all jessie and stretch docker images to pick up toollabs-webservice 0.46
 
=== 2019-10-18 ===
* 22:15 bd808: Rescheduled continuous jobs away from tools-sgeexec-0904 because of high system load
* 22:09 bd808: Cleared error state of webgrid-generic@tools-sgewebgrid-generic-0901, webgrid-lighttpd@tools-sgewebgrid-lighttpd-09{12,15,19,20,26}
* 21:29 bd808: Rescheduled all grid engine webservice jobs ([[phab:T217815|T217815]])
 
=== 2019-10-16 ===
* 16:21 phamhi: Deployed toollabs-webservice 0.46 to buster-tools and stretch-tools ([[phab:T218461|T218461]])
* 09:29 arturo: toolforge is recovered from the reboot of cloudvirt1029
* 09:17 arturo: due to the reboot of cloudvirt1029, several sgeexec nodes (8) are offline, also sgewebgrid-lighttpd (8) and tools-worker (3) and the main toolforge proxy (tools-proxy-03)
 
=== 2019-10-15 ===
* 17:10 phamhi: restart tools-worker-1035 because it is no longer responding
 
=== 2019-10-14 ===
* 09:26 arturo: cleaned-up updatetools from tools-sge-services nodes ([[phab:T229261|T229261]])
 
=== 2019-10-11 ===
* 19:52 bstorm_: restarted docker on tools-docker-builder after phamhi noticed the daemon had a routing issue (blank iptables)
* 11:55 arturo: create tools-test-proxy-01 VM for testing [[phab:T235059|T235059]] and a puppet prefix for it
* 10:53 arturo: added kubernetes-node_1.4.6-7_amd64.deb to buster-tools and buster-toolsbeta (aptly) for [[phab:T235059|T235059]]
* 10:51 arturo: added docker-engine_1.12.6-0~debian-jessie_amd64.deb to buster-tools and buster-toolsbeta (aptly) for [[phab:T235059|T235059]]
* 10:46 arturo: added logster_0.0.10-2~jessie1_all.deb to buster-tools and buster-toolsbeta (aptly) for [[phab:T235059|T235059]]
 
=== 2019-10-10 ===
* 02:33 bd808: Rebooting tools-sgewebgrid-lighttpd-0903. Instance hung.
 
=== 2019-10-09 ===
* 22:52 jeh: removing test instances tools-sssd-sgeexec-test-[12] from SGE
* 15:32 phamhi: drained tools-worker-1020/23/33/35/36/40 to rebalance the cluster
* 14:46 phamhi: drained and cordoned tools-worker-1029 after status reset on reboot
* 12:37 arturo: drain tools-worker-1038 to rebalance load in the k8s cluster
* 12:35 arturo: uncordon tools-worker-1029 (was disabled for unknown reasons)
* 12:33 arturo: drain tools-worker-1010 to rebalance load
* 10:33 arturo: several sgewebgrid-lighttpd nodes (9) not available because cloudvirt1013 is rebooting
* 10:21 arturo: several worker nodes (7) not available because cloudvirt1012 is rebooting
* 10:08 arturo: several worker nodes (6) not available because cloudvirt1009 is rebooting
* 09:59 arturo: several worker nodes (5) not available because cloudvirt1008 is rebooting
 
=== 2019-10-08 ===
* 19:40 bstorm_: drained tools-worker-1007/8 to rebalance the cluster
* 19:34 bstorm_: drained tools-worker-1009 and then 1014 for rebalancing
* 19:27 bstorm_: drained tools-worker-1005 for rebalancing (and put these back in service as I went)
* 19:24 bstorm_: drained tools-worker-1003 and 1009 for rebalancing
* 15:41 arturo: deleted VM instance tools-sgebastion-0test. No longer in use.
 
=== 2019-10-07 ===
* 20:17 bd808: Dropped backlog of messages for delivery to tools.usrd-tools
* 20:16 bd808: Dropped backlog of messages for delivery to tools.mix-n-match
* 20:13 bd808: Dropped backlog of frozen messages for delivery (240 dropped)
* 19:25 bstorm_: deleted tools-puppetmaster-02
* 19:20 Krenair: reboot tools-k8s-master-01 due to nfs stale issue
* 19:18 Krenair: reboot tools-paws-worker-1006 due to nfs stale issue
* 19:16 phamhi: reboot tools-worker-1040 due to nfs stale issue
* 19:16 phamhi: reboot tools-worker-1039 due to nfs stale issue
* 19:16 phamhi: reboot tools-worker-1038 due to nfs stale issue
* 19:16 phamhi: reboot tools-worker-1037 due to nfs stale issue
* 19:16 phamhi: reboot tools-worker-1036 due to nfs stale issue
* 19:16 phamhi: reboot tools-worker-1035 due to nfs stale issue
* 19:15 phamhi: reboot tools-worker-1034 due to nfs stale issue
* 19:15 phamhi: reboot tools-worker-1033 due to nfs stale issue
* 19:15 phamhi: reboot tools-worker-1032 due to nfs stale issue
* 19:15 phamhi: reboot tools-worker-1031 due to nfs stale issue
* 19:15 phamhi: reboot tools-worker-1030 due to nfs stale issue
* 19:10 Krenair: reboot tools-puppetmaster-02 due to nfs stale issue
* 19:09 Krenair: reboot tools-sgebastion-0test due to nfs stale issue
* 19:08 Krenair: reboot tools-sgebastion-09 due to nfs stale issue
* 19:08 Krenair: reboot tools-sge-services-04 due to nfs stale issue
* 19:07 Krenair: reboot tools-paws-worker-1002 due to nfs stale issue
* 19:06 Krenair: reboot tools-mail-02 due to nfs stale issue
* 19:06 Krenair: reboot tools-docker-registry-03 due to nfs stale issue
* 19:04 Krenair: reboot tools-worker-1029 due to nfs stale issue
* 19:00 Krenair: reboot tools-static-12 tools-docker-registry-04 and tools-clushmaster-02 due to NFS stale issue
* 18:55 phamhi: reboot tools-worker-1028 due to nfs stale issue
* 18:55 phamhi: reboot tools-worker-1027 due to nfs stale issue
* 18:55 phamhi: reboot tools-worker-1026 due to nfs stale issue
* 18:55 phamhi: reboot tools-worker-1025 due to nfs stale issue
* 18:47 phamhi: reboot tools-worker-1023 due to nfs stale issue
* 18:47 phamhi: reboot tools-worker-1022 due to nfs stale issue
* 18:46 phamhi: reboot tools-worker-1021 due to nfs stale issue
* 18:46 phamhi: reboot tools-worker-1020 due to nfs stale issue
* 18:46 phamhi: reboot tools-worker-1019 due to nfs stale issue
* 18:46 phamhi: reboot tools-worker-1018 due to nfs stale issue
* 18:34 phamhi: reboot tools-worker-1017 due to nfs stale issue
* 18:34 phamhi: reboot tools-worker-1016 due to nfs stale issue
* 18:32 phamhi: reboot tools-worker-1015 due to nfs stale issue
* 18:32 phamhi: reboot tools-worker-1014 due to nfs stale issue
* 18:23 phamhi: reboot tools-worker-1013 due to nfs stale issue
* 18:21 phamhi: reboot tools-worker-1012 due to nfs stale issue
* 18:12 phamhi: reboot tools-worker-1011 due to nfs stale issue
* 18:12 phamhi: reboot tools-worker-1010 due to nfs stale issue
* 18:08 phamhi: reboot tools-worker-1009 due to nfs stale issue
* 18:07 phamhi: reboot tools-worker-1008 due to nfs stale issue
* 17:58 phamhi: reboot tools-worker-1007 due to nfs stale issue
* 17:57 phamhi: reboot tools-worker-1006 due to nfs stale issue
* 17:47 phamhi: reboot tools-worker-1005 due to nfs stale issue
* 17:47 phamhi: reboot tools-worker-1004 due to nfs stale issue
* 17:43 phamhi: reboot tools-worker-1002.tools.eqiad.wmflabs due to nfs stale issue
* 17:35 phamhi: drained and uncordoned tools-worker-100[1-5]
* 17:32 bstorm_: reboot tools-sgewebgrid-lighttpd-0912
* 17:30 bstorm_: reboot tools-sgewebgrid-lighttpd-0923/24/08
* 17:01 bstorm_: rebooting tools-sgegrid-master and tools-sgegrid-shadow 😭
* 16:58 bstorm_: rebooting tools-sgewebgrid-lighttpd-0902/4/6/7/8/19
* 16:53 bstorm_: rebooting tools-sgewebgrid-generic-0902/4
* 16:50 bstorm_: rebooting tools-sgeexec-0915/18/19/23/26
* 16:49 bstorm_: rebooting tools-sgeexec-0901 and tools-sgeexec-0909/10/11
* 16:46 bd808: `sudo shutdown -r now` for tools-sgebastion-08
* 16:41 bstorm_: reboot tools-sgebastion-07
* 16:39 bd808: `sudo service nslcd restart` on tools-sgebastion-08
 
=== 2019-10-04 ===
* 21:43 bd808: `sudo exec-manage repool tools-sgeexec-0923.tools.eqiad.wmflabs`
* 21:26 bd808: Rebooting tools-sgeexec-0923 after lots of messing about with a broken update-initramfs build
* 20:35 bd808: Manually running `/usr/bin/python3 /usr/bin/unattended-upgrade` on tools-sgeexec-0923
* 20:33 bd808: Killed 2 /usr/bin/unattended-upgrade procs on tools-sgeexec-0923 that seemed stuck
* 13:33 arturo: remove /etc/init.d/rsyslog on tools-worker-XXXX nodes so the rsyslog deb prerm script doesn't prevent the package from being updated
 
=== 2019-10-03 ===
* 13:05 arturo: delete servers tools-sssd-sgeexec-test-[1,2], no longer required
 
=== 2019-09-27 ===
* 16:59 bd808: Set "profile::rsyslog::kafka_shipper::kafka_brokers: []" in tools-elastic prefix puppet
* 00:40 bstorm_: depooled and rebooted tools-sgewebgrid-lighttpd-0927
 
=== 2019-09-25 ===
* 19:08 andrewbogott: moving tools-sgewebgrid-lighttpd-0903 to cloudvirt1021
 
=== 2019-09-23 ===
* 16:58 bstorm_: deployed tools-manifest 0.20 and restarted webservicemonitor
* 06:01 bd808: Restarted maintain-dbusers process on labstore1004. ([[phab:T233530|T233530]])
 
=== 2019-09-12 ===
* 20:48 phamhi: Deleted tools-puppetdb-01.tools as it is no longer in used
 
=== 2019-09-11 ===
* 13:30 jeh: restart tools-sgeexec-0912
 
=== 2019-09-09 ===
* 22:44 bstorm_: uncordoned tools-worker-1030 and tools-worker-1038
 
=== 2019-09-06 ===
* 15:11 bd808: `sudo kill -9 10635` on tools-k8s-master-01 ([[phab:T194859|T194859]])
 
=== 2019-09-05 ===
* 21:02 bd808: Enabled Puppet on tools-docker-registry-03 and forced puppet run ([[phab:T232135|T232135]])
* 18:13 bd808: Disabled Puppet on tools-docker-registry-03 to investigate docker-registry issue (no phab task yet)
 
=== 2019-09-01 ===
* 20:51 Reedy: `sudo service maintain-kubeusers restart` on tools-k8s-master-01
 
=== 2019-08-30 ===
* 16:54 phamhi: restart maintain-kuberusers service in tools-k8s-master-01
* 16:21 bstorm_: depooling tools-sgewebgrid-lighttpd-0923 to reboot -- high iowait likely from NFS mounts
 
=== 2019-08-29 ===
* 22:18 bd808: Finished building new stretch Docker images for Toolforge Kubernetes use
* 22:06 bd808: Starting process of building new stretch Docker images for Toolforge Kubernetes use
* 22:05 bd808: Jessie Docker image rebuild complete
* 21:31 bd808: Starting process of building new jessie Docker images for Toolforge Kubernetes use
 
=== 2019-08-27 ===
* 19:10 bd808: Restarted maintain-kubeusers after complaint on irc. It was stuck in limbo again
 
=== 2019-08-26 ===
* 21:48 bstorm_: repooled tools-sgewebgrid-generic-0902, tools-sgewebgrid-lighttpd-0902, tools-sgewebgrid-lighttpd-0903 and tools-sgeexec-0905
 
=== 2019-08-18 ===
* 08:11 arturo: restart maintain-kuberusers service in tools-k8s-master-01
 
=== 2019-08-17 ===
* 10:56 arturo: force-reboot tools-worker-1006. Is completely stuck
 
=== 2019-08-15 ===
* 15:32 jeh: upgraded jobutils debian package to 1.38 [[phab:T229551|T229551]]
* 09:22 arturo: restart maintain-kubeusers service in tools-k8s-master-01 because some tools were missing their namespaces
 
=== 2019-08-13 ===
* 22:00 bstorm_: truncated exim paniclog on tools-sgecron-01 because it was being spammy
* 13:41 jeh: Set icingia downtime for toolschecker labs showmount [[phab:T229448|T229448]]
 
=== 2019-08-12 ===
* 16:08 phamhi: updated prometheus-node-exporter from 0.14.0~git20170523-1 to 0.17.0+ds-3 in tools-worker-[1030-1040] nodes ([[phab:T230147|T230147]])
 
=== 2019-08-08 ===
* 19:26 jeh: restarting tools-sgewebgrid-lighttpd-0915 [[phab:T230157|T230157]]
 
=== 2019-08-07 ===
* 19:07 bd808: Disassociated SUL and Phabricator accounts from user Lophi ([[phab:T229713|T229713]])
 
=== 2019-08-06 ===
* 16:18 arturo: add phamhi as user/projectadmin ([[phab:T228942|T228942]]) and delete hpham
* 15:59 arturo: add hpham as user/projectadmin ([[phab:T228942|T228942]])
* 13:44 jeh: disabling puppet on tools-checker-03 while testing nginx timeouts [[phab:T221301|T221301]]
 
=== 2019-08-05 ===
* 22:49 bstorm_: launching tools-worker-1040
* 20:36 andrewbogott: rebooting oom tools-worker-1026
* 16:10 jeh: `tools-k8s-master-01: systemctl restart maintain-kubeusers` [[phab:T229846|T229846]]
* 09:39 arturo: `root@tools-checker-03:~# toolscheckerctl restart` again ([[phab:T229787|T229787]])
* 09:30 arturo: `root@tools-checker-03:~# toolscheckerctl restart` ([[phab:T229787|T229787]])
 
=== 2019-08-02 ===
* 14:00 andrewbogott_: rebooting tools-worker-1022 as it is unresponsive
 
=== 2019-07-31 ===
* 18:07 bstorm_: drained tools-worker-1015/05/03/17 to rebalance load
* 17:41 bstorm_: drained tools-worker-1025 and 1026 to rebalance load
* 17:32 bstorm_: drained tools-worker-1028 to rebalance load
* 17:29 bstorm_: drained tools-worker-1008 to rebalance load
* 17:23 bstorm_: drained tools-worker-1021 to rebalance load
* 17:17 bstorm_: drained tools-worker-1007 to rebalance load
* 17:07 bstorm_: drained tools-worker-1004 to rebalance load
* 16:27 andrewbogott: moving tools-static-12 to cloudvirt1018
* 15:33 bstorm_: [[phab:T228573|T228573]] spinning up 5 worker nodes for kubernetes cluster (tools-worker-1035-9)
 
=== 2019-07-27 ===
* 23:00 zhuyifei1999_: a past probably related ticket: [[phab:T194859|T194859]]
* 22:57 zhuyifei1999_: maintain-kubeusers seems stuck. Traceback: https://phabricator.wikimedia.org/P8812, core dump: /root/core.17898. Restarting
 
=== 2019-07-26 ===
* 17:39 bstorm_: restarted maintain-kubeusers because it was suspiciously tardy and quiet
* 17:14 bstorm_: drained tools-worker-1013.tools.eqiad.wmflabs to rebalance load
* 17:09 bstorm_: draining tools-worker-1020.tools.eqiad.wmflabs to rebalance load
* 16:32 bstorm_: created tools-worker-1034 - [[phab:T228573|T228573]]
* 15:57 bstorm_: created tools-worker-1032 and 1033 - [[phab:T228573|T228573]]
* 15:55 bstorm_: created tools-worker-1031 - [[phab:T228573|T228573]]
 
=== 2019-07-25 ===
* 22:01 bstorm_: [[phab:T228573|T228573]] created tools-worker-1030
* 21:22 jeh: rebooting tools-worker-1016 unresponsive
 
=== 2019-07-24 ===
* 10:14 arturo: reallocating tools-puppetmaster-01 from cloudvirt1027 to cloudvirt1028 ([[phab:T227539|T227539]])
* 10:12 arturo: reallocating tools-docker-registry-04 from cloudvirt1027 to cloudvirt1028 ([[phab:T227539|T227539]])
 
=== 2019-07-22 ===
* 18:39 bstorm_: repooled tools-sgeexec-0905 after reboot
* 18:33 bstorm_: depooled tools-sgeexec-0905 because it's acting kind of weird and not responding to prometheus
* 18:32 bstorm_: repooled tools-sgewebgrid-lighttpd-0902 after restarting the grid-exec service
* 18:28 bstorm_: depooled tools-sgewebgrid-lighttpd-0902 to find out why it is behaving weird
* 17:55 bstorm_: draining tools-worker-1023 since it is having issues
* 17:38 bstorm_: Adding the prometheus servers to the ferm rules via wikitech hiera for kubelet stats [[phab:T228573|T228573]]
 
=== 2019-07-20 ===
* 19:52 andrewbogott: rebooting tools-worker-1023
 
=== 2019-07-17 ===
* 20:23 andrewbogott: migrating tools-sgegrid-shadow to cloudvirt1014
 
=== 2019-07-15 ===
* 14:50 bstorm_: cleared error state from tools-sgeexec-0911 which went offline after error from job {{Gerrit|5190035}}
 
=== 2019-06-25 ===
* 09:30 arturo: detected puppet issue in all VMs: [[phab:T226480|T226480]]
 
=== 2019-06-24 ===
* 17:42 andrewbogott: moving tools-sgeexec-0905 to cloudvirt1015
 
=== 2019-06-17 ===
* 14:07 andrewbogott: moving tools-sgewebgrid-lighttpd-0903 to cloudvirt1015
* 13:59 andrewbogott: moving tools-sgewebgrid-generic-0902 and tools-sgewebgrid-lighttpd-0902 to cloudvirt1015 (optimistic re: [[phab:T220853|T220853]] )
 
=== 2019-06-11 ===
* 18:03 bstorm_: deleted anomalous kubernetes node tools-worker-1019.eqiad.wmflabs
 
=== 2019-06-05 ===
* 18:33 andrewbogott: repooled  tools-sgeexec-0921 and tools-sgeexec-0929
* 18:16 andrewbogott: depooling and moving tools-sgeexec-0921 and tools-sgeexec-0929
 
=== 2019-05-30 ===
* 13:01 arturo: uncordon/repool tools-worker-1001/2/3. They should be fine now. I'm only leaving 1029 cordoned for testing purposes
* 13:01 arturo: reboot tools-woker-1003 to cleanup sssd config and let nslcd/nscd start freshly
* 12:47 arturo: reboot tools-woker-1002 to cleanup sssd config and let nslcd/nscd start freshly
* 12:42 arturo: reboot tools-woker-1001 to cleanup sssd config and let nslcd/nscd start freshly
* 12:35 arturo: enable puppet in tools-worker nodes
* 12:29 arturo: switch hiera setting back to classic/sudoldap for tools-worker because [[phab:T224651|T224651]] ([[phab:T224558|T224558]])
* 12:25 arturo: cordon/drain tools-worker-1002 because [[phab:T224651|T224651]] and [[phab:T224651|T224651]]
* 12:23 arturo: cordon/drain tools-worker-1001 because [[phab:T224651|T224651]] and [[phab:T224651|T224651]]
* 12:22 arturo: cordon/drain tools-worker-1029 because [[phab:T224651|T224651]] and [[phab:T224651|T224651]]
* 12:20 arturo: cordon/drain tools-worker-1003 because [[phab:T224651|T224651]] and [[phab:T224651|T224651]]
* 11:59 arturo: [[phab:T224558|T224558]] repool tools-worker-1003 (using sssd/sudo now!)
* 11:23 arturo: [[phab:T224558|T224558]] depool tools-worker-1003
* 10:48 arturo: [[phab:T224558|T224558]] drop/build a VM for tools-worker-1002. It didn't like the sssd/sudo change :-(
* 10:33 arturo: [[phab:T224558|T224558]] switch tools-worker-1002 to sssd/sudo. Includes drain/depool/reboot/repool
* 10:28 arturo: [[phab:T224558|T224558]] use hiera config in prefix tools-worker for sssd/sudo
* 10:27 arturo: [[phab:T224558|T224558]] switch tools-worker-1001 to sssd/sudo. Includes drain/depool/reboot/repool
* 10:09 arturo: [[phab:T224558|T224558]] disable puppet in all tools-worker- nodes
* 10:01 arturo: [[phab:T224558|T224558]] add tools-worker-1029 to the nodes pool of k8s
* 09:58 arturo: [[phab:T224558|T224558]] reboot tools-worker-1029 after puppet changes for sssd/sudo in jessie
 
=== 2019-05-29 ===
* 11:13 arturo: briefly tested some sssd config changes in tools-sgebastion-09
* 10:13 arturo: enroll the tools-worker-1029 VM into toolforge k8s, but leave it cordoned for sssd testing purposes ([[phab:T221225|T221225]])
* 10:12 arturo: re-create the tools-worker-1001 VM, already enrolled into toolforge k8s
* 09:34 arturo: delete tools-worker-1001, it was totally malfunctioning
 
=== 2019-05-28 ===
* 18:15 arturo: [[phab:T221225|T221225]] for the record, tools-worker-1001 is not working after trying with sssd
* 18:13 arturo: [[phab:T221225|T221225]] created tools-worker-1029 to test sssd/sudo stuff
* 17:49 arturo: [[phab:T221225|T221225]] repool tools-worker-1002 (using nscd/nslcd and sudoldap)
* 17:44 arturo: [[phab:T221225|T221225]] back to classic/ldap hiera config in the tools-worker puppet prefix
* 17:35 arturo: [[phab:T221225|T221225]] hard reboot tools-worker-1001 again
* 17:27 arturo: [[phab:T221225|T221225]] hard reboot tools-worker-1001
* 17:12 arturo: [[phab:T221225|T221225]] depool & switch to sssd/sudo & reboot & repool tools-worker-1002
* 17:09 arturo: [[phab:T221225|T221225]] depool & switch to sssd/sudo & reboot & repool tools-worker-1001
* 17:08 arturo: [[phab:T221225|T221225]] switch to sssd/sudo in puppet prefix for tools-worker
* 13:04 arturo: [[phab:T221225|T221225]] depool and rebooted tools-worker-1001 in preparation for sssd migration
* 12:39 arturo: [[phab:T221225|T221225]] disable puppet in all tools-worker nodes in preparation for sssd
* 12:32 arturo: drop the tools-bastion puppet prefix, unused
* 12:31 arturo: [[phab:T221225|T221225]] set sssd/sudo in the hiera config for the tools-checker prefix, and reboot tools-checker-03
* 12:27 arturo: [[phab:T221225|T221225]] set sssd/sudo in the hiera config for the tools-docker-registry prefix, and reboot tools-docker-registry-[03-04]
* 12:16 arturo: [[phab:T221225|T221225]] set sssd/sudo in the hiera config for the tools-sgebastion prefix, and reboot tools-sgebastion-07/08
* 11:26 arturo: merged change to the sudo module to allow sssd transition
 
=== 2019-05-27 ===
* 09:47 arturo: run `apt-get clean` to wipe 4GB of unused .deb packages, usage on / (root) was > 90% (on tools-sgebastion-08)
* 09:47 arturo: run `apt-get clean` to wipe 4GB of unused .deb packages, usage on / (root) was > 90%
 
=== 2019-05-21 ===
* 12:35 arturo: [[phab:T223992|T223992]] rebooting tools-redis-1002
 
=== 2019-05-20 ===
* 11:25 arturo: [[phab:T223332|T223332]] enable puppet agent in tools-k8s-master and tools-docker-registry nodes and deploy new SSL cert
* 10:53 arturo: [[phab:T223332|T223332]] disable puppet agent in tools-k8s-master and tools-docker-registry nodes
 
=== 2019-05-18 ===
* 11:13 chicocvenancio: PAWS update helm chart to point to new singleuser image ([[phab:T217908|T217908]])
* 09:06 bd808: Rebuilding all stretch docker images to pick up toollabs-webservice 0.45
 
=== 2019-05-17 ===
* 17:36 bd808: Rebuilding all docker images to pick up toollabs-webservice 0.45
* 17:35 bd808: Deployed toollabs-webservice 0.45 (python 3.5 and nodejs 10 containers)
 
=== 2019-05-16 ===
* 11:22 chicocvenancio: PAWS: restart hub to get new configured announcement
* 11:05 chicocvenancio: PAWS: change confimap to reference WMHACK 2019 as busiest time
 
=== 2019-05-15 ===
* 16:20 arturo: [[phab:T223148|T223148]] repool both tools-sgeexec-0921 and -0929
* 15:32 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0921 and move to cloudvirt1014
* 15:32 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0920 and move to cloudvirt1014
* 12:29 arturo: [[phab:T223148|T223148]] repool both tools-sgeexec-09[37,39]
* 12:13 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0937 and move to cloudvirt1008
* 12:13 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0939 and move to cloudvirt1007
* 11:34 arturo: [[phab:T223148|T223148]] repool tools-sgeexec-0940
* 11:20 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0940 and move to cloudvirt1006
* 11:11 arturo: [[phab:T223148|T223148]] repool tools-sgeexec-0941
* 10:46 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0941 and move to cloudvirt1005
* 09:44 arturo: [[phab:T223148|T223148]] repool tools-sgeexec-0901
* 09:00 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0901 and reallocate to cloudvirt1004
 
=== 2019-05-14 ===
* 17:12 arturo: [[phab:T223148|T223148]] repool tools-sgeexec-0920
* 16:37 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0920 and reallocate to cloudvirt1003
* 16:36 arturo: [[phab:T223148|T223148]] repool tools-sgeexec-0911
* 15:56 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0911 and reallocate to cloudvirt1003
* 15:52 arturo: [[phab:T223148|T223148]] repool tools-sgeexec-0909
* 15:24 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0909 and reallocate to cloudvirt1002
* 15:24 arturo: [[phab:T223148|T223148]] last SAL entry is bogus, please ignore (depool tools-worker-1009)
* 15:23 arturo: [[phab:T223148|T223148]] depool tools-worker-1009
* 15:13 arturo: [[phab:T223148|T223148]] repool tools-worker-1023
* 13:16 arturo: [[phab:T223148|T223148]] repool tools-sgeexec-0942
* 13:03 arturo: [[phab:T223148|T223148]] repool tools-sgewebgrid-generic-0904
* 12:58 arturo: [[phab:T223148|T223148]] reallocating tools-worker-1023 to cloudvirt1001
* 12:56 arturo: [[phab:T223148|T223148]] depool tools-worker-1023
* 12:52 arturo: [[phab:T223148|T223148]] reallocating tools-sgeexec-0942 to cloudvirt1001
* 12:50 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0942
* 12:49 arturo: [[phab:T223148|T223148]] reallocating tools-sgewebgrid-generic-0904 to cloudvirt1001
* 12:43 arturo: [[phab:T223148|T223148]] depool tools-sgewebgrid-generic-0904
 
=== 2019-05-13 ===
* 08:15 zhuyifei1999_: `truncate -s 0 /var/log/exim4/paniclog` on tools-sgecron-01.tools.eqiad.wmflabs & tools-sgewebgrid-lighttpd-0921.tools.eqiad.wmflabs
 
=== 2019-05-07 ===
* 14:38 arturo: [[phab:T222718|T222718]] uncordon tools-worker-1019, I couldn't find a reason for it to be cordoned
* 14:31 arturo: [[phab:T222718|T222718]] reboot tools-worker-1009 and 1022 after being drained
* 14:28 arturo: k8s drain tools-worker-1009 and 1022
* 11:46 arturo: [[phab:T219362|T219362]] enable puppet in tools-redis servers and use the new puppet role
* 11:33 arturo: [[phab:T219362|T219362]] disable puppet in tools-reds servers for puppet code cleanup
* 11:12 arturo: [[phab:T219362|T219362]] drop the `tools-services` puppet prefix (we are actually using `tools-sgeservices`)
* 11:10 arturo: [[phab:T219362|T219362]] enable puppet in tools-static servers and use new puppet role
* 11:01 arturo: [[phab:T219362|T219362]] disable puppet in tools-static servers for puppet code cleanup
* 10:16 arturo: [[phab:T219362|T219362]] drop the `tools-webgrid-lighttpd` puppet prefix
* 10:14 arturo: [[phab:T219362|T219362]] drop the `tools-webgrid-generic` puppet prefix
* 10:06 arturo: [[phab:T219362|T219362]] drop the `tools-exec-1` puppet prefix
 
=== 2019-05-06 ===
* 11:34 arturo: [[phab:T221225|T221225]] reenable puppet
* 10:53 arturo: [[phab:T221225|T221225]] disable puppet in all toolforge servers for testing sssd patch (puppetmaster livehack)
 
=== 2019-05-03 ===
* 09:43 arturo: fixed puppet in tools-puppetdb-01 too
* 09:39 arturo: puppet should be now fine across toolforge (except tools-puppetdb-01 which is WIP I think)
* 09:37 arturo: fix puppet in tools-elastic-03, archived jessie repos, weird rsyslog-kafka package situation
* 09:33 arturo: fix puppet in tools-elastic-02, archived jessie repos, weird rsyslog-kafka package situation
* 09:24 arturo: fix puppet in tools-elastic-01, archived jessie repos, weird rsyslog-kafka package situation
* 09:18 arturo: solve a weird apt situation in tools-puppetmaster-01 regarding the rsyslog-kafka package (puppet agent was failing)
* 09:16 arturo: solve a weird apt situation in tools-worker-1028 regarding the rsyslog-kafka package
 
=== 2019-04-30 ===
* 12:50 arturo: enable puppet in all servers [[phab:T221225|T221225]]
* 12:45 arturo: adding `sudo_flavor: sudo` hiera config to all puppet prefixes with sssd ([[phab:T221225|T221225]])
* 12:45 arturo: adding `sudo_flavor: sudo` hiera config to all puppet prefixes with sssd
* 11:07 arturo: [[phab:T221225|T221225]] disable puppet in toolforge
* 10:56 arturo: [[phab:T221225|T221225]] create tools-sgebastion-0test for more sssd tests
 
=== 2019-04-29 ===
* 11:22 arturo: [[phab:T221225|T221225]] re-enable puppet agent in all toolforge servers
* 10:27 arturo: [[phab:T221225|T221225]] reboot tool-sgebastion-09 for testing sssd
* 10:21 arturo: disable puppet in all servers to livehack tools-puppetmaster-01 to test [[phab:T221225|T221225]]
* 08:29 arturo: cleanup disk in tools-sgebastion-09, was full of debug logs and unused apt packages
 
=== 2019-04-26 ===
* 12:20 andrewbogott: rescheduling every pod everywhere
* 12:18 andrewbogott: rescheduling all pods on tools-worker-1023.tools.eqiad.wmflabs
 
=== 2019-04-25 ===
* 12:49 arturo: [[phab:T221225|T221225]] using `profile::ldap::client::labs::client_stack: sssd` in horizon for tools-sgebastion-09 (testing)
* 11:43 arturo: [[phab:T221793|T221793]] removing prometheus crontab and letting puppet agent re-create it again to resolve staleness
 
=== 2019-04-24 ===
* 12:54 arturo: puppet broken, fixing right now
* 09:18 arturo: [[phab:T221225|T221225]] reallocating tools-sgebastion-09 to cloudvirt1008
 
=== 2019-04-23 ===
* 15:26 arturo: [[phab:T221225|T221225]] rebooting tools-sgebastion-08 to cleanup sssd
* 15:19 arturo: [[phab:T221225|T221225]] creating tools-sgebastion-09 for testing sssd stuff
* 13:06 arturo: [[phab:T221225|T221225]] use `profile::ldap::client::labs::client_stack: classic` in the puppet bastion prefix, again. Rollback again.
* 12:57 arturo: [[phab:T221225|T221225]] use `profile::ldap::client::labs::client_stack: sssd` in the puppet bastion prefix, try again with sssd in the bastions, reboot them
* 10:28 arturo: [[phab:T221225|T221225]] use `profile::ldap::client::labs::client_stack: classic` in the puppet bastion prefix
* 10:27 arturo: [[phab:T221225|T221225]] rebooting tools-sgebastion-07 to clean sssd confiuration
* 10:16 arturo: [[phab:T221225|T221225]] disable puppet in tools-sgebastion-08 for sssd testing
* 09:49 arturo: [[phab:T221225|T221225]] run puppet agent in the bastions and reboot them with sssd
* 09:43 arturo: [[phab:T221225|T221225]] use `profile::ldap::client::labs::client_stack: sssd` in the puppet bastion prefix
* 09:41 arturo: [[phab:T221225|T221225]] disable puppet agent in the bastions
 
=== 2019-04-17 ===
* 12:09 arturo: [[phab:T221225|T221225]] rebooting bastions to clean sssd. We are back to nscd/nslcd until we figure out what's wrong here
* 11:59 arturo: [[phab:T221205|T221205]] sssd was deployed successfully into all webgrid nodes
* 11:39 arturo: deploy sssd to tools-sge-services-03/04 (includes reboot)
* 11:31 arturo: reboot bastions for sssd deployment
* 11:30 arturo: deploy sssd to bastions
* 11:24 arturo: disable puppet in bastions to deploy sssd
* 09:52 arturo: [[phab:T221205|T221205]] tools-sgewebgrid-lighttpd-0915 requires some manual intervention because issues in the dpkg database prevents deleting nscd/nslcd packages
* 09:45 arturo: [[phab:T221205|T221205]] tools-sgewebgrid-lighttpd-0913 requires some manual intervention because unconfigured packages prevents a clean puppet agent run
* 09:12 arturo: [[phab:T221205|T221205]] start deploying sssd to sgewebgrid nodes
* 09:00 arturo: [[phab:T221205|T221205]] add `profile::ldap::client::labs::client_stack: sssd` in horizon for the puppet prefixes `tools-sgewebgrid-lighttpd` and `tools-sgewebgrid-generic`
* 08:57 arturo: [[phab:T221205|T221205]] disable puppet in all tools-sgewebgrid-* nodes
 
=== 2019-04-16 ===
* 20:49 chicocvenancio: change paws announcement in configmap hub-config back to a welcome message
* 17:15 chicocvenancio: add paws outage announcement in configmap hub-config
* 17:00 andrewbogott: moving tools-k8s-master-01 to eqiad1-r
 
=== 2019-04-15 ===
* 18:50 andrewbogott: moving tools-elastic-01 to cloudvirt1008 to make spreadcheck happy
* 15:01 andrewbogott: moving tools-redis-1001 to eqiad1-r
 
=== 2019-04-14 ===
* 16:23 andrewbogott: moved all tools-worker nodes off of cloudvirt1015 and uncordoned them
 
=== 2019-04-13 ===
* 21:09 bstorm_: Moving tools-prometheus-01 to cloudvirt1009 and tools-clushmaster-02 to cloudvirt1008 for [[phab:T220853|T220853]]
* 20:36 bstorm_: moving tools-elastic-02 to cloudvirt1009 for [[phab:T220853|T220853]]
* 19:58 bstorm_: started migrating tools-k8s-etcd-03 to cloudvirt1012 [[phab:T220853|T220853]]
* 19:51 bstorm_: started migrating tools-flannel-etcd-02 to cloudvirt1013 [[phab:T220853|T220853]]
 
=== 2019-04-11 ===
* 22:38 andrewbogott: moving tools-paws-worker-1005 to cloudvirt1009 to make spreadcheck happier
* 21:49 bd808: Re-enabled puppet on tools-elastic-02 and forced puppet run
* 21:44 andrewbogott: moving tools-mail-02 to eqiad1-r
* 20:56 andrewbogott: shutting down tools-logs-02 — seems unused
* 19:44 bd808: Disabled puppet on tools-elastic-02 and set to 1-node master
* 19:34 andrewbogott: moving tools-puppetmaster-01 to eqiad1-r
* 15:40 andrewbogott: moving tools-redis-1002  to eqiad1-r
* 13:52 andrewbogott: moving tools-prometheus-01 and tools-elastic-01 to eqiad1-r
* 12:01 arturo: [[phab:T151704|T151704]] deploying oidentd
* 11:54 arturo: disable puppet in all hosts to deploy oidentd
* 02:33 andrewbogott: tools-paws-worker-1005,  tools-paws-worker-1006 to eqiad1-r
* 00:03 andrewbogott: tools-paws-worker-1002,  tools-paws-worker-1003 to eqiad1-r
 
=== 2019-04-10 ===
* 23:58 andrewbogott: moving tools-clushmaster-02, tools-elastic-03 and tools-paws-worker-1001 to eqiad1-r
* 18:52 bstorm_: depooled and rebooted tools-sgeexec-0929 because systemd was in a weird state
* 18:46 bstorm_: depooled and rebooted tools-sgewebgrid-lighttpd-0913 because high load was caused by ancient lsof processes
* 14:49 bstorm_: cleared E state from 5 queues
* 13:06 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0906
* 12:31 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0926
* 12:27 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0925
* 12:06 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0901
* 11:55 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0924
* 11:47 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0921
* 11:23 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0940
* 11:03 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0928
* 10:49 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0923
* 10:43 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0915
* 10:27 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0935
* 10:19 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0914
* 10:02 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0907
* 09:41 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0918
* 09:27 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0932
* 09:26 arturo: [[phab:T218216|T218216]] hard reboot tools-sgeexec-0932
* 09:04 arturo: [[phab:T218216|T218216]] add `profile::ldap::client::labs::client_stack: sssd` to prefix puppet for sge-exec nodes
* 09:03 arturo: [[phab:T218216|T218216]] do a controlled rollover of sssd, depooling sgeexec nodes, reboot and repool
* 08:39 arturo: [[phab:T218216|T218216]] disable puppet in all tools-sgeexec-XXXX nodes for controlled sssd rollout
* 00:32 andrewbogott: migrating tools-worker-1022, 1023, 1025, 1026 to eqiad1-r
 
=== 2019-04-09 ===
* 22:04 bstorm_: added the new region on port 80 to the elasticsearch security group for stashbot
* 21:16 andrewbogott: moving tools-flannel-etcd-03 to eqiad1-r
* 20:43 andrewbogott: moving tools-worker-1018, 1019, 1020, 1021 to eqiad1-r
* 20:04 andrewbogott: moving tools-k8s-etcd-03 to eqiad1-r
* 19:54 andrewbogott: moving tools-flannel-etcd-02 to eqiad1-r
* 18:36 andrewbogott: moving tools-worker-1016, tools-worker-1017 to eqiad1-r
* 18:05 andrewbogott: migrating tools-k8s-etcd-02 to eqiad1-r
* 18:00 andrewbogott: migrating tools-flannel-etcd-01 to eqiad1-r
* 17:36 andrewbogott: moving tools-worker-1014, tools-worker-1015 to eqiad1-r
* 17:05 andrewbogott: migrating  tools-k8s-etcd-01 to eqiad1-r
* 15:56 andrewbogott: moving tools-worker-1012, tools-worker-1013 to eqiad1-r
* 14:56 bstorm_: cleared 4 queues on gridengine of E status (ldap again)
* 14:07 andrewbogott: moving tools-worker-1010, tools-worker-1011, tools-worker-1001 to eqiad1-r
* 03:48 andrewbogott: moving tools-worker-1008 and tools-worker-1009 to eqiad1-r
* 02:07 bstorm_: reloaded ferm on tools-flannel-etcd-0[1-3] to get  the k8s node moves to register
 
=== 2019-04-08 ===
* 22:36 andrewbogott: moving tools-worker-1006 and tools-worker-1007 to eqiad1-r
* 20:03 andrewbogott: moving tools-worker-1003 and tools-worker-1004 to eqiad1-r
 
=== 2019-04-07 ===
* 16:54 zhuyifei1999_: tools-sgeexec-0928 unresponsive since around 22 UTC. No data on Graphite. Can't ssh in even as root. Hard rebooting via Horizon
* 01:06 bstorm_: cleared E state from 6 queues
 
=== 2019-04-05 ===
* 15:44 bstorm_: cleared E state from two exec queues
 
=== 2019-04-04 ===
* 21:21 bd808: Uncordoned tools-worker-1013.tools.eqiad.wmflabs after reboot and forced puppet run
* 20:53 bd808: Rebooting tools-worker-1013
* 20:50 bd808: Draining tools-worker-1013.tools.eqiad.wmflabs
* 20:29 bd808: Released floating IP and deleted instance tools-checker-01 via Horizon
* 20:28 bd808: Shutdown tools-checker-01 via Horizon
* 20:17 bd808: Repooled tools-webgrid-lighttpd-0906 after reboot, apt-get dist-upgrade, and forced puppet run
* 20:13 bd808: Hard reboot of tools-sgewebgrid-lighttpd-0906 via Horizon
* 20:09 bd808: Repooled tools-webgrid-lighttpd-0912 after reboot, apt-get dist-upgrade, and forced puppet run
* 20:05 bd808: Depooled and rebooted tools-sgewebgrid-lighttpd-0912
* 20:05 bstorm_: rebooted tools-webgrid-lighttpd-0912
* 20:03 bstorm_: depooled  tools-webgrid-lighttpd-0912
* 19:59 bstorm_: depooling and rebooting tools-webgrid-lighttpd-0906
* 19:43 bd808: Repooled tools-sgewebgrid-lighttpd-0926 after reboot, apt-get dist-update, and forced puppet run
* 19:36 bd808: Hard reboot of tools-sgewebgrid-lighttpd-0926 via Horizon
* 19:30 bd808: Rebooting tools-sgewebgrid-lighttpd-0926
* 19:28 bd808: Depooled tools-sgewebgrid-lighttpd-0926
* 19:13 bstorm_: cleared E state from 7 queues
* 17:32 andrewbogott: moving tools-static-12 to cloudvirt1023 to keep the two static nodes off the same host
 
=== 2019-04-03 ===
* 11:22 arturo: puppet breakage in due to me introducing openstack-mitaka-jessie repo by mistake. Cleaning up already
 
=== 2019-04-02 ===
* 12:11 arturo: icinga downtime toolschecker for 1 month [[phab:T219243|T219243]]
* 03:55 bd808: Added etcd service group to tools-k8s-etcd-* ([[phab:T219243|T219243]])
 
=== 2019-04-01 ===
* 19:44 bd808: Deleted tools-checker-02 via Horizon ([[phab:T219243|T219243]])
* 19:43 bd808: Shutdown tools-checker-02 via Horizon ([[phab:T219243|T219243]])
* 16:53 bstorm_: cleared E state on 6 grid queues
* 14:54 andrewbogott: moving tools-static-12 to eqiad1-r (for real this time maybe)
 
=== 2019-03-29 ===
* 21:13 bstorm_: depooled tools-sgewebgrid-generic-0903 because of some stuck jobs and odd load characteristics
* 21:09 bd808: Updated cherry-pick of https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/500095/ on tools-puppetmaster-01 ([[phab:T219243|T219243]])
* 20:48 bd808: Using root console to fix broken initial puppet run on tools-checker-03.
* 20:32 bd808: Creating tools-checker-03 with role::wmcs::toolforge::checker ([[phab:T219243|T219243]])
* 20:24 bd808: Cherry-picked https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/500095/ to tools-puppetmaster-01 for testing ([[phab:T219243|T219243]])
* 20:22 bd808: Disabled puppet on tools-checker-0{1,2} to make testing new role::wmcs::toolforge::checker easier ([[phab:T219243|T219243]])
* 17:25 bd808: Cleared the "Eqw" state of 44 jobs with `qstat -u '*' {{!}} grep Eqw {{!}} awk '{print $1;}' {{!}} xargs -L1 sudo qmod -cj` on tools-sgegrid-master
* 17:16 andrewbogott: aborted move of tools-static-12; will wait until tomorrow and give DNS caches more time to update
* 17:11 bd808: Restarted nginx on tools-static-13
* 16:53 andrewbogott: moving tools-static-12 to eqiad1-r
* 16:49 bstorm_: cleared E state from 21 queues
* 14:34 andrewbogott: moving tools-static.wmflabs.org to point to tools-static-13 in eqiad1-r
* 13:54 andrewbogott: moving tools-static-13 to eqiad1-r
 
=== 2019-03-28 ===
* 01:00 bstorm_: cleared error states from two queues
* 00:23 bstorm_: [[phab:T216060|T216060]] created tools-sgewebgrid-generic-0901...again!
 
=== 2019-03-27 ===
* 23:35 bstorm_: rebooted tools-paws-master-01 for NFS issue [[phab:T219460|T219460]]
* 14:45 bstorm_: cleared several "E" state queues
* 12:26 gtirloni: truncated exim4/paniclog on tools-sgewebgrid-lighttpd-0921
* 12:25 gtirloni: truncated exim4/paniclog on tools-sgecron-01
* 12:15 arturo: [[phab:T218126|T218126]] `aborrero@tools-sgegrid-master:~$ sudo qmod -d 'test@tools-sssd-sgeexec-test-2'` (and 1)
 
=== 2019-03-26 ===
* 22:00 gtirloni: downtimed toolschecker
* 17:31 arturo: [[phab:T218126|T218126]] create VM instances tools-sssd-sgeexec-test-[12]
* 00:26 bd808: Deleted DNS record for login-trusty.tools.wmflabs.org
* 00:26 bd808: Deleted DNS record for trusty-dev.tools.wmflabs.org
 
=== 2019-03-25 ===
* 21:21 bd808: All Trusty grid engine hosts shutdown and deleted ([[phab:T217152|T217152]])
* {{safesubst:SAL entry|1=21:19 bd808: Deleted tools-grid-{master,shadow}  (T217152)}}
* 21:18 bd808: Deleted tools-webgrid-lighttpd-14*  ([[phab:T217152|T217152]])
* 20:55 bstorm_: reboot tools-sgewebgrid-generic-0903 to clear up some issues
* 20:52 bstorm_: rebooting tools-package-builder-02 due to lots of hung /usr/bin/lsof +c 15 -nXd DEL processes
* 20:51 bd808: Deleted tools-webgrid-generic-14* ([[phab:T217152|T217152]])
* 20:49 bd808: Deleted tools-exec-143* ([[phab:T217152|T217152]])
* 20:49 bd808: Deleted tools-exec-142* ([[phab:T217152|T217152]])
* 20:48 bd808: Deleted tools-exec-141* ([[phab:T217152|T217152]])
* 20:47 bd808: Deleted tools-exec-140* ([[phab:T217152|T217152]])
* 20:43 bd808: Deleted  tools-cron-01 ([[phab:T217152|T217152]])
* 20:42 bd808: Deleted tools-bastion-0{2,3} ([[phab:T217152|T217152]])
* 20:35 bstorm_: rebooted tools-worker-1025 and tools-worker-1021
* 19:59 bd808: Shutdown tools-exec-143* ([[phab:T217152|T217152]])
* 19:51 bd808: Shutdown tools-exec-142* ([[phab:T217152|T217152]])
* 19:47 bstorm_: depooling tools-worker-1025.tools.eqiad.wmflabs because it's not responding and showing insane load
* 19:33 bd808: Shutdown tools-exec-141* ([[phab:T217152|T217152]])
* 19:31 bd808: Shutdown tools-bastion-0{2,3} ([[phab:T217152|T217152]])
* 19:19 bd808: Shutdown tools-exec-140* ([[phab:T217152|T217152]])
* 19:12 bd808: Shutdown tools-webgrid-generic-14* ([[phab:T217152|T217152]])
* 19:11 bd808: Shutdown tools-webgrid-lighttpd-14* ([[phab:T217152|T217152]])
* 18:53 bd808: Shutdown tools-grid-master ([[phab:T217152|T217152]])
* 18:53 bd808: Shutdown tools-grid-shadow ([[phab:T217152|T217152]])
* 18:49 bd808: All jobs still running on the Trusty job grid force deleted.
* 18:46 bd808: All Trusty job grid queues marked as disabled. This should stop all new Trusty job submissions.
* 18:43 arturo: icinga downtime tools-checker for 24h due to trusty grid shutdown
* 18:39 bd808: Shutdown tools-cron-01.tools.eqiad.wmflabs ([[phab:T217152|T217152]])
* 15:27 bd808: Copied all crontab files still on tools-cron-01 to tool's $HOME/crontab.trusty.save
* 02:34 bd808: Disassociated floating IPs and deleted shutdown Trusty grid nodes tools-exec-14{33,34,35,36,37,38,39,40,41,42} ([[phab:T217152|T217152]])
* 02:26 bd808: Deleted shutdown Trusty grid nodes tools-webgrid-lighttpd-14{20,21,22,24,25,26,27,28} ([[phab:T217152|T217152]])
 
=== 2019-03-22 ===
* 17:16 andrewbogott: switching all instances to use ldap-ro.eqiad.wikimedia.org as both primary and secondary ldap server
* 16:12 bstorm_: cleared errored out stretch grid queues
* 15:56 bd808: Rebooting tools-static-12
* 03:09 bstorm_: [[phab:T217280|T217280]] depooled and rebooted 15 other nodes.  Entire stretch grid is in a good state for now.
* 02:31 bstorm_: [[phab:T217280|T217280]] depooled and rebooted tools-sgeexec-0908 since it had no jobs but very high load from an NFS event that was no longer happening
* 02:09 bstorm_: [[phab:T217280|T217280]] depooled and rebooted tools-sgewebgrid-lighttpd-0924
* 00:39 bstorm_: [[phab:T217280|T217280]] depooled and rebooted tools-sgewebgrid-lighttpd-0902
 
=== 2019-03-21 ===
* 23:28 bstorm_: [[phab:T217280|T217280]] depooled, reloaded and repooled tools-sgeexec-0938
* 21:53 bstorm_: [[phab:T217280|T217280]] rebooted and cleared "unknown status" from tools-sgeexec-0914 after depooling
* 21:51 bstorm_: [[phab:T217280|T217280]] rebooted and cleared "unknown status" from tools-sgeexec-0909 after depooling
* 21:26 bstorm_: [[phab:T217280|T217280]] cleared error state from a couple queues and rebooted tools-sgeexec-0901 and 04 to clear other issues related
 
=== 2019-03-18 ===
* 18:43 bd808: Rebooting tools-static-12
* 18:42 chicocvenancio: PAWS: 3 nodes still in not ready state, `worker-10(01{{!}}07{{!}}10)` all else working
* 18:41 chicocvenancio: PAWS: deleting pods stuck in Unknown state with ` --grace-period=0 --force`
* 18:40 andrewbogott: rebooting tools-static-13 in hopes of fixing some nfs mounts
* 18:25 chicocvenancio: removing postStart hook for PWB update and restarting hub while gerrit.wikimedia.com is down
 
=== 2019-03-17 ===
* 23:41 bd808: Cherry-picked https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/497210/ as a quick fix for [[phab:T218494|T218494]]
* 22:30 bd808: Investigating strange system state on tools-bastion-03.
* 17:48 bstorm_: [[phab:T218514|T218514]] rebooting tools-worker-1009 and 1012
* 17:46 bstorm_: depooling tools-worker-1009 and tools-worker-1012 for [[phab:T218514|T218514]]
* 17:13 bstorm_: depooled and rebooting tools-worker-1018
* 15:09 andrewbogott: running 'killall dpkg and dpkg --configure -a' on all nodes to try to work around a race with initramfs
 
=== 2019-03-16 ===
* 22:34 bstorm_: clearing errored out queues again
 
=== 2019-03-15 ===
* 21:08 bstorm_: cleared error state on several queues [[phab:T217280|T217280]]
* 15:58 gtirloni: rebooted tools-clushmaster-02
* 14:40 mutante: tools-sgebastion-07 - dpkg-reconfigure locales and adding Korean ko_KR.EUC-KR - [[phab:T130532|T130532]]
* 14:32 mutante: tools-sgebastion-07 - generating locales for user request in [[phab:T130532|T130532]]
 
=== 2019-03-14 ===
* 23:52 bd808: Disabled job queues and rescheduled continuous jobs away from tools-exec-14{21,22,23,24,25,26,27,28,29,30,31,32} ([[phab:T217152|T217152]])
* 23:28 bd808: Deleted tools-bastion-05 ([[phab:T217152|T217152]])
* 22:30 bd808: Removed obsolete submit hosts from Trusty grid config
* 22:20 bd808: Removed tools-webgrid-lighttpd-142{0,1,2,5} from the grid and shutdown instances via horizon ([[phab:T217152|T217152]])
* 22:10 bd808: Depooled tools-webgrid-lighttpd-142{0,1,2,5} ([[phab:T217152|T217152]])
* 21:55 bd808: Removed submit host flag from tools-bastion-05.tools.eqiad.wmflabs, removed floating ip, and shutdown instance via horizon ([[phab:T217152|T217152]])
* 21:48 bd808: Removed tools-exec-14{33,34,35,36,37,38,39,40,41,42} from the grid and shutdown instances via horizon ([[phab:T217152|T217152]])
* 21:38 gtirloni: rebooted tools-sgewebgrid-generic-0904 ([[phab:T218341|T218341]])
* 21:32 gtirloni: rebooted tools-exec-1020 ([[phab:T218341|T218341]])
* 21:23 gtirloni: rebooted tools-sgeexec-0919, tools-sgeexec-0934, tools-worker-1018 ([[phab:T218341|T218341]])
* 21:19 bd808: Killed jobs still running on tools-exec-14{33,34,35,36,37,38,39,40,41,42}.tools.eqiad.wmflabs 2 weeks after being depooled ([[phab:T217152|T217152]])
* 20:58 bd808: Repooled tools-sgeexec-0941 following reboot
* 20:57 bd808: Hard reboot of tools-sgeexec-0941 via horizon
* 20:54 bd808: Depooled and rebooted tools-sgeexec-0941.tools.eqiad.wmflabs
* 20:53 bd808: Repooled tools-sgeexec-0917 following reboot
* 20:52 bd808: Hard reboot of tools-sgeexec-0917 via horizon
* 20:47 bd808: Depooled and rebooted tools-sgeexec-0917
* 20:44 bd808: Repooled tools-sgeexec-0908 after reboot
* 20:36 bd808: depooled and rebooted tools-sgeexec-0908
* 19:08 gtirloni: rebooted tools-worker-1028 ([[phab:T218341|T218341]])
* 19:08 gtirloni: rebooted tools-sgewebgrid-lighttpd-0914 ([[phab:T218341|T218341]])
* 19:07 gtirloni: rebooted tools-sgewebgrid-lighttpd-0914
* 18:13 gtirloni: drained tools-worker-1028 for reboot (processes in D state)
 
=== 2019-03-13 ===
* 23:30 bd808: Rebuilding stretch Kubernetes images
* 22:55 bd808: Rebuilding jessie Kubernetes images
* 17:11 bstorm_: specifically rebooted SGE cron server tools-sgecron-01
* 17:10 bstorm_: rebooted cron server
* 16:10 bd808: Updated DNS for dev.tools.wmflabs.org to point to Stretch secondary bastion. This was missed on 2019-03-07
* 12:33 arturo: reboot tools-sgebastion-08 ([[phab:T215154|T215154]])
* 12:17 arturo: reboot tools-sgebastion-07 ([[phab:T215154|T215154]])
* 11:53 arturo: enable puppet in tools-sgebastion-07 ([[phab:T215154|T215154]])
* 11:20 arturo: disable puppet in tools-sgebastion-07 for testing [[phab:T215154|T215154]]
* 05:07 bstorm_: re-enabled puppet for tools-sgebastion-07
* 04:59 bstorm_: disabled puppet for a little bit on tools-bastion-07
* 00:22 bd808: Raise web-memlimit for isbn tool to 6G for tomcat8 ([[phab:T217406|T217406]])
 
=== 2019-03-11 ===
* 15:53 bd808: Manually started `service gridengine-master` on tools-sgegrid-master after reboot ([[phab:T218038|T218038]])
* 15:47 bd808: Hard reboot of tools-sgegrid-master via Horizon UI ([[phab:T218038|T218038]])
* 15:42 bd808: Rebooting tools-sgegrid-master ([[phab:T218038|T218038]])
* 14:49 gtirloni: deleted tools-webgrid-lighttpd-1419
* 00:53 bd808: Re-enabled 13 queue instances that had been disabled by LDAP failures during job initialization ([[phab:T217280|T217280]])
 
=== 2019-03-10 ===
* 22:36 gtirloni: increased nscd group TTL from 60 to 300sec
 
=== 2019-03-08 ===
* 19:48 andrewbogott: repooling tools-exec-1430 and tools-sgeexec-0905 to compare ldap usage
* 19:21 andrewbogott: depooling tools-exec-1430 and tools-sgeexec-0905 to compare ldap usage
* 17:49 bd808: Re-enabled 4 queue instances that had been disabled by LDAP failures during job initialization ([[phab:T217280|T217280]])
* 00:30 bd808: DNS record created for trusty-dev.tools.wmflabs.org (Trusty secondary bastion)
 
=== 2019-03-07 ===
* 23:31 bd808: Updated DNS to point login.tools.wmflabs.org at 185.15.56.48 (Stretch bastion)
* 04:15 bd808: Killed 3 orphan processes on Trusty grid
* 04:01 bd808: Cleared error state on a large number of Stretch grid queues which had been disabled by LDAP and/or NFS hiccups ([[phab:T217280|T217280]])
* 00:49 zhuyifei1999_: clushed misctools 1.37 upgrade on @bastion,@cron,@bastion-stretch [[phab:T217406|T217406]]
* 00:38 zhuyifei1999_: published misctools 1.37 [[phab:T217406|T217406]]
* 00:34 zhuyifei1999_: begin building misctools 1.37 using debuild [[phab:T217406|T217406]]
 
=== 2019-03-06 ===
* 13:57 gtirloni: fixed SSH warnings in tools-clushmaster-02
 
=== 2019-03-04 ===
* 19:07 bstorm_: umounted /mnt/nfs/dumps-labstore1006.wikimedia.org for [[phab:T217473|T217473]]
* {{safesubst:SAL entry|1=14:05 gtirloni: rebooted tools-docker-registry-{03,04}, tools-puppetmaster-02 and tools-puppetdb-01 (load avg >45, not accessible)}}
 
=== 2019-03-03 ===
* 20:54 andrewbogott: cleaning out /tmp on tools-exec-1412
 
=== 2019-02-28 ===
* 19:36 zhuyifei1999_: built with debuild instead [[phab:T217297|T217297]]
* 19:08 zhuyifei1999_: test failures during build, see ticket
* 18:55 zhuyifei1999_: start building jobutils 1.36 [[phab:T217297|T217297]]
 
=== 2019-02-27 ===
* 20:41 andrewbogott: restarting nginx on tools-checker-01
* 19:34 andrewbogott: uncordoning tools-worker-1028, 1002 and 1005, now in eqiad1-r
* 16:20 zhuyifei1999_: regenerating k8s creds for tools.whichsub & tools.permission-denied-test [[phab:T176027|T176027]]
* 15:40 andrewbogott: moving tools-worker-1002, 1005, 1028 to eqiad1-r
* 01:36 bd808: Shutdown tools-webgrid-lighttpd-1419.tools.eqiad.wmflabs via horizon ([[phab:T217152|T217152]])
* 01:29 bd808: Depooled tools-webgrid-lighttpd-1419.tools.eqiad.wmflabs ([[phab:T217152|T217152]])
* 01:26 bd808: Disabled job queues and rescheduled continuous jobs away from tools-exec-14{33,34,35,36,37,38,39,40,41,42}.tools.eqiad.wmflabs ([[phab:T217152|T217152]])
 
=== 2019-02-26 ===
* 20:51 gtirloni: reboot tools-package-builder-02 (unresponsive)
* 19:01 gtirloni: pushed updated docker images
* 17:30 andrewbogott: draining and cordoning tools-worker-1027 for a region migration test
 
=== 2019-02-25 ===
* 23:20 bstorm_: Depooled tools-sgeexec-0914 and tools-sgeexec-0915 for [[phab:T217066|T217066]]
* 21:41 andrewbogott: depooling tools-sgeexec-0911, tools-sgeexec-0912, tools-sgeexec-0913 to test [[phab:T217066|T217066]]
* 13:11 chicocvenancio: PAWS:  Stopped AABot notebook pod [[phab:T217010|T217010]]
* 12:54 chicocvenancio: PAWS:  Restarted Criscod notebook pod [[phab:T217010|T217010]]
* 12:21 chicocvenancio: PAWS: killed proxy and hub pods to attempt to get it to see routes to open notebooks servers to no avail. Restarted BernhardHumm's notebook pod [[phab:T217010|T217010]]
* 09:50 gtirloni: rebooted tools-sgeexec-09{16,22,40} ([[phab:T216988|T216988]])
* 09:41 gtirloni: rebooted tools-sgeexec-09{16,22,40}
* 08:37 zhuyifei1999_: uncordon tools-worker-1015.tools.eqiad.wmflabs
* 08:34 legoktm: hard rebooted tools-worker-1015 via horizon
* 07:48 zhuyifei1999_: systemd stuck in D state. :(
* 07:44 zhuyifei1999_: I saved dmesg and process list to a few files in /root if that helps debugging
* 07:43 zhuyifei1999_: D states are not responding to SIGKILL. Will reboot.
* 07:37 zhuyifei1999_: tools-worker-1015.tools.eqiad.wmflabs having severe NFS issues (all NFS accessing processes are stuck in D state). Draining.
 
=== 2019-02-22 ===
* 16:29 gtirloni: upgraded and rebooted tools-puppetmaster-01 (new kernel)
* 15:59 gtirloni: started tools-puppetmaster-01 (new size: m1.large)
* 15:13 gtirloni: shutdown tools-puppetmaster-01
 
=== 2019-02-21 ===
* 09:59 gtirloni: upgraded all packages in all stretch nodes
* 00:12 zhuyifei1999_: forcing puppet run on tools-k8s-master-01
* 00:08 zhuyifei1999_: running /usr/local/bin/git-sync-upstream on tools-puppetmaster-01 to speed puppet changes up
 
=== 2019-02-20 ===
* 23:30 zhuyifei1999_: begin rebuilding all docker images [[phab:T178601|T178601]] [[phab:T193646|T193646]] [[phab:T215683|T215683]]
* 23:25 zhuyifei1999_: upgraded toollabs-webservice on tools-bastion-02 to 0.44 (newly-built version)
* 23:19 zhuyifei1999_: this was built for stretch. hopefully it works for all distros
* 23:17 zhuyifei1999_: begin build new tools-webservice package [[phab:T178601|T178601]] [[phab:T193646|T193646]] [[phab:T215683|T215683]]
* 21:57 andrewbogott: moving tools-static-13  to a new virt host
* 21:34 andrewbogott: moving the tools-static IP from tools-static-13 to tools-static-12
* 19:17 andrewbogott: moving tools-bastion-02 to labvirt1004
* 16:56 andrewbogott: moving tools-paws-worker-1003
* 15:53 andrewbogott: moving tools-worker-1017, tools-worker-1027, tools-worker-1028
* 15:04 andrewbogott: moving tools-exec-1413 and tools-exec-1442
 
=== 2019-02-19 ===
* 01:49 bd808: Revoked Toolforge project membership for user DannyS712 ([[phab:T215092|T215092]])
 
=== 2019-02-18 ===
* 20:45 gtirloni: upgraded and rebooted tools-sgebastion-07 (login-stretch)
* 20:22 gtirloni: enabled toolsdb monitoring in Icinga
* 20:03 gtirloni: pointed tools-db.eqiad.wmflabs to 172.16.7.153
* 18:50 chicocvenancio: moving paws back to toolsdb [[phab:T216208|T216208]]
* 13:47 arturo: rebooting tools-sgebastion-07 to try fixing general slowness
 
=== 2019-02-17 ===
* 22:23 zhuyifei1999_: uncordon tools-worker-1010.tools.eqiad.wmflabs
* 22:11 zhuyifei1999_: rebooting tools-worker-1010.tools.eqiad.wmflabs
* 22:10 zhuyifei1999_: draining tools-worker-1010.tools.eqiad.wmflabs, `docker ps` is hanging. no idea why. also other weirdness like ContainerCreating forever
 
=== 2019-02-16 ===
* 05:00 zhuyifei1999_: fixed by restarting flannel. another puppet run simply started kubelet
* 04:58 zhuyifei1999_: puppet logs: https://phabricator.wikimedia.org/P8097. Docker is failing with 'Failed to load environment files: No such file or directory'
* 04:52 zhuyifei1999_: copied the resolv.conf from tools-k8s-master-01, removing secondary DNS to make sure puppet fixes that, and starting puppet
* 04:48 zhuyifei1999_: that host's resolv.conf is badly broken https://phabricator.wikimedia.org/P8096. The last Puppet run was at Thu Feb 14 15:21:09 UTC 2019 (2247 minutes ago)
* 04:44 zhuyifei1999_: puppet is also failing bad here 'Error: Could not request certificate: getaddrinfo: Name or service not known'
* 04:43 zhuyifei1999_: this one has logs full of 'Can't contact LDAP server'
* 04:41 zhuyifei1999_: nslcd also broken on tools-worker-1005
* 04:34 zhuyifei1999_: uncordon tools-worker-1014.tools.eqiad.wmflabs
* 04:33 zhuyifei1999_: the issue was, /var/run/nslcd/socket was somehow a directory, AFAICT
* 04:31 zhuyifei1999_: then started nslcd vis systemctl and `id zhuyifei1999` returns correct stuffs
* 04:30 zhuyifei1999_: `nslcd -nd` complains about 'nslcd: bind() to /var/run/nslcd/socket failed: Address already in use'. SIGTERMed a background nslcd, `rmdir /var/run/nslcd/socket`, and `nslcd -nd` seemingly starts to work
* 04:23 zhuyifei1999_: drained tools-worker-1014.tools.eqiad.wmflabs
* 04:16 zhuyifei1999_: logs: https://phabricator.wikimedia.org/P8095
* 04:14 zhuyifei1999_: restarting nslcd on tools-worker-1014 in an attempt to fix that, service failed to start, looking into logs
* 04:12 zhuyifei1999_: restarting nscd on tools-worker-1014 in an attempt to fix seemingly-not-attached-to-LDAP
 
=== 2019-02-14 ===
* 21:57 bd808: Deleted old tools-proxy-02 instance
* 21:57 bd808: Deleted old tools-proxy-01 instance
* 21:56 bd808: Deleted old tools-package-builder-01 instance
* 20:57 andrewbogott: rebooting tools-worker-1005
* 20:34 andrewbogott: moving tools-exec-1409, tools-exec-1410, tools-exec-1414, tools-exec-1419
* 19:55 andrewbogott: moving tools-webgrid-generic-1401 and tools-webgrid-lighttpd-1419
* 19:33 andrewbogott: moving tools-checker-01 to labvirt1003
* 19:25 andrewbogott: moving tools-elastic-02 to labvirt1003
* 19:11 andrewbogott: moving tools-k8s-etcd-01 to labvirt1002
* 18:37 andrewbogott: moving tools-exec-1418, tools-exec-1424 to labvirt1003
* 18:34 andrewbogott: moving tools-webgrid-lighttpd-1404, tools-webgrid-lighttpd-1406, tools-webgrid-lighttpd-1410 to labvirt1002
* 17:35 arturo: [[phab:T215154|T215154]] tools-sgebastion-07 now running systemd 239 and starts enforcing user limits
* 15:33 andrewbogott: moving tools-worker-1002, 1003, 1005, 1006, 1007, 1010, 1013, 1014 to different labvirts in order to move labvirt1012 to eqiad1-r
 
=== 2019-02-13 ===
* 19:16 andrewbogott: deleting tools-sgewebgrid-generic-0901, tools-sgewebgrid-lighttpd-0901, tools-sgebastion-06
* 15:16 zhuyifei1999_: `sudo /usr/local/bin/grid-configurator --all-domains --observer-pass $(grep OS_PASSWORD /etc/novaobserver.yaml{{!}}awk '{gsub(/"/,"",$2);print $2}')` on tools-sgegrid-master to attempt to make it recognize -sgebastion-07 [[phab:T216042|T216042]]
* 15:06 zhuyifei1999_: `sudo systemctl restart gridengine-master` on tools-sgegrid-master to attempt to make it recognize -sgebastion-07 [[phab:T216042|T216042]]
* 13:03 arturo: [[phab:T216030|T216030]] switch login-stretch.tools.wmflabs.org floating IP to tools-sgebastion-07
 
=== 2019-02-12 ===
* 01:24 bd808: Stopped maintain-kubeusers, edited /etc/kubernetes/tokenauth, restarted maintain-kubeusers ([[phab:T215704|T215704]])
 
=== 2019-02-11 ===
* 22:57 bd808: Shutoff tools-webgrid-lighttpd-14{01,13,24,26,27,28} via Horizon UI
* 22:34 bd808: Decommissioned tools-webgrid-lighttpd-14{01,13,24,26,27,28}
* 22:23 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1401.tools.eqiad.wmflabs
* 22:21 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1413.tools.eqiad.wmflabs
* 22:18 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1428.tools.eqiad.wmflabs
* 22:07 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1427.tools.eqiad.wmflabs
* 22:06 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1424.tools.eqiad.wmflabs
* 22:05 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1426.tools.eqiad.wmflabs
* 20:06 bstorm_: Ran apt-get clean on tools-sgebastion-07 since it was running out of disk (and lots of it was the apt cache)
* 19:09 bd808: Upgraded tools-manifest on tools-cron-01 to v0.19 ([[phab:T107878|T107878]])
* 18:57 bd808: Upgraded tools-manifest on tools-sgecron-01 to v0.19 ([[phab:T107878|T107878]])
* 18:57 bd808: Built tools-manifest_0.19_all.deb and published to aptly repos ([[phab:T107878|T107878]])
* 18:26 bd808: Upgraded tools-manifest on tools-sgecron-01 to v0.18 ([[phab:T107878|T107878]])
* 18:25 bd808: Built tools-manifest_0.18_all.deb and published to aptly repos ([[phab:T107878|T107878]])
* 18:12 bd808: Upgraded tools-manifest on tools-sgecron-01 to v0.17 ([[phab:T107878|T107878]])
* 18:08 bd808: Built tools-manifest_0.17_all.deb and published to aptly repos ([[phab:T107878|T107878]])
* 10:41 godog: flip tools-prometheus proxy back to tools-prometheus-01 and upgrade to prometheus 2.7.1
 
=== 2019-02-08 ===
* 19:17 hauskatze: Stopped webservice of `tools.sulinfo` which redirects to `tools.quentinv57-tools` which is also unavalaible
* 18:32 hauskatze: Stopped webservice for `tools.quentinv57-tools` for [[phab:T210829|T210829]].
* 13:49 gtirloni: upgraded all packages in SGE cluster
* 12:25 arturo: install aptitude in tools-sgebastion-06
* 11:08 godog: flip tools-prometheus.wmflabs.org to tools-prometheus-02 - [[phab:T215272|T215272]]
* 01:07 bd808: Creating tools-sgebastion-07
 
=== 2019-02-07 ===
* 23:48 bd808: Updated DNS to make tools-trusty.wmflabs.org and trusty.tools.wmflabs.org CNAMEs for login-trusty.tools.wmflabs.org
* 20:18 gtirloni: cleared mail queue on tools-mail-02
* 08:41 godog: upgrade prometheus-02 to prometheus 2.6 - [[phab:T215272|T215272]]
 
=== 2019-02-04 ===
* 13:20 arturo: [[phab:T215154|T215154]] another reboot for tools-sgebastion-06
* 12:26 arturo: [[phab:T215154|T215154]] another reboot for tools-sgebastion-06. Puppet is disabled
* 11:38 arturo: [[phab:T215154|T215154]] reboot tools-sgebastion-06 to totally refresh systemd status
* 11:36 arturo: [[phab:T215154|T215154]] manually install systemd 239 in tools-sgebastion-06
 
=== 2019-01-30 ===
* 23:54 gtirloni: cleared apt cache on sge* hosts
 
=== 2019-01-25 ===
* 20:50 bd808: Deployed new tcl/web Kubernetes image based on Debian Stretch ([[phab:T214668|T214668]])
* 14:22 andrewbogott: draining and moving tools-worker-1016 to a new labvirt for [[phab:T214447|T214447]]
* 14:22 andrewbogott: draining and moving tools-worker-1021 to a new labvirt for [[phab:T214447|T214447]]
 
=== 2019-01-24 ===
* 11:09 arturo: [[phab:T213421|T213421]] delete tools-services-01/02
* 09:46 arturo: [[phab:T213418|T213418]] delete tools-docker-registry-02
* 09:45 arturo: [[phab:T213418|T213418]] delete tools-docker-builder-05 and tools-docker-registry-01
* 03:28 bd808: Fixed rebase conflict in labs/private on tools-puppetmaster-01
 
=== 2019-01-23 ===
* 22:18 bd808: Building new tools-sgewebgrid-lighttpd-0904 instance using Stretch base image ([[phab:T214519|T214519]])
* 22:09 bd808: Deleted tools-sgewebgrid-lighttpd-0904 instance via Horizon, used wrong base image ([[phab:T214519|T214519]])
* 21:04 bd808: Building new tools-sgewebgrid-lighttpd-0904 instance ([[phab:T214519|T214519]])
* 20:53 bd808: Deleted broken tools-sgewebgrid-lighttpd-0904 instance via Horizon ([[phab:T214519|T214519]])
* 19:49 andrewbogott: shutting down eqiad-region proxies tools-proxy-01 and tools-proxy-02
* 17:44 bd808: Added rules to default security group for prometheus monitoring on port 9100 ([[phab:T211684|T211684]])
 
=== 2019-01-22 ===
* 20:21 gtirloni: published new docker images (all)
* 18:57 bd808: Changed deb-tools.wmflabs.org proxy to point to tools-sge-services-03.tools.eqiad.wmflabs
 
=== 2019-01-21 ===
* 05:25 andrewbogott: restarted tools-sgeexec-0906 and tools-sgeexec-0904; they seem better now but I have not repooled them yet
 
=== 2019-01-18 ===
* 21:22 bd808: Forcing php-igbinary update via clush for [[phab:T213666|T213666]]
 
=== 2019-01-17 ===
* 23:37 bd808: Shutdown tools-package-builder-01. Use tools-package-builder-02 instead!
* 22:09 bd808: Upgrading tools-manifest to 0.16 on tools-sgecron-01
* 22:05 bd808: Upgrading tools-manifest to 0.16 on tools-cron-01
* 21:51 bd808: Upgrading tools-manifest to 0.15 on tools-cron-01
* 20:41 bd808: Building tools-package-builder-02 to replace tools-package-builder-01
* 17:16 arturo: [[phab:T213421|T213421]] shutdown tools-services-01/02. Will delete VMs after a grace period
* 12:54 arturo: add webservice security group to tools-sge-services-03/04
 
=== 2019-01-16 ===
* 17:29 andrewbogott: depooling and moving tools-sgeexec-0904 tools-sgeexec-0906 tools-sgewebgrid-lighttpd-0904
* 16:38 arturo: [[phab:T213418|T213418]] shutdown tools-docker-registry-01 and 02. Will delete the instances in a week or so
* 14:34 arturo: [[phab:T213418|T213418]] point docker-registry.tools.wmflabs.org to tools-docker-registry-03 (was in -02)
* 14:24 arturo: [[phab:T213418|T213418]] allocate floating IPs for tools-docker-registry-03 & 04
 
=== 2019-01-15 ===
* 21:02 bstorm_: restarting webservicemonitor on tools-services-02 -- acting funny